Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2022-21

View the Project on GitHub

Top-Level Domains

Top-level domains (abbrev. “TLD”/”TLDs”) are a significant indicator for the representativeness of the data, whether the data set or particular crawl is biased towards certain countries, regions or languages.

Metrics about top-level domains are show on the following pages:

Note, that top-level domain is defined here as the left-most element of a host name (com in Country-code second-level domains (“ccSLD”) and public suffixes are not covered by this metrics.