Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2024-51

View the Project on GitHub

Overlaps between Common Crawl Monthly Archives

Overlaps between monthly crawl archives are calculated and plotted as Jaccard similarity of unique URLs or content digests. The cardinality of the monthly crawls and the union of two crawls are Hyperloglog estimates, cf. plot/overlap.py for details.

URL overlap (Jaccard similarity) between Common Crawl monthly crawls

Content overlap between Common Crawl monthly crawls (Jaccard similarity on unique content digests)

Note, that the content overlaps are small and in the same order of magnitude as the 1% error rate of the Hyperloglog cardinality estimates.