Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2023-14
Overlaps between monthly crawl archives are calculated and plotted as Jaccard similarity of unique URLs or content digests. The cardinality of the monthly crawls and the union of two crawls are Hyperloglog estimates, cf. plot/overlap.py for details.
Note, that the content overlaps are small and in the same order of magnitude as the 1% error rate of the Hyperloglog cardinality estimates.