Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2024-22

View the Project on GitHub

Crawler-Related Metrics

Crawler-related metrics are extracted from the crawler log files, cf. ../stats/crawler/ and include

The first plot shows absolute number for the metrics.

Crawler metrics

The relative portion of the fetch status is shown in the second graphics.

Percentage of fetch status

The next figure shows the relative usage of http and https URL protocols (schemes). The increasing usage HTTPS on the web is reflected. But also crawler properties such as sampling, deduplication and URL canonicalization) may influence the actual amount of HTTPS URLs in a single monthly crawl.

Percentage of HTTP vs. HTTPS URLs