Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-08

View the Project on GitHub

Estimation of Representativeness of a Recent Crawl

The representativeness of the November 2019 crawl (CC-MAIN-2019-47) is estimated by a comparison with the frequency of top-level domains in

All three lists have been fetched at the same time the crawl was performed. For the one million domains/sites in the lists the TLDs have been extracted, and for all TLDs the relative frequency has been calculated and compared to the relative frequency of pages, URLs, hosts and domains in the crawl. Comparisons for older crawls are available via the git version history of the project.

The first table shows Spearman’s rank correlation coefficient (ρ) for the 76 TLDs which cover at least 0.05% of the URLs. The method is similar to Sebastian Spiegler’s analysis of the 2012 crawl archives. He reported ρ = 0.84 based on W3Techs TLD usage statistics for comparison which were/are derived from the top Alexa sites.

As the three lists used for comparison have a different notion of popularity their correlation results differ. There are also small differences between pages/URLs and hosts/domains. It is an open question whether differences in the relative frequency by TLD are caused by Common Crawl’s crawling strategy or a different average size of sites under various TLDs.

The second table shows the relative frequency per TLD for the lists and the recent crawl. The data in this tables was used to calculate the correlation matrix.

pages urls hosts domains alexa cisco majestic
pages 1.000 1.000 0.869 0.862 0.812 0.734 0.812
urls 1.000 1.000 0.870 0.864 0.812 0.734 0.813
hosts 0.869 0.870 1.000 0.999 0.820 0.717 0.868
domains 0.862 0.864 0.999 1.000 0.819 0.721 0.873
alexa 0.812 0.812 0.820 0.819 1.000 0.765 0.831
cisco 0.734 0.734 0.717 0.721 0.765 1.000 0.825
majestic 0.812 0.813 0.868 0.873 0.831 0.825 1.000

pages urls hosts domains alexa cisco majestic
hu 0.437 0.437 0.397 0.388 0.368 0.090 0.226
bg 0.127 0.127 0.059 0.058 0.108 0.026 0.043
se 0.629 0.628 0.686 0.690 0.406 0.126 0.356
ca 0.818 0.819 0.818 0.820 0.682 0.361 0.641
uk 2.300 2.295 2.630 2.623 1.684 1.130 2.567
jp 1.821 1.826 1.546 1.534 1.074 0.360 1.525
it 1.621 1.622 1.584 1.582 1.130 0.428 0.985
at 0.400 0.400 0.587 0.582 0.310 0.087 0.275
be 0.483 0.483 0.601 0.585 0.331 0.147 0.283
cn 1.382 1.391 3.217 3.233 0.265 0.632 2.942
id 0.202 0.202 0.180 0.182 0.198 0.140 0.111
il 0.267 0.267 0.172 0.171 0.157 0.121 0.114
in 0.353 0.354 0.394 0.394 1.407 0.632 0.603
ir 0.240 0.240 0.235 0.231 1.293 0.078 0.323
kr 0.343 0.345 0.386 0.351 0.258 0.100 0.144
me 0.265 0.266 0.224 0.231 0.359 0.787 0.206
nz 0.218 0.218 0.266 0.266 0.137 0.075 0.148
rs 0.153 0.153 0.077 0.073 0.101 0.017 0.045
ru 4.513 4.518 2.987 2.991 4.498 1.440 4.311
th 0.093 0.093 0.079 0.074 0.141 0.053 0.053
vn 0.381 0.381 0.243 0.244 0.162 0.269 0.316
za 0.192 0.193 0.296 0.294 0.386 0.054 0.256
au 0.928 0.928 1.122 1.128 1.110 0.192 0.848
br 1.252 1.253 1.192 1.158 2.417 0.488 0.892
fr 1.681 1.681 1.640 1.643 0.968 0.479 0.956
pl 1.479 1.479 1.360 1.346 1.551 0.339 1.154
us 0.224 0.224 0.252 0.253 0.335 0.396 0.660
fi 0.338 0.338 0.317 0.316 0.208 0.054 0.150
no 0.363 0.362 0.299 0.298 0.364 0.106 0.163
ar 0.205 0.205 0.216 0.206 0.481 0.064 0.148
ro 0.458 0.458 0.310 0.306 0.589 0.065 0.282
tr 0.186 0.186 0.194 0.183 0.326 0.105 0.216
az 0.058 0.058 0.017 0.016 0.163 0.006 0.022
biz 0.180 0.181 0.282 0.286 0.219 0.996 0.306
by 0.195 0.196 0.116 0.116 0.141 0.042 0.116
cat 0.164 0.164 0.068 0.065 0.051 0.011 0.048
cc 0.112 0.110 0.137 0.137 0.138 0.544 0.156
ch 0.549 0.549 0.808 0.782 0.462 0.147 0.329
ua 0.758 0.760 0.429 0.435 0.592 0.259 0.402
cl 0.107 0.108 0.155 0.154 0.311 0.057 0.147
tw 0.523 0.526 0.593 0.596 0.323 0.090 0.716
co 0.417 0.418 0.379 0.387 0.698 0.539 0.554
com 46.309 46.286 47.160 47.446 47.796 58.393 49.626
ee 0.194 0.192 0.096 0.090 0.048 0.014 0.051
es 0.900 0.900 0.725 0.727 0.698 0.227 0.525
ge 0.053 0.054 0.023 0.022 0.036 0.005 0.025
gr 0.421 0.422 0.292 0.282 0.656 0.081 0.211
hk 0.088 0.089 0.076 0.073 0.097 0.062 0.088
hr 0.161 0.161 0.087 0.082 0.110 0.021 0.065
kz 0.118 0.119 0.079 0.079 0.129 0.023 0.102
lv 0.148 0.148 0.075 0.071 0.058 0.014 0.053
mx 0.187 0.187 0.196 0.190 0.633 0.152 0.164
my 0.095 0.095 0.092 0.090 0.105 0.064 0.109
pt 0.254 0.255 0.182 0.178 0.213 0.119 0.129
sg 0.065 0.065 0.070 0.070 0.098 0.049 0.083
pro 0.064 0.064 0.070 0.072 0.167 0.071 0.072
cz 0.970 0.969 0.851 0.836 0.780 0.114 0.370
de 3.847 3.845 5.121 4.999 3.107 0.925 2.398
dk 0.395 0.394 0.537 0.538 0.317 0.125 0.240
edu 1.056 1.059 0.257 0.263 0.306 1.040 0.398
eu 0.704 0.704 0.649 0.641 0.501 0.372 0.480
gov 0.358 0.357 0.055 0.053 0.107 0.511 0.172
ie 0.144 0.144 0.139 0.139 0.177 0.044 0.132
lt 0.247 0.247 0.135 0.130 0.115 0.025 0.084
info 1.115 1.119 0.741 0.743 0.863 0.754 1.062
io 0.140 0.140 0.345 0.361 0.626 0.939 0.207
is 0.060 0.060 0.037 0.037 0.041 0.034 0.036
net 3.656 3.652 3.491 3.492 4.064 15.653 4.480
nl 1.360 1.359 1.944 1.930 0.568 0.453 1.119
nu 0.056 0.056 0.078 0.078 0.031 0.012 0.243
org 5.809 5.811 4.129 4.064 4.875 4.228 7.252
si 0.152 0.152 0.102 0.098 0.096 0.013 0.049
sk 0.308 0.308 0.260 0.255 0.380 0.049 0.108
su 0.074 0.074 0.056 0.056 0.112 0.027 0.091
tk 0.064 0.064 0.073 0.076 0.054 0.013 0.101
tv 0.222 0.222 0.109 0.108 0.325 0.384 0.202
club 0.105 0.105 0.132 0.137 0.188 0.097 0.815
xyz 0.097 0.097 0.170 0.175 0.221 0.109 0.189
top 0.063 0.063 0.272 0.282 0.079 0.054 0.211
site 0.075 0.075 0.478 0.500 0.103 0.036 0.167
online 0.063 0.063 0.080 0.083 0.169 0.060 0.106
live 0.151 0.152 0.303 0.320 0.057 0.033 0.052
blog 0.097 0.097 0.093 0.098 0.031 0.004 0.007