Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2024-46
The representativeness of the November 2019 crawl (CC-MAIN-2019-47) is estimated by a comparison with the frequency of top-level domains in
All three lists have been fetched at the same time the crawl was performed. For the one million domains/sites in the lists the TLDs have been extracted, and for all TLDs the relative frequency has been calculated and compared to the relative frequency of pages, URLs, hosts and domains in the crawl. Comparisons for older crawls are available via the git version history of the project.
The first table shows Spearman’s rank correlation coefficient (ρ) for the 76 TLDs which cover at least 0.05% of the URLs. The method is similar to Sebastian Spiegler’s analysis of the 2012 crawl archives. He reported ρ = 0.84 based on W3Techs TLD usage statistics for comparison which were/are derived from the top Alexa sites.
As the three lists used for comparison have a different notion of popularity their correlation results differ. There are also small differences between pages/URLs and hosts/domains. It is an open question whether differences in the relative frequency by TLD are caused by Common Crawl’s crawling strategy or a different average size of sites under various TLDs.
The second table shows the relative frequency per TLD for the lists and the recent crawl. The data in this tables was used to calculate the correlation matrix.
pages | urls | hosts | domains | alexa | cisco | majestic | |
---|---|---|---|---|---|---|---|
pages | 1.000 | 1.000 | 0.869 | 0.862 | 0.812 | 0.734 | 0.812 |
urls | 1.000 | 1.000 | 0.870 | 0.864 | 0.812 | 0.734 | 0.813 |
hosts | 0.869 | 0.870 | 1.000 | 0.999 | 0.820 | 0.717 | 0.868 |
domains | 0.862 | 0.864 | 0.999 | 1.000 | 0.819 | 0.721 | 0.873 |
alexa | 0.812 | 0.812 | 0.820 | 0.819 | 1.000 | 0.765 | 0.831 |
cisco | 0.734 | 0.734 | 0.717 | 0.721 | 0.765 | 1.000 | 0.825 |
majestic | 0.812 | 0.813 | 0.868 | 0.873 | 0.831 | 0.825 | 1.000 |
pages | urls | hosts | domains | alexa | cisco | majestic | |
---|---|---|---|---|---|---|---|
tld | |||||||
hu | 0.437 | 0.437 | 0.397 | 0.388 | 0.368 | 0.090 | 0.226 |
bg | 0.127 | 0.127 | 0.059 | 0.058 | 0.108 | 0.026 | 0.043 |
se | 0.629 | 0.628 | 0.686 | 0.690 | 0.406 | 0.126 | 0.356 |
ca | 0.818 | 0.819 | 0.818 | 0.820 | 0.682 | 0.361 | 0.641 |
uk | 2.300 | 2.295 | 2.630 | 2.623 | 1.684 | 1.130 | 2.567 |
jp | 1.821 | 1.826 | 1.546 | 1.534 | 1.074 | 0.360 | 1.525 |
it | 1.621 | 1.622 | 1.584 | 1.582 | 1.130 | 0.428 | 0.985 |
at | 0.400 | 0.400 | 0.587 | 0.582 | 0.310 | 0.087 | 0.275 |
be | 0.483 | 0.483 | 0.601 | 0.585 | 0.331 | 0.147 | 0.283 |
cn | 1.382 | 1.391 | 3.217 | 3.233 | 0.265 | 0.632 | 2.942 |
id | 0.202 | 0.202 | 0.180 | 0.182 | 0.198 | 0.140 | 0.111 |
il | 0.267 | 0.267 | 0.172 | 0.171 | 0.157 | 0.121 | 0.114 |
in | 0.353 | 0.354 | 0.394 | 0.394 | 1.407 | 0.632 | 0.603 |
ir | 0.240 | 0.240 | 0.235 | 0.231 | 1.293 | 0.078 | 0.323 |
kr | 0.343 | 0.345 | 0.386 | 0.351 | 0.258 | 0.100 | 0.144 |
me | 0.265 | 0.266 | 0.224 | 0.231 | 0.359 | 0.787 | 0.206 |
nz | 0.218 | 0.218 | 0.266 | 0.266 | 0.137 | 0.075 | 0.148 |
rs | 0.153 | 0.153 | 0.077 | 0.073 | 0.101 | 0.017 | 0.045 |
ru | 4.513 | 4.518 | 2.987 | 2.991 | 4.498 | 1.440 | 4.311 |
th | 0.093 | 0.093 | 0.079 | 0.074 | 0.141 | 0.053 | 0.053 |
vn | 0.381 | 0.381 | 0.243 | 0.244 | 0.162 | 0.269 | 0.316 |
za | 0.192 | 0.193 | 0.296 | 0.294 | 0.386 | 0.054 | 0.256 |
au | 0.928 | 0.928 | 1.122 | 1.128 | 1.110 | 0.192 | 0.848 |
br | 1.252 | 1.253 | 1.192 | 1.158 | 2.417 | 0.488 | 0.892 |
fr | 1.681 | 1.681 | 1.640 | 1.643 | 0.968 | 0.479 | 0.956 |
pl | 1.479 | 1.479 | 1.360 | 1.346 | 1.551 | 0.339 | 1.154 |
us | 0.224 | 0.224 | 0.252 | 0.253 | 0.335 | 0.396 | 0.660 |
fi | 0.338 | 0.338 | 0.317 | 0.316 | 0.208 | 0.054 | 0.150 |
no | 0.363 | 0.362 | 0.299 | 0.298 | 0.364 | 0.106 | 0.163 |
ar | 0.205 | 0.205 | 0.216 | 0.206 | 0.481 | 0.064 | 0.148 |
ro | 0.458 | 0.458 | 0.310 | 0.306 | 0.589 | 0.065 | 0.282 |
tr | 0.186 | 0.186 | 0.194 | 0.183 | 0.326 | 0.105 | 0.216 |
az | 0.058 | 0.058 | 0.017 | 0.016 | 0.163 | 0.006 | 0.022 |
biz | 0.180 | 0.181 | 0.282 | 0.286 | 0.219 | 0.996 | 0.306 |
by | 0.195 | 0.196 | 0.116 | 0.116 | 0.141 | 0.042 | 0.116 |
cat | 0.164 | 0.164 | 0.068 | 0.065 | 0.051 | 0.011 | 0.048 |
cc | 0.112 | 0.110 | 0.137 | 0.137 | 0.138 | 0.544 | 0.156 |
ch | 0.549 | 0.549 | 0.808 | 0.782 | 0.462 | 0.147 | 0.329 |
ua | 0.758 | 0.760 | 0.429 | 0.435 | 0.592 | 0.259 | 0.402 |
cl | 0.107 | 0.108 | 0.155 | 0.154 | 0.311 | 0.057 | 0.147 |
tw | 0.523 | 0.526 | 0.593 | 0.596 | 0.323 | 0.090 | 0.716 |
co | 0.417 | 0.418 | 0.379 | 0.387 | 0.698 | 0.539 | 0.554 |
com | 46.309 | 46.286 | 47.160 | 47.446 | 47.796 | 58.393 | 49.626 |
ee | 0.194 | 0.192 | 0.096 | 0.090 | 0.048 | 0.014 | 0.051 |
es | 0.900 | 0.900 | 0.725 | 0.727 | 0.698 | 0.227 | 0.525 |
ge | 0.053 | 0.054 | 0.023 | 0.022 | 0.036 | 0.005 | 0.025 |
gr | 0.421 | 0.422 | 0.292 | 0.282 | 0.656 | 0.081 | 0.211 |
hk | 0.088 | 0.089 | 0.076 | 0.073 | 0.097 | 0.062 | 0.088 |
hr | 0.161 | 0.161 | 0.087 | 0.082 | 0.110 | 0.021 | 0.065 |
kz | 0.118 | 0.119 | 0.079 | 0.079 | 0.129 | 0.023 | 0.102 |
lv | 0.148 | 0.148 | 0.075 | 0.071 | 0.058 | 0.014 | 0.053 |
mx | 0.187 | 0.187 | 0.196 | 0.190 | 0.633 | 0.152 | 0.164 |
my | 0.095 | 0.095 | 0.092 | 0.090 | 0.105 | 0.064 | 0.109 |
pt | 0.254 | 0.255 | 0.182 | 0.178 | 0.213 | 0.119 | 0.129 |
sg | 0.065 | 0.065 | 0.070 | 0.070 | 0.098 | 0.049 | 0.083 |
pro | 0.064 | 0.064 | 0.070 | 0.072 | 0.167 | 0.071 | 0.072 |
cz | 0.970 | 0.969 | 0.851 | 0.836 | 0.780 | 0.114 | 0.370 |
de | 3.847 | 3.845 | 5.121 | 4.999 | 3.107 | 0.925 | 2.398 |
dk | 0.395 | 0.394 | 0.537 | 0.538 | 0.317 | 0.125 | 0.240 |
edu | 1.056 | 1.059 | 0.257 | 0.263 | 0.306 | 1.040 | 0.398 |
eu | 0.704 | 0.704 | 0.649 | 0.641 | 0.501 | 0.372 | 0.480 |
gov | 0.358 | 0.357 | 0.055 | 0.053 | 0.107 | 0.511 | 0.172 |
ie | 0.144 | 0.144 | 0.139 | 0.139 | 0.177 | 0.044 | 0.132 |
lt | 0.247 | 0.247 | 0.135 | 0.130 | 0.115 | 0.025 | 0.084 |
info | 1.115 | 1.119 | 0.741 | 0.743 | 0.863 | 0.754 | 1.062 |
io | 0.140 | 0.140 | 0.345 | 0.361 | 0.626 | 0.939 | 0.207 |
is | 0.060 | 0.060 | 0.037 | 0.037 | 0.041 | 0.034 | 0.036 |
net | 3.656 | 3.652 | 3.491 | 3.492 | 4.064 | 15.653 | 4.480 |
nl | 1.360 | 1.359 | 1.944 | 1.930 | 0.568 | 0.453 | 1.119 |
nu | 0.056 | 0.056 | 0.078 | 0.078 | 0.031 | 0.012 | 0.243 |
org | 5.809 | 5.811 | 4.129 | 4.064 | 4.875 | 4.228 | 7.252 |
si | 0.152 | 0.152 | 0.102 | 0.098 | 0.096 | 0.013 | 0.049 |
sk | 0.308 | 0.308 | 0.260 | 0.255 | 0.380 | 0.049 | 0.108 |
su | 0.074 | 0.074 | 0.056 | 0.056 | 0.112 | 0.027 | 0.091 |
tk | 0.064 | 0.064 | 0.073 | 0.076 | 0.054 | 0.013 | 0.101 |
tv | 0.222 | 0.222 | 0.109 | 0.108 | 0.325 | 0.384 | 0.202 |
club | 0.105 | 0.105 | 0.132 | 0.137 | 0.188 | 0.097 | 0.815 |
xyz | 0.097 | 0.097 | 0.170 | 0.175 | 0.221 | 0.109 | 0.189 |
top | 0.063 | 0.063 | 0.272 | 0.282 | 0.079 | 0.054 | 0.211 |
site | 0.075 | 0.075 | 0.478 | 0.500 | 0.103 | 0.036 | 0.167 |
online | 0.063 | 0.063 | 0.080 | 0.083 | 0.169 | 0.060 | 0.106 |
live | 0.151 | 0.152 | 0.303 | 0.320 | 0.057 | 0.033 | 0.052 |
blog | 0.097 | 0.097 | 0.093 | 0.098 | 0.031 | 0.004 | 0.007 |