Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2022-21

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2021-49 CC-MAIN-2022-05 CC-MAIN-2022-21
charset
<other> 0.0004 0.0005 0.0003
<unknown> 2.7912 1.9513 1.8127
Big5 0.0852 0.0627 0.0618
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1607 0.1473 0.1273
EUC-KR 0.1077 0.0960 0.0760
GB18030 0.0310 0.0285 0.0226
GB2312 0.4783 0.4138 0.3621
GBK 0.2355 0.2083 0.1698
IBM420 0.0090 0.0076 0.0059
IBM424 0.0025 0.0019 0.0011
IBM500 0.0009 0.0021 0.0014
IBM855 0.0000 0.0000 0.0000
IBM866 0.0003 0.0003 0.0003
ISO-2022-JP 0.0015 0.0010 0.0012
ISO-8859-1 3.4010 2.8968 2.4839
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0912 0.0827 0.0734
ISO-8859-2 0.1778 0.1427 0.1136
ISO-8859-3 0.0003 0.0002 0.0007
ISO-8859-4 0.0009 0.0009 0.0009
ISO-8859-5 0.0038 0.0026 0.0023
ISO-8859-6 0.0004 0.0003 0.0003
ISO-8859-7 0.0107 0.0085 0.0085
ISO-8859-8 0.0005 0.0005 0.0003
ISO-8859-9 0.0393 0.0309 0.0262
KOI8-R 0.0086 0.0072 0.0055
KOI8-U 0.0001 0.0001 0.0001
Shift_JIS 0.2596 0.2140 0.1892
TIS-620 0.0102 0.0089 0.0074
US-ASCII 0.0416 0.0412 0.0311
UTF-16 0.0025 0.0023 0.0021
UTF-16BE 0.0001 0.0001 0.0002
UTF-16LE 0.0473 0.0017 0.0011
UTF-32 0.0000 0.0000 0.0001
UTF-8 90.7609 92.5441 93.4445
windows-1250 0.0979 0.0889 0.0778
windows-1251 0.7218 0.6597 0.6017
windows-1252 0.2842 0.2399 0.2018
windows-1253 0.0037 0.0035 0.0029
windows-1254 0.0198 0.0156 0.0140
windows-1255 0.0163 0.0127 0.0101
windows-1256 0.0679 0.0512 0.0391
windows-1257 0.0090 0.0061 0.0064
windows-31j 0.0022 0.0023 0.0011
x-windows-874 0.0158 0.0130 0.0114