Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-51

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-43 CC-MAIN-2025-47 CC-MAIN-2025-51
charset
<other> 0.0000 0.0001 0.0001
<unknown> 1.4313 1.7191 1.7917
Big5 0.0430 0.0464 0.0572
Big5-HKSCS 0.0001 0.0000 0.0000
EUC-JP 0.1202 0.1345 0.1365
EUC-KR 0.0702 0.0783 0.0804
GB18030 0.0130 0.0160 0.0210
GB2312 0.2100 0.2491 0.2997
GBK 0.0968 0.1042 0.1178
IBM420 0.0033 0.0038 0.0053
IBM424 0.0015 0.0014 0.0015
IBM500 0.0010 0.0012 0.0011
IBM855 0.0000 NaN 0.0000
IBM866 0.0002 0.0002 0.0002
ISO-2022-JP 0.0009 0.0011 0.0011
ISO-8859-1 5.4660 5.7471 5.5513
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0403 0.0449 0.0460
ISO-8859-16 0.0002 0.0002 0.0002
ISO-8859-2 0.0811 0.0888 0.0851
ISO-8859-3 0.0003 0.0004 0.0000
ISO-8859-4 0.0006 0.0007 0.0007
ISO-8859-5 0.0010 0.0012 0.0011
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0044 0.0045 0.0046
ISO-8859-8 0.0006 0.0007 0.0008
ISO-8859-9 0.0212 0.0219 0.0223
KOI8-R 0.0065 0.0070 0.0062
KOI8-U 0.0000 0.0000 0.0000
Shift_JIS 0.1460 0.1550 0.1633
TIS-620 0.0047 0.0041 0.0043
US-ASCII 0.0153 0.0184 0.0207
UTF-16 0.0043 0.0048 0.0036
UTF-16BE 0.0002 0.0004 0.0002
UTF-16LE 0.0009 0.0011 0.0011
UTF-32 0.0000 0.0001 0.0000
UTF-32LE 0.0002 0.0002 0.0002
UTF-8 91.5255 90.7888 90.8089
windows-1250 0.0655 0.0706 0.0711
windows-1251 0.4347 0.4767 0.4955
windows-1252 0.1281 0.1405 0.1331
windows-1253 0.0019 0.0021 0.0022
windows-1254 0.0102 0.0108 0.0099
windows-1255 0.0060 0.0065 0.0064
windows-1256 0.0297 0.0322 0.0329
windows-1257 0.0059 0.0067 0.0064
windows-31j 0.0004 0.0004 0.0004
x-iso-8859-11 0.0001 0.0001 0.0001
x-windows-874 0.0068 0.0075 0.0075
x-windows-949 0.0000 0.0000 0.0000