Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-18

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-08 CC-MAIN-2025-13 CC-MAIN-2025-18
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.6852 1.6162 1.5980
Big5 0.0498 0.0461 0.0449
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1266 0.1206 0.1239
EUC-KR 0.0962 0.0914 0.0910
GB18030 0.0151 0.0148 0.0132
GB2312 0.2352 0.2143 0.2036
GBK 0.1165 0.1087 0.0981
IBM420 0.0047 0.0042 0.0044
IBM424 0.0016 0.0014 0.0014
IBM500 0.0015 0.0009 0.0009
IBM855 0.0000 0.0000 0.0000
IBM866 0.0003 0.0002 0.0002
ISO-2022-JP 0.0011 0.0011 0.0011
ISO-8859-1 2.5702 2.8223 2.6461
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0468 0.0454 0.0470
ISO-8859-16 0.0003 0.0004 0.0004
ISO-8859-2 0.1255 0.1024 0.1063
ISO-8859-3 0.0000 0.0000 0.0000
ISO-8859-4 0.0007 0.0006 0.0008
ISO-8859-5 0.0011 0.0021 0.0023
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0086 0.0070 0.0066
ISO-8859-8 0.0007 0.0007 0.0008
ISO-8859-9 0.0264 0.0230 0.0258
KOI8-R 0.0079 0.0062 0.0064
KOI8-U 0.0000 0.0000 0.0001
Shift_JIS 0.1750 0.1750 0.1834
TIS-620 0.0050 0.0049 0.0047
US-ASCII 0.0282 0.0232 0.0236
UTF-16 0.0059 0.0056 0.0059
UTF-16BE 0.0003 0.0003 0.0001
UTF-16LE 0.0011 0.0011 0.0009
UTF-32 0.0001 0.0000 0.0000
UTF-32LE 0.0003 0.0001 0.0003
UTF-8 93.8159 93.7645 93.9793
windows-1250 0.0729 0.0727 0.0745
windows-1251 0.5222 0.5045 0.4835
windows-1252 0.1493 0.1398 0.1420
windows-1253 0.0031 0.0025 0.0024
windows-1254 0.0142 0.0127 0.0121
windows-1255 0.0092 0.0070 0.0077
windows-1256 0.0521 0.0395 0.0390
windows-1257 0.0118 0.0077 0.0077
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0006 0.0005 0.0005
x-iso-8859-11 0.0001 0.0001 0.0001
x-windows-874 0.0106 0.0081 0.0088
x-windows-949 0.0000 0.0000 0.0000