Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-13

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-05 CC-MAIN-2025-08 CC-MAIN-2025-13
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.4664 1.6852 1.6162
Big5 0.0463 0.0498 0.0461
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1181 0.1266 0.1206
EUC-KR 0.0839 0.0962 0.0914
GB18030 0.0129 0.0151 0.0148
GB2312 0.2164 0.2352 0.2143
GBK 0.1015 0.1165 0.1087
IBM420 0.0041 0.0047 0.0042
IBM424 0.0014 0.0016 0.0014
IBM500 0.0013 0.0015 0.0009
IBM855 0.0000 0.0000 0.0000
IBM866 0.0003 0.0003 0.0002
ISO-2022-JP 0.0009 0.0011 0.0011
ISO-8859-1 2.2951 2.5702 2.8223
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0433 0.0468 0.0454
ISO-8859-16 0.0002 0.0003 0.0004
ISO-8859-2 0.1059 0.1255 0.1024
ISO-8859-3 0.0001 0.0000 0.0000
ISO-8859-4 0.0008 0.0007 0.0006
ISO-8859-5 0.0012 0.0011 0.0021
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0060 0.0086 0.0070
ISO-8859-8 0.0008 0.0007 0.0007
ISO-8859-9 0.0228 0.0264 0.0230
KOI8-R 0.0057 0.0079 0.0062
KOI8-U 0.0000 0.0000 0.0000
Shift_JIS 0.1589 0.1750 0.1750
TIS-620 0.0047 0.0050 0.0049
US-ASCII 0.0314 0.0282 0.0232
UTF-16 0.0051 0.0059 0.0056
UTF-16BE 0.0002 0.0003 0.0003
UTF-16LE 0.0010 0.0011 0.0011
UTF-32 0.0000 0.0001 0.0000
UTF-32LE 0.0003 0.0003 0.0001
UTF-8 94.4849 93.8159 93.7645
windows-1250 0.0677 0.0729 0.0727
windows-1251 0.4926 0.5222 0.5045
windows-1252 0.1395 0.1493 0.1398
windows-1253 0.0025 0.0031 0.0025
windows-1254 0.0123 0.0142 0.0127
windows-1255 0.0074 0.0092 0.0070
windows-1256 0.0404 0.0521 0.0395
windows-1257 0.0071 0.0118 0.0077
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0005 0.0006 0.0005
x-iso-8859-11 0.0000 0.0001 0.0001
x-windows-874 0.0082 0.0106 0.0081
x-windows-949 0.0000 0.0000 0.0000