Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-08

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2024-51 CC-MAIN-2025-05 CC-MAIN-2025-08
charset
UTF-8 93.8442 94.4849 93.8159
ISO-8859-1 2.4097 2.2951 2.5702
<unknown> 1.7906 1.4664 1.6852
windows-1251 0.5437 0.4926 0.5222
GB2312 0.2477 0.2164 0.2352
Shift_JIS 0.1821 0.1589 0.1750
windows-1252 0.1628 0.1395 0.1493
EUC-JP 0.1279 0.1181 0.1266
GBK 0.1194 0.1015 0.1165
ISO-8859-2 0.1165 0.1059 0.1255
EUC-KR 0.0917 0.0839 0.0962
windows-1250 0.0734 0.0677 0.0729
ISO-8859-15 0.0496 0.0433 0.0468
Big5 0.0458 0.0463 0.0498
US-ASCII 0.0407 0.0314 0.0282
windows-1256 0.0392 0.0404 0.0521
ISO-8859-9 0.0221 0.0228 0.0264
GB18030 0.0152 0.0129 0.0151
windows-1254 0.0127 0.0123 0.0142
x-windows-874 0.0082 0.0082 0.0106
windows-1255 0.0076 0.0074 0.0092
ISO-8859-7 0.0065 0.0060 0.0086
windows-1257 0.0064 0.0071 0.0118
IBM420 0.0063 0.0041 0.0047
KOI8-R 0.0061 0.0057 0.0079
TIS-620 0.0056 0.0047 0.0050
UTF-16 0.0055 0.0051 0.0059
windows-1253 0.0027 0.0025 0.0031
IBM424 0.0017 0.0014 0.0016
IBM500 0.0012 0.0013 0.0015
ISO-2022-JP 0.0012 0.0009 0.0011
ISO-8859-5 0.0012 0.0012 0.0011
UTF-16LE 0.0011 0.0010 0.0011
ISO-8859-4 0.0008 0.0008 0.0007
ISO-8859-8 0.0008 0.0008 0.0007
ISO-8859-3 0.0005 0.0001 0.0000
windows-31j 0.0005 0.0005 0.0006
IBM866 0.0003 0.0003 0.0003
ISO-8859-16 0.0002 0.0002 0.0003
UTF-16BE 0.0002 0.0002 0.0003
ISO-8859-13 0.0001 0.0001 0.0001
KOI8-U 0.0001 0.0000 0.0000
UTF-32LE 0.0001 0.0003 0.0003
<other> 0.0000 0.0000 0.0000
Big5-HKSCS 0.0000 0.0000 0.0000
IBM855 0.0000 0.0000 0.0000
ISO-8859-6 0.0000 0.0000 0.0000
UTF-32 0.0000 0.0000 0.0001
windows-1258 0.0000 0.0000 0.0000
x-iso-8859-11 0.0000 0.0000 0.0001
x-windows-949 0.0000 0.0000 0.0000