Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-08

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2024-51 CC-MAIN-2025-05 CC-MAIN-2025-08
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.7906 1.4664 1.6852
Big5 0.0458 0.0463 0.0498
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1279 0.1181 0.1266
EUC-KR 0.0917 0.0839 0.0962
GB18030 0.0152 0.0129 0.0151
GB2312 0.2477 0.2164 0.2352
GBK 0.1194 0.1015 0.1165
IBM420 0.0063 0.0041 0.0047
IBM424 0.0017 0.0014 0.0016
IBM500 0.0012 0.0013 0.0015
IBM855 0.0000 0.0000 0.0000
IBM866 0.0003 0.0003 0.0003
ISO-2022-JP 0.0012 0.0009 0.0011
ISO-8859-1 2.4097 2.2951 2.5702
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0496 0.0433 0.0468
ISO-8859-16 0.0002 0.0002 0.0003
ISO-8859-2 0.1165 0.1059 0.1255
ISO-8859-3 0.0005 0.0001 0.0000
ISO-8859-4 0.0008 0.0008 0.0007
ISO-8859-5 0.0012 0.0012 0.0011
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0065 0.0060 0.0086
ISO-8859-8 0.0008 0.0008 0.0007
ISO-8859-9 0.0221 0.0228 0.0264
KOI8-R 0.0061 0.0057 0.0079
KOI8-U 0.0001 0.0000 0.0000
Shift_JIS 0.1821 0.1589 0.1750
TIS-620 0.0056 0.0047 0.0050
US-ASCII 0.0407 0.0314 0.0282
UTF-16 0.0055 0.0051 0.0059
UTF-16BE 0.0002 0.0002 0.0003
UTF-16LE 0.0011 0.0010 0.0011
UTF-32 0.0000 0.0000 0.0001
UTF-32LE 0.0001 0.0003 0.0003
UTF-8 93.8442 94.4849 93.8159
windows-1250 0.0734 0.0677 0.0729
windows-1251 0.5437 0.4926 0.5222
windows-1252 0.1628 0.1395 0.1493
windows-1253 0.0027 0.0025 0.0031
windows-1254 0.0127 0.0123 0.0142
windows-1255 0.0076 0.0074 0.0092
windows-1256 0.0392 0.0404 0.0521
windows-1257 0.0064 0.0071 0.0118
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0005 0.0005 0.0006
x-iso-8859-11 0.0000 0.0000 0.0001
x-windows-874 0.0082 0.0082 0.0106
x-windows-949 0.0000 0.0000 0.0000