Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-30

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-21 CC-MAIN-2025-26 CC-MAIN-2025-30
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.4997 1.5590 1.5919
Big5 0.0434 0.0433 0.0418
Big5-HKSCS 0.0001 0.0001 0.0001
EUC-JP 0.1120 0.1219 0.1282
EUC-KR 0.0760 0.0816 0.0813
GB18030 0.0123 0.0130 0.0160
GB2312 0.1919 0.2126 0.2133
GBK 0.0990 0.1088 0.1091
IBM420 0.0036 0.0035 0.0036
IBM424 0.0012 0.0012 0.0011
IBM500 0.0009 0.0013 0.0006
IBM855 0.0000 0.0000 0.0000
IBM866 0.0002 0.0002 0.0001
ISO-2022-JP 0.0009 0.0010 0.0009
ISO-8859-1 2.5288 7.0541 5.4270
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0475 0.0488 0.0491
ISO-8859-16 0.0002 0.0002 0.0001
ISO-8859-2 0.0982 0.0963 0.0935
ISO-8859-3 0.0003 0.0003 0.0002
ISO-8859-4 0.0006 0.0006 0.0005
ISO-8859-5 0.0022 0.0017 0.0014
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0051 0.0049 0.0050
ISO-8859-8 0.0007 0.0007 0.0007
ISO-8859-9 0.0242 0.0221 0.0230
KOI8-R 0.0074 0.0073 0.0072
KOI8-U 0.0000 0.0001 0.0000
Shift_JIS 0.1556 0.1548 0.1565
TIS-620 0.0046 0.0036 0.0045
US-ASCII 0.0221 0.0207 0.0203
UTF-16 0.0050 0.0046 0.0044
UTF-16BE 0.0002 0.0002 0.0001
UTF-16LE 0.0010 0.0010 0.0011
UTF-32 0.0001 0.0001 0.0000
UTF-32LE 0.0002 0.0003 0.0003
UTF-8 94.3166 89.6676 91.2574
windows-1250 0.0663 0.0687 0.0694
windows-1251 0.4647 0.4832 0.4884
windows-1252 0.1370 0.1418 0.1336
windows-1253 0.0022 0.0023 0.0025
windows-1254 0.0106 0.0096 0.0092
windows-1255 0.0072 0.0070 0.0059
windows-1256 0.0350 0.0349 0.0357
windows-1257 0.0067 0.0065 0.0062
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0005 0.0006 0.0004
x-MacCyrillic NaN NaN 0.0000
x-iso-8859-11 0.0001 0.0001 0.0000
x-windows-874 0.0079 0.0075 0.0079
x-windows-949 0.0000 0.0000 0.0000