Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-26

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-18 CC-MAIN-2025-21 CC-MAIN-2025-26
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.5980 1.4997 1.5590
Big5 0.0449 0.0434 0.0433
Big5-HKSCS 0.0000 0.0001 0.0001
EUC-JP 0.1239 0.1120 0.1219
EUC-KR 0.0910 0.0760 0.0816
GB18030 0.0132 0.0123 0.0130
GB2312 0.2036 0.1919 0.2126
GBK 0.0981 0.0990 0.1088
IBM420 0.0044 0.0036 0.0035
IBM424 0.0014 0.0012 0.0012
IBM500 0.0009 0.0009 0.0013
IBM855 0.0000 0.0000 0.0000
IBM866 0.0002 0.0002 0.0002
ISO-2022-JP 0.0011 0.0009 0.0010
ISO-8859-1 2.6461 2.5288 7.0541
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0470 0.0475 0.0488
ISO-8859-16 0.0004 0.0002 0.0002
ISO-8859-2 0.1063 0.0982 0.0963
ISO-8859-3 0.0000 0.0003 0.0003
ISO-8859-4 0.0008 0.0006 0.0006
ISO-8859-5 0.0023 0.0022 0.0017
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0066 0.0051 0.0049
ISO-8859-8 0.0008 0.0007 0.0007
ISO-8859-9 0.0258 0.0242 0.0221
KOI8-R 0.0064 0.0074 0.0073
KOI8-U 0.0001 0.0000 0.0001
Shift_JIS 0.1834 0.1556 0.1548
TIS-620 0.0047 0.0046 0.0036
US-ASCII 0.0236 0.0221 0.0207
UTF-16 0.0059 0.0050 0.0046
UTF-16BE 0.0001 0.0002 0.0002
UTF-16LE 0.0009 0.0010 0.0010
UTF-32 0.0000 0.0001 0.0001
UTF-32LE 0.0003 0.0002 0.0003
UTF-8 93.9793 94.3166 89.6676
windows-1250 0.0745 0.0663 0.0687
windows-1251 0.4835 0.4647 0.4832
windows-1252 0.1420 0.1370 0.1418
windows-1253 0.0024 0.0022 0.0023
windows-1254 0.0121 0.0106 0.0096
windows-1255 0.0077 0.0072 0.0070
windows-1256 0.0390 0.0350 0.0349
windows-1257 0.0077 0.0067 0.0065
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0005 0.0005 0.0006
x-iso-8859-11 0.0001 0.0001 0.0001
x-windows-874 0.0088 0.0079 0.0075
x-windows-949 0.0000 0.0000 0.0000