Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-08

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-51 CC-MAIN-2026-04 CC-MAIN-2026-08
charset
<other> 0.0001 0.0001 0.0001
<unknown> 1.7917 1.6272 1.8164
Big5 0.0572 0.0253 0.0219
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1365 0.1285 0.1311
EUC-KR 0.0804 0.0770 0.0785
GB18030 0.0210 0.0165 0.0149
GB2312 0.2997 0.2195 0.2307
GBK 0.1178 0.0991 0.1013
IBM420 0.0053 0.0040 0.0051
IBM424 0.0015 0.0018 0.0023
IBM500 0.0011 0.0010 0.0011
IBM855 0.0000 0.0000 NaN
IBM866 0.0002 0.0002 0.0003
ISO-2022-JP 0.0011 0.0009 0.0010
ISO-8859-1 5.5513 4.3993 6.6472
ISO-8859-13 0.0001 0.0001 0.0000
ISO-8859-15 0.0460 0.0456 0.0466
ISO-8859-16 0.0002 0.0003 0.0002
ISO-8859-2 0.0851 0.0824 0.0906
ISO-8859-3 0.0000 0.0000 0.0002
ISO-8859-4 0.0007 0.0006 0.0006
ISO-8859-5 0.0011 0.0015 0.0016
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0046 0.0046 0.0043
ISO-8859-8 0.0008 0.0006 0.0007
ISO-8859-9 0.0223 0.0198 0.0196
KOI8-R 0.0062 0.0057 0.0067
KOI8-U 0.0000 0.0001 0.0001
Shift_JIS 0.1633 0.1606 0.1749
TIS-620 0.0043 0.0037 0.0038
US-ASCII 0.0207 0.0194 0.0189
UTF-16 0.0036 0.0025 0.0027
UTF-16BE 0.0002 0.0002 0.0002
UTF-16LE 0.0011 0.0012 0.0015
UTF-32 0.0000 0.0001 0.0001
UTF-32LE 0.0002 0.0002 0.0002
UTF-8 90.8089 92.3263 89.8190
windows-1250 0.0711 0.0668 0.0681
windows-1251 0.4955 0.4681 0.4820
windows-1252 0.1331 0.1271 0.1433
windows-1253 0.0022 0.0020 0.0019
windows-1254 0.0099 0.0100 0.0104
windows-1255 0.0064 0.0059 0.0059
windows-1256 0.0329 0.0310 0.0303
windows-1257 0.0064 0.0059 0.0062
windows-31j 0.0004 0.0004 0.0004
x-iso-8859-11 0.0001 0.0001 0.0001
x-windows-874 0.0075 0.0070 0.0072
x-windows-949 0.0000 0.0000 0.0000