Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-12

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2026-04 CC-MAIN-2026-08 CC-MAIN-2026-12
charset
<other> 0.0001 0.0001 0.0001
<unknown> 1.6272 1.8164 1.8232
Big5 0.0253 0.0219 0.0198
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1285 0.1311 0.1340
EUC-KR 0.0770 0.0785 0.0819
GB18030 0.0165 0.0149 0.0165
GB2312 0.2195 0.2307 0.2388
GBK 0.0991 0.1013 0.1000
IBM420 0.0040 0.0051 0.0049
IBM424 0.0018 0.0023 0.0021
IBM500 0.0010 0.0011 0.0011
IBM855 0.0000 NaN 0.0000
IBM866 0.0002 0.0003 0.0002
ISO-2022-JP 0.0009 0.0010 0.0012
ISO-8859-1 4.3993 6.6472 3.3383
ISO-8859-13 0.0001 0.0000 0.0000
ISO-8859-15 0.0456 0.0466 0.0478
ISO-8859-16 0.0003 0.0002 0.0003
ISO-8859-2 0.0824 0.0906 0.0882
ISO-8859-3 0.0000 0.0002 0.0003
ISO-8859-4 0.0006 0.0006 0.0007
ISO-8859-5 0.0015 0.0016 0.0015
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0046 0.0043 0.0043
ISO-8859-8 0.0006 0.0007 0.0008
ISO-8859-9 0.0198 0.0196 0.0201
KOI8-R 0.0057 0.0067 0.0066
KOI8-U 0.0001 0.0001 0.0001
Shift_JIS 0.1606 0.1749 0.1813
TIS-620 0.0037 0.0038 0.0038
US-ASCII 0.0194 0.0189 0.0200
UTF-16 0.0025 0.0027 0.0023
UTF-16BE 0.0002 0.0002 0.0002
UTF-16LE 0.0012 0.0015 0.0015
UTF-32 0.0001 0.0001 0.0001
UTF-32LE 0.0002 0.0002 0.0002
UTF-8 92.3263 89.8190 93.0693
windows-1250 0.0668 0.0681 0.0689
windows-1251 0.4681 0.4820 0.5023
windows-1252 0.1271 0.1433 0.1557
windows-1253 0.0020 0.0019 0.0024
windows-1254 0.0100 0.0104 0.0129
windows-1255 0.0059 0.0059 0.0059
windows-1256 0.0310 0.0303 0.0276
windows-1257 0.0059 0.0062 0.0059
windows-31j 0.0004 0.0004 0.0004
x-iso-8859-11 0.0001 0.0001 0.0001
x-windows-874 0.0070 0.0072 0.0067
x-windows-949 0.0000 0.0000 0.0000