Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-17

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2026-08 CC-MAIN-2026-12 CC-MAIN-2026-17
charset
<other> 0.0001 0.0001 0.0001
<unknown> 1.8164 1.8232 1.8918
Big5 0.0219 0.0198 0.0298
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1311 0.1340 0.1266
EUC-KR 0.0785 0.0819 0.0771
GB18030 0.0149 0.0165 0.0150
GB2312 0.2307 0.2388 0.2011
GBK 0.1013 0.1000 0.0934
IBM420 0.0051 0.0049 0.0051
IBM424 0.0023 0.0021 0.0011
IBM500 0.0011 0.0011 0.0008
IBM855 NaN 0.0000 NaN
IBM866 0.0003 0.0002 0.0001
ISO-2022-JP 0.0010 0.0012 0.0010
ISO-8859-1 6.6472 3.3383 2.7125
ISO-8859-13 0.0000 0.0000 0.0000
ISO-8859-15 0.0466 0.0478 0.0444
ISO-8859-16 0.0002 0.0003 0.0003
ISO-8859-2 0.0906 0.0882 0.0817
ISO-8859-3 0.0002 0.0003 0.0003
ISO-8859-4 0.0006 0.0007 0.0007
ISO-8859-5 0.0016 0.0015 0.0008
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0043 0.0043 0.0040
ISO-8859-8 0.0007 0.0008 0.0007
ISO-8859-9 0.0196 0.0201 0.0235
KOI8-R 0.0067 0.0066 0.0076
KOI8-U 0.0001 0.0001 0.0001
Shift_JIS 0.1749 0.1813 0.1479
TIS-620 0.0038 0.0038 0.0036
US-ASCII 0.0189 0.0200 0.0203
UTF-16 0.0027 0.0023 0.0022
UTF-16BE 0.0002 0.0002 0.0003
UTF-16LE 0.0015 0.0015 0.0039
UTF-32 0.0001 0.0001 0.0000
UTF-32LE 0.0002 0.0002 0.0002
UTF-8 89.8190 93.0693 93.7434
windows-1250 0.0681 0.0689 0.0666
windows-1251 0.4820 0.5023 0.4838
windows-1252 0.1433 0.1557 0.1526
windows-1253 0.0019 0.0024 0.0024
windows-1254 0.0104 0.0129 0.0123
windows-1255 0.0059 0.0059 0.0053
windows-1256 0.0303 0.0276 0.0240
windows-1257 0.0062 0.0059 0.0053
windows-31j 0.0004 0.0004 0.0004
x-iso-8859-11 0.0001 0.0001 0.0000
x-windows-874 0.0072 0.0067 0.0057
x-windows-949 0.0000 0.0000 0.0000