Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-21

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2026-12 CC-MAIN-2026-17 CC-MAIN-2026-21
charset
<other> 0.0001 0.0001 0.0001
<unknown> 1.8232 1.8918 2.0799
Big5 0.0198 0.0298 0.0221
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1340 0.1266 0.1283
EUC-KR 0.0819 0.0771 0.0757
GB18030 0.0165 0.0150 0.0120
GB2312 0.2388 0.2011 0.1915
GBK 0.1000 0.0934 0.0905
IBM420 0.0049 0.0051 0.0052
IBM424 0.0021 0.0011 0.0009
IBM500 0.0011 0.0008 0.0008
IBM855 0.0000 NaN 0.0000
IBM866 0.0002 0.0001 0.0001
ISO-2022-JP 0.0012 0.0010 0.0011
ISO-8859-1 3.3383 2.7125 2.8610
ISO-8859-13 0.0000 0.0000 0.0000
ISO-8859-15 0.0478 0.0444 0.0446
ISO-8859-16 0.0003 0.0003 0.0003
ISO-8859-2 0.0882 0.0817 0.0794
ISO-8859-3 0.0003 0.0003 0.0003
ISO-8859-4 0.0007 0.0007 0.0007
ISO-8859-5 0.0015 0.0008 0.0007
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0043 0.0040 0.0038
ISO-8859-8 0.0008 0.0007 0.0006
ISO-8859-9 0.0201 0.0235 0.0220
KOI8-R 0.0066 0.0076 0.0076
KOI8-U 0.0001 0.0001 0.0001
Shift_JIS 0.1813 0.1479 0.1567
TIS-620 0.0038 0.0036 0.0031
US-ASCII 0.0200 0.0203 0.0241
UTF-16 0.0023 0.0022 0.0020
UTF-16BE 0.0002 0.0003 0.0003
UTF-16LE 0.0015 0.0039 0.0036
UTF-32 0.0001 0.0000 0.0001
UTF-32LE 0.0002 0.0002 0.0001
UTF-8 93.0693 93.7434 93.3986
windows-1250 0.0689 0.0666 0.0698
windows-1251 0.5023 0.4838 0.4885
windows-1252 0.1557 0.1526 0.1684
windows-1253 0.0024 0.0024 0.0025
windows-1254 0.0129 0.0123 0.0117
windows-1255 0.0059 0.0053 0.0059
windows-1256 0.0276 0.0240 0.0241
windows-1257 0.0059 0.0053 0.0051
windows-31j 0.0004 0.0004 0.0002
x-iso-8859-11 0.0001 0.0000 0.0000
x-windows-874 0.0067 0.0057 0.0059
x-windows-949 0.0000 0.0000 0.0000