Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-04

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-47 CC-MAIN-2025-51 CC-MAIN-2026-04
charset
<other> 0.0001 0.0001 0.0001
<unknown> 1.7191 1.7917 1.6272
Big5 0.0464 0.0572 0.0253
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1345 0.1365 0.1285
EUC-KR 0.0783 0.0804 0.0770
GB18030 0.0160 0.0210 0.0165
GB2312 0.2491 0.2997 0.2195
GBK 0.1042 0.1178 0.0991
IBM420 0.0038 0.0053 0.0040
IBM424 0.0014 0.0015 0.0018
IBM500 0.0012 0.0011 0.0010
IBM855 NaN 0.0000 0.0000
IBM866 0.0002 0.0002 0.0002
ISO-2022-JP 0.0011 0.0011 0.0009
ISO-8859-1 5.7471 5.5513 4.3993
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0449 0.0460 0.0456
ISO-8859-16 0.0002 0.0002 0.0003
ISO-8859-2 0.0888 0.0851 0.0824
ISO-8859-3 0.0004 0.0000 0.0000
ISO-8859-4 0.0007 0.0007 0.0006
ISO-8859-5 0.0012 0.0011 0.0015
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0045 0.0046 0.0046
ISO-8859-8 0.0007 0.0008 0.0006
ISO-8859-9 0.0219 0.0223 0.0198
KOI8-R 0.0070 0.0062 0.0057
KOI8-U 0.0000 0.0000 0.0001
Shift_JIS 0.1550 0.1633 0.1606
TIS-620 0.0041 0.0043 0.0037
US-ASCII 0.0184 0.0207 0.0194
UTF-16 0.0048 0.0036 0.0025
UTF-16BE 0.0004 0.0002 0.0002
UTF-16LE 0.0011 0.0011 0.0012
UTF-32 0.0001 0.0000 0.0001
UTF-32LE 0.0002 0.0002 0.0002
UTF-8 90.7888 90.8089 92.3263
windows-1250 0.0706 0.0711 0.0668
windows-1251 0.4767 0.4955 0.4681
windows-1252 0.1405 0.1331 0.1271
windows-1253 0.0021 0.0022 0.0020
windows-1254 0.0108 0.0099 0.0100
windows-1255 0.0065 0.0064 0.0059
windows-1256 0.0322 0.0329 0.0310
windows-1257 0.0067 0.0064 0.0059
windows-31j 0.0004 0.0004 0.0004
x-iso-8859-11 0.0001 0.0001 0.0001
x-windows-874 0.0075 0.0075 0.0070
x-windows-949 0.0000 0.0000 0.0000