Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-21

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-13 CC-MAIN-2025-18 CC-MAIN-2025-21
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.6162 1.5980 1.4997
Big5 0.0461 0.0449 0.0434
Big5-HKSCS 0.0000 0.0000 0.0001
EUC-JP 0.1206 0.1239 0.1120
EUC-KR 0.0914 0.0910 0.0760
GB18030 0.0148 0.0132 0.0123
GB2312 0.2143 0.2036 0.1919
GBK 0.1087 0.0981 0.0990
IBM420 0.0042 0.0044 0.0036
IBM424 0.0014 0.0014 0.0012
IBM500 0.0009 0.0009 0.0009
IBM855 0.0000 0.0000 0.0000
IBM866 0.0002 0.0002 0.0002
ISO-2022-JP 0.0011 0.0011 0.0009
ISO-8859-1 2.8223 2.6461 2.5288
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0454 0.0470 0.0475
ISO-8859-16 0.0004 0.0004 0.0002
ISO-8859-2 0.1024 0.1063 0.0982
ISO-8859-3 0.0000 0.0000 0.0003
ISO-8859-4 0.0006 0.0008 0.0006
ISO-8859-5 0.0021 0.0023 0.0022
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0070 0.0066 0.0051
ISO-8859-8 0.0007 0.0008 0.0007
ISO-8859-9 0.0230 0.0258 0.0242
KOI8-R 0.0062 0.0064 0.0074
KOI8-U 0.0000 0.0001 0.0000
Shift_JIS 0.1750 0.1834 0.1556
TIS-620 0.0049 0.0047 0.0046
US-ASCII 0.0232 0.0236 0.0221
UTF-16 0.0056 0.0059 0.0050
UTF-16BE 0.0003 0.0001 0.0002
UTF-16LE 0.0011 0.0009 0.0010
UTF-32 0.0000 0.0000 0.0001
UTF-32LE 0.0001 0.0003 0.0002
UTF-8 93.7645 93.9793 94.3166
windows-1250 0.0727 0.0745 0.0663
windows-1251 0.5045 0.4835 0.4647
windows-1252 0.1398 0.1420 0.1370
windows-1253 0.0025 0.0024 0.0022
windows-1254 0.0127 0.0121 0.0106
windows-1255 0.0070 0.0077 0.0072
windows-1256 0.0395 0.0390 0.0350
windows-1257 0.0077 0.0077 0.0067
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0005 0.0005 0.0005
x-iso-8859-11 0.0001 0.0001 0.0001
x-windows-874 0.0081 0.0088 0.0079
x-windows-949 0.0000 0.0000 0.0000