Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-33

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-26 CC-MAIN-2025-30 CC-MAIN-2025-33
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.5590 1.5919 1.5070
Big5 0.0433 0.0418 0.0445
Big5-HKSCS 0.0001 0.0001 0.0001
EUC-JP 0.1219 0.1282 0.1225
EUC-KR 0.0816 0.0813 0.0787
GB18030 0.0130 0.0160 0.0135
GB2312 0.2126 0.2133 0.1913
GBK 0.1088 0.1091 0.0963
IBM420 0.0035 0.0036 0.0036
IBM424 0.0012 0.0011 0.0010
IBM500 0.0013 0.0006 0.0008
IBM855 0.0000 0.0000 0.0000
IBM866 0.0002 0.0001 0.0002
ISO-2022-JP 0.0010 0.0009 0.0010
ISO-8859-1 7.0541 5.4270 5.2115
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0488 0.0491 0.0473
ISO-8859-16 0.0002 0.0001 0.0001
ISO-8859-2 0.0963 0.0935 0.0863
ISO-8859-3 0.0003 0.0002 0.0002
ISO-8859-4 0.0006 0.0005 0.0005
ISO-8859-5 0.0017 0.0014 0.0015
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0049 0.0050 0.0049
ISO-8859-8 0.0007 0.0007 0.0007
ISO-8859-9 0.0221 0.0230 0.0200
KOI8-R 0.0073 0.0072 0.0068
KOI8-U 0.0001 0.0000 0.0000
Shift_JIS 0.1548 0.1565 0.1573
TIS-620 0.0036 0.0045 0.0048
US-ASCII 0.0207 0.0203 0.0201
UTF-16 0.0046 0.0044 0.0047
UTF-16BE 0.0002 0.0001 0.0002
UTF-16LE 0.0010 0.0011 0.0010
UTF-32 0.0001 0.0000 0.0000
UTF-32LE 0.0003 0.0003 0.0003
UTF-8 89.6676 91.2574 91.6428
windows-1250 0.0687 0.0694 0.0668
windows-1251 0.4832 0.4884 0.4726
windows-1252 0.1418 0.1336 0.1241
windows-1253 0.0023 0.0025 0.0023
windows-1254 0.0096 0.0092 0.0097
windows-1255 0.0070 0.0059 0.0062
windows-1256 0.0349 0.0357 0.0330
windows-1257 0.0065 0.0062 0.0063
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0006 0.0004 0.0004
x-MacCyrillic NaN 0.0000 NaN
x-iso-8859-11 0.0001 0.0000 0.0000
x-windows-874 0.0075 0.0079 0.0068
x-windows-949 0.0000 0.0000 0.0000