Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-38

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-30 CC-MAIN-2025-33 CC-MAIN-2025-38
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.5919 1.5070 1.4892
Big5 0.0418 0.0445 0.0467
Big5-HKSCS 0.0001 0.0001 0.0001
EUC-JP 0.1282 0.1225 0.1254
EUC-KR 0.0813 0.0787 0.0777
GB18030 0.0160 0.0135 0.0146
GB2312 0.2133 0.1913 0.1972
GBK 0.1091 0.0963 0.1001
IBM420 0.0036 0.0036 0.0034
IBM424 0.0011 0.0010 0.0010
IBM500 0.0006 0.0008 0.0008
IBM855 0.0000 0.0000 0.0000
IBM866 0.0001 0.0002 0.0003
ISO-2022-JP 0.0009 0.0010 0.0008
ISO-8859-1 5.4270 5.2115 5.6088
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0491 0.0473 0.0444
ISO-8859-16 0.0001 0.0001 0.0002
ISO-8859-2 0.0935 0.0863 0.0788
ISO-8859-3 0.0002 0.0002 0.0003
ISO-8859-4 0.0005 0.0005 0.0004
ISO-8859-5 0.0014 0.0015 0.0015
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0050 0.0049 0.0044
ISO-8859-8 0.0007 0.0007 0.0006
ISO-8859-9 0.0230 0.0200 0.0231
KOI8-R 0.0072 0.0068 0.0061
KOI8-U 0.0000 0.0000 0.0000
Shift_JIS 0.1565 0.1573 0.1568
TIS-620 0.0045 0.0048 0.0053
US-ASCII 0.0203 0.0201 0.0192
UTF-16 0.0044 0.0047 0.0044
UTF-16BE 0.0001 0.0002 0.0002
UTF-16LE 0.0011 0.0010 0.0010
UTF-32 0.0000 0.0000 0.0000
UTF-32LE 0.0003 0.0003 0.0003
UTF-8 91.2574 91.6428 91.2632
windows-1250 0.0694 0.0668 0.0620
windows-1251 0.4884 0.4726 0.4674
windows-1252 0.1336 0.1241 0.1291
windows-1253 0.0025 0.0023 0.0017
windows-1254 0.0092 0.0097 0.0108
windows-1255 0.0059 0.0062 0.0046
windows-1256 0.0357 0.0330 0.0337
windows-1257 0.0062 0.0063 0.0063
windows-31j 0.0004 0.0004 0.0004
x-MacCyrillic 0.0000 NaN NaN
x-iso-8859-11 0.0000 0.0000 0.0001
x-windows-874 0.0079 0.0068 0.0073
x-windows-949 0.0000 0.0000 0.0000