Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-43

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-33 CC-MAIN-2025-38 CC-MAIN-2025-43
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.5070 1.4892 1.4313
Big5 0.0445 0.0467 0.0430
Big5-HKSCS 0.0001 0.0001 0.0001
EUC-JP 0.1225 0.1254 0.1202
EUC-KR 0.0787 0.0777 0.0702
GB18030 0.0135 0.0146 0.0130
GB2312 0.1913 0.1972 0.2100
GBK 0.0963 0.1001 0.0968
IBM420 0.0036 0.0034 0.0033
IBM424 0.0010 0.0010 0.0015
IBM500 0.0008 0.0008 0.0010
IBM855 0.0000 0.0000 0.0000
IBM866 0.0002 0.0003 0.0002
ISO-2022-JP 0.0010 0.0008 0.0009
ISO-8859-1 5.2115 5.6088 5.4660
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0473 0.0444 0.0403
ISO-8859-16 0.0001 0.0002 0.0002
ISO-8859-2 0.0863 0.0788 0.0811
ISO-8859-3 0.0002 0.0003 0.0003
ISO-8859-4 0.0005 0.0004 0.0006
ISO-8859-5 0.0015 0.0015 0.0010
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0049 0.0044 0.0044
ISO-8859-8 0.0007 0.0006 0.0006
ISO-8859-9 0.0200 0.0231 0.0212
KOI8-R 0.0068 0.0061 0.0065
KOI8-U 0.0000 0.0000 0.0000
Shift_JIS 0.1573 0.1568 0.1460
TIS-620 0.0048 0.0053 0.0047
US-ASCII 0.0201 0.0192 0.0153
UTF-16 0.0047 0.0044 0.0043
UTF-16BE 0.0002 0.0002 0.0002
UTF-16LE 0.0010 0.0010 0.0009
UTF-32 0.0000 0.0000 0.0000
UTF-32LE 0.0003 0.0003 0.0002
UTF-8 91.6428 91.2632 91.5255
windows-1250 0.0668 0.0620 0.0655
windows-1251 0.4726 0.4674 0.4347
windows-1252 0.1241 0.1291 0.1281
windows-1253 0.0023 0.0017 0.0019
windows-1254 0.0097 0.0108 0.0102
windows-1255 0.0062 0.0046 0.0060
windows-1256 0.0330 0.0337 0.0297
windows-1257 0.0063 0.0063 0.0059
windows-31j 0.0004 0.0004 0.0004
x-iso-8859-11 0.0000 0.0001 0.0001
x-windows-874 0.0068 0.0073 0.0068
x-windows-949 0.0000 0.0000 0.0000