Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-47

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2025-38 CC-MAIN-2025-43 CC-MAIN-2025-47
charset
<other> 0.0000 0.0000 0.0001
<unknown> 1.4892 1.4313 1.7191
Big5 0.0467 0.0430 0.0464
Big5-HKSCS 0.0001 0.0001 0.0000
EUC-JP 0.1254 0.1202 0.1345
EUC-KR 0.0777 0.0702 0.0783
GB18030 0.0146 0.0130 0.0160
GB2312 0.1972 0.2100 0.2491
GBK 0.1001 0.0968 0.1042
IBM420 0.0034 0.0033 0.0038
IBM424 0.0010 0.0015 0.0014
IBM500 0.0008 0.0010 0.0012
IBM855 0.0000 0.0000 NaN
IBM866 0.0003 0.0002 0.0002
ISO-2022-JP 0.0008 0.0009 0.0011
ISO-8859-1 5.6088 5.4660 5.7471
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0444 0.0403 0.0449
ISO-8859-16 0.0002 0.0002 0.0002
ISO-8859-2 0.0788 0.0811 0.0888
ISO-8859-3 0.0003 0.0003 0.0004
ISO-8859-4 0.0004 0.0006 0.0007
ISO-8859-5 0.0015 0.0010 0.0012
ISO-8859-6 0.0000 0.0000 0.0000
ISO-8859-7 0.0044 0.0044 0.0045
ISO-8859-8 0.0006 0.0006 0.0007
ISO-8859-9 0.0231 0.0212 0.0219
KOI8-R 0.0061 0.0065 0.0070
KOI8-U 0.0000 0.0000 0.0000
Shift_JIS 0.1568 0.1460 0.1550
TIS-620 0.0053 0.0047 0.0041
US-ASCII 0.0192 0.0153 0.0184
UTF-16 0.0044 0.0043 0.0048
UTF-16BE 0.0002 0.0002 0.0004
UTF-16LE 0.0010 0.0009 0.0011
UTF-32 0.0000 0.0000 0.0001
UTF-32LE 0.0003 0.0002 0.0002
UTF-8 91.2632 91.5255 90.7888
windows-1250 0.0620 0.0655 0.0706
windows-1251 0.4674 0.4347 0.4767
windows-1252 0.1291 0.1281 0.1405
windows-1253 0.0017 0.0019 0.0021
windows-1254 0.0108 0.0102 0.0108
windows-1255 0.0046 0.0060 0.0065
windows-1256 0.0337 0.0297 0.0322
windows-1257 0.0063 0.0059 0.0067
windows-31j 0.0004 0.0004 0.0004
x-iso-8859-11 0.0001 0.0001 0.0001
x-windows-874 0.0073 0.0068 0.0075
x-windows-949 0.0000 0.0000 0.0000