Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2024-10

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2023-40 CC-MAIN-2023-50 CC-MAIN-2024-10
charset
<other> 0.0000 0.0001 0.0000
<unknown> 1.7751 1.9997 1.6892
Big5 0.0622 0.0610 0.0554
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1089 0.1110 0.1060
EUC-KR 0.0832 0.0957 0.0806
GB18030 0.0166 0.0204 0.0157
GB2312 0.2485 0.3646 0.2435
GBK 0.0975 0.1331 0.1052
IBM420 0.0060 0.0055 0.0056
IBM424 0.0023 0.0034 0.0022
IBM500 0.0007 0.0008 0.0007
IBM855 0.0000 0.0000 0.0000
IBM866 0.0002 0.0002 0.0001
ISO-2022-JP 0.0008 0.0011 0.0012
ISO-8859-1 2.2454 2.2951 2.3258
ISO-8859-13 0.0001 0.0000 0.0000
ISO-8859-15 0.0584 0.0553 0.0500
ISO-8859-16 0.0002 0.0002 0.0001
ISO-8859-2 0.1236 0.1236 0.1261
ISO-8859-3 0.0005 0.0005 0.0005
ISO-8859-4 0.0011 0.0008 0.0007
ISO-8859-5 0.0028 0.0028 0.0026
ISO-8859-6 0.0000 0.0000 0.0001
ISO-8859-7 0.0086 0.0084 0.0095
ISO-8859-8 0.0005 0.0007 0.0009
ISO-8859-9 0.0220 0.0264 0.0261
KOI8-R 0.0060 0.0064 0.0077
KOI8-U 0.0000 0.0001 0.0000
Shift_JIS 0.1604 0.1953 0.1865
TIS-620 0.0074 0.0062 0.0051
US-ASCII 0.0272 0.0323 0.0349
UTF-16 0.0034 0.0034 0.0034
UTF-16BE 0.0008 0.0005 0.0003
UTF-16LE 0.0014 0.0019 0.0015
UTF-32 0.0001 0.0001 0.0000
UTF-32LE 0.0006 0.0006 0.0003
UTF-8 94.0352 93.5115 94.0329
windows-1250 0.0822 0.0758 0.0748
windows-1251 0.5314 0.5618 0.5250
windows-1252 0.1830 0.2031 0.1836
windows-1253 0.0031 0.0029 0.0026
windows-1254 0.0102 0.0123 0.0095
windows-1255 0.0041 0.0069 0.0076
windows-1256 0.0552 0.0478 0.0538
windows-1257 0.0111 0.0096 0.0125
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0009 0.0009 0.0005
x-IBM949 0.0000 0.0000 0.0000
x-windows-874 0.0108 0.0102 0.0097
x-windows-949 0.0001 0.0001 0.0001