Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2023-06

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2022-40 CC-MAIN-2022-49 CC-MAIN-2023-06
charset
<other> 0.0005 0.0005 0.0005
<unknown> 1.7267 1.7970 1.6751
Big5 0.0627 0.0562 0.0493
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1113 0.1180 0.1047
EUC-KR 0.0829 0.0841 0.0837
GB18030 0.0224 0.0200 0.0183
GB2312 0.3777 0.3298 0.2980
GBK 0.1564 0.1379 0.1164
IBM420 0.0061 0.0056 0.0053
IBM424 0.0017 0.0011 0.0015
IBM500 0.0016 0.0006 0.0006
IBM855 0.0000 0.0000 0.0000
IBM866 0.0002 0.0001 0.0003
ISO-2022-JP 0.0009 0.0010 0.0008
ISO-8859-1 2.2999 2.3616 2.2630
ISO-8859-13 0.0001 0.0001 0.0001
ISO-8859-15 0.0697 0.0731 0.0647
ISO-8859-2 0.1180 0.1297 0.1249
ISO-8859-3 0.0004 0.0003 0.0004
ISO-8859-4 0.0007 0.0009 0.0008
ISO-8859-5 0.0019 0.0025 0.0024
ISO-8859-6 0.0005 0.0001 0.0001
ISO-8859-7 0.0076 0.0087 0.0087
ISO-8859-8 0.0004 0.0004 0.0004
ISO-8859-9 0.0199 0.0233 0.0291
KOI8-R 0.0059 0.0062 0.0069
KOI8-U 0.0000 0.0000 0.0000
Shift_JIS 0.1575 0.1619 0.1619
TIS-620 0.0075 0.0067 0.0068
US-ASCII 0.0293 0.0269 0.0283
UTF-16 0.0024 0.0026 0.0029
UTF-16BE 0.0001 0.0001 0.0001
UTF-16LE 0.0013 0.0015 0.0013
UTF-32 0.0001 0.0001 0.0001
UTF-8 93.8166 93.7229 94.0560
windows-1250 0.0784 0.0790 0.0759
windows-1251 0.5612 0.5590 0.5468
windows-1252 0.1871 0.1864 0.1769
windows-1253 0.0025 0.0029 0.0029
windows-1254 0.0117 0.0121 0.0116
windows-1255 0.0043 0.0038 0.0037
windows-1256 0.0466 0.0547 0.0499
windows-1257 0.0067 0.0095 0.0081
windows-31j 0.0010 0.0011 0.0011
x-windows-874 0.0098 0.0098 0.0098