Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2021-31

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2021-21 CC-MAIN-2021-25 CC-MAIN-2021-31
charset
<other> 0.0001 0.0001 0.0001
<unknown> 2.0043 1.5420 1.0458
Big5 0.0596 0.0764 0.0552
Big5-HKSCS 0.0000 0.0000 0.0000
EUC-JP 0.1300 0.1326 0.1138
EUC-KR 0.0861 0.0809 0.0651
GB18030 0.0313 0.0344 0.0245
GB2312 0.4277 1.3178 0.5127
GBK 0.1667 0.1947 0.1507
IBM855 0.0002 0.0002 0.0002
IBM866 0.0013 0.0013 0.0011
ISO-2022-JP 0.0010 0.0008 0.0005
ISO-8859-1 2.9139 2.6655 2.2023
ISO-8859-13 0.0000 0.0000 0.0000
ISO-8859-15 0.0953 0.0944 0.0850
ISO-8859-2 0.1387 0.1200 0.0984
ISO-8859-3 0.0001 0.0001 0.0001
ISO-8859-4 0.0006 0.0006 0.0005
ISO-8859-5 0.0016 0.0026 0.0013
ISO-8859-6 0.0003 0.0003 0.0003
ISO-8859-7 0.0100 0.0091 0.0078
ISO-8859-8 0.0007 0.0005 0.0004
ISO-8859-9 0.0322 0.0283 0.0257
KOI8-R 0.0068 0.0062 0.0049
KOI8-U 0.0001 0.0001 0.0001
Shift_JIS 0.2856 0.2260 0.1725
TIS-620 0.0073 0.0093 0.0067
US-ASCII 0.0369 0.0508 0.0306
UTF-16 0.0031 0.0035 0.0019
UTF-16BE 0.0000 0.0000 0.0000
UTF-16LE 0.0015 0.0326 0.0319
UTF-32 0.0001 0.0000 0.0001
UTF-8 92.0045 92.0921 94.2676
windows-1250 0.1107 0.0953 0.0835
windows-1251 0.8452 0.6804 0.6262
windows-1252 0.4739 0.3925 0.2948
windows-1253 0.0039 0.0037 0.0026
windows-1254 0.0166 0.0161 0.0137
windows-1255 0.0135 0.0121 0.0105
windows-1256 0.0638 0.0522 0.0424
windows-1257 0.0080 0.0070 0.0060
windows-31j 0.0014 0.0018 0.0012
x-MacCyrillic 0.0034 0.0028 0.0022
x-windows-874 0.0119 0.0128 0.0092