Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls.

crawl CC-MAIN-2019-39 CC-MAIN-2019-43 CC-MAIN-2019-47
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.9997 1.2558 1.3988
Big5 0.0649 0.0782 0.0655
Big5-HKSCS 0.0004 0.0006 0.0007
EUC-JP 0.2599 0.2522 0.2361
EUC-KR 0.1461 0.1561 0.1385
GB18030 0.0760 0.0849 0.0957
GB2312 1.1137 1.6685 1.3610
GBK 0.8499 1.1153 0.6279
IBM855 0.0002 0.0003 0.0003
IBM866 0.0007 0.0006 0.0005
ISO-2022-JP 0.0007 0.0006 0.0005
ISO-8859-1 3.6044 3.6676 3.4432
ISO-8859-13 0.0005 0.0004 0.0003
ISO-8859-15 0.1731 0.1250 0.1210
ISO-8859-2 0.2134 0.2175 0.1995
ISO-8859-3 0.0002 0.0005 0.0003
ISO-8859-4 0.0019 0.0020 0.0016
ISO-8859-5 0.0024 0.0029 0.0021
ISO-8859-6 0.0005 0.0004 0.0005
ISO-8859-7 0.0135 0.0139 0.0152
ISO-8859-8 0.0011 0.0008 0.0009
ISO-8859-9 0.0488 0.0449 0.0420
KOI8-R 0.0090 0.0100 0.0095
KOI8-U 0.0001 0.0002 0.0002
Shift_JIS 0.3077 0.3321 0.3149
TIS-620 0.0151 0.0203 0.0166
US-ASCII 0.0427 0.0389 0.0299
UTF-16 0.0034 0.0038 0.0037
UTF-16BE 0.0000 0.0000 0.0002
UTF-16LE 0.0007 0.0006 0.0008
UTF-32 0.0001 0.0001 0.0001
UTF-8 89.0196 88.9559 90.0568
windows-1250 0.1638 0.1453 0.1410
windows-1251 1.1090 1.0689 0.9893
windows-1252 0.5208 0.5188 0.4794
windows-1253 0.0055 0.0053 0.0059
windows-1254 0.0353 0.0297 0.0299
windows-1255 0.0338 0.0289 0.0309
windows-1256 0.1214 0.1109 0.1012
windows-1257 0.0149 0.0128 0.0135
windows-1258 0.0000 0.0000 0.0000
windows-31j 0.0025 0.0031 0.0023
x-IBM949 0.0001 0.0000 0.0001
x-MacCyrillic 0.0043 0.0038 0.0037
x-iso-8859-11 0.0001 0.0001 0.0000
x-windows-874 0.0177 0.0212 0.0180
x-windows-949 0.0000 0.0000 0.0000