Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives

View the Project on GitHub

Character Encoding of HTML Pages

The character set or encoding of HTML pages only is identified by Tika’s AutoDetectReader. The table shows the percentage how character sets have been used to encode HTML pages crawled by the latest monthly crawls. The underlying data including page counts is provided in charsets.csv.

crawl CC-MAIN-2020-10 CC-MAIN-2020-16 CC-MAIN-2020-24
charset
<other> 0.0000 0.0000 0.0000
<unknown> 1.2277 1.5628 1.1357
Big5 0.0920 0.0735 0.0984
Big5-HKSCS 0.0004 0.0003 0.0001
EUC-JP 0.2636 0.2530 0.2600
EUC-KR 0.1694 0.1401 0.1755
GB18030 0.1197 0.1195 0.0833
GB2312 1.5664 1.2630 1.2549
GBK 0.6878 0.5450 0.5764
IBM855 0.0004 0.0004 0.0005
IBM866 0.0006 0.0006 0.0007
ISO-2022-JP 0.0013 0.0007 0.0006
ISO-8859-1 4.0105 3.2558 3.4870
ISO-8859-13 0.0003 0.0003 0.0003
ISO-8859-15 0.1245 0.1180 0.1200
ISO-8859-2 0.2219 0.1939 0.2135
ISO-8859-3 0.0003 0.0003 0.0004
ISO-8859-4 0.0015 0.0019 0.0021
ISO-8859-5 0.0022 0.0018 0.0019
ISO-8859-6 0.0007 0.0004 0.0009
ISO-8859-7 0.0140 0.0114 0.0141
ISO-8859-8 0.0008 0.0009 0.0007
ISO-8859-9 0.0477 0.0426 0.0503
KOI8-R 0.0113 0.0095 0.0107
KOI8-U 0.0001 0.0002 0.0002
Shift_JIS 0.4284 0.3293 0.3178
TIS-620 0.0174 0.0138 0.0174
US-ASCII 0.0473 0.0366 0.0320
UTF-16 0.0037 0.0036 0.0047
UTF-16BE 0.0002 0.0001 0.0002
UTF-16LE 0.0017 0.0009 0.0007
UTF-32 0.0001 0.0002 0.0001
UTF-8 88.7701 90.1309 90.1324
windows-1250 0.1516 0.1304 0.1364
windows-1251 1.1631 1.0931 1.1218
windows-1252 0.6409 0.4858 0.5206
windows-1253 0.0067 0.0056 0.0059
windows-1254 0.0325 0.0286 0.0346
windows-1255 0.0286 0.0294 0.0327
windows-1256 0.1065 0.0835 0.1153
windows-1257 0.0109 0.0116 0.0142
windows-1258 0.0001 0.0000 0.0001
windows-31j 0.0016 0.0013 0.0013
x-IBM949 0.0000 0.0000 0.0000
x-MacCyrillic 0.0047 0.0036 0.0041
x-iso-8859-11 0.0000 0.0000 0.0000
x-windows-874 0.0185 0.0157 0.0190
x-windows-949 0.0001 0.0001 0.0000
x-windows-950 0.0000 0.0000 0.0000