Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2023-06

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2022-40 CC-MAIN-2022-49 CC-MAIN-2023-06
mimetype
<other> 0.0681 0.0732 0.0739
application/atom+xml 0.0703 0.0912 0.0933
application/download 0.0021 0.0028 0.0026
application/epub+zip 0.0012 0.0012 0.0011
application/force-download 0.0046 0.0055 0.0062
application/javascript 0.0002 0.0002 0.0002
application/json 0.0243 0.0250 0.0232
application/ld+json 0.0013 0.0012 0.0012
application/marc 0.0004 0.0004 0.0003
application/msword 0.0028 0.0029 0.0029
application/octet-stream 0.0482 0.0496 0.0466
application/octetstream 0.0003 0.0003 0.0003
application/pdf 0.7067 0.8052 0.7813
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0021 0.0023 0.0026
application/postscript 0.0001 0.0002 0.0001
application/rdf+xml 0.0042 0.0038 0.0041
application/rss+xml 0.0562 0.0641 0.0638
application/rtf 0.0011 0.0013 0.0012
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0023 0.0024 0.0011
application/unknown 0.0002 0.0002 0.0003
application/vnd.android.package-archive 0.0001 0.0002 0.0001
application/vnd.google-earth.kml+xml 0.0015 0.0016 0.0025
application/vnd.google-earth.kmz 0.0003 0.0003 0.0003
application/vnd.ms-excel 0.0018 0.0018 0.0019
application/vnd.ms-powerpoint 0.0003 0.0004 0.0004
application/vnd.ms-word 0.0005 0.0006 0.0003
application/vnd.ms-word.document.macroenabled.12 0.0000 0.0000 0.0000
application/vnd.oasis.opendocument.text 0.0007 0.0008 0.0008
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0003 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0017 0.0017 0.0016
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0024 0.0027 0.0026
application/vnd.wap.xhtml+xml 0.0009 0.0008 0.0008
application/x-bibtex 0.0031 0.0029 0.0032
application/x-bittorrent 0.0002 0.0002 0.0001
application/x-bzip2 0.0007 0.0006 0.0005
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0017 0.0019 0.0017
application/x-endnote-refer 0.0034 0.0034 0.0020
application/x-gzip 0.0013 0.0010 0.0010
application/x-httpd-php 0.0002 0.0003 0.0003
application/x-java-jnlp-file 0.0003 0.0001 0.0003
application/x-javascript 0.0004 0.0004 0.0002
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0002 0.0002 0.0001
application/x-msdownload 0.0005 0.0003 0.0003
application/x-netcdf 0.0015 0.0014 0.0029
application/x-rar-compressed 0.0002 0.0001 0.0001
application/x-research-info-systems 0.0069 0.0064 0.0068
application/x-shockwave-flash 0.0002 0.0002 0.0002
application/x-tar 0.0007 0.0007 0.0008
application/x-tex 0.0002 0.0002 0.0001
application/x-troff-man 0.0008 0.0007 0.0006
application/x-zip-compressed 0.0005 0.0007 0.0007
application/xhtml+xml 0.0486 0.0436 0.0347
application/xml 0.0264 0.0273 0.0249
application/zip 0.0061 0.0052 0.0046
audio/mp3 0.0002 0.0003 0.0002
audio/mp4 0.0000 0.0000 0.0000
audio/mpeg 0.0031 0.0033 0.0033
audio/x-mpegurl 0.0009 0.0007 0.0007
audio/x-pn-realaudio 0.0000 0.0000 0.0000
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0002 0.0002 0.0002
binary/octet-stream 0.0005 0.0006 0.0007
image/gif 0.0042 0.0044 0.0040
image/jp2 0.0000 0.0000 0.0000
image/jpeg 0.1559 0.1492 0.1312
image/jpg 0.0024 0.0023 0.0024
image/pjpeg 0.0005 0.0006 0.0005
image/png 0.0313 0.0263 0.0204
image/svg+xml 0.0010 0.0009 0.0006
image/tiff 0.0005 0.0006 0.0006
image/vnd.djvu 0.0004 0.0002 0.0003
image/webp 0.0010 0.0012 0.0020
message/rfc822 0.0003 0.0003 0.0002
text/calendar 0.0341 0.0352 0.0296
text/css 0.0005 0.0005 0.0005
text/csv 0.0041 0.0041 0.0035
text/enriched 0.0008 0.0008 0.0004
text/html 98.5070 98.3885 98.4612
text/javascript 0.0002 0.0002 0.0002
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0649 0.0615 0.0584
text/tab-separated-values 0.0006 0.0009 0.0001
text/turtle 0.0015 0.0013 0.0012
text/vcard 0.0009 0.0010 0.0010
text/x-bibtex 0.0005 0.0004 0.0005
text/x-c 0.0002 0.0001 0.0002
text/x-csrc 0.0007 0.0006 0.0004
text/x-diff 0.0002 0.0003 0.0002
text/x-patch 0.0004 0.0004 0.0001
text/x-perl 0.0001 0.0001 0.0001
text/x-vcalendar 0.0004 0.0004 0.0005
text/x-vcard 0.0018 0.0021 0.0019
text/xml 0.0651 0.0660 0.0665
unknown/unknown 0.0002 0.0002 0.0001
video/mp4 0.0020 0.0019 0.0016
video/webm 0.0001 0.0001 0.0001
video/x-ms-asf 0.0001 0.0001 0.0001

crawl CC-MAIN-2022-40 CC-MAIN-2022-49 CC-MAIN-2023-06
mimetype_detected
<other> 0.0151 0.0158 0.0144
application/atom+xml 0.0754 0.0965 0.0965
application/epub+zip 0.0015 0.0015 0.0014
application/gpx+xml 0.0003 0.0004 0.0004
application/gzip 0.0032 0.0024 0.0018
application/javascript 0.0007 0.0008 0.0005
application/json 0.0236 0.0244 0.0229
application/marc 0.0014 0.0014 0.0013
application/mbox 0.0036 0.0031 0.0021
application/msword 0.0019 0.0020 0.0020
application/octet-stream 0.0266 0.0249 0.0187
application/pdf 0.7135 0.8161 0.7913
application/pgp-signature 0.0028 0.0032 0.0029
application/pkcs7-signature 0.0003 0.0004 0.0004
application/postscript 0.0001 0.0001 0.0001
application/rdf+xml 0.0091 0.0086 0.0087
application/rss+xml 0.0973 0.1063 0.1010
application/rtf 0.0019 0.0018 0.0013
application/text 0.0005 0.0004 0.0001
application/vnd.android.package-archive 0.0002 0.0002 0.0002
application/vnd.google-earth.kml+xml 0.0025 0.0026 0.0035
application/vnd.google-earth.kmz 0.0003 0.0003 0.0003
application/vnd.ms-excel 0.0013 0.0013 0.0013
application/vnd.ms-powerpoint 0.0002 0.0002 0.0003
application/vnd.oasis.opendocument.text 0.0008 0.0010 0.0010
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0003 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0016 0.0017 0.0015
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0023 0.0026 0.0025
application/x-bibtex-text-file 0.0069 0.0069 0.0071
application/x-bittorrent 0.0003 0.0003 0.0002
application/x-bzip2 0.0015 0.0010 0.0007
application/x-debian-package 0.0001 0.0001 0.0008
application/x-dosexec 0.0002 0.0002 0.0002
application/x-endnote-refer 0.0019 0.0018 0.0014
application/x-hdf 0.0014 0.0031 0.0029
application/x-mobipocket-ebook 0.0003 0.0003 0.0003
application/x-ms-asx 0.0001 0.0001 0.0001
application/x-msdownload 0.0002 0.0002 0.0002
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0002 0.0002 0.0002
application/x-sh 0.0012 0.0011 0.0010
application/x-shockwave-flash 0.0002 0.0003 0.0002
application/x-stata-do 0.0018 0.0016 0.0010
application/x-tex 0.0004 0.0004 0.0002
application/x-tika-msoffice 0.0027 0.0030 0.0030
application/x-tika-ooxml 0.0027 0.0030 0.0029
application/x-wais-source 0.0003 0.0002 0.0002
application/x-xz 0.0003 0.0008 0.0008
application/xhtml+xml 12.0352 11.6760 11.3311
application/xml 0.0676 0.0588 0.0619
application/zip 0.0099 0.0091 0.0083
application/zlib 0.0002 0.0001 0.0002
application/zstd 0.0015 0.0016 0.0014
audio/midi 0.0001 0.0000 0.0001
audio/mp4 0.0001 0.0001 0.0001
audio/mpeg 0.0049 0.0052 0.0052
audio/vnd.wave 0.0006 0.0005 0.0004
audio/x-mpegurl 0.0002 0.0002 0.0001
image/bmp 0.0001 0.0001 0.0001
image/gif 0.0037 0.0037 0.0035
image/jpeg 0.1645 0.1583 0.1393
image/png 0.0309 0.0261 0.0211
image/svg+xml 0.0013 0.0011 0.0008
image/tiff 0.0006 0.0012 0.0007
image/vnd.djvu 0.0005 0.0003 0.0005
image/vnd.dxf; format=ascii 0.0003 0.0003 0.0003
image/vnd.microsoft.icon 0.0001 0.0001 0.0000
image/webp 0.0012 0.0013 0.0022
message/rfc822 0.0028 0.0028 0.0018
text/asp NaN NaN 0.0000
text/calendar 0.0371 0.0394 0.0333
text/css 0.0005 0.0005 0.0005
text/csv 0.0040 0.0039 0.0033
text/html 86.4851 86.7310 87.1496
text/plain 0.1052 0.1045 0.1030
text/prs.lines.tag 0.0041 0.0031 0.0055
text/tab-separated-values 0.0007 0.0010 0.0002
text/troff 0.0006 0.0007 0.0006
text/turtle 0.0014 0.0012 0.0012
text/vtt 0.0006 0.0006 0.0007
text/x-c++src 0.0003 0.0003 0.0003
text/x-chdr 0.0009 0.0008 0.0008
text/x-csrc 0.0017 0.0015 0.0013
text/x-diff 0.0018 0.0016 0.0013
text/x-jsp 0.0001 0.0001 0.0001
text/x-log 0.0021 0.0024 0.0028
text/x-matlab 0.0024 0.0024 0.0022
text/x-perl 0.0030 0.0020 0.0019
text/x-php 0.0037 0.0031 0.0031
text/x-python 0.0007 0.0005 0.0004
text/x-vcalendar 0.0005 0.0005 0.0006
text/x-vcard 0.0033 0.0040 0.0034
text/x-web-markdown 0.0004 0.0003 0.0004
video/mp4 0.0021 0.0020 0.0017
video/quicktime 0.0004 0.0003 0.0003
video/webm 0.0001 0.0001 0.0001
video/x-m4v 0.0000 0.0000 0.0000
video/x-matroska 0.0000 0.0000 0.0000