Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-47

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2025-38 CC-MAIN-2025-43 CC-MAIN-2025-47
mimetype
<other> 0.0700 0.0722 0.0779
application/atom+xml 0.1157 0.0950 0.1434
application/calendar 0.0002 0.0001 0.0004
application/download 0.0027 0.0025 0.0043
application/epub+zip 0.0015 0.0015 0.0018
application/force-download 0.0093 0.0083 0.0114
application/gpx+xml 0.0006 0.0006 0.0007
application/ics 0.0003 0.0003 0.0005
application/javascript 0.0015 0.0009 0.0007
application/json 0.0277 0.0239 0.0279
application/ld+json 0.0018 0.0018 0.0018
application/marc 0.0004 0.0005 0.0004
application/msword 0.0018 0.0018 0.0022
application/octet-stream 0.0390 0.0358 0.0410
application/octetstream 0.0002 0.0002 0.0002
application/pdf 0.5601 0.5740 0.6816
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0012 0.0012 0.0014
application/postscript 0.0001 0.0001 0.0002
application/rdf+xml 0.0043 0.0038 0.0044
application/rss+xml 0.0403 0.0393 0.0439
application/rtf 0.0009 0.0008 0.0010
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0001 0.0001 0.0002
application/unknown 0.0002 0.0002 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0009 0.0010 0.0011
application/vnd.google-earth.kmz 0.0002 0.0002 0.0003
application/vnd.ms-excel 0.0013 0.0012 0.0014
application/vnd.ms-powerpoint 0.0002 0.0002 0.0002
application/vnd.ms-word 0.0002 0.0002 0.0002
application/vnd.oasis.opendocument.text 0.0004 0.0004 0.0007
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0012 0.0016 0.0016
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0019 0.0019 0.0023
application/vnd.wap.xhtml+xml 0.0007 0.0005 0.0006
application/x-bibtex 0.0073 0.0066 0.0077
application/x-bittorrent 0.0002 0.0001 0.0001
application/x-bzip2 0.0001 0.0001 0.0001
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0017 0.0016 0.0021
application/x-endnote-refer 0.0008 0.0008 0.0008
application/x-gzip 0.0001 0.0001 0.0001
application/x-httpd-php 0.0003 0.0004 0.0005
application/x-java-jnlp-file 0.0001 0.0001 0.0001
application/x-javascript 0.0001 0.0001 0.0001
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0005 0.0005 0.0006
application/x-msdownload 0.0004 0.0003 0.0003
application/x-netcdf 0.0001 0.0003 0.0003
application/x-research-info-systems 0.0098 0.0089 0.0107
application/x-shockwave-flash 0.0001 0.0001 0.0001
application/x-tar 0.0001 0.0001 0.0000
application/x-tex 0.0002 0.0002 0.0001
application/x-troff-man 0.0003 0.0003 0.0002
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0099 0.0104 0.0119
application/xml 0.0230 0.0222 0.0245
application/zip 0.0001 0.0001 0.0002
audio/mpeg 0.0000 0.0000 0.0001
audio/x-mpegurl 0.0004 0.0004 0.0005
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0004 0.0004 0.0003
image/gif 0.0004 0.0003 0.0004
image/jp2 NaN 0.0000 NaN
image/jpeg 0.0003 0.0003 0.0003
image/jpg 0.0000 0.0000 0.0000
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0001 0.0002 0.0003
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0002 0.0002 0.0003
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0003 0.0003 0.0004
text/calendar 0.0302 0.0314 0.0400
text/css 0.0005 0.0005 0.0007
text/csv 0.0031 0.0032 0.0032
text/directory 0.0002 0.0002 0.0002
text/enriched 0.0000 0.0000 0.0001
text/html 98.9025 98.9265 98.7109
text/javascript 0.0003 0.0003 0.0003
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0472 0.0428 0.0496
text/prs.lines.tag 0.0022 0.0025 0.0025
text/tab-separated-values 0.0004 0.0004 0.0005
text/turtle 0.0019 0.0018 0.0017
text/vcard 0.0010 0.0010 0.0012
text/x-bibtex 0.0004 0.0003 0.0004
text/x-c 0.0001 0.0001 0.0001
text/x-csrc 0.0001 0.0001 0.0001
text/x-diff 0.0002 0.0002 0.0002
text/x-patch 0.0002 0.0002 0.0002
text/x-perl 0.0000 0.0000 0.0000
text/x-vcalendar 0.0004 0.0003 0.0004
text/x-vcard 0.0015 0.0017 0.0018
text/xml 0.0627 0.0583 0.0663
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm NaN 0.0000 0.0000
video/x-ms-asf 0.0002 0.0003 0.0001

crawl CC-MAIN-2025-38 CC-MAIN-2025-43 CC-MAIN-2025-47
mimetype_detected
<other> 0.0103 0.0094 0.0105
application/atom+xml 0.1175 0.0966 0.1450
application/epub+zip 0.0017 0.0018 0.0020
application/gpx+xml 0.0006 0.0006 0.0007
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0015 0.0009 0.0008
application/json 0.0277 0.0240 0.0282
application/marc 0.0010 0.0010 0.0009
application/mbox 0.0013 0.0010 0.0012
application/msword 0.0014 0.0014 0.0017
application/octet-stream 0.0107 0.0094 0.0112
application/pdf 0.5719 0.5853 0.6951
application/pgp-signature 0.0012 0.0010 0.0014
application/pkcs7-signature 0.0003 0.0004 0.0004
application/postscript 0.0003 0.0003 0.0003
application/rdf+xml 0.0082 0.0076 0.0085
application/rss+xml 0.0655 0.0623 0.0698
application/rtf 0.0013 0.0012 0.0015
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0019 0.0020 0.0023
application/vnd.google-earth.kmz 0.0002 0.0002 0.0003
application/vnd.ms-excel 0.0007 0.0005 0.0007
application/vnd.ms-powerpoint 0.0002 0.0002 0.0002
application/vnd.oasis.opendocument.spreadsheet 0.0003 0.0004 0.0003
application/vnd.oasis.opendocument.text 0.0005 0.0005 0.0009
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0012 0.0016 0.0016
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0019 0.0019 0.0023
application/x-bibtex-text-file 0.0124 0.0111 0.0135
application/x-bittorrent 0.0002 0.0002 0.0002
application/x-dosexec NaN NaN 0.0000
application/x-endnote-refer 0.0020 0.0017 0.0020
application/x-mobipocket-ebook 0.0006 0.0006 0.0008
application/x-ms-asx 0.0002 0.0003 0.0001
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0009 0.0018 0.0016
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0000 0.0000 0.0000
application/x-sh 0.0009 0.0007 0.0008
application/x-shockwave-flash 0.0001 0.0001 0.0001
application/x-stata-do 0.0004 0.0003 0.0004
application/x-tex 0.0003 0.0003 0.0003
application/x-tex-tfm 0.0004 0.0002 0.0001
application/x-tika-msoffice 0.0019 0.0019 0.0023
application/x-tika-ooxml 0.0014 0.0014 0.0015
application/x-wais-source 0.0002 0.0001 0.0001
application/x-xz 0.0000 NaN NaN
application/xhtml+xml 7.7676 7.9563 8.8559
application/xml 0.0671 0.0623 0.0699
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0001 0.0001 0.0001
audio/mp4 NaN NaN 0.0000
audio/mpeg 0.0000 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0002 0.0001 0.0001
image/png 0.0000 0.0000 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0003 0.0003 0.0005
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0007 0.0006 0.0007
text/calendar 0.0358 0.0369 0.0475
text/css 0.0005 0.0005 0.0007
text/csv 0.0031 0.0032 0.0032
text/html 91.1680 91.0110 89.8990
text/plain 0.0825 0.0749 0.0866
text/prs.lines.tag 0.0044 0.0045 0.0048
text/tab-separated-values 0.0004 0.0004 0.0005
text/troff 0.0004 0.0004 0.0004
text/turtle 0.0019 0.0018 0.0017
text/vtt 0.0009 0.0006 0.0007
text/x-c++src 0.0001 0.0001 0.0001
text/x-chdr 0.0003 0.0004 0.0008
text/x-csrc 0.0005 0.0004 0.0005
text/x-diff 0.0009 0.0008 0.0009
text/x-jsp 0.0001 0.0000 0.0000
text/x-log 0.0020 0.0015 0.0020
text/x-matlab 0.0017 0.0016 0.0018
text/x-patch 0.0001 0.0000 0.0001
text/x-perl 0.0012 0.0011 0.0011
text/x-php 0.0031 0.0030 0.0030
text/x-python 0.0004 0.0002 0.0002
text/x-vcalendar 0.0005 0.0004 0.0004
text/x-vcard 0.0030 0.0032 0.0036
text/x-web-markdown 0.0007 0.0007 0.0008
text/x-yaml 0.0003 0.0003 0.0004
video/mp4 0.0000 0.0000 0.0000
video/quicktime 0.0000 0.0000 0.0000
video/webm NaN NaN 0.0000