Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-38

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2025-30 CC-MAIN-2025-33 CC-MAIN-2025-38
mimetype
<other> 0.0763 0.0768 0.0706
application/atom+xml 0.1310 0.1185 0.1157
application/calendar 0.0002 0.0002 0.0002
application/download 0.0025 0.0028 0.0027
application/epub+zip 0.0016 0.0018 0.0015
application/force-download 0.0101 0.0078 0.0093
application/ics 0.0004 0.0003 0.0003
application/javascript 0.0012 0.0010 0.0015
application/json 0.0275 0.0261 0.0277
application/ld+json 0.0017 0.0019 0.0018
application/marc 0.0005 0.0004 0.0004
application/msword 0.0022 0.0020 0.0018
application/octet-stream 0.0419 0.0426 0.0390
application/octetstream 0.0003 0.0002 0.0002
application/pdf 0.6394 0.5800 0.5601
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0018 0.0013 0.0012
application/postscript 0.0002 0.0001 0.0001
application/rdf+xml 0.0044 0.0043 0.0043
application/rss+xml 0.0431 0.0414 0.0403
application/rtf 0.0008 0.0009 0.0009
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0002 0.0001 0.0001
application/unknown 0.0002 0.0002 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0012 0.0012 0.0009
application/vnd.google-earth.kmz 0.0003 0.0003 0.0002
application/vnd.ms-excel 0.0013 0.0013 0.0013
application/vnd.ms-powerpoint 0.0002 0.0002 0.0002
application/vnd.ms-word 0.0002 0.0002 0.0002
application/vnd.oasis.opendocument.text 0.0006 0.0005 0.0004
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0003 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0015 0.0014 0.0012
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0022 0.0020 0.0019
application/vnd.wap.xhtml+xml 0.0002 0.0001 0.0007
application/x-bibtex 0.0072 0.0071 0.0073
application/x-bittorrent 0.0001 0.0002 0.0002
application/x-bzip2 0.0000 0.0001 0.0001
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0020 0.0018 0.0017
application/x-endnote-refer 0.0008 0.0009 0.0008
application/x-gzip 0.0001 0.0001 0.0001
application/x-httpd-php 0.0003 0.0003 0.0003
application/x-java-jnlp-file 0.0001 0.0001 0.0001
application/x-javascript 0.0001 0.0001 0.0001
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0005 0.0006 0.0005
application/x-msdownload 0.0003 0.0004 0.0004
application/x-netcdf 0.0003 0.0007 0.0001
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0104 0.0100 0.0098
application/x-shockwave-flash 0.0001 0.0001 0.0001
application/x-tar 0.0001 0.0001 0.0001
application/x-tex 0.0001 0.0002 0.0002
application/x-troff-man 0.0005 0.0004 0.0003
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0102 0.0101 0.0099
application/xml 0.0241 0.0238 0.0230
application/zip 0.0001 0.0001 0.0001
audio/mpeg 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0004 0.0004 0.0004
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0004 0.0004 0.0004
image/gif 0.0003 0.0003 0.0004
image/jp2 NaN 0.0000 NaN
image/jpeg 0.0004 0.0003 0.0003
image/jpg 0.0000 0.0000 0.0000
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0006 0.0001 0.0001
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0011 0.0001 0.0002
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0003 0.0003 0.0003
text/calendar 0.0316 0.0309 0.0302
text/css 0.0005 0.0005 0.0005
text/csv 0.0031 0.0032 0.0031
text/directory 0.0003 0.0002 0.0002
text/enriched 0.0000 0.0000 0.0000
text/html 98.7794 98.8693 98.9025
text/javascript 0.0003 0.0003 0.0003
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0559 0.0496 0.0472
text/prs.lines.tag 0.0021 0.0020 0.0022
text/tab-separated-values 0.0003 0.0003 0.0004
text/turtle 0.0021 0.0021 0.0019
text/vcard 0.0011 0.0008 0.0010
text/x-bibtex 0.0004 0.0004 0.0004
text/x-c 0.0001 0.0001 0.0001
text/x-csrc 0.0003 0.0002 0.0001
text/x-diff 0.0003 0.0003 0.0002
text/x-patch 0.0003 0.0002 0.0002
text/x-perl 0.0000 0.0000 0.0000
text/x-vcalendar 0.0004 0.0004 0.0004
text/x-vcard 0.0020 0.0018 0.0015
text/xml 0.0628 0.0600 0.0627
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm 0.0000 NaN NaN
video/x-ms-asf 0.0001 0.0001 0.0002

crawl CC-MAIN-2025-30 CC-MAIN-2025-33 CC-MAIN-2025-38
mimetype_detected
<other> 0.0111 0.0105 0.0104
application/atom+xml 0.1327 0.1202 0.1175
application/epub+zip 0.0019 0.0020 0.0017
application/gpx+xml 0.0006 0.0006 0.0006
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0007 0.0006 0.0015
application/json 0.0276 0.0262 0.0277
application/marc 0.0013 0.0011 0.0010
application/mbox 0.0016 0.0014 0.0013
application/msword 0.0017 0.0015 0.0014
application/octet-stream 0.0117 0.0159 0.0107
application/pdf 0.6530 0.5931 0.5719
application/pgp-signature 0.0014 0.0014 0.0012
application/pkcs7-signature 0.0004 0.0004 0.0003
application/postscript 0.0004 0.0003 0.0003
application/rdf+xml 0.0091 0.0085 0.0082
application/rss+xml 0.0693 0.0665 0.0655
application/rtf 0.0013 0.0014 0.0013
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0020 0.0021 0.0019
application/vnd.google-earth.kmz 0.0003 0.0003 0.0002
application/vnd.ms-excel 0.0008 0.0007 0.0007
application/vnd.ms-powerpoint 0.0002 0.0002 0.0002
application/vnd.oasis.opendocument.spreadsheet 0.0003 0.0002 0.0003
application/vnd.oasis.opendocument.text 0.0007 0.0006 0.0005
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0003 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0015 0.0014 0.0012
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0022 0.0020 0.0019
application/x-bibtex-text-file 0.0131 0.0122 0.0124
application/x-bittorrent 0.0002 0.0003 0.0002
application/x-debian-package 0.0000 0.0000 NaN
application/x-dosexec NaN 0.0000 NaN
application/x-endnote-refer 0.0021 0.0017 0.0020
application/x-mobipocket-ebook 0.0007 0.0007 0.0006
application/x-ms-asx 0.0001 0.0001 0.0002
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0022 0.0015 0.0009
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0001 0.0000 0.0000
application/x-sh 0.0009 0.0006 0.0009
application/x-shockwave-flash 0.0001 0.0001 0.0001
application/x-stata-do 0.0004 0.0003 0.0004
application/x-tex 0.0002 0.0003 0.0003
application/x-tex-tfm 0.0003 0.0003 0.0004
application/x-tika-msoffice 0.0024 0.0023 0.0019
application/x-tika-ooxml 0.0016 0.0015 0.0014
application/x-wais-source 0.0002 0.0001 0.0002
application/x-xz NaN 0.0000 0.0000
application/xhtml+xml 8.4630 8.1620 7.7676
application/xml 0.0685 0.0652 0.0671
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0001 0.0002 0.0001
audio/mp4 0.0000 NaN NaN
audio/mpeg 0.0000 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0002 0.0002 0.0002
image/png 0.0000 0.0000 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 NaN 0.0000
image/vnd.djvu 0.0013 0.0004 0.0003
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0007 0.0008 0.0007
text/calendar 0.0374 0.0362 0.0358
text/css 0.0005 0.0005 0.0005
text/csv 0.0031 0.0032 0.0031
text/html 90.3470 90.7412 91.1680
text/plain 0.0947 0.0863 0.0825
text/prs.lines.tag 0.0044 0.0040 0.0044
text/tab-separated-values 0.0004 0.0003 0.0004
text/troff 0.0006 0.0005 0.0004
text/turtle 0.0021 0.0021 0.0019
text/vtt 0.0010 0.0010 0.0009
text/x-c++src 0.0001 0.0001 0.0001
text/x-chdr 0.0004 0.0004 0.0003
text/x-csrc 0.0007 0.0005 0.0005
text/x-diff 0.0010 0.0009 0.0009
text/x-jsp 0.0000 0.0001 0.0001
text/x-log 0.0022 0.0019 0.0020
text/x-matlab 0.0013 0.0015 0.0017
text/x-perl 0.0016 0.0015 0.0012
text/x-php 0.0033 0.0031 0.0031
text/x-python 0.0003 0.0002 0.0004
text/x-vcalendar 0.0005 0.0004 0.0005
text/x-vcard 0.0037 0.0031 0.0030
text/x-web-markdown 0.0005 0.0006 0.0007
text/x-yaml 0.0004 0.0004 0.0003
video/mp4 0.0000 0.0000 0.0000
video/quicktime 0.0000 0.0000 0.0000
video/webm 0.0000 NaN NaN