Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-21

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2026-12 CC-MAIN-2026-17 CC-MAIN-2026-21
mimetype
<other> 0.0646 0.0610 0.0618
application/atom+xml 0.1511 0.1469 0.1503
application/calendar 0.0002 0.0002 0.0002
application/download 0.0013 0.0018 0.0030
application/epub+zip 0.0019 0.0015 0.0019
application/force-download 0.0075 0.0073 0.0088
application/gpx+xml 0.0008 0.0009 0.0009
application/ics 0.0004 0.0004 0.0004
application/javascript 0.0008 0.0008 0.0006
application/json 0.0256 0.0251 0.0241
application/ld+json 0.0014 0.0015 0.0013
application/marc 0.0003 0.0003 0.0003
application/msword 0.0023 0.0025 0.0029
application/octet-stream 0.0487 0.0469 0.0488
application/octetstream 0.0002 0.0002 0.0002
application/pdf 0.7988 0.8655 1.0857
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0016 0.0020 0.0022
application/postscript 0.0002 0.0002 0.0002
application/rdf+xml 0.0036 0.0031 0.0030
application/rss+xml 0.0439 0.0418 0.0431
application/rtf 0.0007 0.0005 0.0005
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0001 0.0001 0.0001
application/unknown 0.0002 0.0002 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0012 0.0011 0.0018
application/vnd.google-earth.kmz 0.0005 0.0005 0.0004
application/vnd.ms-excel 0.0015 0.0012 0.0015
application/vnd.ms-powerpoint 0.0001 0.0002 0.0002
application/vnd.ms-word 0.0001 0.0001 0.0001
application/vnd.oasis.opendocument.text 0.0008 0.0007 0.0011
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0003 0.0002 0.0003
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0017 0.0016 0.0019
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0028 0.0028 0.0033
application/vnd.wap.xhtml+xml 0.0010 0.0009 0.0008
application/x-bibtex 0.0090 0.0088 0.0091
application/x-bittorrent 0.0001 0.0001 0.0001
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0025 0.0022 0.0023
application/x-endnote-refer 0.0004 0.0003 0.0003
application/x-gzip 0.0002 0.0002 0.0002
application/x-httpd-php 0.0003 0.0003 0.0004
application/x-java-jnlp-file 0.0001 0.0001 0.0001
application/x-javascript 0.0002 0.0001 0.0002
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0007 0.0004 0.0005
application/x-msdownload 0.0003 0.0003 0.0003
application/x-netcdf 0.0003 0.0004 0.0002
application/x-research-info-systems 0.0117 0.0116 0.0113
application/x-shockwave-flash 0.0001 0.0000 0.0000
application/x-tar 0.0001 0.0001 0.0001
application/x-tex 0.0002 0.0001 0.0002
application/x-troff-man 0.0003 0.0004 0.0003
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0109 0.0104 0.0096
application/xml 0.0266 0.0254 0.0257
application/zip 0.0001 0.0002 0.0002
audio/mpeg 0.0001 0.0001 0.0001
audio/x-mpegurl 0.0006 0.0006 0.0005
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0005 0.0005 0.0005
image/gif 0.0003 0.0002 0.0001
image/jp2 0.0000 NaN 0.0000
image/jpeg 0.0004 0.0002 0.0003
image/jpg 0.0000 0.0000 0.0000
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0004 0.0001 0.0002
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0003 0.0005 0.0004
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0003 0.0002 0.0003
text/calendar 0.0383 0.0354 0.0340
text/css 0.0006 0.0006 0.0007
text/csv 0.0032 0.0030 0.0030
text/directory 0.0002 0.0002 0.0002
text/enriched 0.0000 0.0000 0.0000
text/html 98.5946 98.5388 98.3251
text/javascript 0.0004 0.0004 0.0004
text/markdown 0.0034 0.0298 0.0134
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0541 0.0485 0.0482
text/prs.lines.tag 0.0039 0.0030 0.0027
text/tab-separated-values 0.0004 0.0004 0.0002
text/turtle 0.0012 0.0010 0.0008
text/vcard 0.0012 0.0012 0.0013
text/x-bibtex 0.0005 0.0005 0.0006
text/x-c 0.0002 0.0001 0.0001
text/x-csrc 0.0001 0.0001 0.0002
text/x-diff 0.0003 0.0003 0.0003
text/x-patch 0.0003 0.0002 0.0002
text/x-perl 0.0001 0.0000 0.0000
text/x-vcalendar 0.0004 0.0004 0.0004
text/x-vcard 0.0019 0.0019 0.0020
text/xml 0.0613 0.0496 0.0502
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm NaN 0.0000 0.0000
video/x-ms-asf 0.0002 0.0002 0.0001

crawl CC-MAIN-2026-12 CC-MAIN-2026-17 CC-MAIN-2026-21
mimetype_detected
<other> 0.0122 0.0112 0.0122
application/atom+xml 0.1527 0.1486 0.1518
application/epub+zip 0.0022 0.0018 0.0023
application/gpx+xml 0.0008 0.0009 0.0009
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0011 0.0011 0.0009
application/json 0.0258 0.0252 0.0240
application/marc 0.0007 0.0007 0.0006
application/mbox 0.0014 0.0011 0.0013
application/msword 0.0017 0.0018 0.0022
application/octet-stream 0.0155 0.0139 0.0133
application/pdf 0.8127 0.8794 1.1018
application/pgp-signature 0.0016 0.0020 0.0022
application/pkcs7-signature 0.0004 0.0004 0.0006
application/postscript 0.0002 0.0001 0.0001
application/rdf+xml 0.0074 0.0063 0.0061
application/rss+xml 0.0684 0.0660 0.0676
application/rtf 0.0013 0.0009 0.0008
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0025 0.0024 0.0031
application/vnd.google-earth.kmz 0.0005 0.0005 0.0005
application/vnd.ms-excel 0.0007 0.0007 0.0009
application/vnd.ms-powerpoint 0.0001 0.0002 0.0002
application/vnd.oasis.opendocument.spreadsheet 0.0004 0.0003 0.0005
application/vnd.oasis.opendocument.text 0.0009 0.0009 0.0013
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0003 0.0002 0.0003
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0017 0.0016 0.0018
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0028 0.0028 0.0033
application/x-bibtex-text-file 0.0143 0.0127 0.0137
application/x-bittorrent 0.0002 0.0002 0.0002
application/x-bzip2 0.0000 NaN 0.0000
application/x-dosexec NaN NaN 0.0000
application/x-endnote-refer 0.0016 0.0014 0.0013
application/x-mobipocket-ebook 0.0008 0.0006 0.0008
application/x-ms-asx 0.0002 0.0002 0.0001
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0012 0.0012 0.0010
application/x-rar-compressed NaN 0.0000 0.0000
application/x-research-info-systems 0.0000 0.0000 0.0000
application/x-sh 0.0010 0.0011 0.0010
application/x-shockwave-flash 0.0001 0.0000 0.0000
application/x-stata-do 0.0004 0.0004 0.0004
application/x-tex 0.0003 0.0003 0.0004
application/x-tex-tfm 0.0003 0.0004 0.0002
application/x-tika-msoffice 0.0028 0.0024 0.0025
application/x-tika-ooxml 0.0018 0.0019 0.0021
application/x-wais-source 0.0001 0.0001 0.0002
application/xhtml+xml 8.6282 8.1569 8.0295
application/xml 0.0697 0.0584 0.0562
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0002 0.0003 0.0002
audio/mp4 0.0000 0.0000 NaN
audio/mpeg 0.0001 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0002 0.0001 0.0002
image/png 0.0000 0.0000 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff NaN 0.0000 NaN
image/vnd.djvu 0.0004 0.0007 0.0007
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0007 0.0007 0.0007
text/asp 0.0000 NaN 0.0000
text/aspdotnet 0.0000 NaN NaN
text/calendar 0.0456 0.0424 0.0417
text/css 0.0006 0.0006 0.0007
text/csv 0.0032 0.0030 0.0030
text/html 89.9924 90.4072 90.3224
text/markdown 0.0024 0.0284 0.0042
text/plain 0.0898 0.0832 0.0851
text/prs.lines.tag 0.0067 0.0054 0.0048
text/tab-separated-values 0.0004 0.0004 0.0002
text/troff 0.0005 0.0005 0.0003
text/turtle 0.0012 0.0010 0.0008
text/vtt 0.0008 0.0008 0.0008
text/x-c++src 0.0002 0.0001 0.0002
text/x-chdr 0.0006 0.0005 0.0007
text/x-csrc 0.0007 0.0006 0.0005
text/x-diff 0.0010 0.0009 0.0010
text/x-jsp 0.0001 0.0001 0.0001
text/x-log 0.0016 0.0016 0.0013
text/x-matlab 0.0010 0.0013 0.0014
text/x-perl 0.0013 0.0008 0.0009
text/x-php 0.0033 0.0034 0.0025
text/x-python 0.0004 0.0003 0.0005
text/x-vcalendar 0.0004 0.0004 0.0004
text/x-vcard 0.0037 0.0036 0.0039
text/x-web-markdown 0.0013 0.0019 0.0100
text/x-yaml 0.0004 0.0004 0.0003
video/mp4 0.0000 0.0000 0.0000
video/quicktime NaN 0.0000 0.0000
video/webm NaN 0.0000 NaN