Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-12

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2026-04 CC-MAIN-2026-08 CC-MAIN-2026-12
mimetype
<other> 0.0669 0.0654 0.0679
application/atom+xml 0.1338 0.1461 0.1511
application/calendar 0.0003 0.0003 0.0002
application/download 0.0012 0.0020 0.0013
application/epub+zip 0.0015 0.0017 0.0019
application/force-download 0.0066 0.0089 0.0075
application/gpx+xml 0.0007 0.0008 0.0008
application/ics 0.0004 0.0004 0.0004
application/javascript 0.0007 0.0008 0.0008
application/json 0.0242 0.0262 0.0256
application/ld+json 0.0016 0.0015 0.0014
application/marc 0.0003 0.0003 0.0003
application/msword 0.0022 0.0025 0.0023
application/octet-stream 0.0423 0.0449 0.0487
application/octetstream 0.0002 0.0002 0.0002
application/pdf 0.6211 0.7330 0.7988
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0014 0.0014 0.0016
application/postscript 0.0001 0.0002 0.0002
application/rdf+xml 0.0036 0.0038 0.0036
application/rss+xml 0.0417 0.0430 0.0439
application/rtf 0.0007 0.0008 0.0007
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0001 0.0001 0.0001
application/unknown 0.0002 0.0002 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0011 0.0012 0.0012
application/vnd.google-earth.kmz 0.0003 0.0003 0.0005
application/vnd.ms-excel 0.0014 0.0014 0.0015
application/vnd.ms-powerpoint 0.0001 0.0001 0.0001
application/vnd.ms-word 0.0002 0.0002 0.0001
application/vnd.oasis.opendocument.text 0.0006 0.0007 0.0008
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0003
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0017 0.0019 0.0017
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0022 0.0024 0.0028
application/vnd.wap.xhtml+xml 0.0003 0.0005 0.0010
application/x-bibtex 0.0081 0.0086 0.0090
application/x-bittorrent 0.0006 0.0001 0.0001
application/x-bzip2 0.0000 0.0000 0.0001
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0025 0.0026 0.0025
application/x-endnote-refer 0.0004 0.0004 0.0004
application/x-gzip 0.0001 0.0003 0.0002
application/x-httpd-php 0.0002 0.0003 0.0003
application/x-java-jnlp-file 0.0001 0.0001 0.0001
application/x-javascript 0.0001 0.0002 0.0002
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0004 0.0005 0.0007
application/x-msdownload 0.0002 0.0003 0.0003
application/x-netcdf 0.0003 0.0001 0.0003
application/x-research-info-systems 0.0108 0.0115 0.0117
application/x-shockwave-flash 0.0001 0.0001 0.0001
application/x-tar 0.0001 0.0000 0.0001
application/x-tex 0.0001 0.0001 0.0002
application/x-troff-man 0.0004 0.0004 0.0003
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0101 0.0111 0.0109
application/xml 0.0234 0.0291 0.0266
application/zip 0.0001 0.0001 0.0001
audio/mpeg 0.0001 0.0001 0.0001
audio/x-mpegurl 0.0005 0.0005 0.0006
audio/x-scpls 0.0001 0.0002 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0005 0.0005 0.0005
image/gif 0.0005 0.0004 0.0003
image/jp2 0.0000 NaN 0.0000
image/jpeg 0.0003 0.0003 0.0004
image/jpg 0.0000 0.0000 0.0000
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0002 0.0002 0.0004
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0002 0.0006 0.0003
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0002 0.0002 0.0003
text/calendar 0.0344 0.0374 0.0383
text/css 0.0006 0.0006 0.0006
text/csv 0.0032 0.0034 0.0032
text/directory 0.0002 0.0002 0.0002
text/enriched 0.0000 0.0000 0.0000
text/html 98.8254 98.6720 98.5946
text/javascript 0.0003 0.0004 0.0004
text/pdf 0.0000 0.0001 0.0000
text/plain 0.0457 0.0521 0.0541
text/prs.lines.tag 0.0022 0.0025 0.0039
text/tab-separated-values 0.0004 0.0005 0.0004
text/turtle 0.0013 0.0014 0.0012
text/vcard 0.0010 0.0011 0.0012
text/x-bibtex 0.0004 0.0006 0.0005
text/x-c 0.0001 0.0001 0.0002
text/x-csrc 0.0001 0.0001 0.0001
text/x-diff 0.0003 0.0002 0.0003
text/x-patch 0.0002 0.0003 0.0003
text/x-perl 0.0000 0.0000 0.0001
text/x-vcalendar 0.0003 0.0003 0.0004
text/x-vcard 0.0017 0.0020 0.0019
text/xml 0.0616 0.0622 0.0613
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm 0.0000 0.0000 NaN
video/x-ms-asf 0.0001 0.0001 0.0002

crawl CC-MAIN-2026-04 CC-MAIN-2026-08 CC-MAIN-2026-12
mimetype_detected
<other> 0.0101 0.0115 0.0145
application/atom+xml 0.1353 0.1476 0.1527
application/epub+zip 0.0017 0.0020 0.0022
application/gpx+xml 0.0007 0.0008 0.0008
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0009 0.0010 0.0011
application/json 0.0243 0.0262 0.0258
application/marc 0.0007 0.0007 0.0007
application/mbox 0.0010 0.0012 0.0014
application/msword 0.0016 0.0018 0.0017
application/octet-stream 0.0152 0.0164 0.0155
application/pdf 0.6335 0.7460 0.8127
application/pgp-signature 0.0014 0.0011 0.0016
application/pkcs7-signature 0.0003 0.0003 0.0004
application/postscript 0.0002 0.0002 0.0002
application/rdf+xml 0.0076 0.0081 0.0074
application/rss+xml 0.0661 0.0681 0.0684
application/rtf 0.0011 0.0013 0.0013
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0024 0.0024 0.0025
application/vnd.google-earth.kmz 0.0003 0.0003 0.0005
application/vnd.ms-excel 0.0006 0.0008 0.0007
application/vnd.ms-powerpoint 0.0001 0.0001 0.0001
application/vnd.oasis.opendocument.spreadsheet 0.0003 0.0004 0.0004
application/vnd.oasis.opendocument.text 0.0007 0.0008 0.0009
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0003
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0016 0.0018 0.0017
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0022 0.0025 0.0028
application/x-bibtex-text-file 0.0128 0.0143 0.0143
application/x-bittorrent 0.0007 0.0002 0.0002
application/x-bzip2 NaN NaN 0.0000
application/x-endnote-refer 0.0014 0.0016 0.0016
application/x-mobipocket-ebook 0.0006 0.0007 0.0008
application/x-ms-asx 0.0001 0.0001 0.0002
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0013 0.0012 0.0012
application/x-rar-compressed NaN 0.0000 NaN
application/x-research-info-systems 0.0000 0.0000 0.0000
application/x-sh 0.0008 0.0007 0.0010
application/x-shockwave-flash 0.0001 0.0001 0.0001
application/x-stata-do 0.0004 0.0005 0.0004
application/x-tex 0.0002 0.0003 0.0003
application/x-tex-tfm 0.0003 0.0003 0.0003
application/x-tika-msoffice 0.0020 0.0023 0.0028
application/x-tika-ooxml 0.0016 0.0017 0.0018
application/x-wais-source 0.0001 0.0001 0.0001
application/x-xz 0.0000 NaN NaN
application/xhtml+xml 8.4126 8.4156 8.6282
application/xml 0.0661 0.0714 0.0697
application/xspf+xml 0.0002 0.0002 0.0002
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0001 0.0001 0.0002
audio/mp4 NaN NaN 0.0000
audio/mpeg 0.0000 0.0000 0.0001
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0001 0.0001 0.0002
image/png 0.0000 0.0000 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 NaN NaN
image/vnd.djvu 0.0004 0.0008 0.0004
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0006 0.0008 0.0007
text/asp 0.0000 0.0000 0.0000
text/aspdotnet NaN NaN 0.0000
text/calendar 0.0410 0.0449 0.0456
text/css 0.0006 0.0006 0.0006
text/csv 0.0032 0.0034 0.0032
text/html 90.4431 90.2830 89.9924
text/plain 0.0773 0.0871 0.0898
text/prs.lines.tag 0.0040 0.0049 0.0067
text/tab-separated-values 0.0004 0.0005 0.0004
text/troff 0.0005 0.0005 0.0005
text/turtle 0.0013 0.0014 0.0012
text/vtt 0.0010 0.0009 0.0008
text/x-c++src 0.0001 0.0002 0.0002
text/x-chdr 0.0005 0.0006 0.0006
text/x-csrc 0.0005 0.0005 0.0007
text/x-diff 0.0009 0.0011 0.0010
text/x-jsp 0.0000 0.0001 0.0001
text/x-log 0.0015 0.0017 0.0016
text/x-matlab 0.0008 0.0010 0.0010
text/x-perl 0.0010 0.0012 0.0013
text/x-php 0.0039 0.0032 0.0033
text/x-python 0.0004 0.0003 0.0004
text/x-vcalendar 0.0003 0.0004 0.0004
text/x-vcard 0.0031 0.0036 0.0037
text/x-web-markdown 0.0012 0.0014 0.0013
text/x-yaml 0.0005 0.0003 0.0004
video/mp4 0.0000 0.0000 0.0000
video/quicktime NaN 0.0000 NaN
video/webm 0.0000 NaN NaN