Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-25

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2026-17 CC-MAIN-2026-21 CC-MAIN-2026-25
mimetype
<other> 0.0610 0.0618 0.0635
application/atom+xml 0.1469 0.1503 0.1595
application/calendar 0.0002 0.0002 0.0003
application/download 0.0018 0.0030 0.0045
application/epub+zip 0.0015 0.0019 0.0018
application/force-download 0.0073 0.0088 0.0113
application/gpx+xml 0.0009 0.0009 0.0009
application/ics 0.0004 0.0004 0.0004
application/javascript 0.0008 0.0006 0.0008
application/json 0.0251 0.0241 0.0256
application/ld+json 0.0015 0.0013 0.0012
application/marc 0.0003 0.0003 0.0003
application/msword 0.0025 0.0029 0.0028
application/octet-stream 0.0469 0.0488 0.0468
application/octetstream 0.0002 0.0002 0.0002
application/pdf 0.8655 1.0857 0.9791
application/pgp-encrypted 0.0001 0.0001 0.0002
application/pgp-signature 0.0020 0.0022 0.0019
application/postscript 0.0002 0.0002 0.0002
application/rdf+xml 0.0031 0.0030 0.0028
application/rss+xml 0.0418 0.0431 0.0428
application/rtf 0.0005 0.0005 0.0007
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0001 0.0001 0.0002
application/unknown 0.0002 0.0002 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0011 0.0018 0.0011
application/vnd.google-earth.kmz 0.0005 0.0004 0.0004
application/vnd.ms-excel 0.0012 0.0015 0.0015
application/vnd.ms-powerpoint 0.0002 0.0002 0.0002
application/vnd.ms-word 0.0001 0.0001 0.0001
application/vnd.oasis.opendocument.text 0.0007 0.0011 0.0009
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0003 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0016 0.0019 0.0018
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0028 0.0033 0.0031
application/vnd.wap.xhtml+xml 0.0009 0.0008 0.0007
application/x-bibtex 0.0088 0.0091 0.0093
application/x-bittorrent 0.0001 0.0001 0.0003
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0022 0.0023 0.0024
application/x-endnote-refer 0.0003 0.0003 0.0004
application/x-gzip 0.0002 0.0002 0.0002
application/x-httpd-php 0.0003 0.0004 0.0004
application/x-java-jnlp-file 0.0001 0.0001 0.0000
application/x-javascript 0.0001 0.0002 0.0001
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0004 0.0005 0.0005
application/x-msdownload 0.0003 0.0003 0.0003
application/x-netcdf 0.0004 0.0002 0.0002
application/x-research-info-systems 0.0116 0.0113 0.0115
application/x-shockwave-flash 0.0000 0.0000 0.0000
application/x-tar 0.0001 0.0001 0.0000
application/x-tex 0.0001 0.0002 0.0002
application/x-troff-man 0.0004 0.0003 0.0003
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0104 0.0096 0.0098
application/xml 0.0254 0.0257 0.0256
application/zip 0.0002 0.0002 0.0002
audio/mpeg 0.0001 0.0001 0.0001
audio/x-mpegurl 0.0006 0.0005 0.0005
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0005 0.0005 0.0005
image/gif 0.0002 0.0001 0.0001
image/jp2 NaN 0.0000 0.0000
image/jpeg 0.0002 0.0003 0.0003
image/jpg 0.0000 0.0000 0.0000
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0001 0.0002 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0005 0.0004 0.0003
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0002 0.0003 0.0002
text/calendar 0.0354 0.0340 0.0352
text/css 0.0006 0.0007 0.0006
text/csv 0.0030 0.0030 0.0029
text/directory 0.0002 0.0002 0.0003
text/enriched 0.0000 0.0000 0.0000
text/html 98.5388 98.3251 98.4122
text/javascript 0.0004 0.0004 0.0004
text/markdown 0.0298 0.0134 0.0146
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0485 0.0482 0.0526
text/prs.lines.tag 0.0030 0.0027 0.0024
text/tab-separated-values 0.0004 0.0002 0.0003
text/turtle 0.0010 0.0008 0.0011
text/vcard 0.0012 0.0013 0.0013
text/x-bibtex 0.0005 0.0006 0.0005
text/x-c 0.0001 0.0001 0.0001
text/x-csrc 0.0001 0.0002 0.0001
text/x-diff 0.0003 0.0003 0.0003
text/x-patch 0.0002 0.0002 0.0002
text/x-perl 0.0000 0.0000 0.0001
text/x-vcalendar 0.0004 0.0004 0.0004
text/x-vcard 0.0019 0.0020 0.0019
text/xml 0.0496 0.0502 0.0501
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm 0.0000 0.0000 0.0000
video/x-ms-asf 0.0002 0.0001 0.0002

crawl CC-MAIN-2026-17 CC-MAIN-2026-21 CC-MAIN-2026-25
mimetype_detected
<other> 0.0112 0.0122 0.0127
application/atom+xml 0.1486 0.1518 0.1609
application/epub+zip 0.0018 0.0023 0.0021
application/gpx+xml 0.0009 0.0009 0.0009
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0011 0.0009 0.0012
application/json 0.0252 0.0240 0.0257
application/marc 0.0007 0.0006 0.0007
application/mbox 0.0011 0.0013 0.0011
application/msword 0.0018 0.0022 0.0020
application/octet-stream 0.0139 0.0133 0.0127
application/pdf 0.8794 1.1018 0.9945
application/pgp-signature 0.0020 0.0022 0.0023
application/pkcs7-signature 0.0004 0.0006 0.0005
application/postscript 0.0001 0.0001 0.0001
application/rdf+xml 0.0063 0.0061 0.0059
application/rss+xml 0.0660 0.0676 0.0674
application/rtf 0.0009 0.0008 0.0011
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0024 0.0031 0.0023
application/vnd.google-earth.kmz 0.0005 0.0005 0.0004
application/vnd.ms-excel 0.0007 0.0009 0.0009
application/vnd.ms-powerpoint 0.0002 0.0002 0.0002
application/vnd.oasis.opendocument.spreadsheet 0.0003 0.0005 0.0005
application/vnd.oasis.opendocument.text 0.0009 0.0013 0.0011
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0003 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0016 0.0018 0.0018
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0028 0.0033 0.0031
application/x-bibtex-text-file 0.0127 0.0137 0.0141
application/x-bittorrent 0.0002 0.0002 0.0003
application/x-bzip2 NaN 0.0000 NaN
application/x-dosexec NaN 0.0000 0.0000
application/x-endnote-refer 0.0014 0.0013 0.0018
application/x-mobipocket-ebook 0.0006 0.0008 0.0007
application/x-ms-asx 0.0002 0.0001 0.0002
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0012 0.0010 0.0006
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0000 0.0000 0.0000
application/x-sh 0.0011 0.0010 0.0012
application/x-shockwave-flash 0.0000 0.0000 0.0000
application/x-stata-do 0.0004 0.0004 0.0004
application/x-tex 0.0003 0.0004 0.0004
application/x-tex-tfm 0.0004 0.0002 0.0003
application/x-tika-msoffice 0.0024 0.0025 0.0026
application/x-tika-ooxml 0.0019 0.0021 0.0023
application/x-wais-source 0.0001 0.0002 0.0001
application/xhtml+xml 8.1569 8.0295 7.9275
application/xml 0.0584 0.0562 0.0563
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0003 0.0002 0.0002
audio/mp4 0.0000 NaN NaN
audio/mpeg 0.0000 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0001 0.0002 0.0002
image/png 0.0000 0.0000 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 NaN 0.0000
image/vnd.djvu 0.0007 0.0007 0.0005
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0007 0.0007 0.0007
text/asp NaN 0.0000 NaN
text/calendar 0.0424 0.0417 0.0433
text/css 0.0006 0.0007 0.0006
text/csv 0.0030 0.0030 0.0030
text/html 90.4072 90.3224 90.5115
text/markdown 0.0284 0.0042 0.0054
text/plain 0.0832 0.0851 0.0914
text/prs.lines.tag 0.0054 0.0048 0.0043
text/tab-separated-values 0.0004 0.0002 0.0003
text/troff 0.0005 0.0003 0.0004
text/turtle 0.0010 0.0008 0.0011
text/vtt 0.0008 0.0008 0.0008
text/x-c++src 0.0001 0.0002 0.0002
text/x-chdr 0.0005 0.0007 0.0008
text/x-csrc 0.0006 0.0005 0.0005
text/x-diff 0.0009 0.0010 0.0010
text/x-jsp 0.0001 0.0001 0.0001
text/x-log 0.0016 0.0013 0.0012
text/x-matlab 0.0013 0.0014 0.0022
text/x-perl 0.0008 0.0009 0.0009
text/x-php 0.0034 0.0025 0.0027
text/x-python 0.0003 0.0005 0.0004
text/x-vcalendar 0.0004 0.0004 0.0004
text/x-vcard 0.0036 0.0039 0.0038
text/x-web-markdown 0.0019 0.0100 0.0103
text/x-yaml 0.0004 0.0003 0.0004
video/mp4 0.0000 0.0000 0.0000
video/quicktime 0.0000 0.0000 NaN
video/webm 0.0000 NaN 0.0000