Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-26

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2025-18 CC-MAIN-2025-21 CC-MAIN-2025-26
mimetype
<other> 0.1336 0.1631 0.1764
application/atom+xml 0.1059 0.1047 0.1283
application/calendar 0.0002 0.0002 0.0002
application/download 0.0032 0.0025 0.0019
application/epub+zip 0.0019 0.0015 0.0018
application/force-download 0.0075 0.0068 0.0066
application/ics 0.0004 0.0003 0.0004
application/javascript 0.0002 0.0010 0.0013
application/json 0.0312 0.0239 0.0264
application/ld+json 0.0022 0.0015 0.0017
application/marc 0.0008 0.0006 0.0005
application/msword 0.0027 0.0017 0.0019
application/octet-stream 0.0574 0.0372 0.0404
application/octetstream 0.0002 0.0002 0.0002
application/pdf 0.6627 0.5871 0.6135
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0031 0.0015 0.0018
application/postscript 0.0001 0.0001 0.0001
application/rdf+xml 0.0055 0.0041 0.0041
application/rss+xml 0.0554 0.0431 0.0435
application/rtf 0.0009 0.0008 0.0008
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0002 0.0002 0.0002
application/unknown 0.0002 0.0001 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0015 0.0012 0.0014
application/vnd.google-earth.kmz 0.0003 0.0003 0.0003
application/vnd.ms-excel 0.0018 0.0009 0.0009
application/vnd.ms-powerpoint 0.0002 0.0002 0.0001
application/vnd.ms-word 0.0002 0.0002 0.0002
application/vnd.oasis.opendocument.text 0.0006 0.0005 0.0005
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0017 0.0012 0.0014
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0022 0.0020 0.0020
application/vnd.wap.xhtml+xml 0.0011 0.0010 0.0002
application/x-bibtex 0.0067 0.0070 0.0072
application/x-bittorrent 0.0001 0.0001 0.0002
application/x-bzip2 0.0001 0.0000 0.0000
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0025 0.0019 0.0020
application/x-endnote-refer 0.0006 0.0007 0.0008
application/x-gzip 0.0001 0.0001 0.0001
application/x-httpd-php 0.0005 0.0004 0.0004
application/x-java-jnlp-file 0.0003 0.0003 0.0002
application/x-javascript 0.0001 0.0001 0.0001
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0007 0.0005 0.0005
application/x-msdownload 0.0004 0.0002 0.0003
application/x-netcdf 0.0002 0.0002 0.0001
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0098 0.0098 0.0095
application/x-shockwave-flash 0.0001 0.0001 0.0001
application/x-tar 0.0001 0.0000 0.0000
application/x-tex 0.0002 0.0001 0.0001
application/x-troff-man 0.0009 0.0003 0.0004
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0102 0.0097 0.0092
application/xml 0.0240 0.0218 0.0235
application/zip 0.0002 0.0001 0.0001
audio/mpeg 0.0001 0.0000 0.0000
audio/x-mpegurl 0.0005 0.0004 0.0004
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0006 0.0004 0.0005
image/gif 0.0005 0.0005 0.0005
image/jp2 0.0000 0.0000 0.0000
image/jpeg 0.0004 0.0002 0.0003
image/jpg 0.0001 0.0000 0.0000
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0001 0.0002 0.0010
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0005 0.0004 0.0009
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0003 0.0003 0.0003
text/calendar 0.0323 0.0299 0.0314
text/css 0.0006 0.0005 0.0007
text/csv 0.0043 0.0029 0.0031
text/directory 0.0002 0.0002 0.0003
text/enriched 0.0000 0.0001 0.0001
text/html 98.6477 98.8044 98.7237
text/javascript 0.0003 0.0003 0.0003
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0717 0.0490 0.0522
text/prs.lines.tag 0.0068 0.0017 0.0014
text/tab-separated-values 0.0003 0.0003 0.0005
text/turtle 0.0024 0.0020 0.0021
text/vcard 0.0009 0.0010 0.0010
text/x-bibtex 0.0004 0.0004 0.0004
text/x-c 0.0003 0.0001 0.0001
text/x-csrc 0.0004 0.0001 0.0002
text/x-diff 0.0004 0.0002 0.0002
text/x-patch 0.0004 0.0002 0.0003
text/x-perl 0.0001 0.0000 0.0000
text/x-vcalendar 0.0004 0.0003 0.0004
text/x-vcard 0.0020 0.0018 0.0020
text/xml 0.0816 0.0584 0.0609
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm 0.0000 0.0000 0.0000
video/x-ms-asf 0.0001 0.0001 0.0001

crawl CC-MAIN-2025-18 CC-MAIN-2025-21 CC-MAIN-2025-26
mimetype_detected
<other> 0.0137 0.0105 0.0105
application/atom+xml 0.1077 0.1067 0.1304
application/epub+zip 0.0023 0.0017 0.0021
application/gpx+xml 0.0006 0.0006 0.0006
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0006 0.0006 0.0007
application/json 0.0323 0.0242 0.0266
application/marc 0.0016 0.0014 0.0012
application/mbox 0.0017 0.0012 0.0014
application/msword 0.0019 0.0013 0.0014
application/octet-stream 0.0187 0.0124 0.0123
application/pdf 0.6806 0.5982 0.6259
application/pgp-signature 0.0029 0.0013 0.0015
application/pkcs7-signature 0.0005 0.0003 0.0004
application/postscript 0.0003 0.0003 0.0004
application/rdf+xml 0.0096 0.0085 0.0088
application/rss+xml 0.0874 0.0684 0.0695
application/rtf 0.0014 0.0012 0.0013
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0025 0.0022 0.0023
application/vnd.google-earth.kmz 0.0003 0.0003 0.0003
application/vnd.ms-excel 0.0012 0.0007 0.0006
application/vnd.ms-powerpoint 0.0002 0.0002 0.0001
application/vnd.oasis.opendocument.spreadsheet 0.0003 0.0003 0.0003
application/vnd.oasis.opendocument.text 0.0007 0.0006 0.0006
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0017 0.0012 0.0013
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0023 0.0020 0.0020
application/x-bibtex-text-file 0.0152 0.0113 0.0124
application/x-bittorrent 0.0002 0.0002 0.0002
application/x-bzip2 0.0000 NaN NaN
application/x-debian-package NaN 0.0000 0.0000
application/x-dosexec 0.0000 NaN NaN
application/x-endnote-refer 0.0017 0.0016 0.0018
application/x-mobipocket-ebook 0.0008 0.0006 0.0006
application/x-ms-asx 0.0001 0.0001 0.0001
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0025 0.0009 0.0012
application/x-rar-compressed 0.0000 0.0000 NaN
application/x-research-info-systems 0.0001 0.0001 0.0001
application/x-sh 0.0010 0.0008 0.0007
application/x-shockwave-flash 0.0001 0.0001 0.0001
application/x-stata-do 0.0004 0.0004 0.0005
application/x-tex 0.0003 0.0003 0.0002
application/x-tex-tfm 0.0011 0.0003 0.0002
application/x-tika-msoffice 0.0026 0.0018 0.0020
application/x-tika-ooxml 0.0020 0.0014 0.0017
application/x-wais-source 0.0002 0.0002 0.0002
application/x-xz NaN 0.0000 NaN
application/xhtml+xml 8.7939 8.3765 8.5312
application/xml 0.0808 0.0628 0.0671
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0001 0.0001 0.0001
audio/mp4 0.0000 NaN NaN
audio/mpeg 0.0000 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0002 0.0001 0.0002
image/png 0.0000 0.0000 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 NaN
image/vnd.djvu 0.0007 0.0006 0.0011
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0007 0.0006 0.0006
text/asp 0.0000 0.0000 0.0000
text/calendar 0.0382 0.0358 0.0373
text/css 0.0006 0.0005 0.0007
text/csv 0.0043 0.0030 0.0031
text/html 89.8826 90.4555 90.2679
text/plain 0.1617 0.1760 0.1431
text/prs.lines.tag 0.0116 0.0037 0.0035
text/tab-separated-values 0.0004 0.0003 0.0005
text/troff 0.0010 0.0004 0.0005
text/turtle 0.0024 0.0020 0.0021
text/vtt 0.0010 0.0008 0.0009
text/x-c++src 0.0003 0.0002 0.0001
text/x-chdr 0.0006 0.0005 0.0004
text/x-csrc 0.0008 0.0005 0.0006
text/x-diff 0.0014 0.0008 0.0009
text/x-jsp 0.0000 0.0000 0.0000
text/x-log 0.0038 0.0020 0.0020
text/x-matlab 0.0018 0.0012 0.0011
text/x-perl 0.0018 0.0018 0.0018
text/x-php 0.0028 0.0029 0.0030
text/x-python 0.0003 0.0002 0.0002
text/x-vcalendar 0.0004 0.0004 0.0005
text/x-vcard 0.0034 0.0033 0.0037
text/x-web-markdown 0.0005 0.0004 0.0005
text/x-yaml 0.0003 0.0003 0.0004
video/mp4 0.0000 0.0000 0.0000
video/quicktime 0.0000 0.0000 0.0000
video/webm 0.0000 0.0000 0.0000
video/x-m4v NaN 0.0000 NaN