Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-08

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2024-51 CC-MAIN-2025-05 CC-MAIN-2025-08
mimetype
<other> 0.1136 0.0862 0.1017
application/atom+xml 0.1099 0.1023 0.1129
application/calendar 0.0004 0.0004 0.0006
application/download 0.0034 0.0026 0.0034
application/epub+zip 0.0021 0.0012 0.0016
application/force-download 0.0073 0.0065 0.0085
application/ics 0.0004 0.0004 0.0005
application/javascript 0.0003 0.0002 0.0002
application/json 0.0345 0.0308 0.0369
application/ld+json 0.0023 0.0021 0.0023
application/marc 0.0008 0.0006 0.0008
application/msword 0.0028 0.0024 0.0028
application/octet-stream 0.0557 0.0460 0.0554
application/octetstream 0.0002 0.0002 0.0003
application/pdf 0.8173 0.6426 0.7679
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0028 0.0031 0.0032
application/postscript 0.0002 0.0001 0.0002
application/rdf+xml 0.0056 0.0050 0.0055
application/rss+xml 0.0571 0.0505 0.0575
application/rtf 0.0010 0.0009 0.0009
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0002 0.0001 0.0002
application/unknown 0.0003 0.0002 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0014 0.0012 0.0014
application/vnd.google-earth.kmz 0.0005 0.0003 0.0004
application/vnd.ms-excel 0.0020 0.0017 0.0020
application/vnd.ms-powerpoint 0.0002 0.0001 0.0002
application/vnd.ms-word 0.0002 0.0002 0.0003
application/vnd.oasis.opendocument.text 0.0008 0.0006 0.0007
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0015 0.0013 0.0014
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0024 0.0020 0.0024
application/vnd.wap.xhtml+xml 0.0007 0.0006 0.0010
application/x-bibtex 0.0067 0.0059 0.0070
application/x-bittorrent 0.0002 0.0001 0.0002
application/x-bzip2 0.0001 0.0001 0.0001
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0022 0.0020 0.0025
application/x-endnote-refer 0.0012 0.0008 0.0009
application/x-gzip 0.0002 0.0002 0.0001
application/x-httpd-php 0.0004 0.0003 0.0007
application/x-java-jnlp-file 0.0003 0.0003 0.0003
application/x-javascript 0.0001 0.0001 0.0001
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0007 0.0003 0.0006
application/x-msdownload 0.0003 0.0003 0.0004
application/x-netcdf 0.0002 0.0002 0.0005
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0117 0.0089 0.0105
application/x-shockwave-flash 0.0002 0.0002 0.0002
application/x-tar 0.0001 0.0001 0.0001
application/x-tex 0.0002 0.0002 0.0002
application/x-troff-man 0.0011 0.0008 0.0010
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0152 0.0140 0.0144
application/xml 0.0264 0.0278 0.0282
application/zip 0.0002 0.0002 0.0001
audio/mpeg 0.0001 0.0001 0.0001
audio/x-mpegurl 0.0005 0.0005 0.0006
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0007 0.0009 0.0007
image/gif 0.0005 0.0006 0.0004
image/jp2 NaN 0.0000 0.0000
image/jpeg 0.0004 0.0003 0.0003
image/jpg 0.0000 0.0000 0.0001
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0001 0.0001 0.0002
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0000 0.0000 0.0000
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0003 0.0003 0.0003
text/calendar 0.0331 0.0289 0.0343
text/css 0.0006 0.0006 0.0006
text/csv 0.0040 0.0036 0.0042
text/directory 0.0002 0.0002 0.0002
text/enriched 0.0001 0.0001 0.0001
text/html 98.5165 98.7787 98.5619
text/javascript 0.0002 0.0002 0.0003
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0672 0.0557 0.0669
text/prs.lines.tag 0.0050 0.0061 0.0059
text/tab-separated-values 0.0004 0.0003 0.0004
text/turtle 0.0024 0.0020 0.0024
text/vcard 0.0010 0.0009 0.0010
text/x-bibtex 0.0007 0.0009 0.0009
text/x-c 0.0001 0.0001 0.0002
text/x-csrc 0.0004 0.0001 0.0003
text/x-diff 0.0003 0.0002 0.0003
text/x-patch 0.0004 0.0004 0.0004
text/x-perl 0.0001 0.0000 0.0001
text/x-vcalendar 0.0006 0.0004 0.0005
text/x-vcard 0.0021 0.0017 0.0021
text/xml 0.0662 0.0599 0.0723
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm 0.0000 0.0000 0.0000
video/x-ms-asf 0.0001 0.0001 0.0002

crawl CC-MAIN-2024-51 CC-MAIN-2025-05 CC-MAIN-2025-08
mimetype_detected
<other> 0.0121 0.0165 0.0153
application/atom+xml 0.1126 0.1045 0.1149
application/epub+zip 0.0024 0.0014 0.0019
application/gpx+xml 0.0006 0.0005 0.0007
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0005 0.0005 0.0005
application/json 0.0356 0.0316 0.0379
application/marc 0.0016 0.0013 0.0017
application/mbox 0.0019 0.0015 0.0024
application/msword 0.0021 0.0017 0.0021
application/octet-stream 0.0176 0.0141 0.0162
application/pdf 0.8354 0.6574 0.7839
application/pgp-signature 0.0030 0.0033 0.0035
application/pkcs7-signature 0.0005 0.0004 0.0004
application/postscript 0.0003 0.0003 0.0003
application/rdf+xml 0.0098 0.0091 0.0098
application/rss+xml 0.0910 0.0807 0.0910
application/rtf 0.0016 0.0014 0.0014
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0023 0.0023 0.0024
application/vnd.google-earth.kmz 0.0005 0.0004 0.0004
application/vnd.ms-excel 0.0010 0.0009 0.0011
application/vnd.ms-powerpoint 0.0002 0.0001 0.0002
application/vnd.oasis.opendocument.spreadsheet 0.0005 0.0004 0.0004
application/vnd.oasis.opendocument.text 0.0009 0.0007 0.0008
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0015 0.0012 0.0013
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0024 0.0020 0.0024
application/x-bibtex-text-file 0.0112 0.0105 0.0119
application/x-bittorrent 0.0002 0.0002 0.0002
application/x-dosexec 0.0000 0.0000 0.0000
application/x-endnote-refer 0.0018 0.0014 0.0016
application/x-mobipocket-ebook 0.0009 0.0005 0.0008
application/x-ms-asx 0.0001 0.0001 0.0002
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0016 0.0028 0.0016
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0002 0.0001 0.0001
application/x-sh 0.0010 0.0008 0.0011
application/x-shockwave-flash 0.0002 0.0002 0.0002
application/x-stata-do 0.0005 0.0003 0.0005
application/x-tex 0.0003 0.0003 0.0003
application/x-tex-tfm 0.0004 0.0007 0.0006
application/x-tika-msoffice 0.0026 0.0024 0.0028
application/x-tika-ooxml 0.0023 0.0019 0.0021
application/x-wais-source 0.0003 0.0003 0.0003
application/x-xz NaN 0.0000 NaN
application/xhtml+xml 9.2090 8.7049 9.1378
application/xml 0.0658 0.0641 0.0744
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0001 0.0001 0.0003
audio/mpeg 0.0000 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0002 0.0002 0.0002
image/png 0.0000 0.0001 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 NaN 0.0000
image/vnd.djvu 0.0000 0.0000 0.0000
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0011 0.0008 0.0008
text/asp 0.0000 0.0000 0.0000
text/aspdotnet 0.0000 0.0000 0.0000
text/calendar 0.0391 0.0343 0.0413
text/css 0.0006 0.0006 0.0006
text/csv 0.0040 0.0036 0.0042
text/html 89.3434 90.0979 89.4580
text/plain 0.1408 0.1050 0.1280
text/prs.lines.tag 0.0101 0.0112 0.0116
text/tab-separated-values 0.0004 0.0004 0.0004
text/troff 0.0012 0.0008 0.0011
text/turtle 0.0024 0.0020 0.0024
text/vtt 0.0010 0.0008 0.0009
text/x-c++src 0.0003 0.0002 0.0002
text/x-chdr 0.0007 0.0004 0.0006
text/x-csrc 0.0010 0.0006 0.0009
text/x-diff 0.0013 0.0010 0.0013
text/x-jsp 0.0001 0.0001 0.0001
text/x-log 0.0028 0.0023 0.0030
text/x-matlab 0.0023 0.0021 0.0027
text/x-perl 0.0017 0.0014 0.0017
text/x-php 0.0032 0.0035 0.0044
text/x-python 0.0004 0.0003 0.0004
text/x-vcalendar 0.0006 0.0004 0.0005
text/x-vcard 0.0037 0.0031 0.0037
text/x-web-markdown 0.0005 0.0004 0.0004
text/x-yaml 0.0003 0.0003 0.0004
video/mp4 0.0000 0.0000 0.0000
video/quicktime 0.0000 0.0000 0.0000
video/webm NaN NaN 0.0000
video/x-m4v 0.0000 NaN 0.0000