Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-05

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2024-46 CC-MAIN-2024-51 CC-MAIN-2025-05
mimetype
<other> 0.1398 0.1136 0.0862
application/atom+xml 0.0989 0.1099 0.1023
application/calendar 0.0004 0.0004 0.0004
application/download 0.0032 0.0034 0.0026
application/epub+zip 0.0017 0.0021 0.0012
application/force-download 0.0066 0.0073 0.0065
application/ics 0.0003 0.0004 0.0004
application/javascript 0.0002 0.0003 0.0002
application/json 0.0321 0.0345 0.0308
application/ld+json 0.0019 0.0023 0.0021
application/marc 0.0006 0.0008 0.0006
application/msword 0.0026 0.0028 0.0024
application/octet-stream 0.0496 0.0557 0.0460
application/octetstream 0.0002 0.0002 0.0002
application/pdf 0.7784 0.8173 0.6426
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0033 0.0028 0.0031
application/postscript 0.0001 0.0002 0.0001
application/rdf+xml 0.0052 0.0056 0.0050
application/rss+xml 0.0525 0.0571 0.0505
application/rtf 0.0007 0.0010 0.0009
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0002 0.0002 0.0001
application/unknown 0.0003 0.0003 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0018 0.0014 0.0012
application/vnd.google-earth.kmz 0.0007 0.0005 0.0003
application/vnd.ms-excel 0.0018 0.0020 0.0017
application/vnd.ms-powerpoint 0.0004 0.0002 0.0001
application/vnd.ms-word 0.0002 0.0002 0.0002
application/vnd.oasis.opendocument.text 0.0007 0.0008 0.0006
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0014 0.0015 0.0013
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0024 0.0024 0.0020
application/vnd.wap.xhtml+xml 0.0006 0.0007 0.0006
application/x-bibtex 0.0062 0.0067 0.0059
application/x-bittorrent 0.0002 0.0002 0.0001
application/x-bzip2 0.0001 0.0001 0.0001
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0020 0.0022 0.0020
application/x-endnote-refer 0.0014 0.0012 0.0008
application/x-gzip 0.0003 0.0002 0.0002
application/x-httpd-php 0.0004 0.0004 0.0003
application/x-java-jnlp-file 0.0002 0.0003 0.0003
application/x-javascript 0.0001 0.0001 0.0001
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0006 0.0007 0.0003
application/x-msdownload 0.0002 0.0003 0.0003
application/x-netcdf 0.0002 0.0002 0.0002
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0117 0.0117 0.0089
application/x-shockwave-flash 0.0002 0.0002 0.0002
application/x-tar 0.0001 0.0001 0.0001
application/x-tex 0.0002 0.0002 0.0002
application/x-troff-man 0.0007 0.0011 0.0008
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0116 0.0152 0.0140
application/xml 0.0542 0.0264 0.0278
application/zip 0.0001 0.0002 0.0002
audio/mpeg 0.0000 0.0001 0.0001
audio/x-mpegurl 0.0004 0.0005 0.0005
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0007 0.0007 0.0009
image/gif 0.0005 0.0005 0.0006
image/jp2 0.0000 NaN 0.0000
image/jpeg 0.0003 0.0004 0.0003
image/jpg 0.0000 0.0000 0.0000
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0001 0.0001 0.0001
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0000 0.0000 0.0000
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0003 0.0003 0.0003
text/calendar 0.0300 0.0331 0.0289
text/css 0.0006 0.0006 0.0006
text/csv 0.0035 0.0040 0.0036
text/directory 0.0002 0.0002 0.0002
text/enriched 0.0001 0.0001 0.0001
text/html 98.5402 98.5165 98.7787
text/javascript 0.0002 0.0002 0.0002
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0651 0.0672 0.0557
text/prs.lines.tag 0.0043 0.0050 0.0061
text/tab-separated-values 0.0003 0.0004 0.0003
text/turtle 0.0020 0.0024 0.0020
text/vcard 0.0011 0.0010 0.0009
text/x-bibtex 0.0009 0.0007 0.0009
text/x-c 0.0001 0.0001 0.0001
text/x-csrc 0.0003 0.0004 0.0001
text/x-diff 0.0005 0.0003 0.0002
text/x-patch 0.0004 0.0004 0.0004
text/x-perl 0.0001 0.0001 0.0000
text/x-vcalendar 0.0005 0.0006 0.0004
text/x-vcard 0.0020 0.0021 0.0017
text/xml 0.0649 0.0662 0.0599
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm NaN 0.0000 0.0000
video/x-ms-asf 0.0002 0.0001 0.0001

crawl CC-MAIN-2024-46 CC-MAIN-2024-51 CC-MAIN-2025-05
mimetype_detected
<other> 0.0152 0.0125 0.0172
application/atom+xml 0.1012 0.1126 0.1045
application/epub+zip 0.0020 0.0024 0.0014
application/gpx+xml 0.0006 0.0006 0.0005
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0004 0.0005 0.0005
application/json 0.0329 0.0356 0.0316
application/marc 0.0014 0.0016 0.0013
application/mbox 0.0019 0.0019 0.0015
application/msword 0.0020 0.0021 0.0017
application/octet-stream 0.0152 0.0176 0.0141
application/pdf 0.7948 0.8354 0.6574
application/pgp-signature 0.0033 0.0030 0.0033
application/pkcs7-signature 0.0004 0.0005 0.0004
application/postscript 0.0003 0.0003 0.0003
application/rdf+xml 0.0088 0.0098 0.0091
application/rss+xml 0.0837 0.0910 0.0807
application/rtf 0.0012 0.0016 0.0014
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0030 0.0023 0.0023
application/vnd.google-earth.kmz 0.0008 0.0005 0.0004
application/vnd.ms-excel 0.0010 0.0010 0.0009
application/vnd.ms-powerpoint 0.0004 0.0002 0.0001
application/vnd.oasis.opendocument.spreadsheet 0.0004 0.0005 0.0004
application/vnd.oasis.opendocument.text 0.0008 0.0009 0.0007
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0014 0.0015 0.0012
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0024 0.0024 0.0020
application/x-bibtex-text-file 0.0110 0.0112 0.0105
application/x-bittorrent 0.0003 0.0002 0.0002
application/x-dosexec NaN 0.0000 0.0000
application/x-endnote-refer 0.0019 0.0018 0.0014
application/x-mobipocket-ebook 0.0007 0.0009 0.0005
application/x-ms-asx 0.0001 0.0001 0.0001
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0019 0.0016 0.0028
application/x-rar-compressed NaN 0.0000 0.0000
application/x-research-info-systems 0.0001 0.0002 0.0001
application/x-sh 0.0011 0.0010 0.0008
application/x-shockwave-flash 0.0002 0.0002 0.0002
application/x-stata-do 0.0004 0.0005 0.0003
application/x-tex 0.0003 0.0003 0.0003
application/x-tika-msoffice 0.0020 0.0026 0.0024
application/x-tika-ooxml 0.0021 0.0023 0.0019
application/x-wais-source 0.0004 0.0003 0.0003
application/x-xz NaN NaN 0.0000
application/xhtml+xml 8.9008 9.2090 8.7049
application/xml 0.0955 0.0658 0.0641
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0001 0.0001 0.0001
audio/mpeg 0.0000 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0002 0.0002 0.0002
image/png 0.0000 0.0000 0.0001
image/svg+xml 0.0000 0.0000 0.0000
image/tiff NaN 0.0000 NaN
image/vnd.djvu NaN 0.0000 0.0000
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0009 0.0011 0.0008
text/asp 0.0000 0.0000 0.0000
text/aspdotnet 0.0000 0.0000 0.0000
text/calendar 0.0353 0.0391 0.0343
text/css 0.0005 0.0006 0.0006
text/csv 0.0035 0.0040 0.0036
text/html 89.6775 89.3434 90.0979
text/plain 0.1566 0.1408 0.1050
text/prs.lines.tag 0.0091 0.0101 0.0112
text/tab-separated-values 0.0004 0.0004 0.0004
text/troff 0.0008 0.0012 0.0008
text/turtle 0.0020 0.0024 0.0020
text/vtt 0.0008 0.0010 0.0008
text/x-c++src 0.0003 0.0003 0.0002
text/x-chdr 0.0005 0.0007 0.0004
text/x-csrc 0.0009 0.0010 0.0006
text/x-diff 0.0014 0.0013 0.0010
text/x-jsp 0.0001 0.0001 0.0001
text/x-log 0.0021 0.0028 0.0023
text/x-matlab 0.0023 0.0023 0.0021
text/x-perl 0.0015 0.0017 0.0014
text/x-php 0.0034 0.0032 0.0035
text/x-python 0.0004 0.0004 0.0003
text/x-vcalendar 0.0005 0.0006 0.0004
text/x-vcard 0.0036 0.0037 0.0031
text/x-web-markdown 0.0005 0.0005 0.0004
text/x-yaml 0.0004 0.0003 0.0003
video/mp4 0.0000 0.0000 0.0000
video/quicktime NaN 0.0000 0.0000
video/x-m4v NaN 0.0000 NaN
video/x-matroska NaN 0.0000 NaN