Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2025-13

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2025-05 CC-MAIN-2025-08 CC-MAIN-2025-13
mimetype
<other> 0.0862 0.1017 0.1037
application/atom+xml 0.1023 0.1129 0.1113
application/calendar 0.0004 0.0006 0.0002
application/download 0.0026 0.0034 0.0033
application/epub+zip 0.0012 0.0016 0.0018
application/force-download 0.0065 0.0085 0.0072
application/ics 0.0004 0.0005 0.0003
application/javascript 0.0002 0.0002 0.0001
application/json 0.0308 0.0369 0.0327
application/ld+json 0.0021 0.0023 0.0021
application/marc 0.0006 0.0008 0.0009
application/msword 0.0024 0.0028 0.0028
application/octet-stream 0.0460 0.0554 0.0564
application/octetstream 0.0002 0.0003 0.0003
application/pdf 0.6426 0.7679 0.6743
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0031 0.0032 0.0024
application/postscript 0.0001 0.0002 0.0002
application/rdf+xml 0.0050 0.0055 0.0057
application/rss+xml 0.0505 0.0575 0.0538
application/rtf 0.0009 0.0009 0.0009
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0001 0.0002 0.0001
application/unknown 0.0002 0.0002 0.0002
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0012 0.0014 0.0012
application/vnd.google-earth.kmz 0.0003 0.0004 0.0004
application/vnd.ms-excel 0.0017 0.0020 0.0019
application/vnd.ms-powerpoint 0.0001 0.0002 0.0002
application/vnd.ms-word 0.0002 0.0003 0.0003
application/vnd.oasis.opendocument.text 0.0006 0.0007 0.0006
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0013 0.0014 0.0015
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0020 0.0024 0.0022
application/vnd.wap.xhtml+xml 0.0006 0.0010 0.0011
application/x-bibtex 0.0059 0.0070 0.0068
application/x-bittorrent 0.0001 0.0002 0.0001
application/x-bzip2 0.0001 0.0001 0.0001
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0020 0.0025 0.0028
application/x-endnote-refer 0.0008 0.0009 0.0008
application/x-gzip 0.0002 0.0001 0.0002
application/x-httpd-php 0.0003 0.0007 0.0005
application/x-java-jnlp-file 0.0003 0.0003 0.0004
application/x-javascript 0.0001 0.0001 0.0001
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0003 0.0006 0.0008
application/x-msdownload 0.0003 0.0004 0.0005
application/x-netcdf 0.0002 0.0005 0.0009
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0089 0.0105 0.0102
application/x-shockwave-flash 0.0002 0.0002 0.0001
application/x-tar 0.0001 0.0001 0.0001
application/x-tex 0.0002 0.0002 0.0002
application/x-troff-man 0.0008 0.0010 0.0014
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0140 0.0144 0.0101
application/xml 0.0278 0.0282 0.0253
application/zip 0.0002 0.0001 0.0001
audio/mpeg 0.0001 0.0001 0.0001
audio/x-mpegurl 0.0005 0.0006 0.0005
audio/x-scpls 0.0001 0.0001 0.0002
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0009 0.0007 0.0005
image/gif 0.0006 0.0004 0.0006
image/jp2 0.0000 0.0000 0.0000
image/jpeg 0.0003 0.0003 0.0003
image/jpg 0.0000 0.0001 0.0001
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0001 0.0002 0.0007
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu 0.0000 0.0000 0.0008
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0003 0.0003 0.0003
text/calendar 0.0289 0.0343 0.0310
text/css 0.0006 0.0006 0.0006
text/csv 0.0036 0.0042 0.0040
text/directory 0.0002 0.0002 0.0002
text/enriched 0.0001 0.0001 0.0000
text/html 98.7787 98.5619 98.6744
text/javascript 0.0002 0.0003 0.0003
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0557 0.0669 0.0702
text/prs.lines.tag 0.0061 0.0059 0.0091
text/tab-separated-values 0.0003 0.0004 0.0004
text/turtle 0.0020 0.0024 0.0023
text/vcard 0.0009 0.0010 0.0009
text/x-bibtex 0.0009 0.0009 0.0004
text/x-c 0.0001 0.0002 0.0002
text/x-csrc 0.0001 0.0003 0.0002
text/x-diff 0.0002 0.0003 0.0003
text/x-patch 0.0004 0.0004 0.0004
text/x-perl 0.0000 0.0001 0.0000
text/x-vcalendar 0.0004 0.0005 0.0005
text/x-vcard 0.0017 0.0021 0.0020
text/xml 0.0599 0.0723 0.0667
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm 0.0000 0.0000 0.0000
video/x-ms-asf 0.0001 0.0002 0.0002

crawl CC-MAIN-2025-05 CC-MAIN-2025-08 CC-MAIN-2025-13
mimetype_detected
<other> 0.0165 0.0153 0.0135
application/atom+xml 0.1045 0.1149 0.1131
application/epub+zip 0.0014 0.0019 0.0020
application/gpx+xml 0.0005 0.0007 0.0005
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0005 0.0005 0.0005
application/json 0.0316 0.0379 0.0336
application/marc 0.0013 0.0017 0.0017
application/mbox 0.0015 0.0024 0.0022
application/msword 0.0017 0.0021 0.0021
application/octet-stream 0.0141 0.0162 0.0189
application/pdf 0.6574 0.7839 0.6893
application/pgp-signature 0.0033 0.0035 0.0035
application/pkcs7-signature 0.0004 0.0004 0.0004
application/postscript 0.0003 0.0003 0.0004
application/rdf+xml 0.0091 0.0098 0.0103
application/rss+xml 0.0807 0.0910 0.0862
application/rtf 0.0014 0.0014 0.0013
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0023 0.0024 0.0022
application/vnd.google-earth.kmz 0.0004 0.0004 0.0004
application/vnd.ms-excel 0.0009 0.0011 0.0011
application/vnd.ms-powerpoint 0.0001 0.0002 0.0002
application/vnd.oasis.opendocument.spreadsheet 0.0004 0.0004 0.0005
application/vnd.oasis.opendocument.text 0.0007 0.0008 0.0007
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0002 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0012 0.0013 0.0015
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0020 0.0024 0.0022
application/x-bibtex-text-file 0.0105 0.0119 0.0119
application/x-bittorrent 0.0002 0.0002 0.0002
application/x-bzip2 NaN NaN 0.0000
application/x-dosexec 0.0000 0.0000 0.0000
application/x-endnote-refer 0.0014 0.0016 0.0018
application/x-mobipocket-ebook 0.0005 0.0008 0.0009
application/x-ms-asx 0.0001 0.0002 0.0002
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0028 0.0016 0.0031
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0001 0.0001 0.0001
application/x-sh 0.0008 0.0011 0.0011
application/x-shockwave-flash 0.0002 0.0002 0.0001
application/x-stata-do 0.0003 0.0005 0.0005
application/x-tex 0.0003 0.0003 0.0003
application/x-tex-tfm 0.0007 0.0006 0.0006
application/x-tika-msoffice 0.0024 0.0028 0.0030
application/x-tika-ooxml 0.0019 0.0021 0.0021
application/x-wais-source 0.0003 0.0003 0.0003
application/x-xz 0.0000 NaN NaN
application/xhtml+xml 8.7049 9.1378 8.8010
application/xml 0.0641 0.0744 0.0694
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0001 0.0003 0.0001
audio/mpeg 0.0000 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0002 0.0002 0.0002
image/png 0.0001 0.0000 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff NaN 0.0000 0.0000
image/vnd.djvu 0.0000 0.0000 0.0010
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0008 0.0008 0.0007
text/asp 0.0000 0.0000 0.0000
text/aspdotnet 0.0000 0.0000 NaN
text/calendar 0.0343 0.0413 0.0365
text/css 0.0006 0.0006 0.0006
text/csv 0.0036 0.0042 0.0040
text/html 90.0979 89.4580 89.9029
text/plain 0.1050 0.1280 0.1302
text/prs.lines.tag 0.0112 0.0116 0.0150
text/tab-separated-values 0.0004 0.0004 0.0004
text/troff 0.0008 0.0011 0.0014
text/turtle 0.0020 0.0024 0.0023
text/vtt 0.0008 0.0009 0.0009
text/x-c++src 0.0002 0.0002 0.0002
text/x-chdr 0.0004 0.0006 0.0005
text/x-csrc 0.0006 0.0009 0.0008
text/x-diff 0.0010 0.0013 0.0013
text/x-jsp 0.0001 0.0001 0.0001
text/x-log 0.0023 0.0030 0.0033
text/x-matlab 0.0021 0.0027 0.0025
text/x-perl 0.0014 0.0017 0.0019
text/x-php 0.0035 0.0044 0.0032
text/x-python 0.0003 0.0004 0.0003
text/x-vcalendar 0.0004 0.0005 0.0005
text/x-vcard 0.0031 0.0037 0.0034
text/x-web-markdown 0.0004 0.0004 0.0005
text/x-yaml 0.0003 0.0004 0.0004
video/mp4 0.0000 0.0000 0.0000
video/quicktime 0.0000 0.0000 NaN
video/webm NaN 0.0000 0.0000
video/x-m4v NaN 0.0000 0.0000