Statistics of Common Crawl Monthly Archives

Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2024-46

View the Project on GitHub

MIME Types

The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.

While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.

crawl CC-MAIN-2024-38 CC-MAIN-2024-42 CC-MAIN-2024-46
mimetype
<other> 0.1286 0.1490 0.1403
application/atom+xml 0.1005 0.1106 0.0989
application/download 0.0023 0.0035 0.0032
application/epub+zip 0.0011 0.0017 0.0017
application/force-download 0.0057 0.0082 0.0066
application/ics 0.0003 0.0005 0.0003
application/javascript 0.0002 0.0002 0.0002
application/json 0.0259 0.0306 0.0321
application/ld+json 0.0018 0.0020 0.0019
application/marc 0.0005 0.0006 0.0006
application/msword 0.0024 0.0034 0.0026
application/octet-stream 0.0443 0.0544 0.0496
application/octetstream 0.0003 0.0004 0.0002
application/pdf 0.6007 1.0165 0.7784
application/pgp-encrypted 0.0001 0.0001 0.0001
application/pgp-signature 0.0027 0.0035 0.0033
application/postscript 0.0001 0.0002 0.0001
application/rdf+xml 0.0050 0.0058 0.0052
application/rss+xml 0.0497 0.0577 0.0525
application/rtf 0.0012 0.0009 0.0007
application/save-to-disk 0.0000 0.0000 0.0000
application/text 0.0002 0.0002 0.0002
application/unknown 0.0002 0.0003 0.0003
application/vnd.android.package-archive 0.0000 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0012 0.0014 0.0018
application/vnd.google-earth.kmz 0.0006 0.0008 0.0007
application/vnd.ms-excel 0.0018 0.0019 0.0018
application/vnd.ms-powerpoint 0.0003 0.0004 0.0004
application/vnd.ms-word 0.0002 0.0002 0.0002
application/vnd.oasis.opendocument.text 0.0005 0.0008 0.0007
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0003 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0012 0.0017 0.0014
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0018 0.0027 0.0024
application/vnd.wap.xhtml+xml 0.0004 0.0006 0.0006
application/x-bibtex 0.0052 0.0063 0.0062
application/x-bittorrent 0.0001 0.0002 0.0002
application/x-bzip2 0.0001 0.0002 0.0001
application/x-debian-package 0.0000 0.0000 0.0000
application/x-download 0.0018 0.0021 0.0020
application/x-endnote-refer 0.0010 0.0013 0.0014
application/x-gzip 0.0003 0.0004 0.0003
application/x-httpd-php 0.0003 0.0003 0.0004
application/x-java-jnlp-file 0.0003 0.0003 0.0002
application/x-javascript 0.0001 0.0001 0.0001
application/x-json 0.0000 0.0000 0.0000
application/x-mobipocket-ebook 0.0004 0.0006 0.0006
application/x-msdownload 0.0002 0.0002 0.0002
application/x-netcdf 0.0003 0.0001 0.0002
application/x-rar-compressed 0.0000 0.0000 0.0000
application/x-research-info-systems 0.0092 0.0118 0.0117
application/x-shockwave-flash 0.0002 0.0002 0.0002
application/x-tar 0.0001 0.0002 0.0001
application/x-tex 0.0002 0.0003 0.0002
application/x-troff-man 0.0008 0.0008 0.0007
application/x-zip-compressed 0.0000 0.0000 0.0000
application/xhtml+xml 0.0090 0.0115 0.0116
application/xml 0.0238 0.0265 0.0542
application/zip 0.0001 0.0002 0.0001
audio/mp3 0.0000 0.0000 0.0000
audio/mpeg 0.0001 0.0001 0.0000
audio/x-mpegurl 0.0004 0.0006 0.0004
audio/x-scpls 0.0001 0.0001 0.0001
audio/x-wav 0.0000 0.0000 0.0000
binary/octet-stream 0.0006 0.0007 0.0007
image/gif 0.0005 0.0005 0.0005
image/jp2 0.0000 0.0000 0.0000
image/jpeg 0.0003 0.0004 0.0003
image/jpg 0.0000 0.0000 0.0000
image/pjpeg 0.0000 0.0000 0.0000
image/png 0.0001 0.0001 0.0001
image/svg+xml 0.0000 0.0000 0.0000
image/tiff 0.0000 0.0000 0.0000
image/vnd.djvu NaN NaN 0.0000
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0002 0.0003 0.0003
text/calendar 0.0247 0.0336 0.0300
text/css 0.0006 0.0007 0.0006
text/csv 0.0033 0.0039 0.0035
text/directory 0.0002 0.0003 0.0002
text/enriched 0.0001 0.0001 0.0001
text/html 98.8067 98.2824 98.5402
text/javascript 0.0002 0.0002 0.0002
text/pdf 0.0000 0.0000 0.0000
text/plain 0.0599 0.0717 0.0651
text/prs.lines.tag 0.0043 0.0041 0.0043
text/tab-separated-values 0.0002 0.0003 0.0003
text/turtle 0.0019 0.0022 0.0020
text/vcard 0.0007 0.0010 0.0011
text/x-bibtex 0.0006 0.0009 0.0009
text/x-c 0.0001 0.0002 0.0001
text/x-csrc 0.0002 0.0004 0.0003
text/x-diff 0.0003 0.0005 0.0005
text/x-patch 0.0004 0.0004 0.0004
text/x-perl 0.0000 0.0000 0.0001
text/x-vcalendar 0.0004 0.0005 0.0005
text/x-vcard 0.0019 0.0024 0.0020
text/xml 0.0556 0.0667 0.0649
unknown/unknown 0.0000 0.0000 0.0000
video/mp4 0.0000 0.0000 0.0000
video/webm 0.0000 0.0000 NaN
video/x-ms-asf 0.0002 0.0002 0.0002

crawl CC-MAIN-2024-38 CC-MAIN-2024-42 CC-MAIN-2024-46
mimetype_detected
<other> 0.0134 0.0153 0.0152
application/atom+xml 0.1018 0.1131 0.1012
application/epub+zip 0.0013 0.0021 0.0020
application/gpx+xml 0.0005 0.0008 0.0006
application/gzip 0.0000 0.0000 0.0000
application/javascript 0.0004 0.0005 0.0004
application/json 0.0264 0.0305 0.0329
application/marc 0.0013 0.0014 0.0014
application/mbox 0.0016 0.0020 0.0019
application/msword 0.0017 0.0027 0.0020
application/octet-stream 0.0137 0.0144 0.0152
application/pdf 0.6158 1.0364 0.7948
application/pgp-signature 0.0023 0.0044 0.0033
application/pkcs7-signature 0.0003 0.0005 0.0004
application/postscript 0.0003 0.0003 0.0003
application/rdf+xml 0.0090 0.0099 0.0088
application/rss+xml 0.0790 0.0909 0.0837
application/rtf 0.0013 0.0014 0.0012
application/text 0.0000 0.0000 0.0000
application/vnd.android.package-archive NaN 0.0000 0.0000
application/vnd.google-earth.kml+xml 0.0023 0.0025 0.0030
application/vnd.google-earth.kmz 0.0006 0.0009 0.0008
application/vnd.ms-excel 0.0012 0.0011 0.0010
application/vnd.ms-powerpoint 0.0003 0.0004 0.0004
application/vnd.oasis.opendocument.spreadsheet 0.0003 0.0006 0.0004
application/vnd.oasis.opendocument.text 0.0006 0.0009 0.0008
application/vnd.openxmlformats-officedocument.presentationml.presentation 0.0002 0.0003 0.0002
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet 0.0012 0.0017 0.0014
application/vnd.openxmlformats-officedocument.wordprocessingml.document 0.0018 0.0027 0.0024
application/x-bibtex-text-file 0.0096 0.0120 0.0110
application/x-bittorrent 0.0002 0.0003 0.0003
application/x-dosexec 0.0000 0.0000 NaN
application/x-endnote-refer 0.0015 0.0020 0.0019
application/x-mobipocket-ebook 0.0005 0.0008 0.0007
application/x-ms-asx 0.0001 0.0002 0.0001
application/x-msdownload 0.0000 0.0000 0.0000
application/x-pds 0.0022 0.0022 0.0019
application/x-rar-compressed 0.0000 0.0000 NaN
application/x-research-info-systems 0.0002 0.0002 0.0001
application/x-sh 0.0010 0.0011 0.0011
application/x-shockwave-flash 0.0002 0.0002 0.0002
application/x-stata-do 0.0004 0.0004 0.0004
application/x-tex 0.0004 0.0004 0.0003
application/x-tika-msoffice 0.0021 0.0027 0.0020
application/x-tika-ooxml 0.0024 0.0025 0.0021
application/x-wais-source 0.0003 0.0003 0.0004
application/x-xz 0.0000 NaN NaN
application/xhtml+xml 8.6992 9.1678 8.9008
application/xml 0.0584 0.0667 0.0955
application/zip 0.0000 0.0000 0.0000
application/zlib 0.0001 0.0001 0.0001
audio/mpeg 0.0000 0.0000 0.0000
audio/vnd.wave 0.0000 0.0000 0.0000
audio/x-mpegurl 0.0000 0.0000 0.0000
image/gif 0.0000 0.0000 0.0000
image/jpeg 0.0002 0.0003 0.0002
image/png 0.0000 0.0000 0.0000
image/svg+xml 0.0000 0.0000 0.0000
image/tiff NaN 0.0000 NaN
image/vnd.djvu NaN 0.0000 NaN
image/webp 0.0000 0.0000 0.0000
message/rfc822 0.0008 0.0008 0.0009
text/asp 0.0000 0.0000 0.0000
text/aspdotnet 0.0000 0.0000 0.0000
text/calendar 0.0289 0.0400 0.0353
text/css 0.0006 0.0007 0.0005
text/csv 0.0033 0.0039 0.0035
text/html 90.1584 89.1643 89.6775
text/plain 0.1220 0.1595 0.1566
text/prs.lines.tag 0.0082 0.0093 0.0091
text/tab-separated-values 0.0002 0.0003 0.0004
text/troff 0.0007 0.0008 0.0008
text/turtle 0.0019 0.0022 0.0020
text/vtt 0.0006 0.0007 0.0008
text/x-c++src 0.0002 0.0002 0.0003
text/x-chdr 0.0004 0.0006 0.0005
text/x-csrc 0.0007 0.0009 0.0009
text/x-diff 0.0012 0.0014 0.0014
text/x-jsp 0.0001 0.0001 0.0001
text/x-log 0.0028 0.0036 0.0021
text/x-matlab 0.0022 0.0026 0.0023
text/x-perl 0.0016 0.0017 0.0015
text/x-php 0.0031 0.0028 0.0034
text/x-python 0.0003 0.0004 0.0004
text/x-vcalendar 0.0004 0.0005 0.0005
text/x-vcard 0.0030 0.0041 0.0036
text/x-web-markdown 0.0004 0.0005 0.0005
text/x-yaml 0.0003 0.0004 0.0004
video/mp4 0.0000 0.0000 0.0000
video/quicktime 0.0000 0.0000 NaN
video/webm 0.0000 0.0000 NaN
video/x-m4v 0.0000 0.0000 NaN