Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2023-40
The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.
While the first table is based the Content-Type
HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.
crawl | CC-MAIN-2023-14 | CC-MAIN-2023-23 | CC-MAIN-2023-40 |
---|---|---|---|
mimetype | |||
<other> | 0.0752 | 0.0751 | 0.0718 |
application/atom+xml | 0.1097 | 0.0939 | 0.0994 |
application/download | 0.0030 | 0.0033 | 0.0030 |
application/epub+zip | 0.0014 | 0.0011 | 0.0010 |
application/force-download | 0.0063 | 0.0063 | 0.0063 |
application/javascript | 0.0003 | 0.0002 | 0.0003 |
application/json | 0.0231 | 0.0234 | 0.0209 |
application/ld+json | 0.0013 | 0.0014 | 0.0014 |
application/marc | 0.0003 | 0.0004 | 0.0003 |
application/msword | 0.0031 | 0.0032 | 0.0032 |
application/octet-stream | 0.0495 | 0.0434 | 0.0448 |
application/octetstream | 0.0004 | 0.0003 | 0.0003 |
application/pdf | 0.9179 | 0.8598 | 0.8661 |
application/pgp-encrypted | 0.0001 | 0.0001 | 0.0001 |
application/pgp-signature | 0.0027 | 0.0032 | 0.0023 |
application/postscript | 0.0002 | 0.0001 | 0.0002 |
application/rdf+xml | 0.0041 | 0.0040 | 0.0038 |
application/rss+xml | 0.0637 | 0.0649 | 0.0537 |
application/rtf | 0.0014 | 0.0014 | 0.0011 |
application/save-to-disk | 0.0000 | 0.0000 | 0.0000 |
application/text | 0.0008 | 0.0007 | 0.0006 |
application/unknown | 0.0003 | 0.0002 | 0.0002 |
application/vnd.android.package-archive | 0.0001 | 0.0001 | 0.0001 |
application/vnd.google-earth.kml+xml | 0.0024 | 0.0018 | 0.0025 |
application/vnd.google-earth.kmz | 0.0005 | 0.0004 | 0.0005 |
application/vnd.ms-excel | 0.0022 | 0.0019 | 0.0018 |
application/vnd.ms-powerpoint | 0.0007 | 0.0003 | 0.0003 |
application/vnd.ms-word | 0.0003 | 0.0003 | 0.0003 |
application/vnd.oasis.opendocument.text | 0.0009 | 0.0007 | 0.0008 |
application/vnd.openxmlformats-officedocument.presentationml.presentation | 0.0003 | 0.0003 | 0.0002 |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | 0.0019 | 0.0016 | 0.0014 |
application/vnd.openxmlformats-officedocument.wordprocessingml.document | 0.0029 | 0.0023 | 0.0023 |
application/vnd.wap.xhtml+xml | 0.0006 | 0.0005 | 0.0002 |
application/x-bibtex | 0.0032 | 0.0037 | 0.0038 |
application/x-bittorrent | 0.0002 | 0.0002 | 0.0002 |
application/x-bzip2 | 0.0008 | 0.0001 | 0.0001 |
application/x-debian-package | 0.0000 | 0.0000 | 0.0000 |
application/x-download | 0.0014 | 0.0015 | 0.0015 |
application/x-endnote-refer | 0.0016 | 0.0018 | 0.0013 |
application/x-gzip | 0.0006 | 0.0004 | 0.0004 |
application/x-httpd-php | 0.0004 | 0.0004 | 0.0005 |
application/x-java-jnlp-file | 0.0001 | 0.0002 | 0.0002 |
application/x-javascript | 0.0002 | 0.0002 | 0.0001 |
application/x-json | 0.0000 | 0.0000 | 0.0000 |
application/x-mobipocket-ebook | 0.0002 | 0.0001 | 0.0002 |
application/x-msdownload | 0.0003 | 0.0002 | 0.0002 |
application/x-netcdf | 0.0027 | 0.0014 | 0.0014 |
application/x-rar-compressed | 0.0001 | 0.0001 | 0.0001 |
application/x-research-info-systems | 0.0075 | 0.0076 | 0.0075 |
application/x-shockwave-flash | 0.0002 | 0.0002 | 0.0002 |
application/x-tar | 0.0005 | 0.0002 | 0.0002 |
application/x-tex | 0.0001 | 0.0002 | 0.0002 |
application/x-troff-man | 0.0007 | 0.0007 | 0.0005 |
application/x-zip-compressed | 0.0006 | 0.0004 | 0.0003 |
application/xhtml+xml | 0.0235 | 0.0181 | 0.0165 |
application/xml | 0.0252 | 0.0250 | 0.0231 |
application/zip | 0.0033 | 0.0025 | 0.0023 |
audio/mp3 | 0.0001 | 0.0001 | 0.0001 |
audio/mp4 | 0.0000 | 0.0000 | 0.0000 |
audio/mpeg | 0.0024 | 0.0015 | 0.0014 |
audio/x-mpegurl | 0.0009 | 0.0009 | 0.0009 |
audio/x-pn-realaudio | 0.0000 | 0.0000 | 0.0000 |
audio/x-scpls | 0.0002 | 0.0001 | 0.0002 |
audio/x-wav | 0.0001 | 0.0001 | 0.0001 |
binary/octet-stream | 0.0007 | 0.0005 | 0.0007 |
image/gif | 0.0044 | 0.0044 | 0.0035 |
image/jp2 | 0.0001 | 0.0001 | 0.0000 |
image/jpeg | 0.1485 | 0.1574 | 0.1529 |
image/jpg | 0.0022 | 0.0018 | 0.0016 |
image/pjpeg | 0.0005 | 0.0005 | 0.0005 |
image/png | 0.0281 | 0.0312 | 0.0297 |
image/svg+xml | 0.0009 | 0.0009 | 0.0009 |
image/tiff | 0.0004 | 0.0002 | 0.0003 |
image/vnd.djvu | 0.0002 | 0.0002 | 0.0003 |
image/webp | 0.0022 | 0.0022 | 0.0023 |
message/rfc822 | 0.0002 | 0.0002 | 0.0002 |
text/calendar | 0.0290 | 0.0308 | 0.0290 |
text/css | 0.0006 | 0.0005 | 0.0005 |
text/csv | 0.0037 | 0.0037 | 0.0035 |
text/directory | 0.0003 | 0.0003 | 0.0002 |
text/enriched | 0.0003 | 0.0003 | 0.0002 |
text/html | 98.2865 | 98.3693 | 98.3908 |
text/javascript | 0.0002 | 0.0002 | 0.0002 |
text/pdf | 0.0000 | 0.0000 | 0.0000 |
text/plain | 0.0609 | 0.0587 | 0.0575 |
text/tab-separated-values | 0.0002 | 0.0002 | 0.0002 |
text/turtle | 0.0015 | 0.0015 | 0.0013 |
text/vcard | 0.0011 | 0.0011 | 0.0009 |
text/x-bibtex | 0.0006 | 0.0006 | 0.0006 |
text/x-c | 0.0002 | 0.0002 | 0.0001 |
text/x-csrc | 0.0005 | 0.0003 | 0.0002 |
text/x-diff | 0.0003 | 0.0004 | 0.0003 |
text/x-patch | 0.0001 | 0.0003 | 0.0002 |
text/x-perl | 0.0001 | 0.0000 | 0.0001 |
text/x-vcalendar | 0.0005 | 0.0005 | 0.0004 |
text/x-vcard | 0.0021 | 0.0022 | 0.0017 |
text/xml | 0.0670 | 0.0638 | 0.0640 |
unknown/unknown | 0.0000 | 0.0000 | 0.0000 |
video/mp4 | 0.0013 | 0.0007 | 0.0007 |
video/webm | 0.0001 | 0.0000 | 0.0000 |
video/x-ms-asf | 0.0001 | 0.0001 | 0.0001 |
crawl | CC-MAIN-2023-14 | CC-MAIN-2023-23 | CC-MAIN-2023-40 |
---|---|---|---|
mimetype_detected | |||
<other> | 0.0155 | 0.0127 | 0.0127 |
application/atom+xml | 0.1119 | 0.0965 | 0.1018 |
application/epub+zip | 0.0018 | 0.0016 | 0.0012 |
application/gpx+xml | 0.0005 | 0.0004 | 0.0004 |
application/gzip | 0.0021 | 0.0018 | 0.0014 |
application/javascript | 0.0006 | 0.0005 | 0.0006 |
application/json | 0.0228 | 0.0231 | 0.0202 |
application/marc | 0.0011 | 0.0016 | 0.0010 |
application/mbox | 0.0021 | 0.0019 | 0.0020 |
application/msword | 0.0022 | 0.0023 | 0.0022 |
application/octet-stream | 0.0214 | 0.0229 | 0.0321 |
application/pdf | 0.9236 | 0.8657 | 0.8619 |
application/pgp-signature | 0.0033 | 0.0035 | 0.0029 |
application/pkcs7-signature | 0.0004 | 0.0004 | 0.0005 |
application/postscript | 0.0001 | 0.0001 | 0.0002 |
application/rdf+xml | 0.0089 | 0.0096 | 0.0084 |
application/rss+xml | 0.1013 | 0.1015 | 0.0883 |
application/rtf | 0.0013 | 0.0015 | 0.0013 |
application/text | 0.0001 | 0.0002 | 0.0001 |
application/vnd.android.package-archive | 0.0001 | 0.0001 | 0.0001 |
application/vnd.google-earth.kml+xml | 0.0034 | 0.0028 | 0.0037 |
application/vnd.google-earth.kmz | 0.0005 | 0.0004 | 0.0005 |
application/vnd.ms-excel | 0.0014 | 0.0015 | 0.0014 |
application/vnd.ms-powerpoint | 0.0004 | 0.0003 | 0.0003 |
application/vnd.oasis.opendocument.text | 0.0010 | 0.0009 | 0.0009 |
application/vnd.openxmlformats-officedocument.presentationml.presentation | 0.0003 | 0.0003 | 0.0002 |
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | 0.0018 | 0.0015 | 0.0014 |
application/vnd.openxmlformats-officedocument.wordprocessingml.document | 0.0028 | 0.0022 | 0.0023 |
application/x-bibtex-text-file | 0.0080 | 0.0083 | 0.0089 |
application/x-bittorrent | 0.0002 | 0.0002 | 0.0002 |
application/x-bzip2 | 0.0009 | 0.0000 | 0.0000 |
application/x-debian-package | 0.0001 | 0.0000 | 0.0000 |
application/x-dosexec | 0.0001 | 0.0001 | 0.0001 |
application/x-endnote-refer | 0.0014 | 0.0015 | 0.0013 |
application/x-hdf | 0.0010 | 0.0008 | 0.0006 |
application/x-mobipocket-ebook | 0.0004 | 0.0003 | 0.0003 |
application/x-ms-asx | 0.0001 | 0.0001 | 0.0001 |
application/x-msdownload | 0.0002 | 0.0001 | 0.0001 |
application/x-rar-compressed | 0.0000 | 0.0000 | 0.0000 |
application/x-research-info-systems | 0.0002 | 0.0002 | 0.0002 |
application/x-sh | 0.0009 | 0.0010 | 0.0011 |
application/x-shockwave-flash | 0.0002 | 0.0003 | 0.0002 |
application/x-stata-do | 0.0011 | 0.0007 | 0.0005 |
application/x-tex | 0.0003 | 0.0003 | 0.0003 |
application/x-tika-msoffice | 0.0030 | 0.0022 | 0.0022 |
application/x-tika-ooxml | 0.0032 | 0.0028 | 0.0026 |
application/x-wais-source | 0.0003 | 0.0003 | 0.0002 |
application/x-xz | 0.0004 | 0.0007 | 0.0003 |
application/xhtml+xml | 11.0476 | 10.9793 | 9.9459 |
application/xml | 0.0618 | 0.0592 | 0.0577 |
application/zip | 0.0067 | 0.0054 | 0.0048 |
application/zlib | 0.0001 | 0.0002 | 0.0001 |
application/zstd | 0.0013 | 0.0000 | 0.0000 |
audio/midi | 0.0001 | 0.0001 | 0.0001 |
audio/mp4 | 0.0000 | 0.0000 | 0.0000 |
audio/mpeg | 0.0037 | 0.0025 | 0.0021 |
audio/vnd.wave | 0.0004 | 0.0004 | 0.0003 |
audio/x-mpegurl | 0.0002 | 0.0002 | 0.0001 |
image/bmp | 0.0001 | 0.0001 | 0.0001 |
image/gif | 0.0036 | 0.0036 | 0.0029 |
image/jpeg | 0.1560 | 0.1645 | 0.1594 |
image/png | 0.0288 | 0.0314 | 0.0301 |
image/svg+xml | 0.0010 | 0.0010 | 0.0010 |
image/tiff | 0.0005 | 0.0003 | 0.0003 |
image/vnd.djvu | 0.0004 | 0.0004 | 0.0004 |
image/vnd.dxf; format=ascii | 0.0002 | 0.0002 | 0.0002 |
image/vnd.microsoft.icon | 0.0000 | 0.0001 | 0.0001 |
image/webp | 0.0023 | 0.0023 | 0.0024 |
message/rfc822 | 0.0014 | 0.0023 | 0.0020 |
text/asp | 0.0000 | 0.0000 | 0.0000 |
text/aspdotnet | 0.0000 | 0.0000 | 0.0000 |
text/calendar | 0.0333 | 0.0353 | 0.0335 |
text/css | 0.0005 | 0.0005 | 0.0005 |
text/csv | 0.0034 | 0.0034 | 0.0031 |
text/html | 87.2520 | 87.3946 | 88.4515 |
text/plain | 0.1100 | 0.1084 | 0.1063 |
text/prs.lines.tag | 0.0042 | 0.0045 | 0.0044 |
text/tab-separated-values | 0.0003 | 0.0003 | 0.0002 |
text/troff | 0.0006 | 0.0007 | 0.0005 |
text/turtle | 0.0014 | 0.0014 | 0.0013 |
text/vtt | 0.0007 | 0.0008 | 0.0007 |
text/x-c++src | 0.0003 | 0.0003 | 0.0002 |
text/x-chdr | 0.0012 | 0.0007 | 0.0006 |
text/x-csrc | 0.0018 | 0.0012 | 0.0010 |
text/x-diff | 0.0010 | 0.0014 | 0.0013 |
text/x-jsp | 0.0001 | 0.0002 | 0.0001 |
text/x-log | 0.0037 | 0.0019 | 0.0034 |
text/x-matlab | 0.0018 | 0.0025 | 0.0021 |
text/x-perl | 0.0018 | 0.0019 | 0.0020 |
text/x-php | 0.0038 | 0.0033 | 0.0030 |
text/x-python | 0.0003 | 0.0003 | 0.0003 |
text/x-vcalendar | 0.0006 | 0.0006 | 0.0005 |
text/x-vcard | 0.0038 | 0.0038 | 0.0032 |
text/x-web-markdown | 0.0004 | 0.0004 | 0.0003 |
video/mp4 | 0.0014 | 0.0008 | 0.0008 |
video/quicktime | 0.0002 | 0.0001 | 0.0001 |
video/webm | 0.0001 | 0.0000 | 0.0000 |
video/x-m4v | 0.0000 | 0.0000 | 0.0000 |
video/x-matroska | 0.0000 | 0.0000 | 0.0000 |