Number of pages, distribution of top-level domains, crawl overlaps, etc. - basic metrics about Common Crawl Monthly Crawl Archives
Latest crawl: CC-MAIN-2026-17
The crawled content is dominated by HTML pages and contains only a small percentage of other document formats. The tables show the percentage of the top 100 media or MIME types of the latest monthly crawls.
While the first table is based the Content-Type HTTP header, the second uses the MIME type detected by Apache Tika based on the actual content. The underlying data including page counts is provided in mimetypes.csv resp. mimetypes_detected.csv.
| crawl | CC-MAIN-2026-08 | CC-MAIN-2026-12 | CC-MAIN-2026-17 |
|---|---|---|---|
| mimetype | |||
| <other> | 0.0642 | 0.0646 | 0.0610 |
| application/atom+xml | 0.1461 | 0.1511 | 0.1469 |
| application/calendar | 0.0003 | 0.0002 | 0.0002 |
| application/download | 0.0020 | 0.0013 | 0.0018 |
| application/epub+zip | 0.0017 | 0.0019 | 0.0015 |
| application/force-download | 0.0089 | 0.0075 | 0.0073 |
| application/gpx+xml | 0.0008 | 0.0008 | 0.0009 |
| application/ics | 0.0004 | 0.0004 | 0.0004 |
| application/javascript | 0.0008 | 0.0008 | 0.0008 |
| application/json | 0.0262 | 0.0256 | 0.0251 |
| application/ld+json | 0.0015 | 0.0014 | 0.0015 |
| application/marc | 0.0003 | 0.0003 | 0.0003 |
| application/msword | 0.0025 | 0.0023 | 0.0025 |
| application/octet-stream | 0.0449 | 0.0487 | 0.0469 |
| application/octetstream | 0.0002 | 0.0002 | 0.0002 |
| application/pdf | 0.7330 | 0.7988 | 0.8655 |
| application/pgp-encrypted | 0.0001 | 0.0001 | 0.0001 |
| application/pgp-signature | 0.0014 | 0.0016 | 0.0020 |
| application/postscript | 0.0002 | 0.0002 | 0.0002 |
| application/rdf+xml | 0.0038 | 0.0036 | 0.0031 |
| application/rss+xml | 0.0430 | 0.0439 | 0.0418 |
| application/rtf | 0.0008 | 0.0007 | 0.0005 |
| application/save-to-disk | 0.0000 | 0.0000 | 0.0000 |
| application/text | 0.0001 | 0.0001 | 0.0001 |
| application/unknown | 0.0002 | 0.0002 | 0.0002 |
| application/vnd.android.package-archive | 0.0000 | 0.0000 | 0.0000 |
| application/vnd.google-earth.kml+xml | 0.0012 | 0.0012 | 0.0011 |
| application/vnd.google-earth.kmz | 0.0003 | 0.0005 | 0.0005 |
| application/vnd.ms-excel | 0.0014 | 0.0015 | 0.0012 |
| application/vnd.ms-powerpoint | 0.0001 | 0.0001 | 0.0002 |
| application/vnd.ms-word | 0.0002 | 0.0001 | 0.0001 |
| application/vnd.oasis.opendocument.text | 0.0007 | 0.0008 | 0.0007 |
| application/vnd.openxmlformats-officedocument.presentationml.presentation | 0.0002 | 0.0003 | 0.0002 |
| application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | 0.0019 | 0.0017 | 0.0016 |
| application/vnd.openxmlformats-officedocument.wordprocessingml.document | 0.0024 | 0.0028 | 0.0028 |
| application/vnd.wap.xhtml+xml | 0.0005 | 0.0010 | 0.0009 |
| application/x-bibtex | 0.0086 | 0.0090 | 0.0088 |
| application/x-bittorrent | 0.0001 | 0.0001 | 0.0001 |
| application/x-debian-package | 0.0000 | 0.0000 | 0.0000 |
| application/x-download | 0.0026 | 0.0025 | 0.0022 |
| application/x-endnote-refer | 0.0004 | 0.0004 | 0.0003 |
| application/x-gzip | 0.0003 | 0.0002 | 0.0002 |
| application/x-httpd-php | 0.0003 | 0.0003 | 0.0003 |
| application/x-java-jnlp-file | 0.0001 | 0.0001 | 0.0001 |
| application/x-javascript | 0.0002 | 0.0002 | 0.0001 |
| application/x-json | 0.0000 | 0.0000 | 0.0000 |
| application/x-mobipocket-ebook | 0.0005 | 0.0007 | 0.0004 |
| application/x-msdownload | 0.0003 | 0.0003 | 0.0003 |
| application/x-netcdf | 0.0001 | 0.0003 | 0.0004 |
| application/x-research-info-systems | 0.0115 | 0.0117 | 0.0116 |
| application/x-shockwave-flash | 0.0001 | 0.0001 | 0.0000 |
| application/x-tar | 0.0000 | 0.0001 | 0.0001 |
| application/x-tex | 0.0001 | 0.0002 | 0.0001 |
| application/x-troff-man | 0.0004 | 0.0003 | 0.0004 |
| application/x-zip-compressed | 0.0000 | 0.0000 | 0.0000 |
| application/xhtml+xml | 0.0111 | 0.0109 | 0.0104 |
| application/xml | 0.0291 | 0.0266 | 0.0254 |
| application/zip | 0.0001 | 0.0001 | 0.0002 |
| audio/mpeg | 0.0001 | 0.0001 | 0.0001 |
| audio/x-mpegurl | 0.0005 | 0.0006 | 0.0006 |
| audio/x-scpls | 0.0002 | 0.0001 | 0.0001 |
| audio/x-wav | 0.0000 | 0.0000 | 0.0000 |
| binary/octet-stream | 0.0005 | 0.0005 | 0.0005 |
| image/gif | 0.0004 | 0.0003 | 0.0002 |
| image/jp2 | NaN | 0.0000 | NaN |
| image/jpeg | 0.0003 | 0.0004 | 0.0002 |
| image/jpg | 0.0000 | 0.0000 | 0.0000 |
| image/pjpeg | 0.0000 | 0.0000 | 0.0000 |
| image/png | 0.0002 | 0.0004 | 0.0001 |
| image/svg+xml | 0.0000 | 0.0000 | 0.0000 |
| image/tiff | 0.0000 | 0.0000 | 0.0000 |
| image/vnd.djvu | 0.0006 | 0.0003 | 0.0005 |
| image/webp | 0.0000 | 0.0000 | 0.0000 |
| message/rfc822 | 0.0002 | 0.0003 | 0.0002 |
| text/calendar | 0.0374 | 0.0383 | 0.0354 |
| text/css | 0.0006 | 0.0006 | 0.0006 |
| text/csv | 0.0034 | 0.0032 | 0.0030 |
| text/directory | 0.0002 | 0.0002 | 0.0002 |
| text/enriched | 0.0000 | 0.0000 | 0.0000 |
| text/html | 98.6720 | 98.5946 | 98.5388 |
| text/javascript | 0.0004 | 0.0004 | 0.0004 |
| text/markdown | 0.0013 | 0.0034 | 0.0298 |
| text/pdf | 0.0001 | 0.0000 | 0.0000 |
| text/plain | 0.0521 | 0.0541 | 0.0485 |
| text/prs.lines.tag | 0.0025 | 0.0039 | 0.0030 |
| text/tab-separated-values | 0.0005 | 0.0004 | 0.0004 |
| text/turtle | 0.0014 | 0.0012 | 0.0010 |
| text/vcard | 0.0011 | 0.0012 | 0.0012 |
| text/x-bibtex | 0.0006 | 0.0005 | 0.0005 |
| text/x-c | 0.0001 | 0.0002 | 0.0001 |
| text/x-csrc | 0.0001 | 0.0001 | 0.0001 |
| text/x-diff | 0.0002 | 0.0003 | 0.0003 |
| text/x-patch | 0.0003 | 0.0003 | 0.0002 |
| text/x-perl | 0.0000 | 0.0001 | 0.0000 |
| text/x-vcalendar | 0.0003 | 0.0004 | 0.0004 |
| text/x-vcard | 0.0020 | 0.0019 | 0.0019 |
| text/xml | 0.0622 | 0.0613 | 0.0496 |
| unknown/unknown | 0.0000 | 0.0000 | 0.0000 |
| video/mp4 | 0.0000 | 0.0000 | 0.0000 |
| video/webm | 0.0000 | NaN | 0.0000 |
| video/x-ms-asf | 0.0001 | 0.0002 | 0.0002 |
| crawl | CC-MAIN-2026-08 | CC-MAIN-2026-12 | CC-MAIN-2026-17 |
|---|---|---|---|
| mimetype_detected | |||
| <other> | 0.0114 | 0.0122 | 0.0112 |
| application/atom+xml | 0.1476 | 0.1527 | 0.1486 |
| application/epub+zip | 0.0020 | 0.0022 | 0.0018 |
| application/gpx+xml | 0.0008 | 0.0008 | 0.0009 |
| application/gzip | 0.0000 | 0.0000 | 0.0000 |
| application/javascript | 0.0010 | 0.0011 | 0.0011 |
| application/json | 0.0262 | 0.0258 | 0.0252 |
| application/marc | 0.0007 | 0.0007 | 0.0007 |
| application/mbox | 0.0012 | 0.0014 | 0.0011 |
| application/msword | 0.0018 | 0.0017 | 0.0018 |
| application/octet-stream | 0.0164 | 0.0155 | 0.0139 |
| application/pdf | 0.7460 | 0.8127 | 0.8794 |
| application/pgp-signature | 0.0011 | 0.0016 | 0.0020 |
| application/pkcs7-signature | 0.0003 | 0.0004 | 0.0004 |
| application/postscript | 0.0002 | 0.0002 | 0.0001 |
| application/rdf+xml | 0.0081 | 0.0074 | 0.0063 |
| application/rss+xml | 0.0681 | 0.0684 | 0.0660 |
| application/rtf | 0.0013 | 0.0013 | 0.0009 |
| application/text | 0.0000 | 0.0000 | 0.0000 |
| application/vnd.android.package-archive | 0.0000 | 0.0000 | 0.0000 |
| application/vnd.google-earth.kml+xml | 0.0024 | 0.0025 | 0.0024 |
| application/vnd.google-earth.kmz | 0.0003 | 0.0005 | 0.0005 |
| application/vnd.ms-excel | 0.0008 | 0.0007 | 0.0007 |
| application/vnd.ms-powerpoint | 0.0001 | 0.0001 | 0.0002 |
| application/vnd.oasis.opendocument.spreadsheet | 0.0004 | 0.0004 | 0.0003 |
| application/vnd.oasis.opendocument.text | 0.0008 | 0.0009 | 0.0009 |
| application/vnd.openxmlformats-officedocument.presentationml.presentation | 0.0002 | 0.0003 | 0.0002 |
| application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | 0.0018 | 0.0017 | 0.0016 |
| application/vnd.openxmlformats-officedocument.wordprocessingml.document | 0.0025 | 0.0028 | 0.0028 |
| application/x-bibtex-text-file | 0.0143 | 0.0143 | 0.0127 |
| application/x-bittorrent | 0.0002 | 0.0002 | 0.0002 |
| application/x-bzip2 | NaN | 0.0000 | NaN |
| application/x-endnote-refer | 0.0016 | 0.0016 | 0.0014 |
| application/x-mobipocket-ebook | 0.0007 | 0.0008 | 0.0006 |
| application/x-ms-asx | 0.0001 | 0.0002 | 0.0002 |
| application/x-msdownload | 0.0000 | 0.0000 | 0.0000 |
| application/x-pds | 0.0012 | 0.0012 | 0.0012 |
| application/x-rar-compressed | 0.0000 | NaN | 0.0000 |
| application/x-research-info-systems | 0.0000 | 0.0000 | 0.0000 |
| application/x-sh | 0.0007 | 0.0010 | 0.0011 |
| application/x-shockwave-flash | 0.0001 | 0.0001 | 0.0000 |
| application/x-stata-do | 0.0005 | 0.0004 | 0.0004 |
| application/x-tex | 0.0003 | 0.0003 | 0.0003 |
| application/x-tex-tfm | 0.0003 | 0.0003 | 0.0004 |
| application/x-tika-msoffice | 0.0023 | 0.0028 | 0.0024 |
| application/x-tika-ooxml | 0.0017 | 0.0018 | 0.0019 |
| application/x-wais-source | 0.0001 | 0.0001 | 0.0001 |
| application/xhtml+xml | 8.4156 | 8.6282 | 8.1569 |
| application/xml | 0.0714 | 0.0697 | 0.0584 |
| application/zip | 0.0000 | 0.0000 | 0.0000 |
| application/zlib | 0.0001 | 0.0002 | 0.0003 |
| audio/mp4 | NaN | 0.0000 | 0.0000 |
| audio/mpeg | 0.0000 | 0.0001 | 0.0000 |
| audio/vnd.wave | 0.0000 | 0.0000 | 0.0000 |
| audio/x-mpegurl | 0.0000 | 0.0000 | 0.0000 |
| image/gif | 0.0000 | 0.0000 | 0.0000 |
| image/jpeg | 0.0001 | 0.0002 | 0.0001 |
| image/png | 0.0000 | 0.0000 | 0.0000 |
| image/svg+xml | 0.0000 | 0.0000 | 0.0000 |
| image/tiff | NaN | NaN | 0.0000 |
| image/vnd.djvu | 0.0008 | 0.0004 | 0.0007 |
| image/webp | 0.0000 | 0.0000 | 0.0000 |
| message/rfc822 | 0.0008 | 0.0007 | 0.0007 |
| text/asp | 0.0000 | 0.0000 | NaN |
| text/aspdotnet | NaN | 0.0000 | NaN |
| text/calendar | 0.0449 | 0.0456 | 0.0424 |
| text/css | 0.0006 | 0.0006 | 0.0006 |
| text/csv | 0.0034 | 0.0032 | 0.0030 |
| text/html | 90.2830 | 89.9924 | 90.4072 |
| text/markdown | 0.0003 | 0.0024 | 0.0284 |
| text/plain | 0.0871 | 0.0898 | 0.0832 |
| text/prs.lines.tag | 0.0049 | 0.0067 | 0.0054 |
| text/tab-separated-values | 0.0005 | 0.0004 | 0.0004 |
| text/troff | 0.0005 | 0.0005 | 0.0005 |
| text/turtle | 0.0014 | 0.0012 | 0.0010 |
| text/vtt | 0.0009 | 0.0008 | 0.0008 |
| text/x-c++src | 0.0002 | 0.0002 | 0.0001 |
| text/x-chdr | 0.0006 | 0.0006 | 0.0005 |
| text/x-csrc | 0.0005 | 0.0007 | 0.0006 |
| text/x-diff | 0.0011 | 0.0010 | 0.0009 |
| text/x-jsp | 0.0001 | 0.0001 | 0.0001 |
| text/x-log | 0.0017 | 0.0016 | 0.0016 |
| text/x-matlab | 0.0010 | 0.0010 | 0.0013 |
| text/x-perl | 0.0012 | 0.0013 | 0.0008 |
| text/x-php | 0.0032 | 0.0033 | 0.0034 |
| text/x-python | 0.0003 | 0.0004 | 0.0003 |
| text/x-vcalendar | 0.0004 | 0.0004 | 0.0004 |
| text/x-vcard | 0.0036 | 0.0037 | 0.0036 |
| text/x-web-markdown | 0.0014 | 0.0013 | 0.0019 |
| text/x-yaml | 0.0003 | 0.0004 | 0.0004 |
| video/mp4 | 0.0000 | 0.0000 | 0.0000 |
| video/quicktime | 0.0000 | NaN | 0.0000 |
| video/webm | NaN | NaN | 0.0000 |