Description

This webpage presents operational statistics derived from Common Crawl's Web Graph releases, which show the structure and connectivity of the web as captured in the crawl releases. The data consists of host- and domain-level graphs, where hostnames are formatted in reverse domain name notation. These graphs include all types of links, such as those pointing to images, JS libraries, web fonts, and so on. However, only hostnames with valid IANA top-level domains are considered, excluding URLs that use IP addresses as host components.

The domain-level graphs are constructed by aggregating host-level data at the pay-level domain (PLD) level, using the public suffix list maintained on publicsuffix.org. This methodology provides a comprehensive view of the web's hierarchical structure, which is useful for research in areas like ranking algorithms, graph analysis, and link spam detection.

For those interested in exploring the Web Graphs, we provide tools and instructions through the cc-webgraph project on GitHub. We also have a Jupyter Notebook in our cc-notebooks repository. Some related papers on how these Web Graphs can be used can be found in the Related Reading section below. Additionally, the list of graph releases is accessible via the graphinfo.json file.

The top ten highest ranked hosts and domains from each release are listed below. Ranks are derived from Harmonic Centrality, and we also show PageRank for comparison.

Top 1000 Ranks

These ranks can be found by running the following:

INDEX_URL="https://index.commoncrawl.org/graphinfo.json"
DATA_BASE_URL="https://data.commoncrawl.org/projects/hyperlinkgraph"

GRAPH_LEVEL="domain"  # "domain" or "host"
RESULTS=1000  # how many results to retrieve

GRAPH_RELEASE="$(curl -fsSL "$INDEX_URL" | jq -r '.[0].id')"
# ... or a specific release e.g. "cc-main-2025-26-dec-jan-feb"

curl -fsSL \
  "$DATA_BASE_URL/$GRAPH_RELEASE/$GRAPH_LEVEL/$GRAPH_RELEASE-$GRAPH_LEVEL-ranks.txt.gz" \
  2>/dev/null | gzip -dc | head -n "$((RESULTS + 1))"

These rank files are multiple GiB each, so we pipe to zcat or gunzip and use head to peek at the first few lines without downloading the whole file.
Note: head can stop the stream early, but tail on a gzipped stream generally cannot.


What Are These Ranks?

Harmonic Centrality considers how close a node is to others, directly or indirectly. The closer a node is to others, the higher its score. It's based on proximity, not the importance or behaviour of neighbours. We calculate this with HyperBall. For more details, see the paper Axioms for Centrality by Boldi and Vigna (2013) and the talk A modern view of centrality measures.

Harmonic Centrality
Harmonic Centrality: H(v) = sum of 1/d(v,u) for all u not equal to v
PageRank
PageRank: PR(v) = sum of PR(u)/L(u) for all u in backlinks of v

With PageRank, each node's score depends on how many important nodes link to it, and how those nodes distribute their importance. We calculate this with PageRankParallelGaussSeidel.

PageRank is susceptible to manipulation (e.g., link farming or creating many interconnected spam pages). These artificial links can inflate the importance of a spam node. Harmonic Centrality is better for reducing this spam, because it's harder to 'game', or exploit through artificial link patterns.

Domain Lookup

Domain rankings from the Common Crawl Web Graph · Updated 2026-03-06

Embed this widget on your own site with an iframe. It resizes automatically to fit the content:

<iframe id="cc-domain-lookup"
  src="https://commoncrawl.github.io/cc-webgraph-statistics/?embed=domain-lookup&domain=example.com"
  width="100%" height="300" frameborder="0"
  style="border:1px solid #e2e8f0;border-radius:8px;">
</iframe>
<script>
window.addEventListener('message', function(e) {
  if (e.data && e.data.ccEmbed === 'domain-lookup') {
    document.getElementById('cc-domain-lookup').style.height = e.data.height + 'px';
  }
});
</script>

To compare two domains, add &compare=other.com to the URL.

Search for a domain to see its Harmonic Centrality and PageRank over time. Enter a second domain to compare them side by side. Only domains that appear in the top 1,000 for at least one release are available.

Statistics Plots

The following charts are interactive Web Graph statistics for all previous releases. Hover for values, click legend items to toggle series, and drag the range bar to zoom.

nodes

The total number of unique nodes (e.g., domains, hosts, or pages) in the graph. Each node represents an entity in the web graph.

arcs

The total number of directed edges (or arcs) in the graph, representing links between nodes.

successoravggap

The average gap between successive nodes in the adjacency list of the graph. This reflects the ordering and clustering of nodes in the graph.

avglocality

A measure of the locality of edges in the graph, indicating how closely related the linked nodes are in terms of graph structure.

maxoutdegree

The highest number of outgoing edges (links) from a single node in the graph. This identifies the most connected node in terms of outlinks.

dangling

The total number of dangling nodes in the graph, which are nodes (vertices) with zero outgoing edges (arcs).

percdangling

The percentage of nodes in the graph that are dangling nodes (see above).

avgdegree

The average number of edges per node (arcs divided by nodes). In a directed graph, the average indegree and average outdegree are equal, so this single value represents both.

successoravglogdelta

The average logarithmic difference between successive node IDs in the adjacency list. This reflects the dispersion of node IDs in the graph structure.

maxindegree

The highest number of incoming edges (links) to a single node in the graph. It identifies the most referenced or linked-to node.

sccs

The total number of strongly connected components (SCCs) in the graph. SCCs are subgraphs in which every node is reachable from every other node within the subgraph.

maxsccsize

The size (number of nodes) of the largest strongly connected component (SCC) in the graph. This indicates the largest cluster of nodes that are mutually reachable.

percmaxscc

The percentage of nodes in the graph that belong to the largest strongly connected component (SCC). It shows how dominant the largest SCC is in the overall graph.

percminscc

The percentage of nodes in the graph that belong to the smallest strongly connected components (SCCs) (typically isolated nodes or trivial SCCs). This indicates the prevalence of disconnected or minimally connected components.

Download Data

domain.tsv host.tsv

Credits

  • Web Data Commons, for their web graph data set and everything related.
  • Common Search; we first used their web graph to expand the crawler frontier, and Common Search's cosr-back project was an important source of inspiration how to process our data using PySpark.
  • The authors of the WebGraph framework, whose software simplifies the computation of rankings.
  • This project is maintained by Common Crawl. View the project on GitHub.