My understanding is there is a few hundred thousand popular websites that have been indexed in snapshots over a number of years and that sample has expanded dramatically in recent couple of years.
That’s correct. We’re up to about 6-7M URLs monthly now.
We are interested in harvesting all the outbound links (links to other websites) from within each website at different points in time.
As newbies to BigQuery and HTTP Archive we would really welcome any advice, tips or examples as to what queries may yield links (or counts of unique URLs and unique domain counts they sit within) at different points in time.
There are two approaches that I can think of: using regexp to parse <a href=...
from the HTML response bodies, and writing a custom metric to query the DOM for a[href]
.
Parsing HTML with regular expressions
Pros:
- can be applied to 4 years of historical data
Cons:
- extremely expensive (one table is 14TB at $5/TB)
- regular expressions are much more brittle compared to querying the DOM
Querying the DOM with a custom metric
A custom metric is a snippet of JS that is executed at runtime while the web page is being tested. We can use native DOM APIs like querySelectorAll
to extract exactly what we want from the page. See img-loading-attr.js
for example:
return JSON.stringify(Array.from(document.querySelectorAll('img[loading]')).map(img => {
return img.getAttribute('loading').toLowerCase();
}));
The team researching the SEO chapter of the 2020 Web Almanac has also implemented some link-related custom metrics, but they aggregate stats based on metadata like number of internal/external links and not necessarily the external domains themselves.
Pros:
- highly accurate and reliable
- less expensive to query
Cons:
- inapplicable to historical data
Similar to Analyzing stylesheets with a JS-based parser, I’ve been dreaming of a third option, which would be to post-process the HTML response bodies with some kind of JS-based DOM parser. It’s only theoretical now and would still require a big upfront expense to query and process the HTML.