How can you query HTTP Archive to yield outbound links (and domains) from homepages?

rviscomi · August 12, 2020, 3:15pm

My understanding is there is a few hundred thousand popular websites that have been indexed in snapshots over a number of years and that sample has expanded dramatically in recent couple of years.

That’s correct. We’re up to about 6-7M URLs monthly now.

We are interested in harvesting all the outbound links (links to other websites) from within each website at different points in time.

As newbies to BigQuery and HTTP Archive we would really welcome any advice, tips or examples as to what queries may yield links (or counts of unique URLs and unique domain counts they sit within) at different points in time.

There are two approaches that I can think of: using regexp to parse <a href=... from the HTML response bodies, and writing a custom metric to query the DOM for a[href].

Parsing HTML with regular expressions

Pros:

can be applied to 4 years of historical data

Cons:

extremely expensive (one table is 14TB at $5/TB)
regular expressions are much more brittle compared to querying the DOM

Querying the DOM with a custom metric

A custom metric is a snippet of JS that is executed at runtime while the web page is being tested. We can use native DOM APIs like querySelectorAll to extract exactly what we want from the page. See img-loading-attr.js for example:

return JSON.stringify(Array.from(document.querySelectorAll('img[loading]')).map(img => {
  return img.getAttribute('loading').toLowerCase();
}));

The team researching the SEO chapter of the 2020 Web Almanac has also implemented some link-related custom metrics, but they aggregate stats based on metadata like number of internal/external links and not necessarily the external domains themselves.

Pros:

highly accurate and reliable
less expensive to query

Cons:

inapplicable to historical data

Similar to Analyzing stylesheets with a JS-based parser, I’ve been dreaming of a third option, which would be to post-process the HTML response bodies with some kind of JS-based DOM parser. It’s only theoretical now and would still require a big upfront expense to query and process the HTML.

Topic		Replies	Views
Improving the HTTP Archive pipeline and dataset by 10x Announcements	8	4387	July 1, 2022
Quickstart guide to exploring the HTTP Archive FAQ	0	19235	March 1, 2016
HTTP Archive turns 7! Meta	1	2381	November 16, 2017
Help finding list of home pages with specific http response header Analysis	7	925	June 7, 2023
Really big queries on BigQuery Analysis	0	2880	June 29, 2018

How can you query HTTP Archive to yield outbound links (and domains) from homepages?

Parsing HTML with regular expressions

Querying the DOM with a custom metric

Related topics