As an extension into ongoing research into online diversity, colleagues and i are exploring potential large scale sources of queryable web corpora that may give insights into long-term patterns of the evolving web.
A colleague recommended HTTP Archive as one potential avenue to explore as it contains a relatively large archive sample of popular websites and is queryable via BigQuery so may be usable for statistical analysis. My understanding is there is a few hundred thousand popular websites that have been indexed in snapshots over a number of years and that sample has expanded dramatically in recent couple of years.
We are interested in harvesting all the outbound links (links to other websites) from within each website at different points in time.
As newbies to BigQuery and HTTP Archive we would really welcome any advice, tips or examples as to what queries may yield links (or counts of unique URLs and unique domain counts they sit within) at different points in time.
2 Likes
My understanding is there is a few hundred thousand popular websites that have been indexed in snapshots over a number of years and that sample has expanded dramatically in recent couple of years.
That’s correct. We’re up to about 6-7M URLs monthly now.
We are interested in harvesting all the outbound links (links to other websites) from within each website at different points in time.
As newbies to BigQuery and HTTP Archive we would really welcome any advice, tips or examples as to what queries may yield links (or counts of unique URLs and unique domain counts they sit within) at different points in time.
There are two approaches that I can think of: using regexp to parse <a href=...
from the HTML response bodies, and writing a custom metric to query the DOM for a[href]
.
Parsing HTML with regular expressions
Pros:
- can be applied to 4 years of historical data
Cons:
- extremely expensive (one table is 14TB at $5/TB)
- regular expressions are much more brittle compared to querying the DOM
Querying the DOM with a custom metric
A custom metric is a snippet of JS that is executed at runtime while the web page is being tested. We can use native DOM APIs like querySelectorAll
to extract exactly what we want from the page. See img-loading-attr.js
for example:
return JSON.stringify(Array.from(document.querySelectorAll('img[loading]')).map(img => {
return img.getAttribute('loading').toLowerCase();
}));
The team researching the SEO chapter of the 2020 Web Almanac has also implemented some link-related custom metrics, but they aggregate stats based on metadata like number of internal/external links and not necessarily the external domains themselves.
Pros:
- highly accurate and reliable
- less expensive to query
Cons:
- inapplicable to historical data
Similar to Analyzing stylesheets with a JS-based parser, I’ve been dreaming of a third option, which would be to post-process the HTML response bodies with some kind of JS-based DOM parser. It’s only theoretical now and would still require a big upfront expense to query and process the HTML.
1 Like
That’s awesome Rick - thank you!
Sounds like option one is best as it’s the full-monty time-series.
When you say one table is 14TB at $5/TB.
Does that mean likely cost for total project is $70?
I’m not sure what kind of scale the data is likely to be.
Or is it likely to be an order of magnitude more than that.
There are two tables for every crawl: desktop and mobile. The crawl is monthly as of January 2019, prior to that it was semi-monthly. The size of the corpus has grown over time, but using 14 TB per table as an upper limit, that would be:
(20 monthly crawls * 2 tables per crawl) + (72 semi-monthly crawls * 2 tables per crawl) = 184 tables
184 tables * 14 TB per table = 2,576 TB
2,576 TB * $5 per TB = $12,880
In any case, querying all response_ bodies
tables is going to be very expensive.
Edit: We can get a more accurate estimate of the total cost by writing (but not running!) a query over the entire dataset and letting BigQuery calculate the actual total bytes for us:
# Don't ever run this query! This is just for demonstration purposes!
SELECT * FROM `httparchive.response_bodies.*`
This query will process 448.9 TB when run.
So 448.9 TB * $5 per TB = $2,244.5
Okay, i see - thanks Rick, much appreciated.