Where to find http headers in the http archive datasets

Can someone point me to where in the data set I can find the response headers that would include Link and “nopush” attributes mentioned here?

Thanks!

If you click the little three dots menu on the top right of any of the Web Almanac stats you get a link to the Query (and the Data):

image

This query uses the httparchive.almanac.requests table.

Using the almanac dataset has several advantages:

  • Certain attributes (including response headers!) are pulled out in an easy to query format.
  • A lot cheaper as it is clustered better. So often you don’t pay the full table scan costs for the full table.
  • Other columns (like page url) are easily queried, without having to join to translate a pageid to a page url.
  • They have a sample_data dataset, which allows you to query quickly and cheaply before hitting the full table.

And some disadvantages:

  • It’s only updated once a year. Last time was 1st July, which is a while ago now. So the further we get from last Almanac edition, the less useful this is.
  • All date is in the one table for all Almanac years - so make sure you specify a date!

Basically the Almanac data model is better and we have talked a lot about backporting the improvements of that to the monthly tables, but good chunk of work to do that. So no ETA on when (or even if!) we’ll do that.

You can also get this in the monthly tables, but a bit more difficult to query from there.

You can get it from the main monthly request tables:

SELECT page, url, JSON_EXTRACT(payload, "$.response.headers") AS response_headers
FROM `httparchive.requests.2022_03_01_mobile` LIMIT 1000

But this is a VERY large, and so expensive, table to query.

The summary_requests tables also have them, which is much cheaper to query, but they are in a text format and bundled into a respOtherHeaders column for any headers that don’t have their own column.

For example:

SELECT pageid, respOtherHeaders FROM `httparchive.summary_requests.2022_03_01_mobile` LIMIT 1000

Personally I think the Almanac tables are pretty good for most cases and easier to query (not least cause you have access to all the Almanac queries to start from!). The web doesn’t generally move that fast so, unless it’s a new technology you’re looking at, or a technology that is likely to have changed a lot since the last Almanac run, see if you can use that.

Hope that helps.

Wow this is incredibly helpful. You have saved me a ton of time. I would fire off a follow up question except you anticipated all of those and so all I can say is…

Thanks very much!

Peter

1 Like