Help finding list of home pages with specific http response header

I’m new to the HTTP Archive dataset as well as Big Query, but have some very basic SQL knowledge. I’m hoping to get help with a certain query.

I’m looking to generate a list of home page URLs where the http response header includes vary=x-wf-forwarded-proto.

I’m trying to understand Big Query pricing, but concerned about accidentally racking up a huge charge. What is a resource efficient way to make this first query? Thanks!

If you are only looking for only home pages that have that response then the below is very efficient (~43 GB - well below the 1 TB monthly cap):

SELECT DISTINCT
  date,
  client,
  url
FROM
  `httparchive.all.requests`,
  UNNEST (response_headers) as resp_headers
WHERE
  date = "2023-04-01" AND
  lower(resp_headers.name) = 'vary' AND
  lower(resp_headers.value) LIKE '%x-wf-forwarded-proto%' AND
  is_main_document
  --AND root_page = url --uncomment this if you just want home pages and not also secondary pages

This uses the new all schema which is clustered to allow cheaper querying for certain filters (like is_main_document).

However, if you are looking for ALL requests that might have that vary header (i.e. including subresources on a page where that might be set), then you can remove the is_main_document clause about, but then it’s much more expensive at 3.2 TB (or $15).

BTW I highly recommend this post on ensuring your don’t run up a big bill:
https://timkadlec.com/remembers/2019-12-10-using-bigquery-without-breaking-the-bank/

1 Like

Thank you for taking the time to explain and share this query. I appreciate it immensely. This is really helpful.

A quick follow up question:

When I paste the query it estimates the processing amount at 3.08 TB (screenshot below). What am I missing?

My fault. Now I see the comment at the bottom.

However, when I uncomment that I’m still seeing an estimate of 3.17 TB

Hmmm that’s really annoying. It appears only admins get the true estimate, while non-admins get a much worse, “worst-case” estimate.

Running it will only consume 43 GB (I tested this with a non-admin account just to be 100% sure before advising you on this) but there’s no way of you knowing that other than taking the leap. Which is a bit poor to be honest :frowning:

Ooooh bummer, that’s a very odd user experience. But no worries, thanks for testing on my behalf. I feel confident pulling the trigger and can cough up a few bucks if for some reason it works the other way.

It does make me question how I can get accurate estimates and practice writing efficient queries moving forward without accurate estimates.

Thanks for the help!

I raised a request here: Google Issue Tracker

In general looking at the partitioned and clustered details, will help guide you to cheaper queries:

But yes it’s annoying that you don’t see this.

For httparchive.all.pages setting rank=1000 is a great way of testing queries for super cheap as limits it to top 1000 URLs rather than the millions. Also avoid the payload and lighthouse columns where you can (less of an issue when using rank=1000 but will cost you if doing a full run).

For httparchive.all.requests using is_main_document usually a great way of narrowing down to just the main request (again excluding 99% of the rows). Also avoiding the payload and especially the response_body columns is a good way to keep costs reasonable.

Best option of all is to use an account without any billing details set. That means you can’t ever be charged.

@tunetheweb I’ve been playing around with the original query you provided and attempting to also grab the technology names (technologies.technology) from the pages table and store them as a comma separated list.

I’m running into an error when trying to select that field in conjunction with the select distinct statement. Do I need to create a third query to access the technology names and convert the array to string and then join with the result of the other 2 queries? I’m inexperience with SQL so this is a bit of a challenge.

Appreciate your thoughts!