Subsetting response_bodies to 50,000 random pages

The response_bodies tables are now over 7TB in size. This is about 10x bigger than it was in 2017 (~700 GB), but the BigQuery pricing is the same (currently $5 per TB, first TB per month is free). A fee of $35 per query seems a bit steep I think.

An easy option is to just query an old table that is smaller, but then you don’t get the latest data.

So I want the latest data, but not necessarily the whole thing. In particular not while iterating on a query.

It seems easy to create a subset with just “LIMIT x”, but that may not create a useful subset, as it could include random resources but not include all resources for a specific “page”.

Here’s how I created a subsetted table, of 50,000 random pages URLs, where each “page” includes all of its resources.

First, query the pages table to get the random URLs, using RAND().

SELECT url FROM [httparchive:pages.2019_07_01_desktop] WHERE RAND() < 50000/4371974

Save this table.

Next, write this query.

SELECT
  response_bodies.page AS page,
  response_bodies.url AS url,
  response_bodies.body AS body,
  response_bodies.truncated AS truncated
FROM `httparchive.response_bodies.2019_07_01_desktop` AS response_bodies 
RIGHT JOIN `your-project.random_urls_50k` AS random_urls
ON response_bodies.page = random_urls.url

Before running it, click “More”, “Query settings” and set a destination table, and click the “Allow large results” checkbox. Otherwise it will error out with a responseTooLarge error.

Click Run. This query takes around 10 minutes.

And that’s it. You can now query this table instead of the full thing. This table is 82.85 GB and has 1,840,810 rows.

Note that storing the table is not free, see https://cloud.google.com/bigquery/pricing#storage - for the above table it’s $1.64 per month if it’s active storage, and half that after 90 days when it’s long-term storage.

3 Likes

This is a great guide, thanks for writing it up! A few related tips:

@paulcalvano created a sample_data dataset consisting of 1K and 10K subsets of all tables in the 2019_05_01 crawl. This is useful if the results you’re interested in are not particularly time-sensitive and you’re trying to refine your query inexpensively.

I’ve also created an almanac dataset with several 1K subset tables of the 2019_07_01 crawl, which contains some Almanac-specific results (for example, custom metrics). I’m also experimenting with a new way of organizing the data to be easier and less expensive to query. For example, combining both desktop and mobile into one table and clustering them, so you can select one or the other and only incur half the cost of the full table. I’ve also added conveniences like the page field to the summary_requests table (previously you would have to join the pageid with the summary_pages table). I’ve also joined the summary_requests metadata with the response_bodies table and clustered by popular fields like firstHtml and type, so you could write queries like "count the HTML pages that contain the word Almanac". The entire table is 17.4 TB but because of the clustering the query only costs about 400 GB:

SELECT
  COUNT(0)
FROM
  `httparchive.almanac.summary_response_bodies`
WHERE
  date = '2019-07-01' AND
  client = 'desktop' AND
  firstHtml AND
  body LIKE '%Almanac%'

I’m thinking about reorganizing the entire dataset to fit this pattern to help improve query efficiency and reduce the end user costs. Hopefully everything will always fit under the free 1 TB monthly quota.

1 Like

I hadn’t seen sample_data, very nice!

I’m also experimenting with a new way of organizing the data to be easier and less expensive to query.

Good stuff, thank you :slight_smile:

1 Like