The response_bodies tables are now over 7TB in size. This is about 10x bigger than it was in 2017 (~700 GB), but the BigQuery pricing is the same (currently $5 per TB, first TB per month is free). A fee of $35 per query seems a bit steep I think.
An easy option is to just query an old table that is smaller, but then you don’t get the latest data.
So I want the latest data, but not necessarily the whole thing. In particular not while iterating on a query.
It seems easy to create a subset with just “
LIMIT x”, but that may not create a useful subset, as it could include random resources but not include all resources for a specific “page”.
Here’s how I created a subsetted table, of 50,000 random
pages URLs, where each “page” includes all of its resources.
First, query the
pages table to get the random URLs, using RAND().
SELECT url FROM [httparchive:pages.2019_07_01_desktop] WHERE RAND() < 50000/4371974
Save this table.
Next, write this query.
SELECT response_bodies.page AS page, response_bodies.url AS url, response_bodies.body AS body, response_bodies.truncated AS truncated FROM `httparchive.response_bodies.2019_07_01_desktop` AS response_bodies RIGHT JOIN `your-project.random_urls_50k` AS random_urls ON response_bodies.page = random_urls.url
Before running it, click “More”, “Query settings” and set a destination table, and click the “Allow large results” checkbox. Otherwise it will error out with a responseTooLarge error.
Click Run. This query takes around 10 minutes.
And that’s it. You can now query this table instead of the full thing. This table is 82.85 GB and has 1,840,810 rows.
Note that storing the table is not free, see https://cloud.google.com/bigquery/pricing#storage - for the above table it’s $1.64 per month if it’s active storage, and half that after 90 days when it’s long-term storage.