Warning: $14,000 BigQuery Charge in 2 Hours

rviscomi · February 22, 2024, 7:46pm

Yeah, hopefully the changes (going live soon) will help to raise awareness of the risks and best practices for controlling costs.

Unfortunately though, the only way to enforce the cost controls is in the individual’s own Google Cloud project. As BigQuery dataset owners, HTTP Archive doesn’t have the ability to set those kinds of restrictions for users.

Do you know about the dry_run option in the Python client? Granted, the estimate is provided in bytes, but it should give you an idea of the costs.

No the warning only relates to the query drafted in the current tab. You could have multiple queries running concurrently, but they will each incur their own separate costs.

And the way this UI is used in BigQuery is mostly to show you that the current query is syntactically valid, hence the checkmark. If there isn’t enough horizontal space, BigQuery will actually hide the estimate behind the tooltip. And if there are any syntax errors, the data estimate is replaced with an error message.

It’s a few factors:

The web is very big
It’s getting heavier
We collect a lot of metadata
We’ve been at it for a long time

You can see the number of websites we test each month in this report. As of today it’s about 26 million. Recently we started testing one interior page, in addition to the home page, for each site. Last month we tested 49,802,459 pages.

It hasn’t always been this large, but it’s been running consistently for over 13 years, and throughout most of that time it’s actually run at a pace of twice per month.

For each page, we save a lot of metadata about it and its resources. For text-based resources, we save their full response bodies in BigQuery too.

On top of the growth in the number of websites we test, the websites themselves are loading more, heavier resources:

As of last month, the page-level data weighs in at 38 TB and the request-level data at 204 TB, an average of 1.5 MB and 7.8 MB of data per page, respectively.

Because it’s so data-rich, the dataset is extremely valuable for understanding the state of the web and how it’s trending, including lots of academic research use cases. But as @Tim rightly points out, the dataset’s size makes it a double-edged sword that needs to be used with extreme caution.

Topic		Replies	Views
Really big queries on BigQuery Analysis	0	2880	June 29, 2018
Have free quota but keep getting error on query quota in bigquery Analysis	2	89	September 8, 2024
Video walkthrough of getting started with HTTP Archive on BigQuery Meta	3	1970	March 4, 2018
Downloading HAR-Datasets later than May 2022? Meta	4	1039	October 16, 2023
Pages and Requests later than Apr 2022 in gs://httparchive/downloads/	6	874	November 27, 2023

Warning: $14,000 BigQuery Charge in 2 Hours

Related topics