Warning: $14,000 BigQuery Charge in 2 Hours

Yeah, hopefully the changes (going live soon) will help to raise awareness of the risks and best practices for controlling costs.

Unfortunately though, the only way to enforce the cost controls is in the individual’s own Google Cloud project. As BigQuery dataset owners, HTTP Archive doesn’t have the ability to set those kinds of restrictions for users.

Do you know about the dry_run option in the Python client? Granted, the estimate is provided in bytes, but it should give you an idea of the costs.

No the warning only relates to the query drafted in the current tab. You could have multiple queries running concurrently, but they will each incur their own separate costs.

And the way this UI is used in BigQuery is mostly to show you that the current query is syntactically valid, hence the checkmark. If there isn’t enough horizontal space, BigQuery will actually hide the estimate behind the tooltip. And if there are any syntax errors, the data estimate is replaced with an error message.

It’s a few factors:

  1. The web is very big
  2. It’s getting heavier
  3. We collect a lot of metadata
  4. We’ve been at it for a long time :sweat_smile:

You can see the number of websites we test each month in this report. As of today it’s about 26 million. Recently we started testing one interior page, in addition to the home page, for each site. Last month we tested 49,802,459 pages.

It hasn’t always been this large, but it’s been running consistently for over 13 years, and throughout most of that time it’s actually run at a pace of twice per month.

For each page, we save a lot of metadata about it and its resources. For text-based resources, we save their full response bodies in BigQuery too.

On top of the growth in the number of websites we test, the websites themselves are loading more, heavier resources:

As of last month, the page-level data weighs in at 38 TB and the request-level data at 204 TB, an average of 1.5 MB and 7.8 MB of data per page, respectively.

Because it’s so data-rich, the dataset is extremely valuable for understanding the state of the web and how it’s trending, including lots of academic research use cases. But as @Tim rightly points out, the dataset’s size makes it a double-edged sword that needs to be used with extreme caution.