Warning: $14,000 BigQuery Charge in 2 Hours

This website makes it seem like this “public” dataset is for the community to use, but it is instead a for-profit money maker for Google Cloud and you can lose tens of thousands of dollars.

Last week I ran a script on BigQuery for historical HTTP Archive data and was billed $14,000 by Google Cloud with zero warning whatsoever, and they won’t remove the fee.

This official website should be updated to warn people Google is apparently now hosting this dataset to make money. I don’t think that was the original mission, but that’s what it is today, there’s basically zero customer support, and you can lose $14k in the blink of an eye.

Academics, especially grad students, need to be aware of this before they give a credit card number to Google. In fact, I’d caution against using this dataset whatsoever with this new business model attached.

Hi @Tim, we’ve already chatted over DM but I just wanted to reiterate a few things here for anyone else who finds this thread.

First, I’m really sorry this happened to you. I can imagine how horrifying it would be to get an unexpected bill for that amount. At your suggestion, I’ve created a PR to add a more explicit warning about this to the website’s FAQ page, which I think is where you discovered the BigQuery dataset, so everyone else will know about the risks before diving in.

99% of the people who access HTTP Archive’s data do so through the free monthly reports and annual Web Almanac reports. BigQuery is really only for the 1% of power users who need lower-level access to the raw data.

In your case, $14,000 works out to processing about 2.5 petabytes worth of data. I don’t know exactly what query you wrote to do that much work, but here’s an example of a query that incurs a similar cost:

Behind the friendly-looking checkmark is a very very important warning: “This query will process 2.66 PB when run.” At $6.25 per TiB that’s about the cost you ran into. I won’t get into the dangers of SELECT * but you get the idea.

Again, I’m very sorry about this and I hope it doesn’t happen to anyone else. Also, as we discussed, I’ll see if there’s any way I can escalate your support ticket internally.

I hope you’re able to set up new cost controls to regain your trust in using BigQuery and continue using the HTTP Archive dataset without any more surprises. And for anyone else looking to get started with the dataset, know that you do not need to provide a credit card to use BigQuery. This will automatically put you in the free tier, which gives you a 1 TB/month quota, and you’ll never be charged a cent. For tips on getting the most out of the dataset for 1 TB/month, check out our guide on minimizing query costs.

How can there be 2500 TB of data about how the web is built and is evolving over time?

Hi @rviscomi ,

I have a question about the warning message shown in your image. So, when we see that warning message, does it mean another query might be running concurrently? And if so, does that affect our billing? Just trying to understand how it all works.

image

Thanks for the very helpful response Rick, much appreciated! I think one thing that would help is to highlight people should enable the cost controls prior to running queries, as they are not on by default. Likewise, I was running my queries from a python script with the official GCP libraries, and unlike the web ui there’s no mechanism to show costs for a query.

As discussed, this sucks for me (but hopefully can be fixed) - but my main concern is students who may be using the data for research, so if there is any way to create some roadblocks or friction before somebody incurs such a massive fee it could be useful for others.

If I had tripped a circuit breaker at $5k (or less) to cut off access until I manually confirmed I wanted to continue that would be an ideal solution for everybody I think.

Yeah, hopefully the changes (going live soon) will help to raise awareness of the risks and best practices for controlling costs.

Unfortunately though, the only way to enforce the cost controls is in the individual’s own Google Cloud project. As BigQuery dataset owners, HTTP Archive doesn’t have the ability to set those kinds of restrictions for users.

Do you know about the dry_run option in the Python client? Granted, the estimate is provided in bytes, but it should give you an idea of the costs.

No the warning only relates to the query drafted in the current tab. You could have multiple queries running concurrently, but they will each incur their own separate costs.

And the way this UI is used in BigQuery is mostly to show you that the current query is syntactically valid, hence the checkmark. If there isn’t enough horizontal space, BigQuery will actually hide the estimate behind the tooltip. And if there are any syntax errors, the data estimate is replaced with an error message.

It’s a few factors:

  1. The web is very big
  2. It’s getting heavier
  3. We collect a lot of metadata
  4. We’ve been at it for a long time :sweat_smile:

You can see the number of websites we test each month in this report. As of today it’s about 26 million. Recently we started testing one interior page, in addition to the home page, for each site. Last month we tested 49,802,459 pages.

It hasn’t always been this large, but it’s been running consistently for over 13 years, and throughout most of that time it’s actually run at a pace of twice per month.

For each page, we save a lot of metadata about it and its resources. For text-based resources, we save their full response bodies in BigQuery too.

On top of the growth in the number of websites we test, the websites themselves are loading more, heavier resources:

As of last month, the page-level data weighs in at 38 TB and the request-level data at 204 TB, an average of 1.5 MB and 7.8 MB of data per page, respectively.

Because it’s so data-rich, the dataset is extremely valuable for understanding the state of the web and how it’s trending, including lots of academic research use cases. But as @Tim rightly points out, the dataset’s size makes it a double-edged sword that needs to be used with extreme caution.