Queries returning numbers larger than size of tables?

amirian · July 10, 2017, 10:55pm

I’m using the tables from the June 15, 2017 run and running into some strange numbers.

For example, when I run select count(*) from httparchive:runs.2017_06_15_pages I get 474,696 rows. I ran select count(distinct(pageid)) from httparchive:runs.2017_06_15_pages to see if there were any pageid’s that were duplicate, and received 498599 as an answer. My understanding is that there is one pageid per row.

I’m pretty stuck at this point. The second query shouldn’t be returning a number larger than the first, at least based on my understanding from the docs. What am I missing?

rviscomi · July 11, 2017, 5:03pm

Yeah that is strange. I rewrote your second query to group duplicate pageids, if they exist:

SELECT
  COUNT(0),
  pageid
FROM
  [httparchive:runs.2017_06_15_pages]
GROUP BY
  2
ORDER BY
  1 DESC

And there were no duplicates. Also the number of rows in the result was 474696.

amirian · July 12, 2017, 12:04am

Thanks for the modified query!

I’m really curious about the underlying reason though…any guesses as to why that original problem is? Does it have something to do with the querying syntax with BigQuery?

rviscomi · July 12, 2017, 1:06am

Yeah I think so. Just trying to do SELECT DISTINCT(pageid) ... yielded an error so it seems GROUP BY is the right way to go.

cc @fhoffa in case he has any advice

fhoffa · July 12, 2017, 1:22am

Switch to #standardSQL - COUNT DISTINCT is exact there.

And to see the advantages of an approximate method:

https://medium.freecodecamp.org/counting-uniques-faster-in-bigquery-with-hyperloglog-5d3764493a5a

With #standardSQL, the approximate results are better too:

#standardSQL
select count(distinct(pageid)) 
from `httparchive.runs.2017_06_15_pages`

474696	

#standardSQL
select APPROX_COUNT_DISTINCT(pageid) from `httparchive.runs.2017_06_15_pages`

473038

Topic		Replies	Views
Tracking Page Weight Over Time Analysis	3	8566	March 3, 2021
How to use bigquery standard SQL ? (instead of the legacy SQL called by google)	2	5915	September 13, 2016
Pages and Requests table origin	1	1472	July 10, 2017
Help finding list of home pages with specific http response header Analysis	7	924	June 7, 2023
Use of custom elements with attributes Analysis	1	1492	March 12, 2019

Queries returning numbers larger than size of tables?

Related topics