Queries returning numbers larger than size of tables?


#1

I’m using the tables from the June 15, 2017 run and running into some strange numbers.

For example, when I run select count(*) from httparchive:runs.2017_06_15_pages I get 474,696 rows. I ran select count(distinct(pageid)) from httparchive:runs.2017_06_15_pages to see if there were any pageid’s that were duplicate, and received 498599 as an answer. My understanding is that there is one pageid per row.

I’m pretty stuck at this point. The second query shouldn’t be returning a number larger than the first, at least based on my understanding from the docs. What am I missing?


#2

Yeah that is strange. I rewrote your second query to group duplicate pageids, if they exist:

SELECT
  COUNT(0),
  pageid
FROM
  [httparchive:runs.2017_06_15_pages]
GROUP BY
  2
ORDER BY
  1 DESC

And there were no duplicates. Also the number of rows in the result was 474696.


#3

Thanks for the modified query!

I’m really curious about the underlying reason though…any guesses as to why that original problem is? Does it have something to do with the querying syntax with BigQuery?


#4

Yeah I think so. Just trying to do SELECT DISTINCT(pageid) ... yielded an error so it seems GROUP BY is the right way to go.

cc @fhoffa in case he has any advice


#5

Switch to #standardSQL - COUNT DISTINCT is exact there.

And to see the advantages of an approximate method:

With #standardSQL, the approximate results are better too:

#standardSQL
select count(distinct(pageid)) 
from `httparchive.runs.2017_06_15_pages`

474696	

#standardSQL
select APPROX_COUNT_DISTINCT(pageid) from `httparchive.runs.2017_06_15_pages`

473038