I’m using the tables from the June 15, 2017 run and running into some strange numbers.
For example, when I run select count(*) from httparchive:runs.2017_06_15_pages I get 474,696 rows. I ran select count(distinct(pageid)) from httparchive:runs.2017_06_15_pages to see if there were any pageid’s that were duplicate, and received 498599 as an answer. My understanding is that there is one pageid per row.
I’m pretty stuck at this point. The second query shouldn’t be returning a number larger than the first, at least based on my understanding from the docs. What am I missing?
I’m really curious about the underlying reason though…any guesses as to why that original problem is? Does it have something to do with the querying syntax with BigQuery?