Null values in num_scripts_sync and num_scripts_async


#1

It appears that there are null values in num_scripts_sync and num_scripts_async in some of the rows of summary_pages (as shown below).

image
image
Are there some specific circumstances in which this may be happening?


#2

The req_host field in the summary_requests also seems to be having missing values.

numDomains from summary_pages
SELECT url, pageid, numDomains, wptid, wptrun
FROM
httparchive.summary_pages.2018_01_01_desktop
where pageid = 84767573

image

numDomains calculated from summary_requests
SELECT COUNT(DISTINCT(req_host)) distinctReqHost
FROM
httparchive.summary_requests.2018_01_01_desktop
where pageid = 84767573

image

Looking at the req_host values in the summary_requests table show that some of the req_host values are missing, leading to this mismatch.
A previous post on calculating the 1st vs 3rd party requests depends on the req_host field values https://discuss.httparchive.org/t/what-is-the-distribution-of-1st-party-vs-3rd-party-resources/100.
In the above example, 244 out of 423 have blank req_host values.

Revised numDomains calculated from summary_requests
Was able to fill out the missing values by using the following query and get a value matching numDomains (in the 1st and 3rd party query, replaced req_host by the extraction expression).
SELECT COUNT(DISTINCT (SUBSTR(REGEXP_EXTRACT(url, r’([:]//[a-z0-9-._~%]+)’), 4))) actualDistinctReqHost
FROM
httparchive.summary_requests.2018_01_01_desktop
where pageid = 84767573

image

@rviscomi is there something I am missing out on the req_host, num_scripts_sync, num_scripts_asynch?


#3

@raghuramakrishnan - It looks like the null values for sync and async scripts were recorded for every site in the 6/1 dataset, but not the 5/15 or 6/15 datasets. The page Ids you referenced for dhl-usa and rediffmailpro were from the 5/15 data set.

I think this may be related to - https://twitter.com/patmeenan/status/1011958674598875136, but @pmeenan or @rviscomi can confirm…


#4

Looks like there are more errors in the 2018-06-01 dataset: about 8000 sites have the crawlid 547 which isn’t in the stats table.