Null values in num_scripts_sync and num_scripts_async

The req_host field in the summary_requests also seems to be having missing values.

numDomains from summary_pages
SELECT url, pageid, numDomains, wptid, wptrun
FROM
httparchive.summary_pages.2018_01_01_desktop
where pageid = 84767573

image

numDomains calculated from summary_requests
SELECT COUNT(DISTINCT(req_host)) distinctReqHost
FROM
httparchive.summary_requests.2018_01_01_desktop
where pageid = 84767573

image

Looking at the req_host values in the summary_requests table show that some of the req_host values are missing, leading to this mismatch.
A previous post on calculating the 1st vs 3rd party requests depends on the req_host field values https://discuss.httparchive.org/t/what-is-the-distribution-of-1st-party-vs-3rd-party-resources/100.
In the above example, 244 out of 423 have blank req_host values.

Revised numDomains calculated from summary_requests
Was able to fill out the missing values by using the following query and get a value matching numDomains (in the 1st and 3rd party query, replaced req_host by the extraction expression).
SELECT COUNT(DISTINCT (SUBSTR(REGEXP_EXTRACT(url, r’([:]//[a-z0-9-._~%]+)’), 4))) actualDistinctReqHost
FROM
httparchive.summary_requests.2018_01_01_desktop
where pageid = 84767573

image

@rviscomi is there something I am missing out on the req_host, num_scripts_sync, num_scripts_asynch?