See Changes to the HTTP Archive corpus for more info on when and why the number of URLs changed.
There is a firstHtml
boolean field in the summary_requests
dataset that indicates whether the request returns the root document of the page, after redirects. You can use it like this:
SELECT
respOtherHeaders
FROM
`httparchive.latest.summary_requests_desktop`
WHERE
firstHtml
Or if you need access to the raw JSON request object, you can join the tables together:
SELECT
page,
url,
JSON_EXTRACT(payload, '$._headers.response') AS all_response_headers
FROM
`httparchive.latest.summary_requests_desktop`
JOIN
`httparchive.latest.requests_desktop`
USING
(url)
WHERE
firstHtml