Data collection in HTTPArchive

See Changes to the HTTP Archive corpus for more info on when and why the number of URLs changed.

There is a firstHtml boolean field in the summary_requests dataset that indicates whether the request returns the root document of the page, after redirects. You can use it like this:

SELECT
  respOtherHeaders
FROM
  `httparchive.latest.summary_requests_desktop`
WHERE
  firstHtml

Or if you need access to the raw JSON request object, you can join the tables together:

SELECT
  page,
  url,
  JSON_EXTRACT(payload, '$._headers.response') AS all_response_headers
FROM
  `httparchive.latest.summary_requests_desktop`
JOIN
  `httparchive.latest.requests_desktop`
USING
  (url)
WHERE
  firstHtml