Data collection in HTTPArchive

rviscomi · January 15, 2019, 6:19pm

See Changes to the HTTP Archive corpus for more info on when and why the number of URLs changed.

There is a firstHtml boolean field in the summary_requests dataset that indicates whether the request returns the root document of the page, after redirects. You can use it like this:

SELECT
  respOtherHeaders
FROM
  `httparchive.latest.summary_requests_desktop`
WHERE
  firstHtml

Or if you need access to the raw JSON request object, you can join the tables together:

SELECT
  page,
  url,
  JSON_EXTRACT(payload, '$._headers.response') AS all_response_headers
FROM
  `httparchive.latest.summary_requests_desktop`
JOIN
  `httparchive.latest.requests_desktop`
USING
  (url)
WHERE
  firstHtml

Topic		Replies	Views
The growth of HTTPS requests Analysis	3	2196	November 25, 2014
Quickstart guide to exploring the HTTP Archive FAQ	0	19191	March 1, 2016
New Release: `httparchive.crawl` Dataset Meta	1	177	November 20, 2024
Where to find http headers in the http archive datasets Analysis	2	1432	April 7, 2022
Resource Churn Across Crawls Analysis	2	2135	September 30, 2013

Data collection in HTTPArchive

Related topics