Pages and Requests later than Apr 2022 in gs://httparchive/downloads/

Hello,
I used to download the list of pages and requests from gs://httparchive/downloads/ , where there are files like: httparchive_Sep_1_2021_requests.gz
However, there is no data newer than April 2022. Is the data moved? Or is it no longer offered?
Thank you in advance!
M

Hi @marty90, yeah unfortunately HTTP Archive no longer provides downloadable summaries on GCS. Would you be able to tell us a bit more about your use case?

All of the data is still available on BigQuery, so as a workaround, you can export the data manually.

Note that this may be an expensive operation, given the size of the tables.

  1. Go to https://console.cloud.google.com/bigquery
  2. Navigate to a summary table like httparchive.summary_pages.2023_10_01_desktop
  3. Select “Export to GCS” and select your project’s own GCS bucket

Once it’s on GCS you should be able to process the data like before.

Dear @rviscomi thank you for your help.
Actually, we are working on studying the pervasiveness of Consent Management Platforms (e.g., Iubenda) over time.
So having the summary of pages and requests is fine for us.
I can try to export the tables using Big Query.
But it is a paid service? Is there any free option to get those data?
Thank You
Ciao!
M

GCP does offer a free tier which may or may not be enough for what you’re trying to do.

If you’re comfortable exploring the data directly on BigQuery, you can try some of these techniques to minimize query costs.

For example, the free tier of BigQuery includes 1 TB/month and this query consumes only 40 MB:

-- Top CMPs used on the 100k most popular websites' home pages
SELECT
  t.technology,
  COUNT(0) AS sites
FROM
  `httparchive.all.pages`,
  UNNEST(technologies) AS t,
  UNNEST(t.categories) AS category
WHERE
  -- Only the October 2023 dataset
  date = '2023-10-01' AND
  -- Only mobile pages
  client = 'mobile' AND
  -- Only home pages
  is_root_page AND
  -- Only websites ranked in the top 100k
  rank <= 100000 AND
  category = 'Cookie compliance'
GROUP BY
  technology
ORDER BY
  sites DESC
technology sites
OneTrust 5,364
Funding Choices 4,720
Didomi 1,128
Cookiebot 1,037
iubenda 404
Conversant Consent Tool 401
Usercentrics 400
CookieYes 356
AdRoll CMP System 241
TrustArc 220

If you’re only interested in tracking CMP adoption, then the Core Web Vitals Tech Report should get you what you need for free. For example, this dashboard produces results that are effectively the same as the direct query. (The differences are a side effect of joining HTTP Archive with the CrUX dataset).

Technology Origins
OneTrust 5,739
Funding Choices 5,139
Didomi 1,204
Cookiebot 1,138
Conversant Consent Tool 457
Usercentrics 428
iubenda 411
CookieYes 353
AdRoll CMP System 255
TrustArc 241

Anything more complicated and you’ll probably need to query the data directly on BigQuery. If you need to stay within the free tier, there are some tricks/tradeoffs you can make to optimize the queries for efficiency.

Thank you.
It is a bit surprising that there is no way to download the data without paying…
We will try the queries you suggested.
Question: you use Wappalyzer to extract the “Technology” list from a website, right?

Yes, although we’re using a somewhat stale copy of the Wappalyzer detections as of their final open source commit a few months ago.

Perfect, thank you, we will try to use the SQL Queries of GC!