I used to download the list of pages and requests from gs://httparchive/downloads/ , where there are files like: httparchive_Sep_1_2021_requests.gz
However, there is no data newer than April 2022. Is the data moved? Or is it no longer offered?
Thank you in advance!
Hi @marty90, yeah unfortunately HTTP Archive no longer provides downloadable summaries on GCS. Would you be able to tell us a bit more about your use case?
All of the data is still available on BigQuery, so as a workaround, you can export the data manually.
Note that this may be an expensive operation, given the size of the tables.
- Go to https://console.cloud.google.com/bigquery
- Navigate to a summary table like
- Select “Export to GCS” and select your project’s own GCS bucket
Once it’s on GCS you should be able to process the data like before.
Dear @rviscomi thank you for your help.
Actually, we are working on studying the pervasiveness of Consent Management Platforms (e.g., Iubenda) over time.
So having the summary of pages and requests is fine for us.
I can try to export the tables using Big Query.
But it is a paid service? Is there any free option to get those data?
GCP does offer a free tier which may or may not be enough for what you’re trying to do.
If you’re comfortable exploring the data directly on BigQuery, you can try some of these techniques to minimize query costs.
For example, the free tier of BigQuery includes 1 TB/month and this query consumes only 40 MB:
-- Top CMPs used on the 100k most popular websites' home pages SELECT t.technology, COUNT(0) AS sites FROM `httparchive.all.pages`, UNNEST(technologies) AS t, UNNEST(t.categories) AS category WHERE -- Only the October 2023 dataset date = '2023-10-01' AND -- Only mobile pages client = 'mobile' AND -- Only home pages is_root_page AND -- Only websites ranked in the top 100k rank <= 100000 AND category = 'Cookie compliance' GROUP BY technology ORDER BY sites DESC
|Conversant Consent Tool||401|
|AdRoll CMP System||241|
If you’re only interested in tracking CMP adoption, then the Core Web Vitals Tech Report should get you what you need for free. For example, this dashboard produces results that are effectively the same as the direct query. (The differences are a side effect of joining HTTP Archive with the CrUX dataset).
|Conversant Consent Tool||457|
|AdRoll CMP System||255|
Anything more complicated and you’ll probably need to query the data directly on BigQuery. If you need to stay within the free tier, there are some tricks/tradeoffs you can make to optimize the queries for efficiency.
It is a bit surprising that there is no way to download the data without paying…
We will try the queries you suggested.
Question: you use Wappalyzer to extract the “Technology” list from a website, right?
Yes, although we’re using a somewhat stale copy of the Wappalyzer detections as of their final open source commit a few months ago.
Perfect, thank you, we will try to use the SQL Queries of GC!