New Release: httparchive.crawl
Dataset
We’re thrilled to announce the launch of the new httparchive.crawl
dataset, now the default and only dataset for HTTP Archive crawls data. This update simplifies data access and brings significant improvements.
After November 2024 this will be the only dataset updated with our monthly crawls and after February 2025 we plan to delete the older datasets to avoid confusion.
What’s New:
- Complete History: All HTTP Archive crawl data since June 2011 is now in a single dataset.
- Lean & Fast: Cleanup and deduplication make
crawl.pages
40% smaller andcrawl.requests
15% smaller on average. - Secondary Pages: Since 2022, we’ve been testing secondary pages (
is_root_page = FALSE
), offering deeper insights. - Partitioned & Clustered: Optimized performance with partitioning by date and up to 4 clustered columns (e.g.,
client
,rank
,is_root_page
). - Efficient Queries:
- Drill into specific pages or domains (e.g.,
example.com
) in seconds, scanning just ~10MB. - Focus on specific data types (e.g., images) with clustered
type
columns incrawl.requests
.
- Drill into specific pages or domains (e.g.,
- Streamlined Data:
- HTTP headers are now easy-to-use RECORD types.
- JSON data is stored as native types — no need for
JSON_QUERY
orJSON_EXTRACT_SCALAR
!
SELECT
page,
summary.bytesHtml
FROM httparchive.crawl.pages
WHERE
date = '2024-10-01' AND
rank <= 1000
- Popular
custom_metrics
columns are now separate, slashing query costs (e.g., 90GB vs. 9TB for some queries).
SELECT
client,
rank,
page,
custom_metrics.cookies
FROM httparchive.crawl.pages
WHERE
date = '2024-10-01'
What’s Next?
-
Explore: Review our migration guide and test-drive the updated data!
-
Update Your Queries: Adjust your queries to migrate from legacy datasets:
all
pages
summary_pages
requests
summary_requests
response_bodies
lighthouse
technologies
to
crawl.pages
andcrawl.requests
. -
Plan Ahead:
- Legacy datasets won’t update after November 2024.
- Public access to legacy datasets ends February 2025.
Questions?
We’re here to help! Join the discussion in Slack, this forum, or the GitHub project for support.
Let’s unlock more accessible insights!
— HTTP Archive Maintainers Team