New Release: httparchive.crawl Dataset
We’re thrilled to announce the launch of the new httparchive.crawl dataset, now the default and only dataset for HTTP Archive crawls data. This update simplifies data access and brings significant improvements.
After November 2024 this will be the only dataset updated with our monthly crawls and after February 2025 we plan to delete the older datasets to avoid confusion.
What’s New:
- Complete History: All HTTP Archive crawl data since June 2011 is now in a single dataset.
- Lean & Fast: Cleanup and deduplication make
crawl.pages40% smaller andcrawl.requests15% smaller on average. - Secondary Pages: Since 2022, we’ve been testing secondary pages (
is_root_page = FALSE), offering deeper insights. - Partitioned & Clustered: Optimized performance with partitioning by date and up to 4 clustered columns (e.g.,
client,rank,is_root_page). - Efficient Queries:
- Drill into specific pages or domains (e.g.,
example.com) in seconds, scanning just ~10MB. - Focus on specific data types (e.g., images) with clustered
typecolumns incrawl.requests.
- Drill into specific pages or domains (e.g.,
- Streamlined Data:
- HTTP headers are now easy-to-use RECORD types.
- JSON data is stored as native types — no need for
JSON_QUERYorJSON_EXTRACT_SCALAR!
SELECT
page,
summary.bytesHtml
FROM httparchive.crawl.pages
WHERE
date = '2024-10-01' AND
rank <= 1000
- Popular
custom_metricscolumns are now separate, slashing query costs (e.g., 90GB vs. 9TB for some queries).
SELECT
client,
rank,
page,
custom_metrics.cookies
FROM httparchive.crawl.pages
WHERE
date = '2024-10-01'
What’s Next?
-
Explore: Review our migration guide and test-drive the updated data!
-
Update Your Queries: Adjust your queries to migrate from legacy datasets:
allpagessummary_pagesrequestssummary_requestsresponse_bodieslighthousetechnologies
to
crawl.pagesandcrawl.requests. -
Plan Ahead:
- Legacy datasets won’t update after November 2024.
- Public access to legacy datasets ends February 2025.
Questions?
We’re here to help! Join the discussion in Slack, this forum, or the GitHub project for support.
Let’s unlock more accessible insights!
— HTTP Archive Maintainers Team