Announcing HTTP Archive Beta

rviscomi · October 23, 2017, 11:17pm

Today we’re excited to announce the open beta for the new HTTP Archive website. It’s more than just a fresh coat of paint: we’ve upgraded how you research web stats and trends with combined desktop and mobile reporting, introduced new report structure and new data visualization capabilities, improved documentation, made the site mobile friendly, and lots more.

Best of all, new reports are now powered directly by our public dataset on BigQuery. This change enables us to generate reports—and iterate on their definitions much faster—based on a much richer set of metrics available in the HAR files, Lighthouse reports, and deep trace files that we record as part of the crawl. With this new infrastructure we can now more easily surface new and more powerful reports. For example, in the initial beta release we’re adding metrics like Time to Consistently Interactive (TTCI) and First Meaningful Paint (FMP).

Give the new site a try by visiting https://beta.httparchive.org.

If you have any suggestions or feedback, please let us know here, or open an issue GitHub. We’d love to hear your ideas on how we can improve HTTP Archive. And if you’re feeling motivated, the code is all open source and we’d welcome your help and contributions!

Rick, on behalf of the HTTP Archive team

charlie.clark · October 24, 2017, 9:30am

Hi Rick,

looks good even if at the moment all the charts have an element of sameness. I always loved the quick overview of the classic site.

Are the raw datasets still going to be available as downloads? Or will I need to write my own import routine?

rviscomi · October 24, 2017, 5:11pm

The legacy website at httparchive.org will stick around for a while longer, so you can continue to use all of the features there.

One amenity we’ve added with the beta website is a CDN wrapper around the public Google Storage bucket. So you could access the raw datasets if you know the URL pattern, for example https://cdn.httparchive.org/Oct_1_2017/pages.csv.gz. The subdir will be the month, day, and year optionally prefixed by “mobile_”. There will always be one pages.csv.gz file. The requests files are split into 1 GB chunks and have the name requests_aa.gz, requests_ab.gz, etc. These are also CSV files but the file extension is missing. There may be 6-8 of these. I wish there was a directory-level file explorer, but I hope this helps.

charlie.clark · October 24, 2017, 8:39pm

Thanks for the info. I only really need access to pages so the wrapper is great. Any chance of renaming the order folder using iso format? eg. 2017-10-01?

For requests big query is probably the way to go anyway.

rviscomi · October 24, 2017, 8:53pm

Totally agree that the naming scheme could be more consistent and predictable, but a lot of inertia behind keeping it the way it is tbh.

Topic		Replies	Views
Announcing the new HTTP Archive! Announcements	3	13547	April 4, 2018
HTTP Archive New Leadership Meta	0	3001	April 12, 2017
HTTP Archive turns 7! Meta	1	2381	November 16, 2017
New Release: `httparchive.crawl` Dataset Meta	1	175	November 20, 2024
Improving the HTTP Archive pipeline and dataset by 10x Announcements	8	4372	July 1, 2022

Announcing HTTP Archive Beta

Related topics