Feature to view websites trends MoM or YoY - GOING AWAY

tweetbro · April 3, 2018, 1:38am

https://legacy.httparchive.org/viewsite.php?pageid=87266328 is an extremely good feature where a developer can check how the website has been faring over a period of time and do good amount of analysis around it.
I hear that we are phasing this feature out. It would be terrible if you guys do that. Please don’t
There really isn’t any other way to look at trends besides this.

igrigorik · April 3, 2018, 2:29am

Could you elaborate on what features in particular were useful to you? E.g. particular stats, metrics, or something else?

rviscomi · April 3, 2018, 4:36pm

I’ve also heard @addyosmani, @paulcalvano, and others find this feature useful.

You can always query the raw data on BigQuery but I agree a UI is much easier and quicker to work with.

Perhaps I can look into a tutorial to stand up a personalized dashboard on Data Studio based on the raw data. This would be much more customizable than the current /viewsite.php page which shows all charts.

In the corresponding email thread I noted that this page represents 10% of total traffic to the legacy website, or 5% if you ignore the automated views a couple of particular pageids have been getting. Still, I think this is a significant enough amount of traffic to need to have some kind of support plan going forward.

charlie.clark · April 3, 2018, 5:12pm

Actually I think it makes much more sense to work with normalised data for this kind of query. I haven’t been keeping the data snapshots up to date but this kind of report is still available on mamasnewbag.org (already ported to Python) as I use the underlying Postgres database for my own benchmark reports (in Excel via Pandas but that’s another story).
If you want to keep the reports around then I’d be happy to contribute but would strongly recommend normalising the schema, saves space and makes some queries a lot faster.

rviscomi · April 3, 2018, 5:35pm

Yeah the summary_pages table contains pretty much everything you see on the legacy website graphs. For basic perf metrics, that’s the table I’d recommend people query.

The limitation of that table of course is that it’s just a summary and if you want deeper or more modern insights like Lighthouse data or raw HAR metrics, you need to query the larger/expensive JSON tables (pages, requests, lighthouse) or the gigantic response_bodies tables.

That’s also the reason why the new HTTP Archive reports rely much less on the summary tables, because we want to surface some of the richer and more modern metrics.

charlie.clark · April 3, 2018, 5:49pm

Horses for courses: anything that runs regularly should be normalised. The pages table is great for queries across websites and can be extended fairly easily.

rviscomi · April 3, 2018, 5:59pm

Yeah I can see the case for adding more things to the summary table as they become top-level metrics on the web UI.

There’s a caveat to that though which is modifications to the table schema will break some existing queries. For example, if we add lighthouse_score to the newest summary_pages tables, we would be unable to use a wildcard query like SELECT lighthouse_score FROM summary_pages.* because that field would be missing from older tables. It wouldn’t help even if we used a clause like WHERE _TABLE_SUFFIX > 2018_04_01. A workaround would be to update all existing tables to use the new schema and fall back to null values. Not sure of the exact resource costs to do that for hundreds of tables.

charlie.clark · April 4, 2018, 4:43pm

Yes, but that has always been the case. New metrics have been added and older ones retired over time. Gets a lot easier to deal with once real date objects are used in the schema:
SELECT lighthouse_score FROM summary_pages WHERE label > '2018-01-01. And, to reiterate the original question, we’re not just talking about the summary tables but about the results for individual websites.
I think some of the recent queries that take advantage of the extended datasets are fascinating and very useful, but when it comes to changing behaviour of websites then direct comparisons are essential.

rviscomi · April 4, 2018, 4:51pm

The summary tables summarize each individual website’s results.

We use dated tables so all date fields in the same table would be the same. Unless your suggestion is to merge everything into a single table, which may be worth exploring. @fhoffa had a similar suggestion once before. I’m just concerned about how to reconcile that with other user feedback that the tables are too large/expensive to query.

charlie.clark · April 4, 2018, 5:08pm

I’m not sure if we’re not talking at cross-purposes here. I still don’t do anything with BigQuery so I’m not familiar with its structure and I don’t know how suitable it is for this kind of task, whereas I know relational tables are. I rely on importing the CSV files into Postgres and running queries on tables slightly modified from the “official” MySQL one, which, for example isn’t normalised and doesn’t use date objects. If you want to compare data for say, google.com with amazon.com over time, then map/reduce is not the way to go as Paul notes. All my queries run in ms.

rviscomi · April 4, 2018, 6:49pm

Ok, yeah I’m referring to the BigQuery interface that the rest of us use

paulcalvano · April 5, 2018, 2:36pm

The “Trends” and “Compare to” features of the legacy “Websites” section are extremely useful and it will be unfortunate if this goes away These features allow a user to view trend graphs for page weight, requests by content types, etc. When an anomaly is detected, one can go back to the WPT entry to see exactly what contributed to the anomaly.

While it is possible to extract this information from BigQuery, it adds a technical hurdle for those that don’t use BigQuery. @rviscomi - I really like the Data Studio idea!

I still see value in continuing to make the summary trends reports available. If it’s possible to add key metrics to the summary table (from the more expensive to query tables, as mentioned above and without breakage or cost issues), then the value of this feature could be increased over time as well.

Topic		Replies	Views
Improving the HTTP Archive pipeline and dataset by 10x Announcements	8	4441	July 1, 2022
What are the effects of the COVID-19 pandemic on web usage and experience? Analysis	21	3354	July 2, 2020
[NOTICE] Reorganizing the har and runs datasets on BigQuery Announcements	3	3164	July 23, 2018
Announcing HTTP Archive Beta Meta	4	2286	October 24, 2017
Quickstart guide to exploring the HTTP Archive FAQ	0	19339	March 1, 2016

Feature to view websites trends MoM or YoY - GOING AWAY

Related topics