What are the effects of the COVID-19 pandemic on web usage and experience?

The world has changed a lot in the past few months with the spread of the coronavirus. The web has been one area that has seen a dramatic shift, with so many people working and learning from home and using it to stay up to date with the latest news and information.

I’m especially curious what kinds of trends the web community can identify using the millions of websites in the HTTP Archive and Chrome UX Report transparency datasets. We can use this data to get some clues to answer the question at the top of this thread: What are the effects of the COVID-19 pandemic on web usage and experience?

There is so much metadata available in these datasets that I think there’s a lot we can learn from this. Some findings may be purely coincidental but it would be interesting to see the correlations even if we can’t definitively prove their cause. Here are some ideas to pique your curiosity:

  • Has the number of websites grown, indicating more widespread use of the web?
  • Are more local/state/federal/international websites appearing in the dataset?
  • Are there dramatic shifts within countries affected by coronavirus outbreaks or are these trends happening worldwide?
  • As more and more people shift to using the web for their everyday work, education, and social life, has that affected web performance in any way?
  • Can we tell if the websites themselves are getting slower under the load or if the users’ network connections are congested?
  • How have websites adapted to the shift to the web: are pages shedding bytes to accommodate more load? are CDNs becoming more prevalent?
  • Have there been any shifts in the proportion of users’ device types (desktop/phone/tablet)?

To encourage exploration, we’re able to provide each of you with 60 TB worth of free BigQuery credit (while supplies last). To claim your credit, please reply to this thread indicating what you’re interested in researching and I’ll DM you with a redeemable code. Whatever turns up, please reply to this thread and share your queries and analysis with the community so we can all learn from it.

2 Likes

This will be interesting. I was wondering what networks would do to handle all this new found time to hit the web, as ppl use it as the main means of comms now. Nearly every platform has a video chat feature, and ppl are using it. That’s for sure. Open Signal just posted info about the 1st week of quarantine in Italy and what the network congestion looked like.
I would also imagine w/ such a large majority of ppl not being @ work, there must be a rocketing used of mobile - the type we see on weekend vs weekday. This should be fun Rick!

1 Like

For those just getting started with these datasets, here’s a quick introduction.

Read @paulcalvano’s excellent Getting Started with BigQuery guide to get your environment set up. Paul also has a few sections complete of his Guided Tour that goes into detail for each dataset (WIP).

HTTP Archive

HTTP Archive is a monthly dataset of how the web is built, containing metadata from ~5 million web pages. This is considered “lab data” in that the results come from a single test of a page load, but those results contains a wealth of information about the page.

The March 2020 dataset is available in tables named “2020_03_01” and suffixed with “desktop” or “mobile” accordingly. For example, the 2020_03_01_mobile table of the summary_pages dataset contains summary data about 5,484,239 mobile pages.

Note that the table name corresponds with March 1, 2020 but the tests took place throughout the month of March.

Other datasets exist with different kinds of information about each page:

Dataset Description
pages JSON data about each page including loading metrics, CPU stats, optimization info, and more
requests JSON data about each request including request/response headers, networking info, payload size, MIME type, etc
response_bodies (very expensive) Full payloads for text-based resources like HTML, JS, and CSS
summary_pages A subset of high-level stats per page
summary_requests A subset of high-level stats per request
technologies A list of which technologies are used per page, detected by Wappalyzer
blink_features A list of which JavaScript, HTML, or CSS APIs are used per page
lighthouse (mobile only) Full JSON Lighthouse report containing audit results in areas of accessibility, performance, mobile friendliness, SEO, and more

Chrome UX Report

The Chrome UX Report is a monthly dataset of how the web is experienced. This is considered “field data” in that it is sourced from real Chrome users. Data in this project encapsulates the user experience with a small number of metrics including: time to first byte, first paint, first contentful paint, largest contentful paint, DOM content loaded, onload, first input delay, cumulative layout shift, and notification permission acceptance rates. You can query the data by origin (website) month, country, form factor (desktop/mobile/tablet), and effective connection type (4G, 3G, 2G, slow 2G, offline).

The most recent dataset is 202002 (Februrary 2020). The March 2020 dataset will be released on April 14 and it includes user experience data for the full calendar month of March.

Data for each metric is organized as a histogram so that you can measure the percent of user experiences for a given range of times, for example how often users experience TTFB between 0 and 200 ms. If you want fine-grained control over these ranges, you can query the all or country-specific datasets. If you want to query these ranges over time, you should use the experimental dataset, which is optimized for month-to-month analysis. See this post for more info.

If you don’t need fine-grained control of the histogram ranges, we summarize key “fast”, “average”, and “slow” percentages in the materialized dataset. This is also optimized for month-to-month analysis.


For even more info, check out the Web Almanac methodology for an in-depth explanation of how this transparency data is used for a large-scale research project.

1 Like

If useful, one idea to help identify/group websites with medical information is to search for the https://schema.org/MedicalWebPage @type in structured data, and other related types. The custom metric built for the Web Almanac includes a function that extracts all of these types for a page, so we could remix the original query:

#standardSQL
CREATE TEMPORARY FUNCTION getSchemaTypes(payload STRING)
RETURNS ARRAY<STRING> LANGUAGE js AS '''
  try {
    var $ = JSON.parse(payload);
    var almanac = JSON.parse($._almanac);
    return almanac['10.5'].map(element => {
        // strip any @context
        var split = element.split('/');
        return split[split.length - 1];
    });
  } catch (e) {
    return [];
  }
''';

SELECT
  schema_type,
  COUNT(DISTINCT url) AS freq
FROM
  `httparchive.pages.2020_03_01_mobile`,
  UNNEST(getSchemaTypes(payload)) AS schema_type
WHERE
  STARTS_WITH(schema_type, 'Medical')
GROUP BY
  schema_type
ORDER BY
  freq DESC

Here are the top results:

schema_type freq
MedicalOrganization 1536
MedicalClinic 1455
MedicalBusiness 657
MedicalWebPage 469
MedicalCondition 186
MedicalProcedure 119
MedicalSpecialty 85
MedicalEntity 81
MedicalClinic,MedicalOrganization 68
MedicalScholarlyArticle 57
MedicalTherapy 55
MedicalAudience 48
MedicalCode 26
MedicalSpecialty,MedicalProcedure 23
MedicalTest 11

I’ve saved the exact URLs and corresponding schema_types to a scratchspace.medical table for easier analysis. Here’s an example of counting the number of medical websites in the dataset since January:

SELECT
  _TABLE_SUFFIX AS month_client,
  COUNT(DISTINCT page.url) AS pages
FROM
  `httparchive.summary_pages.2020_*` AS page
JOIN
  `httparchive.scratchspace.medical` AS medical
ON
  page.url = medical.url AND
  ENDS_WITH(_TABLE_SUFFIX, medical.client)
WHERE
  schema_type IN ('MedicalOrganization', 'MedicalClinic', 'MedicalBusiness', 'MedicalWebPage')
GROUP BY
  month_client
ORDER BY
  month_client ASC
month_client pages
01_01_desktop 2107
01_01_mobile 2474
02_01_desktop 2347
02_01_mobile 2904
03_01_desktop 3066
03_01_mobile 3858

This is telling us that about a thousand medical websites (~30%) visited in March were not popular enough to be included in the CrUX/HA datasets in January. We can improve this query by going through older datasets and extracting schema types to get a better idea of how many medical websites did exist in the January dataset (2474 sites on mobile plus some that didn’t also exist in March). But this is a quick example of the kind of insights we can get.

1 Like

I am CS grad student at Stony Brook University and I am planning to work on this as part of my wireless class project. It would be really helpful if you could share the code for BigQuery to help get started.

1 Like

That’s great, welcome aboard @jainromil! Here’s the BigQuery project.

Also see this comment, which contains more info to get started. Let me know if you have any questions. I’ll DM you with a coupon code for additional quota.

Interesting report here from Fastly: https://www.fastly.com/blog/how-covid-19-is-affecting-internet-performance

2 Likes

Hi @rviscomi, we would love to analyze the impact of the surge of world wide traffic demand on Web performance and web QoE in the context of my PhD studies. These datasets look like an invaluable resource. Would you mind sharing a coupon code for free BigQuery credit to get the project started?

Thanks!

1 Like

Thank you for sharing this, @rviscomi! I am a grad student interested in the correlation between the change on network traffic and measures on Covid-19. What @tunetheweb shared present nice work about this, and I wanna study this more! Could you share the BigQuery credit with me so that I can get involved? Thanks!

1 Like

Hello,

I am I CS Grad student looking to do some data analysis for a final project in one of my classes. My teammate and I would be very interested in collaborating on this project. Could you please help us get started by providing the BigQuery code? Thank you!

1 Like

Hello Rick,

I’m a CS graduate student taking a wireless course. For my final project, I’m interested in the effect of the coronavirus on video streaming. It would be greatly appreciated if you can provide me with BigQuery credit.

Thank You!

1 Like

Hi @rviscomi, do you know when the “materialized” dataset gets updated? The most recent entry I saw by selecting max(yyyymmdd) in the country_summary table is 2020-01-01, so it looks like we don’t have data from the last months unfortunately…

The country summary table is due for an update but the others (device/metrics) should be up to date with the March dataset.

1 Like

Hi @rviscomi, together with Sarah we have been working on the data to understand the impact of covid on Web users performance, and while we see some indications that performance was degrading in some cases, we don’t know if these are statistically relevant. The question is, related to the histograms, is there any way to know the number of samples (measurements) which are used for a specific histogram? As we only have the “densities”, we can’t do plain comparisons, as differences are sensitive to the number of measurements. Thanks in advance! Pedro :slight_smile:

1 Like

Hi @pecasas. Absolute sample counts are intentionally obfuscated in the CrUX dataset and relative percentages are used instead. From this you can still reason about the differences in dimensions for an origin, for example, if you add up the densities for all FCP experiences on phone compared to desktop, that will tell you the relative proportions of phone and desktop experiences.

Do you have any queries to look at as an example?

Thanks for your quick answer Rick! Still a doubt, without having the actual number of samples, how is it possible to add/compare histograms at all? Wouldn’t this bias results towards origins with less measurements? Or maybe there is a minimum of samples required to actually keep the results in the DB, which would filter out those cases with much less measurements?

We were originally working with the aggregated results (fast, average, slow), and then wanted to go deeper into the histograms, but started to doubt on how to properly compare the histograms from one origin (e.g., we took https://zoom.us) to a group of other origins (top 100 alexa). We first thought on having a single histogram for all these top-100 alexa, but adding densities from different number of measurements and re-normalizing won’t work. An alternative we thought of is to weight the densities of each top-100 origin by it’s alexa popularity. Thanks!

To add, what it looks interesting is that it seems some popular sites got actually optimized and even improved their performance (like https://zoom.us), and we wanted to be sure that this was not linked to a bigger number of measurements due to the increase in popularity.

@pecasas great questions. I’d recommend reading the Analysis tips & best practices section of the CrUX docs. There are a few considerations given for comparing data across origins.

To summarize, if we see the % of fast experiences increase month-over-month, we can’t say whether the website’s own performance got faster, whether specific users’ infrastructure (internet connection, device type) improved, whether relatively more users with fast infrastructure started using the website, etc. CrUX tells us the “what” not the “why”.

Hey @rviscomi, any chance that the BigQuery credit is still available? We would like to study the impact of the pandemic on the adoption & practice of good “security hygiene” on the web, to answer questions such as the following:

  • Is there a larger delay on updating web app software?
  • Is there a significant change in the adoption of security mechanisms over time, with regard to how & when different countries were affected?
  • Do the (corona-related) websites that are recently launched follow good security practices (assuming they were hastily put together)

Yes, it’s still available. Check your DMs!