Missing websites in April 2022

Dear Http Archive Comunity,

We are researchers investigating trends in website request data using Http Archive and came across a strange issue we would like to have more clarity about.

It has been observed that there are several websites (both popular and non-popular) that were skipped from the archive in April 2022, even though they appear in the crux lists as well as in the Archive in the months before and after April 2022. A notable example is: https://drive.google.com

For a smaller subset of the above websites, there is no archive in the months of March and April, or April and May, but they are continuously archived outside these months.

We would like to know if there is an explanation for this behaviour, for example some bug in the scraping code, server failures, etc. that could have caused the issue?
If there is no concrete explanation, we would be grateful to receive a partial explanation or some likely reasons.

Thank you for taking the time! We appreciate your help!

This sounds like the effects of a change that happened in the upstream CrUX dataset around that time. For a site to be included, CrUX checks that it meets eligibility criteria around public discoverability and sufficient popularity. Earlier this year there was a change in the CrUX data pipeline to use a different system that classifies whether an origin or page is public. We identified some differences between the old and new systems and the team has been working to close those gaps. That would explain some of your temporary observations. If the Drive website is still not included in the dataset, it’s possible that the new system has determined that it no longer meets the public discoverability criteria.

cc @tunetheweb

Thank you for your reply!

We appreciate your insights, however, we suspect that this is not an issue with the CrUX dataset itself, as it always seems to include the Google Drive url (it is quite popular).
We can also share some other notable examples: https://www.deepl.com, https://www.spotify.com, https://www.microsoft.com, https://www.tumblr.com

All of these are reasonably popular public websites that are always included in the CrUX datasets around that time, but fail to appear in HTTP Archive ONLY IN APRIL 2022.
Could there have been some other issues with the HTTP Archive itself?

Thanks again for your help!

Could you share more about what led you to believe that these websites aren’t available in the April 2022 dataset?

It does seem like they’re all there:

SELECT
  _TABLE_SUFFIX AS client,
  url
FROM
  `httparchive.summary_pages.2022_04_01_*`
WHERE
  url IN (
    'https://drive.google.com/',
    'https://www.spotify.com/',
    'https://www.deepl.com/',
    'https://www.microsoft.com/',
    'https://www.tumblr.com/'
  )

Thank you again for the response and apologies for the delay.

To clarify, we used the pages table from 2022_04_01_desktop to get the archival status for a list of websites that we were matching.

Our query looked something like this:

SELECT website, COUNT(*) archive
FROM
‘our custom list’ websites
INNER JOIN
(
SELECT url
FROM
‘httparchive.pages.2022_04_01_desktop’ pgs
) as pages
ON
websites.website = pages.url
GROUP BY website