How many https websites are using Mixed Content images

zcorpan · February 28, 2023, 7:49am

We (Mozilla) wanted to find out how the landscape has changed since 2020 wrt to usage of mixed content images on https websites, and how many of those images are broken anyway (fail to fetch). Here’s my method to find out:

SELECT
 summary_pages.url AS page_url, summary_requests.url AS mixed_content_image_url, status
FROM
 `httparchive.summary_pages.2023_01_01_desktop` AS summary_pages
 JOIN `httparchive.summary_requests.2023_01_01_desktop` AS summary_requests
 ON summary_pages.pageid = summary_requests.pageid
WHERE
 summary_requests.type = "image"
 AND STARTS_WITH(summary_pages.url, 'https:')
 AND STARTS_WITH(summary_requests.url, 'http:')
ORDER BY summary_pages.url ASC

426,648 requests of mixed content images. To get the number of pages:

SELECT COUNT(DISTINCT page_url) FROM `saved_table`

70,772 pages, out of the total of 12,675,392.

To find how many failed to fetch, check “status” >= 400. 10,996 requests, on 7,288 pages.

The number of pages with mixed content images and without broken images is thus:

(70772 - 7288) / 12675392 * 100 ≈ 0.5%

However, I don’t know if my query is comparable to the one done by the Chromium team back in 2020 (“this reduces the percentage of affected navigations to 0.69%”): https://groups.google.com/a/chromium.org/g/blink-dev/c/gFNMWmg7iOw/m/Dw58z-UXAgAJ

Chrome’s use counter for Mixed Content Image seems to have increased from ~2% in early 2020 to 3.4% today (after going down to 0.5% in October 2020): Chrome Platform Status - why is the use counter up but the httparchive numbers down?

Further, I don’t know how it’s recorded in summary_requests if an image request fails to be fetched at a lower layer than HTTP e.g. TCP or DNS. Is it captured in the status column, or somewhere else, or not included at all?

(Similar thread: How Many HTTPS websites are using Mixed Content)

tunetheweb · February 28, 2023, 12:41pm

Hey Simon,

Chrome’s use counter for Mixed Content Image seems to have increased from ~2% in early 2020 to 3.4% today (after going down to 0.5% in October 2020): Chrome Platform Status - why is the use counter up but the httparchive numbers down?

Chrome’s use counter is based on page views of Chrome users, while HTTP Archive’s stats in the tables you looked at are based on a single crawl of the home page. So some differences, include:

Non-home pages will be counted in Chromestatus but not in your HTTP Archive query (more on this later).
Popular sites will weigh the same as niche sites in HTTP Archive (as it’s based on crawlking each domain equally), but more in Chromestatus (as it’s based on actual page loads).
Non-public websites (e.g. Intranet sites) will be counted in Chromestatus but not HTTP Archive (which only crawls public sites).

Saying that, there is more we can do with HTTP Archive. For a start, we can look at our secondary pages, when we look at the new data model:

SELECT
  client,
  is_root_page,
  COUNT(DISTINCT page) AS mixed_content_pages,
  total_pages,
  COUNT(DISTINCT page) / total_pages AS pct_mixed_content_pages,
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT page LIMIT 5), ' ') AS sample_urls
FROM
  `httparchive.all.requests`
JOIN
  (
    SELECT
      client,
      is_root_page,
      COUNT(DISTINCT page) as total_pages
    FROM
      `httparchive.all.requests`
    WHERE
      date = '2023-01-01'
    GROUP BY
      client,
      is_root_page
  )
USING (client, is_root_page)
WHERE
  date = '2023-01-01' AND
  type = "image" AND
  STARTS_WITH(page, 'https:') AND
  STARTS_WITH(url, 'http:')
GROUP BY
  client,
  is_root_page,
  total_pages
ORDER BY
  client,
  is_root_page

When I run this, I see the following:

is_root_page	desktop	mobile
FALSE	0.49%	0.54%
TRUE	0.56%	0.63%

This seems to imply home pages are worse offenders for mixed content images. Which surprised me to be honest! I would have thought home pages would have been better cared for and less likely to have mixed content, whereas easier to slip into a secondary page.

However, when I restrict this to status: 200 images by adding AND summary LIKE '%"status": 200%' to the WHERE clause (WARNING this turns it into a 9.3 TB query!!), the numbers drop almost to half:

is_root_page	desktop	mobile
FALSE	0.28%	0.31%
TRUE	0.35%	0.39%

Not sure if this is due to redirects actually upgrading the images in the end (but the HTML still refering to http version), or 404s meaning those images aren’t loaded or whatever, but none-the-less the original request was HTTP so risk remains.

When I look at the blink counters for the HTTP Archive run (I’ve used the old tables here as easier to query so home page only again):

SELECT
  *
FROM
  `httparchive.blink_features.usage`
WHERE
  yyyymmdd = '20230101' AND
  feature = 'MixedContentImage'

This shows a much smaller number (0.03%), which again surprised me that it was so much lower. I had a look at some pages and it appears an insecure favicon does not trigger the blink feature counter (possibly fair enough, as it’s not mixed content on the page?), but would have been caught by our previous query. Not sure if there are other reasons for the big difference between that and the requests count (e.g. a CSP preventing the mixed content from attempting to be used, even if it was downloaded?).

That still begs the question as to why the Chromestatus counters (from real users) is so much higher than HTTP Archive crawl (3.4% of page loads, compared to a tenth of that for HTTP Archive) and increasing and don’t really have any answers for that unfortunately, beyond what I’ve suggested above (other inner pages, non-public pages like Intranets, some popular sites having this on a large number of pages). Weird that it’s increasing though… It. may be worth contacting the people on that original email thread to see if they can explain it?

patmeenan · February 28, 2023, 4:01pm

FWIW, the HTTP Archive crawl extracts the feature status flags for each page load as well. You can see it in the " Adoption of the feature on top sites" table below the overall status and it certainly looks like it took a huge drop when the crawl expanded to cover more pages.

It would also be interesting to see if upgrade-insecure-requests triggers the feature counter but doesn’t show up as a mixed-content URL since it would load it with HTTPS.

tunetheweb · February 28, 2023, 4:02pm

I was wondering about upgrade-insecure-requests but presumed it would make the request over HTTPS and so would show as such in the requests table rather than HTTP?

zcorpan · February 28, 2023, 6:40pm

Thanks! I didn’t know deep pages were available now, that’s cool.

Yeah I also tried to use MixedContentImage use counter with httparchive, but was perplexed by the number of matching pages being so different from the number of page views for the use counter. I understand that they can be different for the reasons you mention, but usually they are reasonably close.

patmeenan · February 28, 2023, 6:59pm

Oh, no, the matching pages vs use counters can be off by HUGE amounts. The top-of-the-top sites for popularity drive an inordinate percentage of “page views”.

The fact that the two data sources are in a similar ballpark for some counters is sheer luck. If Facebook, Google or Youtube have a feature that triggers broadly it can dominate. This is a fairly common source of frustration in using web-wide stats for driving priorities (making a change might only move the usage-based stats if it improves one of the top 10 sites).

I don’t remember the most recent version of the stat quoted but it was something like “The top 10% of sites are responsible for 50% of the web usage”. This 2-year-old infographic doesn’t look to be too far off.

tunetheweb · February 28, 2023, 7:02pm

We only have one secondary page per origin btw. We pick the first link in the LCP element on the home page as the secondary page to crawl. The hope was the LCP link was likely a representative good page (e.g. an article on a newspaper, a product on an ecommerce site…etc.) rather than a random link. Though have noticed some sites (e.g. Amazon) often have a splash page for the promo of the month rather than a product page which we might have preferred. But still - as vast improvement.

patmeenan · February 28, 2023, 7:16pm

FWIW, it’s not technically the LCP element. The actual code is here but it basically aggregates the viewport area covered by links by URL (so multiple links to the same content count in aggregate).

Often that will catch the LCP element if the LCP element is a link but it’s basically the same-origin URL that has the largest in-viewport clickable area.

Topic		Replies	Views
How Many HTTPS websites are using Mixed Content Analysis	5	4350	April 28, 2015
The growth of HTTPS requests Analysis	3	2194	November 25, 2014
How many pages use <picture> with <source media> Analysis	9	1547	July 27, 2020
How many resources have X Frame Options, Strict Transport Security, or Content Security Policy headers for web app security? Analysis	6	7064	November 16, 2017
Data collection in HTTPArchive Analysis	1	1614	January 15, 2019

How many https websites are using Mixed Content images

Related topics