Hey Simon,
Chrome’s use counter for Mixed Content Image seems to have increased from ~2% in early 2020 to 3.4% today (after going down to 0.5% in October 2020): Chrome Platform Status - why is the use counter up but the httparchive numbers down?
Chrome’s use counter is based on page views of Chrome users, while HTTP Archive’s stats in the tables you looked at are based on a single crawl of the home page. So some differences, include:
- Non-home pages will be counted in Chromestatus but not in your HTTP Archive query (more on this later).
- Popular sites will weigh the same as niche sites in HTTP Archive (as it’s based on crawlking each domain equally), but more in Chromestatus (as it’s based on actual page loads).
- Non-public websites (e.g. Intranet sites) will be counted in Chromestatus but not HTTP Archive (which only crawls public sites).
Saying that, there is more we can do with HTTP Archive. For a start, we can look at our secondary pages, when we look at the new data model:
SELECT
client,
is_root_page,
COUNT(DISTINCT page) AS mixed_content_pages,
total_pages,
COUNT(DISTINCT page) / total_pages AS pct_mixed_content_pages,
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT page LIMIT 5), ' ') AS sample_urls
FROM
`httparchive.all.requests`
JOIN
(
SELECT
client,
is_root_page,
COUNT(DISTINCT page) as total_pages
FROM
`httparchive.all.requests`
WHERE
date = '2023-01-01'
GROUP BY
client,
is_root_page
)
USING (client, is_root_page)
WHERE
date = '2023-01-01' AND
type = "image" AND
STARTS_WITH(page, 'https:') AND
STARTS_WITH(url, 'http:')
GROUP BY
client,
is_root_page,
total_pages
ORDER BY
client,
is_root_page
When I run this, I see the following:
is_root_page |
desktop |
mobile |
FALSE |
0.49% |
0.54% |
TRUE |
0.56% |
0.63% |
This seems to imply home pages are worse offenders for mixed content images. Which surprised me to be honest! I would have thought home pages would have been better cared for and less likely to have mixed content, whereas easier to slip into a secondary page.
However, when I restrict this to status: 200
images by adding AND summary LIKE '%"status": 200%'
to the WHERE
clause (WARNING this turns it into a 9.3 TB query!!), the numbers drop almost to half:
is_root_page |
desktop |
mobile |
FALSE |
0.28% |
0.31% |
TRUE |
0.35% |
0.39% |
Not sure if this is due to redirects actually upgrading the images in the end (but the HTML still refering to http version), or 404s meaning those images aren’t loaded or whatever, but none-the-less the original request was HTTP so risk remains.
When I look at the blink counters for the HTTP Archive run (I’ve used the old tables here as easier to query so home page only again):
SELECT
*
FROM
`httparchive.blink_features.usage`
WHERE
yyyymmdd = '20230101' AND
feature = 'MixedContentImage'
This shows a much smaller number (0.03%), which again surprised me that it was so much lower. I had a look at some pages and it appears an insecure favicon does not trigger the blink feature counter (possibly fair enough, as it’s not mixed content on the page?), but would have been caught by our previous query. Not sure if there are other reasons for the big difference between that and the requests count (e.g. a CSP preventing the mixed content from attempting to be used, even if it was downloaded?).
That still begs the question as to why the Chromestatus counters (from real users) is so much higher than HTTP Archive crawl (3.4% of page loads, compared to a tenth of that for HTTP Archive) and increasing and don’t really have any answers for that unfortunately, beyond what I’ve suggested above (other inner pages, non-public pages like Intranets, some popular sites having this on a large number of pages). Weird that it’s increasing though… It. may be worth contacting the people on that original email thread to see if they can explain it?