How Many HTTPS websites are using Mixed Content

Just after seeing @igrigorik’s tweet about Chorme 36 and Mixed Content Blocking, I was wondering how many websites were using Mixed Contents (a firstHtml page using HTTPS that fetches requests over non secure HTTP).

I wanted to judge the interest to add a rule to detect such an issue with www.dareboost.com (websites test tool)

I started documenting myself on the browsers current policy, here is some usefull links if you want to know more :
Article about Firefox policy with useful information : https://blog.mozilla.org/tanvi/2013/04/10/mixed-content-blocking-enabled-in-firefox-23/
You can test your current browser policy here : https://www.ssllabs.com/ssltest/viewMyClient.html

I first searched for pages with an url field starting with https. There were very few of them (18!). So I try to look directly in requests table, for requests with a true firstHtml field and an url starting with https. I found 9185 entries.

Next, for all the matching pageid of these HTTPS firstHtml requests , I look for requests that were not HTTPS, excluding 301 and 302 status codes, in order to avoid to detect this scheme as Mixed Content :
http ://example.com — 302|301 -----> httpS://example.com

SELECT pages.pageid, pages.url, pages.rank, COUNT(requests.url) AS resource_count FROM [httparchive:runs.latest_requests] requests JOIN EACH
(
    SELECT url, rank, pageid FROM [httparchive:runs.latest_pages]
) pages ON pages.pageid = requests.pageid 

WHERE requests.status!=301 AND requests.status!=302
AND requests.pageid IN (
  # Here we have pageid for which firstHtml is an https resource - 9185 entries
  SELECT pages.pageid FROM [httparchive:runs.latest_requests] requests JOIN EACH
  (
    SELECT rank, pageid FROM [httparchive:runs.latest_pages]
  ) pages ON pages.pageid = requests.pageid 

  WHERE requests.firstHtml = 1 AND requests.url LIKE ("https%")
)
AND requests.url NOT LIKE ("https%") GROUP BY pages.pageid, pages.url, pages.rank ORDER BY pages.rank ASC ;

Among 9185 home pages over SSL, it shows that 1031 are using Mixed Contents (11%).

As I found websites such as paypal.jp or usbank.com in the results, I first thought I made a mistake, so I manually checked some HARs downloaded via httparchive.org… That confirmed the Mixed Content issues.
But it still disapoints me that there are so many mixed contents websites, did I missed something ?

NB :

  • I make no distinction between active or passive mixed contents
  • some cases are probably excluded due to WHERE clause on status field, but I did not find another way to simply exclude false positive related to requests with non SSL redirect before reaching firstHtml pages over SSL.
  • I also found that for some websites, issues have been solved ever since. It would be interesting to compare current results with next run ones.
2 Likes

That’s depressing. Here’s another relevant study (with even worse numbers):

Thanks for the comment !
Depressing, that’s the word…

From the crawled HTTPS websites, 7,980 (43%) were found to have at least
one type of mixed content

I can’t explain for sure why there is a such gap between this value and mine, but here are some reasons I found:

  • httparchive crawls use IE9, that blocks XmlHttpRequests and does not support WebSockets, that minimizes the number of Mixed Contents found by the query.
  • Excluding https to http requests might have an significant effect, I will probably take a look.
  • In the linked study, “200 HTTPS page URLs for each website” are crawled, if one of them contains a Mixed Content, it seems that the website is counted among the 43%. Whereas in httparchive we only consider the homepage. I think it’s the main factor explaining the difference on numbers.

Hi guys,

I wanted to find lastest results with this query, as the new CSP directive upgrade-insecure-requests sounds like really good news.

But unfortunatelly, I discovered that BigQuery is not free anymore. I manage to access the console and to find my httparchive project, but running the query (or one of the lastest of this forum), I still don’t have results after several minutes waiting… Any idea?

Sorry to bother you about it!

That sounds like a bug with the query. The same rules about processing limits (free quotas) apply as before.

Thanks @igrigorik for you answer. No idea why, but running this same query now works…

Anyway, here are the updated results :
Among 27182 home pages over SSL, it shows that 3107 are using Mixed Contents (11%).

The ratio is about the same than one year ago (even if there are a lot more HTTPS pages, firstly because there are more pages in HTTPArchive project, but also because HTTPS usage is growing!

1 Like