How Many HTTPS websites are using Mixed Content

Just after seeing @igrigorik’s tweet about Chorme 36 and Mixed Content Blocking, I was wondering how many websites were using Mixed Contents (a firstHtml page using HTTPS that fetches requests over non secure HTTP).

I wanted to judge the interest to add a rule to detect such an issue with www.dareboost.com (websites test tool)

I started documenting myself on the browsers current policy, here is some usefull links if you want to know more :
Article about Firefox policy with useful information : https://blog.mozilla.org/tanvi/2013/04/10/mixed-content-blocking-enabled-in-firefox-23/
You can test your current browser policy here : https://www.ssllabs.com/ssltest/viewMyClient.html

I first searched for pages with an url field starting with https. There were very few of them (18!). So I try to look directly in requests table, for requests with a true firstHtml field and an url starting with https. I found 9185 entries.

Next, for all the matching pageid of these HTTPS firstHtml requests , I look for requests that were not HTTPS, excluding 301 and 302 status codes, in order to avoid to detect this scheme as Mixed Content :
http ://example.com — 302|301 -----> httpS://example.com

SELECT pages.pageid, pages.url, pages.rank, COUNT(requests.url) AS resource_count FROM [httparchive:runs.latest_requests] requests JOIN EACH
(
    SELECT url, rank, pageid FROM [httparchive:runs.latest_pages]
) pages ON pages.pageid = requests.pageid 

WHERE requests.status!=301 AND requests.status!=302
AND requests.pageid IN (
  # Here we have pageid for which firstHtml is an https resource - 9185 entries
  SELECT pages.pageid FROM [httparchive:runs.latest_requests] requests JOIN EACH
  (
    SELECT rank, pageid FROM [httparchive:runs.latest_pages]
  ) pages ON pages.pageid = requests.pageid 

  WHERE requests.firstHtml = 1 AND requests.url LIKE ("https%")
)
AND requests.url NOT LIKE ("https%") GROUP BY pages.pageid, pages.url, pages.rank ORDER BY pages.rank ASC ;

Among 9185 home pages over SSL, it shows that 1031 are using Mixed Contents (11%).

As I found websites such as paypal.jp or usbank.com in the results, I first thought I made a mistake, so I manually checked some HARs downloaded via httparchive.org… That confirmed the Mixed Content issues.
But it still disapoints me that there are so many mixed contents websites, did I missed something ?

NB :

  • I make no distinction between active or passive mixed contents
  • some cases are probably excluded due to WHERE clause on status field, but I did not find another way to simply exclude false positive related to requests with non SSL redirect before reaching firstHtml pages over SSL.
  • I also found that for some websites, issues have been solved ever since. It would be interesting to compare current results with next run ones.

That’s depressing. Here’s another relevant study (with even worse numbers):

Thanks for the comment !
Depressing, that’s the word…

From the crawled HTTPS websites, 7,980 (43%) were found to have at least
one type of mixed content

I can’t explain for sure why there is a such gap between this value and mine, but here are some reasons I found:

  • httparchive crawls use IE9, that blocks XmlHttpRequests and does not support WebSockets, that minimizes the number of Mixed Contents found by the query.
  • Excluding https to http requests might have an significant effect, I will probably take a look.
  • In the linked study, “200 HTTPS page URLs for each website” are crawled, if one of them contains a Mixed Content, it seems that the website is counted among the 43%. Whereas in httparchive we only consider the homepage. I think it’s the main factor explaining the difference on numbers.

Hi guys,

I wanted to find lastest results with this query, as the new CSP directive upgrade-insecure-requests sounds like really good news.

But unfortunatelly, I discovered that BigQuery is not free anymore. I manage to access the console and to find my httparchive project, but running the query (or one of the lastest of this forum), I still don’t have results after several minutes waiting… Any idea?

Sorry to bother you about it!

That sounds like a bug with the query. The same rules about processing limits (free quotas) apply as before.

Thanks @igrigorik for you answer. No idea why, but running this same query now works…

Anyway, here are the updated results :
Among 27182 home pages over SSL, it shows that 3107 are using Mixed Contents (11%).

The ratio is about the same than one year ago (even if there are a lot more HTTPS pages, firstly because there are more pages in HTTPArchive project, but also because HTTPS usage is growing!