Recent articles have shown that the trackers embedded on many websites add to the information that Facebook, Google (and other companies) can use to learn about us - even if we are not logged in (or even if we do not have an account!) I started thinking about how many different sites might be tracking my browsing on the web.
“They’re tracking us.” -Princess Leia
How do we know where and how much companies are tracking us?
How many sites are using Facebook tracking APIs? Twitter APIs? Amazon? The number of companies/trackers is probably endless, but in this article, I’ll focus on a small handful - Facebook, Google, Twitter, LinkedIn and Amazon. I know that as long as I have urls, HTTPArchive can help!
What are the tracker urls I based my queries on? In an attempt at irony, I ran a Google search on “Facebook privacy”, and I chose the top 6 articles:
I then used Ghostery (a Chrome plugin that IDs trackers/ads, etc) to identify the trackers on these pages. For example, here are two Facebook trackers:
Now, it is possible that some of the urls I flagged are innocuous and not tracking users across the web. I didn’t dig into each API or what data they collect. Just using the urls in these reports, I built the following query for Facebook:
pages.pageid = requests.pageid
(requests.url CONTAINS “facebook.com/tr”||
requests.url CONTAINS “graph.facebook.com”||
requests.url CONTAINS “facebook.com/impression” ||
requests.url CONTAINS “facebook.com/connect”||
requests.url CONTAINS “connect.facebook.net”||
requests.url CONTAINS “connect.facebook.com”||
requests.url CONTAINS “facebook.com/brandlift”
So, How Much is Facebook Tracking You?
I find 805k Facebook tracking requests in the database over 147.5k sites. That’s a median of 5 Facebook trackers per page, across roughly 33% of the internet.
Interestingly, breaking this down by rank shows that the top 100 sites use far LESS Facebook tracking than the remainder of the dataset (11%) - possibly because many of Facebook’s competitors are in the top 100.
How about Twitter?
In the 6 pages examined, I found 2 tracking urls:
this query results in 117.5k tracking requests across 39.7k sites.
6k instances across 2.4k sites
22.5k instances across 7.6k sites
In my limited sample of sites, Google had the largest number of urls indicated as trackers. So large, in fact - that BigQuery has memory issues if I try to run them all at once. I was able to break these up into smaller queries and gain the complete picture:
Google Ads (Not DoubleClick)
requests.url CONTAINS “https://www.googletagservices.com/tag/js”||
requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com”||
requests.url CONTAINS “pagead2.googlesyndication.com/pagead”||
requests.url CONTAINS “www.googleadservices.com/pagead/”||
requests.url CONTAINS “imasdk.googleapis.com/js/sdkloader”
600k entries across 202k sites (44% of all sites tested have one of these urls)
Google Ads - DoubleClick
requests.url CONTAINS “stats.g.doubleclick.net/r/collect”||
requests.url CONTAINS “securepubads.g.doubleclick.net/gpt/”||
requests.url CONTAINS “googleads.g.doubleclick.net”
1.02M entries across 250k sites (54% of all sites tested have one of these urls)
Adding up the entries gives 1.79M Google trackers. If I run a query for just the page urls, the query runs successfully, and these trackers appear on 268k sites (58% of all sites).
Combining all these results can give us the sum of all trackers found in the HTTPArchive:
And we can look at the number of sites with each set of urls present:
64% of sites in the HTTPArchive use at least one of the 22 urls I have specified above. As you can probably guess, most sites use more than one of these trackers - and you’d be correct:
(The max value is 290, but I removed it to keep the y-axis scale reasonable). The median count is 7 per page.
All tracking is different - with different levels of personal information attached to each click or page visit. But, when added up, it can lead to a pretty clear picture of an individual’s likes, dislikes, etc.
With a very limited list of tracking urls from Ghostery and 6 news articles. I found that there are many trackers being used across a large cross section of the web:
64% of all sites had at least one tracker present, and the median site with trackers utilized 7 of the 22 urls I queried on.