Recent articles have shown that the trackers embedded on many websites add to the information that Facebook, Google (and other companies) can use to learn about us - even if we are not logged in (or even if we do not have an account!) I started thinking about how many different sites might be tracking my browsing on the web.
TL;DR
“They’re tracking us.” -Princess Leia
How do we know where and how much companies are tracking us?
How many sites are using Facebook tracking APIs? Twitter APIs? Amazon? The number of companies/trackers is probably endless, but in this article, I’ll focus on a small handful - Facebook, Google, Twitter, LinkedIn and Amazon. I know that as long as I have urls, HTTPArchive can help!
Methodology
What are the tracker urls I based my queries on? In an attempt at irony, I ran a Google search on “Facebook privacy”, and I chose the top 6 articles:
https://www.chronicle.com/blogs/profhacker/firefox-add-on-protects-against-most-facebook-tracking/65281
Facebook Tracking Your Web History? Time to Switch Your Browser
Limiting Facebook's data brokers won’t stop tracking | The Daily Star
Facebook, Google and others are tracking you. Here’s how to stop targeted ads - National | Globalnews.ca
Facebook tracking calls - how the Facebook Messenger app collected phone and text logs on Android phones - CBS News
https://www.makeuseof.com/tag/facebook-tracking-stop/
I then used Ghostery (a Chrome plugin that IDs trackers/ads, etc) to identify the trackers on these pages. For example, here are two Facebook trackers:
Now, it is possible that some of the urls I flagged are innocuous and not tracking users across the web. I didn’t dig into each API or what data they collect. Just using the urls in these reports, I built the following query for Facebook:
SELECT
pages.rank,
pages.url,
requests.url,
ext
FROM
httparchive.runs.latest_requests_mobile requests
JOIN (
SELECT
rank,
pageid,
url
FROM
httparchive.runs.latest_pages_mobile) pages
ON
pages.pageid = requests.pageid
WHERE
(requests.url CONTAINS “facebook.com/tr”||
requests.url CONTAINS “graph.facebook.com”||
requests.url CONTAINS “Impression | Nottingham” ||
requests.url CONTAINS “Redirecting...”||
requests.url CONTAINS “connect.facebook.net”||
requests.url CONTAINS “connect.facebook.com”||
requests.url CONTAINS “Redirecting...”
)
ORDER BY
rank ASC
So, How Much is Facebook Tracking You?
I find 805k Facebook tracking requests in the database over 147.5k sites. That’s a median of 5 Facebook trackers per page, across roughly 33% of the internet.
Interestingly, breaking this down by rank shows that the top 100 sites use far LESS Facebook tracking than the remainder of the dataset (11%) - possibly because many of Facebook’s competitors are in the top 100.
How about Twitter?
In the 6 pages examined, I found 2 tracking urls:
requests.url CONTAINS “syndication.twitter.com/i/jot”||
requests.url CONTAINS “platform.twitter.com/widgets.js”
this query results in 117.5k tracking requests across 39.7k sites.
Amazon?
amazon-adsystem.com/widgets
6k instances across 2.4k sites
requests.url CONTAINS “px.ads.linkedin.com/collect”||
requests.url CONTAINS “snap.licdn.com/li.lms-analytics”
22.5k instances across 7.6k sites
Google
In my limited sample of sites, Google had the largest number of urls indicated as trackers. So large, in fact - that BigQuery has memory issues if I try to run them all at once. I was able to break these up into smaller queries and gain the complete picture:
Google Analytics
requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com”
176k results across 103.7k sites (23% of all sites in the dataset)
Google Ads (Not DoubleClick)
requests.url CONTAINS “https://www.googletagservices.com/tag/js”||
requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com”||
requests.url CONTAINS “pagead2.googlesyndication.com/pagead”||
requests.url CONTAINS “www.googleadservices.com/pagead/”||
requests.url CONTAINS “imasdk.googleapis.com/js/sdkloader”
600k entries across 202k sites (44% of all sites tested have one of these urls)
Google Ads - DoubleClick
requests.url CONTAINS “stats.g.doubleclick.net/r/collect”||
requests.url CONTAINS “securepubads.g.doubleclick.net/gpt/”||
requests.url CONTAINS “googleads.g.doubleclick.net”
1.02M entries across 250k sites (54% of all sites tested have one of these urls)
Adding up the entries gives 1.79M Google trackers. If I run a query for just the page urls, the query runs successfully, and these trackers appear on 268k sites (58% of all sites).
Totals
Combining all these results can give us the sum of all trackers found in the HTTPArchive:
And we can look at the number of sites with each set of urls present:
64% of sites in the HTTPArchive use at least one of the 22 urls I have specified above. As you can probably guess, most sites use more than one of these trackers - and you’d be correct:
(The max value is 290, but I removed it to keep the y-axis scale reasonable). The median count is 7 per page.
Conclusion:
All tracking is different - with different levels of personal information attached to each click or page visit. But, when added up, it can lead to a pretty clear picture of an individual’s likes, dislikes, etc.
With a very limited list of tracking urls from Ghostery and 6 news articles. I found that there are many trackers being used across a large cross section of the web:
64% of all sites had at least one tracker present, and the median site with trackers utilized 7 of the 22 urls I queried on.