Tracking: Your Privacy and the Web

image
Recent articles have shown that the trackers embedded on many websites add to the information that Facebook, Google (and other companies) can use to learn about us - even if we are not logged in (or even if we do not have an account!) I started thinking about how many different sites might be tracking my browsing on the web.

TL;DR
image
“They’re tracking us.” -Princess Leia



How do we know where and how much companies are tracking us?

https://twitter.com/danbarker/status/979571712961507328

How many sites are using Facebook tracking APIs? Twitter APIs? Amazon? The number of companies/trackers is probably endless, but in this article, I’ll focus on a small handful - Facebook, Google, Twitter, LinkedIn and Amazon. I know that as long as I have urls, HTTPArchive can help!

Methodology
What are the tracker urls I based my queries on? In an attempt at irony, I ran a Google search on “Facebook privacy”, and I chose the top 6 articles:

https://www.chronicle.com/blogs/profhacker/firefox-add-on-protects-against-most-facebook-tracking/65281
https://www.thequint.com/tech-and-auto/tech-news/prevent-facebook-data-access-with-this-firefox-web-extension
http://www.thedailystar.net/world/limiting-facebooks-data-brokers-wont-stop-tracking-1555630
https://globalnews.ca/news/4110311/how-to-stop-targeted-ads-facebook-google-browser/
https://www.cbsnews.com/news/how-facebook-was-able-to-collect-android-phone-and-text-logs/
https://www.makeuseof.com/tag/facebook-tracking-stop/

I then used Ghostery (a Chrome plugin that IDs trackers/ads, etc) to identify the trackers on these pages. For example, here are two Facebook trackers:
imageimage

Now, it is possible that some of the urls I flagged are innocuous and not tracking users across the web. I didn’t dig into each API or what data they collect. Just using the urls in these reports, I built the following query for Facebook:

SELECT
pages.rank,
pages.url,
requests.url,
ext
FROM
httparchive.runs.latest_requests_mobile requests
JOIN (
SELECT
rank,
pageid,
url
FROM
httparchive.runs.latest_pages_mobile) pages
ON
pages.pageid = requests.pageid
WHERE
(requests.url CONTAINS “facebook.com/tr”||
requests.url CONTAINS “graph.facebook.com”||
requests.url CONTAINS “facebook.com/impression” ||
requests.url CONTAINS “facebook.com/connect”||
requests.url CONTAINS “connect.facebook.net”||
requests.url CONTAINS “connect.facebook.com”||
requests.url CONTAINS “facebook.com/brandlift
)
ORDER BY
rank ASC

So, How Much is Facebook Tracking You?
I find 805k Facebook tracking requests in the database over 147.5k sites. That’s a median of 5 Facebook trackers per page, across roughly 33% of the internet.

Interestingly, breaking this down by rank shows that the top 100 sites use far LESS Facebook tracking than the remainder of the dataset (11%) - possibly because many of Facebook’s competitors are in the top 100.
image

image

How about Twitter?
In the 6 pages examined, I found 2 tracking urls:

requests.url CONTAINS “syndication.twitter.com/i/jot”||
requests.url CONTAINS “platform.twitter.com/widgets.js

this query results in 117.5k tracking requests across 39.7k sites.

Amazon?
amazon-adsystem.com/widgets
6k instances across 2.4k sites

LinkedIn

requests.url CONTAINS “px.ads.linkedin.com/collect”||
requests.url CONTAINS “snap.licdn.com/li.lms-analytics

22.5k instances across 7.6k sites

Google
In my limited sample of sites, Google had the largest number of urls indicated as trackers. So large, in fact - that BigQuery has memory issues if I try to run them all at once. I was able to break these up into smaller queries and gain the complete picture:

Google Analytics

requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com
176k results across 103.7k sites (23% of all sites in the dataset)

Google Ads (Not DoubleClick)

requests.url CONTAINS “https://www.googletagservices.com/tag/js”||
requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com”||
requests.url CONTAINS “pagead2.googlesyndication.com/pagead”||
requests.url CONTAINS “www.googleadservices.com/pagead/”||
requests.url CONTAINS “imasdk.googleapis.com/js/sdkloader
600k entries across 202k sites (44% of all sites tested have one of these urls)

Google Ads - DoubleClick

requests.url CONTAINS “stats.g.doubleclick.net/r/collect”||
requests.url CONTAINS “securepubads.g.doubleclick.net/gpt/”||
requests.url CONTAINS “googleads.g.doubleclick.net
1.02M entries across 250k sites (54% of all sites tested have one of these urls)

Adding up the entries gives 1.79M Google trackers. If I run a query for just the page urls, the query runs successfully, and these trackers appear on 268k sites (58% of all sites).

Totals
Combining all these results can give us the sum of all trackers found in the HTTPArchive:
image

And we can look at the number of sites with each set of urls present:
image

64% of sites in the HTTPArchive use at least one of the 22 urls I have specified above. As you can probably guess, most sites use more than one of these trackers - and you’d be correct:
image
(The max value is 290, but I removed it to keep the y-axis scale reasonable). The median count is 7 per page.

Conclusion:

All tracking is different - with different levels of personal information attached to each click or page visit. But, when added up, it can lead to a pretty clear picture of an individual’s likes, dislikes, etc.

With a very limited list of tracking urls from Ghostery and 6 news articles. I found that there are many trackers being used across a large cross section of the web:

64% of all sites had at least one tracker present, and the median site with trackers utilized 7 of the 22 urls I queried on.

2 Likes