Tracking: Your Privacy and the Web


#1


Recent articles have shown that the trackers embedded on many websites add to the information that Facebook, Google (and other companies) can use to learn about us - even if we are not logged in (or even if we do not have an account!) I started thinking about how many different sites might be tracking my browsing on the web.

TL;DR


“They’re tracking us.” -Princess Leia



How do we know where and how much companies are tracking us?

How many sites are using Facebook tracking APIs? Twitter APIs? Amazon? The number of companies/trackers is probably endless, but in this article, I’ll focus on a small handful - Facebook, Google, Twitter, LinkedIn and Amazon. I know that as long as I have urls, HTTPArchive can help!

Methodology
What are the tracker urls I based my queries on? In an attempt at irony, I ran a Google search on “Facebook privacy”, and I chose the top 6 articles:

https://www.chronicle.com/blogs/profhacker/firefox-add-on-protects-against-most-facebook-tracking/65281
https://www.thequint.com/tech-and-auto/tech-news/prevent-facebook-data-access-with-this-firefox-web-extension
http://www.thedailystar.net/world/limiting-facebooks-data-brokers-wont-stop-tracking-1555630
https://globalnews.ca/news/4110311/how-to-stop-targeted-ads-facebook-google-browser/
https://www.cbsnews.com/news/how-facebook-was-able-to-collect-android-phone-and-text-logs/
https://www.makeuseof.com/tag/facebook-tracking-stop/

I then used Ghostery (a Chrome plugin that IDs trackers/ads, etc) to identify the trackers on these pages. For example, here are two Facebook trackers:
imageimage

Now, it is possible that some of the urls I flagged are innocuous and not tracking users across the web. I didn’t dig into each API or what data they collect. Just using the urls in these reports, I built the following query for Facebook:

SELECT
pages.rank,
pages.url,
requests.url,
ext
FROM
httparchive.runs.latest_requests_mobile requests
JOIN (
SELECT
rank,
pageid,
url
FROM
httparchive.runs.latest_pages_mobile) pages
ON
pages.pageid = requests.pageid
WHERE
(requests.url CONTAINS “facebook.com/tr”||
requests.url CONTAINS “graph.facebook.com”||
requests.url CONTAINS “facebook.com/impression” ||
requests.url CONTAINS “facebook.com/connect”||
requests.url CONTAINS “connect.facebook.net”||
requests.url CONTAINS “connect.facebook.com”||
requests.url CONTAINS “facebook.com/brandlift
)
ORDER BY
rank ASC

So, How Much is Facebook Tracking You?
I find 805k Facebook tracking requests in the database over 147.5k sites. That’s a median of 5 Facebook trackers per page, across roughly 33% of the internet.

Interestingly, breaking this down by rank shows that the top 100 sites use far LESS Facebook tracking than the remainder of the dataset (11%) - possibly because many of Facebook’s competitors are in the top 100.
image

image

How about Twitter?
In the 6 pages examined, I found 2 tracking urls:

requests.url CONTAINS “syndication.twitter.com/i/jot”||
requests.url CONTAINS “platform.twitter.com/widgets.js

this query results in 117.5k tracking requests across 39.7k sites.

Amazon?
amazon-adsystem.com/widgets
6k instances across 2.4k sites

LinkedIn

requests.url CONTAINS “px.ads.linkedin.com/collect”||
requests.url CONTAINS “snap.licdn.com/li.lms-analytics

22.5k instances across 7.6k sites

Google
In my limited sample of sites, Google had the largest number of urls indicated as trackers. So large, in fact - that BigQuery has memory issues if I try to run them all at once. I was able to break these up into smaller queries and gain the complete picture:

Google Analytics

requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com
176k results across 103.7k sites (23% of all sites in the dataset)

Google Ads (Not DoubleClick)

requests.url CONTAINS “https://www.googletagservices.com/tag/js”||
requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com”||
requests.url CONTAINS “pagead2.googlesyndication.com/pagead”||
requests.url CONTAINS “www.googleadservices.com/pagead/”||
requests.url CONTAINS “imasdk.googleapis.com/js/sdkloader
600k entries across 202k sites (44% of all sites tested have one of these urls)

Google Ads - DoubleClick

requests.url CONTAINS “stats.g.doubleclick.net/r/collect”||
requests.url CONTAINS “securepubads.g.doubleclick.net/gpt/”||
requests.url CONTAINS “googleads.g.doubleclick.net
1.02M entries across 250k sites (54% of all sites tested have one of these urls)

Adding up the entries gives 1.79M Google trackers. If I run a query for just the page urls, the query runs successfully, and these trackers appear on 268k sites (58% of all sites).

Totals
Combining all these results can give us the sum of all trackers found in the HTTPArchive:
image

And we can look at the number of sites with each set of urls present:
image

64% of sites in the HTTPArchive use at least one of the 22 urls I have specified above. As you can probably guess, most sites use more than one of these trackers - and you’d be correct:
image
(The max value is 290, but I removed it to keep the y-axis scale reasonable). The median count is 7 per page.

Conclusion:

All tracking is different - with different levels of personal information attached to each click or page visit. But, when added up, it can lead to a pretty clear picture of an individual’s likes, dislikes, etc.

With a very limited list of tracking urls from Ghostery and 6 news articles. I found that there are many trackers being used across a large cross section of the web:

64% of all sites had at least one tracker present, and the median site with trackers utilized 7 of the 22 urls I queried on.