Tracking: Your Privacy and the Web

doug_sillars · April 1, 2018, 8:30pm

Recent articles have shown that the trackers embedded on many websites add to the information that Facebook, Google (and other companies) can use to learn about us - even if we are not logged in (or even if we do not have an account!) I started thinking about how many different sites might be tracking my browsing on the web.

TL;DR

“They’re tracking us.” -Princess Leia

How do we know where and how much companies are tracking us?

How many sites are using Facebook tracking APIs? Twitter APIs? Amazon? The number of companies/trackers is probably endless, but in this article, I’ll focus on a small handful - Facebook, Google, Twitter, LinkedIn and Amazon. I know that as long as I have urls, HTTPArchive can help!

Methodology
What are the tracker urls I based my queries on? In an attempt at irony, I ran a Google search on “Facebook privacy”, and I chose the top 6 articles:

https://www.chronicle.com/blogs/profhacker/firefox-add-on-protects-against-most-facebook-tracking/65281
Facebook Tracking Your Web History? Time to Switch Your Browser
Limiting Facebook's data brokers won’t stop tracking | The Daily Star
Facebook, Google and others are tracking you. Here’s how to stop targeted ads - National | Globalnews.ca
Facebook tracking calls - how the Facebook Messenger app collected phone and text logs on Android phones - CBS News
https://www.makeuseof.com/tag/facebook-tracking-stop/

I then used Ghostery (a Chrome plugin that IDs trackers/ads, etc) to identify the trackers on these pages. For example, here are two Facebook trackers:

Now, it is possible that some of the urls I flagged are innocuous and not tracking users across the web. I didn’t dig into each API or what data they collect. Just using the urls in these reports, I built the following query for Facebook:

SELECT
pages.rank,
pages.url,
requests.url,
ext
FROM
httparchive.runs.latest_requests_mobile requests
JOIN (
SELECT
rank,
pageid,
url
FROM
httparchive.runs.latest_pages_mobile) pages
ON
pages.pageid = requests.pageid
WHERE
(requests.url CONTAINS “facebook.com/tr”||
requests.url CONTAINS “graph.facebook.com”||
requests.url CONTAINS “Impression | Nottingham” ||
requests.url CONTAINS “Redirecting...”||
requests.url CONTAINS “connect.facebook.net”||
requests.url CONTAINS “connect.facebook.com”||
requests.url CONTAINS “Redirecting...”
)
ORDER BY
rank ASC

So, How Much is Facebook Tracking You?
I find 805k Facebook tracking requests in the database over 147.5k sites. That’s a median of 5 Facebook trackers per page, across roughly 33% of the internet.

Interestingly, breaking this down by rank shows that the top 100 sites use far LESS Facebook tracking than the remainder of the dataset (11%) - possibly because many of Facebook’s competitors are in the top 100.

How about Twitter?
In the 6 pages examined, I found 2 tracking urls:

requests.url CONTAINS “syndication.twitter.com/i/jot”||
requests.url CONTAINS “platform.twitter.com/widgets.js”

this query results in 117.5k tracking requests across 39.7k sites.

Amazon?
amazon-adsystem.com/widgets
6k instances across 2.4k sites

LinkedIn

requests.url CONTAINS “px.ads.linkedin.com/collect”||
requests.url CONTAINS “snap.licdn.com/li.lms-analytics”

22.5k instances across 7.6k sites

Google
In my limited sample of sites, Google had the largest number of urls indicated as trackers. So large, in fact - that BigQuery has memory issues if I try to run them all at once. I was able to break these up into smaller queries and gain the complete picture:

Google Analytics

requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com”
176k results across 103.7k sites (23% of all sites in the dataset)

Google Ads (Not DoubleClick)

requests.url CONTAINS “https://www.googletagservices.com/tag/js”||
requests.url CONTAINS “https://www.google-analytics.com/collect”||
requests.url CONTAINS “https://ssl.google-analytics.com”||
requests.url CONTAINS “pagead2.googlesyndication.com/pagead”||
requests.url CONTAINS “www.googleadservices.com/pagead/”||
requests.url CONTAINS “imasdk.googleapis.com/js/sdkloader”
600k entries across 202k sites (44% of all sites tested have one of these urls)

Google Ads - DoubleClick

requests.url CONTAINS “stats.g.doubleclick.net/r/collect”||
requests.url CONTAINS “securepubads.g.doubleclick.net/gpt/”||
requests.url CONTAINS “googleads.g.doubleclick.net”
1.02M entries across 250k sites (54% of all sites tested have one of these urls)

Adding up the entries gives 1.79M Google trackers. If I run a query for just the page urls, the query runs successfully, and these trackers appear on 268k sites (58% of all sites).

Totals
Combining all these results can give us the sum of all trackers found in the HTTPArchive:

And we can look at the number of sites with each set of urls present:

64% of sites in the HTTPArchive use at least one of the 22 urls I have specified above. As you can probably guess, most sites use more than one of these trackers - and you’d be correct:

(The max value is 290, but I removed it to keep the y-axis scale reasonable). The median count is 7 per page.

Conclusion:

All tracking is different - with different levels of personal information attached to each click or page visit. But, when added up, it can lead to a pretty clear picture of an individual’s likes, dislikes, etc.

With a very limited list of tracking urls from Ghostery and 6 news articles. I found that there are many trackers being used across a large cross section of the web:

64% of all sites had at least one tracker present, and the median site with trackers utilized 7 of the 22 urls I queried on.

Topic		Replies	Views
Number of Tracking Cookies a Website Uses (1st & 3rd Parties) Analysis	4	785	May 27, 2020
Ads influence on a site	2	2048	July 25, 2016
Who does not track you Analysis	0	1886	January 4, 2016
Missing: number of requested domains per page Analysis	5	1536	April 13, 2018
Which sites are using multiple Analytics providers? Analysis	1	1678	July 9, 2013

Tracking: Your Privacy and the Web

Related topics