What is the distribution of 1st party vs 3rd party resources?

First, let’s classify requests into first vs third party:

SELECT 
  origin,
  IF (req_host CONTAINS REGEXP_EXTRACT(origin, r'([\w-]+)'), INTEGER(1), INTEGER(3)) AS party,
  COUNT(*) as cnt
FROM [httparchive:runs.latest_requests] requests JOIN EACH (
  SELECT pageid, DOMAIN(url) as origin
  FROM [httparchive:runs.latest_pages]
) pages ON pages.pageid = requests.pageid
GROUP BY origin, party
ORDER BY origin
LIMIT 100

There is no exact way to tell if the request is first vs third party, but a simple heuristic goes a long way:

  1. Extract the domain and page ID of the page we’re loading (from latest_pages table)
  2. Join the resulting table against request table
  3. Extract the ‘name’ part of the domain (e.g. awesome-site.co.uk > awesome-site)
  4. Check if hostname of the requested resource contains ‘name’

The implicit assumption here ‘first party’ domain is any domain that has the site name in its hostname.

With that, we can now run an aggregate function:

SELECT party,
  NTH(50, quantiles(cnt)) p50,
  NTH(75, quantiles(cnt)) p75,
  NTH(90, quantiles(cnt)) p90,
  NTH(95, quantiles(cnt)) p95,
FROM (
  SELECT 
    origin,
    IF (req_host CONTAINS REGEXP_EXTRACT(origin, r'([\w-]+)'), INTEGER(1), INTEGER(3)) AS party,
    COUNT(*) as cnt
  FROM [httparchive:runs.latest_requests] requests JOIN EACH (
    SELECT pageid, DOMAIN(url) as origin
    FROM [httparchive:runs.latest_pages]
  ) pages ON pages.pageid = requests.pageid
  GROUP BY origin, party
)
GROUP by party

Finally, we have our answer:

For the median case, there are 39 requests served from first party domain(s) and 24 from third party domain(s). From there, the further into the tail, the more third party dependencies there are… I guess, that’s not entirely surprising, but interesting nonetheless: the more requests you have on the page, the more likely that you’ll have more third party dependencies.

5 Likes

Great data, and definitely more accurate than the raw number of domains check, but I think it falls short quite frequently… Just looking at a handful of small sites, you quickly come across 1st party requests that don’t fit the pattern:

I think this is a step ahead, and it’s also good to see the number of requests, not just number of domains, that are fetched from the same top level domain vs not, but I think it still falls short when identifying 3rd party content.

@guypod walmart case is covered by above query, the other three would indeed come out as third-party false positives. That said, based on inspecting the first couple thousand classifications, I think you’re overestimating the prevalence of such domains.

I ran the same query for Nov 1 in 2011, 2012, and 2013:. The percentage of 3rd party content is 32%, 36%, and 38% respectively:

2011:

2012:

2013:

1 Like

I ran a similar query to see the distribution of frontend SPOF resources (JS, CSS & font). Query:

SELECT party,  
  NTH(10, quantiles(cnt)) p10,  
  NTH(20, quantiles(cnt)) p20,
  NTH(30, quantiles(cnt)) p30,
  NTH(40, quantiles(cnt)) p40,
  NTH(50, quantiles(cnt)) p50,
  NTH(60, quantiles(cnt)) p60,
  NTH(70, quantiles(cnt)) p70,
  NTH(80, quantiles(cnt)) p80,
  NTH(90, quantiles(cnt)) p90,
  NTH(99, quantiles(cnt)) p99
FROM (  
  SELECT 
    origin,
    IF (req_host CONTAINS REGEXP_EXTRACT(origin, r'([\w-]+)'), INTEGER(1), INTEGER(3)) AS party,
    COUNT(*) as cnt
  FROM [httparchive:runs.2014_04_15_requests] requests JOIN EACH (
    SELECT pageid, DOMAIN(url) as origin
    FROM [httparchive:runs.2014_04_15_pages]
  ) pages ON pages.pageid = requests.pageid
  GROUP BY origin, party
)
GROUP by party

The results:

And just fonts:

Results for Nov 1 2014:

For the median, 1st party* is 60% and 3rd party is 40%.

* subject to the simple “name of origin is somewhere in the hostname” heuristic - i.e. likely under-counting, see discussion above.

For Nov 1st 2015

Row party p50 p75 p90 p95
1 1 41 69 104 131
2 3 29 64 117 164

(Will tidy formatting a bit later)

Just my 2c, since I think this is a very useful analysis: I happened to use the same pattern-matching heuristic for identifying 1st/3rd party domains recently and also concluded it can miss several cases. I also tried looking at the number of objects from a domain: if img.xmpl.com serves 30 objects for www.example.com, it’s likely a 1st party domain even though the pattern doesn’t match. I don’t think this heuristic alone is better, as there could be false positives on either direction no matter what the object threshold (and I’m not sure what a good value is). But maybe combining the two would increase the accuracy. I guess it depends on whether one would rather only likely under-count, or potentially over-count.

Also, Results for July 14, 2016:
party p50 p75 p90 p95
1 39 68 105 133
3 29 71 146 244

(the only trend I can discern is that the gap in the tails seems to increase over time)

Not sure if there’s a better approach since this was originally posted, but comparing apples to apples here, third party content (probably not too surprising) has really ballooned since @igrigorik first ran the query in Nov of 2013.

49 PM

The first party counts have stayed mostly the same, very slight variation. But 3rd party is definitely a different story. That’s 52.8% 3rd-party components at the median, now.

1 Like

I thought it would be interesting to explore this in terms of histograms. But first, let’s update the above query with Standard SQL syntax. Here’s what I believe is the equivalent query in Standard SQL:

SELECT party,
       APPROX_QUANTILES(cnt, 100)[SAFE_ORDINAL(50)] p50,
       APPROX_QUANTILES(cnt, 100)[SAFE_ORDINAL(75)] p75,
       APPROX_QUANTILES(cnt, 100)[SAFE_ORDINAL(90)] p90,
       APPROX_QUANTILES(cnt, 100)[SAFE_ORDINAL(95)] p95
FROM (
    SELECT origin,
           IF (STRPOS(req_host,REGEXP_EXTRACT(origin, r'([\w-]+)'))>0, 1, 3) AS party,
           COUNT(*) as cnt
    FROM httparchive.runs.2017_09_15_requests requests JOIN (
         SELECT pageid, NET.REG_DOMAIN(url) as origin
         FROM httparchive.runs.2017_09_15_pages
    ) pages ON pages.pageid = requests.pageid
    GROUP BY origin, party 
)
GROUP by party

Note that I’ve swapped:

IF (req_host CONTAINS REGEXP_EXTRACT(origin, r'([\w-]+)'), INTEGER(1), INTEGER(3)) AS party,

for

IF (STRPOS(req_host,REGEXP_EXTRACT(origin, r'([\w-]+)'))>0, 1, 3) AS party,

Next I decided to take the same classification logic, and calculate the percentage of 3rd party resources per page

SELECT percent_third_party, count(*) as total
FROM (
    SELECT pages.url, FLOOR((SUM(IF(STRPOS(NET.HOST(requests.url),REGEXP_EXTRACT(NET.HOST(pages.url), r'([\w-]+)'))>0, 0, 1)) / COUNT(*))*100) percent_third_party
    FROM httparchive.runs.2017_09_15_pages pages 
    JOIN httparchive.runs.2017_09_15_requests requests
    ON pages.pageid = requests.pageid
    GROUP BY pages.url
	)
GROUP BY percent_third_party
ORDER BY percent_third_party

I wound up with the following histogram, which shows that 6.4% of sites had no 3rd party content, and 38% had more than 75% third party content. The rest of the population spanned the entire range
image

I was curious to see if there was any correlation to Alexa ranking, so I ran the same analysis for Alexa rankings < 100K, 50K, <10K and <1K. The results visually appear consistent across these as you can see in the 2 graphs below:
image

Looking at the numbers, I can see that the percentage is skewed towards more 3rd party content for more popular sites.

image

But what about historical trends? Let’s compare the histograms for Sept 15th for the past 5 years: (Note I trimmed the Y axis in this graph for readability). What’s interesting in tihs is that it seems like the top 25% of sites w/ 3rd party content is increasing each year except for 2015…

image

Looking at the same stats, I can see there were more sites in 2015 - which explains the skew. There were also significantly less sites in 2012-2014. So what this data is essentially telling us is:

  • There’s an increase in 3rd party usage from 2016 to 2017
  • There’s an increase from 2012 to 2013
  • There was no signficant change from 2013 to 2015.
  • There’s not much fluctaution in the top 15% from year to year

image

3 Likes

This is great analysis. Something I’ve been curious about is how much more granular we can look at third-party distribution:

  • Is there a Top 10 list of third-party hosts that are being referenced? My assumption is the majority of scripts referenced are analytics, ads/tag managers and social media sharing widgets.
  • Is there a Top 10 list of specific third-party resources being referenced? Again, my gut hints at things like google-analytics.com/analytics.js are quite prevalent.

But what about historical trends? Let’s compare the histograms for Sept 15th for the past 5 years

This is fascinating. Over on the Chrome Loading team we assumed that the usage of CDNs were on the decline (although this was based on anecdotal data). I would be curious to what degree the increase in 3rd party usage we’re seeing year on year (2016-2017) is down to popular CDNs getting more traction (like Akamai etc).

Wanted to know how the “Ratio of sum of all time spent in 3rd party resources to the total time spent in all content” looked like (it is logical that the more the third party requests, more the fraction of time spent on third party content). Used the time attribute of the requests table (assuming it would have the major time components like sending and receiving).

First wrote the following query to get the percentiles. Used the same logic mentioned in the post to detect first and third party origins. Added a check for time > 0 to eliminate erroneous data (if any).

SELECT
NTH(10, quantiles(thirdpartytimetototaltimeratio)) p10,
NTH(20, quantiles(thirdpartytimetototaltimeratio)) p20,
NTH(30, quantiles(thirdpartytimetototaltimeratio)) p30,
NTH(40, quantiles(thirdpartytimetototaltimeratio)) p40,
NTH(50, quantiles(thirdpartytimetototaltimeratio)) p50,
NTH(60, quantiles(thirdpartytimetototaltimeratio)) p60,
NTH(70, quantiles(thirdpartytimetototaltimeratio)) p70,
NTH(80, quantiles(thirdpartytimetototaltimeratio)) p80,
NTH(90, quantiles(thirdpartytimetototaltimeratio)) p90,
NTH(99, quantiles(thirdpartytimetototaltimeratio)) p99,
COUNT(*) totalpages
FROM
(
SELECT
thirdpartytime/totaltime thirdpartytimetototaltimeratio
FROM
(
SELECT
pages.pageid,
SUM(IF (req_host CONTAINS REGEXP_EXTRACT(origin, r’([\w-]+)’), 0, requests.time)) AS thirdpartytime,
SUM(requests.time) as totaltime
FROM httparchive:runs.2017_08_01_requests requests JOIN EACH (
SELECT pages.pageid, DOMAIN(url) as origin
FROM httparchive:runs.2017_08_01_pages pages
) pages ON pages.pageid = requests.pageid
GROUP BY pages.pageid
)
WHERE totaltime > 0
)

The results:
image
image

Next tried to do a histogram.

SELECT
thirdpartytimetototaltimeratiobucket, COUNT(*) totalpages
FROM
(
SELECT
ROUND((thirdpartytime/totaltime) * 100, 0) thirdpartytimetototaltimeratiobucket
FROM
(
SELECT
pages.pageid,
SUM(IF (req_host CONTAINS REGEXP_EXTRACT(origin, r’([\w-]+)’), 0, requests.time)) AS thirdpartytime,
SUM(requests.time) as totaltime
FROM httparchive:runs.2017_08_01_requests requests JOIN EACH (
SELECT pages.pageid, DOMAIN(url) as origin
FROM httparchive:runs.2017_08_01_pages pages
) pages ON pages.pageid = requests.pageid
GROUP BY pages.pageid
)
WHERE totaltime > 0
)
GROUP BY thirdpartytimetototaltimeratiobucket
ORDER BY thirdpartytimetototaltimeratiobucket ASC

The results:
image
image

The last time we looked at this query, we had 470K sites in the HTTP Archive. Now that we’re upwards of 1.3 million, I was wondering whether this changed much.

Since this query was run, the runs tables were moved to summary_pages and summary_requests. So here’s an updated query (this will process 17GB)

SELECT percent_third_party, count(*) as total
FROM (
    SELECT pages.url, FLOOR((SUM(IF(STRPOS(NET.HOST(requests.url),REGEXP_EXTRACT(NET.HOST(pages.url), r'([\w-]+)'))>0, 0, 1)) / COUNT(*))*100) percent_third_party
    FROM httparchive.summary_pages.2018_08_15_desktop pages 
    JOIN httparchive.summary_requests.2018_08_15_desktop requests
    ON pages.pageid = requests.pageid
    GROUP BY pages.url
	)
GROUP BY percent_third_party
ORDER BY percent_third_party

It’s interesting to note that the % of third party content has not changed much, despite the increase in the number of URLs monitored…

Last year we saw 7% of 470K pages with no 3rd parties. Today 7.5% of 1.3 million sites have no third parties.

The statistic about 38% of sites with > 75% third party content is still consistent with the larger dataset.

I found the >90% distribution to be extremely interesting through. Apparently 27% of sites have between 90% and 99% third party content!

image

5 Likes

Google Adsense in play ? when enabled i see double the amount of requests on my sites !

Similar issues once we enable GoogleTagManager & DFP ads. Ideally there should be a max cap for ads shown in a page.

Do you think so many websites have 0% third-party domains is an error? It’s a little bit odd and I can’t figure out why.