What is the distribution of 1st party vs 3rd party resources?

I thought it would be interesting to explore this in terms of histograms. But first, let’s update the above query with Standard SQL syntax. Here’s what I believe is the equivalent query in Standard SQL:

SELECT party,
       APPROX_QUANTILES(cnt, 100)[SAFE_ORDINAL(50)] p50,
       APPROX_QUANTILES(cnt, 100)[SAFE_ORDINAL(75)] p75,
       APPROX_QUANTILES(cnt, 100)[SAFE_ORDINAL(90)] p90,
       APPROX_QUANTILES(cnt, 100)[SAFE_ORDINAL(95)] p95
FROM (
    SELECT origin,
           IF (STRPOS(req_host,REGEXP_EXTRACT(origin, r'([\w-]+)'))>0, 1, 3) AS party,
           COUNT(*) as cnt
    FROM httparchive.runs.2017_09_15_requests requests JOIN (
         SELECT pageid, NET.REG_DOMAIN(url) as origin
         FROM httparchive.runs.2017_09_15_pages
    ) pages ON pages.pageid = requests.pageid
    GROUP BY origin, party 
)
GROUP by party

Note that I’ve swapped:

IF (req_host CONTAINS REGEXP_EXTRACT(origin, r'([\w-]+)'), INTEGER(1), INTEGER(3)) AS party,

for

IF (STRPOS(req_host,REGEXP_EXTRACT(origin, r'([\w-]+)'))>0, 1, 3) AS party,

Next I decided to take the same classification logic, and calculate the percentage of 3rd party resources per page

SELECT percent_third_party, count(*) as total
FROM (
    SELECT pages.url, FLOOR((SUM(IF(STRPOS(NET.HOST(requests.url),REGEXP_EXTRACT(NET.HOST(pages.url), r'([\w-]+)'))>0, 0, 1)) / COUNT(*))*100) percent_third_party
    FROM httparchive.runs.2017_09_15_pages pages 
    JOIN httparchive.runs.2017_09_15_requests requests
    ON pages.pageid = requests.pageid
    GROUP BY pages.url
	)
GROUP BY percent_third_party
ORDER BY percent_third_party

I wound up with the following histogram, which shows that 6.4% of sites had no 3rd party content, and 38% had more than 75% third party content. The rest of the population spanned the entire range
image

I was curious to see if there was any correlation to Alexa ranking, so I ran the same analysis for Alexa rankings < 100K, 50K, <10K and <1K. The results visually appear consistent across these as you can see in the 2 graphs below:
image

Looking at the numbers, I can see that the percentage is skewed towards more 3rd party content for more popular sites.

image

But what about historical trends? Let’s compare the histograms for Sept 15th for the past 5 years: (Note I trimmed the Y axis in this graph for readability). What’s interesting in tihs is that it seems like the top 25% of sites w/ 3rd party content is increasing each year except for 2015…

image

Looking at the same stats, I can see there were more sites in 2015 - which explains the skew. There were also significantly less sites in 2012-2014. So what this data is essentially telling us is:

  • There’s an increase in 3rd party usage from 2016 to 2017
  • There’s an increase from 2012 to 2013
  • There was no signficant change from 2013 to 2015.
  • There’s not much fluctaution in the top 15% from year to year

image

3 Likes