What is the distribution of 1st party vs 3rd party resources?

The last time we looked at this query, we had 470K sites in the HTTP Archive. Now that we’re upwards of 1.3 million, I was wondering whether this changed much.

Since this query was run, the runs tables were moved to summary_pages and summary_requests. So here’s an updated query (this will process 17GB)

SELECT percent_third_party, count(*) as total
FROM (
    SELECT pages.url, FLOOR((SUM(IF(STRPOS(NET.HOST(requests.url),REGEXP_EXTRACT(NET.HOST(pages.url), r'([\w-]+)'))>0, 0, 1)) / COUNT(*))*100) percent_third_party
    FROM httparchive.summary_pages.2018_08_15_desktop pages 
    JOIN httparchive.summary_requests.2018_08_15_desktop requests
    ON pages.pageid = requests.pageid
    GROUP BY pages.url
	)
GROUP BY percent_third_party
ORDER BY percent_third_party

It’s interesting to note that the % of third party content has not changed much, despite the increase in the number of URLs monitored…

Last year we saw 7% of 470K pages with no 3rd parties. Today 7.5% of 1.3 million sites have no third parties.

The statistic about 38% of sites with > 75% third party content is still consistent with the larger dataset.

I found the >90% distribution to be extremely interesting through. Apparently 27% of sites have between 90% and 99% third party content!

image

5 Likes