The last time we looked at this query, we had 470K sites in the HTTP Archive. Now that we’re upwards of 1.3 million, I was wondering whether this changed much.
Since this query was run, the runs
tables were moved to summary_pages
and summary_requests
. So here’s an updated query (this will process 17GB)
SELECT percent_third_party, count(*) as total
FROM (
SELECT pages.url, FLOOR((SUM(IF(STRPOS(NET.HOST(requests.url),REGEXP_EXTRACT(NET.HOST(pages.url), r'([\w-]+)'))>0, 0, 1)) / COUNT(*))*100) percent_third_party
FROM httparchive.summary_pages.2018_08_15_desktop pages
JOIN httparchive.summary_requests.2018_08_15_desktop requests
ON pages.pageid = requests.pageid
GROUP BY pages.url
)
GROUP BY percent_third_party
ORDER BY percent_third_party
It’s interesting to note that the % of third party content has not changed much, despite the increase in the number of URLs monitored…
Last year we saw 7% of 470K pages with no 3rd parties. Today 7.5% of 1.3 million sites have no third parties.
The statistic about 38% of sites with > 75% third party content is still consistent with the larger dataset.
I found the >90% distribution to be extremely interesting through. Apparently 27% of sites have between 90% and 99% third party content!