First, let’s classify requests into first vs third party:
SELECT
origin,
IF (req_host CONTAINS REGEXP_EXTRACT(origin, r'([\w-]+)'), INTEGER(1), INTEGER(3)) AS party,
COUNT(*) as cnt
FROM [httparchive:runs.latest_requests] requests JOIN EACH (
SELECT pageid, DOMAIN(url) as origin
FROM [httparchive:runs.latest_pages]
) pages ON pages.pageid = requests.pageid
GROUP BY origin, party
ORDER BY origin
LIMIT 100
There is no exact way to tell if the request is first vs third party, but a simple heuristic goes a long way:
- Extract the domain and page ID of the page we’re loading (from latest_pages table)
- Join the resulting table against request table
- Extract the ‘name’ part of the domain (e.g. awesome-site.co.uk > awesome-site)
- Check if hostname of the requested resource contains ‘name’
The implicit assumption here ‘first party’ domain is any domain that has the site name in its hostname.
With that, we can now run an aggregate function:
SELECT party,
NTH(50, quantiles(cnt)) p50,
NTH(75, quantiles(cnt)) p75,
NTH(90, quantiles(cnt)) p90,
NTH(95, quantiles(cnt)) p95,
FROM (
SELECT
origin,
IF (req_host CONTAINS REGEXP_EXTRACT(origin, r'([\w-]+)'), INTEGER(1), INTEGER(3)) AS party,
COUNT(*) as cnt
FROM [httparchive:runs.latest_requests] requests JOIN EACH (
SELECT pageid, DOMAIN(url) as origin
FROM [httparchive:runs.latest_pages]
) pages ON pages.pageid = requests.pageid
GROUP BY origin, party
)
GROUP by party
Finally, we have our answer:
For the median case, there are 39 requests served from first party domain(s) and 24 from third party domain(s). From there, the further into the tail, the more third party dependencies there are… I guess, that’s not entirely surprising, but interesting nonetheless: the more requests you have on the page, the more likely that you’ll have more third party dependencies.