For reference, here’s the query I started with:
SELECT
name,
COUNT(0) AS requests,
APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
`httparchive.requests.2019_02_01_desktop`
JOIN
`httparchive.scratchspace.third_parties`
ON
NET.HOST(url) IN UNNEST(domains)
WHERE
'ad' IN UNNEST(categories) AND
JSON_EXTRACT_SCALAR(payload, '$._contentType') = 'text/javascript'
GROUP BY
name
ORDER BY
requests DESC
name |
requests |
median_bytes |
median_js_eval |
median_js_exec |
Google/Doubleclick Ads |
3656674 |
8721 |
8 |
0 |
Criteo |
130062 |
12645 |
24 |
1 |
Yahoo Ads |
24760 |
14425 |
21 |
16 |
WordAds |
15266 |
1541 |
42 |
|
Rubicon Project |
10588 |
7558 |
18 |
35 |
33 Across |
5193 |
788 |
180 |
|
Popads |
4992 |
44 |
0 |
0 |
OpenX |
4939 |
17994 |
83 |
45 |
DoubleVerify |
4309 |
3095 |
23 |
1 |
Market GID |
2768 |
38808 |
34 |
94 |
Pubmatic |
637 |
10677 |
3 |
151 |
Media Math |
44 |
19075 |
34 |
704 |
Adroll |
23 |
57 |
9 |
|
Taboola |
13 |
63648 |
39 |
8 |
I also wrote a query to generate a timeseries of these values. It’s SUPER EXPENSIVE to run, so nobody should probably run this:
# WARNING!!! 51.2 TB !!!
SELECT
SUBSTR(_TABLE_SUFFIX, 0, 10) AS date,
IF(ENDS_WITH(_TABLE_SUFFIX, 'desktop'), 'desktop', 'mobile') AS client,
name,
COUNT(0) AS requests,
APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
`httparchive.requests.*`
JOIN
`httparchive.scratchspace.third_parties`
ON
NET.HOST(url) IN UNNEST(domains)
WHERE
'ad' IN UNNEST(categories) AND
JSON_EXTRACT_SCALAR(payload, '$._contentType') = 'text/javascript'
GROUP BY
date,
client,
name
ORDER BY
date,
client,
name
Browse the raw data
It’s important to note that unlike @patrickhulce’s approach which uses Lighthouse, this approach only accounts for the bytes and execution time directly attributed to the safelist of ad provider hostnames. I’d need to either query the Lighthouse dataset (limited to mobile only) or implement some kind of request mapping to roll up dependent scripts to their parent.