Ads impact (bytes transferred, execution time) on the web over time

jasti · March 29, 2019, 5:29pm

Hi all,
@rviscomi shared some amazing analysis by @patrickhulce here. Couple of questions:

Does Patrick’s execution time include all subsequent ad redirects or resources loaded from domains referenced as part of the original ad. e.g. doubleclick.net (the primary ad tag on page) -> loads an “ad” from criteo.com -> loads the final ad from celtra.com. -> loads some additional resources from hostingservice.com
Similar to execution time above, is there a way to get at total bytes transferred that includes the entire chain?
Trends over the past few years for both 1 and 2 on mobile & desktop is what I’m really after.

I’m new to this forum, so apologies if I’m missing anything obvious.
Cheers.

patrickhulce · March 29, 2019, 6:04pm

Glad you found it helpful

The execution time for the entire page load is analyzed and broken down by tasks. Each main thread task is attributed to the first script in the stack we can recognize that triggered it. i.e. the script executions from doubleclick will be attributed to doubleclick but everything that the criteo assets triggered will be attributed to criteo, and so on.
It’s been brought up before to get the total bytes data and add it to the repo I just haven’t gotten around to it yet. The queries for bytes should actually be much easier than task time too

rviscomi · March 29, 2019, 6:10pm

For reference, here’s the query I started with:

SELECT
  name,
  COUNT(0) AS requests,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
  `httparchive.requests.2019_02_01_desktop`
JOIN
  `httparchive.scratchspace.third_parties`
ON
  NET.HOST(url) IN UNNEST(domains)
WHERE
  'ad' IN UNNEST(categories) AND
  JSON_EXTRACT_SCALAR(payload, '$._contentType') = 'text/javascript'
GROUP BY
  name
ORDER BY
  requests DESC

name	requests	median_bytes	median_js_eval	median_js_exec
Google/Doubleclick Ads	3656674	8721	8	0
Criteo	130062	12645	24	1
Yahoo Ads	24760	14425	21	16
WordAds	15266	1541	42
Rubicon Project	10588	7558	18	35
33 Across	5193	788	180
Popads	4992	44	0	0
OpenX	4939	17994	83	45
DoubleVerify	4309	3095	23	1
Market GID	2768	38808	34	94
Pubmatic	637	10677	3	151
Media Math	44	19075	34	704
Adroll	23	57	9
Taboola	13	63648	39	8

I also wrote a query to generate a timeseries of these values. It’s SUPER EXPENSIVE to run, so nobody should probably run this:

# WARNING!!! 51.2 TB !!!
SELECT
  SUBSTR(_TABLE_SUFFIX, 0, 10) AS date,
  IF(ENDS_WITH(_TABLE_SUFFIX, 'desktop'), 'desktop', 'mobile') AS client,
  name,
  COUNT(0) AS requests,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
  `httparchive.requests.*`
JOIN
  `httparchive.scratchspace.third_parties`
ON
  NET.HOST(url) IN UNNEST(domains)
WHERE
  'ad' IN UNNEST(categories) AND
  JSON_EXTRACT_SCALAR(payload, '$._contentType') = 'text/javascript'
GROUP BY
  date,
  client,
  name
ORDER BY
  date,
  client,
  name

Browse the raw data

It’s important to note that unlike @patrickhulce’s approach which uses Lighthouse, this approach only accounts for the bytes and execution time directly attributed to the safelist of ad provider hostnames. I’d need to either query the Lighthouse dataset (limited to mobile only) or implement some kind of request mapping to roll up dependent scripts to their parent.

rviscomi · April 8, 2019, 10:08pm

How were you thinking of implementing this? The challenge is rolling up all dependencies of the third party to the common initiator. With the JS timing data, Lighthouse does all the work of aggregating dependencies.

Also, do you have any plans to track how these stats change over time?

I’m helping @jasti to answer his questions but would love to reuse any work you’ve already done.

patrickhulce · April 8, 2019, 10:55pm

Fairly similar to what you’ve already done

IMO, the initiator of the request isn’t as critical for request data since they have to be making the request to some domain and the hope is that we’ve already labelled and attributed that domain to the same entity too. The new dataset has ~3k domains tracked now so the coverage is pretty good.

For CPU/JS timing data the alternative to using the initiator is having no one to attribute it to at all, so the need we have for LH doing the heavy lifting is a lot greater.

My plan for eventually doing it was to just modify the origin grouping request query to grab total bytes too which is exactly what you’ve already done.

So far it’s just been monthly manual updates to the GH repo with the JSON data. There’s been chatter around making it more legit I’d love to spend some time on

rviscomi · April 9, 2019, 7:52pm

Oh that’s great to hear that you’re tracking 3k domains now. For my earlier analysis, I converted your JSON array of entities into a BigQuery table. I’ll update that table now with your newest entity data.

For posterity, here’s how I’m doing the conversion:

Copy the entity JSON into the Dev Tools console and assign it to a variable named entities
Convert to a newline-delimited JSON string and save as a local file: entities.map(e=>JSON.stringify(e)).join('\n')
Upload the file in the BigQuery UI to create a new table

I reran the analysis using the March dataset and the updated third_parties table:

SELECT
  name,
  COUNT(0) AS requests,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
  `httparchive.requests.2019_03_01_desktop`
JOIN
  `httparchive.scratchspace.third_parties`
ON
  NET.HOST(url) IN UNNEST(domains)
WHERE
  'ad' IN UNNEST(categories) AND
  JSON_EXTRACT_SCALAR(payload, '$._contentType') = 'text/javascript'
GROUP BY
  name
ORDER BY
  requests DESC

name	requests	median_bytes	median_js_eval	median_js_exec
Google/Doubleclick Ads	3834246	9282	8	0
Criteo	113005	13097	30	1
`Media.net`	26517	14739	30	20
WordAds	14876	1792	61
Rubicon Project	9876	7558	19	42
ActiveCampaign	7836	0	0
Pubmatic	6033	10677	3	84
33 Across	4738	809	174
Popads	4563	44	1	44
DoubleVerify	4395	3255	24	1
OpenX	4361	17994	112	50
Branch Metrics	3329	90	3
fam	2642	2018
Market GID	1795	38506	36	81
Yahoo!	1197	309	1
LongTail Ad Solutions	794	35272	28	1
Fidelity Media	783	1759	1	38
Smarter Click	719	16316	10	61
Numberly	521	144	4
AcuityAds	491	2453	8
Magnetic	486	37845	11
PollDaddy (Automattic)	451	6326	9	1
NextPerf	396	0	0
Bidtellect	381	3539	5	1
Impact Radius	310	5591	10	2
Reklama	236	1427	6	0
VisualVisitor	195	269	1
Geniee	190	43	0
Affiliatly	121	1776	1	135
Q1Media	60	17245	22	27
Media Math	47	19075	36	774
Batch Media	43	34	1
Lead Forensics	28	135	0
Adroll	23	57	8
ADventori	22	205	1
Clickadu (Winner Solutions)	22	1223	23
Converge-Digital	16	893	1
Eco Rebates	15	12027	2	1
LoopMe	15	74	1
Video Media Groep	14	1219	2	22
Taboola	9	66338	38	1
Communicator	4	14995
engage:BDR	4	3739	2
Unbounce	2	2050	1
Talkable	2	652	0	0
ToutApp	2	0	0
Fresh 8 Gaming	2	874	2
ClickDimensions	1	0	0
Admixer for Publishers	1	116179	53	74
RebelMouse	1	3997	25	1

@patrickhulce does this look right?

patrickhulce · April 9, 2019, 8:09pm

The volume of requests seems off for quite a few of the entities. For example, Taboola has ~26,000 appearances in the script dataset, so we should see at least that many requests (to account for downloading the scripts and any extra assets they need), but in that query result we only see 9? In fact the threshold for the third-party-web dataset is at least 100 appearances so nothing should be below 100.

Not sure what’s going on there. Maybe it’s the explicit content type filter?

rviscomi · April 9, 2019, 8:45pm

Reran without the JS type limitation:

SELECT
  name,
  COUNT(0) AS requests,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
  `httparchive.requests.2019_03_01_desktop`
JOIN
  `httparchive.scratchspace.third_parties`
ON
  NET.HOST(url) IN UNNEST(domains)
WHERE
  'ad' IN UNNEST(categories)
GROUP BY
  name
ORDER BY
  requests DESC

name	requests	median_bytes	median_js_eval	median_js_exec
Google/Doubleclick Ads	10950994	365	4	0
The Trade Desk	910727	0	0
Bidswitch	725999	0	5	1
Rubicon Project	520786	42	19	24
Drawbridge	501321	0
Adroll	421226	0	13
AppNexus	407349	43	0	9
Amazon Ads	382701	43
Scorecard Research	358456	0	4	18
Pubmatic	356540	4257	9	68
AOL / Oath / Verizon Media	245947	0	10
Taboola	226775	0	31	17
Criteo	220558	7701	30	1
Yandex Ads	197976	43	64	76
WordAds	197716	51	50	6
Yahoo!	192827	1455	0	5
Bing Ads	183769	0	2	20
`Media.net`	114414	236	29	19
Integral Ad Science	81415	13148	12	122
Geniee	54811	0	0
Twitter Online Conversion Tracking	49448	1954	2	2
Clickagy Audience Lab	44140	0
Fidelity Media	40561	0	1	31
OpenX	36820	319	2	50
Republer	35244	0	42
DoubleVerify	34336	5086	49	94
Moat	25336	86547	189	89
Bidtellect	23871	0	5	1
Sizmek	22889	2751	8	7
AcuityAds	16577	0	8
Unbounce	15965	5157	2
TrafficStars	15702	332	3
Privy	14676	0
MediaVine	13507	27816	39	74
33 Across	12866	806	171	265
LongTail Ad Solutions	12732	0	28	1
fam	12027	2201	3	12
Neustar AdAdvisor	11709	0
OneTag	11696	0	3	0
Innovid	11000	43
Popads	9730	806	6	53
RTB House AdPilot	9393	0	7	7
Xaxis	8478	0
ActiveCampaign	7839	0	0
MGID	5082	1098	97
Datonics	5023	43	0
PushCrew	4689	3597	2	4
Branch Metrics	3681	90	3	31
Numberly	3485	43	4
Storygize	2722	0
Admeta	2702	43	0
Smarter Click	2553	3117	4	11
Ziff Davis Tech	2522	0
Conversant Ad Server	2346	49	1
SaleCycle	2346	2463	27	7
Underdog Media	2119	43	17
DialogTech SourceTrak	2024	24055	19	1
Crosswise	1962	150
Market GID	1840	38452	36	81
Affiliatly	1662	1808	3	135
Adverline Board	1649	43	3	1
Flockler	1345	18418	18
TrafficJunky	1326	19508	33
Impact Radius	1320	25048	10	2
Auto Link Maker	1235	6160	5	0
Talkable	1090	26680	6	14
NextPerf	962	0	0
ReachDynamics	860	0
piano	853	2237	4	8
Video Media Groep	849	20	0	22
Performio	760	0	8	15
Reklama	749	233	6	0
VideoHub	731	0
Justuno	677	3057	5	0
Extole	672	2827	8	63
Friendbuy	539	41605	15	13
Remarketing Pixel	532	7
Better Banners	527	2689	1	5
Navegg	505	178	0
Exponential Interactive	498	3117	2	53
Magnetic	492	37845	11
Yieldify	489	7964	48	150
PollDaddy (Automattic)	453	6326	9	1
VisualVisitor	390	279	3	5
Profitshare	375	0	8	1
QuinStreet	367	43	11
LoyaltyLion	333	19713	26	30
ClickDimensions	311	3585	1	3
ADventori	295	0	5	5
Permutive	280	138834	53	51
Adthink	253	99	15	1
Retention Science	230	16415	23	1
SmartAdServer	220	43	3	5
Vero	217	5623	13	1
Soundest	210	14671	31	0
Semcasting Site Visitor Attribution	191	0	0
RebelMouse	186	0	23	0
Admedo	152	0
Developer Media	128	1208	1	1
One by AOL:Mobile	126	0	0
Sourcepoint	125	42771	116	28
Finsbury Media	113	4432	26	7
JustPremium Ads	102	13218	2	11

(truncated)

The bytes column is out of whack due to what I assume are many 0-byte tracking pixels or similar. Any alternate ideas for JS detection to filter these scripts?

patrickhulce · April 9, 2019, 9:19pm

Oh awesome!!

I wonder how just “ends in .js” would fair?

Those volume numbers look a lot more reasonable . How are those _cpu stats computed? I didn’t even know those were there! I’m assuming they’re partially lower due to desktop vs. mobile, but I’d also be curious to understand the attribution logic there.

rviscomi · April 9, 2019, 9:22pm

I’ll take a sample and see how reliable .js is. We also need to factor in query string params, etc, but it could be a useful signal. Curious to see how much better it is than the content type.

cc @patmeenan to answer the question about CPU stats

Topic		Replies	Views
Impact of 3rd Party JavaScript Analysis	4	1870	May 20, 2019
Ads influence on a site	2	2046	July 25, 2016
What effect does HTTP/2 have on the performance metrics of the average website not on a CDN? Analysis	3	1199	April 7, 2020
Attributing Time Spent in Third Party Domains Analysis	1	2251	May 11, 2015
Compare bytes document complete vs fully loaded Analysis	1	2327	May 9, 2015

Ads impact (bytes transferred, execution time) on the web over time

Related topics