Ads impact (bytes transferred, execution time) on the web over time

#1

Hi all,
@rviscomi shared some amazing analysis by @patrickhulce here. Couple of questions:

  1. Does Patrick’s execution time include all subsequent ad redirects or resources loaded from domains referenced as part of the original ad. e.g. doubleclick.net (the primary ad tag on page) -> loads an “ad” from criteo.com -> loads the final ad from celtra.com. -> loads some additional resources from hostingservice.com

  2. Similar to execution time above, is there a way to get at total bytes transferred that includes the entire chain?

  3. Trends over the past few years for both 1 and 2 on mobile & desktop is what I’m really after.

I’m new to this forum, so apologies if I’m missing anything obvious.
Cheers.

1 Like
#2

Glad you found it helpful :slight_smile:

  1. The execution time for the entire page load is analyzed and broken down by tasks. Each main thread task is attributed to the first script in the stack we can recognize that triggered it. i.e. the script executions from doubleclick will be attributed to doubleclick but everything that the criteo assets triggered will be attributed to criteo, and so on.
  2. It’s been brought up before to get the total bytes data and add it to the repo I just haven’t gotten around to it yet. The queries for bytes should actually be much easier than task time too :tada:
2 Likes
#3

For reference, here’s the query I started with:

SELECT
  name,
  COUNT(0) AS requests,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
  `httparchive.requests.2019_02_01_desktop`
JOIN
  `httparchive.scratchspace.third_parties`
ON
  NET.HOST(url) IN UNNEST(domains)
WHERE
  'ad' IN UNNEST(categories) AND
  JSON_EXTRACT_SCALAR(payload, '$._contentType') = 'text/javascript'
GROUP BY
  name
ORDER BY
  requests DESC
name requests median_bytes median_js_eval median_js_exec
Google/Doubleclick Ads 3656674 8721 8 0
Criteo 130062 12645 24 1
Yahoo Ads 24760 14425 21 16
WordAds 15266 1541 42
Rubicon Project 10588 7558 18 35
33 Across 5193 788 180
Popads 4992 44 0 0
OpenX 4939 17994 83 45
DoubleVerify 4309 3095 23 1
Market GID 2768 38808 34 94
Pubmatic 637 10677 3 151
Media Math 44 19075 34 704
Adroll 23 57 9
Taboola 13 63648 39 8

I also wrote a query to generate a timeseries of these values. It’s SUPER EXPENSIVE to run, so nobody should probably run this:

# WARNING!!! 51.2 TB !!!
SELECT
  SUBSTR(_TABLE_SUFFIX, 0, 10) AS date,
  IF(ENDS_WITH(_TABLE_SUFFIX, 'desktop'), 'desktop', 'mobile') AS client,
  name,
  COUNT(0) AS requests,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
  `httparchive.requests.*`
JOIN
  `httparchive.scratchspace.third_parties`
ON
  NET.HOST(url) IN UNNEST(domains)
WHERE
  'ad' IN UNNEST(categories) AND
  JSON_EXTRACT_SCALAR(payload, '$._contentType') = 'text/javascript'
GROUP BY
  date,
  client,
  name
ORDER BY
  date,
  client,
  name

Browse the raw data

It’s important to note that unlike @patrickhulce’s approach which uses Lighthouse, this approach only accounts for the bytes and execution time directly attributed to the safelist of ad provider hostnames. I’d need to either query the Lighthouse dataset (limited to mobile only) or implement some kind of request mapping to roll up dependent scripts to their parent.

1 Like
#4

How were you thinking of implementing this? The challenge is rolling up all dependencies of the third party to the common initiator. With the JS timing data, Lighthouse does all the work of aggregating dependencies.

Also, do you have any plans to track how these stats change over time?

I’m helping @jasti to answer his questions but would love to reuse any work you’ve already done.

#5

Fairly similar to what you’ve already done :slight_smile:

IMO, the initiator of the request isn’t as critical for request data since they have to be making the request to some domain and the hope is that we’ve already labelled and attributed that domain to the same entity too. The new dataset has ~3k domains tracked now so the coverage is pretty good.

For CPU/JS timing data the alternative to using the initiator is having no one to attribute it to at all, so the need we have for LH doing the heavy lifting is a lot greater.

My plan for eventually doing it was to just modify the origin grouping request query to grab total bytes too which is exactly what you’ve already done.

So far it’s just been monthly manual updates to the GH repo with the JSON data. There’s been chatter around making it more legit I’d love to spend some time on :slight_smile:

#6

Oh that’s great to hear that you’re tracking 3k domains now. For my earlier analysis, I converted your JSON array of entities into a BigQuery table. I’ll update that table now with your newest entity data.

For posterity, here’s how I’m doing the conversion:

  1. Copy the entity JSON into the Dev Tools console and assign it to a variable named entities
  2. Convert to a newline-delimited JSON string and save as a local file: entities.map(e=>JSON.stringify(e)).join('\n')
  3. Upload the file in the BigQuery UI to create a new table

I reran the analysis using the March dataset and the updated third_parties table:

SELECT
  name,
  COUNT(0) AS requests,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
  `httparchive.requests.2019_03_01_desktop`
JOIN
  `httparchive.scratchspace.third_parties`
ON
  NET.HOST(url) IN UNNEST(domains)
WHERE
  'ad' IN UNNEST(categories) AND
  JSON_EXTRACT_SCALAR(payload, '$._contentType') = 'text/javascript'
GROUP BY
  name
ORDER BY
  requests DESC
name requests median_bytes median_js_eval median_js_exec
Google/Doubleclick Ads 3834246 9282 8 0
Criteo 113005 13097 30 1
Media.net 26517 14739 30 20
WordAds 14876 1792 61
Rubicon Project 9876 7558 19 42
ActiveCampaign 7836 0 0
Pubmatic 6033 10677 3 84
33 Across 4738 809 174
Popads 4563 44 1 44
DoubleVerify 4395 3255 24 1
OpenX 4361 17994 112 50
Branch Metrics 3329 90 3
fam 2642 2018
Market GID 1795 38506 36 81
Yahoo! 1197 309 1
LongTail Ad Solutions 794 35272 28 1
Fidelity Media 783 1759 1 38
Smarter Click 719 16316 10 61
Numberly 521 144 4
AcuityAds 491 2453 8
Magnetic 486 37845 11
PollDaddy (Automattic) 451 6326 9 1
NextPerf 396 0 0
Bidtellect 381 3539 5 1
Impact Radius 310 5591 10 2
Reklama 236 1427 6 0
VisualVisitor 195 269 1
Geniee 190 43 0
Affiliatly 121 1776 1 135
Q1Media 60 17245 22 27
Media Math 47 19075 36 774
Batch Media 43 34 1
Lead Forensics 28 135 0
Adroll 23 57 8
ADventori 22 205 1
Clickadu (Winner Solutions) 22 1223 23
Converge-Digital 16 893 1
Eco Rebates 15 12027 2 1
LoopMe 15 74 1
Video Media Groep 14 1219 2 22
Taboola 9 66338 38 1
Communicator 4 14995
engage:BDR 4 3739 2
Unbounce 2 2050 1
Talkable 2 652 0 0
ToutApp 2 0 0
Fresh 8 Gaming 2 874 2
ClickDimensions 1 0 0
Admixer for Publishers 1 116179 53 74
RebelMouse 1 3997 25 1

@patrickhulce does this look right?

#7

The volume of requests seems off for quite a few of the entities. For example, Taboola has ~26,000 appearances in the script dataset, so we should see at least that many requests (to account for downloading the scripts and any extra assets they need), but in that query result we only see 9? In fact the threshold for the third-party-web dataset is at least 100 appearances so nothing should be below 100.

Not sure what’s going on there. Maybe it’s the explicit content type filter?

#8

Reran without the JS type limitation:

SELECT
  name,
  COUNT(0) AS requests,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_bytesIn']") AS INT64), 1000)[OFFSET(500)] AS median_bytes,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.EvaluateScript']") AS INT64), 1000)[OFFSET(500)] AS median_js_eval,
  APPROX_QUANTILES(CAST(JSON_EXTRACT_SCALAR(payload, "$['_cpu.FunctionCall']") AS INT64), 1000)[OFFSET(500)] AS median_js_exec
FROM
  `httparchive.requests.2019_03_01_desktop`
JOIN
  `httparchive.scratchspace.third_parties`
ON
  NET.HOST(url) IN UNNEST(domains)
WHERE
  'ad' IN UNNEST(categories)
GROUP BY
  name
ORDER BY
  requests DESC
name requests median_bytes median_js_eval median_js_exec
Google/Doubleclick Ads 10950994 365 4 0
The Trade Desk 910727 0 0
Bidswitch 725999 0 5 1
Rubicon Project 520786 42 19 24
Drawbridge 501321 0
Adroll 421226 0 13
AppNexus 407349 43 0 9
Amazon Ads 382701 43
Scorecard Research 358456 0 4 18
Pubmatic 356540 4257 9 68
AOL / Oath / Verizon Media 245947 0 10
Taboola 226775 0 31 17
Criteo 220558 7701 30 1
Yandex Ads 197976 43 64 76
WordAds 197716 51 50 6
Yahoo! 192827 1455 0 5
Bing Ads 183769 0 2 20
Media.net 114414 236 29 19
Integral Ad Science 81415 13148 12 122
Geniee 54811 0 0
Twitter Online Conversion Tracking 49448 1954 2 2
Clickagy Audience Lab 44140 0
Fidelity Media 40561 0 1 31
OpenX 36820 319 2 50
Republer 35244 0 42
DoubleVerify 34336 5086 49 94
Moat 25336 86547 189 89
Bidtellect 23871 0 5 1
Sizmek 22889 2751 8 7
AcuityAds 16577 0 8
Unbounce 15965 5157 2
TrafficStars 15702 332 3
Privy 14676 0
MediaVine 13507 27816 39 74
33 Across 12866 806 171 265
LongTail Ad Solutions 12732 0 28 1
fam 12027 2201 3 12
Neustar AdAdvisor 11709 0
OneTag 11696 0 3 0
Innovid 11000 43
Popads 9730 806 6 53
RTB House AdPilot 9393 0 7 7
Xaxis 8478 0
ActiveCampaign 7839 0 0
MGID 5082 1098 97
Datonics 5023 43 0
PushCrew 4689 3597 2 4
Branch Metrics 3681 90 3 31
Numberly 3485 43 4
Storygize 2722 0
Admeta 2702 43 0
Smarter Click 2553 3117 4 11
Ziff Davis Tech 2522 0
Conversant Ad Server 2346 49 1
SaleCycle 2346 2463 27 7
Underdog Media 2119 43 17
DialogTech SourceTrak 2024 24055 19 1
Crosswise 1962 150
Market GID 1840 38452 36 81
Affiliatly 1662 1808 3 135
Adverline Board 1649 43 3 1
Flockler 1345 18418 18
TrafficJunky 1326 19508 33
Impact Radius 1320 25048 10 2
Auto Link Maker 1235 6160 5 0
Talkable 1090 26680 6 14
NextPerf 962 0 0
ReachDynamics 860 0
piano 853 2237 4 8
Video Media Groep 849 20 0 22
Performio 760 0 8 15
Reklama 749 233 6 0
VideoHub 731 0
Justuno 677 3057 5 0
Extole 672 2827 8 63
Friendbuy 539 41605 15 13
Remarketing Pixel 532 7
Better Banners 527 2689 1 5
Navegg 505 178 0
Exponential Interactive 498 3117 2 53
Magnetic 492 37845 11
Yieldify 489 7964 48 150
PollDaddy (Automattic) 453 6326 9 1
VisualVisitor 390 279 3 5
Profitshare 375 0 8 1
QuinStreet 367 43 11
LoyaltyLion 333 19713 26 30
ClickDimensions 311 3585 1 3
ADventori 295 0 5 5
Permutive 280 138834 53 51
Adthink 253 99 15 1
Retention Science 230 16415 23 1
SmartAdServer 220 43 3 5
Vero 217 5623 13 1
Soundest 210 14671 31 0
Semcasting Site Visitor Attribution 191 0 0
RebelMouse 186 0 23 0
Admedo 152 0
Developer Media 128 1208 1 1
One by AOL:Mobile 126 0 0
Sourcepoint 125 42771 116 28
Finsbury Media 113 4432 26 7
JustPremium Ads 102 13218 2 11

(truncated)

The bytes column is out of whack due to what I assume are many 0-byte tracking pixels or similar. Any alternate ideas for JS detection to filter these scripts?

#9

Oh awesome!!

I wonder how just "ends in .js" would fair?

Those volume numbers look a lot more reasonable :slight_smile: . How are those _cpu stats computed? I didn’t even know those were there! I’m assuming they’re partially lower due to desktop vs. mobile, but I’d also be curious to understand the attribution logic there.

#10

I’ll take a sample and see how reliable .js is. We also need to factor in query string params, etc, but it could be a useful signal. Curious to see how much better it is than the content type.

cc @patmeenan to answer the question about CPU stats