How and where is document.write used on the web?


#1

As of the June 1 mobile crawl, Lighthouse reports are now available in the HAR dataset on BigQuery.

As an example, let’s find out how many sites are still using document.write in their JS.

We’ll start writing our query to use the httparchive:har.latest_lighthouse_mobile table, which automatically matches the most recent results as new crawls complete. Since the data is in JSON format, we’ll use JSON_EXTRACT_SCALAR to pluck out a piece of data using its address in the JSON object. So for this particular audit score, the address is:

"$.audits.no-document-write.score"

The leading $ represents the root of the JSON object and each dot-notation property is a deeper level in the object.

So here’s the full query to get a breakdown of scores for this particular audit:

SELECT
  JSON_EXTRACT_SCALAR(report, "$.audits.no-document-write.score") AS score,
  COUNT(0) AS volume
FROM
  [httparchive:har.latest_lighthouse_mobile]
WHERE
  report IS NOT NULL
GROUP BY
  score
HAVING
  score IS NOT NULL
ORDER BY
  score

Run it on BigQuery

The results are split pretty evenly:

score	volume
false	213456
true	214479

So about half of the pages crawled by HTTP Archive are still using document.write. The best part of it all is that we get to watch these metrics update every couple of weeks and monitor how the web is changing.

Feel free to comment in this thread if you’ve got any other interesting findings from the Lighthouse data.


#2

This is cool (if terrifying). I’m curious: is there information in the dataset that could help determine if this comes largely from hand-coding or rather if it is dominated by one stupid library being used a lot? I ask because if it’s the latter, a campaign to push for upgrades might achieve something.

If not, well, we’ll just have to wait for the content to die :slight_smile:


#3

I bet it is related to some AdServer.


#4

I bet it is related to some AdServer.

Winner!

Yeah it looks like ads are by far the most predominant culprit.

CREATE TEMPORARY FUNCTION getViolations(items STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  try {
    return JSON.parse(items).map(i => {
      const url = i[0].text;
      return url.substr(0, url.indexOf('?'));
    });
  } catch (e) {
    return [];
  }
""";

SELECT
  COUNT(0) AS volume,
  url
FROM (
  SELECT
    getViolations(JSON_EXTRACT(report, "$.audits.no-document-write.details.items")) AS urls
  FROM
    `httparchive.har.2017_06_15_android_lighthouse`)
CROSS JOIN
  UNNEST(urls) AS url
WHERE
  url != ''
GROUP BY
  url
ORDER BY
  volume DESC

(BigQuery)

Top 10 results:

Row	volume	url	 
1	66561	https://googleads.g.doubleclick.net/pagead/ads	 
2	29176	https://image6.pubmatic.com/AdServer/PugMaster	 
3	14088	https://cdn.cdnvideofiles.com/JVS/Display08/ams.js	 
4	13671	http://image6.pubmatic.com/AdServer/PugMaster	 
5	12570	https://cdn.cdnvideofiles.com/JVS/Display08/pg.js	 
6	8849	https://cdn.cdnvideofiles.com/JVS/Display08/ttm.js	 
7	7795	https://presentation-hkg1.turn.com/server/ads.js	 
8	6496	https://ssl.directferries.com/partners/deal_finder.aspx	 
9	6170	https://www6.smartadserver.com/ac	 
10	3910	http://a.tribalfusion.com/j.ad

And here’s a variation of the previous query that groups by domain instead of base URL: BigQuery. Top 10 results:

Row	volume	domain	 
1	332851	googlesyndication.com	 
2	188415	doubleclick.net	 
3	105081	pubmatic.com	 
4	53367	aniview.com	 
5	48295	cdnvideofiles.com	 
6	43169	lijit.com	 
7	37627	googleadservices.com	 
8	24159	rubiconproject.com	 
9	23099	googletagservices.com	 
10	20818	exoclick.com

#5

Wait. So given that DoubleClick is owned by Google, if we add the Google properties (that I know of) in that top ten we get:

1	332851	googlesyndication.com	 
2	188415	doubleclick.net	 
7	37627	googleadservices.com	 
9	23099	googletagservices.com	

That adds up to 581,992. Which is almost three times as many as the number of sites that feature document.write in the first place — obviously something is off. I don’t know BigQuery, but this counts multiple hits when an ad appears several times in a page, right?

I have tried to dedupe a bit, this query will deduplicate URLs. The numbers are lower but still insane: the total for Google properties is still 268,910. I don’t know what libraries are available in a BigQuery context so I made a super hacky version that tries to dedupe domains first. It’s clearly a bit buggy, but on the assumption that it’s not too buggy I get 121,917 for Google properties. That’s 56% of the Web’s document.write problem coming from Google.

There are many places in which I could have screwed this up since I have no idea what I’m doing and just basically modified your query until I got results that looked possible — but this still seems to point at a fairly big potential win just from reaching out to Google :slight_smile:


#6

Just remember that the counts are of offending requests, of which there may be many per page. The original query found 213456 pages with at least one request containing document.write. The follow-up query counted all offending requests to get an idea of magnitude. For example, this approach is influenced by requests that are included multiple times per page, all of which add to the severity of the problem. Your approach counts distinct domains per page, which has value but answers a different question.

I’ll see if there’s anything that can be done along the lines of enforcement of code quality.


#7

Even if it looks like the problem come from Google the real issue is on the creativity side. Media company and advertisers don’t care about speed or good code so you will find document.write, but also 300*250 banners with a weight of more than some megabytes…obviously insane :confounded:.

We have still lot to do on the web perf education side and the adv part of the web cause it is a big part of the pie.

Maybe forcing AMP for ads could be a great idea…better than forcing website owner in exchange of visibility on Google. In this case AMP would make a lot of sense :slight_smile:


#8

A few folks noted this on Twitter already: doc.write inside of an iframe does not block the top-level frame.

As next steps, to better understand the impact of the numbers we’re citing here:

  • Can we isolate top-level frame use of doc.write vs child frame here?
  • For child frames, group by child src, or the parent script that injected it into parent?

#9

Lighthouse knows the script so guess it needs a way to query the DOM for which scripts are in the head, and synchronous (not the full case but perhaps an important one?)

Adobe Tag Manager and Maxymiser (A/B) are two 3rd parties that we commonly see using doc.write and aren’t iframed


#10

Right, LH should have all the necessary info through the debug protocol to trace ancestry… If not, we may have this data in separate traces (cc @patmeenan), but that will require a lot of additional processing. My wishlist here would be:

  • Content of the doc.write?
  • URL of script (or frame URL, if inline script) that triggered doc.write
  • Ancestor URL of the script that triggered doc.write
  • Script may have been injected by another script, etc.
  • Frame URL where doc.write is triggered and parent’s URL if frame is a nested context
  • A “top-level doc.write” boolean would help isolate worst case offenders.

#11

My vote would be to identify cases where the doc.write isn’t a problem and file issues with the Lighthouse project so we can improve the source of the audits rather than try to hack around or recreate them (FWIW, there are also issues with the “render blocking scripts” audit where I’ve seen it pick up scripts at the end of the body).


#12

@patmeenan could you file an issue with a repro of that case? Only scripts in the head should be getting picked up there.

As for the doc.write wishlist, we recently switched to using Chrome’s built-in violations to detect document.write usage which has lower precision (and occasionally missing) attribution than our previous method. Documenting cases where it’s currently lacking would help greatly in improving the audit.