As an example, let’s find out how many sites are still using document.write in their JS.
We’ll start writing our query to use the httparchive:har.latest_lighthouse_mobile table, which automatically matches the most recent results as new crawls complete. Since the data is in JSON format, we’ll use JSON_EXTRACT_SCALAR to pluck out a piece of data using its address in the JSON object. So for this particular audit score, the address is:
"$.audits.no-document-write.score"
The leading $ represents the root of the JSON object and each dot-notation property is a deeper level in the object.
So here’s the full query to get a breakdown of scores for this particular audit:
SELECT
JSON_EXTRACT_SCALAR(report, "$.audits.no-document-write.score") AS score,
COUNT(0) AS volume
FROM
[httparchive:har.latest_lighthouse_mobile]
WHERE
report IS NOT NULL
GROUP BY
score
HAVING
score IS NOT NULL
ORDER BY
score
So about half of the pages crawled by HTTP Archive are still using document.write. The best part of it all is that we get to watch these metrics update every couple of weeks and monitor how the web is changing.
Feel free to comment in this thread if you’ve got any other interesting findings from the Lighthouse data.
This is cool (if terrifying). I’m curious: is there information in the dataset that could help determine if this comes largely from hand-coding or rather if it is dominated by one stupid library being used a lot? I ask because if it’s the latter, a campaign to push for upgrades might achieve something.
If not, well, we’ll just have to wait for the content to die
Yeah it looks like ads are by far the most predominant culprit.
CREATE TEMPORARY FUNCTION getViolations(items STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
try {
return JSON.parse(items).map(i => {
const url = i[0].text;
return url.substr(0, url.indexOf('?'));
});
} catch (e) {
return [];
}
""";
SELECT
COUNT(0) AS volume,
url
FROM (
SELECT
getViolations(JSON_EXTRACT(report, "$.audits.no-document-write.details.items")) AS urls
FROM
`httparchive.har.2017_06_15_android_lighthouse`)
CROSS JOIN
UNNEST(urls) AS url
WHERE
url != ''
GROUP BY
url
ORDER BY
volume DESC
That adds up to 581,992. Which is almost three times as many as the number of sites that feature document.write in the first place — obviously something is off. I don’t know BigQuery, but this counts multiple hits when an ad appears several times in a page, right?
I have tried to dedupe a bit, this query will deduplicate URLs. The numbers are lower but still insane: the total for Google properties is still 268,910. I don’t know what libraries are available in a BigQuery context so I made a super hacky version that tries to dedupe domains first. It’s clearly a bit buggy, but on the assumption that it’s not too buggy I get 121,917 for Google properties. That’s 56% of the Web’s document.write problem coming from Google.
There are many places in which I could have screwed this up since I have no idea what I’m doing and just basically modified your query until I got results that looked possible — but this still seems to point at a fairly big potential win just from reaching out to Google
Just remember that the counts are of offending requests, of which there may be many per page. The original query found 213456 pages with at least one request containing document.write. The follow-up query counted all offending requests to get an idea of magnitude. For example, this approach is influenced by requests that are included multiple times per page, all of which add to the severity of the problem. Your approach counts distinct domains per page, which has value but answers a different question.
I’ll see if there’s anything that can be done along the lines of enforcement of code quality.
Even if it looks like the problem come from Google the real issue is on the creativity side. Media company and advertisers don’t care about speed or good code so you will find document.write, but also 300*250 banners with a weight of more than some megabytes…obviously insane .
We have still lot to do on the web perf education side and the adv part of the web cause it is a big part of the pie.
Maybe forcing AMP for ads could be a great idea…better than forcing website owner in exchange of visibility on Google. In this case AMP would make a lot of sense
Lighthouse knows the script so guess it needs a way to query the DOM for which scripts are in the head, and synchronous (not the full case but perhaps an important one?)
Adobe Tag Manager and Maxymiser (A/B) are two 3rd parties that we commonly see using doc.write and aren’t iframed
Right, LH should have all the necessary info through the debug protocol to trace ancestry… If not, we may have this data in separate traces (cc @patmeenan), but that will require a lot of additional processing. My wishlist here would be:
Content of the doc.write?
URL of script (or frame URL, if inline script) that triggered doc.write
Ancestor URL of the script that triggered doc.write
Script may have been injected by another script, etc.
Frame URL where doc.write is triggered and parent’s URL if frame is a nested context
A “top-level doc.write” boolean would help isolate worst case offenders.
My vote would be to identify cases where the doc.write isn’t a problem and file issues with the Lighthouse project so we can improve the source of the audits rather than try to hack around or recreate them (FWIW, there are also issues with the “render blocking scripts” audit where I’ve seen it pick up scripts at the end of the body).
@patmeenan could you file an issue with a repro of that case? Only scripts in the head should be getting picked up there.
As for the doc.write wishlist, we recently switched to using Chrome’s built-in violations to detect document.write usage which has lower precision (and occasionally missing) attribution than our previous method. Documenting cases where it’s currently lacking would help greatly in improving the audit.