The lang
attribute is used to describe the language of the child content. For example, <html lang="en">
indicates that the content of the HTML document is in English.
Lighthouse includes a report to validate this attribute so we can write a query to see how the results are distributed:
SELECT
JSON_EXTRACT_SCALAR(report, "$.audits.valid-lang.score") AS score,
COUNT(0) AS volume
FROM
[httparchive:har.latest_lighthouse_mobile]
WHERE
report IS NOT NULL
GROUP BY
score
HAVING
score IS NOT NULL
ORDER BY
score
Results:
Row score volume
1 false 424
2 true 424917
This means that 99.9% of websites are using the lang attribute correctly. That’s surprisingly good. Maybe too good to be true? In any case, let’s dive deeper and try to learn more about how the attribute is being used incorrectly.
Since there may be multiple instances of the lang attribute on a page, the invalid occurrences are in an array. And the BigQuery JSON_EXTRACT
method doesn’t yet play nice with plucking data out of arrays, so we can work around that by writing a User-Defined Function (UDF) that does the manipulation directly in JavaScript. Here’s the query:
CREATE TEMPORARY FUNCTION getViolations(items STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
try {
return items.match(/lang="([^"]*)"/ig);
} catch (e) {
return [];
}
""";
SELECT
COUNT(0) AS volume,
lang
FROM (
SELECT
getViolations(JSON_EXTRACT(report, "$.audits.valid-lang.details.items")) AS langs
FROM
`httparchive.har.2017_06_15_android_lighthouse`)
CROSS JOIN
UNNEST(langs) AS lang
GROUP BY
lang
ORDER BY
volume DESC
This will sort and count the number of distinct lang attributes. There are 837 results, but here are the top 25:
Row volume lang
1 123 lang="cz"
2 107 lang="zh_CN"
3 54 lang="zh_TW"
4 52 lang="join-count"
5 43 lang="0"
6 42 lang="a1"
7 42 lang="pt_PT"
8 39 lang="en_US"
9 28 lang="gr"
10 27 lang="ge"
11 27 lang="x-normage"
12 24 lang="main"
13 24 lang="X-NONE"
14 22 lang="jp"
15 22 lang="none"
16 21 lang="2"
17 20 lang="cn"
18 20 lang="KZ"
19 18 lang="en_GB"
20 18 lang="1"
21 15 lang="#0066cc"
22 13 lang="menu_list"
23 13 lang="english"
24 12 lang="3"
25 11 lang="\\'EN-US\\'"
It’s important to note upfront that a single page could include many incorrect language attributes and single-handedly skew these results. That said, there are a few interesting kinds of invalid values. Let’s break them down:
Garbage: There are nonsense values like join-count
, menu_list
, #0066cc
, a1
, 1
, 2
, and 3
that are obviously not valid languages. It’s as if a templating system mistakenly added these values to the wrong attribute.
Null: Some values seem to attempt to convey that there is no language: eg X-NONE
and none
. Instead of supplying an invalid value, the attribute could have been omitted entirely.
Incorrect value: Some values use the wrong language codes. For example, jp
is assumed to be the Japanese language, but according to the Language Subtag Registry the correct value is actually ja
. The number one invalid language value is cz
. Looking at the URLs corresponding the Lighthouse report, most of them have the .cz
TLD, which is used for the Czech Republic. Consulting the LSR, the correct subtag for Czech is cs
.
Incorrect syntax: Many values provide a correct language and region, but they use underscores to delimit. The correct syntax is to delimit with hyphens. For example, the second most frequently used invalid value is zh_CN
, which is close to the tag for the Chinese language and region of China. While it could be changed to zh-CN
and still be valid, including the region may be redundant:
The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use
ja
for Japanese and notja-JP
, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.
Language tags in HTML and XML
So to summarize, the overwhelming majority of the web uses the lang attribute correctly. The few cases of invalid attributes could be explained as misinterpretations of the language tag standards. So remember to:
- Use the correct language symbol.
ja
is Japanese andcs
is Czech. - Use hyphens to delimit language and region codes, not underscores.
- Omit the attribute to indicate no particular language.