What are the invalid uses of the lang attribute?


#1

The lang attribute is used to describe the language of the child content. For example, <html lang="en"> indicates that the content of the HTML document is in English.

Lighthouse includes a report to validate this attribute so we can write a query to see how the results are distributed:

SELECT
  JSON_EXTRACT_SCALAR(report, "$.audits.valid-lang.score") AS score,
  COUNT(0) AS volume
FROM
  [httparchive:har.latest_lighthouse_mobile]
WHERE
  report IS NOT NULL
GROUP BY
  score
HAVING
  score IS NOT NULL
ORDER BY
  score

Results:

Row	score	volume	 
1	false	424	 
2	true	424917	 

This means that 99.9% of websites are using the lang attribute correctly. That’s surprisingly good. Maybe too good to be true? In any case, let’s dive deeper and try to learn more about how the attribute is being used incorrectly.

Since there may be multiple instances of the lang attribute on a page, the invalid occurrences are in an array. And the BigQuery JSON_EXTRACT method doesn’t yet play nice with plucking data out of arrays, so we can work around that by writing a User-Defined Function (UDF) that does the manipulation directly in JavaScript. Here’s the query:

CREATE TEMPORARY FUNCTION getViolations(items STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  try {
    return items.match(/lang="([^"]*)"/ig);
  } catch (e) {
    return [];
  }
""";

SELECT
  COUNT(0) AS volume,
  lang
FROM (
  SELECT
    getViolations(JSON_EXTRACT(report, "$.audits.valid-lang.details.items")) AS langs
  FROM
    `httparchive.har.2017_06_15_android_lighthouse`)
CROSS JOIN
  UNNEST(langs) AS lang
GROUP BY
  lang
ORDER BY
  volume DESC

This will sort and count the number of distinct lang attributes. There are 837 results, but here are the top 25:

Row	volume	lang	 
1	123	lang="cz"	 
2	107	lang="zh_CN"	 
3	54	lang="zh_TW"	 
4	52	lang="join-count"	 
5	43	lang="0"	 
6	42	lang="a1"	 
7	42	lang="pt_PT"	 
8	39	lang="en_US"	 
9	28	lang="gr"	 
10	27	lang="ge"	 
11	27	lang="x-normage"	 
12	24	lang="main"	 
13	24	lang="X-NONE"	 
14	22	lang="jp"	 
15	22	lang="none"	 
16	21	lang="2"	 
17	20	lang="cn"	 
18	20	lang="KZ"	 
19	18	lang="en_GB"	 
20	18	lang="1"	 
21	15	lang="#0066cc"	 
22	13	lang="menu_list"	 
23	13	lang="english"	 
24	12	lang="3"	 
25	11	lang="\\'EN-US\\'"	 

It’s important to note upfront that a single page could include many incorrect language attributes and single-handedly skew these results. That said, there are a few interesting kinds of invalid values. Let’s break them down:

Garbage: There are nonsense values like join-count, menu_list, #0066cc, a1, 1, 2, and 3 that are obviously not valid languages. It’s as if a templating system mistakenly added these values to the wrong attribute.

Null: Some values seem to attempt to convey that there is no language: eg X-NONE and none. Instead of supplying an invalid value, the attribute could have been omitted entirely.

Incorrect value: Some values use the wrong language codes. For example, jp is assumed to be the Japanese language, but according to the Language Subtag Registry the correct value is actually ja. The number one invalid language value is cz. Looking at the URLs corresponding the Lighthouse report, most of them have the .cz TLD, which is used for the Czech Republic. Consulting the LSR, the correct subtag for Czech is cs.

Incorrect syntax: Many values provide a correct language and region, but they use underscores to delimit. The correct syntax is to delimit with hyphens. For example, the second most frequently used invalid value is zh_CN, which is close to the tag for the Chinese language and region of China. While it could be changed to zh-CN and still be valid, including the region may be redundant:

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.
https://www.w3.org/International/articles/language-tags/


So to summarize, the overwhelming majority of the web uses the lang attribute correctly. The few cases of invalid attributes could be explained as misinterpretations of the language tag standards. So remember to:

  • Use the correct language symbol. ja is Japanese and cs is Czech.
  • Use hyphens to delimit language and region codes, not underscores.
  • Omit the attribute to indicate no particular language.

#2

Keep in mind that language can influence text rendering. Specifically, I believe that people label content with zh-CN and zh-TW to get respectively simplified and traditional Chinese. There are better, non-geographic alternatives (zh-Hans and zh-Hant) but I wouldn’t be surprised if people preferred the geographic ones as more mnemonic.


#3

Great point thanks @robin