What are the invalid uses of the lang attribute?

The lang attribute is used to describe the language of the child content. For example, <html lang="en"> indicates that the content of the HTML document is in English.

Lighthouse includes a report to validate this attribute so we can write a query to see how the results are distributed:

SELECT
  JSON_EXTRACT_SCALAR(report, "$.audits.valid-lang.score") AS score,
  COUNT(0) AS volume
FROM
  [httparchive:har.latest_lighthouse_mobile]
WHERE
  report IS NOT NULL
GROUP BY
  score
HAVING
  score IS NOT NULL
ORDER BY
  score

Results:

Row	score	volume	 
1	false	424	 
2	true	424917	 

This means that 99.9% of websites are using the lang attribute correctly. That’s surprisingly good. Maybe too good to be true? In any case, let’s dive deeper and try to learn more about how the attribute is being used incorrectly.

Since there may be multiple instances of the lang attribute on a page, the invalid occurrences are in an array. And the BigQuery JSON_EXTRACT method doesn’t yet play nice with plucking data out of arrays, so we can work around that by writing a User-Defined Function (UDF) that does the manipulation directly in JavaScript. Here’s the query:

CREATE TEMPORARY FUNCTION getViolations(items STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  try {
    return items.match(/lang="([^"]*)"/ig);
  } catch (e) {
    return [];
  }
""";

SELECT
  COUNT(0) AS volume,
  lang
FROM (
  SELECT
    getViolations(JSON_EXTRACT(report, "$.audits.valid-lang.details.items")) AS langs
  FROM
    `httparchive.har.2017_06_15_android_lighthouse`)
CROSS JOIN
  UNNEST(langs) AS lang
GROUP BY
  lang
ORDER BY
  volume DESC

This will sort and count the number of distinct lang attributes. There are 837 results, but here are the top 25:

Row	volume	lang	 
1	123	lang="cz"	 
2	107	lang="zh_CN"	 
3	54	lang="zh_TW"	 
4	52	lang="join-count"	 
5	43	lang="0"	 
6	42	lang="a1"	 
7	42	lang="pt_PT"	 
8	39	lang="en_US"	 
9	28	lang="gr"	 
10	27	lang="ge"	 
11	27	lang="x-normage"	 
12	24	lang="main"	 
13	24	lang="X-NONE"	 
14	22	lang="jp"	 
15	22	lang="none"	 
16	21	lang="2"	 
17	20	lang="cn"	 
18	20	lang="KZ"	 
19	18	lang="en_GB"	 
20	18	lang="1"	 
21	15	lang="#0066cc"	 
22	13	lang="menu_list"	 
23	13	lang="english"	 
24	12	lang="3"	 
25	11	lang="\\'EN-US\\'"	 

It’s important to note upfront that a single page could include many incorrect language attributes and single-handedly skew these results. That said, there are a few interesting kinds of invalid values. Let’s break them down:

Garbage: There are nonsense values like join-count, menu_list, #0066cc, a1, 1, 2, and 3 that are obviously not valid languages. It’s as if a templating system mistakenly added these values to the wrong attribute.

Null: Some values seem to attempt to convey that there is no language: eg X-NONE and none. Instead of supplying an invalid value, the attribute could have been omitted entirely.

Incorrect value: Some values use the wrong language codes. For example, jp is assumed to be the Japanese language, but according to the Language Subtag Registry the correct value is actually ja. The number one invalid language value is cz. Looking at the URLs corresponding the Lighthouse report, most of them have the .cz TLD, which is used for the Czech Republic. Consulting the LSR, the correct subtag for Czech is cs.

Incorrect syntax: Many values provide a correct language and region, but they use underscores to delimit. The correct syntax is to delimit with hyphens. For example, the second most frequently used invalid value is zh_CN, which is close to the tag for the Chinese language and region of China. While it could be changed to zh-CN and still be valid, including the region may be redundant:

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.
Language tags in HTML and XML


So to summarize, the overwhelming majority of the web uses the lang attribute correctly. The few cases of invalid attributes could be explained as misinterpretations of the language tag standards. So remember to:

  • Use the correct language symbol. ja is Japanese and cs is Czech.
  • Use hyphens to delimit language and region codes, not underscores.
  • Omit the attribute to indicate no particular language.
2 Likes

Keep in mind that language can influence text rendering. Specifically, I believe that people label content with zh-CN and zh-TW to get respectively simplified and traditional Chinese. There are better, non-geographic alternatives (zh-Hans and zh-Hant) but I wouldn’t be surprised if people preferred the geographic ones as more mnemonic.

2 Likes

Great point thanks @robin

Ran a similar query today to find out which 2-char lang values are used on pages in Indonesia:

SELECT
  APPROX_TOP_COUNT(LOWER(REGEXP_EXTRACT(body, '(?i)<html[^>]*lang=[\'"]?([a-z]{2})')), 10) AS lang
FROM
  `httparchive.response_bodies.2019_02_01_desktop`
JOIN
  `chrome-ux-report.country_id.201901`
ON
  CONCAT(origin, '/') = page

Besides the null values (pages with no lang attrs or response bodies for resources that aren’t even HTML) the top language is English with 167,191 values. Next is Indonesian with 20,142. Runners up: Japanese, Korean, Chinese, Russian, Spanish, French, and German.

null	6028012	 
en		167191	 
id		20142	 
ja		1844	 
ko		1170	 
zh		861	 
ru		823	 
es		540	 
fr		480	 
de		449	 

Top 50 lang values over all pages from the 2019_04_01_desktop dataset:

#standardSQL
# WARNING! This query consumes 6.2 TB!
SELECT
  APPROX_TOP_COUNT(LOWER(REGEXP_EXTRACT(body, '(?i)<html[^>]*lang=[\'"]?([a-z]{2})')), 50) AS lang
FROM
  `httparchive.response_bodies.2019_04_01_desktop`
WHERE
  page = url
Language Lang Count Percent
English en 1571474 39.82%
1095890 27.77%
Japanese ja 188711 4.78%
Spanish es 160215 4.06%
Russian ru 159318 4.04%
French fr 112883 2.86%
Portuguese pt 107759 2.73%
German de 89616 2.27%
Dutch nl 58122 1.47%
Italian it 57720 1.46%
Polish pl 51615 1.31%
Korean ko 41660 1.06%
Chinese zh 41632 1.05%
Turkish tr 35342 0.90%
Czech cs 26374 0.67%
Hungarian hu 20690 0.52%
Swedish sv 19802 0.50%
Vietnamese vi 16591 0.42%
Danish da 15176 0.38%
Romanian ro 14313 0.36%
Greek el 12478 0.32%
Hebrew he 11886 0.30%
Thai th 10843 0.27%
Slovak sk 9838 0.25%
Arabic ar 9682 0.25%
Finnish fi 9658 0.24%
Ukrainian uk 7696 0.20%
Bulgarian bg 7676 0.19%
Persian fa 6726 0.17%
Indonesian id 6281 0.16%
Norwegian BokmĂĄl nb 5874 0.15%
Lithuanian lt 5232 0.13%
Croatian hr 4260 0.11%
Norwegian no 4045 0.10%
Serbian sr 3921 0.10%
Slovenian sl 3546 0.09%
Catalan ca 3421 0.09%
Estonian et 3269 0.08%
Latvian lv 2326 0.06%
Icelandic is 1122 0.03%
jp 1018 0.03%
us 937 0.02%
ua 923 0.02%
zx 868 0.02%
Bosnian bs 797 0.02%
cz 755 0.02%
Georgian ka 710 0.02%
Breton br 668 0.02%
Malay ms 574 0.01%
eu 553 0.01%