What are the invalid uses of the lang attribute?

rviscomi · July 11, 2017, 9:22pm

The lang attribute is used to describe the language of the child content. For example, <html lang="en"> indicates that the content of the HTML document is in English.

Lighthouse includes a report to validate this attribute so we can write a query to see how the results are distributed:

SELECT
  JSON_EXTRACT_SCALAR(report, "$.audits.valid-lang.score") AS score,
  COUNT(0) AS volume
FROM
  [httparchive:har.latest_lighthouse_mobile]
WHERE
  report IS NOT NULL
GROUP BY
  score
HAVING
  score IS NOT NULL
ORDER BY
  score

Results:

Row	score	volume	 
1	false	424	 
2	true	424917

This means that 99.9% of websites are using the lang attribute correctly. That’s surprisingly good. Maybe too good to be true? In any case, let’s dive deeper and try to learn more about how the attribute is being used incorrectly.

Since there may be multiple instances of the lang attribute on a page, the invalid occurrences are in an array. And the BigQuery JSON_EXTRACT method doesn’t yet play nice with plucking data out of arrays, so we can work around that by writing a User-Defined Function (UDF) that does the manipulation directly in JavaScript. Here’s the query:

CREATE TEMPORARY FUNCTION getViolations(items STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
  try {
    return items.match(/lang="([^"]*)"/ig);
  } catch (e) {
    return [];
  }
""";

SELECT
  COUNT(0) AS volume,
  lang
FROM (
  SELECT
    getViolations(JSON_EXTRACT(report, "$.audits.valid-lang.details.items")) AS langs
  FROM
    `httparchive.har.2017_06_15_android_lighthouse`)
CROSS JOIN
  UNNEST(langs) AS lang
GROUP BY
  lang
ORDER BY
  volume DESC

This will sort and count the number of distinct lang attributes. There are 837 results, but here are the top 25:

Row	volume	lang	 
1	123	lang="cz"	 
2	107	lang="zh_CN"	 
3	54	lang="zh_TW"	 
4	52	lang="join-count"	 
5	43	lang="0"	 
6	42	lang="a1"	 
7	42	lang="pt_PT"	 
8	39	lang="en_US"	 
9	28	lang="gr"	 
10	27	lang="ge"	 
11	27	lang="x-normage"	 
12	24	lang="main"	 
13	24	lang="X-NONE"	 
14	22	lang="jp"	 
15	22	lang="none"	 
16	21	lang="2"	 
17	20	lang="cn"	 
18	20	lang="KZ"	 
19	18	lang="en_GB"	 
20	18	lang="1"	 
21	15	lang="#0066cc"	 
22	13	lang="menu_list"	 
23	13	lang="english"	 
24	12	lang="3"	 
25	11	lang="\\'EN-US\\'"

It’s important to note upfront that a single page could include many incorrect language attributes and single-handedly skew these results. That said, there are a few interesting kinds of invalid values. Let’s break them down:

Garbage: There are nonsense values like join-count, menu_list, #0066cc, a1, 1, 2, and 3 that are obviously not valid languages. It’s as if a templating system mistakenly added these values to the wrong attribute.

Null: Some values seem to attempt to convey that there is no language: eg X-NONE and none. Instead of supplying an invalid value, the attribute could have been omitted entirely.

Incorrect value: Some values use the wrong language codes. For example, jp is assumed to be the Japanese language, but according to the Language Subtag Registry the correct value is actually ja. The number one invalid language value is cz. Looking at the URLs corresponding the Lighthouse report, most of them have the .cz TLD, which is used for the Czech Republic. Consulting the LSR, the correct subtag for Czech is cs.

Incorrect syntax: Many values provide a correct language and region, but they use underscores to delimit. The correct syntax is to delimit with hyphens. For example, the second most frequently used invalid value is zh_CN, which is close to the tag for the Chinese language and region of China. While it could be changed to zh-CN and still be valid, including the region may be redundant:

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.
Language tags in HTML and XML

So to summarize, the overwhelming majority of the web uses the lang attribute correctly. The few cases of invalid attributes could be explained as misinterpretations of the language tag standards. So remember to:

Use the correct language symbol. ja is Japanese and cs is Czech.
Use hyphens to delimit language and region codes, not underscores.
Omit the attribute to indicate no particular language.

robin · July 11, 2017, 9:51pm

Keep in mind that language can influence text rendering. Specifically, I believe that people label content with zh-CN and zh-TW to get respectively simplified and traditional Chinese. There are better, non-geographic alternatives (zh-Hans and zh-Hant) but I wouldn’t be surprised if people preferred the geographic ones as more mnemonic.

rviscomi · July 11, 2017, 9:53pm

Great point thanks @robin

rviscomi · February 26, 2019, 9:37pm

Ran a similar query today to find out which 2-char lang values are used on pages in Indonesia:

SELECT
  APPROX_TOP_COUNT(LOWER(REGEXP_EXTRACT(body, '(?i)<html[^>]*lang=[\'"]?([a-z]{2})')), 10) AS lang
FROM
  `httparchive.response_bodies.2019_02_01_desktop`
JOIN
  `chrome-ux-report.country_id.201901`
ON
  CONCAT(origin, '/') = page

Besides the null values (pages with no lang attrs or response bodies for resources that aren’t even HTML) the top language is English with 167,191 values. Next is Indonesian with 20,142. Runners up: Japanese, Korean, Chinese, Russian, Spanish, French, and German.

null	6028012	 
en		167191	 
id		20142	 
ja		1844	 
ko		1170	 
zh		861	 
ru		823	 
es		540	 
fr		480	 
de		449

rviscomi · May 29, 2019, 10:33pm

Top 50 lang values over all pages from the 2019_04_01_desktop dataset:

#standardSQL
# WARNING! This query consumes 6.2 TB!
SELECT
  APPROX_TOP_COUNT(LOWER(REGEXP_EXTRACT(body, '(?i)<html[^>]*lang=[\'"]?([a-z]{2})')), 50) AS lang
FROM
  `httparchive.response_bodies.2019_04_01_desktop`
WHERE
  page = url

Language	Lang	Count	Percent
English	en	1571474	39.82%
		1095890	27.77%
Japanese	ja	188711	4.78%
Spanish	es	160215	4.06%
Russian	ru	159318	4.04%
French	fr	112883	2.86%
Portuguese	pt	107759	2.73%
German	de	89616	2.27%
Dutch	nl	58122	1.47%
Italian	it	57720	1.46%
Polish	pl	51615	1.31%
Korean	ko	41660	1.06%
Chinese	zh	41632	1.05%
Turkish	tr	35342	0.90%
Czech	cs	26374	0.67%
Hungarian	hu	20690	0.52%
Swedish	sv	19802	0.50%
Vietnamese	vi	16591	0.42%
Danish	da	15176	0.38%
Romanian	ro	14313	0.36%
Greek	el	12478	0.32%
Hebrew	he	11886	0.30%
Thai	th	10843	0.27%
Slovak	sk	9838	0.25%
Arabic	ar	9682	0.25%
Finnish	fi	9658	0.24%
Ukrainian	uk	7696	0.20%
Bulgarian	bg	7676	0.19%
Persian	fa	6726	0.17%
Indonesian	id	6281	0.16%
Norwegian Bokmål	nb	5874	0.15%
Lithuanian	lt	5232	0.13%
Croatian	hr	4260	0.11%
Norwegian	no	4045	0.10%
Serbian	sr	3921	0.10%
Slovenian	sl	3546	0.09%
Catalan	ca	3421	0.09%
Estonian	et	3269	0.08%
Latvian	lv	2326	0.06%
Icelandic	is	1122	0.03%
	jp	1018	0.03%
	us	937	0.02%
	ua	923	0.02%
	zx	868	0.02%
Bosnian	bs	797	0.02%
	cz	755	0.02%
Georgian	ka	710	0.02%
Breton	br	668	0.02%
Malay	ms	574	0.01%
	eu	553	0.01%

Topic		Replies	Views
How and where is document.write used on the web? Analysis	11	5361	July 6, 2017
Sampling JSON-LD Analysis	0	1319	May 3, 2018
Analyzing Lighthouse Scores Across the Web Analysis	5	3973	March 15, 2019
<html lang> values, by complete lang value (full language tag) Analysis	0	1533	July 26, 2016
Analyzing Largest Contentful Paint stats via Lighthouse audits Analysis	0	1468	June 8, 2021

What are the invalid uses of the lang attribute?

Related topics