What are the invalid uses of the lang attribute?


The lang attribute is used to describe the language of the child content. For example, <html lang="en"> indicates that the content of the HTML document is in English.

Lighthouse includes a report to validate this attribute so we can write a query to see how the results are distributed:

  JSON_EXTRACT_SCALAR(report, "$.audits.valid-lang.score") AS score,
  COUNT(0) AS volume
  report IS NOT NULL
  score IS NOT NULL


Row	score	volume	 
1	false	424	 
2	true	424917	 

This means that 99.9% of websites are using the lang attribute correctly. That’s surprisingly good. Maybe too good to be true? In any case, let’s dive deeper and try to learn more about how the attribute is being used incorrectly.

Since there may be multiple instances of the lang attribute on a page, the invalid occurrences are in an array. And the BigQuery JSON_EXTRACT method doesn’t yet play nice with plucking data out of arrays, so we can work around that by writing a User-Defined Function (UDF) that does the manipulation directly in JavaScript. Here’s the query:

  try {
    return items.match(/lang="([^"]*)"/ig);
  } catch (e) {
    return [];

  COUNT(0) AS volume,
    getViolations(JSON_EXTRACT(report, "$.audits.valid-lang.details.items")) AS langs
  UNNEST(langs) AS lang
  volume DESC

This will sort and count the number of distinct lang attributes. There are 837 results, but here are the top 25:

Row	volume	lang	 
1	123	lang="cz"	 
2	107	lang="zh_CN"	 
3	54	lang="zh_TW"	 
4	52	lang="join-count"	 
5	43	lang="0"	 
6	42	lang="a1"	 
7	42	lang="pt_PT"	 
8	39	lang="en_US"	 
9	28	lang="gr"	 
10	27	lang="ge"	 
11	27	lang="x-normage"	 
12	24	lang="main"	 
13	24	lang="X-NONE"	 
14	22	lang="jp"	 
15	22	lang="none"	 
16	21	lang="2"	 
17	20	lang="cn"	 
18	20	lang="KZ"	 
19	18	lang="en_GB"	 
20	18	lang="1"	 
21	15	lang="#0066cc"	 
22	13	lang="menu_list"	 
23	13	lang="english"	 
24	12	lang="3"	 
25	11	lang="\\'EN-US\\'"	 

It’s important to note upfront that a single page could include many incorrect language attributes and single-handedly skew these results. That said, there are a few interesting kinds of invalid values. Let’s break them down:

Garbage: There are nonsense values like join-count, menu_list, #0066cc, a1, 1, 2, and 3 that are obviously not valid languages. It’s as if a templating system mistakenly added these values to the wrong attribute.

Null: Some values seem to attempt to convey that there is no language: eg X-NONE and none. Instead of supplying an invalid value, the attribute could have been omitted entirely.

Incorrect value: Some values use the wrong language codes. For example, jp is assumed to be the Japanese language, but according to the Language Subtag Registry the correct value is actually ja. The number one invalid language value is cz. Looking at the URLs corresponding the Lighthouse report, most of them have the .cz TLD, which is used for the Czech Republic. Consulting the LSR, the correct subtag for Czech is cs.

Incorrect syntax: Many values provide a correct language and region, but they use underscores to delimit. The correct syntax is to delimit with hyphens. For example, the second most frequently used invalid value is zh_CN, which is close to the tag for the Chinese language and region of China. While it could be changed to zh-CN and still be valid, including the region may be redundant:

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.

So to summarize, the overwhelming majority of the web uses the lang attribute correctly. The few cases of invalid attributes could be explained as misinterpretations of the language tag standards. So remember to:

  • Use the correct language symbol. ja is Japanese and cs is Czech.
  • Use hyphens to delimit language and region codes, not underscores.
  • Omit the attribute to indicate no particular language.

Identifying The Local Websites To Understand The State of The Web in A Country

Keep in mind that language can influence text rendering. Specifically, I believe that people label content with zh-CN and zh-TW to get respectively simplified and traditional Chinese. There are better, non-geographic alternatives (zh-Hans and zh-Hant) but I wouldn’t be surprised if people preferred the geographic ones as more mnemonic.


Great point thanks @robin


Ran a similar query today to find out which 2-char lang values are used on pages in Indonesia:

  APPROX_TOP_COUNT(LOWER(REGEXP_EXTRACT(body, '(?i)<html[^>]*lang=[\'"]?([a-z]{2})')), 10) AS lang
  CONCAT(origin, '/') = page

Besides the null values (pages with no lang attrs or response bodies for resources that aren’t even HTML) the top language is English with 167,191 values. Next is Indonesian with 20,142. Runners up: Japanese, Korean, Chinese, Russian, Spanish, French, and German.

null	6028012	 
en		167191	 
id		20142	 
ja		1844	 
ko		1170	 
zh		861	 
ru		823	 
es		540	 
fr		480	 
de		449