Language Tags and Locale Identifiers for the World Wide Web

This document provides definitions and best practices related to the identification of the natural language of content in document formats, specifications, and implementations on the Web. It describes how language tags are used to indicate a user's locale preferences which, in turn, are used to process, format, and display information to the user.

Languages and Language Tags

Tags for identifying the natural language of content or the international preferences of users are one of the fundamental building blocks of the Web. The language tags found in Web and Internet formats and protocols are defined by [[BCP47]]. Consistent use of language tags provides applications the ability to perform language-specific formatting or processing. For example, a user-agent might use the language to select an appropriate font for displaying text or a Web page designer might style text differently in one language than in another.

Many of the core standards for the Web include support for language tags; these include the xml:lang attribute in [[XML10]], the lang and hreflang atttributes in [[HTML]], the language property in [[XSL10]], and the :lang pseudo-class in CSS [[CSS3-SELECTORS]], and many others, including SVG, TTML, SSML, etc.

Natural Language (or, in this document, just language). The spoken, written, or signed communications used by human beings.

There are many ways that languages might be identified and many reasons that software might need to identify the language of content on the Web. Document formats and protocols on the Web generally use the identifiers used in most other parts of the Internet, consisting of the language tags defined in [[BCP47]]. "BCP" nomenclature refers to the current set of IETF RFCs that form the "best current practice".

Language tag. A string used as an identifier for a language. In this document, the term language tag always refers explicitly to a [[BCP47]] language tag. These language tags consist of one or more subtags.

Specifications for the Web that require language identification MUST refer to [[BCP47]].

Specifications SHOULD NOT refer to specific component RFCs of [[BCP47]].

[[BCP47]] is a multipart document consisting, at the time this document was published, of two separate RFCs. The first part, called Tags for Identifying Languages [[RFC5646]], defines the grammar, form, and terminology of language tags. The second part, called Matching of Language Tags [[RFC4647]], describes several schemes for matching, comparing, and selecting content using language tags and includes useful terminology related to comparison of language preferences to tagged content.

Formulations such as "RFC 5646 or its successor" MAY be used, but only in cases where the specific document version is necessary.

While this style of reference was once popular, using the BCP reference is more accurate. Since the grammar of language tags has been fixed since [[RFC4646]], referring to the BCP will not incur additional compliance risk to most implementations.

Specifications MUST NOT reference obsolete versions of [[BCP47]], such as [[RFC1766]] or [[RFC3066]].

Specifications that need to preserve compatibility with obsolete versions of [[BCP47]] MUST reference the production obs-language-tag in [[BCP47]].

Beginning with [[RFC4646]], [[BCP47]] defined a more complex, machine-readable syntax for language tags. This syntax is stable and is not expected to change in the foreseeable future. Some specifications might desire or require compatibility with the older language tag grammar found in previous versions of BCP47 (specifically [[RFC1766]] and [[RFC3066]]). This grammar was more permissive and is described in [[BCP47]] as the ABNF production obs-language-tag. [[RFC4646]], which introduced the current grammar for language tags, was replaced by [[RFC5646]] as part of the current [[BCP47]].

Applications that provide language information as part of URIs (e.g. in the realm of RDF) SHOULD use [[BCP47]].

Currently, URIs expressing language information often use values from parts of ISO 639. This leads to situations in which there are ambiguities about what the proper value should be, e.g. for German de from ISO 639-1 or ger from ISO 639-2. By using BCP 47 and its language sub tag registry, such ambiguities can be avoided, e.g. for German, the registry contains only de.

Subtag. A sequence of ASCII letters or digits separated from other subtags by the hyphen-minus character and identifying a specific element of meaning withing the overall language tag. In [[BCP47]], subtags can consist of upper or lowercase ASCII letters (the case carries no distinction) or ASCII digits. Subtags are limited to no more than eight characters (although additional length restrictions apply depending on the specific use of the subtag).

Selecting content or behavior based on the language tag requires a few additional concepts defined by [[BCP47]] (in [[RFC4647]]). In this document, we adopt the following terminology taken directly from [[BCP47]]:

IANA Language Subtag Registry. A machine-readable text file available via IANA which contains a comprehensive list of all of the subtags valid in language tags. (Link: Registry)

Specifications SHOULD NOT reference [[BCP47]]'s underlying standards that contribute to the IANA Language Subtag Registry, such as ISO639, ISO15924, ISO3066, or UN M.49.

Some standards might directly consume one of [[BCP47]]'s contributory standards, in which case a reference is wholly appropriate. However, in most cases, the purpose of the reference is to specify a valid list of codes and their meanings. [[BCP47]]'s subtag registry is stabilized and resolves ambiguity in a number of useful ways and so should be the preferred source for this type of reference.

[[BCP47]] defines two different levels of conformance. See classes of conformance in [[BCP47]] for specifics. For language tags, the levels of conformance correspond to type of checking that an implementation applies to language tag values.

Well-formed language tag. A language tag that follows the grammar defined in [[BCP47]]. That is, it is structurally correct, consisting of ASCII letters and digit subtags of the prescribed length, separated by hyphens.

Valid language tag. A language tag that is well-formed and which also conforms to the additional conformance requirements in [BCP47], notably that each of the subtags appears in the IANA Language Subtag Registry.

Specifications SHOULD require that language tags be well-formed.

Specifications MAY require that language tags be valid.

Specifications SHOULD require that content authors use valid language tags.

Note that this is stricter than what is recommended for implementations.

Content validators SHOULD check if content uses valid language tags where feasible.

Checking if a tag is valid requires access to or a copy of the registry plus additional runtime logic. While content authors are advised to choose, generate, and exchange only valid values, language tag matching and other common language tag operations are designed so that validity checking is not needed. Features or functions that need to understand the specific semantic content of subtags are the main reason that a specification would normatively require valid tags as part of the protocol or document format.

Language tag extension or extension. A system of additional [[BCP47]] subtags introduced by a single letter or digit subtag registered with IANA and permitting additional types of language identification.

Specifications MAY reference registered extensions to [[BCP47]] as necessary.

In particular, [[RFC6067]] defines the BCP 47 Extension U, also known as "Unicode Locales". This extension to [[BCP47]] provides additional subtag sequences for selecting specific locale variations.

Specifications SHOULD NOT restrict the length of language tags or permit or encourage the removal of extensions.

Language range. A string similar in structure to a language tag that is used for "identifying sets of language tags that share specific attributes".

Language priority list. A collection of one or more language ranges identifying the user's language preferences for use in matching. As the name suggests, such lists are normally ordered or weighted according to the user's preferences. The HTTP [[RFC2616]] Accept-Language [[RFC3282]] header is an example of one kind of language priority list.

Basic language range. A language range consisting of a sequence of subtags separated by hyphens. That is, it is identical in appearance to a language tag.

Extended language range. A language range consisting of a sequence of hyphen-separated subtags. In an extended language range, a subtag can either be a valid subtag or the wildcard subtag *, which matches any value.

Some language priority lists, such as the Accept-Language [[RFC3282]] header mentioned earlier, provide "weights" for values appearing in the list. Such weighting cannot be depended on for anything other than ordering the list.

Specifications that define language tag matching or language negotiation MUST specify whether language ranges used are a basic language range or an extended language range.

Specifications that define language tag matching MUST specify whether the results of a matching operation contains a single result (lookup as defined in [[RFC4647]]), or a possibly-empty (zero or more) set of results (filtering as defined in [[RFC4647]]).

Specifications that define language tag matching MUST specify the matching algorithms available and the selection mechanism.

For example, JavaScript internationalization [[ECMA-402]] and [[CLDR]] provide a "best fit" algorithm which can be tailored by implementers.

Locales and Internationalization

This section defines basic terminology related to internationalization and localization.

Users who speak different languages or come from different cultural backgrounds usually require software and services that are adapted to correctly process information using their native languages, writing systems, measurement systems, calendars, and other linguistic rules and cultural conventions.

Language tags can also be used to identify international preferences associated with a given piece of content or user because these preferences are linked to the natural language, regional association, or culture of the end user. Such preferences are applied to processes such as presenting numbers, dates, or times; sorting lists linguistically; providing defaults for items such as the presentation of a calendar, or common units of measurement; selecting between 12- vs. 24-hour time presentation; and many other details that users might find too tedious to set individually. Collectively, an identifier for these preferences is usually called a locale. The extensions to [[BCP47]] that define Unicode locales [[CLDR]] provide the basis for internationalization APIs on the Web, notably the JavaScript language [[ECMASCRIPT]] uses Unicode locales as the basis for the APIs found in [[ECMA-402]].

International Preferences. A user's particular set of language and formatting preferences and associated cultural conventions. Software can use these preferences to correctly process or present information exchanged with that user.

Many kinds of international preference may be offered on the Web in order for a content or a service to be considered usable and acceptable by users around the world. Some of these preferences might include:

Natural language for text processing, such as parsing, spell checking, and grammar checking;
User interface language, which may include items like images, colors, sounds, formats, and navigational elements as well as the visible text strings;
Presentation (human-oriented formatting) of dates, times, numbers, lists, and other values;
Collation, sorting, and organization of content (such as in a phone book or a dictionary);
Alternate time-keeping and calendars, which may include holidays, work rules, weekday/weekend distinctions, the number and organization of months, the numbering of years, and so forth;
Tax or regulatory regime;
Currency

... and many more.

Internationalization. The design and development of a product that is enabled for target audiences that vary in culture, region, or language. Internationalization is sometimes abbreviated i18n because there are eighteen letters between the "I" and the "N" in the English word.

Localization. The tailoring of a system to the individual cultural expectations of a specific target market or group of individuals. Localization includes, but is not limited to, the translation of user-facing text and messages. Localization is sometimes abbreviated as l10n because there are ten letters between the "L" and the "N" in the English word. When a particular set of content and preferences corresponding to a specific set of international preferences is operationally available, then the system is said to be localized.

Locale. An identifier (such as a language tag) for a set of international preferences. Usually this identifier indicates the preferred language of the user and possibly includes other information, such as a geographic region (such as a country). A locale is passed in APIs or set in the operating environment to obtain culturally-affected behavior within a system or process.

Locale-aware (or Enabled). A system that can respond to changes in the locale with culturally and language-specific behavior or content. Generally, systems that are internationalized can support a wide range of locales in order to meet the international preferences of many kinds of users.

Language tags can provide information about the language, script, region, and various specially-registered variants using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. Thus a German language user might want to choose between the sort ordering used in a dictionary versus that used in a phone book.

Historically, locales were associated with and specific to the programming language or operating environment of the user. These application-specific identifiers often could be inferred from or converted into language tags. Some examples of locale models include Java's java.util.Locale, POSIX (with identifiers such as de_CH@utf8), Oracle databases (AMERICAN_AMERICA.AL32UTF8), or Microsoft's LCIDs (which used numeric codes such as 0x0409). The relationship between several of these models, the underlying standards such as ISO639 or ISO3166, and early language tags (such as [[RFC1766]]) was entirely intentional. Implementations often mapped (and continue to map) language tags from an existing protocol, such as HTTP's Accept-Language header, to proprietary or platform-specific locale models.

Since the adoption of the current [[BCP47]] identifier syntax, a number of locale models have adopted BCP47 directly or provided adaptation or mappings between proprietary models and language tags. Notably, the development and adoption of the open-source repository of locale data known as [[CLDR]] has led to wider general adoption of language tags as locale identifiers.

Common Locale Data Repository (or [[CLDR]]). The Common Locale Data Repository is a Unicode Consortium project that defines, collects, and curates sets of data needed to enable locales in systems or operating environments. CLDR data and its locale model are widely adopted, particularly in browsers.

Unicode Locale Identifier or Unicode Locale. A language tag that follows the additional rules and restrictions on subtag choice defined in UTR#35 [[LDML]]. Any valid Unicode locale identifier is also a valid [[BCP47]] language tag, but a few valid language tags are not also valid Unicode locale identifiers.

Canonical Unicode locale identifier. A well-formed language tag resulting from the application of the Unicode locale identifier canonicalization rules found in [[LDML]] (see Section 3). This process converts any valid [[BCP47]] language tag into a valid Unicode locale identifier. For example, deprecated subtags or irregular grandfathered tags are replaced with their preferred value from the IANA language subtag registry.

[[CLDR]] defines and maintains two language tag extensions ([[RFC6067]] and [[RFC6497]]) that are related to Unicode locale identifiers. These extensions allow a language tag to express some international preference variations that go beyond linguistic or regional variation or to select formatting behavior or content when there are multiple options or user preferences within a given locale. Unicode locale identifiers are not required to include these extensions: they are only used when the locale being identified requires additional tailoring provided by one of these extensions. [[CLDR]] also applies specific interpretation of certain subtags when used as a locale identifier. See Section 3.2 of [[LDML]] for details.

The Unicode locale language tag extension [[RFC6067]] uses the -u- subtag, and provides subtags for selecting different locale-based formats and behaviors. See Section 3.6 of [[LDML]] for details.

The transformed content language tag extension [[RFC6497]], which uses the -t- subtag, provides subtags for text transformations, such as transliteration between scripts. See Section 3.7 of [[LDML]] for details.

Unicode Locales increasingly form the basis for internationalization on the Web, particularly as part of the Intl locale framework [[ECMA-402]] in JavaScript [[ECMASCRIPT]].

Content authors SHOULD choose language tags that are canonical Unicode locale identifiers.

The additional content restrictions and normalization steps found in Section 3 of [[LDML]] provide for better interoperability and consistency than that afforded by [[BCP47]] directly.

Implementations SHOULD only emit language tags that are canonical Unicode locale identifiers and SHOULD normalize language tags that they consume using the rules for producing canonical tags.

As above, the additional content restrictions and normalization steps found in Section 3 of [[LDML]] provide for better interoperability and consistency than that afforded by [[BCP47]] directly. This best practice should not be interpreted as meaning that implementations need to support, generate, process, or understand either of [[CLDR]]'s extensions.

Content authors SHOULD NOT include language tag extensions in a language tag unless the specific application requires the additional tailoring.

It is important to remember that every Unicode locale identifier is also a well-formed [[BCP47]] language tag. Unicode locale identifiers do not require the use of either of [[CLDR]]'s language tag extensions.

Some international and cultural preferences are individual and are left to content authors, service providers, operating environments, or user agents to define and manage on behalf of the user.

Here are a few selected examples of Unicode Locale identifiers and the variations associated with them.

In this example, the value 123456789.5678 is formatted using the locale rules represented by the various language tags. Notice how the u extension and its nu keyword are used to select between Latin and Devanagari digit shapes in the Hindi-as-used-in-India (hi-IN) locale and between Latin and Arabic script digit shaps in the Arabic (ar) locale.

Variation Type	Value	Locale	Formatted Value
Numbering System	`123456789.5678`	en-US	123,456,789.5678
		de	123.456.789,5678
		hi-IN-u-nu-latn	12,34,56,789.5678
		hi-IN-u-nu-deva	१२,३४,५६,७८९.५६७८
		ar-u-nu-latn	123,456,789.5678
		ar-u-nu-arab	١٢٣٬٤٥٦٬٧٨٩٫٥٦٧٨

In this example, a date value corresponding to 8 October 2020 on the Gregorian calendar is formatted using various different locales. In the tables below we present both the local-language and English (en) locale format of the same date value with different corresponding extension sequences supplied. This demonstrates the interplay between different locales and calendars when formatting a locale-neutral date value. Note that the language tag extensions can be applied to any language tag to modify the resulting Unicode locale.

Here are some presentational differences between English, French, and Japanese locales without using language tag extensions (each of which happens to use the Gregorian calendar):

Value	Locale	Formatted Value
`2020-10-08T12:00:00Z`	en	October 8, 2020
	fr	8 octobre 2020
	ja	2020年10月8日

Thailand uses the Thai Buddhist calendar, which can be represented using the extension sequence -u-ca-buddhist. This calendar is similar to the Gregorian calendar, but uses a different year numbering scheme.

Value	Locale	Formatted Value
`2020-10-08T12:00:00Z`	en	October 8, 2020
	th-u-ca-gregory	8 ตุลาคม ค.ศ. 2020
	th-u-ca-buddhist	8 ตุลาคม 2563
	en-u-ca-buddhist	October 8, 2563 BE

In addition to the Gregorian calendar, Japan uses other calendar systems for different cultural or official purposes. One such calendar is the Japanese Imperial calendar denoted by the extension sequence -u-ca-japanese. This calendar is also similar to the Gregorian calendar, but uses a different year numbering scheme.

Value	Locale	Formatted Value
`2020-10-08T12:00:00Z`	en	October 8, 2020
	ja-u-ca-japanese	令和2年10月8日
	en-u-ca-japanese	October 8, 2 Reiwa

Some countries or cultures use non-Gregorian calendars for official, religious, or cultural purposes. One such calendar is represented by the extension sequence -u-ca-islamic. This particular calendar is based on lunar months and thus 2020-10-08 (Gregorian) corresponds to the 21st day of the 2nd month (called "Safar" when rendered into English). This calendar also uses a different year numbering scheme.

Value	Locale	Formatted Value
`2020-10-08T12:00:00Z`	en	October 8, 2020
	ar-u-ca-islamic	٢١ صفر ١٤٤٢ هـ
	en-u-ca-islamic	Safar 21, 1442 AH

Non-linguistic Field. Any element of a data structure not intended for the storage or interchange of natural language textual data. This includes non-string data types, such as booleans, numbers, dates, and so forth. It also includes strings, such as program or protocol internal identifiers. This document uses the term field as a short hand for this concept.

Specifications for document formats or protocols usually define the exchange, processing, or display of various data values or data structures. The Web primarily relies on text files for the serialization and exchange of data: even raw bytes are usually transmitted using a string serialization such as base64. Thus non-linguistic fields on the Web are also normally made up of strings. The important distinction here is that non-linguistic fields are generally interpreted by or meant for consumption by the underlying application, rather than by a user.

Locale-neutral. A non-linguistic field is said to be locale-neutral when it is stored or exchanged in a format that is not specifically appropriate for any given language, locale, or culture and which can be interpreted unambiguously for presentation in a locale aware way.

Many specifications use a serialization scheme, such as those provided by [[XMLSCHEMA11-2]] or [[JSON-LD]], to provide a locale neutral encoding of non-linguistic fields in document formats or protocols.

A locale-neutral representation might itself be linked to a specific cultural preference, but such linkages should be minimized. For example, many of the ISO8601 date/time value serializations are linked to the Gregorian calendar, but the format, field order, separators, and visual appearance are not specifically suitable to any locale (they are intended to be machine readable) and, as shown in the example above, the value can be converted for display into any calendar or locale.

Suppose your application needs to collect and store some value in a field. The system can use a locale-neutral format for storing and exchanging the value. For instance, schema languages such as [[XMLSCHEMA11-2]] or data formats such as [[JSON]] provide ready made types for this purpose. When the user is entering or editing the value, however, the user expects to interact with a more human friendly format. For example, if your application needed to input a user's birth date and the value they were trying to enter were 2020-01-31:

The input field might look like this in HTML:

<input type="date" id="birthDate" value="2020-01-31" lang=… >

The lang attribute here should control the display and formatting of the value, including the expected input pattern. Note that this guidance is at odds with what browsers do at the time this document was published.

Value	Language Tag	Display	Input Format Pattern
`2020-01-31`	en-GB	31/01/2020	dd/MM/yyyy
	en-US	01/31/2020	MM/dd/yyyy
	fr-FR	31-01-2020	dd-MM-yyyy
	zh-Hans-CN	2020-01-31	yyyy-MM-dd

Language negotiation. The process of matching a user's international preferences to available locales, localized resources, content, or processing.

Locale fallback. The process of searching for translated content, locale data, or other resources by "falling back" from more-specific resources to more-general ones following a deterministic pattern.

A user's preferences are usually expressed as a locale or prioritized list of locales. When negotiating the language, the system follows some sort of algorithm to get the best matching content or functionality from the available resources. In many cases the language negotiation algorithm uses locale fallback.

Specifications that present fields in a document format SHOULD require that data is formatted according to the language of the surrounding content.

When non-linguistic fields are presented to the user as part of a document or application, the document or application forms the "context" where the data is being viewed. Content authors or application developers need a way to make the fields seem like a natural part of the experience and need a way to control the presentation. This is indicated by the language tag of the context in which the content appears: usually enabled implementations interpret the tag as a locale in order to accomplish this. Using the runtime locale or localization of the user-agent as the locale for presenting non-linguistic fields should only be a last resort.

Specifications that present forms or receive input of non-linguistic fields in a document format or application SHOULD require that the values be presented to the user localized in the format of the language of the content or markup immediately surrounding the value.

Specifications that present, exchange, or allow the input of non-linguistic fields MUST use a locale-neutral format for storage and interchange.

Implementations SHOULD present non-linguistic fields in a document format or application using a format consistent with the language of the surrounding content and are encouraged to provide controls which are localized to the same locale for input or editing.

Users expect form fields and other data inputs to use a presentation for non-linguistic fields that is consistent with the document or application where the values appear. User's usually expect their input to match the document's context rather than the user-agent or operating environments and input validation, prompting, or controls are also thus consistent with the content. This gives content authors the ability to create a wholly localized customer experience and is generally in keeping with customer expectations.

Choosing between metadata and text-processing language

There are two common uses for language tags in document formats, protocols, and specifications. In some cases, language tags are used to provide metadata about intended audience for collections of content, such as at the record or document level. In other cases, language tags are used to identify the language of specific bits of text in order to facilitate text processing.

The language of the intended audience

Metadata that describes the language of the intended audience is about the document as a whole. Such metadata may be used for searching, serving the right language version, classification, etc. Where there are language changes in a document, information about the language of the intended audience is not specific enough to support text-processing, that is to say, in a way that would be needed for the application of text-to-speech, styling, automatic font assignment, etc.

The language of the intended audience does not include every language used in a document. Many documents on the Web contain embedded fragments of content in different languages, whereas the page is clearly aimed at speakers of one particular language. For example, a German city-guide for Beijing may contain useful phrases in Chinese, but it is aimed at a German-speaking audience, not a Chinese one.

On the other hand, it is also possible to imagine a situation where a document contains the same or parallel content in more than one language. For example, a Web page may welcome Canadian readers with French content in the left column, and the same content in English in the right-hand column. Here the document is equally targeted at speakers of both languages, so there are two audience languages. This situation is not as common on the Web as in printed material since it is easy to link to separate pages on the Web for different audiences, but it does occur where there are multilingual communities. Another use case is a blog or a news page aimed at a multilingual community, where some articles on a page are in one language and some in another.

There are also pages where the navigational information, including the page title, is in one language but the real content of the page is in another. While this is not necessarily good practice, it doesn't change the fact that the language of the intended audience is usually that of the content, regardless of the language at the top of the document source.

Metadata about the language of the intended audience is usually best declared outside the document, such as in the HTTP Content-Language header.

The text-processing language

When specifying the text-processing language you are declaring the language in which a specific range of text is actually written, so that user agents or applications that manipulate the text (such as voice browsers, spell checkers, or style processors) can process the text in a language-appropriate manner. So we are, by necessity, talking about associating a single language with a specific range of text.

This specificity distinguishes the declaration of the language for text-processing from that of the language of the intended audience.

The language for text-processing is usually best declared using attributes on elements, including setting a document-wide default.

For example the html element in [[HTML]] contains all of the content of the document, so setting the lang attribute sets the text-processing language for the whole document except where locally overridden. Enclosed elements inherit the declared value, but you can, of course, override an initial declaration by specifying a different language on embedded elements where the language changes, eg. a French phrase in an English paragraph:

<html lang="en" dir="ltr">
   <head>
      <title>This example is in English</title>
      ...
   </head>
   <body>
       <h1>This also inherits from <code>html</code></h1>
       
       <p>The following example is in French:
           <!-- Text-processing in French inside the 'span' tag -->
           <span lang="fr">cet exemple est en français</span>
           <!-- Text-processing reverts to English here -->
       </p>
   </body>
</html>

Introduction

Document Conventions

Languages and Language Tags

Locales and Internationalization

Choosing between metadata and text-processing language

The language of the intended audience

The text-processing language

Further Reading

Revision Log

Acknowledgements