This document describes the requirements for language and base direction metadata for data formats used on the Web. formats.

This document

Sending comments on this document

If you wish to make comments regarding this document, please raise them as github issues. Only send comments by email if you are unable to raise issues on github (see links below). All comments are welcome.

To make it easier to track comments, please raise separate issues or emails for each comment, and point to the section you are commenting on using a URL for the dated version of the document.

Introduction

Natural language information on the Web depends on and benefits from the presence of language and direction metadata. Along with support for Unicode, mechanisms for including and specifying the base direction and language of spans of text are one of the key considerations in development of new formats and technologies for the Web.

Markup formats, such as HTML and XML, as well as related styling languages, such as CSS and XSL, are reasonably mature and provide support for the interchange and presentation of the world's langauges via built-in features.

This document was developed as a result of observations by the Internationalization Working Group over a series of specification reviews related to formats based on JSON, WebIDL, and other non-markup data languages. Unlike markup formats, such as XML, these data languages generally do not provide extensible attributes and were not conceived with built-in language or direction metadata.

Why is this important?

The language of content is important when processing and presenting natural language data for a variety of reasons. When this data is not present, the resulting degradation in appearance or functionality can frustrate users—or render the content unintelligible. Some of the affected processes include:

Similarly, direction metadata is important to the Web. When a string contains text in a script that runs right-to-left (RTL), it must be possible to eventually display that string correctly when it reaches an end user. For that to happen, it is necessary to establish what 'base direction' needs to be applied to the string as a whole. The appropriate base direction cannot always be deduced by simply looking at the string; even if it were possible, the producer and consumer of the string would need to use the same heuristics to interpret its direction.

Static content, such as the body of a Web page or the contents of an e-book, often has language or direction information provided by the document format or as part of the content metadata. Data formats found on the Web generally do not supply this metadata. Base specifications such as Microformats, WebIDL, JSON, and more, have tended to store natural language text in string objects, without additional metadata.

This places a burden on application authors and data format designers to provide the metadata on their own initiative. When standardized formats do not address the resulting issues, the result can be that, while the data arrives intact, its processing or presentation cannot be wholly recovered.

Suppose that you are building a Web page to show a customer's library of e-books. The e-books exist in a catalog of data and consist of the usual data values. A JSON file for a single entry might look something like:

{
    "id": "978-0-1234-5678-X",
    "title": "Moby Dick",
    "authors": [ "Herman Melville" ],
    "language": "en-US",
    "pubDate": "1851-10-18",
    "publisher": "Mark Twain Press",
    "coverImage": "https://example.com/images/mobidick_cover.jpg",
    // etc.
},

Each of the above is a data field in a database somewhere. There is even language information about the contents of the book ("language": "en-US").

A well-internationalized catalog would include additional metadata to what is shown above, though. For each of the fields containing natural language text, such as the title and authors fields, there will be a language attribute and base direction stored as metadata. (There may be other values as well, such as pronunciation metadata for sorting East Asian language information.) These data fields are used in a variety of ways to influence and enable the processing and display of the items. But the data structure provides no place to store these.

One work-around might be to encode the values using a mix of HTML and Unicode bidi controls, so that the data value looks like this:

   "title": "<span lang='en-US' dir='ltr'>&lrm;Mobi Dick</span>"

But JSON is a data interchange format. The content may not end up displaying the title field in an HTML context. The JSON might very well be used to populate, say, a local data store which uses native controls to show the title. And both the producer and consumer of the data don't currently expect to introspect the data. They want to generate it directly from a local data store, such as a database, or push it directly into processing. They may have other considerations, such as field length, that are affected by the inseration of additional controls or markup.

Isn't Unicode Enough?

[[!Unicode]] and its character encodings (such as UTF-8) are key elements of the Web and its formats. They provide the ability to encode and exchange text in any language consistently throughout the Internet. However, Unicode by itself does not guarantee perfect presentation and processing of natural language text, even though it does guarantee perfect interchange.

Several features of Unicode are sometimes suggested as part of the solution to providing language and direction metadata. Specificially, Unicode bidi controls are suggested for handling direction metadata. In addition, there are "tag" characters in the U+E0000 block of Unicode for use as language tags. These characters are deprecated and their use is "strongly discouraged" (to quote Unicode).

There are a variety of reasons why the addition of characters to data in an interchange format is not a good idea. These include:

Language Identification

Definitions

Language metadata typically indicates the intended linguistic audience or user of the resource as a whole, and it's possible to imagine that this could, for a multilingual resource, involve a property value that is a list of languages. A property that is about language metadata may have more than one value, since it aims to describe all potential users of the information

Text-processing language is the language of a particular range of text (which could be a whole resource or just part of it). A property that represents the text-processing language needs to have a single value, because it describes the text content in such a way that tools such as spell-checkers, default font applicators, hyphenation and line breakers, case converters, voice browsers, and other language-sensitive applications know which set of rules or resources to apply to a specific range of text. Such applications generally need an unambiguous statement about the language they are working on.

Language Tagging Use Cases

Kensuke is reading an old Tibetan manuscript from the Dunhuang collection. The tool he is using to read the manuscript has access to annotations created by scholars working in the various languages of the International Dunhuang Project, who are commenting on the text. The section of the manuscript he is currently looking at has commentaries by people writing in Chinese, Japanese and Russian. Each of these commentaries is stored in a separate annotation, but the annotations point to the same point in the target document. Each commentary is mainly written in the language of the scholar, but may contain excerpts from the manuscript and other sources written in Tibetan as well quoted text in Chinese and English.

Kensuke speaks Japanese, so he wants to be presented with the Japanese commentary.

The body containing the Japanese commentary has a language property set to ja (Japanese). The tool he is using knows that he wants to read Japanese commentaries, and it uses this information to select and present to him the text contained in that body. This is language information being used as metadata – it indicates to the application doing the retrieval that the intended consumer of the information wants Japanese.

The Japanese commentary for this particular annotation starts with a sentence in Japanese, but later contains some excerpts from Chinese and Tibetan sources. It's possible for the value of the language property, when used as metadata, to contain three language tags, ja,zh,bo (japanese, chinese, and tibetan, respectively), but i'm not sure how useful that is in this particular use case.

Having identified the relevant annotation text to present to Kensuke, his application has to then display it so that he can read it. It's important to apply the correct font to the text. In the following examples, the first paragraph has no language tag applied. The subsequent paragraphs are labeled ja (Japanese), zh-Hans (Simplified Chinese), and zh-Hant (Traditional Chinese) respectively.

雪, 刃, 直, 令, 垔

雪, 刃, 直, 令, 垔

雪, 刃, 直, 令, 垔

雪, 刃, 直, 令, 垔

You should be able to see slight but important differences in Japanese vs Chinese fonts. It's important to apply a Japanese font to the Japanese text that Kensuke is reading. There are also language-specific differences in the way text is wrapped at the end of a line. For these reasons we need to identify the actual language of the text to which the font or the wrapping algorithm will be applied. Also, a voice browser will need to know whether to use Japanese or Chinese pronunciations for the ideographic characters contained in the annotation body text, and as mentioned before, various other text rendering or analysis tools need to know the language of the text they are dealing with.

If the language property value contains only ja, that's a good indicator that the application should expect the first sentence and the annotation in general to be in Japanese, unless instructed otherwise. If, however, the language property has the value bo,ja,zh, it's not clear what the default font, etc, should be. In that case, we need a way to indicate that the first sentence in the text presented to Kensuke is actually in Japanese.

We also need a way to indicate the change of language to Chinese and Tibetan later in the commentary for this annotation, so that appropriate fonts and wrapping algorithms can be applied there. One proposal from members of the Annotation WG was to require HTML/XML formats for such annotation bodies, and use the lang or xml:lang attributes in markup to denote the language changes.

(Use case from Felix) If Kensuke's body contains quoted text in Chinese and Tibetan it would be useful to know that if you were someone who wanted to locate all annotations containing text in more than one language.

Bidirectional Use Cases

If your specification or application provides a way of correctly displaying the following strings when they reach the point of display to the user, you will have solved the majority of the problems. For that to happen, there must be a way to tell the required base direction for each string.

All examples show characters from left to right in the order they are stored in memory. We use Hebrew text so as to avoid issues related to the display of cursive characters in Arabic. We will also use these tests as examples for the concepts on this page.

Tests 1-4 need to be displayed using an RTL base direction. Test 5 needs to be displayed as LTR text.

Test #1

"בינלאומי!"

For presentation to a user, the characters above should be presented in the reverse order to what you see on this page. The Hebrew characters will be reversed by applying the Unicode Bidirectional Algorithm (UBA). However, the UBA cannot make the exclamation mark appear to the left of the Hebrew text, where it belongs, unless the base direction is set to RTL.

This is what the text of the string should look like if displayed correctly by a consumer:  Test 1

Test #2

"bidi בינלאומי"

For presentation to a user, the text "bidi" must appear to the right of the Hebrew letters. The UBA cannot do this unless it knows that the overall base direction is RTL.

This is what the text of the string should look like if displayed correctly by a consumer:  Test 2

Test #3

"<span>בינלאומי!</span>"

This test is intended for consuming applications that treat the markup in the string as actual markup. As for test #1, the exclamation mark must appear to the left of the Hebrew letters, regardless of the LTR directionality of the markup surrounding it.

This is what the text of the string should look like if displayed correctly by a consumer:  Test 3

Test #4

"<span dir='rtl'>one שתיים three</span>"

If the consuming application is expected to parse the markup as actual markup, the list in the element content above should be displayed to the user in the order "three שתיים one". This requires the UBA to know that the intended base direction of the string is RTL. The key point of this test is that the base direction information is carried in the markup.

This is what the text of the string should look like if displayed correctly by a consumer:  Test 4

Test #5

"123 456 789"

When presented to a user, the order of the numbers must remain the same even when the directional context of the surrounding text is RTL. There are no strong directional characters in this string.

This is what the text of the string should look like if displayed correctly by a consumer:  Test 5

The main issue

The main issue is how a consumer of a string will know what base direction should be used for that string when it is eventually displayed to a user. A number of alternatives are considered below.

Ancillary issues

Apart from the question of how a consumer will know what base direction to use for a string, the following are things that need to be considered for an end to end process that supports proper application of base direction to strings.

Producing

A string may become a string in a number of ways, including a content author typing strings into a plain text editor or text message, or a script scraping text from web pages, or acquisition of an existing set of strings from another application or repository, or, if you are lucky, a dedicated system with an interface that allows base direction to be specified during input. In this article, the producer of a string is the human or mechanism that creates a string for storage or transmission to a consumer of strings.

When a string is created, it's necessary to (a) detect the appropriate base direction to be associated with the string, and (b) take steps, where needed, to set the string up in a way that communicates the base direction.

For example, in the case of a string that is extracted from an HTML form, the base direction can be detected from the computed value of the form's field. Such a value could be inherited from an earlier element, such as the html element, or set using markup or styling on the input element itself. The user could also set the direction of the text by using keyboard shortcut keys to change the direction of the form field. The dirname attribute provides a way of automatically communicating that value with a form submission.

If the producer of the string is receiving the string from a location where it was stored by another producer, and where the base direction has already been established, the producer should understand that the base direction has already been set.

Consuming

A consumer is an application or process that takes a string and places it into a context where it will be exposed to a user. It must ensure that the base direction of the string is correctly applied to the string in that context.

Applying the base direction may involve constructing additional markup or adding control codes or some such to indicate the base direction. It must also isolate embedded strings from the surrounding text to avoid spill-over effects of the bidi algorithm.

Decoding information

Any time a producer of a string takes special steps to add information about the base direction of that string it must do so with the expectation that the consumer of the string will understand how the producer did so. Even if no action is taken by the producer, the consumer must decide what rules to follow in order to decide on the appropriate base direction.

In some systems, the behaviour of the producer and the consumer of a string will both be specified. In others, such agreements may not be available.

First-strong

First-strong detection looks for the first character with a strong Unicode directional property in a string, and sets the base direction to match it. Many developers assume that this provides a robust solution, but first-strong detection alone is not always adequate to communicate base direction.

Note that, if the producer is relying on the consumer using first-strong character detection to establish the contextual base direction of a string, the consumer needs to be aware that it should also use that approach. Although first-strong detection is outlined in the UBA, it is not the only possible higher level protocol mentioned for detecting string direction. For example, Twitter and Facebook currently use different default heuristics for guessing the base direction of text – neither use just simple first-strong detection, and one uses a completely different method.

The first-strong detection algorithm needs to skip characters at the start of the string that don't have a strong directional property.

It also needs to skip embedded runs of text that are directionally isolated from the text around it, if it is to follow the UBA. Isolation may be achieved by Unicode formatting characters, such as RLI, LRI and FSI, or by markup in the string if that markup is to be interpreted as actual markup by the consumer (eg. <span dir="rtl"> in HTML5).

The principle problem encountered with first-strong detection is that the first strong character is not always representative of the base direction that needs to be applied to that string, such as in test #2 above.

If a string contains markup that will be parsed by the consumer as markup, there are additional problems. Any such markup at the start of the string must also be skipped when searching for the first strong directional character. If, however, there is angle bracket content that is intended to be an example of markup, rather than actual markup, the markup must not be skipped. It isn't clear how a consumer of the string would know the difference between this case and the previous one.

If parseable markup in the string contains information about the intended direction of the string, that information should be used rather than relying on first-strong heuristics. This is problematic in a couple of ways: (a) it assumes that the consumer of the string understands the semantics of the markup, which may be ok if there is an agreement between all parties to use, say, HTML markup only, but would be problematic when dealing with random XML vocabularies, and (b) the consumer must be able to to recognise and handle a situation where only the initial part of the string has markup, ie. the markup applies to an inline span of text rather than the string as a whole.

If no strong directional character is found in the string, the direction should be assumed to be LTR.

The remaining sections look at ways that a string may be stored with additional information where text direction cannot be detected accurately by the first-strong method.

Augmenting first-strong by inserting RLM markers

It is possible for a producer of a string to attach an RLM/LRM character to the beginning of those strings where the wrong base direction would otherwise be assumed when using a simple first-strong heuristic.

If the producer is a human, they could theoretically apply one of these characters when creating a string in order to signal the directionality, although they are very likely to not have a way of inputting RLM/LRM characters, especially on mobile devices.

However, humans often create text that will later become strings in environments where such action is unnecessary. For example, if a person types information into an HTML form and relies on the form's base direction or use of shortcut keys to make the string look correct in the form field, they would not need to add RLM/LRM to make the string 'look correct' for themselves, but outside of that context the string would look incorrect unless an appropriate strong character was added to it. Similarly, strings scraped from a web page that has dir=rtl set in the html element would not normally have or need an RLM/LRM character at the start of the string in HTML.

This approach is therefore only appropriate for general use if it is acceptable to change the value of the string.

Apart from changing the identity of the string, adding characters to it may have an effect on things such as string length or pointer positions, which may become problematic.

As a variant of the first-strong heuristic approach, the consumer would still need to also use first-strong heuristics to apply the correct directionality to the string.

If directional information is contained in markup that will be parsed as such by the consumer (eg. dir=rtl in HTML), the producer of the string should understand that markup in order to set or not set an RLM/LRM character as appropriate. If the producer always adds RLM/LRM to the start of such strings, the consumer should know that. If it relies instead on the markup being understood, the consumer needs to understand the markup.

The producer of a string should not automatically apply RLM or LRM to the start of the string, but should test whether it is needed. For example, if there's already an RLM there, no need to add another. If the context is correctly conveyed by first-strong heuristics, no need to add additional characters either. Note, however, that testing whether supplementary directional information of this kind is needed is only possible if the producer has access, and knows that it has access, to the original context of the string.

Paired formatting characters

This approach inserts paired Unicode formatting characters at the start and end of a string to indicate the base direction.

If paired formatting characters are used, they should be isolating, ie. starting with RLI, LRI, FSI, and not with RLE or LRE.

However, It would not be enough to simply apply the UBA first-strong heuristics to such a string, because the Unicode bidi algorithm is unable to ascertain the base direction for a string that starts with RLI/LRI/FSI and ends with PDI. This is because the algorithm skips over isolated sequences and treats them as a neutral character. A consumer of the string would have to take special steps, in this case, to uncover the first-strong character.

This approach is also only appropriate if it is acceptable to change the value of the string. In addition to possible issues such as changed string length or pointer positions, this approach runs the risk of one of the paired characters getting lost, either through handling errors, or through text truncation, etc.

A producer and a consumer of a string would need to recognise and handle a situation where a string begins with a paired formatting character but doesn't end with it because the formatting characters only describe a part of the string.

Unicode specifies a limit to the number of embeddings that are effective, and embeddings could build up over time to exceed that limit.

Consuming applications would need to recognise and appropriately handle the isolating formatting characters. At the moment such support for RLI/LRI/FSI is not pervasive.

Metadata

If it is possible to pass metadata with the string and the consumer knows how to retrieve the meaning of that metadata, this can provide a simple, effective and efficient method of communicating the intended base direction without affecting the actual content of the string.

Metadata not only removes the problem of whether or not, and how, to parse markup in a string to determine the direction, but even in the simplest strings, without markup, it avoids the need to inspect and run heuristics on the string to determine its base direction.

There needs to be metadata available for each individual string. Alternatively, metadata can be inherited, but some mechanism must be available to override the inherited direction for a particular string which differs in direction from the inherited value.

Metadata is probably most effective, however (especially for the original creator of the strings), if it is only passed with a string in those cases where first-strong detection is otherwise going to produce a wrong result. This would mean that consumers of strings should not only recognise the metadata, but should also expect to rely on first-strong heuristics for strings without metadata. It also means that producers of strings need to recognise situations where directional information is needed and set the metadata.

Acknowledgements

The Internationalization (I18N) Working Group would like to thank the following contributors to this document: Felix Sasaki,