This document describes string searching operations on the Web in order to allow greater interoperability. String searching refers to natural language string matching such as the "find" command in a Web browser. This document builds upon the concepts found in Character Model for the World Wide Web 1.0: Fundamentals [[CHARMOD]] and Character Model for the World Wide Web 1.0: String Matching [[CHARMOD-NORM]] to provide authors of specifications, software developers, and content developers the information they need to describe and implement search features suitable for global audiences.
Caution: work in progress
This document is not being actively developed by the Internationalization Working Group. It was created to capture information about substring matching in [=natural language=] text that was off-topic in other I18N documents as well as to serve as a ready reference for conversations about the problems of substring matching on the Web.
Readers should not expect that the materials found here represent full guidance for the implementation or specification of "find" features.
This document describes the problems, requirements, and considerations for specification or implementations of string searching operations. A common example of string searching is the "find" command in a Web browser, but there are many other forms of searching that a specification might wish to define.
This document builds on Character Model for the World Wide Web: Fundamentals [[CHARMOD]] and Character Model for the Word Wide Web: String Matching [[CHARMOD-NORM]]. Understanding the concepts in those documents are important to being able to understand and apply this document successfully.
The main target audience of this specification is W3C specification developers who need to define some form of search or find algorithm: the goal is to provide a stable reference to the concepts, terms, and requirements needed.
The concepts described in this document provide authors of specifications, software developers, and content developers with a common reference for consistent, interoperable text searching on the World Wide Web. Working together, these three groups can build a globally accessible Web.
This document contains best practices and requirements for other specifications, as well as recommendations for implementations and content authors. These best practices for specifications (and others) can also be found in the Internationalization Working Group's document Internationalization Best Practices for Spec Developers [[INTERNATIONAL-SPECS]], which is intended to serve as a general reference for all Internationalization best practices in W3C specifications.
In this document [[RFC2119]] keywords in uppercase italics have their usual meaning. We also use these stylistic conventions:
Definitions appear with a different background color and decoration like this.
Best practices appear with a different background color and decoration like this.
Issues, gaps, and recommendations for future work appear with a different background color and decoration like this.
This section contains terminology specific to this document.
Much of the terminology needed to understand this document is provided by the Internationalization Glossary [[I18N-GLOSSARY]]. Some terms are also defined by [[CHARMOD-NORM]] and can be found in the Terminology and Notation section of that document.
Unicode, also known as the Universal Character Set, allows Web documents to be authored in any of the world's writing systems, scripts, or languages, on any computing platforms and then to be exchanged, read, and searched by the Web's users around the world. The first few chapters of the Unicode Standard [[Unicode]] provide useful background reading. Also see the Unicode Collation Algorithm [[UTS10]], which contains a chapter on searching.
Corpus The natural language text contained by a document or set of documents which the user would like to search.
Segmentation The process of breaking natural language text up into distinct words and phrases. This often includes operations such as "named entity recognition" (such as recognizing that the three word sequence Dr. Jonas Salk is a person's name).
Stemming A process or operation that reduces words to their "stem" or root. For example, the words runs, ran, and running all share the stem run. This is sometimes called (more formally) lemmatization and the stem is sometimes called the lemma.
Full-Text Search refers to searches that process the entire contents of the textual document or set of documents. Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.
Frequently this means that a full-text search employs indexes and natural language processing. When you are using a search engine, you are using a form of full text search. Full text search often breaks natural language text into words or phrases (this is called segmentation) and may apply complex processing to get at the semantic "root" values of words (this is called stemming). These processes are sensitive to language, context, and many other aspects of textual variation.
Natural Language Processing (NLP) refers to the domain of software designed to understand, process, and manipulate human languages (that is, natural language). This is a very wide ranging term. It can cover relatively simple problems, such as word tokenization, or more complex behaviors, such as deriving "meaning" from text, recognizing parts of speech, performing accurate translation, and much else.
Users of the Web often want to search for specific text in a document or collection of documents without having to read line-by-line. Specifications sometimes seek to support this desire by exposing text searching in the Web platform.
There are different types of document searching. One type, called a full text search, is the sort of searching most often found in applications such as a search engine. This type of searching is complex, can be resource intensive, and often depends on processes outside the scope of a given search request.
A more limited form of text search (and the topic of this document) is sub-string matching
. One familiar form of sub-string matching is the find
feature of browsers and other types of user-agent. For user agents with physical keyboards, this functionality is often accessed via a key combination such as Cmd+F or Ctrl+F. Such a feature might be exposed on the Web via the API window.find
, which is currently not fully standardized, or capabilities such as the proposed [[?SCROLL-TO-TEXT-FRAGMENT]].
Find operations can provide optional mechanisms for improving or tailoring the matching behavior. For example, the abilility to add (or remove) case sensitivity, whether the feature supports different aspects of a regular expression language such as wildcard characters, or whether to limit matches to whole words.
One way that sub-string matching usually differs from full-text search is that, while it might use various algorithms in an attempt to suppress or ignore textual variations, it usually does not produce matches that contain additional or unspecified character sequences, words, or phrases, such as would result from stemming or other NLP processes.
When attempting to standardize sub-string matching, specification authors often struggle with the complexity that is inherent in the encoding of natural language in computer systems, including the different mechanisms employed to encode characters in the [[Unicode]] standard.
Quite often, the user's input doesn't consist of exactly the same sequence of code points as that used in the document being searched, while the user still expects a match to occur. This can happen for a variety of reasons. Sometimes it is because the text being searched varies in ways the user could not have predicted. In other cases it is because the user's keyboard or input method does not provide ready access to the textual variations needed. It can even be because the user cannot be bothered to input the text accurately.
In this section, we examine various common cases known to us which specification authors need to take into consideration when specifying a sub-string match API or mechanism.
User expectations about whether their search term matches a given part of a document or [=corpus=] sometimes depends on the user's language, the language of the document, or both. It might also involve other factors, such as which keyboards or input methods are available on a given device. This might be because various operations that are part of searching, such as case folding, are locale-affected, or that, given the complexity of human language and culture, that expectations about matching or about the use and interpretation of various character sequences differs, even within a given script. Similarly, the handling of accents, alternate scripts, or character encoding (such as variations in the formation of grapheme clusters) is linked to the specific language of the text in question.
It is important to emphasize that we mean language here, and not script. Many different languages that share a script apply different processing or imply different expectations.
Implementations of a "find" feature often have to guess what language the user intended based solely on the user's input or on various "hints" in the runtime environment, such as the operating environment locale, the user agent's localization, or the language of the active keyboard. These hints are, at best, a proxy for the user's intent, particularly when the user is searching a document that doesn't match any of these or when the searched document contains more than one language.
A user might expect a term entered in lowercase to match uppercase equivalents (and perhaps vice-versa). Sub-string matching features, such as the browser "find" command, often offer a user-selectable option for matching (or not) the case of the input to that of the text.
For a survey of case folding, see the discussion here in [[CHARMOD-NORM]].
Unicode defines canonical and compatibility relationships between characters which can impact user perceptions of string searching. For a detailed discussion of Unicode Normalization forms see Section 2.2 of [[CHARMOD-NORM]] as well as the definitions found in Unicode Normalization Forms [[UAX15]].
In many complex scripts it is possible to encode letters or vowel-signs in more than one way, but the alternatives are canonically equivalent.
Some languages are written in more than one script. A user searching a document might type in text in one script, but wish to find equivalent text in both scripts.
Some compatibility characters were encoded into Unicode to account for single- or multibyte representation in legacy character encodings or for compatibility with certain layout behaviors in East Asian languages.
Many scripts have their own digit characters for the numbers from 0 to 9. In some Web applications, the familiar ASCII digits are replaced for display purposes with the local digit shapes. In other cases, the text actually might contain the Unicode characters for the local digits. Users attempting to search a document might expect that typing one form of digit will find the eqivalent digits.
Some languages have different orthographic traditions that vary by region or dialect or allow different spellings of the same word. Searches and spell-checking may need to know about these variations.
Indic script languages have many instances of this kind of problem. Sometimes these are spelling errors, but in other cases multiple spellings are acceptable.
For example, the Bengali language (language tag bn
) is notorious for having a wide range of spelling variations permitted by the language: nearly 80% of Bengali words have at least two spellings. Many words have 3, 4, or more variations—with at least one word having 16 different valid spellings.
Other Indic scripts provide alternative mechanisms for representing particular sounds, and in most cases either representation is considered equally valid. The most common instance of this involves representation of syllable-final nasals.
For example, the /n/ sound in the word for snake in Hindi can be written using either ँ [U+0901 DEVANAGARI SIGN CANDRABINDU] and ं [U+0902 DEVANAGARI SIGN ANUSVARA] Both of the following are possible valid spellings:
In an additional twist to this story, two diacritics with different code points could be used here. In our previous example we used ं [U+0902 DEVANAGARI SIGN ANUSVARA ] to represent the nasal sound because the accompanying vowel-sign rises above the hanging baseline. If the vowel-sign was one that didn't rise above the hanging baseline, we would normally use ँ [U+0901 DEVANAGARI SIGN CANDRABINDU ] instead. The function of both of these diacritics is the same, but their code points are different.
The alternative use of either a letter or a diacritic for syllable-final nasals is common to several other Indian languages. In addition to Devanagari, used to write languages such as Hindi (language tag hi
) or Marathi (language tag mr
, scripts such as Malayalam, Gujarati, Odia, and others provide similar spelling options.
Some languages use whitespace to separate words, sentences, or paragraphs while others do not. When performing sub-string matching, different forms of whitespace found in [[Unicode]] must be normalized so that the match succeeds.
Users will sometimes vary their input when dealing with letters that contain accents or diacritic marks when entering search terms in scripts (such as the Latin script) that use various diacritics, even though the text they are searching includes the additional marks. This is particularly true on mobile keyboards, where input of these characters can require additional effort. In these cases, users generally expect the search operation to be more "promiscuous" to make up for their failure to make the additional effort needed.
This effect might vary depending on context as well. For example, a person using a physical keyboard may have direct access to accented letters, while a virtual or on-screen keyboard may require extra effort to access and select the same letters.
In some orthographies it is necessary to match strings with different numbers of characters.
A prime example of this involves vowel diacritics in abjads. For example, some languages that use the Arabic and Hebrew scripts do not require (but optionally allow) the user to input short vowels. (For some other languages in these scripts, the inclusion of the short vowels is not optional.) The presence or absence of vowels in the text being input or searched might impede a match if the user doesn't enter or know to enter them.
In some cases, visually similar or identical glyph patterns can be made from different sequences of code points. Sometimes this is intentional and variations can be removed via Unicode normalization. But there are other cases in which similar-appearing [= graphemes =] are not made the same by normalisation, and they are not semantically equivalent.
Some languages which use the Arabic script also have [= graphemes =] which can be encoded in more than one way. In some cases, these variations are handled by Unicode Normalization, but in other cases they are not considered equivalent by Unicode, even if they appear visually to be identical. Sometimes these variations are considered to be valid spelling variations. In other cases they are the result of user's mistaken perception.
Some languages, such as English or Arabic, use spaces between words. Other languages, such as Chinese, Japanese, or Thai, do not. Some language use spaces to separate other text units, such as phrases. In those languages that do not use spaces between words, computing "whole word" matching often depends on the ability to determine word boundaries when the boundaries are not themselves encoded into the text.
This section was identified as a new area needing document as part of the overall rearchitecting of the document. The text here is incomplete and needs further development. Contributions from the community are invited.
Implementers often need to provide simple "find text" algorithms and specifications often try to define APIs to support these needs. Find operations on text generate different user expectations and thus have different requirements from the need for absolute identity matching needed by document formats and protocols. It is important to note that domain-specific requirements may impose additional restrictions or alter the considerations presented here.
Increasing input effort from the user SHOULD be mirrored by more selective matching.
When the user expends more effort on the input—by using the shift key to produce uppercase or by entering a letter with diacritics instead of just the base letter—they might expect their search results to match (only) their more-specific input.
When creating a string search API or algorithm, the following textual options might be useful to users:
The W3C Internationalization Working Group and Interest Group, as well as others, provided many comments and suggestions. The Working Group would like to thank: all of the contributors to the Character Model series of documents over the many years of their development.
The examples in this example were taken from a page authored by Henri Sivonen, as were a number of concepts and ideas recorded by him in this issue.