Search term to key
To look up words quickly, the trie
model creates a
search key that takes the latest word (as determined by the word breaker and converts it into an internal form.
The purpose of this internal form is to make searching for a word work,
as expected,
regardless of things such as accents,
diacritics, letter case, and minor
spelling variations.
The internal form is called the key. Typically, the key is always
in lowercase, and lacks all accents and diacritics. For example, the key
for “naïve" is naive
and the key for “Canada” is
canada
.
The form of the word that is stored is “regularized” through the use of a key function, which you can define in TypeScript code.
Note: this function runs both on every word when the wordlist is compiled and on the input, whenever a suggestion is requested. This way, whatever a user types is matched to something stored in the lexical model, without the user having to type things in a specific way.
The key function takes a string which is the raw search term, and returns a new string, being the “regularized” key. As an example, consider the default key function; that is, the key function that is used if you do not specify one:
searchTermToKey: function (term: string): string {
// Use this pattern to remove common diacritical marks.
// See: https://www.compart.com/en/unicode/block/U+0300
const COMBINING_DIACRITICAL_MARKS = /[\u0300-\u036f]/g;
// Converts to Unicode Normalization form D.
// This means that MOST accents and diacritics have been "decomposed" and
// are stored as separate characters. We can then remove these separate
// characters!
//
// e.g., Å → A + ˚
let normalizedTerm = term.normalize('NFD');
// Now, make it lowercase.
//
// e.g., A + ˚ → a + ˚
let lowercasedTerm = normalizedTerm.toLowerCase();
// Now, using the pattern above replace each accent and diacritic with the
// empty string. This effectively removes all accents and diacritics!
//
// e.g., a + ˚ → a
let termWithoutDiacritics = lowercasedTerm.replace(COMBINING_DIACRITICAL_MARKS, '')
// The resultant key is lowercased, and has no accents or diacritics.
return termWithoutDiacritics;
},
This should be sufficient for most Latin-based writing systems. However, there are cases, such as with SENĆOŦEN, where some characters do not decompose into a base letter and a diacritic. In this case, it is necessary to write your own key function.
Use in your model definition file
To use this in your model definition file, provide a function as the
searchTermToKey
property of the lexical model source:
const source: LexicalModelSource = {
format: 'trie-1.0',
sources: ['wordlist.tsv'],
searchTermToKey: function (wordform: string): string {
// Your searchTermToKey function goes here!
let key = wordform.toLowerCase();
return key;
},
// other customizations go here:
};
export default source;
Suggested customizations
- For all writing systems, normalize into NFC or
NFD form using
wordform = wordform.normalize('NFC')
. - For Latin-based scripts, lowercase the word, and remove diacritics.
- For scripts that use the U+200C zero-width joiner (ZwJ) and/or the U+200D zero-width
non-joiner (ZWNJ) (e.g., Brahamic scripts),
remove the ZWJ or ZWNJ from the end of the input with
wordform = wordform.replace(/[\u200C\u200D]+$/