Step 4: Editing a model definition file
We have exported our wordlist to wordlist.tsv
. We now need to
tell the lexical model compiler how to turn this raw word list
into a lexical model that is quick to use on a smartphone.
To do this, we must create a model definition file.
This is a small TypeScript source code file that tells us where to find the word list file, as well as gives us the option to tell the compiler a little bit more about our language’s spelling system or orthography.
The model definition template
Keyman Developer will provide you with a model definition
similar to the following. If you want to create the file yourself, copy-paste the following template, and save
it as model.ts
. Place this file in the same folder as
wordlist.tsv
.
/*
sencoten 1.0 generated from template.
This is a minimal lexical model source that uses a tab delimited wordlist.
See documentation online at https://help.keyman.com/developer/ for
additional parameters.
*/
const source: LexicalModelSource = {
format: 'trie-1.0',
sources: ['wordlist.tsv'],
};
export default source;
Let's step through this file, line-by-line.
On the first line, we're declaring the source code of a new lexical model.
const source: LexicalModelSource = {
On the second line, we're saying the lexical model will use the
trie-1.0
format. The trie
format creates a lexical
model from one or more word lists; the trie
structures the
lexical model such that it can predict through thousands of words very
quickly.
format: 'trie-1.0',
On the third line, we're telling the trie
where to find our wordlist.
sources: ['wordlist.tsv'],
The fourth line marks the termination of the lexical model source code. If we specify any customizations, they must be declared above this line:
};
The fifth line is necessary to allow external applications to read the lexical model source code.
export default source;
Customizing our lexical model
The template, as described in the previous section, is a good starting
point, and may be all you need for you language. However, most language
require a few customizations. The trie
model supports the
following customizations:
- word breaking
- How to determine when words start and end in the writing system.
- search term to key
- How and when to ignore accents and lettercase
Word breaking
The trie
family of lexical models needs to know what a word
is in running text. In language using the Latin script—like, English, French,
and SENĆOŦEN—finding words is easy. Words are separated by spaces or
punctuation. The actual rules for where to find words can get quite tricky to
describe, but Keyman implements the
Unicode Standard Annex #29 §4.1 Default Word Boundary Specification
which works well for most languages.
However, in languages written in other scripts—especially East Asian scripts like Chinese, Japanese, Khmer, Lao, and Thai—there are is no obvious break in between words. For these languages, there must be special rules for determining when words start and stop. This is what a word breaking function is responsible for. It is a little bit of code that looks at some text to determine where the words are.
Search term to key
To look up words quickly, the trie
model creates a
search key that takes the latest word (as determined by the word breaking and converts it into a “regular” form.
The purpose of this “regular” form is to make searching for a word work,
regardless of things such as accents,
diacritics, lettercase, and minor
spelling variations.
The ”regular” form is called the key. Typically, the key is always
in lowercase, and lacks all accents and diacritics. For example, the key form
of “naïve" is "naive" and the keyform of Canada is “canada”.
The form of the word that is stored is “regularized” through the use of a key function, which you can define in TypeScript code.
The key function takes a string, the raw search term, and returns a string, being the “regular” key. As an example, consider the default key function; that is, the key function that is used if you do not specify one:
searchTermToKey: function (term) {
// Use this pattern to remove common diacritical marks.
// See: https://www.compart.com/en/unicode/block/U+0300
const COMBINING_DIACRITICAL_MARKS = /[\u0300-\u036f]/g;
// Converts to Unicode Normalization form D.
// This means that MOST accents and diacritics have been "decomposed" and
// are stored as separate characters. We can then remove these separate
// characters!
//
// e.g., Å → A + ˚
let normalizedTerm = term.normalize('NFD');
// Now, make it lowercase.
//
// e.g., A + ˚ → a + ˚
let lowercasedTerm = normalizedTerm.toLowerCase();
// Now, using the pattern above replace each accent and diacritic with the
// empty string. This effectively removes all accents and diacritics!
//
// e.g., a + ˚ → a
let termWithoutDiacritics = lowercasedTerm.replace(COMBINING_DIACRITICAL_MARKS, '')
// The resultant key is lowercased, and has no accents or diacritics.
return termWithoutDiacritics;
},
This should be sufficient for most Latin-based writing systems. However, there are cases, such as with SENĆOŦEN, where some characters do not decompose into a base letter and a diacritic. In this case, it is necessary to write your own key function.
Once customization is done
We may have some tweaks, but first we need to actually build and test our lexical model. This will be discussed in the next step.