Language identification of linguistic examples
This utility identifies the subject language of linguistic examples, specifically for use in the ODIN data-acquisition pipeline. Unlike common language identification methods that look for characteristic n-grams of characters or words, this tool attempts to identify languages for which there might be little or no other example data. This is accomplished by looking for language mentions in the context of the examples. If a language name is mentioned in the document containing the example, it is a candidate for the subject language of the example. Other features, such as the proximity of language mentions to the example, or n-gram matches to a language model (if such a model exists) also contribute to the determination.
In addition to accounting for a lack of language model data, this package also does not assume any supervised training data. In order to make a classifier that can predict a language when no positive instances of the language have been seen, we generalize the problem so the classifier predicts the probability that some mentioned language is associated with the current example. The language with the highest probability is then chosen.
The lgid
package includes functionality for building resources from
training data, listing language mentions in a document, training models,
classifying data, and testing the trained models.
The basic requirements are probably satisfied by any modern Linux or Mac setup:
- bash
- git
- Python (3.3 or higher)
- virtualenv
First clone the repository:
~$ git clone https://github.com/xigt/lgid.git
If you wish to install a pre-trained model and ODIN language model files alongside
the code, you can download the latest code_and_data.zip
file from the
releases page and unzip that instead of
cloning the git repository. You may also clone the git repository, then download the
latest data_only.zip
file from the releases page and unzip that in the cloned directory.
After cloning this repository, run the setup-env.sh
script to create a
virtual environment and download the necessary
dependencies into it:
~$ cd lgid/
~/lgid$ bash setup-env.sh
The dependencies installed by this script are:
When completed, run lgid.sh
as the front end to all tasks, as it
manages the activation/deactivation of the virtual environment:
~/lgid$ ./lgid.sh
Usage:
lgid [-v...] train --model=PATH [--vectors=DIR] CONFIG INFILE...
lgid [-v...] test --model=PATH [--vectors=DIR] CONFIG INFILE...
lgid [-v...] validate --model=PATH [--vectors=DIR] CONFIG INFILE...
lgid [-v...] classify --model=PATH --out=PATH [--vectors=DIR] CONFIG INFILE...
lgid [-v...] get-lg-recall CONFIG INFILE...
lgid [-v...] list-model-weights --model=PATH CONFIG
lgid [-v...] list-mentions CONFIG INFILE...
lgid [-v...] count-mentions CONFIG INFILE...
lgid [-v...] find-common-codes CONFIG INFILE...
lgid [-v...] download-crubadan-data CONFIG
lgid [-v...] build-odin-lm CONFIG
Try lgid.sh --help
for more usage information.
Test the performance of the prebuilt model on the included sample input:
./lgid.sh -v test --model=model/sample_model config.ini sample/input/*
Classify the sample input and output the files with predicted languages marked:
./lgid.sh -v classify --model=model/sample_model --out=sample/output config.ini sample/input/*
Here is an overview of the imporant files in the repository:
lgid
├── setup-env.sh # first-time setup
├── lgid.sh # main script for running the program
├── config.ini # configuration parameters
├── lgid # source code
│ ├── analyzers.py # functions for extracting info from input documents
│ ├── buildlms.py # functions for building language model files out of ODIN data
│ ├── features.py # functions for activating model features
│ ├── main.py # main process; entry point for the lgid.sh command
│ ├── models.py # abstraction of the machine learning model(s)
│ └── util.py # utility functions for reading/transforming resources
├── res # resources for running the program
│ ├── Crubadan.csv # index file for downloading Crúbadán data
│ ├── lang_table.txt # language name-code mapping
│ ├── common_codes.txt # Shows code most commonly paired with each language name
│ ├── english_word_language_names.txt # list of language names that are English words
│ ├── crubadan_directory_index.csv # table of what directory holds Crúbadán data for each language
│ ├── language_index.txt # table of each language and its ID
│ ├── word_index.txt # table of each word present in a language name and its ID
│ └── word_language_mapping.txt # table of each word present in a language name and which languages it appears in (in ID form)
├── sample # results from sample runs
└── test # files for testing the program
├── mentions_gold_output.txt # the gold standard output when running list-mentions on mentions_test.freki
├── mentions_single_gold_output.txt # the gold standard output when running list-mentions on mentions_test.freki with the setting `single-longest-mention = yes`
└── mentions_test.freki # freki file for testing list-mentions
In the repository, the lgid/
subdirectory contains all code for data
analysis, model building, and document classification.
The res/
subdirectory contains resource files used in model-building.
Only static files that we have rights to, like the language name-code
mapping table, should be checked in. Other resources, like the compiled
language model or Crúbadán data, may reside here on a local machine,
but they should not be committed to the remote repository.
All of the functions that take INFILE
as an argument expect that file or files to be in Freki format.
The classify
function produces Freki files as output.
The build-odin-lm
function expects its input files (location specified in the config file) to be in the Xigt format, version 2.1.
The ODIN language model files have one ngram on each line, with the format <ngram>\t<count>
. There are no special symbols
used for beginning or end of line. Each file contains ngrams for all values of n, 1-3 for characters and 1-2 for words. The
morpheme language models are built using the word data.
The Crúbadán language model files have one ngram on each line, with the format <ngram> <count>
. The \n
character is used
to indicate beginning or end of line for word ngrams. The <
and >
characters are used for the beginning and end of word, respectively,
for character ngrams. Each file contains ngrams for only one value of n. The Crúbadán language models have only
trigrams for characters and both unigrams and bigrams for words.
The config.ini
file contains parameters for managing builds of the
model.
The [locations]
section contains paths for finding various
resources.
location | description |
---|---|
lgid-dir | location of the system files on disk |
language-table | location of the master language table |
most-common-codes | location of table of most common codes for each language |
english-word-names | location of list of languages whose names are also English words |
word-index | location of the file mapping words to IDs |
language-index | location of the file mapping languages to IDs |
word-language-mapping | location of the file mapping words to the languages they appear in |
odin-source | location of the Xigt files for building ODIN language models from |
odin-language-model | directory containing the ODIN language model files |
crubadan-index | location of file containing index and download location for Crúbadán language model data |
crubadan-base-uri | base URL where Crúbadán data files are downloaded from |
crubadan-language-model | directory containing the Crúbadán language model |
crubadan-directory-index | location of table tracking location of Crúbadán language model for each language |
classify-error-file | language id errors are written to this text file |
The [parameters]
section contains parameters for modifying
the behavior of feature functions. Available parameters are described
below:
parameter | description |
---|---|
window-size | number of lines before an IGT to consider |
after-window-size | number of lines after an IGT to consider |
close-window-size | smaller window before an IGT |
after-close-window-size | smaller window after an IGT |
word-n-gram-size | number of tokens in word-lm n-grams |
morpheme-n-gram-size | number of tokens in morpheme-lm n-grams |
character-n-gram-size | number of chars in character-lm n-grams |
crubadan-char-size | number of chars in Crúbadán character-lm n-grams |
crubadan-word-size | number of tokens in Crúbadán word-lm n-grams |
morpheme-delimiter | regular expression for tokenizing morphemes |
frequent-mention-threshold | minimum window mentions to be "frequent" |
after-frequent-mention-threshold | min. mentions after an IGT to be "frequent" |
article-frequent-mention-threshold | min. mentions in document to be "frequent" |
mention-capitalization | case-normalization for language mentions |
short-name-size | a language name shorter than or equal to this length is flagged, as very short names are often false positive mentions |
single-longest-mention | return all mentions in a given span or only the single longest one. yes for on, anything else for off |
code-only-odin-lms | use ODIN language models for code+language or just the code. yes for on, anything else for off |
The [features]
section has boolean flags for turning on/off specific
features. The value yes
turns the feature on, any other value turns it
off. The available features are:
feature name | description | notes |
---|---|---|
GL-first-lines | language mentioned in the first window of the document | |
GL-last-lines | language mentioned in the last window of the document | |
GL-frequent | language mentioned N+ times in the document |
|
GL-most-frequent | language is the most frequently mentioned one | |
GL-most-frequent-code | code is the most frequent one paired with language | |
GL-possible-english-word | language name is possibly an English word or name | |
GL-short-lang-name | language name is shorter than short-name-size, and may be a false positive because it occurs as a word in some language | |
GL-is-english | language is English | |
GL-multi-word-name | language name is multiple words | |
W-prev | language mentioned within the IGT's preceding window | |
W-close | language mentioned within a smaller preceding window | |
W-closest | language is closest to the IGT in the preceding window | |
W-frequent | language mentioned N+ times in the preceding window |
|
W-after | language mentioned within the IGT's following window | |
W-close-after | language mentioned within a smaller following window | |
W-closest-after | language is closest to the IGT in the following window | |
W-frequent-after | language mentioned N+ in the following window |
|
L-in-line | language mentioned in the IGT's language line | |
G-in-line | language mentioned in the IGT's gloss line | |
T-in-line | language mentioned in the IGT's translation line | |
M-in-line | language mentioned in the IGT's meta lines | |
L-LMw | more than M% of word ngrams occur in the training data, using ODIN data |
|
L-LMm | more than M% of morpheme ngrams occur in training data, using ODIN data |
|
L-LMc | more than M% of character ngrams occur in training data, using ODIN data |
|
L-CR-LMw | same as L-LMw, but using Crúbadán data | |
L-CR-LMc | same as L-LMc, but using Crúbadán data | |
G-overlap | at least M% of gloss tokens occur in the training data |
not implemented |
W-prevclass | language is predicted for the previous IGT | not implemented |
Note that the features have prefixes that group them into categories. The categories are:
feature prefix | description |
---|---|
GL- | feature is relevant globally |
W- | feature is relevant within a window |
L- | feature is relevant for the language line |
G- | feature is relevant for the gloss line |
T- | feature is relevant for the translation line |
M- | feature is relevant for a meta line |