Easy to use, cross-platform toolkit to train argos-translate models, which can be used by LibreTranslate 🚂
It can also convert pre-trained Opus-MT models.
- Python >= 3.8
- NVIDIA CUDA graphics card (not required, but highly recommended)
git clone https://github.com/LibreTranslate/Locomotive --depth 1
cd Locomotive
pip install -r requirements.txt
Language models can be trained by providing lots of example translations from a source language to a target language. All you need to get started is a set of two files (source
and target
). The source file containing sentences written in the source language and a corresponding file with sentences written in the target language.
For example:
source.txt
:
Hello
I'm a train!
Goodbye
target.txt
:
Hola
¡Soy un tren!
Adiós
You'll need a few million sentences to train decent models, and at least ~100k sentences to get some results. OPUS has a good collection of datasets to get started. You can also use any of the data sources listed on the argos-train index. Also check NLLU.
Place source.txt
and target.txt
files in a folder (e.g. mydataset-en_es
) of your choice:
mydataset-en_es/
├── source.txt
└── target.txt
Create a config.json
file specifying your sources:
{
"from": {
"name": "English",
"code": "en"
},
"to": {
"name": "Spanish",
"code": "es"
},
"version": "1.0",
"sources": [
"file://D:\\path\\to\\mydataset-en_es",
"opus://Ubuntu",
"http://data.argosopentech.com/data-ccaligned-en_es.argosdata"
]
}
Note you can specify, local folders (using the file://
prefix), internet URLs to .zip archives (using the http://
or https://
prefix) or OPUS datasets (using the opus://
prefix). For a complete list of OPUS datasets, see OPUS.md and note that they are case-sensitive.
Then run:
python train.py --config config.json
Training can take a while and depending on the size of datasets can require a graphics card with lots of memory.
The output will be saved in run/[model]/translate-[from]_[to]-[version].argosmodel
.
If you're running out of CUDA memory, decrease the batch_size
parameter, which by default is set to 8192
:
{
"from": {
"name": "English",
"code": "en"
},
"to": {
"name": "Spanish",
"code": "es"
},
"version": "1.0",
"sources": [
"file://D:\\path\\to\\mydataset-en_es",
"http://data.argosopentech.com/data-ccaligned-en_es.argosdata"
],
"batch_size": 2048
}
Once you have trained a model from source => target
, you can easily train a reverse model target => source
model by passing --reverse
:
python train.py --config config.json --reverse
TensorBoard allows tracking and visualizing metrics such as loss and accuracy, visualizing the model graph and other features. You can enable tensorboard with the --tensorboard
option:
python train.py --config config.json --tensorboard
The model is generated using sensible default values. You can override the default configuration by adding values directly to your config.json
. For example, to use a smaller dictionary size, add a vocab_size
key in config.json
:
{
"from": {
"name": "English",
"code": "en"
},
"to": {
"name": "Spanish",
"code": "es"
},
"version": "1.0",
"sources": [
"file://D:\\path\\to\\mydataset-en_es",
"http://data.argosopentech.com/data-ccaligned-en_es.argosdata"
],
"vocab_size": 30000
}
Locomotive provides various filters, transforms and augmenters which can be used to dynamically cleanup, modify and augment the input sources before training:
{
"filters": [
"duplicates",
{"source_target_ratio": {"min": 0.6, "max": 1.5}}
],
"transforms":[
"remove_unpaired_quotes_and_brackets"
],
"augmenters":[
"single_word_punctuation"
],
"sources": [
{
"source": "file://D:\\path\\to\\mydataset-en_es",
"filters": [
{"char_length": {"min": 20}}
]
}
]
}
Filters, transforms and augmenters can be specified globally (applied to all sources) as well as per-source (applied only to the specified source).
It's possible to specify weights for each source, for example, it's possible to instruct the training to use less samples for certain datasets:
{
"sources": [
{"source": "file://D:\\path\\to\\mydataset-en_es", "weight": 1},
{"source": "http://data.argosopentech.com/data-ccaligned-en_es.argosdata", "weight": 5}
]
}
In the example above, 1 sample will be taken from mydataset and 5 will will be taken from CCAligned.
Specifying weights disables filtering, transformations and augmentations. The datasets are used as-is. No merging or shuffling is performed either. A weight of 1 can be used to instruct Locomotive to not preprocess a source.
You can evaluate the model by running:
python eval.py --config config.json
Starting interactive mode
(en)> Hello!
(es)> ¡Hola!
(en)>
You can also compute BLEU scores against the flores200 dataset for the model by running:
python eval.py --config config.json --bleu
BLEU score: 45.12354
Locomotive provides a convenient script to convert pre-trained models from OPUS-MT to make them compatible with LibreTranslate:
python opus_mt_convert.py -s en -t it
This will attempt to automatically find/download the OPUS-MT's model archive from https://github.com/Helsinki-NLP/OPUS-MT-train/tree/master/models/ or https://github.com/Helsinki-NLP/Tatoeba-Challenge/tree/master/models/. This doesn't always work, and will not always pick the best model. You can specify a model archive manually by using the --model-url
parameter:
Some models also need a beginning of sentence (BOS) token for the model to work. You can specify a BOS token by using the --bos
parameter:
python opus_mt_convert.py -s en -t vi --model-url https://object.pouta.csc.fi/Tatoeba-MT-models/eng-vie/opus+bt-2021-04-10.zip --bos ">>vie<<"
To run evaluation:
python eval.py --config run/en_it-opus_1.0/config.json
The script is experimental. If you find issues, feel free to open a pull request!
Some models fail to execute with int8 quantization. If you get a lot of repeated words, try to set -q float32
to keep full precision.
Want to share your model with the world? Post it on community.libretranslate.com and we'll include in future releases of LibreTranslate. Make sure to share both a forward and reverse model (e.g. en => es
and es => en
), otherwise we won't be able to include it in the model repository.
We also welcome contributions to Locomotive! Just open a pull request.
To install the resulting .argosmodel file, locate the ~/.local/share/argos-translate/packages
folder. On Windows this is the %userprofile%\.local\share\argos-translate\packages
folder. Then create a [from-code]_[to-code]
folder (e.g. en_es
). If it already exists, delete or move it.
Extract the contents of the .argosmodel file (which is just a .zip file, you might need to change the extension to .zip) into this folder. Then restart LibreTranslate.
You can also install .argosmodel packages from Python:
import pathlib
import argostranslate.package
package_path = pathlib.Path("/root/translate-en_it-2_0.argosmodel")
argostranslate.package.install_from_path(package_path)
In no particular order, we'd like to thank:
For making Locomotive possible.
AGPLv3