-
-
Notifications
You must be signed in to change notification settings - Fork 3
Translate Word Documents with NLLB
SILNLP can translate docx files as well as txt and USFM. When translating Word documents the paragraph structure is preserved but inline formatting is lost. (It would be necessary to fine tune NLLB to recognize docx markup to preserve the formatting.) The primary reason for supporting docx was to add the ability to translate between any of the 200 languages supported by NLLB, so the translate script is able to use an NLLB model without fine tuning.
Create an experiment folder with a config file as usual, but don't specify any corpus pairs. This will allow you to configure decoding hyperparameters and which model to use. Here is an example of a config file:
model: facebook/nllb-200-1.3B
data:
seed: 111
lang_codes:
en: eng_Latn
es: spa_Latn
params:
label_smoothing_factor: 0.2
infer:
infer_batch_size: 16
num_beams: 2
Here is an example of how to call the translate script:
python -m silnlp.nmt.translate <source_folder> --src MT/experiments/experiment --src-iso en --trg-iso es
This will translate every file in the MT/experiments/<source_folder> directory recursively and output the results to the infer directory in the experiment directory. The --src parameter will also accept a file path. You can specify the target directory or file using --trg. The new changes to the translate script should work on ClearML.
Smallest 600M parameter model - distilled:
model: facebook/nllb-200-distilled-600M
Medium 1.3B parameter model - distilled
model: facebook/nllb-200-distilled-1.3B
Medium 1.3B parameter model
model: facebook/nllb-200-1.3B
Large 3.3B parameter model
model: facebook/nllb-200-3.3B
More information about the NLLB model is available from HuggingFace The metrics link on that page gives a list of the languages and script pairs that are included in the model.