Skip to content

Translate Word Documents with NLLB

David Baines edited this page Feb 8, 2023 · 2 revisions

SILNLP can translate docx files as well as txt and USFM. When translating Word documents the paragraph structure is preserved but inline formatting is lost. (It would be necessary to fine tune NLLB to recognize docx markup to preserve the formatting.) The primary reason for supporting docx was to add the ability to translate between any of the 200 languages supported by NLLB, so the translate script is able to use an NLLB model without fine tuning.

Setup for translating docx files with NLLB

Create an experiment folder with a config file as usual, but don't specify any corpus pairs. This will allow you to configure decoding hyperparameters and which model to use. Here is an example of a config file:

model: facebook/nllb-200-1.3B
data:
  seed: 111
  lang_codes:
    en: eng_Latn
    es: spa_Latn
params:
  label_smoothing_factor: 0.2
infer:
  infer_batch_size: 16
  num_beams: 2

Here is an example of how to call the translate script: python -m silnlp.nmt.translate <source_folder> --src MT/experiments/experiment --src-iso en --trg-iso es

This will translate every file in the MT/experiments/<source_folder> directory recursively and output the results to the infer directory in the experiment directory. The --src parameter will also accept a file path. You can specify the target directory or file using --trg. The new changes to the translate script should work on ClearML.

Available models are:

Smallest 600M parameter model - distilled:

model: facebook/nllb-200-distilled-600M

Medium 1.3B parameter model - distilled

model: facebook/nllb-200-distilled-1.3B

Medium 1.3B parameter model

model: facebook/nllb-200-1.3B

Large 3.3B parameter model

model: facebook/nllb-200-3.3B

Further information and languages

More information about the NLLB model is available from HuggingFace The metrics link on that page gives a list of the languages and script pairs that are included in the model.