Using Language Models to Disambiguate Lexical Choices in Translation

Authors

UC Berkeley

News 🎉

September 20, 2024: Paper accepted at EMNLP 2024 main proceedings!

Project Overview

This project investigates whether modern large language models (LLMs) understand concept variation across languages. We focus on the task of lexical selection—choosing the most suitable lexeme in the target language to match a single lexeme in the source language, based on the context of the source sentence.

To support this study, we introduce DTAiLS, a dataset covering nine languages with expert-verified sentence pairs that demonstrate concept variation in translation. Additionally, we explore the impact of lexical rules (natural language descriptions of concept variation) on model performance during lexical selection.

Our key takeaways are:

All LLMs can reason over source-side context to resolve ambiguity given high-quality rules, but only the strongest closed-weight LLMs can generate such rules
Lexical rules can help bridge gaps in knowledge of concept variation in low-resource languages across models

Installation

Clone the repository:

git clone https://github.com/Berkeley-NLP/Lex-Rules.git
cd Lex-Rules

Create and activate the conda environment:

conda env create -f environment.yml
conda activate lex-rules

Visit the PyTorch website and install torch locally. I used the following command:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

To use GPT-4-Turbo, set your API key:

export OPENAI_API_KEY=<your-api-key>

Dataset

The DTAiLS dataset is located in the ./data/ directory, organized by language pairs (e.g., en-af for English-Afrikaans).

Generating Lexical Rules

Generate lexical rules using language models with the following script:

python generate_lexical_rules.py \
    --model="gpt-4-turbo" \
    --source_language="English" \
    --target_language="Afrikaans" \
    --source_concepts="data/en-af/af.words" \
    --out_file="lexical_rules/gpt-4-turbo-afrikaans.csv" \
    --use_incontext_sentences \
    --num_sentences=25 \
    --train_dataset="data/en-af/train_dataset.csv" \
    --max_retries=3

Parameters

Parameter	Description	Options
`--model`	Model to use	`gpt-4-turbo`, `llama-3-8b`, `gemma-7b`
`--source_concepts`	Path to source concepts file	Path String
`--out_file`	Output file path	Path String
`--use_incontext_sentences`	Include example sentences in prompt	Flag
`--num_sentences`	Number of example sentences per variation	Integer
`--train_dataset`	Path to training dataset	Path String

Model Evaluation

Language Model Evaluation

Evaluate language models on lexical selection:

python evaluate.py \
    --model="gpt-4-turbo" \
    --target_language="Afrikaans" \
    --test_dataset="data/en-af/expert_dataset.csv" \
    --eval_mode="lexical_rules" \
    --lexical_rules="lexical_rules/gpt-4-turbo-afrikaans.csv" \
    --out_file="results/afrikaans_gpt-4-turbo_predictions.csv" \
    --results_file="data/en-af/results.csv" \
    --results_file_column="GPT4" \
    --shuffle_order \
    --max_retries=2

Parameters

Parameter	Description	Options
`--model`	Model to use for evaluation	`gpt-4-turbo`, `llama-3-8b`, `gemma-7b`
`--target_language`	Target language for translation	String (e.g., "Afrikaans")
`--test_dataset`	Path to test dataset	Path String
`--eval_mode`	Evaluation mode	`lexical_rules`, `baseline`
`--lexical_rules`	Path to lexical rules file	Path String
`--out_file`	Output file for predictions	Path String
`--results_file`	File for accumulating results	Path String
`--results_file_column`	Column name in results file	`GPT4`, `GPT4-NoRules`, `Llama-3-8b-GPT4Rules`, `Llama-3-8b`, `Llama-3-8b-NoRules`, `Gemma-7b-GPT4Rules`, `Gemma-7b`, `Gemma-7b-NoRules`
`--shuffle_order`	Randomize order of lexical variations in prompt	Flag
`--max_retries`	Maximum attempts to generate valid response	Integer

Neural MT Evaluation

Evaluate neural machine translation models:

python mt_evaluate.py \
    --model="madlad400-10b" \
    --test_dataset="data/en-af/expert_dataset.csv" \
    --target_lang_id="af" \
    --results_file="data/en-af/results.csv" \
    --column_name="MADLAD400-10b"

Parameters

Parameter	Description	Options
`--model`	MT model to evaluate	`madlad400-10b`, `nllb-200-3.3b`
`--test_dataset`	Path to test dataset	Path String
`--target_lang_id`	Target language code	Two-letter code (see supported languages table)
`--results_file`	File for accumulating results	Path String
`--column_name`	Column name in results file	`MADLAD400-10b`, `NLLB-200-3.3b`

Supported Models and Languages

Models:

Language Models: gpt-4-turbo, llama-3-8b, gemma-7b
MT Systems: madlad400-10b, nllb-200-3.3b

Target Languages:

Code	Language
af	Afrikaans
ta	Tamil
te	Telugu
gl	Galician
hi	Hindi
hy	Armenian
ja	Japanese
fa	Farsi
lv	Latvian

Computing Results

Calculate micro-averages across all concepts:

LANGS=("afrikaans" "tamil" "telugu" "galician" "hindi" "armenian" "japanese" "farsi" "latvian")
LANG_IDS=("en-af" "en-ta" "en-te" "en-gl" "en-hi" "en-hy" "en-ja" "en-fa" "en-lv")
for (( i=0; i<9; i++ )); do
    python computeOverallAccuracy.py \
        --results_file="data/${LANG_IDS[$i]}/results.csv" \
        --output_file="accuracy.csv" \
        --language=${LANGS[$i]} 
done

Citation

Please consider citing our paper if you find our work or dataset helpful for your research:

@inproceedings{barua24emnlp,
    title = {Using Language Models to Disambiguate Lexical Choices in Translation},
    author = {Barua, Josh and Subramanian, Sanjay and Yin, Kayo and Suhr, Alane},
    booktitle = {EMNLP},
    month = {November},
    year = {2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using Language Models to Disambiguate Lexical Choices in Translation

Authors

News 🎉

Project Overview

Table of Contents

Installation

Dataset

Generating Lexical Rules

Parameters

Model Evaluation

Language Model Evaluation

Parameters

Neural MT Evaluation

Parameters

Supported Models and Languages

Computing Results

Citation

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
scripts		scripts
README.md		README.md
computeOverallAccuracy.py		computeOverallAccuracy.py
environment.yml		environment.yml
evaluate.py		evaluate.py
generate_lexical_rules.py		generate_lexical_rules.py
mt_evaluate.py		mt_evaluate.py

Berkeley-NLP/Lex-Rules

Folders and files

Latest commit

History

Repository files navigation

Using Language Models to Disambiguate Lexical Choices in Translation

Authors

News 🎉

Project Overview

Table of Contents

Installation

Dataset

Generating Lexical Rules

Parameters

Model Evaluation

Language Model Evaluation

Parameters

Neural MT Evaluation

Parameters

Supported Models and Languages

Computing Results

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages