Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora

This repository contains the code, data and evaluation scripts to reproduce the results of the paper Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora, which has been accepted at EMNLP 2025 (Findings).

Overview

📖 Dialect Variation Dictionary

Our Bavarian dialect variation dictionary contains 5,124 lemmas with dialect variants (i.e., translations) and inflected variants. The file can be found in data/dictionary.jsonl.

Example:

{
  "id": "1702", 
  "pos": "NOUN", 
  "term": "Ortschaft", 
  "variants": [
     "Ortschft", 
     "Ortschoft", 
     "Ortschaoft", 
     "Oatschaft", 
     "Ortschåft", 
     "Ortsschoft"], 
  "inflected_variants": [
     "Ortschaftn", 
     "Ortschaftnn", 
     "Ortschafta"
  ]
}

Format:

"id": Unique identifier for German lemmas.
"pos": Majority part-of-speech (POS) tag assigned by the de_core_news_lg POS tagger in spaCy.
"term": German lemma for which we collected (inflected) variants.
"variants": Bavarian terms that were annotated as direct translations.
"inflected_variants": Bavarian terms that were annotated inflected translations

The dictionary was created by running python build_dictionary.py.

📊 Dialect NLP Tasks

We created two dialect NLP task datasets based on 100K human-annotated German-Bavarian word pairs:

Judging Translation Candidates (Recognition)
Dialect-to-Standard Translation (Translation)

Task	Split	n instances	File
Recognition	Dev	300	recognition_dev.csv
Recognition	Test	97000	recognition_test.csv
Translation	Dev	301	translation_dev.csv
Translation	Test	10775	translation_test.csv

The dev and test splits for both tasks are created with python split_datasets.py.

To print the dataset statistics (Table 7), run python statistics.py.

📝 Reproduce Results

Below, we show the steps to reproduce the results in our paper.

1. Setup

Create a python environment and install the required packages:

conda create --name dvar python=3.10
conda activate dvar
pip install -r requirements.txt
python -m spacy download de_core_news_lg

2. Annotations

The following steps describe how we created the annotation files:

Download the Wikipedia dumps of a standard language and dialect (link). Example:

wget https://dumps.wikimedia.org/other/cirrussearch/20250310/barwiki-20250310-cirrussearch-content.json.gz
wget https://dumps.wikimedia.org/other/cirrussearch/20250310/dewiki-20250310-cirrussearch-content.json.gz

Run python dialemma_pipeline.py to create annotation files (.xls). The output is split into ten chunks to avoid large files.
Upload and share data with annotators (e.g., with google sheets).
Annotate pairs of German lemmas and Bavarian terms (see annotation guidelines) and download annotated records as csv files.*
Run python merge_files.py to create one file with all records.

*Note: Word pairs were annotated with respect to the POS tag of the lemma. We found rare cases (80 out of 99,700 instances) where words could be seen as "inflected" adjectives. Since those words were tagged as adverbs (which cannot be inflected) they received the label "no". We share a list of ambiguous instances for further uptake.

3. Experiments

A. Setup Local Endpoint

We use Ollama to set up a local endpoint that is compatible with the Open AI python libray. We ran our experiments with v0.6.7 of Ollama.

Download LLM with ollama pull llama3.1:8b-instruct-fp16.
1. The list of models used in our study can be found in models.txt.
Run LLM with ollama serve.
1. If needed, change the port with export OLLAMA_HOST=127.0.0.1:11435.

B. Prompt Selection

You can find the list of prompts and their German translations in the prompt-templates/ folder. Note that in our study we only evaluated the German translation of the best-performing prompt. Run

python prompt_llm.py --split dev --task {recognition,translation}

to prompt all LLMs on all prompts on instances of the development set. The results are written to results/dev/recognition/ and results/dev/translation/. Folder names indicate the prompt language and index of the prompt that were used to generate the results. For example, the folder en_0/ contains one csv files for each LLM and a text file with the first prompt (id: 0) written in English:

en_0/
├── aya-expanse:32b-fp16.csv
├── ...
└── prompt_template.txt

C. Get LLM Predictions

To obtain predictions of each LLM on the test dataseset using its best-performing prompt, run

python prompt_llm.py --split test --task {recognition,translation} --prompt_lang {en,de} --use_context

The prompt ids of the best-performing prompts are hard-coded. To reproduce the results of our ablation experiments, use one of the both flags, respectively:

--prompt_lang de: Uses German translations of best-performing prompts, default: en.
--use_context: Flag that is used to run context ablation experiments, default: no context.

The output is written into results/test/recognition/ and results/test/translation/.

test/
├── recognition
├── recognition+context
├── recognition-with_de_prompts
├── translation
├── translation+context
└── translation-with_de_prompts

D. Evaluate

To reproduce the main results of our paper (Tables 1 and 2), run:

python evaluate.py --task {recognition,translation} --split {dev,test}

To reproduce results for the ablation experiments (Figures 3 and 4), use --prompt_lang de or --use_context parameters.
To reproduce the confusion matrices (Tables 5-6 and 8-14), use --confusion_matrix. Use this only with --task recognition.
To reproduce baseline results (Random, Levenshtein, Majority, Logistic Regression) (Tables 1-2 and Tables 16-17), use --baselines. This applies only for the recognition task (test set).

Citation

Please consider citing our paper if you use the code in this repository:

@misc{litschko2025make-every-letter-count,
      title={Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora}, 
      author={Litschko, Robert and Blaschke, Verena and Burkhardt, Diana and Plank, Barbara and Frassinelli, Diego},
      year={2025},
      eprint={2509.17855},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.17855}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora

Overview

📖 Dialect Variation Dictionary

📊 Dialect NLP Tasks

📝 Reproduce Results

1. Setup

2. Annotations

3. Experiments

A. Setup Local Endpoint

B. Prompt Selection

C. Get LLM Predictions

D. Evaluate

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
prompt-templates		prompt-templates
results		results
build_dictionary.py		build_dictionary.py
dialemma_pipeline.py		dialemma_pipeline.py
evaluate.py		evaluate.py
logistic_regression.py		logistic_regression.py
merge_files.py		merge_files.py
metrics.py		metrics.py
models.txt		models.txt
prompt_llm.py		prompt_llm.py
readme.md		readme.md
requirements.txt		requirements.txt
split_dataset.py		split_dataset.py
statistics.py		statistics.py
utils.py		utils.py

mainlp/dialemma

Folders and files

Latest commit

History

Repository files navigation

Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora

Overview

📖 Dialect Variation Dictionary

📊 Dialect NLP Tasks

📝 Reproduce Results

1. Setup

2. Annotations

3. Experiments

A. Setup Local Endpoint

B. Prompt Selection

C. Get LLM Predictions

D. Evaluate

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages