This repository contains the code, data and evaluation scripts to reproduce the results of the paper Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora, which has been accepted at EMNLP 2025 (Findings).
Our Bavarian dialect variation dictionary contains 5,124 lemmas with dialect variants (i.e., translations) and inflected variants. The file can be found in data/dictionary.jsonl.
Example:
{
"id": "1702",
"pos": "NOUN",
"term": "Ortschaft",
"variants": [
"Ortschft",
"Ortschoft",
"Ortschaoft",
"Oatschaft",
"Ortschåft",
"Ortsschoft"],
"inflected_variants": [
"Ortschaftn",
"Ortschaftnn",
"Ortschafta"
]
}
Format:
"id"
: Unique identifier for German lemmas."pos"
: Majority part-of-speech (POS) tag assigned by the de_core_news_lg POS tagger in spaCy."term"
: German lemma for which we collected (inflected) variants."variants"
: Bavarian terms that were annotated as direct translations."inflected_variants"
: Bavarian terms that were annotated inflected translations
The dictionary was created by running python build_dictionary.py
.
We created two dialect NLP task datasets based on 100K human-annotated German-Bavarian word pairs:
- Judging Translation Candidates (
Recognition
) - Dialect-to-Standard Translation (
Translation
)
Task | Split | n instances | File |
---|---|---|---|
Recognition | Dev | 300 | recognition_dev.csv |
Recognition | Test | 97000 | recognition_test.csv |
Translation | Dev | 301 | translation_dev.csv |
Translation | Test | 10775 | translation_test.csv |
The dev and test splits for both tasks are created with python split_datasets.py
.
To print the dataset statistics (Table 7), run python statistics.py
.
Below, we show the steps to reproduce the results in our paper.
Create a python environment and install the required packages:
conda create --name dvar python=3.10
conda activate dvar
pip install -r requirements.txt
python -m spacy download de_core_news_lg
The following steps describe how we created the annotation files:
- Download the Wikipedia dumps of a standard language and dialect (link). Example:
wget https://dumps.wikimedia.org/other/cirrussearch/20250310/barwiki-20250310-cirrussearch-content.json.gz
wget https://dumps.wikimedia.org/other/cirrussearch/20250310/dewiki-20250310-cirrussearch-content.json.gz
- Run
python dialemma_pipeline.py
to create annotation files (.xls). The output is split into ten chunks to avoid large files. - Upload and share data with annotators (e.g., with google sheets).
- Annotate pairs of German lemmas and Bavarian terms (see annotation guidelines) and download annotated records as csv files.*
- Run
python merge_files.py
to create one file with all records.
*Note: Word pairs were annotated with respect to the POS tag of the lemma. We found rare cases (80 out of 99,700 instances) where words could be seen as "inflected" adjectives. Since those words were tagged as adverbs (which cannot be inflected) they received the label "no". We share a list of ambiguous instances for further uptake.
We use Ollama to set up a local endpoint that is compatible with the Open AI python libray. We ran our experiments with v0.6.7 of Ollama.
- Download LLM with
ollama pull llama3.1:8b-instruct-fp16
.- The list of models used in our study can be found in models.txt.
- Run LLM with
ollama serve
.- If needed, change the port with
export OLLAMA_HOST=127.0.0.1:11435
.
- If needed, change the port with
You can find the list of prompts and their German translations in the prompt-templates/ folder. Note that in our study we only evaluated the German translation of the best-performing prompt. Run
python prompt_llm.py --split dev --task {recognition,translation}
to prompt all LLMs on all prompts on instances of the development set. The results are written to results/dev/recognition/ and results/dev/translation/. Folder names indicate the prompt language and index of the prompt that were used to generate the results. For example, the folder en_0/
contains one csv files for each LLM and a text file with the first prompt (id: 0) written in English:
en_0/
├── aya-expanse:32b-fp16.csv
├── ...
└── prompt_template.txt
To obtain predictions of each LLM on the test dataseset using its best-performing prompt, run
python prompt_llm.py --split test --task {recognition,translation} --prompt_lang {en,de} --use_context
The prompt ids of the best-performing prompts are hard-coded. To reproduce the results of our ablation experiments, use one of the both flags, respectively:
--prompt_lang de
: Uses German translations of best-performing prompts, default: en.--use_context
: Flag that is used to run context ablation experiments, default: no context.
The output is written into results/test/recognition/ and results/test/translation/.
test/
├── recognition
├── recognition+context
├── recognition-with_de_prompts
├── translation
├── translation+context
└── translation-with_de_prompts
- To reproduce the main results of our paper (Tables 1 and 2), run:
python evaluate.py --task {recognition,translation} --split {dev,test}
- To reproduce results for the ablation experiments (Figures 3 and 4), use
--prompt_lang de
or--use_context
parameters. - To reproduce the confusion matrices (Tables 5-6 and 8-14), use
--confusion_matrix
. Use this only with--task recognition
. - To reproduce baseline results (Random, Levenshtein, Majority, Logistic Regression) (Tables 1-2 and Tables 16-17), use
--baselines
. This applies only for the recognition task (test set).
Please consider citing our paper if you use the code in this repository:
@misc{litschko2025make-every-letter-count,
title={Make Every Letter Count: Building Dialect Variation Dictionaries from Monolingual Corpora},
author={Litschko, Robert and Blaschke, Verena and Burkhardt, Diana and Plank, Barbara and Frassinelli, Diego},
year={2025},
eprint={2509.17855},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.17855},
}