AMTA 2024 Tutorial: Edit Distances and their application to downstream tasks, in research and commercial contexts.
This repository contains the code and resources for our tutorial presented at the 16th Biennial Conference of the Association for Machine Translation in the Americas (AMTA 2024). The tutorial covers the theoretical foundations of edit distances, their applications in Natural Language Processing (NLP), and the challenges and limitations when applying these metrics to tasks like Machine Translation (MT), Quality Estimation (QE), and Automatic Post-Editing (APE).
The tutorial is structured into four parts:
-
Part 1: Edit distances and their different implementations and applications
-
Part 2: Analysing an incrementally-complex sequence of edits
-
Part 3: Building a Computational Perspective
-
Part 4: Implications for research and commercial applications of edit distances
-
code/
: Contains Python notebook used during the tutorial, including:- Examples for calculating Levenshtein, Damerau-Levenshtein, LCS, and N-gram distances.
- Examples for computing TER (using multiple implementations), BLEU, and chrF score calculations.
- Scripts for visualizing the results of different metrics using plots and charts.
-
data/
: Example dataset used in the tutorial, which include the reference and hypothesis translations for analysis in two separate files. -
AMTA-ED-Tutorial-2024.pdf
: Slide deck used for presentation.