This repository contains information about the code and resources used in the Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation paper, accepted at NAACL2024.
✨TL;DR✨ In this work we explore the feasability of using grammar-generated text as a Data Augmentation strategy for low-resource languages, and we find that synthetic text is useful to boost Guarani-Spanish Machine Translation systems.
Authors:
- Agustín Lucas
- Alexis Baladón
- Victoria Pardiñas
- Marvin Agüero-Torales
- Santiago Góngora
- Luis Chiruzzo
The work presented in this paper was part of an undergraduate thesis, named "Generación de datos sintéticos para traducción automática entre español y guaraní" (Synthetic Data Generation for Spanish-Guarani Machine Translation).
One of the main components of this work is the development of a Feature Grammar to model simple Spanish-Guarani sentence pairs. The code for the grammar and the resources needed to run can be found in this repository.
The experiments were carried out using MarianNMT. You can find the code here.
To fine-tune and test the models we use the Jojajovai parallel corpus, consisting of 8 subsets with different kinds of text (e.g. journalistic, user interfaces).
Published paper (ACL Anthology): https://aclanthology.org/2024.naacl-long.354/
If you use some part of this work in your research, please cite:
Agustín Lucas, Alexis Baladón, Victoria Pardiñas, Marvin Agüero-Torales, Santiago Góngora, and Luis Chiruzzo. 2024. Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6385–6397, Mexico City, Mexico. Association for Computational Linguistics.