Grammar-based Data Augmentation for Low-Resource Languages - NAACL2024

This repository contains information about the code and resources used in the Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation paper, accepted at NAACL2024.

✨TL;DR✨ In this work we explore the feasability of using grammar-generated text as a Data Augmentation strategy for low-resource languages, and we find that synthetic text is useful to boost Guarani-Spanish Machine Translation systems.

Authors:

Related resources

The work presented in this paper was part of an undergraduate thesis, named "Generación de datos sintéticos para traducción automática entre español y guaraní" (Synthetic Data Generation for Spanish-Guarani Machine Translation).

Grammars for Spanish and Guarani

One of the main components of this work is the development of a Feature Grammar to model simple Spanish-Guarani sentence pairs. The code for the grammar and the resources needed to run can be found in this repository.

NMT experiments using MarianNMT (C++)

The experiments were carried out using MarianNMT. You can find the code here.

Jojajovai corpus

To fine-tune and test the models we use the Jojajovai parallel corpus, consisting of 8 subsets with different kinds of text (e.g. journalistic, user interfaces).

Paper

Published paper (ACL Anthology): https://aclanthology.org/2024.naacl-long.354/

Citation

If you use some part of this work in your research, please cite:

Agustín Lucas, Alexis Baladón, Victoria Pardiñas, Marvin Agüero-Torales, Santiago Góngora, and Luis Chiruzzo. 2024. Grammar-based Data Augmentation for Low-Resource Languages: The Case of Guarani-Spanish Neural Machine Translation. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6385–6397, Mexico City, Mexico. Association for Computational Linguistics.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grammar-based Data Augmentation for Low-Resource Languages - NAACL2024

Related resources

Grammars for Spanish and Guarani

NMT experiments using MarianNMT (C++)

Jojajovai corpus

Paper

Citation

About

Releases

Packages

Contributors 2

pln-fing-udelar/guarani-grammar-NAACL2024

Folders and files

Latest commit

History

Repository files navigation

Grammar-based Data Augmentation for Low-Resource Languages - NAACL2024

Related resources

Grammars for Spanish and Guarani

NMT experiments using MarianNMT (C++)

Jojajovai corpus

Paper

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages