Skip to content

afonso-sousa/argumentation_mining_pt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Cross-Lingual Annotation Projection for Argument Mining in Portuguese

===============

Sample source code and data for our EPIA 2021 paper:

@inproceedings{10.1007/978-3-030-86230-5_59,
author="Sousa, Afonso
and Leite, Bernardo
and Rocha, Gil
and Lopes Cardoso, Henrique",
editor="Marreiros, Goreti
and Melo, Francisco S.
and Lau, Nuno
and Lopes Cardoso, Henrique
and Reis, Lu{\'i}s Paulo",
title="Cross-Lingual Annotation Projection for Argument Mining in Portuguese",
booktitle="Progress in Artificial Intelligence",
year="2021",
publisher="Springer International Publishing",
address="Cham",
pages="752--765"
}

Abstract: While Argument Mining has seen increasing success in monolingual settings, especially for the English language, other less-resourced languages are still lagging behind. In this paper, we build a Portuguese projected version of the Persuasive Essays corpus and evaluate it both intrinsically (through back-projection) and extrinsically (in a sequence tagging task). To build the corpus, we project the token-level annotations into a new Portuguese version using translations and respective alignments. Intrinsic evaluation entails rebuilding the English corpus using back alignment and back projection from the Portuguese version, comparing against the original English annotations. For extrinsic evaluation, we assess and compare the performance of machine learning models on several language variants of the corpus (including the Portuguese one), following both in-language/projection training and direct transfer. Our evaluation highlights the quality of the generated corpus. Experimental results show the effectiveness of the projection approach, while providing competitive baselines for the Portuguese version of the corpus. The corpus and code are available (https://github.com/AfonsoSalgadoSousa/argumentation_mining_pt).

Drop us a line or report an issue if something is broken (and shouldn't be) or if you have any questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication. It uses code from the following third-party repositories:

Requirements

  • NLTK
  • NumPy
  • SciPy
  • PyTorch
  • Scikit-learn
  • Transformers 3.1.0 (later versions might throw an error)
  • NetworkX
  • tqdm

Usage

Make use of the script files on this folder to build the annotation projection corpus and perform intrinsic evaluations (further explanations below). Alternatively, you can find the Portuguese dataset and all of the intermediate files in this folder.

For sequence tagging, we adopted both TAGGER and NeuroNLP2. For a detailed explanation on how to use these tools, please refer to the TAGGER or NeuroNLP2 repository.

Building the Portuguese Dataset

To build the Portuguese version of the Persuasive Essays, we used the CoNLL-formatted version of the dataset, from this repo. assuming the following file structure:

  ├── DATASET_ROOT_DIR
  │   ├── en_pe                   # Persuasive Essays ConLL-formatted
  │   │   ├── train.dat            
  │   │   ├── dev.dat
  │   │   ├── test.dat

Free-text

To start building the dataset, create free-text files for each train/dev/test file.

python src/convert_to_free_text.py data/auxiliary/train/train_ft.txt data/en_pe/train.dat
python src/convert_to_free_text.py data/auxiliary/dev/dev_ft.txt data/en_pe/dev.dat
python src/convert_to_free_text.py data/auxiliary/test/test_ft.txt data/en_pe/test.dat

These scripts create the "auxiliary" folder in the root folder to store further auxiliary files for the construction of the dataset.

Translation

Next, translate the free-text files.

python src/translator.py data/auxiliary/train/train_ft.txt --src_lang en --trg_lang pt
python src/translator.py data/auxiliary/dev/dev_ft.txt --src_lang en --trg_lang pt
python src/translator.py data/auxiliary/test/test_ft.txt --src_lang en --trg_lang pt

You will end up with a file with parallel data seperated by the "|||" sequence, sentences split by a break line and paragraph split by an empty line.

Alignment

Next, generate alignment files for the previously created files with translations.

python src/align.py data/auxiliary/train/train_ft_translated.txt
python src/align.py data/auxiliary/dev/dev_ft_translated.txt
python src/align.py data/auxiliary/test/test_ft_translated.txt

The generated file follows the structure from the translation file, but instead of parallel data has per-token index pairs.

Annotation Projection

Finally, project the annotations.

python src/project_annotations.py data/en_pe/train.dat data/auxiliary/train/train_ft_translated.txt data/auxiliary/train/train_ft_translated_alignment.txt --output_dir data/pt_pe
python src/project_annotations.py data/en_pe/dev.dat data/auxiliary/dev/dev_ft_translated.txt data/auxiliary/dev/dev_ft_translated_alignment.txt --output_dir data/pt_pe
python src/project_annotations.py data/en_pe/test.dat data/auxiliary/test/test_ft_translated.txt data/auxiliary/test/test_ft_translated_alignment.txt --output_dir data/pt_pe

These scripts create the "pt_pe" folder to store the Portuguese version of the dataset.

Evaluation

We performed both intrinsic and extrinsic evaluation of the corpus.

Intrinsic Evaluation

To replicate the results in the paper for intrinsic evaluation, run:

python src/eval_en_from_pt.py data/en_pe/ data/en_from_pt_pe_pad/

Alternatively, you can run the same evaluation per split using the --split tag, and with or without padding with --with_pad.

Licenses

There are two licenses for this project:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published