Skip to content

Latest commit

 

History

History
74 lines (57 loc) · 2.73 KB

README.md

File metadata and controls

74 lines (57 loc) · 2.73 KB

🇺🇾 UY22 corpus

This repo contains the notebooks used for scraping the following uruguayan media sites:

  • El Observador (elobservador.com.uy)
  • El País (elpais.com.uy)
  • La Diaria (ladiaria.com.uy)
  • Montevideo Portal (montevideo.com.uy)

Every article scraped was stored as a .json file with the following structure:

{
    "url":      string,
    "id":       int,
    "date":     string,
    "category": string,
    "title":    string,
    "keywords": []string,
    "cover":    string,
    "body":     string,
}

where

  • url: URL pointing to original article
  • id: numeric ID (if exists, else random UID)
  • date: article's timestamp
  • category: article's category
  • title: article's title or header
  • keywords: article's tags
  • cover: URL pointing to article's front image (if any)
  • body: article's body

Every site is assagined a directory, and every articles is stored inside a directory named after its publishing year.

e.g., uy22-raw/ep22/2019/20190101120000-142502-Los_datos_del_Rey_de.json

For every corpus, there are two versions available:

  1. raw: where body contains the raw unprocessed articles' HTML
  2. clean: where body contains just text without HTML tags

Both raw & clean versions are about 6 GiB & 4 GiB respectively (totalling 10.3 GiB) and can be downloaded from here or here.

For every site there's also available an unified+splitted version of every article in a single .txt file. (totalling 2.4 GiB). Slipped means that every line contains a single sentence, and unified means every articles is separated by a blank line. The splitting was made using pln-fing-udelar/ sentence-splitter

542M dic 28 20:47 ep22-unified-splitted.txt
876M dic 27 23:04 eo22-unified-splitted.txt
854M dic 27 18:58 mp22-unified-splitted.txt

The concatenations of these files were used to train a RoBERTa-like LM using the HuggingFace library, and can be found here huggingface.co/datasets/pln-udelar/uy22 or here archive.org.

Cite this work

@inproceedings{rouberta2024,
  title={A Language Model Trained on Uruguayan Spanish News Text},
  author={Filevich, Juan Pablo and Marco, Gonzalo and Castro, Santiago and Chiruzzo, Luis and Ros{\'a}, Aiala},
  booktitle={Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability@ LREC-COLING 2024},
  pages={53--60},
  year={2024}
}