Skip to content

Dataset Preprocessing

Yiyi Chen edited this page May 13, 2024 · 3 revisions

folder: data_preprocessing

  1. Deduplicate oscar data.

dedup_oscar.sh

deduplicating oscar corpus with text-dedup (thank the contributors for their great work!)

  • install python3 -m pip install git+https://github.com/ChenghaoMou/text-dedup
  1. quality control

oscar 2301 is annotated with th following length-based annotations

refer to paper

  • tiny: The document has a low (≤ 5) number of lines
  • short_sentences: The document has a high number (≥ 50%) of short lines
  • header: checks the occurrence of short lines at the start of the document, and adds a header annotation if it is the case, indicating that low-quality content could be present at the start of the document.
  • footer: works in the same way on the tail of the document.

for embedding inversion: we remove "tiny" and "header" annotated datasets.

python oscar.py amh_Latn output/am