Dataset Preprocessing

Jump to bottom

Yiyi Chen edited this page May 13, 2024 · 3 revisions

folder: data_preprocessing

Deduplicate oscar data.

dedup_oscar.sh

deduplicating oscar corpus with text-dedup (thank the contributors for their great work!)

install python3 -m pip install git+https://github.com/ChenghaoMou/text-dedup

quality control

oscar 2301 is annotated with th following length-based annotations

refer to paper

tiny: The document has a low (≤ 5) number of lines
short_sentences: The document has a high number (≥ 50%) of short lines
header: checks the occurrence of short lines at the start of the document, and adds a header annotation if it is the case, indicating that low-quality content could be present at the start of the document.
footer: works in the same way on the tail of the document.

for embedding inversion: we remove "tiny" and "header" annotated datasets.

python oscar.py amh_Latn output/am