-
Notifications
You must be signed in to change notification settings - Fork 0
Dataset Preprocessing
Yiyi Chen edited this page May 13, 2024
·
3 revisions
folder: data_preprocessing
- Deduplicate oscar data.
dedup_oscar.sh
deduplicating oscar corpus with text-dedup (thank the contributors for their great work!)
- install
python3 -m pip install git+https://github.com/ChenghaoMou/text-dedup
- quality control
oscar 2301 is annotated with th following length-based annotations
refer to paper
- tiny: The document has a low (≤ 5) number of lines
- short_sentences: The document has a high number (≥ 50%) of short lines
- header: checks the occurrence of short lines at the start of the document, and adds a header annotation if it is the case, indicating that low-quality content could be present at the start of the document.
- footer: works in the same way on the tail of the document.
for embedding inversion: we remove "tiny" and "header" annotated datasets.
python oscar.py amh_Latn output/am