(Parent) XRI to extract ETL script #472

rminsil · 2024-08-03T01:47:26Z

Overview

A parent issue for a new script in silnlp that converts data dumps in the XRI format to extract files that are used to train machine translation models. Essentially it's an ETL process.

Child issues:

Usage

Script is a python cli application. Usage would look something like:

$ python xri_etl.py input.tsv swa kcz XRI-2024-08-12

Stack

The stack is the current stack and style used in the silnlp repo for other scripts:

python 3.8
argparse
dataclass

etc...

Static analysis tools are used similar to other projects:

black
pyflakes
mypy

Input data

Input data is a tsv file in the XRI format.

This is a small sample from a kcz file:

id	source	target	split
0	Ilikuwa ni aibu, jinsi desturi hizi, ambazo zilikuwa mwanzo kabisa zimekufa, hazikuwa tena sehemu ya jamii yetu ya kisasa.	Ndɨkʉbha nsoni, kʉmɨzimʉ ɨzi, ɨzyo hambʉkɨ nhana ndɨfwaa, ndɨzɨlʉdʉhu hangɨ ʉlwande lwa bhantʉ vɨsʉ ibha bhalʉno.	train
1	John angeweza kujifunza zaidi kuhusu kundi la kijeshi ikiwa kama mwanahistoria angelijibu swali lake.	Ʉjohn mbee ndɨɨ akakoole ɨlɨhembeka mno kwɨdale ɨlya bhashɨlɨkale mbee akabhee mʉkʉmʉke mbee ndɨajɨbʉ ɨswali lyakwe.	train
2	Akiwa anafanya kazi ya kuonyesha michongo ya zamani kwenye mnara, alikasirikia alipokuwa baada ya kuona grafiti iliyoharibu uso wake.	Haho ndɨakʉbhezya ʉmlɨmo ʉgwailangɨsha ʉʉpʉnzi wa kale kʉ mnara, ndɨagaya haho ndɨaona kwitʉngo grafiti ɨyo ndɨyanonanga ʉshʉ wakwe.	train

The schema is:

id - 0-indexed incrementing integer id
source - original sentence from the LWC
target - vernacular translation
split - an assignment to train/dev/test

We will assume input files contain less than 10K sentence pairs and are safe to load into memory. The samples seen so far are around 2K pairs.

Output

There would be two sets of output produced by the tool:

(1) source/target files without any kind of train/dev/test annotations (*.all.txt files)
(2) source/target data split up into 2x3=6 smaller files based on train/dev/test

                            SPLIT

                   train     dev     test
        ------------------------------------
        source |         |        |        |
        ------------------------------------
LANG    target |         |        |        |
        ------------------------------------

For (1), file naming conventions are:

# General form
<source_iso>-<dataset_name>.all.txt
<target_iso>-<dataset_name>.all.txt

# Examples
Source: asa-XRI-2024-07-12.all.txt
        ^^^ ^^^^^^^^^^^^^^
        iso dataset name

Target: ngq-XRI-2024-07-12.all.txt
        ^^^ ^^^^^^^^^^^^^^
        iso dataset name

Where the iso codes for languages are the ISO 639-3 codes defined here.

For (2), file naming conventions are the same except the split is added:

# General form
<sorce_iso>-<dataset_name>.<split>.txt
<trg_iso>-<dataset_name>.<split>.txt

# Examples

                              train                      dev/val                         test
        --------------------------------------------------------------------------------------------------
        source | asa-XRI-2024-07-12.train.txt | asa-XRI-2024-07-12.val.txt | asa-XRI-2024-07-12.test.txt |
        --------------------------------------------------------------------------------------------------
LANG    target | ngq-XRI-2024-07-12.train.txt | ngq-XRI-2024-07-12.val.txt | ngq-XRI-2024-07-12.test.txt |
        --------------------------------------------------------------------------------------------------
                                                                   ^^^
                                                                   NOTE

Note that "val" is used instead of "dev" to make working with downstream training tools easier.

The split is determined by the original input file, not by any split logic in the script.

The tsv data format doesn't specify the source_iso or dataset names so it must be provided by the user.
These can potentially be inferred from the tsv filename,
e.g. ngq_parallel_dataset_unified_2024-07-12_15-28-15_1.tsv begins with ngq indicating Ngoreme
and has some description of the data.

But to keep things simple and consistent, initially all 3 identifiers must be specified by the user as cli arguments:

source_iso
target_iso
dataset name

Note that currently no English extract files are required. Not all tsv files include English so it's out of scope for now.

Data transformations

There are some known data quality issues and some unknown issues we'll hit as more data arrives.

Some known issues:

when translators can't translate a source sentence, they put "!" in the target field - we would filter these out
potential duplicate entries
trailing whitespace at the end of some target sentences
inconsistent use of double vs single quotes
inconsistent spacing around punctuation like commas and periods

The script would would have some intelligence to skip or repair data in some cases. Initially this logic wouldn't be very sophisticated, and would improve as the script matures and we get more samples to learn from.

Normalization around punctuation and casing could be achieved with the wildebeest library later on.

Statistics

The script will provide some statistics around the data.

Initially this would just be very basic data for the number of training pairs before and after filtering the data.

Later we can potentially look at alignment scores for each pair of extract files as well as generating a source/target word lexicon for the *.all.txt extract files. We have tools for doing those tasks in the silnlp repo, and would just need some scripts that would automate the step.

The text was updated successfully, but these errors were encountered:

rminsil mentioned this issue Aug 3, 2024

Create initial xri_etl script #473

Closed

rminsil self-assigned this Aug 3, 2024

ddaspit added this to SIL-NLP Research Aug 9, 2024

github-project-automation bot moved this to 🆕 New in SIL-NLP Research Aug 9, 2024

ddaspit moved this from 🆕 New to 🏗 In progress in SIL-NLP Research Aug 9, 2024

ddaspit added enhancement New feature or request pipeline 2: extract Issue related to extracting parallel corpora labels Aug 9, 2024

rminsil mentioned this issue Aug 23, 2024

Add data transformation and cleaning to XRI etl script #494

Open

rminsil mentioned this issue Oct 3, 2024

Iterate on XRI extract script after initial usage #546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Parent) XRI to extract ETL script #472

(Parent) XRI to extract ETL script #472

rminsil commented Aug 3, 2024 •

edited

Loading

(Parent) XRI to extract ETL script #472

(Parent) XRI to extract ETL script #472

Comments

rminsil commented Aug 3, 2024 • edited Loading

Overview

Usage

Stack

Input data

Output

Data transformations

Statistics

rminsil commented Aug 3, 2024 •

edited

Loading