Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Parent) XRI to extract ETL script #472

Open
rminsil opened this issue Aug 3, 2024 · 0 comments
Open

(Parent) XRI to extract ETL script #472

rminsil opened this issue Aug 3, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request pipeline 2: extract Issue related to extracting parallel corpora

Comments

@rminsil
Copy link
Collaborator

rminsil commented Aug 3, 2024

Overview

A parent issue for a new script in silnlp that converts data dumps in the XRI format to extract files that are used to train machine translation models. Essentially it's an ETL process.

Child issues:

Usage

Script is a python cli application. Usage would look something like:

$ python xri_etl.py input.tsv swa kcz XRI-2024-08-12

Stack

The stack is the current stack and style used in the silnlp repo for other scripts:

  • python 3.8
  • argparse
  • dataclass

etc...

Static analysis tools are used similar to other projects:

  • black
  • pyflakes
  • mypy

Input data

Input data is a tsv file in the XRI format.

This is a small sample from a kcz file:

id	source	target	split
0	Ilikuwa ni aibu, jinsi desturi hizi, ambazo zilikuwa mwanzo kabisa zimekufa, hazikuwa tena sehemu ya jamii yetu ya kisasa.	Ndɨkʉbha nsoni, kʉmɨzimʉ ɨzi, ɨzyo hambʉkɨ nhana ndɨfwaa, ndɨzɨlʉdʉhu hangɨ ʉlwande lwa bhantʉ vɨsʉ ibha bhalʉno.	train
1	John angeweza kujifunza zaidi kuhusu kundi la kijeshi ikiwa kama mwanahistoria angelijibu swali lake.	Ʉjohn mbee ndɨɨ akakoole ɨlɨhembeka mno kwɨdale ɨlya bhashɨlɨkale mbee akabhee mʉkʉmʉke mbee ndɨajɨbʉ ɨswali lyakwe.	train
2	Akiwa anafanya kazi ya kuonyesha michongo ya zamani kwenye mnara, alikasirikia alipokuwa baada ya kuona grafiti iliyoharibu uso wake.	Haho ndɨakʉbhezya ʉmlɨmo ʉgwailangɨsha ʉʉpʉnzi wa kale kʉ mnara, ndɨagaya haho ndɨaona kwitʉngo grafiti ɨyo ndɨyanonanga ʉshʉ wakwe.	train

The schema is:

  • id - 0-indexed incrementing integer id
  • source - original sentence from the LWC
  • target - vernacular translation
  • split - an assignment to train/dev/test

We will assume input files contain less than 10K sentence pairs and are safe to load into memory. The samples seen so far are around 2K pairs.

Output

There would be two sets of output produced by the tool:

(1) source/target files without any kind of train/dev/test annotations (*.all.txt files)
(2) source/target data split up into 2x3=6 smaller files based on train/dev/test

                            SPLIT

                   train     dev     test
        ------------------------------------
        source |         |        |        |
        ------------------------------------
LANG    target |         |        |        |
        ------------------------------------

For (1), file naming conventions are:

# General form
<source_iso>-<dataset_name>.all.txt
<target_iso>-<dataset_name>.all.txt

# Examples
Source: asa-XRI-2024-07-12.all.txt
        ^^^ ^^^^^^^^^^^^^^
        iso dataset name

Target: ngq-XRI-2024-07-12.all.txt
        ^^^ ^^^^^^^^^^^^^^
        iso dataset name

Where the iso codes for languages are the ISO 639-3 codes defined here.

For (2), file naming conventions are the same except the split is added:

# General form
<sorce_iso>-<dataset_name>.<split>.txt
<trg_iso>-<dataset_name>.<split>.txt

# Examples

                              train                      dev/val                         test
        --------------------------------------------------------------------------------------------------
        source | asa-XRI-2024-07-12.train.txt | asa-XRI-2024-07-12.val.txt | asa-XRI-2024-07-12.test.txt |
        --------------------------------------------------------------------------------------------------
LANG    target | ngq-XRI-2024-07-12.train.txt | ngq-XRI-2024-07-12.val.txt | ngq-XRI-2024-07-12.test.txt |
        --------------------------------------------------------------------------------------------------
                                                                   ^^^
                                                                   NOTE

Note that "val" is used instead of "dev" to make working with downstream training tools easier.

The split is determined by the original input file, not by any split logic in the script.

The tsv data format doesn't specify the source_iso or dataset names so it must be provided by the user.
These can potentially be inferred from the tsv filename,
e.g. ngq_parallel_dataset_unified_2024-07-12_15-28-15_1.tsv begins with ngq indicating Ngoreme
and has some description of the data.

But to keep things simple and consistent, initially all 3 identifiers must be specified by the user as cli arguments:

  • source_iso
  • target_iso
  • dataset name

Note that currently no English extract files are required. Not all tsv files include English so it's out of scope for now.

Data transformations

There are some known data quality issues and some unknown issues we'll hit as more data arrives.

Some known issues:

  • when translators can't translate a source sentence, they put "!" in the target field - we would filter these out
  • potential duplicate entries
  • trailing whitespace at the end of some target sentences
  • inconsistent use of double vs single quotes
  • inconsistent spacing around punctuation like commas and periods

The script would would have some intelligence to skip or repair data in some cases. Initially this logic wouldn't be very sophisticated, and would improve as the script matures and we get more samples to learn from.

Normalization around punctuation and casing could be achieved with the wildebeest library later on.

Statistics

The script will provide some statistics around the data.

Initially this would just be very basic data for the number of training pairs before and after filtering the data.

Later we can potentially look at alignment scores for each pair of extract files as well as generating a source/target word lexicon for the *.all.txt extract files. We have tools for doing those tasks in the silnlp repo, and would just need some scripts that would automate the step.

@rminsil rminsil self-assigned this Aug 3, 2024
@ddaspit ddaspit moved this from 🆕 New to 🏗 In progress in SIL-NLP Research Aug 9, 2024
@ddaspit ddaspit added enhancement New feature or request pipeline 2: extract Issue related to extracting parallel corpora labels Aug 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pipeline 2: extract Issue related to extracting parallel corpora
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants