You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A parent issue for a new script in silnlp that converts data dumps in the XRI format to extract files that are used to train machine translation models. Essentially it's an ETL process.
The stack is the current stack and style used in the silnlp repo for other scripts:
python 3.8
argparse
dataclass
etc...
Static analysis tools are used similar to other projects:
black
pyflakes
mypy
Input data
Input data is a tsv file in the XRI format.
This is a small sample from a kcz file:
idsourcetargetsplit0Ilikuwa ni aibu, jinsi desturi hizi, ambazo zilikuwa mwanzo kabisa zimekufa, hazikuwa tena sehemu ya jamii yetu ya kisasa.Ndɨkʉbha nsoni, kʉmɨzimʉ ɨzi, ɨzyo hambʉkɨ nhana ndɨfwaa, ndɨzɨlʉdʉhu hangɨ ʉlwande lwa bhantʉ vɨsʉ ibha bhalʉno.train1John angeweza kujifunza zaidi kuhusu kundi la kijeshi ikiwa kama mwanahistoria angelijibu swali lake.Ʉjohn mbee ndɨɨ akakoole ɨlɨhembeka mno kwɨdale ɨlya bhashɨlɨkale mbee akabhee mʉkʉmʉke mbee ndɨajɨbʉ ɨswali lyakwe.train2Akiwa anafanya kazi ya kuonyesha michongo ya zamani kwenye mnara, alikasirikia alipokuwa baada ya kuona grafiti iliyoharibu uso wake.Haho ndɨakʉbhezya ʉmlɨmo ʉgwailangɨsha ʉʉpʉnzi wa kale kʉ mnara, ndɨagaya haho ndɨaona kwitʉngo grafiti ɨyo ndɨyanonanga ʉshʉ wakwe.train
The schema is:
id - 0-indexed incrementing integer id
source - original sentence from the LWC
target - vernacular translation
split - an assignment to train/dev/test
We will assume input files contain less than 10K sentence pairs and are safe to load into memory. The samples seen so far are around 2K pairs.
Output
There would be two sets of output produced by the tool:
(1) source/target files without any kind of train/dev/test annotations (*.all.txt files)
(2) source/target data split up into 2x3=6 smaller files based on train/dev/test
SPLIT
train dev test
------------------------------------
source | | | |
------------------------------------
LANG target | | | |
------------------------------------
For (1), file naming conventions are:
# General form
<source_iso>-<dataset_name>.all.txt
<target_iso>-<dataset_name>.all.txt
# Examples
Source: asa-XRI-2024-07-12.all.txt
^^^ ^^^^^^^^^^^^^^
iso dataset name
Target: ngq-XRI-2024-07-12.all.txt
^^^ ^^^^^^^^^^^^^^
iso dataset name
Where the iso codes for languages are the ISO 639-3 codes defined here.
For (2), file naming conventions are the same except the split is added:
# General form
<sorce_iso>-<dataset_name>.<split>.txt
<trg_iso>-<dataset_name>.<split>.txt
# Examples
train dev/val test
--------------------------------------------------------------------------------------------------
source | asa-XRI-2024-07-12.train.txt | asa-XRI-2024-07-12.val.txt | asa-XRI-2024-07-12.test.txt |
--------------------------------------------------------------------------------------------------
LANG target | ngq-XRI-2024-07-12.train.txt | ngq-XRI-2024-07-12.val.txt | ngq-XRI-2024-07-12.test.txt |
--------------------------------------------------------------------------------------------------
^^^
NOTE
Note that "val" is used instead of "dev" to make working with downstream training tools easier.
The split is determined by the original input file, not by any split logic in the script.
The tsv data format doesn't specify the source_iso or dataset names so it must be provided by the user.
These can potentially be inferred from the tsv filename,
e.g. ngq_parallel_dataset_unified_2024-07-12_15-28-15_1.tsv begins with ngq indicating Ngoreme
and has some description of the data.
But to keep things simple and consistent, initially all 3 identifiers must be specified by the user as cli arguments:
source_iso
target_iso
dataset name
Note that currently no English extract files are required. Not all tsv files include English so it's out of scope for now.
Data transformations
There are some known data quality issues and some unknown issues we'll hit as more data arrives.
Some known issues:
when translators can't translate a source sentence, they put "!" in the target field - we would filter these out
potential duplicate entries
trailing whitespace at the end of some target sentences
inconsistent use of double vs single quotes
inconsistent spacing around punctuation like commas and periods
The script would would have some intelligence to skip or repair data in some cases. Initially this logic wouldn't be very sophisticated, and would improve as the script matures and we get more samples to learn from.
Normalization around punctuation and casing could be achieved with the wildebeest library later on.
Statistics
The script will provide some statistics around the data.
Initially this would just be very basic data for the number of training pairs before and after filtering the data.
Later we can potentially look at alignment scores for each pair of extract files as well as generating a source/target word lexicon for the *.all.txt extract files. We have tools for doing those tasks in the silnlp repo, and would just need some scripts that would automate the step.
The text was updated successfully, but these errors were encountered:
Overview
A parent issue for a new script in silnlp that converts data dumps in the XRI format to extract files that are used to train machine translation models. Essentially it's an ETL process.
Child issues:
Usage
Script is a python cli application. Usage would look something like:
Stack
The stack is the current stack and style used in the silnlp repo for other scripts:
etc...
Static analysis tools are used similar to other projects:
Input data
Input data is a tsv file in the XRI format.
This is a small sample from a kcz file:
The schema is:
We will assume input files contain less than 10K sentence pairs and are safe to load into memory. The samples seen so far are around 2K pairs.
Output
There would be two sets of output produced by the tool:
(1) source/target files without any kind of train/dev/test annotations (
*.all.txt
files)(2) source/target data split up into 2x3=6 smaller files based on train/dev/test
For (1), file naming conventions are:
Where the iso codes for languages are the ISO 639-3 codes defined here.
For (2), file naming conventions are the same except the split is added:
Note that "val" is used instead of "dev" to make working with downstream training tools easier.
The split is determined by the original input file, not by any split logic in the script.
The tsv data format doesn't specify the source_iso or dataset names so it must be provided by the user.
These can potentially be inferred from the tsv filename,
e.g.
ngq_parallel_dataset_unified_2024-07-12_15-28-15_1.tsv
begins withngq
indicating Ngoremeand has some description of the data.
But to keep things simple and consistent, initially all 3 identifiers must be specified by the user as cli arguments:
Note that currently no English extract files are required. Not all tsv files include English so it's out of scope for now.
Data transformations
There are some known data quality issues and some unknown issues we'll hit as more data arrives.
Some known issues:
The script would would have some intelligence to skip or repair data in some cases. Initially this logic wouldn't be very sophisticated, and would improve as the script matures and we get more samples to learn from.
Normalization around punctuation and casing could be achieved with the wildebeest library later on.
Statistics
The script will provide some statistics around the data.
Initially this would just be very basic data for the number of training pairs before and after filtering the data.
Later we can potentially look at alignment scores for each pair of extract files as well as generating a source/target word lexicon for the
*.all.txt
extract files. We have tools for doing those tasks in the silnlp repo, and would just need some scripts that would automate the step.The text was updated successfully, but these errors were encountered: