Skip to content

Partition build dataset into train dev test

Thamme Gowda edited this page May 2, 2019 · 1 revision

Use the tool at saral/tools/corpus_splitter.py

$ corpus_splitter.py  -h
usage: corpus_splitter.py [-h] -i IN -o OUT -dev DEV -test TEST

Corpus Splitter - makes test and devsplits, by isolating all segments of each
document into a split, and also satisfies word count constraints set in
commandline args.

optional arguments:
  -h, --help            show this help message and exit
  -i IN, --in IN        material data file
  -o OUT, --out OUT     Output prefix
  -dev DEV, --dev DEV   Development size in number of tokens
  -test TEST, --test TEST
                        Test Size in number of tokens

Example:

corpus_splitter.py  -i  BUILD/bitext/MATERIAL_OP1-2S-BUILD_bitext.txt \
 -o mt-out/build1/2S/2S-build -dev 40000 -test 40000

Then rename files as follows:

2S-builddev.orig.src-ref.tsv
2S-buildtest.orig.src-ref.tsv
2S-buildtrain.orig.src-ref.tsv
Clone this wiki locally