-
Notifications
You must be signed in to change notification settings - Fork 0
Partition build dataset into train dev test
Thamme Gowda edited this page May 2, 2019
·
1 revision
Use the tool at saral/tools/corpus_splitter.py
$ corpus_splitter.py -h
usage: corpus_splitter.py [-h] -i IN -o OUT -dev DEV -test TEST
Corpus Splitter - makes test and devsplits, by isolating all segments of each
document into a split, and also satisfies word count constraints set in
commandline args.
optional arguments:
-h, --help show this help message and exit
-i IN, --in IN material data file
-o OUT, --out OUT Output prefix
-dev DEV, --dev DEV Development size in number of tokens
-test TEST, --test TEST
Test Size in number of tokens
Example:
corpus_splitter.py -i BUILD/bitext/MATERIAL_OP1-2S-BUILD_bitext.txt \
-o mt-out/build1/2S/2S-build -dev 40000 -test 40000
Then rename files as follows:
2S-builddev.orig.src-ref.tsv
2S-buildtest.orig.src-ref.tsv
2S-buildtrain.orig.src-ref.tsv