GitHub - esalesky/NYFNN: project code for CMU 11-747 and online BPE expansion from the paper "Optimizing Segmentation Granularity for Neural Machine Translation"

project code for 11-747

Single step postprocessing

./external_scripts/run-bleu-score.sh

Usage: run-bleu-score.sh [-d] [-h] output ref.txt ref.xml src.xml tgt_lang
Options:
    -d Detokenize the test file after removing bpe splits
    -h Display this help
    Note: tgt_lang should be written out, eg czech or english

useful preprocessing/postprocessing commands

to de-xml data directory (runs on all files in a dir, currently):

python3 data_xml_to_txt.py -d dir

to tokenize/detokenize English:

perl external_scripts/tokenizer.perl (-l [en|cs|...]) (-threads 4) < textfile > tokenizedfile
perl external_scripts/detokenizer.perl (-l [en|cs|...]) < tokenizedfile > detokenizedfile

to tokenize Czech (runs on all files in a dir, currently):

python3 morphology/run_czech_transform.py -d dir -y morphodita_dict -t morphodita_parser_file

to generate bpe:

./subword-nmt/learn_bpe.py -s {num_operations} < {train_file} > {codes_file}
./subword-nmt/apply_bpe.py -c {codes_file} < {test_file}

to de-bpe:

sed -r 's/(@@ )|(@@ ?$)//g'

to score with BLEU:

perl external_scripts/multi-bleu.perl -lc ref < hyp

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
config		config
data		data
external_scripts		external_scripts
morphology		morphology
subword-nmt @ 27f1ab8		subword-nmt @ 27f1ab8
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
batching.py		batching.py
beam_search.py		beam_search.py
bleu.py		bleu.py
bpe_dir.sh		bpe_dir.sh
conditional_gru.py		conditional_gru.py
count_novel_words.py		count_novel_words.py
data_xml_to_txt.py		data_xml_to_txt.py
encdec.py		encdec.py
main.py		main.py
params.py		params.py
plot_attention.py		plot_attention.py
preprocessing.py		preprocessing.py
run_generate.py		run_generate.py
train_monitor.py		train_monitor.py
training.py		training.py
utils.py		utils.py
vocab_check.py		vocab_check.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

project code for 11-747

Single step postprocessing

useful preprocessing/postprocessing commands

About

Releases

Packages

Contributors 2

Languages

esalesky/NYFNN

Folders and files

Latest commit

History

Repository files navigation

project code for 11-747

Single step postprocessing

useful preprocessing/postprocessing commands

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages