This is some old corpus augmentation code I wrote around 2014. The main algorithm is in Filter.cpp, which is based on the paper Submodularity for Data Selection in Statistical Machine Translation. There are also some supporting scripts for data prep, transforming the output to a WFST (arpa2fst), the main mining script (mine_google.py), etc.
-
Notifications
You must be signed in to change notification settings - Fork 0
TimeDelta/submodularity-for-data-selection
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Some old code I wrote around 2014 based on "Submodularity for Data Selection in Statistical Machine Translation"
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published