Skip to content

Commit

Permalink
Initial add (with Git LFS for big data files)
Browse files Browse the repository at this point in the history
  • Loading branch information
proycon committed Oct 3, 2017
0 parents commit 914bd4e
Show file tree
Hide file tree
Showing 12 changed files with 722,463 additions and 0 deletions.
41 changes: 41 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
Oersetter Models
=================

Maarten van Gompel
Centre for Language and Speech Technology
Radboud University Nijmegen
Licensed under [OPEN LICENCE TO BE DETERMINED BY FRYSKE AKADEMY]

This repository contains models for Oersetter, the Frisian-Dutch Machine
Translation system developed by Radboud University Nijmegen in close
collaboration with Fryske Akademy. This repository does not contain the
literary sources to the models, as supplied by the Fryske Akademy, as those are
copyrighted. It only contains derivative data from which the sources can not be
reconstructed.

It contains the following

* ``nl-fy`` - *Dutch to Frisian*
* ``moses.ini`` - Configuration for [Moses](http://www.statmt.org/moses/) with parameters optimized on a held-out development set using MERT. This file references all the others, please read the notices inside.
* ``fy.lm`` - Language model (ARPA-style, generated with SRILM, should run also with KenLM supplied with Moses)
* ``phrase-table.gz`` - The phrase-translation table (ARPA-style, generated with SRILM, should run also with KenLM supplied with Moses)
* ``reordering-table.wbe-msd-bidirectional-fe.gz`` - Reordering table
* ``fy-nl`` - *Frisian to Dutch*
* ``moses.ini`` - Configuration for [Moses](http://www.statmt.org/moses/) with parameters optimized on a held-out development set using MERT. This file references all the others, please read the notices inside.
* ``nl.lm.gz`` - Language model (ARPA-style, generated with SRILM, should run also with KenLM supplied with Moses), this is a big one trained on the frisian parallel corpora, OpenSubtitles and Europarl
* ``nl.tiny.lm`` - A small language model trained only on the frisian parallel corpora and used during testing
* ``phrase-table.gz`` - The phrase-translation table (ARPA-style, generated with SRILM, should run also with KenLM supplied with Moses)
* ``reordering-table.wbe-msd-bidirectional-fe.gz`` - Reordering table

This system is to be used with [Moses](http://www.statmt.org/moses/). A moses2
server can then be started as follows:

```
moses2 -f moses.ini --server --server-port 2002 --mark-unknown --unknown-word-prefix "<em>" --unknown-word-suffix "</em>"
```

A RESTful webservice wrapper that communicates with such a Moses server (and
also provides a web-interface for users) is provided
[separately](https://github.com/proycon/oersetter-webservice) and is powered by
[CLAM](https://proycon.github.io/clam).

3 changes: 3 additions & 0 deletions fy-nl/.gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
phrase-table.gz filter=lfs diff=lfs merge=lfs -text
reordering-table.wbe-msd-bidirectional-fe.gz filter=lfs diff=lfs merge=lfs -text
nl.lm.gz filter=lfs diff=lfs merge=lfs -text
44 changes: 44 additions & 0 deletions fy-nl/moses.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# MERT optimized configuration
# decoder moses
# BLEU 0.576313 on dev dev.fy
# We were before running iteration 5
# finished Wed Aug 31 15:27:29 CEST 2016
### MOSES CONFIG FILE ###
#########################

# input factors
[input-factors]
0

# mapping steps
[mapping]
0 T 0

[distortion-limit]
6

# feature functions
# **NOTICE**: path= statements are relative, you may need to turn them into absolute paths in your environment!
[feature]
UnknownWordPenalty
WordPenalty
PhrasePenalty
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=phrase-table.gz input-factor=0 output-factor=0
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=reordering-table.wbe-msd-bidirectional-fe.gz
Distortion
KENLM name=LM0 factor=0 path=nl.lm.gz order=2
#NOTICE: ^-- If the above language model gives any trouble, gunzip if and adapt the path

# dense weights for feature functions

[threads]
10
[weight]

LexicalReordering0= 0.0934955 0.00140425 0.0376116 0.115159 0.0329388 0.0612791
Distortion0= 0.0701483
LM0= 0.108341
WordPenalty0= 0.128481
PhrasePenalty0= 0.102767
TranslationModel0= 0.149808 0.0126496 0.050318 0.035599
UnknownWordPenalty0= 1
3 changes: 3 additions & 0 deletions fy-nl/nl.lm.gz
Git LFS file not shown
Loading

0 comments on commit 914bd4e

Please sign in to comment.