-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word alignment try 2 #267
base: master
Are you sure you want to change the base?
Word alignment try 2 #267
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #267 +/- ##
==========================================
- Coverage 69.96% 69.92% -0.05%
==========================================
Files 379 379
Lines 31778 31798 +20
Branches 4456 4456
==========================================
Hits 22235 22235
- Misses 8509 8529 +20
Partials 1034 1034 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change for?
Reviewable status: 0 of 3 files reviewed, all discussions resolved
This is needed for adding the word alignment engine to Serval. Just exposing the alignment endpoints to the interactive engine. |
This needs to be merged and released before the Serval changes will be able to compile. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure I understand what this is for. There are already interfaces for word alignment models. Also, phrase alignment isn't word alignment. That is specific to the Thot SMT engine.
Reviewable status: 0 of 3 files reviewed, all discussions resolved
The ThotSmtModel appears to be the best place to add the alignment routines onto - as the "phrase alignment" just means that the tokenizer can be configured. If I don't use ThotSmtModel, what specific things would I use? IWordAligner assumes that the source and target are already tokenized. Also, how would it interact with loading models built by machine.py? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For word alignment, you should use one of the classes that inherits from ThotWordAlignmentModel
. For SMT and word alignment models, you will need to tokenize the text. We should just use the LatinWordTokenizer
like we do for the SMT engine.
Reviewable status: 0 of 3 files reviewed, all discussions resolved
Hmm. It wold be quite a bit of reworking. I would have to use a different wording than Otherwise, I think I would have to create base class of ThotSmtModel called ThotSymmetrizedWordAlignmentModelWithTokenizer? in which 1/2 of the functionality of ThotSmtModel is implemented. And even then, all the configurations and trainers and everything else would need to be torn apart and rewritten. I think this minimal change is the best solution - it looks like a word aligner on Serval but is just an SMT model underneath. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ThotSmtModel
is a full phrased-based SMT system and takes a lot more computation and time to train. The phrase alignment from the SMT model uses a different algorithm than the word alignment models and is much more expensive. Unfortunately, it is not a replacement for the word alignment models. We should meet to discuss how best to proceed. I'm sure if I had a better understanding of what you are trying to achieve, we can come up with a good solution.
Reviewable status: 0 of 3 files reviewed, all discussions resolved
Add word alignment engine to IInteractiveTranslationEngine.
This change is