Word alignment try 2 #267

johnml1135 · 2024-11-05T21:58:46Z

Add word alignment engine to IInteractiveTranslationEngine.

This change is

codecov-commenter · 2024-11-05T22:00:50Z

Codecov Report

Attention: Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.

Project coverage is 69.92%. Comparing base (7f2af4e) to head (08a2719).

Files with missing lines	Patch %	Lines
...SIL.Machine/Translation/HybridTranslationEngine.cs	0.00%	20 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #267      +/-   ##
==========================================
- Coverage   69.96%   69.92%   -0.05%     
==========================================
  Files         379      379              
  Lines       31778    31798      +20     
  Branches     4456     4456              
==========================================
  Hits        22235    22235              
- Misses       8509     8529      +20     
  Partials     1034     1034

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ddaspit

What is this change for?

Reviewable status: 0 of 3 files reviewed, all discussions resolved

johnml1135 · 2024-11-06T16:09:45Z

This is needed for adding the word alignment engine to Serval. Just exposing the alignment endpoints to the interactive engine.

johnml1135 · 2024-11-06T16:11:41Z

This needs to be merged and released before the Serval changes will be able to compile.

ddaspit

I'm still not sure I understand what this is for. There are already interfaces for word alignment models. Also, phrase alignment isn't word alignment. That is specific to the Thot SMT engine.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

johnml1135 · 2024-11-07T15:37:22Z

The ThotSmtModel appears to be the best place to add the alignment routines onto - as the "phrase alignment" just means that the tokenizer can be configured. If I don't use ThotSmtModel, what specific things would I use? IWordAligner assumes that the source and target are already tokenized. Also, how would it interact with loading models built by machine.py?

ddaspit

For word alignment, you should use one of the classes that inherits from ThotWordAlignmentModel. For SMT and word alignment models, you will need to tokenize the text. We should just use the LatinWordTokenizer like we do for the SMT engine.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

johnml1135 · 2024-11-08T21:23:15Z

Hmm. It wold be quite a bit of reworking. I would have to use a different wording than ThotWordAlignmentModel because that is just referring to the asymmetrical alignment, not the symmetrical alignment with tokenizer. In python, the word aligner has the tokenizer connected to it. I could rework the Machine word aligner to have the tokenizer in it, but that would be a fair amount of work. The solution I have appears to be a good minimal solution - treat the ThotSmtModel as a SymmetrizedWordAlignmentModel with tokenizers - it already has the capability of having the truecaser as null.

Otherwise, I think I would have to create base class of ThotSmtModel called ThotSymmetrizedWordAlignmentModelWithTokenizer? in which 1/2 of the functionality of ThotSmtModel is implemented. And even then, all the configurations and trainers and everything else would need to be torn apart and rewritten.

I think this minimal change is the best solution - it looks like a word aligner on Serval but is just an SMT model underneath.

ddaspit

The ThotSmtModel is a full phrased-based SMT system and takes a lot more computation and time to train. The phrase alignment from the SMT model uses a different algorithm than the word alignment models and is much more expensive. Unfortunately, it is not a replacement for the word alignment models. We should meet to discuss how best to proceed. I'm sure if I had a better understanding of what you are trying to achieve, we can come up with a good solution.

Reviewable status: 0 of 3 files reviewed, all discussions resolved

johnml1135 added 2 commits November 5, 2024 16:04

a start

16fb76d

Add word alignment to hybrid engine

08a2719

johnml1135 requested a review from ddaspit November 5, 2024 21:58

ddaspit reviewed Nov 5, 2024

View reviewed changes

ddaspit reviewed Nov 6, 2024

View reviewed changes

ddaspit reviewed Nov 7, 2024

View reviewed changes

ddaspit reviewed Nov 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word alignment try 2 #267

Word alignment try 2 #267

johnml1135 commented Nov 5, 2024 •

edited by ddaspit

Loading

codecov-commenter commented Nov 5, 2024

ddaspit left a comment

johnml1135 commented Nov 6, 2024

johnml1135 commented Nov 6, 2024

ddaspit left a comment

johnml1135 commented Nov 7, 2024

ddaspit left a comment

johnml1135 commented Nov 8, 2024

ddaspit left a comment

Word alignment try 2 #267

Are you sure you want to change the base?

Word alignment try 2 #267

Conversation

johnml1135 commented Nov 5, 2024 • edited by ddaspit Loading

codecov-commenter commented Nov 5, 2024

Codecov Report

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 6, 2024

johnml1135 commented Nov 6, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 7, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 8, 2024

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Nov 5, 2024 •

edited by ddaspit

Loading