-
-
Notifications
You must be signed in to change notification settings - Fork 7
Verse segmentation fix #883
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…usly, it crashed in this case)
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Enkidu93 reviewed 1 of 1 files at r1, all commit messages.
Reviewable status:complete! all files reviewed, all discussions resolved (waiting on @benjaminking)
…can be fed directly into NLLB)
|
I've also added some code to output the resulting verses in vref format (so that they can easily be fed directly into NLLB). @Enkidu93, would you also be able to review this code? Thank you! |
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure! This looks good.
My one concern is versification: It looks like the source and target verses are created without having a versification specified. In both cases, the versification will default to the English versification (luckily the same!). The vref.txt format, I believe, follows the Original versification.
I think we could address handling different versifications properly in a separate issue (unless you'd like to just do it now). For the time being could you confirm that there's a mismatch and if so just change the versification of the verse refs in verse_map to Original before str()ing them so the lookup will be apples-to-apples?
@Enkidu93 reviewed all commit messages.
Reviewable status: 0 of 1 files reviewed, all discussions resolved (waiting on @benjaminking)
|
That's a good point, thanks for catching that! I've been digging into the code around versification, but I've never worked with it before, so I'm not sure that I've implemented this correctly. In Paratext, the source project is marked as having a versification of 4 (would that be English versification?). I'm not sure what the versification of the target project is, since I don't have access to their Paratext project. |
Enkidu93
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Versification is a hairy thing 😬 😁 . 4 would be English, yes. Yeah, it seems like we just didn't think of this before. Probably not making much difference at the moment but we'll definitely want to address this as we move to a more established pipeline.
This works well, but I put in a comment for consideration!
@Enkidu93 reviewed all commit messages.
Reviewable status: 0 of 1 files reviewed, 1 unresolved discussion (waiting on @benjaminking and @laura-burdick-sil)
silnlp/alignment/segment_verses.py line 729 at r3 (raw file):
for trg_passage in trg_segmented_passages: for verse in trg_passage.verses: verse.reference.change_versification(ORIGINAL_VERSIFICATION)
You might consider using VerseRef.to_versification which doesn't mutate the original reference and call it while iterating below. Besides saving a few lines, it would guarantee that steps after this step which may access those verse references will access the unchanged references.
|
Thanks for the comment - that's a good suggestion! I'll add an issue about addressing this versification issue more thoroughly at some point. |
Previously, the verse segmentation algorithm was crashing when there was a passage that was only a single verse. This fixes that behavior.
This change is