Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract_corpora: Invalid book ID causing multiple books not to be extracted from the project #574

Open
mmartin9684-sil opened this issue Oct 25, 2024 · 2 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@mmartin9684-sil
Copy link
Collaborator

A project (Galego_2024_10_24.save) has an invalid book ID in the "\id" tag ("\id JUD") for the book of Judges; it should be "JDG" instead of "JUD".

No error was reported for this mismatch, but the book content was not extracted into the extract file. Also, most other books in the project were not extracted into the extract file either. The extract file contained verse extracts only for GEN, JUD, and REV, even though the project has a complete NT and multiple OT books.

A warning about the book ID mistake would be helpful, and the error should not affect the extraction of the other books.

@mmartin9684-sil mmartin9684-sil added bug Something isn't working enhancement New feature or request labels Oct 25, 2024
@ddaspit ddaspit removed their assignment Oct 28, 2024
@ddaspit ddaspit moved this from 🆕 New to 🔖 Ready in SIL-NLP Research Oct 28, 2024
@mshannon-sil
Copy link
Collaborator

I took a deep dive into what's going on, and this is what I found. In the paratext project, as you mentioned, the file for Judges is using "JUD" for the "\id", which is the id for the book of Jude, not Judges. So when the project_corpus is initialized inside the extract_corpora function, the versification for the project_corpus rows incorrectly uses JUD when it gets to Judges, although at this point there aren't any verses missing. The verses go missing when the project_corpus is then aligned with the reference_corpus (the corpus that contains every possible verse ref in the chosen versification) to create the parallel_corpus. The get_rows method for the parallel_corpus first goes through Genesis like normal, and then when it reaches the second book Judges, it sees the JUD id in the project_corpus rows, and then iterates through the reference corpus rows until it gets to JUD. Then since the reference corpus is now at JUD, once all the JUD rows are finished, REV is the next verse ref in the reference_corpus, and get_rows will iterate through the project_corpus until it gets to REV as well, skipping any verses that come before that.

Saying all this, I think the best solution would be to prevent this scenario from happening by throwing an error if the "\id" tag does not match the book id in the filename. I don't think allowing the error to exist without affecting the extraction process and just throwing a warning is practical, since it would mean needing to completely redesign the algorithm for aligning rows, and that's probably not necessary just to handle malformed data.

@ddaspit
Copy link
Collaborator

ddaspit commented Nov 11, 2024

@mshannon-sil I agree with your analysis. When there is a mismatch between the file name and the "\id" tag, we don't know which one is correct. We could decide to always use the book code from the file name and ignore the "\id" tag and that would fix the issue in this case, but in other cases, it might be the opposite. The safest thing to do is to throw an exception.

@mshannon-sil mshannon-sil moved this from 🔖 Ready to 🏗 In progress in SIL-NLP Research Nov 11, 2024
@mshannon-sil mshannon-sil moved this from 🏗 In progress to 👀 In review in SIL-NLP Research Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
Status: 👀 In review
Development

No branches or pull requests

4 participants