extract_corpora: Invalid book ID causing multiple books not to be extracted from the project #574

mmartin9684-sil · 2024-10-25T13:29:41Z

A project (Galego_2024_10_24.save) has an invalid book ID in the "\id" tag ("\id JUD") for the book of Judges; it should be "JDG" instead of "JUD".

No error was reported for this mismatch, but the book content was not extracted into the extract file. Also, most other books in the project were not extracted into the extract file either. The extract file contained verse extracts only for GEN, JUD, and REV, even though the project has a complete NT and multiple OT books.

A warning about the book ID mistake would be helpful, and the error should not affect the extraction of the other books.

mshannon-sil · 2024-11-11T20:27:33Z

I took a deep dive into what's going on, and this is what I found. In the paratext project, as you mentioned, the file for Judges is using "JUD" for the "\id", which is the id for the book of Jude, not Judges. So when the project_corpus is initialized inside the extract_corpora function, the versification for the project_corpus rows incorrectly uses JUD when it gets to Judges, although at this point there aren't any verses missing. The verses go missing when the project_corpus is then aligned with the reference_corpus (the corpus that contains every possible verse ref in the chosen versification) to create the parallel_corpus. The get_rows method for the parallel_corpus first goes through Genesis like normal, and then when it reaches the second book Judges, it sees the JUD id in the project_corpus rows, and then iterates through the reference corpus rows until it gets to JUD. Then since the reference corpus is now at JUD, once all the JUD rows are finished, REV is the next verse ref in the reference_corpus, and get_rows will iterate through the project_corpus until it gets to REV as well, skipping any verses that come before that.

Saying all this, I think the best solution would be to prevent this scenario from happening by throwing an error if the "\id" tag does not match the book id in the filename. I don't think allowing the error to exist without affecting the extraction process and just throwing a warning is practical, since it would mean needing to completely redesign the algorithm for aligning rows, and that's probably not necessary just to handle malformed data.

ddaspit · 2024-11-11T21:28:18Z

@mshannon-sil I agree with your analysis. When there is a mismatch between the file name and the "\id" tag, we don't know which one is correct. We could decide to always use the book code from the file name and ignore the "\id" tag and that would fix the issue in this case, but in other cases, it might be the opposite. The safest thing to do is to throw an exception.

mmartin9684-sil added bug Something isn't working enhancement New feature or request labels Oct 25, 2024

mmartin9684-sil assigned ddaspit, isaac091 and mshannon-sil Oct 25, 2024

ddaspit removed their assignment Oct 28, 2024

ddaspit added this to SIL-NLP Research Oct 28, 2024

github-project-automation bot moved this to 🆕 New in SIL-NLP Research Oct 28, 2024

ddaspit moved this from 🆕 New to 🔖 Ready in SIL-NLP Research Oct 28, 2024

mshannon-sil moved this from 🔖 Ready to 🏗 In progress in SIL-NLP Research Nov 11, 2024

mshannon-sil mentioned this issue Nov 11, 2024

Raise error when id tag doesn't match filename book id sillsdev/machine.py#141

Merged

mshannon-sil moved this from 🏗 In progress to 👀 In review in SIL-NLP Research Nov 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract_corpora: Invalid book ID causing multiple books not to be extracted from the project #574

extract_corpora: Invalid book ID causing multiple books not to be extracted from the project #574

mmartin9684-sil commented Oct 25, 2024

mshannon-sil commented Nov 11, 2024

ddaspit commented Nov 11, 2024

extract_corpora: Invalid book ID causing multiple books not to be extracted from the project #574

extract_corpora: Invalid book ID causing multiple books not to be extracted from the project #574

Comments

mmartin9684-sil commented Oct 25, 2024

mshannon-sil commented Nov 11, 2024

ddaspit commented Nov 11, 2024