Skip to content

Do not translate markers #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 12, 2025
Merged

Do not translate markers #153

merged 6 commits into from
Feb 12, 2025

Conversation

johnml1135
Copy link
Collaborator

@johnml1135 johnml1135 commented Jan 31, 2025

This is a mirror PR to sillsdev/machine#281.


This change is Reviewable

@codecov-commenter
Copy link

codecov-commenter commented Jan 31, 2025

Codecov Report

Attention: Patch coverage is 98.45560% with 4 lines in your changes missing coverage. Please review.

Project coverage is 88.38%. Comparing base (50cb1de) to head (13853af).

Files with missing lines Patch % Lines
tests/corpora/test_usfm_memory_text.py 85.71% 3 Missing ⚠️
tests/corpora/test_usfm_manual.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #153      +/-   ##
==========================================
+ Coverage   88.30%   88.38%   +0.07%     
==========================================
  Files         275      275              
  Lines       16235    16389     +154     
==========================================
+ Hits        14337    14485     +148     
- Misses       1898     1904       +6     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@johnml1135
Copy link
Collaborator Author

@isaac091 and @ddaspit - this is ready for reviews.

Copy link
Collaborator

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should very minimally affect my current code

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)


machine/corpora/paratext_project_text_updater_base.py line 30 at r1 (raw file):

        text_behavior: UpdateUsfmTextBehavior = UpdateUsfmTextBehavior.PREFER_EXISTING,
        embed_behavior: UpdateUsfmIntraVerseMarkerBehavior = UpdateUsfmIntraVerseMarkerBehavior.PRESERVE,
        syle_behavior: UpdateUsfmIntraVerseMarkerBehavior = UpdateUsfmIntraVerseMarkerBehavior.STRIP,

"style" is misspelled

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @johnml1135)


machine/corpora/scripture_ref_usfm_parser_handler.py line 35 at r1 (raw file):

        return ScriptureTextType.NONE if len(self._cur_text_type_stack) == 0 else self._cur_text_type_stack[-1]

    def get_in_note_text(self) -> bool:

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses. Also, _is_in_note_text would be a more consistent name.


machine/corpora/scripture_ref_usfm_parser_handler.py line 122 at r1 (raw file):

        self._in_embed = False

    def start_embed(self, state: UsfmParserState, marker: str, caller: str, category: Optional[str]) -> None:

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.


machine/corpora/scripture_ref_usfm_parser_handler.py line 130 at r1 (raw file):

            self._next_element(marker)

    def end_embed(

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.


machine/corpora/scripture_ref_usfm_parser_handler.py line 173 at r1 (raw file):

    def _end_non_verse_text(self, state: UsfmParserState, scripture_ref: ScriptureRef) -> None: ...

    def start_note_text(self, state: UsfmParserState):

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.


machine/corpora/scripture_ref_usfm_parser_handler.py line 180 at r1 (raw file):

    def _start_note_text(self, state: UsfmParserState, scripture_ref: ScriptureRef) -> None: ...

    def end_note_text(self, state: UsfmParserState):

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.


tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

\id MAT - Test
\f + \fr 1.0 \ft \f*

We are trying to transition away from testing USFM reading/writing using these test files. I would like to stop using this file to test new cases.


tests/corpora/tests.cs line 0 at r1 (raw file):
I think this file got checked-in accidentally.


machine/corpora/update_usfm_parser_handler.py line 18 at r1 (raw file):

class UpdateUsfmIntraVerseMarkerBehavior(Enum):

I think we can just shorten this to UpdateUsfmMarkerBehavior.


machine/corpora/update_usfm_parser_handler.py line 25 at r1 (raw file):

class UpdateUsfmParserHandler(ScriptureRefUsfmParserHandler):

    _untranslatable_paragraph_tags = ("r", "rem")

This class could be used for other cases where you want to update the text in USFM and not just translation. Can we handle these specific translation cases in Serval, i.e. don't pass in translation segments for these tags?

@johnml1135
Copy link
Collaborator Author

machine/corpora/paratext_project_text_updater_base.py line 30 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

"style" is misspelled

Nice catch :-)

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 35 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses. Also, _is_in_note_text would be a more consistent name.

Done.

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 122 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

Done.

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 130 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

It's different for dotnet - underscores are only for private variables, not protected. I'll update it.

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 173 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

Done.

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 180 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

Done.

@johnml1135
Copy link
Collaborator Author

machine/corpora/update_usfm_parser_handler.py line 18 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think we can just shorten this to UpdateUsfmMarkerBehavior.

Done.

@johnml1135
Copy link
Collaborator Author

tests/corpora/tests.cs line at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think this file got checked-in accidentally.

Done.

@johnml1135
Copy link
Collaborator Author

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We are trying to transition away from testing USFM reading/writing using these test files. I would like to stop using this file to test new cases.

I tried to keep the changes minimal and move many of the tests other places. Because the parsing was different, I needed to at least update some of the results, especially around the notes.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @isaac091 and @johnml1135)


machine/corpora/scripture_ref_usfm_parser_handler.py line 130 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

It's different for dotnet - underscores are only for private variables, not protected. I'll update it.

Yeah, by convention, we've been using underscores for private and protected fields/methods in Machine.py.


tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I tried to keep the changes minimal and move many of the tests other places. Because the parsing was different, I needed to at least update some of the results, especially around the notes.

I'm not sure I understand why changes were necessary to this file. Shouldn't UsfmTextBase still parse this file the same way it did before? The tests for this file are in place to guard against regressions.

Copy link
Collaborator

@isaac091 isaac091 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

@johnml1135
Copy link
Collaborator Author

machine/corpora/update_usfm_parser_handler.py line 25 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This class could be used for other cases where you want to update the text in USFM and not just translation. Can we handle these specific translation cases in Serval, i.e. don't pass in translation segments for these tags?

Ok - this really needs to be done in Serval then, not in Machine. I'll tear it out.

@johnml1135
Copy link
Collaborator Author

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm not sure I understand why changes were necessary to this file. Shouldn't UsfmTextBase still parse this file the same way it did before? The tests for this file are in place to guard against regressions.

Yes, I could have made the changes with independent tests, rather then overloading these. The only actual "change" per say is the 1:esb changing to 2:esb - it parses differently because the \fm tag registers as an embed. If we need to, I can remove the modifications and add more individual tests to verify the edge cases.

@johnml1135
Copy link
Collaborator Author

machine/corpora/update_usfm_parser_handler.py line 25 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok - this really needs to be done in Serval then, not in Machine. I'll tear it out.

Done

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Yes, I could have made the changes with independent tests, rather then overloading these. The only actual "change" per say is the 1:esb changing to 2:esb - it parses differently because the \fm tag registers as an embed. If we need to, I can remove the modifications and add more individual tests to verify the edge cases.

Yes, that is the plan moving forward. We no longer want to use this file for new tests. It has just gotten too difficult to maintain and it isn't explicit what we are testing for. We should create individual tests for all new cases.

@johnml1135
Copy link
Collaborator Author

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Yes, I could have made the changes with independent tests, rather then overloading these. The only actual "change" per say is the 1:esb changing to 2:esb - it parses differently because the \fm tag registers as an embed. If we need to, I can remove the modifications and add more individual tests to verify the edge cases.

@ddaspit - I am inclined to leave it as is right now, but if you think it best, I can work on minimizing changes to the MAT file.

@johnml1135
Copy link
Collaborator Author

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

@ddaspit - I am inclined to leave it as is right now, but if you think it best, I can work on minimizing changes to the MAT file.

Ok - I will create the new tests.

…m machine.py

Updated logic
* Add "nested" embeds
* If we strip embeds and there is an updated embed, strip it
All tests pass
@johnml1135 johnml1135 force-pushed the do_not_translate_markers branch from 38c87db to ba783fb Compare February 10, 2025 19:57
@johnml1135
Copy link
Collaborator Author

@ddaspit - the changes requested have been implemented.

@johnml1135
Copy link
Collaborator Author

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok - I will create the new tests.

done

@johnml1135 johnml1135 force-pushed the do_not_translate_markers branch from 49edd20 to 5ebd6af Compare February 11, 2025 18:59
Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r4, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @johnml1135)


tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

done

You forgot to revert the footnote marker that occurs right after the \id marker. I believe that was added to check for a specific issue.


machine/corpora/scripture_ref_usfm_parser_handler.py line 278 at r4 (raw file):

        return marker == "ft"

    def _is_embed_part(self, marker: Optional[str]) -> bool:

The name of this method does not clearly communicate the purpose of this method is and how it is different than other "embed" methods.


machine/corpora/scripture_ref_usfm_parser_handler.py line 281 at r4 (raw file):

        return marker is not None and marker.startswith(EMBED_STARTING_CHARS)

    def _is_embed_character(self, marker: Optional[str]) -> bool:

I wasn't sure what this meant at first. I would rename this to _is_embed_character_style.


machine/corpora/scripture_ref_usfm_parser_handler.py line 282 at r4 (raw file):

    def _is_embed_character(self, marker: Optional[str]) -> bool:
        return marker in ("f", "fe", "fig", "fm", "x")

This set can be a constant.


machine/corpora/usfm_text_base.py line 208 at r4 (raw file):

                state.token is not None
                and self._is_in_embed(state.token.marker)
                and (not self._is_in_note_text() or self._is_in_nested_embed(state.token.marker))

It is unclear what this if statement is for. Could you put a comment explaining or (probably better) put the conditional in a method/property with a name that describes what the conditional is checking?

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 278 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The name of this method does not clearly communicate the purpose of this method is and how it is different than other "embed" methods.

Whether the marker is in the group of markers associated with all types of embeds - figs, f, fe, ft, fr, xt, xta, etc.

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 281 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I wasn't sure what this meant at first. I would rename this to _is_embed_character_style.

Done.

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 278 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Whether the marker is in the group of markers associated with all types of embeds - figs, f, fe, ft, fr, xt, xta, etc.

_is_embed_part_style

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 282 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This set can be a constant.

Done.

@johnml1135
Copy link
Collaborator Author

machine/corpora/usfm_text_base.py line 208 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is unclear what this if statement is for. Could you put a comment explaining or (probably better) put the conditional in a method/property with a name that describes what the conditional is checking?

Done.

@johnml1135
Copy link
Collaborator Author

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You forgot to revert the footnote marker that occurs right after the \id marker. I believe that was added to check for a specific issue.

It was - but I moved it to another test. The numbering is now done differently so that if I keep it, all the numbers are thrown off. By removing it here and moving it to a different test, it minimizes the changes needed.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)


machine/corpora/scripture_ref_usfm_parser_handler.py line 281 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Done.

Now that I understand the difference between the methods, I would rename this to _is_embed_style so that it is consistent with _is_embed_part_style. I promise this is the last comment.

@johnml1135
Copy link
Collaborator Author

machine/corpora/scripture_ref_usfm_parser_handler.py line 281 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Now that I understand the difference between the methods, I would rename this to _is_embed_style so that it is consistent with _is_embed_part_style. I promise this is the last comment.

Done.

Copy link
Contributor

@ddaspit ddaspit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

@johnml1135 johnml1135 merged commit ee14f5b into main Feb 12, 2025
14 checks passed
@johnml1135 johnml1135 deleted the do_not_translate_markers branch February 12, 2025 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants