Do not translate markers #153

johnml1135 · 2025-01-31T14:57:45Z

This is a mirror PR to sillsdev/machine#281.

This change is

codecov-commenter · 2025-01-31T16:33:38Z

Codecov Report

Attention: Patch coverage is 98.45560% with 4 lines in your changes missing coverage. Please review.

Project coverage is 88.38%. Comparing base (50cb1de) to head (13853af).

Files with missing lines	Patch %	Lines
tests/corpora/test_usfm_memory_text.py	85.71%	3 Missing ⚠️
tests/corpora/test_usfm_manual.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #153      +/-   ##
==========================================
+ Coverage   88.30%   88.38%   +0.07%     
==========================================
  Files         275      275              
  Lines       16235    16389     +154     
==========================================
+ Hits        14337    14485     +148     
- Misses       1898     1904       +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

johnml1135 · 2025-02-03T20:11:25Z

@isaac091 and @ddaspit - this is ready for reviews.

isaac091

This should very minimally affect my current code

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @ddaspit and @johnml1135)

machine/corpora/paratext_project_text_updater_base.py line 30 at r1 (raw file):

        text_behavior: UpdateUsfmTextBehavior = UpdateUsfmTextBehavior.PREFER_EXISTING,
        embed_behavior: UpdateUsfmIntraVerseMarkerBehavior = UpdateUsfmIntraVerseMarkerBehavior.PRESERVE,
        syle_behavior: UpdateUsfmIntraVerseMarkerBehavior = UpdateUsfmIntraVerseMarkerBehavior.STRIP,

"style" is misspelled

ddaspit

Reviewed 12 of 12 files at r1, all commit messages.
Reviewable status: all files reviewed, 10 unresolved discussions (waiting on @johnml1135)

machine/corpora/scripture_ref_usfm_parser_handler.py line 35 at r1 (raw file):

        return ScriptureTextType.NONE if len(self._cur_text_type_stack) == 0 else self._cur_text_type_stack[-1]

    def get_in_note_text(self) -> bool:

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses. Also, _is_in_note_text would be a more consistent name.

machine/corpora/scripture_ref_usfm_parser_handler.py line 122 at r1 (raw file):

        self._in_embed = False

    def start_embed(self, state: UsfmParserState, marker: str, caller: str, category: Optional[str]) -> None:

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

machine/corpora/scripture_ref_usfm_parser_handler.py line 130 at r1 (raw file):

            self._next_element(marker)

    def end_embed(

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

machine/corpora/scripture_ref_usfm_parser_handler.py line 173 at r1 (raw file):

    def _end_non_verse_text(self, state: UsfmParserState, scripture_ref: ScriptureRef) -> None: ...

    def start_note_text(self, state: UsfmParserState):

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

machine/corpora/scripture_ref_usfm_parser_handler.py line 180 at r1 (raw file):

    def _start_note_text(self, state: UsfmParserState, scripture_ref: ScriptureRef) -> None: ...

    def end_note_text(self, state: UsfmParserState):

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

\id MAT - Test
\f + \fr 1.0 \ft \f*

We are trying to transition away from testing USFM reading/writing using these test files. I would like to stop using this file to test new cases.

tests/corpora/tests.cs line 0 at r1 (raw file):
I think this file got checked-in accidentally.

machine/corpora/update_usfm_parser_handler.py line 18 at r1 (raw file):

class UpdateUsfmIntraVerseMarkerBehavior(Enum):

I think we can just shorten this to UpdateUsfmMarkerBehavior.

machine/corpora/update_usfm_parser_handler.py line 25 at r1 (raw file):

class UpdateUsfmParserHandler(ScriptureRefUsfmParserHandler):

    _untranslatable_paragraph_tags = ("r", "rem")

This class could be used for other cases where you want to update the text in USFM and not just translation. Can we handle these specific translation cases in Serval, i.e. don't pass in translation segments for these tags?

johnml1135 · 2025-02-04T13:12:39Z

machine/corpora/paratext_project_text_updater_base.py line 30 at r1 (raw file):

Previously, isaac091 (Isaac Schifferer) wrote…

"style" is misspelled

Nice catch :-)

johnml1135 · 2025-02-04T13:37:37Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 35 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses. Also, _is_in_note_text would be a more consistent name.

Done.

johnml1135 · 2025-02-04T13:38:14Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 122 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

Done.

johnml1135 · 2025-02-04T13:39:17Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 130 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

It's different for dotnet - underscores are only for private variables, not protected. I'll update it.

johnml1135 · 2025-02-04T13:43:51Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 173 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

Done.

johnml1135 · 2025-02-04T13:44:10Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 180 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This method should be prefixed with an underscore, since it is only intended to be used internally or by subclasses.

Done.

johnml1135 · 2025-02-04T13:44:58Z

machine/corpora/update_usfm_parser_handler.py line 18 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think we can just shorten this to UpdateUsfmMarkerBehavior.

Done.

johnml1135 · 2025-02-04T13:46:03Z

tests/corpora/tests.cs line at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I think this file got checked-in accidentally.

Done.

johnml1135 · 2025-02-04T13:47:38Z

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

We are trying to transition away from testing USFM reading/writing using these test files. I would like to stop using this file to test new cases.

I tried to keep the changes minimal and move many of the tests other places. Because the parsing was different, I needed to at least update some of the results, especially around the notes.

ddaspit

Reviewed 9 of 9 files at r2, all commit messages.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @isaac091 and @johnml1135)

machine/corpora/scripture_ref_usfm_parser_handler.py line 130 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

It's different for dotnet - underscores are only for private variables, not protected. I'll update it.

Yeah, by convention, we've been using underscores for private and protected fields/methods in Machine.py.

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

I tried to keep the changes minimal and move many of the tests other places. Because the parsing was different, I needed to at least update some of the results, especially around the notes.

I'm not sure I understand why changes were necessary to this file. Shouldn't UsfmTextBase still parse this file the same way it did before? The tests for this file are in place to guard against regressions.

isaac091

Reviewed all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @johnml1135)

johnml1135 · 2025-02-04T20:24:08Z

machine/corpora/update_usfm_parser_handler.py line 25 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This class could be used for other cases where you want to update the text in USFM and not just translation. Can we handle these specific translation cases in Serval, i.e. don't pass in translation segments for these tags?

Ok - this really needs to be done in Serval then, not in Machine. I'll tear it out.

johnml1135 · 2025-02-04T20:40:08Z

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I'm not sure I understand why changes were necessary to this file. Shouldn't UsfmTextBase still parse this file the same way it did before? The tests for this file are in place to guard against regressions.

Yes, I could have made the changes with independent tests, rather then overloading these. The only actual "change" per say is the 1:esb changing to 2:esb - it parses differently because the \fm tag registers as an embed. If we need to, I can remove the modifications and add more individual tests to verify the edge cases.

johnml1135 · 2025-02-06T15:16:31Z

machine/corpora/update_usfm_parser_handler.py line 25 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok - this really needs to be done in Serval then, not in Machine. I'll tear it out.

Done

ddaspit

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Yes, I could have made the changes with independent tests, rather then overloading these. The only actual "change" per say is the 1:esb changing to 2:esb - it parses differently because the \fm tag registers as an embed. If we need to, I can remove the modifications and add more individual tests to verify the edge cases.

Yes, that is the plan moving forward. We no longer want to use this file for new tests. It has just gotten too difficult to maintain and it isn't explicit what we are testing for. We should create individual tests for all new cases.

johnml1135 · 2025-02-06T15:17:23Z

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Yes, I could have made the changes with independent tests, rather then overloading these. The only actual "change" per say is the 1:esb changing to 2:esb - it parses differently because the \fm tag registers as an embed. If we need to, I can remove the modifications and add more individual tests to verify the edge cases.

@ddaspit - I am inclined to leave it as is right now, but if you think it best, I can work on minimizing changes to the MAT file.

johnml1135 · 2025-02-06T15:18:04Z

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

@ddaspit - I am inclined to leave it as is right now, but if you think it best, I can work on minimizing changes to the MAT file.

Ok - I will create the new tests.

…m machine.py Updated logic * Add "nested" embeds * If we strip embeds and there is an updated embed, strip it All tests pass

johnml1135 · 2025-02-10T21:26:57Z

@ddaspit - the changes requested have been implemented.

johnml1135 · 2025-02-10T21:27:04Z

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Ok - I will create the new tests.

done

ddaspit

Reviewed 11 of 11 files at r4, all commit messages.
Reviewable status: all files reviewed, 5 unresolved discussions (waiting on @johnml1135)

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, johnml1135 (John Lambert) wrote…

done

You forgot to revert the footnote marker that occurs right after the \id marker. I believe that was added to check for a specific issue.

machine/corpora/scripture_ref_usfm_parser_handler.py line 278 at r4 (raw file):

        return marker == "ft"

    def _is_embed_part(self, marker: Optional[str]) -> bool:

The name of this method does not clearly communicate the purpose of this method is and how it is different than other "embed" methods.

machine/corpora/scripture_ref_usfm_parser_handler.py line 281 at r4 (raw file):

        return marker is not None and marker.startswith(EMBED_STARTING_CHARS)

    def _is_embed_character(self, marker: Optional[str]) -> bool:

I wasn't sure what this meant at first. I would rename this to _is_embed_character_style.

machine/corpora/scripture_ref_usfm_parser_handler.py line 282 at r4 (raw file):

    def _is_embed_character(self, marker: Optional[str]) -> bool:
        return marker in ("f", "fe", "fig", "fm", "x")

This set can be a constant.

machine/corpora/usfm_text_base.py line 208 at r4 (raw file):

                state.token is not None
                and self._is_in_embed(state.token.marker)
                and (not self._is_in_note_text() or self._is_in_nested_embed(state.token.marker))

It is unclear what this if statement is for. Could you put a comment explaining or (probably better) put the conditional in a method/property with a name that describes what the conditional is checking?

johnml1135 · 2025-02-11T20:22:35Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 278 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

The name of this method does not clearly communicate the purpose of this method is and how it is different than other "embed" methods.

Whether the marker is in the group of markers associated with all types of embeds - figs, f, fe, ft, fr, xt, xta, etc.

johnml1135 · 2025-02-11T20:23:10Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 281 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

I wasn't sure what this meant at first. I would rename this to _is_embed_character_style.

Done.

johnml1135 · 2025-02-11T20:24:05Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 278 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Whether the marker is in the group of markers associated with all types of embeds - figs, f, fe, ft, fr, xt, xta, etc.

_is_embed_part_style

johnml1135 · 2025-02-11T20:25:04Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 282 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

This set can be a constant.

Done.

johnml1135 · 2025-02-11T20:28:12Z

machine/corpora/usfm_text_base.py line 208 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

It is unclear what this if statement is for. Could you put a comment explaining or (probably better) put the conditional in a method/property with a name that describes what the conditional is checking?

Done.

johnml1135 · 2025-02-11T20:29:16Z

tests/testutils/data/usfm/Tes/41MATTes.SFM line 2 at r1 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

You forgot to revert the footnote marker that occurs right after the \id marker. I believe that was added to check for a specific issue.

It was - but I moved it to another test. The numbering is now done differently so that if I keep it, all the numbers are thrown off. By removing it here and moving it to a different test, it minimizes the changes needed.

ddaspit

Reviewed 3 of 3 files at r5, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @johnml1135)

machine/corpora/scripture_ref_usfm_parser_handler.py line 281 at r4 (raw file):

Previously, johnml1135 (John Lambert) wrote…

Done.

Now that I understand the difference between the methods, I would rename this to _is_embed_style so that it is consistent with _is_embed_part_style. I promise this is the last comment.

johnml1135 · 2025-02-12T13:05:05Z

machine/corpora/scripture_ref_usfm_parser_handler.py line 281 at r4 (raw file):

Previously, ddaspit (Damien Daspit) wrote…

Now that I understand the difference between the methods, I would rename this to _is_embed_style so that it is consistent with _is_embed_part_style. I promise this is the last comment.

Done.

ddaspit

Reviewed 2 of 2 files at r6, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @johnml1135)

johnml1135 requested review from ddaspit and isaac091 January 31, 2025 14:57

isaac091 requested changes Feb 3, 2025

View reviewed changes

ddaspit requested changes Feb 3, 2025

View reviewed changes

ddaspit reviewed Feb 4, 2025

View reviewed changes

isaac091 reviewed Feb 4, 2025

View reviewed changes

ddaspit reviewed Feb 6, 2025

View reviewed changes

johnml1135 added 2 commits February 10, 2025 14:56

Synchronize with dotnet implementation

d4c7802

Move "untraslatable" to Serval/SILNLP. or just at least remove it fro…

ba783fb

…m machine.py Updated logic * Add "nested" embeds * If we strip embeds and there is an updated embed, strip it All tests pass

johnml1135 force-pushed the do_not_translate_markers branch from 38c87db to ba783fb Compare February 10, 2025 19:57

johnml1135 added 2 commits February 11, 2025 13:58

windows/linux line ending ignore

454a5fc

rename in_nested_embed

5ebd6af

johnml1135 force-pushed the do_not_translate_markers branch from 49edd20 to 5ebd6af Compare February 11, 2025 18:59

ddaspit reviewed Feb 11, 2025

View reviewed changes

reviewer comments.

dbdbd1c

ddaspit reviewed Feb 11, 2025

View reviewed changes

_is_embed_style

13853af

ddaspit approved these changes Feb 12, 2025

View reviewed changes

johnml1135 merged commit ee14f5b into main Feb 12, 2025
14 checks passed

johnml1135 deleted the do_not_translate_markers branch February 12, 2025 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not translate markers #153

Do not translate markers #153

johnml1135 commented Jan 31, 2025 •

edited by ddaspit

Loading

codecov-commenter commented Jan 31, 2025 •

edited

Loading

johnml1135 commented Feb 3, 2025

isaac091 left a comment

ddaspit left a comment

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

ddaspit left a comment

isaac091 left a comment

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 6, 2025

ddaspit left a comment

johnml1135 commented Feb 6, 2025

johnml1135 commented Feb 6, 2025

johnml1135 commented Feb 10, 2025

johnml1135 commented Feb 10, 2025

ddaspit left a comment

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

ddaspit left a comment

johnml1135 commented Feb 12, 2025

ddaspit left a comment

Do not translate markers #153

Do not translate markers #153

Conversation

johnml1135 commented Jan 31, 2025 • edited by ddaspit Loading

codecov-commenter commented Jan 31, 2025 • edited Loading

Codecov Report

johnml1135 commented Feb 3, 2025

isaac091 left a comment

Choose a reason for hiding this comment

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

ddaspit left a comment

Choose a reason for hiding this comment

isaac091 left a comment

Choose a reason for hiding this comment

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 4, 2025

johnml1135 commented Feb 6, 2025

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Feb 6, 2025

johnml1135 commented Feb 6, 2025

johnml1135 commented Feb 10, 2025

johnml1135 commented Feb 10, 2025

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

johnml1135 commented Feb 11, 2025

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Feb 12, 2025

ddaspit left a comment

Choose a reason for hiding this comment

johnml1135 commented Jan 31, 2025 •

edited by ddaspit

Loading

codecov-commenter commented Jan 31, 2025 •

edited

Loading