[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium #673

rmlockwood · 2024-07-17T14:52:12Z

Currently FLExTrans doesn't handle reserved characters in Apertium very well. Here's two examples:

If there is an asterisk in a lemma, FT converts it to an underscore. Users have to be aware of this conversion and refer to lemmas with underscores in place of asterisks. By the way asterisks can be common in FLEx lemmas if the user sets the morph type to bound root or bound stem. FLEx automatically prepends the lexeme form with a * to make the headword.
If there is a forward slash that is part of symbol string in the bilingual lexicon. FT converts it to 'SLASH' which again requires the user to use this string in rules.

Other characters may be problematic either in the data stream going into Apertium or in the bilingual lexicon. All of the characters should be identified and the appropriate quoting or converting should be done. Ideally, the user should not have to change how he/she references lemmas or affixes in the rules from what he/she sees in the FLEx lexicon. (Except, of course, the dot to underscore conversion that I don't think we can avoid.)

This work should be done off of a new branch from master.

mr-martian · 2024-08-07T13:27:31Z

apertium-transfer doesn't actually care about the presence of *, but lt-proc does and simply escaping doesn't work currently (though I could probably make it work)
If we put \/ in the stream, the rules can just refer to a/b without issue.

so be more selective (#673)

mr-martian · 2024-08-07T13:59:16Z

The annoying part is that they only need to be escaped inside <test>. In <def-cat> and <out> the unescaped versions are fine.

mr-martian · 2024-08-07T14:02:55Z

Since we're mangling the files anyway, I suppose I could add a step to escape all the symbols that need escaping in the transfer file before running it. Then the user probably wouldn't need to escape anything at all.

related to #673 This way users never need to escape anything by hand except for periods in tags.

- ensure all characters that can change behavior are in the escape list - don't escape things that don't need to be escaped (slashes are fine in tags in newer apertium versions) - when stripping the doctype from transfer files, also escape anything that could be a lemma - strip trailing whitespace because my editor is picky

rmlockwood · 2024-09-03T14:47:56Z

Slash in the biling. lex. is now working, but asterisk doesn't seem to be. I updated the code in three places in the reserved-charactes branch to not change * to _. So now if you test with German-Swedish Reserved characters you should get *lieb1.1 in the biling. lex. and Apertium isn't translating it to älska1.1.

mr-martian · 2024-09-03T18:04:08Z

The bilingual.dix file in that project hasn't been regenerated and still has _.

rmlockwood · 2024-09-04T04:10:50Z

I changed it to use * in the Utils code. It still doesn't work. Please try running the Build Bilingual Lexicon module yourself with the latest code on the reserved-characters branch.

mr-martian · 2024-09-04T13:19:46Z

After 43c6f71 it works for me.

rmlockwood · 2024-09-05T07:37:27Z

Apertium tools not working when another symbol follows a symbol with a slash.
I have the following in my biling. lex.:
<e><p><l>*lobwana1.1<s n="n" /><s n="1/2" /></l><r>*lopwana1.1<s n="n" /><s n="1/2" /></r></p></e>I have this in my source text:
^\*lobwana1.1<n><1/2><x>$
I get this result (no rules applied):
^*lopwana1.1<n><1<x>$

If the source text doesn't have the <x>, it works fine.

mr-martian · 2024-09-05T13:46:26Z

The problem there is in lt-proc and there's a fix in apertium/lttoolbox#185

rmlockwood · 2024-10-10T19:38:20Z

Fixed in PR #726

rmlockwood added enhancement high priority labels Jul 17, 2024

rmlockwood assigned mr-martian Jul 17, 2024

mr-martian added a commit that referenced this issue Aug 7, 2024

Escape all characters which have meaning in Apertium (#673)

29326c1

mr-martian added a commit that referenced this issue Aug 7, 2024

Slashes in tags are actually fine (#673)

0c4314a

mr-martian added a commit that referenced this issue Aug 7, 2024

Escaped symbols still need to be escaped in some parts of rules

6794566

so be more selective (#673)

mr-martian added a commit that referenced this issue Aug 14, 2024

Escape problematic characters in lemmas in rule files

0f80c47

related to #673 This way users never need to escape anything by hand except for periods in tags.

mr-martian mentioned this issue Aug 26, 2024

Apertium: update binaries and makefiles #708

Merged

rmlockwood closed this as completed Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium #673

[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium #673

rmlockwood commented Jul 17, 2024

mr-martian commented Aug 7, 2024

mr-martian commented Aug 7, 2024

mr-martian commented Aug 7, 2024

rmlockwood commented Sep 3, 2024

mr-martian commented Sep 3, 2024

rmlockwood commented Sep 4, 2024

mr-martian commented Sep 4, 2024

rmlockwood commented Sep 5, 2024

mr-martian commented Sep 5, 2024

rmlockwood commented Oct 10, 2024

[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium #673

[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium #673

Comments

rmlockwood commented Jul 17, 2024

mr-martian commented Aug 7, 2024

mr-martian commented Aug 7, 2024

mr-martian commented Aug 7, 2024

rmlockwood commented Sep 3, 2024

mr-martian commented Sep 3, 2024

rmlockwood commented Sep 4, 2024

mr-martian commented Sep 4, 2024

rmlockwood commented Sep 5, 2024

mr-martian commented Sep 5, 2024

rmlockwood commented Oct 10, 2024