Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium #673

Closed
rmlockwood opened this issue Jul 17, 2024 · 10 comments

Comments

@rmlockwood
Copy link
Owner

Currently FLExTrans doesn't handle reserved characters in Apertium very well. Here's two examples:

  1. If there is an asterisk in a lemma, FT converts it to an underscore. Users have to be aware of this conversion and refer to lemmas with underscores in place of asterisks. By the way asterisks can be common in FLEx lemmas if the user sets the morph type to bound root or bound stem. FLEx automatically prepends the lexeme form with a * to make the headword.
  2. If there is a forward slash that is part of symbol string in the bilingual lexicon. FT converts it to 'SLASH' which again requires the user to use this string in rules.

Other characters may be problematic either in the data stream going into Apertium or in the bilingual lexicon. All of the characters should be identified and the appropriate quoting or converting should be done. Ideally, the user should not have to change how he/she references lemmas or affixes in the rules from what he/she sees in the FLEx lexicon. (Except, of course, the dot to underscore conversion that I don't think we can avoid.)

This work should be done off of a new branch from master.

@mr-martian
Copy link
Collaborator

  1. apertium-transfer doesn't actually care about the presence of *, but lt-proc does and simply escaping doesn't work currently (though I could probably make it work)
  2. If we put \/ in the stream, the rules can just refer to a/b without issue.

@mr-martian
Copy link
Collaborator

The annoying part is that they only need to be escaped inside <test>. In <def-cat> and <out> the unescaped versions are fine.

@mr-martian
Copy link
Collaborator

Since we're mangling the files anyway, I suppose I could add a step to escape all the symbols that need escaping in the transfer file before running it. Then the user probably wouldn't need to escape anything at all.

mr-martian added a commit that referenced this issue Aug 14, 2024
related to #673

This way users never need to escape anything by hand except for
periods in tags.
mr-martian added a commit that referenced this issue Aug 16, 2024
- ensure all characters that can change behavior are in the escape
list
- don't escape things that don't need to be escaped (slashes are fine
in tags in newer apertium versions)
- when stripping the doctype from transfer files, also escape anything
that could be a lemma
- strip trailing whitespace because my editor is picky
@rmlockwood
Copy link
Owner Author

Slash in the biling. lex. is now working, but asterisk doesn't seem to be. I updated the code in three places in the reserved-charactes branch to not change * to _. So now if you test with German-Swedish Reserved characters you should get *lieb1.1 in the biling. lex. and Apertium isn't translating it to älska1.1.

@mr-martian
Copy link
Collaborator

The bilingual.dix file in that project hasn't been regenerated and still has _.

@rmlockwood
Copy link
Owner Author

I changed it to use * in the Utils code. It still doesn't work. Please try running the Build Bilingual Lexicon module yourself with the latest code on the reserved-characters branch.

@mr-martian
Copy link
Collaborator

After 43c6f71 it works for me.

@rmlockwood
Copy link
Owner Author

Apertium tools not working when another symbol follows a symbol with a slash.
I have the following in my biling. lex.:
<e><p><l>*lobwana1.1<s n="n" /><s n="1/2" /></l><r>*lopwana1.1<s n="n" /><s n="1/2" /></r></p></e>I have this in my source text:
^\*lobwana1.1<n><1/2><x>$
I get this result (no rules applied):
^*lopwana1.1<n><1<x>$

If the source text doesn't have the <x>, it works fine.

@mr-martian
Copy link
Collaborator

The problem there is in lt-proc and there's a fix in apertium/lttoolbox#185

@rmlockwood
Copy link
Owner Author

Fixed in PR #726

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants