-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Apertium] Improve handling of reserved characters in inputs to and outputs from Apertium #673
Comments
|
The annoying part is that they only need to be escaped inside |
Since we're mangling the files anyway, I suppose I could add a step to escape all the symbols that need escaping in the transfer file before running it. Then the user probably wouldn't need to escape anything at all. |
related to #673 This way users never need to escape anything by hand except for periods in tags.
- ensure all characters that can change behavior are in the escape list - don't escape things that don't need to be escaped (slashes are fine in tags in newer apertium versions) - when stripping the doctype from transfer files, also escape anything that could be a lemma - strip trailing whitespace because my editor is picky
Slash in the biling. lex. is now working, but asterisk doesn't seem to be. I updated the code in three places in the reserved-charactes branch to not change * to _. So now if you test with German-Swedish Reserved characters you should get *lieb1.1 in the biling. lex. and Apertium isn't translating it to älska1.1. |
The |
I changed it to use * in the Utils code. It still doesn't work. Please try running the Build Bilingual Lexicon module yourself with the latest code on the reserved-characters branch. |
After 43c6f71 it works for me. |
Apertium tools not working when another symbol follows a symbol with a slash. If the source text doesn't have the |
The problem there is in |
Fixed in PR #726 |
Currently FLExTrans doesn't handle reserved characters in Apertium very well. Here's two examples:
Other characters may be problematic either in the data stream going into Apertium or in the bilingual lexicon. All of the characters should be identified and the appropriate quoting or converting should be done. Ideally, the user should not have to change how he/she references lemmas or affixes in the rules from what he/she sees in the FLEx lexicon. (Except, of course, the dot to underscore conversion that I don't think we can avoid.)
This work should be done off of a new branch from master.
The text was updated successfully, but these errors were encountered: