Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input normalization #38

Open
martinpopel opened this issue Jul 5, 2021 · 0 comments
Open

Input normalization #38

martinpopel opened this issue Jul 5, 2021 · 0 comments

Comments

@martinpopel
Copy link
Member

Currently a sentence consisting of a single space is translated as "I know" in cs->en and as "Ano" in en->cs (you need to enter another non-empty sentence/word on the previous line).
I admit I should fix the NMT model (its training data), but meanwhile the web interface should never query the backend model with whitespace-only sentence, I think.

I think the input sentences should be normalized for (all Unicode) whitespaces:

s/^\s+//;
s/\s+$//;
s/\s+/ /g;

The current MT models will merge multiple spaces into a single space anyway. I am not sure about other whitespace than space, but I doubt we have enough vertical tabs, thin/non-breakable spaces etc. in the training data, so it seems wiser to normalize such whitespace characters to a single space.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant