Input normalization #38

martinpopel · 2021-07-05T22:04:40Z

Currently a sentence consisting of a single space is translated as "I know" in cs->en and as "Ano" in en->cs (you need to enter another non-empty sentence/word on the previous line).
I admit I should fix the NMT model (its training data), but meanwhile the web interface should never query the backend model with whitespace-only sentence, I think.

I think the input sentences should be normalized for (all Unicode) whitespaces:

s/^\s+//;
s/\s+$//;
s/\s+/ /g;

The current MT models will merge multiple spaces into a single space anyway. I am not sure about other whitespace than space, but I doubt we have enough vertical tabs, thin/non-breakable spaces etc. in the training data, so it seems wiser to normalize such whitespace characters to a single space.

The text was updated successfully, but these errors were encountered:

martinpopel mentioned this issue Mar 13, 2023

Normalize input sentences before translation #59

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Input normalization #38

Input normalization #38

martinpopel commented Jul 5, 2021

Input normalization #38

Input normalization #38

Comments

martinpopel commented Jul 5, 2021