-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added space_between_characters #197
base: main
Are you sure you want to change the base?
Added space_between_characters #197
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the scope for this could be expanded by not adding spaces between every letter of a word. Something like "house" -> "h ouse" or "h o use" is a much more plausible error for OCR systems.
This could easily be implemented by moving away from whitespace tokenization to a character-based one which would also make the transformation applicable to many more languages.
## What are the limitations of this transformation? | ||
- The transformation's outputs are very simple. | ||
- It is not capable of generating linguistically diverse text. | ||
- This transformation will mainly affect the perfornamce of token/word-level models, while character-level models should be much robust. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "much more robust"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @sebastianGehrmann for your suggestion. I agree that it is very interesting to expand this transformation by adding the possibility of not having a space. I have implemented it in b551a5c where I added a new argument controlling the probability of inserting a space between 2 characters in a token.
I have also updated the README in 28b1301
TaskType.TEXT_TO_TEXT_GENERATION, | ||
TaskType.TEXT_TAGGING, | ||
] | ||
languages = ["en"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using another tokenizer, this could also work for other languages. The "en" choice is surprising here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, thank you for spotting this. I have changed it to "all" in 28b1301
Thanks for the changes! A couple small things now:
I think an easy fix for (1) is to no longer differentiate between probability per-word and instead just have the probability per character. That way you are truly language-agnostic. |
Thanks for the comments:
Thank you. |
No description provided.