Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added space_between_characters #197

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

marco-digio
Copy link
Contributor

No description provided.

Copy link
Collaborator

@sebastianGehrmann sebastianGehrmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the scope for this could be expanded by not adding spaces between every letter of a word. Something like "house" -> "h ouse" or "h o use" is a much more plausible error for OCR systems.

This could easily be implemented by moving away from whitespace tokenization to a character-based one which would also make the transformation applicable to many more languages.

## What are the limitations of this transformation?
- The transformation's outputs are very simple.
- It is not capable of generating linguistically diverse text.
- This transformation will mainly affect the perfornamce of token/word-level models, while character-level models should be much robust.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "much more robust"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sebastianGehrmann for your suggestion. I agree that it is very interesting to expand this transformation by adding the possibility of not having a space. I have implemented it in b551a5c where I added a new argument controlling the probability of inserting a space between 2 characters in a token.
I have also updated the README in 28b1301

TaskType.TEXT_TO_TEXT_GENERATION,
TaskType.TEXT_TAGGING,
]
languages = ["en"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using another tokenizer, this could also work for other languages. The "en" choice is surprising here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, thank you for spotting this. I have changed it to "all" in 28b1301

@sebastianGehrmann
Copy link
Collaborator

Thanks for the changes! A couple small things now:

  1. You are still using a whitespace tokenizer which I am not sure is completely necessary (especially since it excludes languages like Chinese which don't use spaces
  2. The tests don't cover the new cases

I think an easy fix for (1) is to no longer differentiate between probability per-word and instead just have the probability per character. That way you are truly language-agnostic.

@marco-digio
Copy link
Contributor Author

Thanks for the changes! A couple small things now:

  1. You are still using a whitespace tokenizer which I am not sure is completely necessary (especially since it excludes languages like Chinese which don't use spaces
  2. The tests don't cover the new cases

I think an easy fix for (1) is to no longer differentiate between probability per-word and instead just have the probability per character. That way you are truly language-agnostic.

Thanks for the comments:

  1. Languages that do not use spaces, like Chinese, will not benefit from this transformation. The goal of this PR is to transform words in a full-width style, separating every character with spaces. Since Chinese and similar languages do not use spaces, I believe that this transformation should not be applied to them. I think that similar transformations applied to languages that do not use spaces could be useful (I do not speak any of those languages so I am not completely sure about this) but they should be kept separate from this one.
  2. You are right, but I don't know how to set the argument prob_char different from the default value in the test cases. If you know a transformation that uses different arguments in the tests, please link it to me so I can check how to implement it here. I have not found similar cases in the ones that I have checked, nor documentation that describes it.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants