Added space_between_characters #197

marco-digio · 2021-08-04T09:09:46Z

No description provided.

sebastianGehrmann

I think the scope for this could be expanded by not adding spaces between every letter of a word. Something like "house" -> "h ouse" or "h o use" is a much more plausible error for OCR systems.

This could easily be implemented by moving away from whitespace tokenization to a character-based one which would also make the transformation applicable to many more languages.

sebastianGehrmann · 2021-09-08T20:41:16Z

transformations/space_between_characters/README.md

+## What are the limitations of this transformation?
+- The transformation's outputs are very simple.
+- It is not capable of generating linguistically diverse text.
+- This transformation will mainly affect the perfornamce of token/word-level models, while character-level models should be much robust.


nit: "much more robust"

Thank you @sebastianGehrmann for your suggestion. I agree that it is very interesting to expand this transformation by adding the possibility of not having a space. I have implemented it in b551a5c where I added a new argument controlling the probability of inserting a space between 2 characters in a token.
I have also updated the README in 28b1301

sebastianGehrmann · 2021-09-08T20:42:08Z

transformations/space_between_characters/transformation.py

+        TaskType.TEXT_TO_TEXT_GENERATION,
+        TaskType.TEXT_TAGGING,
+    ]
+    languages = ["en"]


By using another tokenizer, this could also work for other languages. The "en" choice is surprising here.

You are right, thank you for spotting this. I have changed it to "all" in 28b1301

sebastianGehrmann · 2021-09-29T19:30:08Z

Thanks for the changes! A couple small things now:

You are still using a whitespace tokenizer which I am not sure is completely necessary (especially since it excludes languages like Chinese which don't use spaces
The tests don't cover the new cases

I think an easy fix for (1) is to no longer differentiate between probability per-word and instead just have the probability per character. That way you are truly language-agnostic.

marco-digio · 2021-09-29T21:06:14Z

Thanks for the changes! A couple small things now:

You are still using a whitespace tokenizer which I am not sure is completely necessary (especially since it excludes languages like Chinese which don't use spaces

The tests don't cover the new cases

I think an easy fix for (1) is to no longer differentiate between probability per-word and instead just have the probability per character. That way you are truly language-agnostic.

Thanks for the comments:

Languages that do not use spaces, like Chinese, will not benefit from this transformation. The goal of this PR is to transform words in a full-width style, separating every character with spaces. Since Chinese and similar languages do not use spaces, I believe that this transformation should not be applied to them. I think that similar transformations applied to languages that do not use spaces could be useful (I do not speak any of those languages so I am not completely sure about this) but they should be kept separate from this one.
You are right, but I don't know how to set the argument prob_char different from the default value in the test cases. If you know a transformation that uses different arguments in the tests, please link it to me so I can check how to implement it here. I have not found similar cases in the ones that I have checked, nor documentation that describes it.

Thank you.

Added space_between_characters

851acd8

kaustubhdhole added the transformation label Aug 10, 2021

sebastianGehrmann requested changes Sep 8, 2021

View reviewed changes

marco-digio added 4 commits September 9, 2021 09:30

Add probability of inserting space between characters

b551a5c

Small update README

28b1301

Add keywords

b418cbe

Merge branch 'main' into space_between_characters

85e4549

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added space_between_characters #197

Added space_between_characters #197

marco-digio commented Aug 4, 2021

sebastianGehrmann left a comment

sebastianGehrmann Sep 8, 2021

marco-digio Sep 9, 2021

sebastianGehrmann Sep 8, 2021

marco-digio Sep 9, 2021

sebastianGehrmann commented Sep 29, 2021

marco-digio commented Sep 29, 2021

Added space_between_characters #197

Are you sure you want to change the base?

Added space_between_characters #197

Conversation

marco-digio commented Aug 4, 2021

sebastianGehrmann left a comment

Choose a reason for hiding this comment

sebastianGehrmann Sep 8, 2021

Choose a reason for hiding this comment

marco-digio Sep 9, 2021

Choose a reason for hiding this comment

sebastianGehrmann Sep 8, 2021

Choose a reason for hiding this comment

marco-digio Sep 9, 2021

Choose a reason for hiding this comment

sebastianGehrmann commented Sep 29, 2021

marco-digio commented Sep 29, 2021