-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character duplication #184
base: main
Are you sure you want to change the base?
Conversation
Hi @marco-digio thank you very much for your changes. I would suggest combining character_duplication, underscore_trick into one. Also, please mention in the README about other previous PRs (created as well as merged) which are similar to your PR. |
Hi @kaustubhdhole thank you. I am sorry but I messed up a bit with git. I accidentally included an old commit in the new branch. Now it is fixed by removing the underscore_trick to the character_duplication pull request, since I already did a separate underscore_trick pull request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @marco-digio, I am a reviewer assigned to this PR. Overall, everything looks good, you just need to include keywords
for the transformation as explained here. I will provide the final feedback after you have added the keywords.
Few minor comments:
- There is a typo in the last line of
README.md
("perfornamce") - If this transformation was proposed in any existing work then please include the relevant citation.
Thank you @uyaseen for the feedback. I have inserted the keywords now in 22a16e7 and I have fixed the README typo in 97866a7 . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marco-digio thanks for making the changes.
Here's my general review:
Clarity: The README clearly explains the transformation
Correctness: All checks have passed
Interface: The interface seems correct
Adding New Libraries: No new libraries were added
Test Cases: 5 test cases added
Evaluating Robustness: Robustness evaluation is not yet conducted
"sentence": "Andrew finally returned the French book to Chris that I bought last week" | ||
}, | ||
"outputs": [{ | ||
"sentence": "Anndrew ffinnallly returrned thee French book too Chhris that I bought last week" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triple duplication in the same word doesn't seem like a typical situation. I would suggest adding some rules to limit the generation of such unlikely human input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here triple duplication happens just because one of the two ‘l’ chars in the word “finally” was duplicated, obtaining the same letter 3 times in total.
I am not sure how likely is this in real data with respect to duplication of characters that appears once in the word.
However I believe that trained models should be able to process words like “ffinallly” in the similar way as “finally”, since humans can easily understand the meaning of the word with this kind of typo.
"sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film" | ||
}, | ||
"outputs": [{ | ||
"sentence": "Allice inn WWondderland is a 200110 American livve-aaction/animated dark fanntasy adventure film" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same is for the double letter in the beginning or a 6-figure number, which should represent a year. Please consider adding some rules to change that behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the same reason as before, I disagree about the double letter in the beginning, but I agree with you about not duplicating digits. I have added a rule to exclude digits from duplication in eb09bbc. Thank you for the suggestion
from interfaces.SentenceOperation import SentenceOperation | ||
from tasks.TaskTypes import TaskType | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please consider adding doc strings, comments and error handling logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a brief doc string in eb09bbc. I believe that the code is simple enough to understand everything without the need of more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the description of your arguments, using the doc string convetion:
`def complex(real=0.0, imag=0.0):
"""Form a complex number.
Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
"""
if imag == 0.0 and real == 0.0:
return complex_zero
...`
as stated in the official doc string convention for Python: https://www.python.org/dev/peps/pep-0257/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about error handling logic - what happens if the user enters the illegal value for some of the parameters? Will he receive a human-readable message, pointing out what he/she did wrong or a generic Python error log, when the wrong parameter will break the code?
tasks = [ | ||
TaskType.TEXT_CLASSIFICATION, | ||
TaskType.TEXT_TO_TEXT_GENERATION, | ||
TaskType.TEXT_TAGGING, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How the TaskType.TEXT_TAGGING
is relevant to this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for spotting this, you are completely right and I have removed it in eb09bbc
Thank you for the contribution. Could you please explain the value or your PR compared to the [transformatio]https://github.com/GEM-benchmark/NL-Augmenter/tree/main/transformations/butter_fingers_perturbation), which addresses the same typo issue? |
The transformation is similar, because it adds noise similar to typos. However, Butter Fingers Perturbation swap two characters, while this PR (character duplication) duplicate a character. I hope that this clarifies your doubts @asnota |
from interfaces.SentenceOperation import SentenceOperation | ||
from tasks.TaskTypes import TaskType | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the description of your arguments, using the doc string convetion:
`def complex(real=0.0, imag=0.0):
"""Form a complex number.
Keyword arguments:
real -- the real part (default 0.0)
imag -- the imaginary part (default 0.0)
"""
if imag == 0.0 and real == 0.0:
return complex_zero
...`
as stated in the official doc string convention for Python: https://www.python.org/dev/peps/pep-0257/
from interfaces.SentenceOperation import SentenceOperation | ||
from tasks.TaskTypes import TaskType | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget about error handling logic - what happens if the user enters the illegal value for some of the parameters? Will he receive a human-readable message, pointing out what he/she did wrong or a generic Python error log, when the wrong parameter will break the code?
No description provided.