-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character duplication #184
base: main
Are you sure you want to change the base?
Changes from all commits
03b31e2
4cdd971
92da719
97866a7
22a16e7
00542c0
eb09bbc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
# Character Duplication | ||
This perturbation adds noise to all types of text sources (sentence, paragraph, etc.) proportional to noise erupting from keyboard typos making common spelling errors. | ||
|
||
Author name: Marco Di Giovanni | ||
Author email: marco.digiovanni@polimi.it | ||
Author Affiliation: Politecnico di Milano and University of Bologna | ||
|
||
|
||
|
||
## What type of a transformation is this? | ||
This transformation acts like a perturbation to test robustness. | ||
Few letters picked at random are duplicated. | ||
Generated transformations display high similarity to the source sentences. | ||
|
||
## What tasks does it intend to benefit? | ||
- This perturbation would benefit all tasks which have a sentence/paragraph/document as input like text classification, text generation, etc. | ||
- The generated texts mimic typing mistakes. | ||
|
||
## What are the limitations of this transformation? | ||
- This transformation is not capable of generating linguistically diverse text. | ||
- This transformation will mainly affect the performance of token/word-level models, while character-level models should be much robust. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .transformation import * |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
{ | ||
"type": "character_duplication", | ||
"test_cases": [ | ||
{ | ||
"class": "CharacterDuplication", | ||
"inputs": { | ||
"sentence": "Andrew finally returned the French book to Chris that I bought last week" | ||
}, | ||
"outputs": [{ | ||
"sentence": "Anndrew ffinnallly returrned thee French book too Chhris that I bought last week" | ||
}] | ||
}, | ||
{ | ||
"class": "CharacterDuplication", | ||
"inputs": { | ||
"sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments." | ||
}, | ||
"outputs": [{ | ||
"sentence": "Seentencees witth gappiing, succhh as Paul likess cooffee and Mary tea, lackk an overt predicate ttoo indiicate tthe relation between two orr moree arrguuments." | ||
}] | ||
}, | ||
{ | ||
"class": "CharacterDuplication", | ||
"inputs": { | ||
"sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film" | ||
}, | ||
"outputs": [{ | ||
"sentence": "Allice inn WWondderland is a 2010 AAmmerican live-acctioon/animated dark fantasyy adventure film" | ||
}] | ||
}, | ||
{ | ||
"class": "CharacterDuplication", | ||
"inputs": { | ||
"sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001" | ||
}, | ||
"outputs": [{ | ||
"sentence": "Ujjjal Deev Dossanjh seerved ass 33rd Premier of Briitish Columbia from 2000 to 2001" | ||
}] | ||
}, | ||
{ | ||
"class": "CharacterDuplication", | ||
"inputs": { | ||
"sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization." | ||
}, | ||
"outputs": [{ | ||
"sentence": "Neeuroplaastticiity is aa continnuuous processingg alllowing short-term, mediium-term, and long-terrmm remoodelingg of the neuronosynaptic orrganizzatiionn." | ||
}] | ||
} | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
import random | ||
|
||
from interfaces.SentenceOperation import SentenceOperation | ||
from tasks.TaskTypes import TaskType | ||
|
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please consider adding doc strings, comments and error handling logic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added a brief doc string in eb09bbc. I believe that the code is simple enough to understand everything without the need of more comments. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add the description of your arguments, using the doc string convetion:
as stated in the official doc string convention for Python: https://www.python.org/dev/peps/pep-0257/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't forget about error handling logic - what happens if the user enters the illegal value for some of the parameters? Will he receive a human-readable message, pointing out what he/she did wrong or a generic Python error log, when the wrong parameter will break the code? |
||
def duplicate(text, prob=0.1, seed=42, max_outputs=1): | ||
""" | ||
This function duplicates random chars (not digits) in the text string, with specified probability. It returns a list of different perturbed strings, whose length is specified by max_outputs. | ||
""" | ||
random.seed(seed) | ||
|
||
original_text = list(text) | ||
perturbed_texts = [] | ||
for _ in range(max_outputs): | ||
perturbed_text = [ | ||
[letter] | ||
if letter.isdigit() or random.random() > prob | ||
else [letter, letter] | ||
for letter in original_text | ||
] | ||
perturbed_text = [ | ||
letter for sublist in perturbed_text for letter in sublist | ||
] | ||
perturbed_texts.append("".join(perturbed_text)) | ||
return perturbed_texts | ||
|
||
|
||
class CharacterDuplication(SentenceOperation): | ||
tasks = [ | ||
TaskType.TEXT_CLASSIFICATION, | ||
TaskType.TEXT_TO_TEXT_GENERATION, | ||
] | ||
languages = ["All"] | ||
keywords = [ | ||
"morphological", | ||
"noise", | ||
"rule-based", | ||
"highly-meaning-preserving", | ||
"high-precision", | ||
"high-coverage", | ||
"high-generations", | ||
] | ||
|
||
def __init__(self, seed=42, max_outputs=1, prob=0.1): | ||
super().__init__(seed, max_outputs=max_outputs) | ||
self.prob = prob | ||
|
||
def generate(self, sentence: str): | ||
perturbed_texts = duplicate( | ||
text=sentence, | ||
prob=self.prob, | ||
seed=self.seed, | ||
max_outputs=self.max_outputs, | ||
) | ||
return perturbed_texts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Triple duplication in the same word doesn't seem like a typical situation. I would suggest adding some rules to limit the generation of such unlikely human input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here triple duplication happens just because one of the two ‘l’ chars in the word “finally” was duplicated, obtaining the same letter 3 times in total.
I am not sure how likely is this in real data with respect to duplication of characters that appears once in the word.
However I believe that trained models should be able to process words like “ffinallly” in the similar way as “finally”, since humans can easily understand the meaning of the word with this kind of typo.