-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for NoiseBench #3512
Conversation
flair/datasets/sequence_labeling.py
Outdated
**corpusargs: The arguments propagated to :meth:'flair.datasets.ColumnCorpus.__init__'. | ||
""" | ||
if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]: | ||
raise Exception("Please choose a valid version") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is recommended to raise ValueError for wrong and unexpected values passed to the argument.
Also, an error message would be more helpful if it listed valid values. Something like:
if noise not in VALID_NOISE_VALUES:
raise ValueError(f"Unsupported value for noise argument. Got {noise}, expected one of {VALUE_NOISE_VALUES}!")
flair/datasets/sequence_labeling.py
Outdated
if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]: | ||
raise Exception("Please choose a valid version") | ||
|
||
self._set_path(base_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove this function and just set the base_path
here in __init__
:
if base_path:
self.base_path = Path(base_path)
else:
self.base_path = flair.cache_root / "datasets" / "noisebench"
flair/datasets/sequence_labeling.py
Outdated
self.base_path = flair.cache_root / "datasets" / "noisebench" if not base_path else Path(base_path) | ||
|
||
@staticmethod | ||
def read_column_file(filename): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are adding a new public method to NER_NOISEBENCH
. I would avoid this. Also, you are relying on this method only in non-public methods in the class (except in generate_data_files
, which I would also put as non-public.)
My suggestion is to rename the method to _read_column_file
and add type annotations for filename
argument.
flair/datasets/sequence_labeling.py
Outdated
return all_x | ||
|
||
@staticmethod | ||
def save_to_column_file(filename, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Rename to
_save_to_column_file
- Add type annotations to arguments
- Avoid using the name of built-in modules/types/etc. for argument and variable names. Here, you are using
list
as the name for the argument. Seems like a more appropriate name issentences
(which is of typelist
).
flair/datasets/sequence_labeling.py
Outdated
|
||
def generate_data_files(self, filename): | ||
|
||
with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid this os.path.join
since base_path
is of type Path. Just use /
to concatenate the path. Also, we should avoid relying on implicit default values, it is better to explicitly mention that we want to read the file with utf-8:
with open(self.base_path / "annotations_only" / "index.txt", "r", encoding="utf-8") as fp:
# ...
flair/datasets/sequence_labeling.py
Outdated
with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file: | ||
token_indices = index_file.readlines() | ||
|
||
all_clean_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.train")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use self.cleanconll_base_path / "cleanconll.train"
instead of os.path.join
.
flair/datasets/sequence_labeling.py
Outdated
) | ||
|
||
# copy test set | ||
all_clean_test_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.test")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use self.cleanconll_base_path / "cleanconll.test"
instead of os.path.join
.
flair/datasets/sequence_labeling.py
Outdated
new_s.append([token[0], token[4]]) | ||
test_sentences.append(new_s) | ||
|
||
self.save_to_column_file(os.path.join(self.base_path, "clean.test"), test_sentences) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use self.base_path / "clean.test"
instead of os.path.join
.
flair/datasets/sequence_labeling.py
Outdated
new_s = [] | ||
for token in s: | ||
new_s.append([token[0], token[4]]) | ||
test_sentences.append(new_s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be refactored to a more idiomatic approach with list comprehensions:
for sentence in all_clean_test_sentences:
new_sentence = [[tokens[0], tokens[4]] for tokens in sentence]
test_sentences.append(new_sentence)
flair/datasets/sequence_labeling.py
Outdated
for token, label in zip(clean_sentence, sentence): | ||
label[0] = token[0] # token[0] -> text, token[1] -> BIO label | ||
if self.SAVE_TRAINDEV_FILE: | ||
self.save_to_column_file(os.path.join(self.base_path, f"{corpus}.traindev"), noisy_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use self.base_path / f"{corpus}.traindev"
instead of os.path.join
.
flair/datasets/sequence_labeling.py
Outdated
|
||
noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev")) | ||
# print(noisy_labels) | ||
# print(token_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove these debugging prints.
flair/datasets/sequence_labeling.py
Outdated
def _merge_tokens_labels(self, corpus, all_clean_sentences, token_indices): | ||
# generate NoiseBench dataset variants, given CleanCoNLL, noisy label files and index file | ||
|
||
noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, use self.base_path / "annotations_only" / f"{corpus}.traindev"
instead of os.path.join
.
flair/datasets/sequence_labeling.py
Outdated
point = [] | ||
for line in lines: | ||
if "\t" in line.strip(): | ||
stripped_line = line.strip().split("\t") if "\t" in line.strip() else line.strip().split(" ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like stripped_line = line.strip().split("\t") if "\t" in line.strip()
will always be set due to the condition if "\t" in line.strip()
in line 5076. Refactor to:
if "\t" in line.strip():
stripped_line = line.strip().split("\t")
else:
stripped_line = line.strip().split(" ")
def save_to_column_file(filename, list): | ||
with open(filename, "w") as f: | ||
for sentence in list: | ||
for token in sentence: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename the iteration variable to tokens
to convert the meaning that this is a container of elements and not a single token.
@fkdosilovic thanks a lot for your review, this is very helpful! Please note that this is still a draft PR and one major change is still planned. |
@fkdosilovic Thanks a lot for the review, it was very helpful and I incorporated most of your feedback! Also, the CLEANCONLL Corpus is now a part of Flair, so I modified this PR to use it, instead of downloading and running an external .sh script. |
@elenamer thanks for adding this! |
This PR adds support for NoiseBench, a subset of CoNLL-03 for English with 7 different label sets, including 6 different types of label noise.
Example usage (intialize dataset with a crowdsourced version of the NER labels):
Potential issues:To create the dataset, we need the CleanCoNLL labels as gold labels. This required running an external script in this implementation: https://github.com/flairNLP/CleanCoNLL/blob/main/create_cleanconll_from_conll03.shEDIT: The implementation was modified to use a CLEANCONLL object, which is why this issue is not relevant anymore.