Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for NoiseBench #3512

Merged
merged 14 commits into from
Dec 19, 2024
Merged

Add support for NoiseBench #3512

merged 14 commits into from
Dec 19, 2024

Conversation

elenamer
Copy link
Collaborator

@elenamer elenamer commented Jul 24, 2024

This PR adds support for NoiseBench, a subset of CoNLL-03 for English with 7 different label sets, including 6 different types of label noise.

Example usage (intialize dataset with a crowdsourced version of the NER labels):

corpus = NER_NOISEBENCH(noise="crowd")

Potential issues:

To create the dataset, we need the CleanCoNLL labels as gold labels. This required running an external script in this implementation: https://github.com/flairNLP/CleanCoNLL/blob/main/create_cleanconll_from_conll03.sh

EDIT: The implementation was modified to use a CLEANCONLL object, which is why this issue is not relevant anymore.

**corpusargs: The arguments propagated to :meth:'flair.datasets.ColumnCorpus.__init__'.
"""
if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]:
raise Exception("Please choose a valid version")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to raise ValueError for wrong and unexpected values passed to the argument.

Also, an error message would be more helpful if it listed valid values. Something like:

if noise not in VALID_NOISE_VALUES:
    raise ValueError(f"Unsupported value for noise argument. Got {noise}, expected one of {VALUE_NOISE_VALUES}!")

if noise not in ["clean", "crowd", "crowdbest", "expert", "distant", "weak", "llm"]:
raise Exception("Please choose a valid version")

self._set_path(base_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this function and just set the base_path here in __init__:

if base_path:
    self.base_path = Path(base_path)
else:
    self.base_path = flair.cache_root / "datasets" / "noisebench"

self.base_path = flair.cache_root / "datasets" / "noisebench" if not base_path else Path(base_path)

@staticmethod
def read_column_file(filename):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are adding a new public method to NER_NOISEBENCH. I would avoid this. Also, you are relying on this method only in non-public methods in the class (except in generate_data_files, which I would also put as non-public.)

My suggestion is to rename the method to _read_column_file and add type annotations for filename argument.

return all_x

@staticmethod
def save_to_column_file(filename, list):
Copy link
Contributor

@fkdosilovic fkdosilovic Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Rename to _save_to_column_file
  2. Add type annotations to arguments
  3. Avoid using the name of built-in modules/types/etc. for argument and variable names. Here, you are using list as the name for the argument. Seems like a more appropriate name is sentences (which is of type list).


def generate_data_files(self, filename):

with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid this os.path.join since base_path is of type Path. Just use / to concatenate the path. Also, we should avoid relying on implicit default values, it is better to explicitly mention that we want to read the file with utf-8:

with open(self.base_path / "annotations_only" / "index.txt", "r", encoding="utf-8") as fp:
    # ...

with open(os.path.join(self.base_path, "annotations_only", "index.txt")) as index_file:
token_indices = index_file.readlines()

all_clean_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.train"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.cleanconll_base_path / "cleanconll.train" instead of os.path.join.

)

# copy test set
all_clean_test_sentences = self.read_column_file(os.path.join(self.cleanconll_base_path, "cleanconll.test"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.cleanconll_base_path / "cleanconll.test" instead of os.path.join.

new_s.append([token[0], token[4]])
test_sentences.append(new_s)

self.save_to_column_file(os.path.join(self.base_path, "clean.test"), test_sentences)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.base_path / "clean.test" instead of os.path.join.

new_s = []
for token in s:
new_s.append([token[0], token[4]])
test_sentences.append(new_s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be refactored to a more idiomatic approach with list comprehensions:

for sentence in all_clean_test_sentences:
    new_sentence = [[tokens[0], tokens[4]] for tokens in sentence]
    test_sentences.append(new_sentence)

for token, label in zip(clean_sentence, sentence):
label[0] = token[0] # token[0] -> text, token[1] -> BIO label
if self.SAVE_TRAINDEV_FILE:
self.save_to_column_file(os.path.join(self.base_path, f"{corpus}.traindev"), noisy_labels)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.base_path / f"{corpus}.traindev" instead of os.path.join.


noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev"))
# print(noisy_labels)
# print(token_indices)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these debugging prints.

def _merge_tokens_labels(self, corpus, all_clean_sentences, token_indices):
# generate NoiseBench dataset variants, given CleanCoNLL, noisy label files and index file

noisy_labels = self.read_column_file(os.path.join(self.base_path, "annotations_only", f"{corpus}.traindev"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, use self.base_path / "annotations_only" / f"{corpus}.traindev" instead of os.path.join.

point = []
for line in lines:
if "\t" in line.strip():
stripped_line = line.strip().split("\t") if "\t" in line.strip() else line.strip().split(" ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like stripped_line = line.strip().split("\t") if "\t" in line.strip() will always be set due to the condition if "\t" in line.strip() in line 5076. Refactor to:

if "\t" in line.strip():
    stripped_line = line.strip().split("\t")
else:
    stripped_line = line.strip().split(" ")

def save_to_column_file(filename, list):
with open(filename, "w") as f:
for sentence in list:
for token in sentence:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename the iteration variable to tokens to convert the meaning that this is a container of elements and not a single token.

@alanakbik
Copy link
Collaborator

@fkdosilovic thanks a lot for your review, this is very helpful! Please note that this is still a draft PR and one major change is still planned.

@elenamer
Copy link
Collaborator Author

@fkdosilovic Thanks a lot for the review, it was very helpful and I incorporated most of your feedback!

Also, the CLEANCONLL Corpus is now a part of Flair, so I modified this PR to use it, instead of downloading and running an external .sh script.

@elenamer elenamer marked this pull request as ready for review December 16, 2024 10:08
@alanakbik
Copy link
Collaborator

@elenamer thanks for adding this!

@alanakbik alanakbik merged commit 016079e into master Dec 19, 2024
2 checks passed
@alanakbik alanakbik deleted the add_noisebench branch December 19, 2024 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants