-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #355 | Add Dataloader TotalDefMeme #602
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @akhdanfadh! Thank you for your contribution. I have some suggestions regarding the dataloader:
- Let's move the new tasks and schema to another PR.
subset_id=_DATASETNAME, | ||
), | ||
SEACrowdConfig( | ||
name=f"{_DATASETNAME}_{_SEACROWD_SCHEMA['IMC_MULTI']}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"{_DATASETNAME}_{_SEACROWD_SCHEMA['IMC_MULTI']}"
--> f"{_DATASETNAME}_topic_{_SEACROWD_SCHEMA['IMC_MULTI']}"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this will make the test not work since the config should maintain a certain template based on constants.py
"image_path": image_path, | ||
"metadata": { | ||
"tags": tags, | ||
"stances": [pillar["stance"] for pillar in pillar_stances], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please make this "stances"
--> "agreed_stances"
and add another variable "all_stances"
?
The all_stances
labels have all the stance annotations while the agreed_stances
labels are the "correct labels" based on this processing detail in Section 3.3 Quality Control Measures from the paper.
Lastly, the annotators annotate the meme’s stances towards the assigned pillars: support, against, or neutral. To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels. In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.
Actually it's not. Just for an example, this is the raw data from annotation.json
(see img_4124).
To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels.
There is also no "agreed stances" properties in the dataset. Except if we want to add a script to process the dataset. But IMO, since we are just "loading" the dataset, the processing should be given later to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if raw_pillar_stances: | ||
for pillar, stances in raw_pillar_stances: | ||
category = pillar.split(" ")[0] | ||
pillar_stances.append({"category": category, "stance": stances}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please make this "stance"
--> "agreed_stances"
and add another variable "all_stances"
?
The all_stances
labels have all the stance annotations while the agreed_stances
labels are the "correct labels" based on this processing detail in Section 3.3 Quality Control Measures from the paper.
Lastly, the annotators annotate the meme’s stances towards the assigned pillars: support, against, or neutral. To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels. In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment above.
@holylovenia I've done changing what can be done. See my comments on your review.
I'll make the new PR once everything on the dataloader part is done and reviewed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the discussion, @akhdanfadh! I understand all your comments and decisions. LGTM. Let's wait for @raileymontalan's review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested and reviewed. LGTM!
Closes #355
There are 2 tasks with different schemas. The OCR task is intended for all the images, but the ImageClassification task is only for those having
pillar_stances
attribute since the dataset is about pillar classification, CMIIW.Also, the new
image
schema is added here instead of in a new PR for example sake. Once checked and okay, I will add a new PR for adding the new schema, and remove the relevant files from this PR.Also again, similar to #556 and #566: I use third-party libraries to download the GDrive data, i.e.,
pip install gdown
, because it is more reliable than thedl_manager
. Similarly, I also store the downloaded data indata/total_defense_meme/
. I am aware that I should make or wait for a PR on those two things, so currently waiting for further instruction.Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.