Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #355 | Add Dataloader TotalDefMeme #602

Merged
merged 3 commits into from
May 1, 2024

Conversation

akhdanfadh
Copy link
Collaborator

@akhdanfadh akhdanfadh commented Apr 2, 2024

Closes #355

There are 2 tasks with different schemas. The OCR task is intended for all the images, but the ImageClassification task is only for those having pillar_stances attribute since the dataset is about pillar classification, CMIIW.

Also, the new image schema is added here instead of in a new PR for example sake. Once checked and okay, I will add a new PR for adding the new schema, and remove the relevant files from this PR.

Also again, similar to #556 and #566: I use third-party libraries to download the GDrive data, i.e., pip install gdown, because it is more reliable than the dl_manager. Similarly, I also store the downloaded data in data/total_defense_meme/. I am aware that I should make or wait for a PR on those two things, so currently waiting for further instruction.

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

@holylovenia holylovenia assigned raileymontalan and unassigned haryoa Apr 8, 2024
@holylovenia holylovenia removed the top-priority Needs to get done ASAP for the experiments label Apr 11, 2024
Copy link
Contributor

@holylovenia holylovenia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akhdanfadh! Thank you for your contribution. I have some suggestions regarding the dataloader:

  1. Let's move the new tasks and schema to another PR.

subset_id=_DATASETNAME,
),
SEACrowdConfig(
name=f"{_DATASETNAME}_{_SEACROWD_SCHEMA['IMC_MULTI']}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f"{_DATASETNAME}_{_SEACROWD_SCHEMA['IMC_MULTI']}" --> f"{_DATASETNAME}_topic_{_SEACROWD_SCHEMA['IMC_MULTI']}"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this will make the test not work since the config should maintain a certain template based on constants.py

"image_path": image_path,
"metadata": {
"tags": tags,
"stances": [pillar["stance"] for pillar in pillar_stances],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make this "stances" --> "agreed_stances" and add another variable "all_stances"?

The all_stances labels have all the stance annotations while the agreed_stances labels are the "correct labels" based on this processing detail in Section 3.3 Quality Control Measures from the paper.

Lastly, the annotators annotate the meme’s stances towards the assigned pillars: support, against, or neutral. To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels. In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.

Actually it's not. Just for an example, this is the raw data from annotation.json (see img_4124).
image

To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels.

There is also no "agreed stances" properties in the dataset. Except if we want to add a script to process the dataset. But IMO, since we are just "loading" the dataset, the processing should be given later to the user.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to add about the inconsistencies from the dataset, see below.

  • img_4129: perfect example for three annotators, we can get obtain the agreed stances.
  • img_4130: where is the third annotator?
  • img_4131: first and second annotators already agreed on neutral, why third annotator?
    image

if raw_pillar_stances:
for pillar, stances in raw_pillar_stances:
category = pillar.split(" ")[0]
pillar_stances.append({"category": category, "stance": stances})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please make this "stance" --> "agreed_stances" and add another variable "all_stances"?

The all_stances labels have all the stance annotations while the agreed_stances labels are the "correct labels" based on this processing detail in Section 3.3 Quality Control Measures from the paper.

Lastly, the annotators annotate the meme’s stances towards the assigned pillars: support, against, or neutral. To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels. In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above.

@akhdanfadh
Copy link
Collaborator Author

@holylovenia I've done changing what can be done. See my comments on your review.

Let's move the new tasks and schema to another PR.

I'll make the new PR once everything on the dataloader part is done and reviewed.

Copy link
Contributor

@holylovenia holylovenia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the discussion, @akhdanfadh! I understand all your comments and decisions. LGTM. Let's wait for @raileymontalan's review.

Copy link
Collaborator

@raileymontalan raileymontalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and reviewed. LGTM!

@holylovenia holylovenia merged commit 8a35006 into SEACrowd:master May 1, 2024
1 check passed
@akhdanfadh akhdanfadh deleted the total_defense_meme branch May 6, 2024 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for TotalDefMeme
4 participants