Closes #355 | Add Dataloader TotalDefMeme #602

akhdanfadh · 2024-04-02T10:51:26Z

Closes #355

There are 2 tasks with different schemas. The OCR task is intended for all the images, but the ImageClassification task is only for those having pillar_stances attribute since the dataset is about pillar classification, CMIIW.

Also, the new image schema is added here instead of in a new PR for example sake. Once checked and okay, I will add a new PR for adding the new schema, and remove the relevant files from this PR.

Also again, similar to #556 and #566: I use third-party libraries to download the GDrive data, i.e., pip install gdown, because it is more reliable than the dl_manager. Similarly, I also store the downloaded data in data/total_defense_meme/. I am aware that I should make or wait for a PR on those two things, so currently waiting for further instruction.

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

holylovenia

Hi @akhdanfadh! Thank you for your contribution. I have some suggestions regarding the dataloader:

Let's move the new tasks and schema to another PR.

seacrowd/sea_datasets/total_defense_meme/total_defense_meme.py

holylovenia · 2024-04-11T05:45:30Z

seacrowd/sea_datasets/total_defense_meme/total_defense_meme.py

+            subset_id=_DATASETNAME,
+        ),
+        SEACrowdConfig(
+            name=f"{_DATASETNAME}_{_SEACROWD_SCHEMA['IMC_MULTI']}",


f"{_DATASETNAME}_{_SEACROWD_SCHEMA['IMC_MULTI']}" --> f"{_DATASETNAME}_topic_{_SEACROWD_SCHEMA['IMC_MULTI']}"

Changing this will make the test not work since the config should maintain a certain template based on constants.py

holylovenia · 2024-04-11T06:00:06Z

seacrowd/sea_datasets/total_defense_meme/total_defense_meme.py

+                        "image_path": image_path,
+                        "metadata": {
+                            "tags": tags,
+                            "stances": [pillar["stance"] for pillar in pillar_stances],


Could you please make this "stances" --> "agreed_stances" and add another variable "all_stances"?

The all_stances labels have all the stance annotations while the agreed_stances labels are the "correct labels" based on this processing detail in Section 3.3 Quality Control Measures from the paper.

Lastly, the annotators annotate the meme’s stances towards the assigned pillars: support, against, or neutral. To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels. In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.

In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.

Actually it's not. Just for an example, this is the raw data from annotation.json (see img_4124).

To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels.

There is also no "agreed stances" properties in the dataset. Except if we want to add a script to process the dataset. But IMO, since we are just "loading" the dataset, the processing should be given later to the user.

Also to add about the inconsistencies from the dataset, see below.

img_4129: perfect example for three annotators, we can get obtain the agreed stances.

img_4130: where is the third annotator?

img_4131: first and second annotators already agreed on neutral, why third annotator?

holylovenia · 2024-04-11T06:03:31Z

seacrowd/sea_datasets/total_defense_meme/total_defense_meme.py

+            if raw_pillar_stances:
+                for pillar, stances in raw_pillar_stances:
+                    category = pillar.split(" ")[0]
+                    pillar_stances.append({"category": category, "stance": stances})


Could you please make this "stance" --> "agreed_stances" and add another variable "all_stances"?

The all_stances labels have all the stance annotations while the agreed_stances labels are the "correct labels" based on this processing detail in Section 3.3 Quality Control Measures from the paper.

Lastly, the annotators annotate the meme’s stances towards the assigned pillars: support, against, or neutral. To ensure the reliability of the dataset, each meme is annotated by two annotators. If the disagreements contain similar opinions, the overlap annotations will be considered correct labels. However, if there are disagreements with entirely different perspectives, a third annotator will be brought in to provide an additional annotation for the meme. The overlapping annotations between at least two annotators will then be considered the correct labels. In the extreme case where all three annotators have different opinions, the meme will be flagged and removed from the dataset.

See my comment above.

akhdanfadh · 2024-04-18T04:38:44Z

@holylovenia I've done changing what can be done. See my comments on your review.

Let's move the new tasks and schema to another PR.

I'll make the new PR once everything on the dataloader part is done and reviewed.

holylovenia

Thanks for the discussion, @akhdanfadh! I understand all your comments and decisions. LGTM. Let's wait for @raileymontalan's review.

raileymontalan

Tested and reviewed. LGTM!

akhdanfadh added 2 commits April 2, 2024 19:44

add image classification schema

71a1603

add dataloader

fc1499e

akhdanfadh requested review from holylovenia, SamuelCahyawijaya, sabilmakbar, jamesjaya, yongzx, gentaiscool, ljvmiranda921, jensan-1, danjohnvelasco, MJonibek and tellarin as code owners April 2, 2024 10:51

holylovenia added bonus +1 top-priority Needs to get done ASAP for the experiments labels Apr 8, 2024

holylovenia requested review from haryoa and removed request for tellarin, gentaiscool, jamesjaya, SamuelCahyawijaya, ljvmiranda921, yongzx, MJonibek, danjohnvelasco, jensan-1 and sabilmakbar April 8, 2024 09:51

holylovenia assigned holylovenia and haryoa Apr 8, 2024

holylovenia requested review from raileymontalan and removed request for haryoa April 8, 2024 09:57

holylovenia assigned raileymontalan and unassigned haryoa Apr 8, 2024

holylovenia removed the top-priority Needs to get done ASAP for the experiments label Apr 11, 2024

holylovenia requested changes Apr 11, 2024

View reviewed changes

change source feature, modify comment

f07c6ba

holylovenia approved these changes Apr 21, 2024

View reviewed changes

raileymontalan approved these changes Apr 30, 2024

View reviewed changes

holylovenia merged commit 8a35006 into SEACrowd:master May 1, 2024
1 check passed

akhdanfadh deleted the total_defense_meme branch May 6, 2024 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #355 | Add Dataloader TotalDefMeme #602

Closes #355 | Add Dataloader TotalDefMeme #602

akhdanfadh commented Apr 2, 2024 •

edited

Loading

holylovenia left a comment

holylovenia Apr 11, 2024

akhdanfadh Apr 18, 2024

holylovenia Apr 11, 2024

akhdanfadh Apr 18, 2024

akhdanfadh Apr 18, 2024

holylovenia Apr 11, 2024

akhdanfadh Apr 18, 2024

akhdanfadh commented Apr 18, 2024

holylovenia left a comment

raileymontalan left a comment

Closes #355 | Add Dataloader TotalDefMeme #602

Closes #355 | Add Dataloader TotalDefMeme #602

Conversation

akhdanfadh commented Apr 2, 2024 • edited Loading

Checkbox

holylovenia left a comment

Choose a reason for hiding this comment

holylovenia Apr 11, 2024

Choose a reason for hiding this comment

akhdanfadh Apr 18, 2024

Choose a reason for hiding this comment

holylovenia Apr 11, 2024

Choose a reason for hiding this comment

akhdanfadh Apr 18, 2024

Choose a reason for hiding this comment

akhdanfadh Apr 18, 2024

Choose a reason for hiding this comment

holylovenia Apr 11, 2024

Choose a reason for hiding this comment

akhdanfadh Apr 18, 2024

Choose a reason for hiding this comment

akhdanfadh commented Apr 18, 2024

holylovenia left a comment

Choose a reason for hiding this comment

raileymontalan left a comment

Choose a reason for hiding this comment

akhdanfadh commented Apr 2, 2024 •

edited

Loading