Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for M3LS #228

Closed
SamuelCahyawijaya opened this issue Dec 26, 2023 · 10 comments · Fixed by #675
Closed

Create dataset loader for M3LS #228

SamuelCahyawijaya opened this issue Dec 26, 2023 · 10 comments · Fixed by #675

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: m3ls/m3ls.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?m3ls

Dataset m3ls
Description The multilingual multimodal summarization dataset (M3LS) consists of over a million instances of document-image pairs along with a professionally annotated multimodal summary for each pair. It is derived from news articles published by the British Broadcasting Corporation (BBC) over a decade and spans 20 total languages.
Subsets -
Languages ind
Tasks Summarization
License MIT (mit)
Homepage https://github.com/anubhav-jangra/M3LS
HF URL -
Paper URL https://aclanthology.org/2023.eacl-main.263/
@sedrickkeh
Copy link

#self-assign

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@sedrickkeh
Copy link

Working on it. Will try to finish this week

@holylovenia
Copy link
Contributor

Working on it. Will try to finish this week

No problem! Feel free to let us know anytime you would like to discuss.

Copy link

Hi, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

@sabilmakbar
Copy link
Collaborator

#self-assign

@sabilmakbar
Copy link
Collaborator

and for this one, I'll try to do this immediately after this one is PR-ed:
#449

@sabilmakbar
Copy link
Collaborator

Hi, seems the dataset itself is exceedingly large (the zipped version is around 14GB, unsure abt the actual size after unzipping -- now I'm doing it).

Also, there is a forseeable blocker on bypassing Google Drive downloading process by passing the GDrive URL to either datasets.DownloadManager or gdown.download. I'm trying to fix the issue and see other workarounds as well (last time I saw a possible workaround in #206 discussion).

If all of them aren't possible, maybe the last resort is to change the format into local-based dataset

@sabilmakbar
Copy link
Collaborator

updates:

  1. There are multiple files that can be used, but there's no clear documentation on the contents/folder structuring itself on the description. Prob needs to skim the paper or the scrapper codes to get some hints.
image
  1. From what I inspect on the , prob this data contains much more info than we thought initially (only text summarization from article). Idk if the images there can be stitched together to get a multimodal dataset, tho (would be valuable if we can somehow pull it).
    image

any thoughts or ideas? @holylovenia @SamuelCahyawijaya

@holylovenia
Copy link
Contributor

holylovenia commented May 2, 2024

Thanks for inspecting this dataset, @sabilmakbar! I think this dataset is a multimodal dataset, precisely a multilingual
multimodal summarization dataset. Or did you mean stitching up another multimodal dataset?

fhudi added a commit that referenced this issue May 31, 2024
* add m3ls

* Update seacrowd/sea_datasets/m3ls/m3ls.py

* Apply suggestions from code review

update to comply w/ `black` formatter

Co-authored-by: Frederikus Hudi <frederikus.hudi@gmail.com>

* Update m3ls.py

* Update m3ls.py

* Update m3ls.py following `black` formatter

---------

Co-authored-by: Lj Miranda <12949683+ljvmiranda921@users.noreply.github.com>
Co-authored-by: Frederikus Hudi <frederikus.hudi@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants