Create dataset loader for Lio and the Central Flores languages #312

SamuelCahyawijaya · 2024-01-10T05:52:28Z

Dataloader name: lio_and_central_flores/lio_and_central_flores.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?lio_and_central_flores

Dataset	lio_and_central_flores
Description	This dataset is a collection of language resources of Li'o, Ende, Nage, and So'a which are collected in Ende, Flores, Eastern Nusa Tenggara. This dataset is the dataset from the research MA thesis by Alexander Elias. Title: Lio and the Central Flores languages
Subsets	Lio Collection
Languages	end, nxe, ssq, ljl, eng
Tasks	Automatic Speech Recognition, Machine Translation
License	Unknown (unknown)
Homepage	https://archive.mpi.nl/tla/islandora/search/alexander%20elias?type=dismax&islandora_solr_search_navigation=0&f%5B0%5D=cmd.Contributor%3A%22Alexander%5C%20Elias%22
HF URL	-
Paper URL	https://studenttheses.universiteitleiden.nl/handle/1887/69452

The text was updated successfully, but these errors were encountered:

joanitolopo · 2024-01-12T04:52:30Z

#self-assign

joanitolopo · 2024-01-19T05:07:13Z

Hi!
I have a question regarding this dataset. Would you mind that should we separate the task data loaders within this dataset for the sake of simplicity?: Speeh Recognition and Machine Translation. If not, could you please share a reference that has implemented two or more tasks in a single data loader?
Thanks!

holylovenia · 2024-01-25T08:56:05Z

Hi! I have a question regarding this dataset. Would you mind that should we separate the task data loaders within this dataset for the sake of simplicity?: Speeh Recognition and Machine Translation. If not, could you please share a reference that has implemented two or more tasks in a single data loader? Thanks!

Hi @joanitolopo, thank you for taking on this dataloader. Could we have multiple subsets instead of multiple dataloaders?

`seacrowd` subsets

lio_and_central_flores_asr_{lang}_seacrowd_sptext for all of the SEA languages
lio_and_central_flores_mt_{lang}_seacrowd_sptext for all of the SEA languages

`source` subsets

lio_and_central_flores_asr_{lang}_source for all of the SEA languages
lio_and_central_flores_mt_{lang}_source for all of the SEA languages

joanitolopo · 2024-02-05T09:24:53Z

HI @holylovenia.

Could we have multiple subsets instead of multiple dataloaders?

I assumed that we have 16 configs for each language because there are four languages and two task.

For seacrowd subsets, i used lio_and_central_flores_asr_{lang}_seacrowd_sptext for ASR task and lio_and_central_flores_mt_{lang}_seacrowd_t2t for MT task. Am i right? Thankyou

github-actions · 2024-02-20T01:56:25Z

Hi @, may I know if you are still working on this issue? Please let @holylovenia @SamuelCahyawijaya @sabilmakbar know if you need any help.

holylovenia · 2024-02-26T07:56:43Z

I assumed that we have 16 configs for each language because there are four languages and two task.

For seacrowd subsets, i used lio_and_central_flores_asr_{lang}_seacrowd_sptext for ASR task and lio_and_central_flores_mt_{lang}_seacrowd_t2t for MT task. Am i right? Thankyou

Yes. For the MT, could you please use lio_and_central_flores_mt_eng_{lang}_seacrowd_t2t instead of lio_and_central_flores_mt_{lang}_seacrowd_t2t? Just for clarity's sake.

Sorry for the late reply.

holylovenia · 2024-03-12T05:08:08Z

Adding top-priority and bonus+ labels because we would need this for the experiments.

holylovenia · 2024-03-25T05:19:41Z

Hi @joanitolopo, may I know if you need any help with the dataloader?

SamuelCahyawijaya · 2024-03-30T02:00:52Z

Hi @holylovenia, I had a discussion with @joanitolopo earlier, and it seems like it is almost impossible to create a useful ASR dataset from this data because the video (audio) is impossible to align because there is no clear timestamp, the audio is noisy, and even sometimes repetitive.

Nonetheless, I think we can keep the machine translation task, as there are source-to-english sentence pairs provided in the transcription file.

holylovenia · 2024-04-01T05:56:45Z

Hi @holylovenia, I had a discussion with @joanitolopo earlier, and it seems like it is almost impossible to create a useful ASR dataset from this data because the video (audio) is impossible to align because there is no clear timestamp, the audio is noisy, and even sometimes repetitive.

Nonetheless, I think we can keep the machine translation task, as there are source-to-english sentence pairs provided in the transcription file.

Noted, thanks @joanitolopo @SamuelCahyawijaya! But I'll keep the datasheet as-is with ASR and MT tasks since the dataset provides the resources needed for these tasks—albeit with additional postprocessing steps.

* Create lio_and_central_flores dataset loader * fix requirement issue * adding docstring and run make file

github-actions bot assigned joanitolopo Jan 12, 2024

github-actions bot added the staled-issue label Feb 20, 2024

github-actions bot removed the staled-issue label Feb 27, 2024

holylovenia added bonus +1 top-priority Needs to get done ASAP for the experiments labels Mar 12, 2024

joanitolopo mentioned this issue Mar 31, 2024

Closes #312 | Create lio_and_central_flores dataset loader #561

Merged

8 tasks

holylovenia added the pr-ready A PR that closes this issue is Ready to be reviewed label Apr 1, 2024

ljvmiranda921 closed this as completed in #561 Apr 21, 2024

ljvmiranda921 pushed a commit that referenced this issue Apr 21, 2024

Closes #312 | Create lio_and_central_flores dataset loader (#561)

0915fe8

* Create lio_and_central_flores dataset loader * fix requirement issue * adding docstring and run make file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for Lio and the Central Flores languages #312

Create dataset loader for Lio and the Central Flores languages #312

SamuelCahyawijaya commented Jan 10, 2024

joanitolopo commented Jan 12, 2024

joanitolopo commented Jan 19, 2024

holylovenia commented Jan 25, 2024

joanitolopo commented Feb 5, 2024 •

edited

Loading

github-actions bot commented Feb 20, 2024

holylovenia commented Feb 26, 2024 •

edited

Loading

holylovenia commented Mar 12, 2024

holylovenia commented Mar 25, 2024

SamuelCahyawijaya commented Mar 30, 2024 •

edited

Loading

holylovenia commented Apr 1, 2024

Create dataset loader for Lio and the Central Flores languages #312

Create dataset loader for Lio and the Central Flores languages #312

Comments

SamuelCahyawijaya commented Jan 10, 2024

joanitolopo commented Jan 12, 2024

joanitolopo commented Jan 19, 2024

holylovenia commented Jan 25, 2024

seacrowd subsets

source subsets

joanitolopo commented Feb 5, 2024 • edited Loading

github-actions bot commented Feb 20, 2024

holylovenia commented Feb 26, 2024 • edited Loading

holylovenia commented Mar 12, 2024

holylovenia commented Mar 25, 2024

SamuelCahyawijaya commented Mar 30, 2024 • edited Loading

holylovenia commented Apr 1, 2024

`seacrowd` subsets

`source` subsets

joanitolopo commented Feb 5, 2024 •

edited

Loading

holylovenia commented Feb 26, 2024 •

edited

Loading

SamuelCahyawijaya commented Mar 30, 2024 •

edited

Loading