Create dataset loader for CMU Wilderness Multilingual Speech Dataset #343

SamuelCahyawijaya · 2024-01-22T06:51:32Z

Dataloader name: cmu_wilderness_multilingual_speech_dataset/cmu_wilderness_multilingual_speech_dataset.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?cmu_wilderness_multilingual_speech_dataset

Dataset	cmu_wilderness_multilingual_speech_dataset
Description	The CMU Wilderness Multilingual Speech Dataset is a speech dataset of aligned sentences and audio for around 700 different languages. It is based on readings of the New Testement from Bible.is. It provides data to allow building of kaldi ASR models, and Festvox TTS voices in the target languages.
Subsets	-
Languages	mhx, ifk, tlb, nod, ilo, frd, cgc, tha, cfm, bgr, blt, atq, dtp, cmr, amk, ptu, jav, lsi, nij, mhy, acn, prf, alj, lnd, kzf, pww, sda, mbb, ify, mbt, iba, pse, kje, gbi, mog, alp, twb, law, dni, ahk, rej, bcl, nlc, plw, zyp, lew, mad, txa, bpr, min, kne, agn, mqj, itv, gor, bts, twu, mwv, sml, npy, khm, sas, krj, ury, obo, kqe, mrw, ifb, mvp, cmo, por, xsb, ljp, bru, ban, ind, cnk, sgb, mak, nia, sun, hnn, ceb, btd, lao, pam, kac, ifa, blz, bps, ctd, mnb, pmf, hil, sxn, bep, ppk, mej, ace, ifu, tgl, lex, vie, btx, lhu, pag, xmm, bhz, tby
Tasks	Automatic Speech Recognition, Text-To-Speech Synthesis
License	Unknown (unknown)
Homepage	http://festvox.org/cmu_wilderness/
HF URL	-
Paper URL	https://ieeexplore.ieee.org/document/8683536

The text was updated successfully, but these errors were encountered:

akhdanfadh · 2024-03-01T03:08:23Z

It seems the speech data needs to be scraped from a particular website, and the official codebases for the paper do not count for the changed website structure. These links may be helpful:

Opened issues on the repo here and here
Reference scraper here, probably fixed it but have not yet tested

holylovenia · 2024-03-12T05:02:24Z

It seems the speech data needs to be scraped from a particular website, and the official codebases for the paper do not count for the changed website structure. These links may be helpful:
Opened issues on the repo festvox/datasets-CMU_Wilderness#11 and festvox/datasets-CMU_Wilderness#1
Reference scraper here, probably fixed it but have not yet tested

Thanks for inspecting this, @akhdanfadh! May I ask if you'll be able to check if the reference scraper fixed the problem or not (at least for the SEA languages)?

Also, it seems implementing this dataloader warrants a bonus since it's more complex than the others.

akhdanfadh · 2024-03-12T05:51:34Z

Got it @holylovenia, will do by Friday night.

holylovenia · 2024-03-12T05:53:15Z

Got it @holylovenia, will do by Friday night.

Thanks a lot, @akhdanfadh!!

akhdanfadh · 2024-03-13T13:49:37Z

After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages.

@holylovenia @SamuelCahyawijaya @sabilmakbar

holylovenia · 2024-03-14T05:28:48Z

After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages.

@holylovenia @SamuelCahyawijaya @sabilmakbar

Tough. I was looking through the dataset website too and it seems like they have outgrown CMU Wilderness dataset's coverage.

Have you taken a look at their API, @akhdanfadh? It seems like we should be able to access all of their data through the API.

holylovenia · 2024-03-25T05:14:11Z

May I know if there is any update on this, @akhdanfadh?

akhdanfadh · 2024-03-25T07:58:18Z

I haven't looked up on this, will do this week.

holylovenia · 2024-03-26T02:54:51Z

I haven't looked up on this, will do this week.

Sure! @yongzx will also help inspect this issue.

akhdanfadh · 2024-03-28T07:13:21Z

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I will update here on anything related from their docs.

EDIT: Done @holylovenia 😵‍💫

API key is needed to access the data as a whole

An API key is required for development and production use. However, sometimes you just want to get a feel for what the data looks like, and plan how you will interact with the data. For that use only, a generic API key is provided as part of the collection. This key is rate-limited to 1000 requests per month.

We can actually explore them from Example Workflows. But if someone wants to access it through a dataloader, it is necessary to implement an API key input in the code later on.

Not all contents are available to download

From the API core concept:

The license allows for certain content to be downloaded for offline personal use within the application if the content is specifically marked as permitted for download within the API. The API indicates applicable content via the /download/list endpoint. This endpoint requires an API Key; any fileset from the resulting content list can be downloaded via the /download/:filesetid endpoint. Note that the content must remain within the application; the license allows the content to only be consumed by the application associated with the API Key.

Note the bold sentence. Not sure what application means.

Testing on INZNTV `FilesetId`

Download the index files to reconstruct alignments: INZNTV.tar.gz provided in the dataset website.
Extract and open INZNTV and you can get the full FilesetId: INZNTVN2DA. This will be used to access the API.
Try to download the data using their API and got "403 Forbidden"

Tried testing it with the assumed most accessed data that is English with FilesetId EN1NIVN2DA and still got "403 Forbidden". I guess I am waiting for the requested API key.

Release date confusion

The released data for INZNTV id specified here (see the one with INDNTV id) is in mid-2021. But please note that the CMU dataset was released in March 2019. I haven't yet found any data versioning with the bible website API, so not sure if the data will match or not.

holylovenia · 2024-04-01T06:18:35Z

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I will update here on anything related from their docs.

EDIT: Done @holylovenia 😵‍💫

Thanks a lot, @akhdanfadh! It seems that we will have to update some info for the corresponding datasheet too. 😵‍💫 Tagging @yongzx here too in case we need another pair of eyes for discussion and/or this dataloader implementation.

Note that the content must remain within the application.

I think it means that users are not permitted to upload the data to anywhere else. All usages should be done with the API and API key.

Release date confusion

By how things unfold, CMU Wilderness and the current Bible website seem to have different sets of datasets and distinct metadata. Let's follow the current Bible website since it's the one that provides the dataset now. We can even change the datasheet name and the dataloader name if needed.

cc: @SamuelCahyawijaya @sabilmakbar for your information.

akhdanfadh · 2024-04-18T04:40:01Z

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx

holylovenia · 2024-04-22T07:11:17Z

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx

Got it, there's nothing we can do without the API access for now. 👍 It seems unlikely we can use this dataset for the experiment as well.

If there's no response until the end of SEACrowd, I might add a note on the corresponding datasheet or deprecate it.

Thanks @akhdanfadh! Please keep us updated if there's some news.

sabilmakbar added the help wanted Extra attention is needed label Jan 30, 2024

holylovenia added bonus +2 top-priority Needs to get done ASAP for the experiments help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Mar 12, 2024

sabilmakbar assigned akhdanfadh Mar 12, 2024

github-actions bot added the staled-issue label Apr 16, 2024

github-actions bot removed the staled-issue label Apr 19, 2024

holylovenia added in-progress Assignee has given confirmation on progress and ETA and removed top-priority Needs to get done ASAP for the experiments labels Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create dataset loader for CMU Wilderness Multilingual Speech Dataset #343

Create dataset loader for CMU Wilderness Multilingual Speech Dataset #343

SamuelCahyawijaya commented Jan 22, 2024

akhdanfadh commented Mar 1, 2024

holylovenia commented Mar 12, 2024 •

edited

Loading

akhdanfadh commented Mar 12, 2024

holylovenia commented Mar 12, 2024

akhdanfadh commented Mar 13, 2024 •

edited

Loading

holylovenia commented Mar 14, 2024 •

edited

Loading

holylovenia commented Mar 25, 2024

akhdanfadh commented Mar 25, 2024

holylovenia commented Mar 26, 2024

akhdanfadh commented Mar 28, 2024 •

edited

Loading

holylovenia commented Apr 1, 2024

akhdanfadh commented Apr 18, 2024 •

edited

Loading

holylovenia commented Apr 22, 2024

Create dataset loader for CMU Wilderness Multilingual Speech Dataset #343

Create dataset loader for CMU Wilderness Multilingual Speech Dataset #343

Comments

SamuelCahyawijaya commented Jan 22, 2024

akhdanfadh commented Mar 1, 2024

holylovenia commented Mar 12, 2024 • edited Loading

akhdanfadh commented Mar 12, 2024

holylovenia commented Mar 12, 2024

akhdanfadh commented Mar 13, 2024 • edited Loading

holylovenia commented Mar 14, 2024 • edited Loading

holylovenia commented Mar 25, 2024

akhdanfadh commented Mar 25, 2024

holylovenia commented Mar 26, 2024

akhdanfadh commented Mar 28, 2024 • edited Loading

API key is needed to access the data as a whole

Not all contents are available to download

Testing on INZNTV FilesetId

Release date confusion

holylovenia commented Apr 1, 2024

akhdanfadh commented Apr 18, 2024 • edited Loading

holylovenia commented Apr 22, 2024

holylovenia commented Mar 12, 2024 •

edited

Loading

akhdanfadh commented Mar 13, 2024 •

edited

Loading

holylovenia commented Mar 14, 2024 •

edited

Loading

akhdanfadh commented Mar 28, 2024 •

edited

Loading

Testing on INZNTV `FilesetId`

akhdanfadh commented Apr 18, 2024 •

edited

Loading