Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create dataset loader for CMU Wilderness Multilingual Speech Dataset #343

Open
SamuelCahyawijaya opened this issue Jan 22, 2024 · 13 comments
Assignees
Labels
bonus +2 help wanted Extra attention is needed in-progress Assignee has given confirmation on progress and ETA

Comments

@SamuelCahyawijaya
Copy link
Collaborator

Dataloader name: cmu_wilderness_multilingual_speech_dataset/cmu_wilderness_multilingual_speech_dataset.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?cmu_wilderness_multilingual_speech_dataset

Dataset cmu_wilderness_multilingual_speech_dataset
Description The CMU Wilderness Multilingual Speech Dataset is a speech dataset of aligned sentences and audio for around 700 different languages. It is based on readings of the New Testement from Bible.is. It provides data to allow building of kaldi ASR models, and Festvox TTS voices in the target languages.
Subsets -
Languages mhx, ifk, tlb, nod, ilo, frd, cgc, tha, cfm, bgr, blt, atq, dtp, cmr, amk, ptu, jav, lsi, nij, mhy, acn, prf, alj, lnd, kzf, pww, sda, mbb, ify, mbt, iba, pse, kje, gbi, mog, alp, twb, law, dni, ahk, rej, bcl, nlc, plw, zyp, lew, mad, txa, bpr, min, kne, agn, mqj, itv, gor, bts, twu, mwv, sml, npy, khm, sas, krj, ury, obo, kqe, mrw, ifb, mvp, cmo, por, xsb, ljp, bru, ban, ind, cnk, sgb, mak, nia, sun, hnn, ceb, btd, lao, pam, kac, ifa, blz, bps, ctd, mnb, pmf, hil, sxn, bep, ppk, mej, ace, ifu, tgl, lex, vie, btx, lhu, pag, xmm, bhz, tby
Tasks Automatic Speech Recognition, Text-To-Speech Synthesis
License Unknown (unknown)
Homepage http://festvox.org/cmu_wilderness/
HF URL -
Paper URL https://ieeexplore.ieee.org/document/8683536
@sabilmakbar sabilmakbar added the help wanted Extra attention is needed label Jan 30, 2024
@akhdanfadh
Copy link
Collaborator

It seems the speech data needs to be scraped from a particular website, and the official codebases for the paper do not count for the changed website structure. These links may be helpful:

  • Opened issues on the repo here and here
  • Reference scraper here, probably fixed it but have not yet tested

@holylovenia
Copy link
Contributor

holylovenia commented Mar 12, 2024

It seems the speech data needs to be scraped from a particular website, and the official codebases for the paper do not count for the changed website structure. These links may be helpful:
Opened issues on the repo festvox/datasets-CMU_Wilderness#11 and festvox/datasets-CMU_Wilderness#1
Reference scraper here, probably fixed it but have not yet tested

Thanks for inspecting this, @akhdanfadh! May I ask if you'll be able to check if the reference scraper fixed the problem or not (at least for the SEA languages)?

Also, it seems implementing this dataloader warrants a bonus since it's more complex than the others.

@holylovenia holylovenia added bonus +2 top-priority Needs to get done ASAP for the experiments help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Mar 12, 2024
@akhdanfadh
Copy link
Collaborator

Got it @holylovenia, will do by Friday night.

@holylovenia
Copy link
Contributor

Got it @holylovenia, will do by Friday night.

Thanks a lot, @akhdanfadh!!

@akhdanfadh
Copy link
Collaborator

akhdanfadh commented Mar 13, 2024

After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages.

@holylovenia @SamuelCahyawijaya @sabilmakbar

@holylovenia
Copy link
Contributor

holylovenia commented Mar 14, 2024

After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages.

@holylovenia @SamuelCahyawijaya @sabilmakbar

Tough. I was looking through the dataset website too and it seems like they have outgrown CMU Wilderness dataset's coverage.

Have you taken a look at their API, @akhdanfadh? It seems like we should be able to access all of their data through the API.

@holylovenia
Copy link
Contributor

May I know if there is any update on this, @akhdanfadh?

@akhdanfadh
Copy link
Collaborator

I haven't looked up on this, will do this week.

@holylovenia
Copy link
Contributor

I haven't looked up on this, will do this week.

Sure! @yongzx will also help inspect this issue.

@akhdanfadh
Copy link
Collaborator

akhdanfadh commented Mar 28, 2024

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I will update here on anything related from their docs.

EDIT: Done @holylovenia 😵‍💫


API key is needed to access the data as a whole

An API key is required for development and production use. However, sometimes you just want to get a feel for what the data looks like, and plan how you will interact with the data. For that use only, a generic API key is provided as part of the collection. This key is rate-limited to 1000 requests per month.

We can actually explore them from Example Workflows. But if someone wants to access it through a dataloader, it is necessary to implement an API key input in the code later on.

Not all contents are available to download

From the API core concept:

The license allows for certain content to be downloaded for offline personal use within the application if the content is specifically marked as permitted for download within the API. The API indicates applicable content via the /download/list endpoint. This endpoint requires an API Key; any fileset from the resulting content list can be downloaded via the /download/:filesetid endpoint. Note that the content must remain within the application; the license allows the content to only be consumed by the application associated with the API Key.

Note the bold sentence. Not sure what application means.

Testing on INZNTV FilesetId

  1. Download the index files to reconstruct alignments: INZNTV.tar.gz provided in the dataset website.
  2. Extract and open INZNTV and you can get the full FilesetId: INZNTVN2DA. This will be used to access the API.
  3. Try to download the data using their API and got "403 Forbidden"

Tried testing it with the assumed most accessed data that is English with FilesetId EN1NIVN2DA and still got "403 Forbidden". I guess I am waiting for the requested API key.

Release date confusion

The released data for INZNTV id specified here (see the one with INDNTV id) is in mid-2021. But please note that the CMU dataset was released in March 2019. I haven't yet found any data versioning with the bible website API, so not sure if the data will match or not.

@holylovenia
Copy link
Contributor

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I will update here on anything related from their docs.

EDIT: Done @holylovenia 😵‍💫

Thanks a lot, @akhdanfadh! It seems that we will have to update some info for the corresponding datasheet too. 😵‍💫 Tagging @yongzx here too in case we need another pair of eyes for discussion and/or this dataloader implementation.

Note that the content must remain within the application.

I think it means that users are not permitted to upload the data to anywhere else. All usages should be done with the API and API key.

Release date confusion

By how things unfold, CMU Wilderness and the current Bible website seem to have different sets of datasets and distinct metadata. Let's follow the current Bible website since it's the one that provides the dataset now. We can even change the datasheet name and the dataloader name if needed.

cc: @SamuelCahyawijaya @sabilmakbar for your information.

@akhdanfadh
Copy link
Collaborator

akhdanfadh commented Apr 18, 2024

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx

@holylovenia holylovenia added in-progress Assignee has given confirmation on progress and ETA and removed top-priority Needs to get done ASAP for the experiments labels Apr 22, 2024
@holylovenia
Copy link
Contributor

I am requesting the API key just now. This was what they said.

We will review your request and get back to you within one week. In the mean time feel free to start reading the documentation.

I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx

Got it, there's nothing we can do without the API access for now. 👍 It seems unlikely we can use this dataset for the experiment as well.

If there's no response until the end of SEACrowd, I might add a note on the corresponding datasheet or deprecate it.

Thanks @akhdanfadh! Please keep us updated if there's some news.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bonus +2 help wanted Extra attention is needed in-progress Assignee has given confirmation on progress and ETA
Projects
Status: No status
Development

No branches or pull requests

4 participants