-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create dataset loader for CMU Wilderness Multilingual Speech Dataset #343
Comments
Thanks for inspecting this, @akhdanfadh! May I ask if you'll be able to check if the reference scraper fixed the problem or not (at least for the SEA languages)? Also, it seems implementing this dataloader warrants a bonus since it's more complex than the others. |
Got it @holylovenia, will do by Friday night. |
Thanks a lot, @akhdanfadh!! |
After further observation, this problem was more about the dataset not being up-to-date, not just the outdated website scraper. The language ID used on the current Bible website does not match the LANGID used on the dataset website. For example, there are 3 LANGID for Indonesian dataset (INZNTV, INZSHL, INZTSI), but on the current Bible website for Indonesian, the codes are INDASV and INDTSI. With this, I think it will be difficult to implement the dataloader because inevitably someone has to match the existing dataset with the latest data on the website for all ASEAN languages. |
Tough. I was looking through the dataset website too and it seems like they have outgrown CMU Wilderness dataset's coverage. Have you taken a look at their API, @akhdanfadh? It seems like we should be able to access all of their data through the API. |
May I know if there is any update on this, @akhdanfadh? |
I haven't looked up on this, will do this week. |
Sure! @yongzx will also help inspect this issue. |
I am requesting the API key just now. This was what they said.
I will update here on anything related from their docs. EDIT: Done @holylovenia 😵💫 API key is needed to access the data as a whole
We can actually explore them from Example Workflows. But if someone wants to access it through a dataloader, it is necessary to implement an API key input in the code later on. Not all contents are available to downloadFrom the API core concept:
Note the bold sentence. Not sure what application means. Testing on INZNTV
|
Thanks a lot, @akhdanfadh! It seems that we will have to update some info for the corresponding datasheet too. 😵💫 Tagging @yongzx here too in case we need another pair of eyes for discussion and/or this dataloader implementation.
I think it means that users are not permitted to upload the data to anywhere else. All usages should be done with the API and API key.
By how things unfold, CMU Wilderness and the current Bible website seem to have different sets of datasets and distinct metadata. Let's follow the current Bible website since it's the one that provides the dataset now. We can even change the datasheet name and the dataloader name if needed. cc: @SamuelCahyawijaya @sabilmakbar for your information. |
I've got no update from their API key yet. Still waiting for further instructions. @holylovenia @yongzx |
Got it, there's nothing we can do without the API access for now. 👍 It seems unlikely we can use this dataset for the experiment as well. If there's no response until the end of SEACrowd, I might add a note on the corresponding datasheet or deprecate it. Thanks @akhdanfadh! Please keep us updated if there's some news. |
Dataloader name:
cmu_wilderness_multilingual_speech_dataset/cmu_wilderness_multilingual_speech_dataset.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?cmu_wilderness_multilingual_speech_dataset
The text was updated successfully, but these errors were encountered: