Skip to content

Conversation

@3manifold
Copy link
Contributor

@3manifold 3manifold commented Jul 14, 2025

✨ Description

Given that Emilia & Emilia-YODAS datasets are quite large, there can be cases that users intend to acquire only specific parts of the data. This PR proposes a data downloader tool that can selectively download data from Emilia dataset to a specified destination.

It supports:

  • Data path patter e.g. datasets/amphion/Emilia-Dataset/Emilia-YODAS/JA/*.tar as input
  • Download resume in case of interruption.

Usage

python3 preprocessors/Emilia/utils/data_downloader.py \
>   --output_data_path "/mnt/Emilia-YODAS/data" \
>   --emilia_token hf_xx \
>   --data_path_pattern "datasets/amphion/Emilia-Dataset/Emilia-YODAS/JA/*.tar"
Number of files to download: 30

JA-B000000.tar: 100%|██████████████████████████████████████████████████████| 1.07G/1.07G [02:56<00:00, 6.08MB/s]
20xx-07-xx 10:11:42.724579 downloaded file: Emilia-YODAS/data/Emilia-YODAS/JA/JA-B000000.tar
...
...
downloading dataset complete

👨‍💻 Changes Proposed

  • Add Emilia data downloader tool

✅ Checklist

  • Code complies with the project's code standards and best practices
  • Code has passed all tests
  • Code does not affect the normal use of existing features
  • Code has been commented properly
  • Documentation has been updated (if applicable)
  • Demo/checkpoint has been attached (if applicable)

@3manifold 3manifold changed the title Emilia data downloader Emilia dataset selective downloader Jul 14, 2025
@3manifold 3manifold changed the title Emilia dataset selective downloader Emilia dataset selective downloader tool Jul 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant