This repo is designed to be a one-stop tool for downloading & aggregating open-source music source separation datasets (MUSDB18-HQ, MoisesDB, MedleyDB) into a usable format of unified stem folders for music source separation (MSS) training.
Design decisions were made with the goal of maximising the available data per stem, rather than keeping each stem uniform. This comes at the tradeoff of a small degree of data imbalance for increased data and diversity.
Github install:
pip install git+https://github.com/crlandsc/mss-datasets.gitor clone repository, then install locally:
git clone https://github.com/crlandsc/mss-datasets.git
cd mss-datasets# Local install
pip install -e .
# Development install (adds dev dependencies)
pip install -e ".[dev]"Requires Python >= 3.9.
Downloads MUSDB18-HQ and MedleyDB automatically, then aggregates all datasets. MoisesDB must be downloaded manually.
NOTE: Downloading and aggregation will take several hours. Raw datasets require ~242 GB (MUSDB18-HQ ~31 GB, MedleyDB ~62 GB, MoisesDB ~149 GB) and aggregation adds ~175–205 GB depending on profile and options — plan for ~420–450 GB total. Recommend leaving it running in the background or overnight.
Complete all of the following before running. These steps cannot be skipped.
-
Create a Zenodo account and personal access token
- Create an account at https://zenodo.org
- Go to https://zenodo.org/account/settings/applications/
- Click "New token", name it anything (no scopes need to be selected — the token is used for authentication only)
- Copy the token
-
Request access to the MedleyDB records
- Visit both pages below while logged in and click "Request access":
- MedleyDB v1: https://zenodo.org/records/1649325
- MedleyDB v2: https://zenodo.org/records/1715175
- You must wait for the dataset owners to approve your request. Typically happens within minutes, but timing is not guaranteed. Until approved, the download will fail with
"No files found".
- Visit both pages below while logged in and click "Request access":
-
Download MoisesDB manually
- Visit https://music.ai/research/ and scroll down to the MoisesDB dataset section
- Select Download and enter required fields to request download. A download link will be sent to your email, usually within minutes.
- Unzip the archive
- It will unzip to a
moisesdb_v0.1/subfolder - this should be used as the basemoisesdb_path.
Recommended for better organization & repeatability.
# provide arguments in config.yaml
datasets:
# musdb18hq_path and medleydb_path are omitted — auto-set by --download
moisesdb_path: /path/to/moisesdb # manual download from step 3
output: ./data
profile: vdbo
workers: 4
data_dir: ./datasets
zenodo_token: YOUR_ZENODO_TOKENmss-datasets --config config.yaml --download --aggregateIf you do not want to use a config, all arguments can be added directly in the CLI.
mss-datasets --download --aggregate \
--moisesdb-path /path/to/moisesdb \
--zenodo-token YOUR_ZENODO_TOKEN \
--data-dir ./datasets \
--output ./data \
--workers 4NOTE: You can also provide the Zenodo token via environment variable (ZENODO_TOKEN) or .env file instead of the CLI or config flags.
Downloads are resumable — if interrupted, re-run the same command to continue.
If you already have the datasets downloaded, point the tool at their directories. Use any combination — all three are optional, but at least one is required.
MUSDB18-HQ — train/ and test/ subdirs, each with Artist - Title folders containing stem WAVs:
musdb18hq/
├── train/
│ ├── Artist - Title/
│ │ ├── vocals.wav
│ │ ├── drums.wav
│ │ ├── bass.wav
│ │ ├── other.wav
│ │ └── mixture.wav
│ └── ...
└── test/
├── Artist - Title/
│ └── (same stem files)
└── ...
MedleyDB — Audio/ subdir with ArtistName_TrackTitle folders, each containing metadata YAML and a _STEMS/ subdir:
medleydb/
└── Audio/
├── ArtistName_TrackTitle/
│ ├── ArtistName_TrackTitle_METADATA.yaml
│ └── ArtistName_TrackTitle_STEMS/
│ ├── ArtistName_TrackTitle_STEM_01.wav
│ ├── ArtistName_TrackTitle_STEM_02.wav
│ └── ...
└── ...
If metadata YAML files are missing (common with Zenodo downloads), they will be auto-downloaded from GitHub on first run.
MoisesDB — either the official moisesdb_v0.1/ layout or flat UUID-named track directories:
moisesdb/
├── moisesdb_v0.1/ # official layout
│ └── <provider>/
│ └── <track-uuid>/
│ └── data.json
└── ...
# OR flat layout (common after manual unzip):
moisesdb/
├── <track-uuid>/
│ └── data.json
├── <track-uuid>/
│ └── data.json
└── ...
# provide arguments in config.yaml
datasets:
musdb18hq_path: /path/to/musdb18hq
moisesdb_path: /path/to/moisesdb
medleydb_path: /path/to/medleydb
output: ./data
profile: vdbo
workers: 4mss-datasets --config config.yamlmss-datasets --aggregate \
--musdb18hq-path /path/to/musdb18hq \
--moisesdb-path /path/to/moisesdb \
--medleydb-path /path/to/medleydb \
--output ./data \
--workers 4Dry run — preview what would be processed without writing files:
mss-datasets --config config.yaml --dry-run
# or
mss-datasets --dry-run --musdb18hq-path /path/to/musdb18hqSee config.example.yaml for a fully annotated config template with explanations for every option. Copy it and adjust paths/settings:
cp config.example.yaml config.yaml
# edit config.yaml with your paths
mss-datasets --config config.yamlConfig files use YAML format. Dataset paths go under a datasets: key; all other options are top-level. CLI flags always override config file values.
| Flag | Default | Description |
|---|---|---|
--download |
off | Download MUSDB18-HQ and MedleyDB |
--aggregate |
off | Aggregate datasets into unified stem folders |
--musdb18hq-path |
-- | Path to MUSDB18-HQ dataset |
--moisesdb-path |
-- | Path to MoisesDB dataset |
--medleydb-path |
-- | Path to MedleyDB dataset |
--output, -o |
./output |
Output directory |
--profile |
vdbo |
vdbo (4-stem) or vdbo+gp (6-stem) |
--workers |
1 |
Parallel workers (MoisesDB always sequential) |
--group-by-dataset |
off | Add source dataset subfolders within each stem folder |
--split-output |
off | Organize output into train/ and val/ directories |
--include-mixtures |
off | Generate mixture WAV files |
--include-bleed |
off | Include tracks with stem bleed (excluded by default) |
--verify-mixtures |
off | Verify written stem sums match mixture files (requires --include-mixtures) |
--dry-run |
off | Preview what would be processed without writing |
--config |
-- | Path to YAML config file (implies --aggregate) |
--data-dir |
./datasets |
Directory for raw dataset downloads |
--zenodo-token |
-- | Zenodo access token for MedleyDB (also: ZENODO_TOKEN env var) |
--verbose, -v |
off | Debug logging |
--version |
-- | Show version and exit |
--help |
-- | Show help message and exit |
At least one mode flag is required: --download, --aggregate, or --dry-run. Using --config alone implies --aggregate.
Music-Source-Separation-Training — Type 2 layout — one folder per stem:
vdbo (4-stem) — ~164 GB, 2,204 files:
output/
├── vocals/ (410 files)
├── drums/ (446 files)
├── bass/ (430 files)
├── other/ (458 files)
├── mixture/ (460 files, with --include-mixtures)
└── metadata/
vdbo+gp (6-stem) — ~188 GB, 2,522 files:
output/
├── vocals/ (410 files)
├── drums/ (446 files)
├── bass/ (430 files)
├── other/ (331 files)
├── guitar/ (290 files)
├── piano/ (155 files)
├── mixture/ (460 files, with --include-mixtures)
└── metadata/
--split-output — organize by train/val split:
output/
├── train/
│ ├── vocals/
│ ├── drums/
│ ├── bass/
│ └── other/
├── val/
│ ├── vocals/
│ ├── drums/
│ ├── bass/
│ └── other/
└── metadata/
MUSDB18-HQ "test" tracks are remapped to "val" — there is no "test" directory.
--group-by-dataset — add source dataset subfolders within each stem folder:
output/
├── vocals/
│ ├── musdb18hq/
│ ├── medleydb/
│ └── moisesdb/
├── drums/
│ ├── musdb18hq/
│ ├── medleydb/
│ └── moisesdb/
├── bass/
│ └── ...
├── other/
│ └── ...
└── metadata/
--include-mixtures — generate mixture WAV files in a separate folder:
output/
├── vocals/
├── drums/
├── bass/
├── other/
├── mixture/
└── metadata/
Combines with --group-by-dataset for nested layouts (e.g. output/vocals/musdb18hq/) and with --split-output for split+grouped layouts (e.g. output/train/vocals/musdb18hq/). All flags are composable.
The metadata/ directory contains: manifest.json, splits.json, overlap_registry.json, errors.json, config.yaml.
460 tracks total across all three datasets (360 train / 100 val). All output: 44.1 kHz, float32, stereo WAV. Stem folders have independent file counts — not every track appears in every folder.
Filename format: {source}_{split}_{index:04d}_{artist}_{title}.wav
- MUSDB18-HQ: 150 tracks, 4 stems. 100 train / 50 test.
- Direct 1:1 stem mapping. Only has 4 stems — does not contribute to guitar/piano in
vdbo+gp. - Mixture copied directly from source (not summed from stems).
- Direct 1:1 stem mapping. Only has 4 stems — does not contribute to guitar/piano in
- MoisesDB: 240 tracks, 11 top-level stems. 50-track val set (deterministic random selection, seed=42).
percussionandbassare routed at the sub-stem level, not top-level:- Bass guitar, bass synth, contrabass → bass; tuba, bassoon → other
- Atonal percussion → drums; pitched percussion → other
- Always processed sequentially (library constraint).
- MedleyDB v1+v2: 196 tracks, ~121 instrument labels mapped to stems. 5 tracks are excluded (stem bleed), 38 individual stems are excluded (audio content doesn't match routed category), and 1 stem is rerouted to the correct category. The decisions for these overrides were made after manual review, so the are not necessarily comprehensive and may not be completely perfect. See MedleyDB Exclusions for details.
- 2 special labels: "Main System" → excluded, "Unlabeled" → other.
- Multiple stems mapping to the same output category are summed.
has_bleedmetadata field controls--include-bleed. The 5 override-excluded tracks are always excluded regardless of--include-bleed.- Metadata YAML auto-downloaded from GitHub if missing.
46 tracks overlap between MUSDB18-HQ and MedleyDB — MedleyDB is preferred (more granular stems). Cross-dataset deduplication is automatic and only triggers when both MUSDB18-HQ and MedleyDB are present.
vdboroutes guitar/piano labels to "other";vdbo+gpgives them dedicated stems.- Mono sources are converted to dual-mono. No resampling — all sources expected to be 44.1 kHz.
- Silent stems are automatically skipped.
- MedleyDB overlap tracks inherit the MUSDB18-HQ split; non-overlap tracks default to train.
- Splits are locked via
splits.jsonfor reproducibility across re-runs. - Processing is resumable — tracks with existing output files are skipped on re-run.
Contributions are welcome! Please open an issue or submit a pull request if you have any bug fixes, improvements, or new features to suggest.
This tool is MIT-licensed. The underlying datasets have their own licenses — see LICENSE for details.
If you use these datasets in your research, please cite the original papers:
MUSDB18-HQ:
Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner. "MUSDB18-HQ — an uncompressed version of MUSDB18," 2019. DOI: 10.5281/zenodo.3338373
MoisesDB:
I. Pereira, F. Araújo, F. Korzeniowski, and R. Vogl. "MoisesDB: A dataset for source separation beyond 4-stems," in Proc. International Society for Music Information Retrieval Conference (ISMIR), Milan, Italy, 2023. arXiv:2307.15913
MedleyDB:
R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. "MedleyDB: A multitrack dataset for annotation-intensive MIR research," in Proc. International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 2014.
R. Bittner, J. Wilkins, H. Yip, and J. P. Bello. "MedleyDB 2.0: New data and a system for sustainable data collection," in Proc. International Society for Music Information Retrieval Conference (ISMIR) Late Breaking and Demo Papers, New York, USA, 2016.
