Skip to content

crlandsc/mss-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mss-datasets-logo

LICENSE GitHub Repo stars Python Version

This repo is designed to be a one-stop tool for downloading & aggregating open-source music source separation datasets (MUSDB18-HQ, MoisesDB, MedleyDB) into a usable format of unified stem folders for music source separation (MSS) training.

Design decisions were made with the goal of maximising the available data per stem, rather than keeping each stem uniform. This comes at the tradeoff of a small degree of data imbalance for increased data and diversity.

Installation

Github install:

pip install git+https://github.com/crlandsc/mss-datasets.git

or clone repository, then install locally:

git clone https://github.com/crlandsc/mss-datasets.git
cd mss-datasets
# Local install
pip install -e .

# Development install (adds dev dependencies)
pip install -e ".[dev]"

Requires Python >= 3.9.

Flow 1: Download + Aggregate (All-in-One)

Downloads MUSDB18-HQ and MedleyDB automatically, then aggregates all datasets. MoisesDB must be downloaded manually.

NOTE: Downloading and aggregation will take several hours. Raw datasets require ~242 GB (MUSDB18-HQ ~31 GB, MedleyDB ~62 GB, MoisesDB ~149 GB) and aggregation adds ~175–205 GB depending on profile and options — plan for ~420–450 GB total. Recommend leaving it running in the background or overnight.

Prerequisites

Complete all of the following before running. These steps cannot be skipped.

  1. Create a Zenodo account and personal access token

  2. Request access to the MedleyDB records

    • Visit both pages below while logged in and click "Request access":
    • You must wait for the dataset owners to approve your request. Typically happens within minutes, but timing is not guaranteed. Until approved, the download will fail with "No files found".
  3. Download MoisesDB manually

    • Visit https://music.ai/research/ and scroll down to the MoisesDB dataset section
    • Select Download and enter required fields to request download. A download link will be sent to your email, usually within minutes.
    • Unzip the archive
    • It will unzip to a moisesdb_v0.1/ subfolder - this should be used as the base moisesdb_path.

Run mss-datasets

Option 1: Config file

Recommended for better organization & repeatability.

# provide arguments in config.yaml
datasets:
  # musdb18hq_path and medleydb_path are omitted — auto-set by --download
  moisesdb_path: /path/to/moisesdb  # manual download from step 3
output: ./data
profile: vdbo
workers: 4
data_dir: ./datasets
zenodo_token: YOUR_ZENODO_TOKEN
mss-datasets --config config.yaml --download --aggregate

Option 2: CLI

If you do not want to use a config, all arguments can be added directly in the CLI.

mss-datasets --download --aggregate \
  --moisesdb-path /path/to/moisesdb \
  --zenodo-token YOUR_ZENODO_TOKEN \
  --data-dir ./datasets \
  --output ./data \
  --workers 4

NOTE: You can also provide the Zenodo token via environment variable (ZENODO_TOKEN) or .env file instead of the CLI or config flags.

Downloads are resumable — if interrupted, re-run the same command to continue.

Flow 2: Aggregate Pre-Downloaded Datasets

If you already have the datasets downloaded, point the tool at their directories. Use any combination — all three are optional, but at least one is required.

Dataset Directory Structures

MUSDB18-HQtrain/ and test/ subdirs, each with Artist - Title folders containing stem WAVs:

musdb18hq/
├── train/
│   ├── Artist - Title/
│   │   ├── vocals.wav
│   │   ├── drums.wav
│   │   ├── bass.wav
│   │   ├── other.wav
│   │   └── mixture.wav
│   └── ...
└── test/
    ├── Artist - Title/
    │   └── (same stem files)
    └── ...

MedleyDBAudio/ subdir with ArtistName_TrackTitle folders, each containing metadata YAML and a _STEMS/ subdir:

medleydb/
└── Audio/
    ├── ArtistName_TrackTitle/
    │   ├── ArtistName_TrackTitle_METADATA.yaml
    │   └── ArtistName_TrackTitle_STEMS/
    │       ├── ArtistName_TrackTitle_STEM_01.wav
    │       ├── ArtistName_TrackTitle_STEM_02.wav
    │       └── ...
    └── ...

If metadata YAML files are missing (common with Zenodo downloads), they will be auto-downloaded from GitHub on first run.

MoisesDB — either the official moisesdb_v0.1/ layout or flat UUID-named track directories:

moisesdb/
├── moisesdb_v0.1/       # official layout
│   └── <provider>/
│       └── <track-uuid>/
│           └── data.json
└── ...

# OR flat layout (common after manual unzip):
moisesdb/
├── <track-uuid>/
│   └── data.json
├── <track-uuid>/
│   └── data.json
└── ...

Run

Option 1: Config file

# provide arguments in config.yaml
datasets:
  musdb18hq_path: /path/to/musdb18hq
  moisesdb_path: /path/to/moisesdb
  medleydb_path: /path/to/medleydb
output: ./data
profile: vdbo
workers: 4
mss-datasets --config config.yaml

Option 2: CLI

mss-datasets --aggregate \
  --musdb18hq-path /path/to/musdb18hq \
  --moisesdb-path /path/to/moisesdb \
  --medleydb-path /path/to/medleydb \
  --output ./data \
  --workers 4

Dry run — preview what would be processed without writing files:

mss-datasets --config config.yaml --dry-run
# or
mss-datasets --dry-run --musdb18hq-path /path/to/musdb18hq

Configuration

See config.example.yaml for a fully annotated config template with explanations for every option. Copy it and adjust paths/settings:

cp config.example.yaml config.yaml
# edit config.yaml with your paths
mss-datasets --config config.yaml

Config files use YAML format. Dataset paths go under a datasets: key; all other options are top-level. CLI flags always override config file values.

CLI Reference

Flag Default Description
--download off Download MUSDB18-HQ and MedleyDB
--aggregate off Aggregate datasets into unified stem folders
--musdb18hq-path -- Path to MUSDB18-HQ dataset
--moisesdb-path -- Path to MoisesDB dataset
--medleydb-path -- Path to MedleyDB dataset
--output, -o ./output Output directory
--profile vdbo vdbo (4-stem) or vdbo+gp (6-stem)
--workers 1 Parallel workers (MoisesDB always sequential)
--group-by-dataset off Add source dataset subfolders within each stem folder
--split-output off Organize output into train/ and val/ directories
--include-mixtures off Generate mixture WAV files
--include-bleed off Include tracks with stem bleed (excluded by default)
--verify-mixtures off Verify written stem sums match mixture files (requires --include-mixtures)
--dry-run off Preview what would be processed without writing
--config -- Path to YAML config file (implies --aggregate)
--data-dir ./datasets Directory for raw dataset downloads
--zenodo-token -- Zenodo access token for MedleyDB (also: ZENODO_TOKEN env var)
--verbose, -v off Debug logging
--version -- Show version and exit
--help -- Show help message and exit

At least one mode flag is required: --download, --aggregate, or --dry-run. Using --config alone implies --aggregate.

Output Format

Music-Source-Separation-Training — Type 2 layout — one folder per stem:

vdbo (4-stem) — ~164 GB, 2,204 files:

output/
├── vocals/    (410 files)
├── drums/     (446 files)
├── bass/      (430 files)
├── other/     (458 files)
├── mixture/   (460 files, with --include-mixtures)
└── metadata/

vdbo+gp (6-stem) — ~188 GB, 2,522 files:

output/
├── vocals/    (410 files)
├── drums/     (446 files)
├── bass/      (430 files)
├── other/     (331 files)
├── guitar/    (290 files)
├── piano/     (155 files)
├── mixture/   (460 files, with --include-mixtures)
└── metadata/

--split-output — organize by train/val split:

output/
├── train/
│   ├── vocals/
│   ├── drums/
│   ├── bass/
│   └── other/
├── val/
│   ├── vocals/
│   ├── drums/
│   ├── bass/
│   └── other/
└── metadata/

MUSDB18-HQ "test" tracks are remapped to "val" — there is no "test" directory.

--group-by-dataset — add source dataset subfolders within each stem folder:

output/
├── vocals/
│   ├── musdb18hq/
│   ├── medleydb/
│   └── moisesdb/
├── drums/
│   ├── musdb18hq/
│   ├── medleydb/
│   └── moisesdb/
├── bass/
│   └── ...
├── other/
│   └── ...
└── metadata/

--include-mixtures — generate mixture WAV files in a separate folder:

output/
├── vocals/
├── drums/
├── bass/
├── other/
├── mixture/
└── metadata/

Combines with --group-by-dataset for nested layouts (e.g. output/vocals/musdb18hq/) and with --split-output for split+grouped layouts (e.g. output/train/vocals/musdb18hq/). All flags are composable.

The metadata/ directory contains: manifest.json, splits.json, overlap_registry.json, errors.json, config.yaml.

460 tracks total across all three datasets (360 train / 100 val). All output: 44.1 kHz, float32, stereo WAV. Stem folders have independent file counts — not every track appears in every folder.

Filename format: {source}_{split}_{index:04d}_{artist}_{title}.wav

Datasets

  • MUSDB18-HQ: 150 tracks, 4 stems. 100 train / 50 test.
    • Direct 1:1 stem mapping. Only has 4 stems — does not contribute to guitar/piano in vdbo+gp.
    • Mixture copied directly from source (not summed from stems).
  • MoisesDB: 240 tracks, 11 top-level stems. 50-track val set (deterministic random selection, seed=42).
    • percussion and bass are routed at the sub-stem level, not top-level:
      • Bass guitar, bass synth, contrabass → bass; tuba, bassoon → other
      • Atonal percussion → drums; pitched percussion → other
    • Always processed sequentially (library constraint).
  • MedleyDB v1+v2: 196 tracks, ~121 instrument labels mapped to stems. 5 tracks are excluded (stem bleed), 38 individual stems are excluded (audio content doesn't match routed category), and 1 stem is rerouted to the correct category. The decisions for these overrides were made after manual review, so the are not necessarily comprehensive and may not be completely perfect. See MedleyDB Exclusions for details.
    • 2 special labels: "Main System" → excluded, "Unlabeled" → other.
    • Multiple stems mapping to the same output category are summed.
    • has_bleed metadata field controls --include-bleed. The 5 override-excluded tracks are always excluded regardless of --include-bleed.
    • Metadata YAML auto-downloaded from GitHub if missing.

46 tracks overlap between MUSDB18-HQ and MedleyDB — MedleyDB is preferred (more granular stems). Cross-dataset deduplication is automatic and only triggers when both MUSDB18-HQ and MedleyDB are present.

Aggregation Notes

  • vdbo routes guitar/piano labels to "other"; vdbo+gp gives them dedicated stems.
  • Mono sources are converted to dual-mono. No resampling — all sources expected to be 44.1 kHz.
  • Silent stems are automatically skipped.
  • MedleyDB overlap tracks inherit the MUSDB18-HQ split; non-overlap tracks default to train.
  • Splits are locked via splits.json for reproducibility across re-runs.
  • Processing is resumable — tracks with existing output files are skipped on re-run.

Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any bug fixes, improvements, or new features to suggest.

License

This tool is MIT-licensed. The underlying datasets have their own licenses — see LICENSE for details.

References

If you use these datasets in your research, please cite the original papers:

MUSDB18-HQ:

Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and R. Bittner. "MUSDB18-HQ — an uncompressed version of MUSDB18," 2019. DOI: 10.5281/zenodo.3338373

MoisesDB:

I. Pereira, F. Araújo, F. Korzeniowski, and R. Vogl. "MoisesDB: A dataset for source separation beyond 4-stems," in Proc. International Society for Music Information Retrieval Conference (ISMIR), Milan, Italy, 2023. arXiv:2307.15913

MedleyDB:

R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello. "MedleyDB: A multitrack dataset for annotation-intensive MIR research," in Proc. International Society for Music Information Retrieval Conference (ISMIR), Taipei, Taiwan, 2014.

R. Bittner, J. Wilkins, H. Yip, and J. P. Bello. "MedleyDB 2.0: New data and a system for sustainable data collection," in Proc. International Society for Music Information Retrieval Conference (ISMIR) Late Breaking and Demo Papers, New York, USA, 2016.

About

One-stop tool for downloading & aggregating open-source music source separation datasets (MUSDB18-HQ, MoisesDB, MedleyDB)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages