Skip to content

amitripshtos/drumtrack

Repository files navigation

DrumTrack

Isolate drums from any song and generate a playable MIDI drum track.

Upload an MP3 or paste a YouTube URL, and DrumTrack will separate the drums, detect individual hits, and produce a MIDI file you can drag into your DAW. You can review the detected drum types, relabel them, and regenerate the MIDI before downloading.

How It Works

DrumTrack processes audio through a multi-stage pipeline that combines deep learning models with signal processing.

Screenshots

Scenario Screenshot
Upload a song or Youtube link image
Get song output image
Midi player image

Examples

Input Output (drums are midi)
Neat - I remember
neat-i-remember-midi-drums.mp4

Pipeline Overview

Audio Input (MP3 upload or YouTube URL)
    |
    v
1. Stem Separation (Demucs)
   Isolates drums from bass, vocals, and other instruments
    |
    v
2. Drum Instrument Separation (DrumSep MDX23C)
   Splits the drum stem into 5 individual instruments:
   kick, snare, toms, hi-hat, cymbals
    |
    v
3. Onset Detection (librosa)
   Finds the precise time and velocity of each drum hit
    |
    v
4. Quantization & MIDI Generation (pretty_midi)
   Snaps hits to a musical grid and writes a MIDI file
    |
    v
5. Interactive Review
   Listen back, relabel misclassified hits, re-export

References

https://github.com/adefossez/demucs https://docs.google.com/document/d/17fjNvJzj8ZGSer7c7OFe_CNfUKbAxEh_OBv94ZdRG5c/edit?tab=t.0 https://github.com/ZFTurbo/Music-Source-Separation-Training/tree/main

Stem Separation

The first step isolates the drum track from the rest of the mix using Demucs — Facebook Research's htdemucs model, a hybrid transformer/waveform U-Net that separates audio into four stems (drums, bass, vocals, other). Runs locally on CPU or GPU.

The non-drum stems are mixed back together into a "backing track" that can be played alongside the MIDI output for reference.

Audio hashing (SHA-256) enables deduplication: if you process the same audio twice, stems from the first run are reused.

Drum Instrument Separation

The isolated drum stem is further separated into five individual instruments using DrumSep, an MDX23C model (TFC-TDF-net architecture). The model operates in the frequency domain:

  1. STFT converts the waveform to a spectrogram (16384-point FFT, 2048 hop)
  2. An encoder-decoder network with TFC-TDF blocks (temporal-frequency convolutions + time-domain filter bottlenecks) predicts five instrument masks
  3. Each mask is applied to the input spectrogram and inverse-STFT'd back to audio

The five output stems are: kick, snare, toms, hi-hat, and cymbals. Model weights (~200MB) are downloaded automatically on first run.

Onset Detection

Each drum stem is analyzed independently using librosa's onset detection, with per-instrument parameters tuned for different hit characteristics:

Stem Threshold (delta) Min Wait (frames)
Kick 0.08 3
Snare 0.06 2
Toms 0.07 3
Hi-hat 0.05 1
Cymbals 0.06 4

Velocity estimation measures RMS amplitude in a 50ms window around each onset, converts to dB, and maps to MIDI velocity (0-127) on a logarithmic scale. Quiet hits (~-60 dB) map to velocity 20; loud hits (~0 dB) map to 127.

Deduplication removes double-triggers using type-specific minimum gaps (kick: 35ms, snare: 40ms, hi-hat: 25ms, cymbals: 150ms). When two hits fall within the gap, the louder one is kept.

Quantization

Detected hit times are snapped to a 16th-note grid based on the user-provided BPM:

grid_step = 60 / BPM / 4
grid_position = round(time / grid_step)
quantized_time = grid_position * grid_step

A 30% tolerance preserves swing feel: if a hit deviates from the grid by more than 30% of a grid step, its original timing is kept. This prevents the MIDI from sounding overly mechanical on tracks with intentional swing.

MIDI Generation

Quantized events are written to a Standard MIDI File using pretty_midi:

  • All notes are on MIDI channel 10 (General MIDI drum channel)
  • Each hit becomes a 50ms note at the appropriate MIDI note number (e.g., kick=36, snare=38, hi-hat=42)
  • Velocity values from the onset detection stage are preserved

Interactive Review

The web UI allows you to review the automatic classification before downloading. Each detected cluster of hits can be relabeled (e.g., changing a misidentified tom to a snare), and the MIDI is regenerated on the fly. A built-in player syncs the MIDI playback with the backing track so you can hear exactly what the output sounds like.

Architecture

drumtrack/
  app/                    # Next.js app router (pages)
  components/             # React components (shadcn/ui)
  hooks/                  # React hooks (polling, MIDI player)
  lib/                    # API client, MIDI playback engine
  types/                  # TypeScript type definitions
  backend/
    app/
      routers/            # FastAPI endpoints
      services/           # Processing pipeline, ML models
      models/             # Pydantic data models
      storage/            # Job persistence, file management
      ml/                 # MDX23C neural network definition
    models/               # Downloaded model weights (gitignored)
    static/
      samples/
        default/          # Built-in sample set (9 WAVs + kit.json)
    storage/
      jobs/               # Job artifacts per UUID (gitignored)
      samples/            # User-created sample sets (gitignored)
  • Frontend: Next.js 16, React 19, Tailwind CSS 4, shadcn/ui, Tone.js
  • Backend: FastAPI, PyTorch, Demucs, librosa, pretty_midi

Local Setup

Prerequisites

  • Node.js 18+ and pnpm
  • Python 3.11+ and uv
  • ffmpeg (required by Demucs and yt-dlp for audio conversion)
  • yt-dlp (optional, for YouTube URL support)

Backend

cd backend
uv sync

This installs all Python dependencies including PyTorch, Demucs, librosa, and FastAPI. On first run, the DrumSep model weights (~200MB) will be downloaded automatically.

Start the backend:

cd backend
uv run uvicorn app.main:app --reload

The API will be available at http://localhost:8000. Interactive API docs at http://localhost:8000/docs.

Frontend

pnpm install
pnpm dev

The frontend will be available at http://localhost:3000.

By default it connects to the backend at http://localhost:8000. To change this, set the NEXT_PUBLIC_API_URL environment variable.

Usage

  1. Open http://localhost:3000
  2. Upload an MP3 file (or paste a YouTube URL) and enter the song's BPM
  3. Click Start Processing and watch the progress bar
  4. Once complete, review the detected drum hits in the cluster review panel
  5. Use the built-in player to listen to the MIDI drums synced with the backing track
  6. Relabel any misclassified hits and click Regenerate MIDI if needed
  7. Download the MIDI file, individual drum stems, backing track, or full drum track

The sidebar shows all previous jobs. Jobs persist across server restarts.

API

Method Path Description
POST /api/jobs/upload Upload MP3 + BPM, start processing
POST /api/jobs/youtube Submit YouTube URL + BPM
GET /api/jobs/ List all jobs (newest first)
GET /api/jobs/{id} Get job status and progress
GET /api/jobs/{id}/midi Download MIDI file
GET /api/jobs/{id}/drum-track Download isolated drum track (MP3)
GET /api/jobs/{id}/other-track Download backing track (MP3)
GET /api/jobs/{id}/stems/{name} Download individual drum stem (WAV)
GET /api/jobs/{id}/events Get drum events as JSON
GET /api/jobs/{id}/clusters Get clusters and events
PUT /api/jobs/{id}/clusters Update cluster labels, regenerate MIDI
GET /api/samples List available sample sets
GET /api/samples/{set} Get kit manifest (instrument → WAV files)
GET /api/samples/{set}/{file} Serve a sample WAV file

Stem names: kick, snare, toms, hh, cymbals.

Evaluation Framework

The transcription pipeline (onset detection → quantization → MIDI) is evaluated using a synthetic dataset approach: MIDI drum patterns are rendered into audio using the real sample kits, then the algorithm runs on that audio and its output is compared against the known MIDI ground truth. This bypasses DrumSep (which runs on pre-separated stems anyway) and evaluates everything downstream of it.

How It Works

Built-in MIDI patterns  ──►  render with sample kit  ──►  5 DrumSep-format stems
                                                                    │
                                                        detect_onsets_from_stems()
                                                                    │
                                                            predicted events
                                                                    │
                                                    compare vs ground_truth.json
                                                                    │
                                              F-measure / onset MAE / velocity RMSE

Evaluation is done at the stem-group level (5 groups) rather than the 9 individual drum types, because the algorithm cannot distinguish tom-high from tom-mid (they share one stem):

Stem group Drum types included
kick kick
snare snare
toms tom_high, tom_mid, tom_low
hh closed_hihat, open_hihat
cymbals crash, ride

Matching uses a 50 ms tolerance window (MIREX standard) on the quantized onset time.

Generate Synthetic Data

cd backend

# Step 1 — write 7 built-in MIDI patterns (rock beat, 16th HH, four-on-the-floor, etc.)
uv run python -m eval.evaluate generate-patterns \
  --output-dir ./eval/midis --bpm 120

# Step 2 — render patterns to audio using a sample kit
uv run python -m eval.evaluate generate-dataset \
  --midi-dir ./eval/midis \
  --sample-kit ./static/samples/default/kit.json \
  --output-dir ./eval/dataset

# Optional: generate a noisy version to test robustness
uv run python -m eval.evaluate generate-dataset \
  --midi-dir ./eval/midis \
  --sample-kit ./static/samples/default/kit.json \
  --output-dir ./eval/dataset-noisy \
  --snr 10

Each pattern directory contains mix.wav, stems/ (5 DrumSep-format WAVs), ground_truth.json, and meta.json.

Run Evaluation

cd backend

# Fast evaluation (no neural nets — reads pre-rendered stems directly)
uv run python -m eval.evaluate evaluate \
  --dataset ./eval/dataset \
  --tolerance 50 \
  --output-json ./eval/results.json

Output: per-sample table (P/R/F1/TP/FP/FN per stem group), aggregate mean±std table, and a 5×5 confusion matrix. The --output-json flag writes all results to a file.

Current Results

Evaluated on 7 synthetic patterns rendered with the test kit at BPM=120, clean audio (no added noise):

Stem group F1 (mean) Precision Recall
kick 82.7% 85.7% 79.9%
snare 85.7% 85.7% 85.7%
toms 14.2% 14.3% 14.1%
hh 69.8% 71.4% 68.6%
cymbals 27.6% 28.6% 26.8%
overall 98.4% 100.0% 96.9%
  • Onset MAE: 1.84 ± 1.73 ms (pre-quantization timing accuracy)
  • Velocity RMSE: 22.1 ± 9.2 (MIDI units, 0–127 scale)

Notes on per-group variance: the high std on individual groups is because each pattern only uses a subset of instruments — e.g. tom_fill has no kick (0% for that group), while rock_beat has no toms. The overall F1 (micro-averaged across all groups that are actually present) is a more meaningful single-number summary.

Hi-hat and cymbals score lower because the pipeline applies heuristic post-processing (open/closed HH inference, crash accent filtering, ride re-labeling) that can alter event counts relative to the literal MIDI ground truth. This is expected behaviour — those heuristics are designed for real audio and may over- or under-fire on perfectly synthetic stems.

At SNR=10 dB (significant background noise), overall F1 drops to ~36% driven mostly by false positives, confirming the detector is operating near its noise floor at that level.

Sample Kits

Each sample kit is a directory containing WAV files and a kit.json manifest that maps instrument names to one or more sample files (for round-robin variation):

{
  "kick": ["kick.wav"],
  "snare": ["snare.wav", "snare-alt.wav"],
  "hihat-closed": ["hihat-closed.wav"],
  "hihat-open": ["hihat-open.wav"],
  "tom-low": ["tom-low.wav"],
  "tom-mid": ["tom-mid.wav"],
  "tom-high": ["tom-high.wav"],
  "crash": ["crash.wav"],
  "ride": ["ride.wav"]
}

When multiple files are listed for an instrument, the player cycles through them on each hit (round-robin) for a more natural sound.

The built-in default kit lives in backend/static/samples/default/ and is committed to git.

Adding a Custom Kit

  1. Create a new folder under backend/storage/samples/ (e.g., backend/storage/samples/my-kit/)
  2. Add your WAV files to the folder
  3. Create a kit.json mapping instrument names to filenames (see format above)
  4. Refresh the player — your custom kit will appear in the "Sample Kit" dropdown

Only kits with a valid kit.json are listed. The API merges kits from both backend/static/samples/ (built-in) and backend/storage/samples/ (user-created).

About

Isolate drums from any song and generate a playable MIDI drum track

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors