DrumTrack

Isolate drums from any song and generate a playable MIDI drum track.

Upload an MP3 or paste a YouTube URL, and DrumTrack will separate the drums, detect individual hits, and produce a MIDI file you can drag into your DAW. You can review the detected drum types, relabel them, and regenerate the MIDI before downloading.

How It Works

DrumTrack processes audio through a multi-stage pipeline that combines deep learning models with signal processing.

Screenshots

Scenario	Screenshot
Upload a song or Youtube link
Get song output
Midi player

Examples

Input	Output (drums are midi)
Neat - I remember	neat-i-remember-midi-drums.mp4

Pipeline Overview

Audio Input (MP3 upload or YouTube URL)
    |
    v
1. Stem Separation (Demucs)
   Isolates drums from bass, vocals, and other instruments
    |
    v
2. Drum Instrument Separation (DrumSep MDX23C)
   Splits the drum stem into 5 individual instruments:
   kick, snare, toms, hi-hat, cymbals
    |
    v
3. Onset Detection (librosa)
   Finds the precise time and velocity of each drum hit
    |
    v
4. Quantization & MIDI Generation (pretty_midi)
   Snaps hits to a musical grid and writes a MIDI file
    |
    v
5. Interactive Review
   Listen back, relabel misclassified hits, re-export

References

https://github.com/adefossez/demucs https://docs.google.com/document/d/17fjNvJzj8ZGSer7c7OFe_CNfUKbAxEh_OBv94ZdRG5c/edit?tab=t.0 https://github.com/ZFTurbo/Music-Source-Separation-Training/tree/main

Stem Separation

The first step isolates the drum track from the rest of the mix using Demucs — Facebook Research's htdemucs model, a hybrid transformer/waveform U-Net that separates audio into four stems (drums, bass, vocals, other). Runs locally on CPU or GPU.

The non-drum stems are mixed back together into a "backing track" that can be played alongside the MIDI output for reference.

Audio hashing (SHA-256) enables deduplication: if you process the same audio twice, stems from the first run are reused.

Drum Instrument Separation

The isolated drum stem is further separated into five individual instruments using DrumSep, an MDX23C model (TFC-TDF-net architecture). The model operates in the frequency domain:

STFT converts the waveform to a spectrogram (16384-point FFT, 2048 hop)
An encoder-decoder network with TFC-TDF blocks (temporal-frequency convolutions + time-domain filter bottlenecks) predicts five instrument masks
Each mask is applied to the input spectrogram and inverse-STFT'd back to audio

The five output stems are: kick, snare, toms, hi-hat, and cymbals. Model weights (~200MB) are downloaded automatically on first run.

Onset Detection

Each drum stem is analyzed independently using librosa's onset detection, with per-instrument parameters tuned for different hit characteristics:

Stem	Threshold (delta)	Min Wait (frames)
Kick	0.08	3
Snare	0.06	2
Toms	0.07	3
Hi-hat	0.05	1
Cymbals	0.06	4

Velocity estimation measures RMS amplitude in a 50ms window around each onset, converts to dB, and maps to MIDI velocity (0-127) on a logarithmic scale. Quiet hits (~-60 dB) map to velocity 20; loud hits (~0 dB) map to 127.

Deduplication removes double-triggers using type-specific minimum gaps (kick: 35ms, snare: 40ms, hi-hat: 25ms, cymbals: 150ms). When two hits fall within the gap, the louder one is kept.

Quantization

Detected hit times are snapped to a 16th-note grid based on the user-provided BPM:

grid_step = 60 / BPM / 4
grid_position = round(time / grid_step)
quantized_time = grid_position * grid_step

A 30% tolerance preserves swing feel: if a hit deviates from the grid by more than 30% of a grid step, its original timing is kept. This prevents the MIDI from sounding overly mechanical on tracks with intentional swing.

MIDI Generation

Quantized events are written to a Standard MIDI File using pretty_midi:

All notes are on MIDI channel 10 (General MIDI drum channel)
Each hit becomes a 50ms note at the appropriate MIDI note number (e.g., kick=36, snare=38, hi-hat=42)
Velocity values from the onset detection stage are preserved

Interactive Review

The web UI allows you to review the automatic classification before downloading. Each detected cluster of hits can be relabeled (e.g., changing a misidentified tom to a snare), and the MIDI is regenerated on the fly. A built-in player syncs the MIDI playback with the backing track so you can hear exactly what the output sounds like.

Architecture

drumtrack/
  app/                    # Next.js app router (pages)
  components/             # React components (shadcn/ui)
  hooks/                  # React hooks (polling, MIDI player)
  lib/                    # API client, MIDI playback engine
  types/                  # TypeScript type definitions
  backend/
    app/
      routers/            # FastAPI endpoints
      services/           # Processing pipeline, ML models
      models/             # Pydantic data models
      storage/            # Job persistence, file management
      ml/                 # MDX23C neural network definition
    models/               # Downloaded model weights (gitignored)
    static/
      samples/
        default/          # Built-in sample set (9 WAVs + kit.json)
    storage/
      jobs/               # Job artifacts per UUID (gitignored)
      samples/            # User-created sample sets (gitignored)

Frontend: Next.js 16, React 19, Tailwind CSS 4, shadcn/ui, Tone.js
Backend: FastAPI, PyTorch, Demucs, librosa, pretty_midi

Local Setup

Prerequisites

Node.js 18+ and pnpm
Python 3.11+ and uv
ffmpeg (required by Demucs and yt-dlp for audio conversion)
yt-dlp (optional, for YouTube URL support)

Backend

cd backend
uv sync

This installs all Python dependencies including PyTorch, Demucs, librosa, and FastAPI. On first run, the DrumSep model weights (~200MB) will be downloaded automatically.

Start the backend:

cd backend
uv run uvicorn app.main:app --reload

The API will be available at http://localhost:8000. Interactive API docs at http://localhost:8000/docs.

Frontend

pnpm install
pnpm dev

The frontend will be available at http://localhost:3000.

By default it connects to the backend at http://localhost:8000. To change this, set the NEXT_PUBLIC_API_URL environment variable.

Usage

Open http://localhost:3000
Upload an MP3 file (or paste a YouTube URL) and enter the song's BPM
Click Start Processing and watch the progress bar
Once complete, review the detected drum hits in the cluster review panel
Use the built-in player to listen to the MIDI drums synced with the backing track
Relabel any misclassified hits and click Regenerate MIDI if needed
Download the MIDI file, individual drum stems, backing track, or full drum track

The sidebar shows all previous jobs. Jobs persist across server restarts.

API

Method	Path	Description
`POST`	`/api/jobs/upload`	Upload MP3 + BPM, start processing
`POST`	`/api/jobs/youtube`	Submit YouTube URL + BPM
`GET`	`/api/jobs/`	List all jobs (newest first)
`GET`	`/api/jobs/{id}`	Get job status and progress
`GET`	`/api/jobs/{id}/midi`	Download MIDI file
`GET`	`/api/jobs/{id}/drum-track`	Download isolated drum track (MP3)
`GET`	`/api/jobs/{id}/other-track`	Download backing track (MP3)
`GET`	`/api/jobs/{id}/stems/{name}`	Download individual drum stem (WAV)
`GET`	`/api/jobs/{id}/events`	Get drum events as JSON
`GET`	`/api/jobs/{id}/clusters`	Get clusters and events
`PUT`	`/api/jobs/{id}/clusters`	Update cluster labels, regenerate MIDI
`GET`	`/api/samples`	List available sample sets
`GET`	`/api/samples/{set}`	Get kit manifest (instrument → WAV files)
`GET`	`/api/samples/{set}/{file}`	Serve a sample WAV file

Stem names: kick, snare, toms, hh, cymbals.

Evaluation Framework

The transcription pipeline (onset detection → quantization → MIDI) is evaluated using a synthetic dataset approach: MIDI drum patterns are rendered into audio using the real sample kits, then the algorithm runs on that audio and its output is compared against the known MIDI ground truth. This bypasses DrumSep (which runs on pre-separated stems anyway) and evaluates everything downstream of it.

How It Works

Built-in MIDI patterns  ──►  render with sample kit  ──►  5 DrumSep-format stems
                                                                    │
                                                        detect_onsets_from_stems()
                                                                    │
                                                            predicted events
                                                                    │
                                                    compare vs ground_truth.json
                                                                    │
                                              F-measure / onset MAE / velocity RMSE

Evaluation is done at the stem-group level (5 groups) rather than the 9 individual drum types, because the algorithm cannot distinguish tom-high from tom-mid (they share one stem):

Stem group	Drum types included
kick	kick
snare	snare
toms	tom_high, tom_mid, tom_low
hh	closed_hihat, open_hihat
cymbals	crash, ride

Matching uses a 50 ms tolerance window (MIREX standard) on the quantized onset time.

Generate Synthetic Data

cd backend

# Step 1 — write 7 built-in MIDI patterns (rock beat, 16th HH, four-on-the-floor, etc.)
uv run python -m eval.evaluate generate-patterns \
  --output-dir ./eval/midis --bpm 120

# Step 2 — render patterns to audio using a sample kit
uv run python -m eval.evaluate generate-dataset \
  --midi-dir ./eval/midis \
  --sample-kit ./static/samples/default/kit.json \
  --output-dir ./eval/dataset

# Optional: generate a noisy version to test robustness
uv run python -m eval.evaluate generate-dataset \
  --midi-dir ./eval/midis \
  --sample-kit ./static/samples/default/kit.json \
  --output-dir ./eval/dataset-noisy \
  --snr 10

Each pattern directory contains mix.wav, stems/ (5 DrumSep-format WAVs), ground_truth.json, and meta.json.

Run Evaluation

cd backend

# Fast evaluation (no neural nets — reads pre-rendered stems directly)
uv run python -m eval.evaluate evaluate \
  --dataset ./eval/dataset \
  --tolerance 50 \
  --output-json ./eval/results.json

Output: per-sample table (P/R/F1/TP/FP/FN per stem group), aggregate mean±std table, and a 5×5 confusion matrix. The --output-json flag writes all results to a file.

Current Results

Evaluated on 7 synthetic patterns rendered with the test kit at BPM=120, clean audio (no added noise):

Stem group	F1 (mean)	Precision	Recall
kick	82.7%	85.7%	79.9%
snare	85.7%	85.7%	85.7%
toms	14.2%	14.3%	14.1%
hh	69.8%	71.4%	68.6%
cymbals	27.6%	28.6%	26.8%
overall	98.4%	100.0%	96.9%

Onset MAE: 1.84 ± 1.73 ms (pre-quantization timing accuracy)
Velocity RMSE: 22.1 ± 9.2 (MIDI units, 0–127 scale)

Notes on per-group variance: the high std on individual groups is because each pattern only uses a subset of instruments — e.g. tom_fill has no kick (0% for that group), while rock_beat has no toms. The overall F1 (micro-averaged across all groups that are actually present) is a more meaningful single-number summary.

Hi-hat and cymbals score lower because the pipeline applies heuristic post-processing (open/closed HH inference, crash accent filtering, ride re-labeling) that can alter event counts relative to the literal MIDI ground truth. This is expected behaviour — those heuristics are designed for real audio and may over- or under-fire on perfectly synthetic stems.

At SNR=10 dB (significant background noise), overall F1 drops to ~36% driven mostly by false positives, confirming the detector is operating near its noise floor at that level.

Sample Kits

Each sample kit is a directory containing WAV files and a kit.json manifest that maps instrument names to one or more sample files (for round-robin variation):

{
  "kick": ["kick.wav"],
  "snare": ["snare.wav", "snare-alt.wav"],
  "hihat-closed": ["hihat-closed.wav"],
  "hihat-open": ["hihat-open.wav"],
  "tom-low": ["tom-low.wav"],
  "tom-mid": ["tom-mid.wav"],
  "tom-high": ["tom-high.wav"],
  "crash": ["crash.wav"],
  "ride": ["ride.wav"]
}

When multiple files are listed for an instrument, the player cycles through them on each hit (round-robin) for a more natural sound.

The built-in default kit lives in backend/static/samples/default/ and is committed to git.

Adding a Custom Kit

Create a new folder under backend/storage/samples/ (e.g., backend/storage/samples/my-kit/)
Add your WAV files to the folder
Create a kit.json mapping instrument names to filenames (see format above)
Refresh the player — your custom kit will appear in the "Sample Kit" dropdown

Only kits with a valid kit.json are listed. The API merges kits from both backend/static/samples/ (built-in) and backend/storage/samples/ (user-created).

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.claude		.claude
app		app
backend		backend
components		components
contexts		contexts
hooks		hooks
lib		lib
public		public
screenshots		screenshots
types		types
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
biome.json		biome.json
components.json		components.json
docker-compose.yml		docker-compose.yml
next.config.ts		next.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DrumTrack

How It Works

Screenshots

Examples

Pipeline Overview

References

Stem Separation

Drum Instrument Separation

Onset Detection

Quantization

MIDI Generation

Interactive Review

Architecture

Local Setup

Prerequisites

Backend

Frontend

Usage

API

Evaluation Framework

How It Works

Generate Synthetic Data

Run Evaluation

Current Results

Sample Kits

Adding a Custom Kit

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DrumTrack

How It Works

Screenshots

Examples

Pipeline Overview

References

Stem Separation

Drum Instrument Separation

Onset Detection

Quantization

MIDI Generation

Interactive Review

Architecture

Local Setup

Prerequisites

Backend

Frontend

Usage

API

Evaluation Framework

How It Works

Generate Synthetic Data

Run Evaluation

Current Results

Sample Kits

Adding a Custom Kit

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages