Skip to content

Commit 40d75d1

Browse files
authored
Merge pull request #6 from audiohacking/copilot/resolve-merge-conflicts
Resolve merge conflicts: resync with upstream while preserving fork features
2 parents 0899b63 + b237e8e commit 40d75d1

31 files changed

+7437
-196
lines changed

README.md

Lines changed: 37 additions & 63 deletions
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,6 @@ cmake --build . --config Release -j$(nproc)
3131

3232
Builds two binaries: `ace-qwen3` (LLM) and `dit-vae` (DiT + VAE).
3333

34-
**CI (GitHub Actions)**
35-
- **Build**: on every push/PR, builds on Ubuntu (BLAS) and macOS (Metal); smoke test runs each binary `--help`.
36-
- **Test generation**: on release or manual trigger only; runs the same checks as **local** `tests/run-generation-tests.sh`. Validate locally first (build + `./models.sh`, then `tests/run-generation-tests.sh`), then use CI to confirm. See `.github/workflows/`.
37-
3834
## Models
3935

4036
Pre-quantized GGUFs on [Hugging Face](https://huggingface.co/Serveurperso/ACE-Step-1.5-GGUF).
@@ -143,16 +139,10 @@ cd examples
143139
./partial.sh # caption + lyrics + duration
144140
./full.sh # all metadata provided
145141
./dit-only.sh # skip LLM, DiT from noise
146-
./cover.sh # cover mode: decode precomputed audio_codes (no LLM)
147-
./cover-reference.sh # cover + reference_audio for timbre (WAV/MP3; needs reference.wav or .mp3)
148-
./test-reference.sh # reference_audio (WAV or MP3) + audio_cover_strength
149-
./lora.sh # DiT + LoRA adapter
150142
```
151143

152144
Each example has a `-sft` variant (SFT model, 50 steps, CFG 7.0)
153-
alongside the turbo default (8 steps, no CFG). For **reference timbre**, set `reference_audio` to a **WAV or MP3** path; dit-vae loads it (MP3 decoded in memory via header-only minimp3, no temp files), encodes with the VAE encoder (requires a full VAE GGUF that includes encoder weights).
154-
155-
**LoRA adapters**: use `--lora <path>` and optional `--lora-scale <float>` with dit-vae to run the DiT with PEFT-style Ace-Step LoRAs.
145+
alongside the turbo default (8 steps, no CFG).
156146

157147
## Generation modes
158148

@@ -180,11 +170,10 @@ Run `dit-vae` to decode existing codes. See `examples/dit-only.json`.
180170

181171
## Request JSON reference
182172

183-
All fields with defaults. Only `caption` is required. Built-in modes (text2music, cover, repaint) and audio inputs follow the [ACE-Step 1.5 Tutorial](https://github.com/ace-step/ACE-Step-1.5/blob/main/docs/en/Tutorial.md); see [docs/MODES.md](docs/MODES.md) for what is implemented.
173+
All fields with defaults. Only `caption` is required.
184174

185175
```json
186176
{
187-
"task_type": "text2music",
188177
"caption": "",
189178
"lyrics": "",
190179
"instrumental": false,
@@ -199,12 +188,7 @@ All fields with defaults. Only `caption` is required. Built-in modes (text2music
199188
"lm_top_p": 0.9,
200189
"lm_top_k": 0,
201190
"lm_negative_prompt": "",
202-
"reference_audio": "",
203-
"src_audio": "",
204191
"audio_codes": "",
205-
"audio_cover_strength": 1.0,
206-
"repainting_start": 0.0,
207-
"repainting_end": 0.0,
208192
"inference_steps": 8,
209193
"guidance_scale": 7.0,
210194
"shift": 3.0
@@ -214,12 +198,7 @@ All fields with defaults. Only `caption` is required. Built-in modes (text2music
214198
Key fields: `seed` -1 means random (resolved once, then +1 per batch
215199
element). `audio_codes` is generated by ace-qwen3 and consumed by
216200
dit-vae (comma separated FSQ token IDs). When present, the LLM is
217-
skipped entirely (cover-style generation). `reference_audio`: path to a **WAV or MP3** file for global timbre/style (MP3 decoded in memory; encoded via built-in VAE encoder; requires VAE GGUF with encoder weights). `src_audio`: path to a **WAV or MP3** for cover source; dit-vae encodes it (VAE + FSQ nearest-codeword) to codes internally, no Python required (see docs/MODES.md).
218-
219-
**Reference and cover strength (not the same as guidance_scale):**
220-
- **`audio_cover_strength`** (0.0–1.0): Controls how strongly the **cover/source** (from `audio_codes` or `src_audio`) influences the DiT context. The context is blended with silence: `(1 - audio_cover_strength)*silence + audio_cover_strength*decoded`. Use 1.0 for full cover influence, lower values to soften it. Only applies when cover context is present.
221-
- **`reference_audio`**: Timbre from the reference file is applied at full strength; there is no separate strength parameter for reference timbre.
222-
- **`guidance_scale`**: This is **DiT classifier-free guidance** (conditioned vs unconditioned prediction), not reference or cover strength. Turbo models ignore it (forced to 1.0).
201+
skipped entirely.
223202

224203
Turbo preset: `inference_steps=8, shift=3.0` (no guidance_scale, turbo models don't use CFG).
225204
SFT preset: `inference_steps=50, guidance_scale=4.0, shift=6.0`.
@@ -241,6 +220,7 @@ Output naming: input.json -> input0.json, input1.json, ... (last digit = batch i
241220
Debug:
242221
--max-seq <N> KV cache size (default: 8192)
243222
--no-fsm Disable FSM constrained decoding
223+
--no-fa Disable flash attention
244224
--dump-logits <path> Dump prefill logits (binary f32)
245225
--dump-tokens <path> Dump prompt token IDs (CSV)
246226
```
@@ -262,10 +242,6 @@ Required:
262242
--dit <gguf> DiT GGUF file
263243
--vae <gguf> VAE GGUF file
264244
265-
LoRA:
266-
--lora <path> LoRA adapter (adapter_model.safetensors)
267-
--lora-scale <float> LoRA scale, e.g. alpha/rank (default: 1.0)
268-
269245
Batch:
270246
--batch <N> DiT variations per request (default: 1, max 9)
271247
@@ -276,6 +252,7 @@ VAE tiling (memory control):
276252
--vae-overlap <N> Overlap frames per side (default: 64)
277253
278254
Debug:
255+
--no-fa Disable flash attention
279256
--dump <dir> Dump intermediate tensors
280257
```
281258

@@ -320,10 +297,7 @@ conditional and N unconditional sequences are packed into a single forward pass
320297
`logits = uncond + scale * (cond - uncond)`. The KV cache is a single 4D tensor
321298
`[D, max_seq, Nkv, n_sets]` shared across all batch elements and CFG paths. Shared
322299
prompts are prefilled once and cloned to other KV sets via copy, avoiding redundant
323-
prefills. Embedding lookup bypasses ggml_get_rows entirely: rows are read directly
324-
from the mmap'd GGUF file on CPU, dequantized, and uploaded as F32 input tensors.
325-
Decode uses a dedicated single-backend graph allocator (gallocr) with no scheduler
326-
dispatch overhead, while prefill uses the multi-backend scheduler for flexibility.
300+
prefills.
327301

328302
## Accuracy
329303

@@ -343,42 +317,42 @@ python3 debug-dit-cossim.py # DiT: per-layer cossim GGML vs Python (turbo/
343317

344318
## Patched GGML fork
345319

346-
Uses a patched GGML fork (submodule) with ops added for the Oobleck VAE decoder.
320+
Uses a patched GGML fork (submodule) with two new ops and a CUDA bugfix for the Oobleck
321+
VAE decoder. All backends: CPU, CUDA, Metal, Vulkan. F32/F16/BF16 data types.
322+
The DiT uses only standard GGML ops and needs no patches.
347323

348324
The VAE reconstructs audio from latent space through 5 upsampling blocks (total 1920x),
349325
each running a transposed convolution followed by 3 WaveNet-style residual units with
350326
dilated convolutions and Snake activations. A single tile builds a graph of 36 snake
351327
activations, 5 transposed convolutions, and 32 regular convolutions. At the final blocks,
352-
sequence lengths reach 491520 timesteps, which stresses GGML ops designed for short NLP sequences.
353-
The DiT (flow matching diffusion transformer) uses only standard GGML ops and needs no patches.
354-
355-
Patches on top of upstream GGML, oldest first:
356-
357-
| Commit | Scope | Description |
358-
|--------|-------|-------------|
359-
| `8c70db84` | CUDA | `conv_transpose_1d`: replace O(T_in) brute-force loop with bounded range |
360-
| `b65bf458` | CUDA | `im2col`: grid-stride loop on OW to fix gridDim.y overflow when T > 65535 |
361-
| `e0e36f3c` | Metal | `conv_transpose_1d`: same bounded loop fix as CUDA |
362-
| `2b9080bd` | CPU, CUDA, Metal | New `GGML_OP_COL2IM_1D`: scatter-add for GEMM-based conv_transpose_1d decomposition |
363-
| `02c8041f` | CPU, CUDA, Metal | New `GGML_OP_SNAKE`: fused activation y = x + sin^2(a*x) / b (replaces 5 element-wise ops) |
364-
| `3f60b19c` | Metal | Fix snake kernel to use current C wrapper API |
365-
| `cb5d7067` | Vulkan | Guard `VK_EXT_layer_settings` for legacy Vulkan SDK (fixes MI50/gfx906) |
366-
| `1f0f4214` | Vulkan | `col2im_1d`: add Vulkan backend |
367-
| `efbf3df6` | Vulkan | `snake`: add Vulkan backend |
368-
| `6608cd11` | Vulkan | Fix rvalue ref for `col2im_1d` and `snake` push constants |
369-
| `06101d38` | Vulkan | Fix double-division dispatch for `col2im_1d` and `snake` |
370-
| `91416cee` | CPU, CUDA, Metal, Vulkan | `col2im_1d`: fuse padding crop via p0 parameter (saves 5 allocs + 5 memcpy per VAE tile) |
371-
| `20675b09` | Vulkan | `col2im_1d`, `snake`: 2D dispatch (fixes workgroup overflow on MI50) |
372-
373-
**Why col2im_1d**: upstream `ggml_conv_transpose_1d` uses a naive CUDA kernel (one scalar
374-
FMA loop per output element, no shared memory, no tensor cores). The VAE spends 40% of its
375-
FLOP budget on transposed convolutions. We decompose it as `mul_mat + col2im_1d`, routing
376-
the heavy GEMM through cuBLAS/BLAS/MPS tensor cores. The col2im_1d gather has a 2-iteration
377-
inner loop and is pure bandwidth.
378-
379-
**Why snake**: the Oobleck VAE uses Snake1d activation (x + sin^2(a*x) / b) 36 times per
380-
tile. Without a fused op, each activation requires 5 separate GGML kernels (mul, sin, sqr,
381-
mul, add), causing 5x the memory traffic. The fused kernel reads x once, writes y once.
328+
sequence lengths reach 491520 timesteps, which stresses GGML ops designed for short NLP
329+
sequences.
330+
331+
### `GGML_OP_SNAKE` (fused Snake activation)
332+
333+
Computes y = x + sin^2(a * x) * inv_b in a single kernel.
334+
The Oobleck VAE calls this 36 times per tile. Without a fused op, each activation
335+
requires 5 separate GGML kernels (mul, sin, sqr, mul, add), causing 5x the memory
336+
traffic. The fused kernel reads x once and writes y once. BF16 cast nodes before/after
337+
each snake call halve memory bandwidth at the cost of negligible precision loss
338+
(cossim > 0.999 vs F32 baseline).
339+
340+
### `GGML_OP_COL2IM_1D` (scatter-add for GEMM-based conv_transpose_1d)
341+
342+
Gather-based reconstruction of a 1D signal from GEMM columns [K*OC, T_in] to
343+
[T_out, OC], with fused padding crop via the p0 parameter.
344+
Upstream `ggml_conv_transpose_1d` uses a naive kernel (one scalar FMA loop per output
345+
element, no shared memory, no tensor cores). The VAE spends 40% of its FLOP budget on
346+
transposed convolutions. We decompose each as `mul_mat + col2im_1d`, routing the heavy
347+
GEMM through cuBLAS/BLAS/MPS tensor cores. The col2im_1d gather has a 2-iteration inner
348+
loop and is pure bandwidth. BF16 cast nodes around col2im_1d halve the scatter bandwidth.
349+
350+
### Bugfix: `im2col` gridDim.y overflow (CUDA)
351+
352+
Upstream `im2col_kernel` uses OW directly as grid dimension Y, which exceeds the CUDA
353+
65535 gridDim limit on long sequences. The VAE calls `ggml_conv_1d` (im2col path) 32
354+
times per tile at output widths up to 491520. Fixed with a grid-stride loop on OW and
355+
`MIN(OW, MAX_GRIDDIM_Z)` clamping.
382356

383357
## Acknowledgements
384358

_codeql_detected_source_root

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
.

buildcuda.sh

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
#!/bin/bash
2+
3+
rm -rf build
4+
mkdir build
5+
cd build
6+
7+
cmake .. -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
8+
cmake --build . --config Release -j "$(nproc)"

src/cond-enc.h

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,7 @@ struct CondGGML {
6969
ggml_backend_t backend;
7070
ggml_backend_t cpu_backend;
7171
ggml_backend_sched_t sched;
72+
bool use_flash_attn;
7273
WeightCtx wctx;
7374
};
7475

@@ -78,6 +79,7 @@ static void cond_ggml_init_backend(CondGGML * m) {
7879
m->backend = bp.backend;
7980
m->cpu_backend = bp.cpu_backend;
8081
m->sched = backend_sched_new(bp, 8192);
82+
m->use_flash_attn = true;
8183
}
8284

8385
// Load from ACEStep DiT GGUF
@@ -191,7 +193,8 @@ static void cond_ggml_forward(CondGGML * m,
191193
for (int i = 0; i < m->lyric_cfg.n_layers; i++) {
192194
struct ggml_tensor * layer_mask = (i % 2 == 0) ? lyric_slide_mask : NULL;
193195
lyric_h = qwen3_build_layer(ctx, m->lyric_cfg, &m->lyric_layers[i],
194-
lyric_h, lyric_pos, layer_mask, S_lyric);
196+
lyric_h, lyric_pos, layer_mask, S_lyric,
197+
m->use_flash_attn);
195198
}
196199
lyric_h = qwen3_rms_norm(ctx, lyric_h, m->lyric_norm, m->lyric_cfg.rms_norm_eps);
197200

@@ -236,7 +239,8 @@ static void cond_ggml_forward(CondGGML * m,
236239
for (int i = 0; i < m->timbre_cfg.n_layers; i++) {
237240
struct ggml_tensor * layer_mask = (i % 2 == 0) ? timbre_slide_mask : NULL;
238241
timbre_h = qwen3_build_layer(ctx, m->timbre_cfg, &m->timbre_layers[i],
239-
timbre_h, timbre_pos, layer_mask, S_ref);
242+
timbre_h, timbre_pos, layer_mask, S_ref,
243+
m->use_flash_attn);
240244
}
241245
timbre_h = qwen3_rms_norm(ctx, timbre_h, m->timbre_norm, m->timbre_cfg.rms_norm_eps);
242246

src/fsq-detok.h

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ struct DetokGGML {
6464
ggml_backend_t backend;
6565
ggml_backend_t cpu_backend;
6666
ggml_backend_sched_t sched;
67+
bool use_flash_attn;
6768
WeightCtx wctx;
6869
};
6970

@@ -73,6 +74,7 @@ static bool detok_ggml_load(DetokGGML * m, const char * gguf_path,
7374
m->cfg = detok_config();
7475
m->backend = backend;
7576
m->cpu_backend = cpu_backend;
77+
m->use_flash_attn = true;
7678

7779
GGUFModel gf;
7880
if (!gf_load(&gf, gguf_path)) {
@@ -169,7 +171,8 @@ static int detok_ggml_decode(DetokGGML * m, const int * codes, int T_5Hz,
169171

170172
// 2L encoder + norm (non-causal, no mask needed at S=5)
171173
hidden = qwen3_build_layers(ctx, m->cfg, m->layers, m->norm,
172-
hidden, positions, NULL, P);
174+
hidden, positions, NULL, P,
175+
m->use_flash_attn);
173176

174177
// proj_out: [2048, 5] -> [64, 5]
175178
struct ggml_tensor * output = ggml_mul_mat(ctx, m->proj_out_w, hidden);

0 commit comments

Comments
 (0)