Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 18 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,15 +8,16 @@ A high-performance FFT implementation in WebAssembly Text format that **signific

Benchmarked against [pffft-wasm](https://www.npmjs.com/package/@echogarden/pffft-wasm) (PFFFT with SIMD):

| Size | wat-fft (f32) | pffft-wasm (f32) | Speedup |
| ------ | ------------------- | ---------------- | -------- |
| N=64 | **6,040,000 ops/s** | 4,440,000 ops/s | **+36%** |
| N=128 | **3,040,000 ops/s** | 1,950,000 ops/s | **+56%** |
| N=256 | **1,640,000 ops/s** | 980,000 ops/s | **+67%** |
| N=512 | **736,000 ops/s** | 404,000 ops/s | **+82%** |
| N=1024 | **365,000 ops/s** | 201,000 ops/s | **+81%** |
| N=2048 | **163,000 ops/s** | 84,000 ops/s | **+94%** |
| N=4096 | **81,000 ops/s** | 41,000 ops/s | **+95%** |
| Size | wat-fft (f32) | pffft-wasm (f32) | Speedup |
| ------ | -------------------- | ---------------- | -------- |
| N=16 | **16,700,000 ops/s** | 13,900,000 ops/s | **+20%** |
| N=64 | **6,040,000 ops/s** | 4,440,000 ops/s | **+36%** |
| N=128 | **3,040,000 ops/s** | 1,950,000 ops/s | **+56%** |
| N=256 | **1,640,000 ops/s** | 980,000 ops/s | **+67%** |
| N=512 | **736,000 ops/s** | 404,000 ops/s | **+82%** |
| N=1024 | **365,000 ops/s** | 201,000 ops/s | **+81%** |
| N=2048 | **163,000 ops/s** | 84,000 ops/s | **+94%** |
| N=4096 | **81,000 ops/s** | 41,000 ops/s | **+95%** |

```mermaid
---
Expand All @@ -30,18 +31,18 @@ config:
---
xychart-beta
title "Complex FFT Performance (Million ops/s)"
x-axis [N=64, N=128, N=256, N=512, N=1024, N=2048, N=4096]
y-axis "Million ops/s" 0 --> 7
line [3.83, 1.74, 0.96, 0.37, 0.19, 0.080, 0.044]
line [6.04, 3.04, 1.64, 0.74, 0.36, 0.163, 0.081]
line [4.44, 1.95, 0.98, 0.40, 0.20, 0.084, 0.041]
line [2.80, 1.07, 0.56, 0.22, 0.11, 0.047, 0.023]
line [1.86, 0.80, 0.44, 0.18, 0.10, 0.041, 0.022]
x-axis [N=16, N=64, N=128, N=256, N=512, N=1024, N=2048, N=4096]
y-axis "Million ops/s" 0 --> 18
line [17.57, 3.83, 1.74, 0.96, 0.37, 0.19, 0.080, 0.044]
line [16.68, 6.04, 3.04, 1.64, 0.74, 0.36, 0.163, 0.081]
line [13.88, 4.44, 1.95, 0.98, 0.40, 0.20, 0.084, 0.041]
line [11.50, 2.80, 1.07, 0.56, 0.22, 0.11, 0.047, 0.023]
line [6.05, 1.86, 0.80, 0.44, 0.18, 0.10, 0.041, 0.022]
```

> 🟢 **wat-fft f64** · 🔵 **wat-fft f32** · 🟠 **pffft-wasm** · 🟣 **fft.js** · 🔴 **kissfft-js**

**wat-fft f32 beats pffft-wasm by 36-95%** across all sizes. It's also **2-3x faster** than fft.js (the fastest pure JS). **Choose f64** (`fft_combined.wasm`) for double precision. **Choose f32** (`fft_stockham_f32_dual.wasm`) for maximum single-precision speed.
**wat-fft f32 beats pffft-wasm by 20-95%** across all sizes. It's also **2-3x faster** than fft.js (the fastest pure JS). **Choose f64** (`fft_combined.wasm`) for double precision. **Choose f32** (`fft_stockham_f32_dual.wasm`) for maximum single-precision speed.

### Real FFT

Expand Down
45 changes: 45 additions & 0 deletions docs/optimization/EXPERIMENT_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ Detailed record of all optimization experiments.
| 41 | Buffer Copy Unrolling | INCONCLUSIVE | Within variance, V8 handles simple loops well |
| 42 | Performance Analysis | COMPLETE | Optimization complete; beats all competitors |
| 43 | SIMD Split-Format IFFT | SUCCESS | 4x throughput for IFFT conjugation phases |
| 44 | f32 N=16 Radix-4 Codelet | SUCCESS +18% | Radix-4 codelet closes gap with f64 |

---

Expand Down Expand Up @@ -1264,3 +1265,47 @@ Further gains would require:
**Lesson**: Consistency across modules makes the codebase easier to maintain. SIMD patterns that work in one module should be applied systematically.

**Files modified**: `modules/fft_split_native_f32.wat`

---

## Experiment 44: f32 N=16 Radix-4 Codelet (2026-01-28)

**Goal**: Improve f32 complex FFT performance at N=16, which underperformed f64.

**Observation**: The f32 complex FFT at N=16 (14.1M ops/s) was 20% slower than f64 (17.6M ops/s). This is counterintuitive since f32 should be faster due to 2x SIMD throughput. The f32 module fell through to `$fft_general` for N=16, while f64 had a specialized `$fft_16` radix-4 codelet.

**Hypothesis**: A radix-4 N=16 codelet for f32 would eliminate loop overhead and match f64 performance.

**Approach**:

- Port the f64 `$fft_16` radix-4 algorithm to f32
- Use single-complex-per-lane (like f64) rather than dual-complex packing
- 2 stages instead of 4 for radix-2 Stockham
- Hardcoded twiddle factors (W_16^k for k=1,2,3,4,6,9)
- Update dispatch in both `fft` and `ifft` paths (via shared `$fft_dispatch`)

**Result**: SUCCESS - +18% improvement at N=16

| Metric | Before | After | Change |
| ------------- | ---------- | ---------- | ------ |
| f32 N=16 | 14.1M op/s | 16.7M op/s | +18% |
| Gap vs f64 | 20% slower | 5% slower | +15pp |
| vs pffft-wasm | +0% | +20% | +20pp |
| vs fft.js | +22% | +45% | +23pp |

**Analysis**:

The radix-4 algorithm reduces N=16 from 4 stages (radix-2) to 2 stages. Key benefits:

1. **Fewer iterations**: 2 stages × 4 groups vs 4 stages × varying groups
2. **No loop overhead**: Fully unrolled butterflies
3. **Inline twiddles**: `v128.const` eliminates memory loads
4. **Better register usage**: 20 locals vs dynamic allocation in general loop

The f32 codelet uses the same single-complex-per-lane approach as f64. Dual-complex packing was attempted but the complex shuffling required for radix-4 negated the benefits (similar to Experiment 34's N=16 DIT finding).

**Key implementation detail**: The IFFT was initially broken because it called `$fft_general` directly instead of going through dispatch. Fixed by creating a shared `$fft_dispatch` function used by both `fft` export and `ifft`.

**Lesson**: When f32 underperforms f64 at a specific size, check if f64 has a specialized codelet that f32 lacks. Direct algorithm ports often work well.

**Files modified**: `modules/fft_stockham_f32_dual.wat`
244 changes: 242 additions & 2 deletions modules/fft_stockham_f32_dual.wat
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,238 @@
)


;; ============================================================================
;; N=16 Radix-4 Codelet (single-complex per lane, matching f64 structure)
;; ============================================================================
;; Uses radix-4 algorithm: 2 stages instead of 4 stages for radix-2.
;; Each v128 holds one f32 complex in low 64 bits [re, im, 0, 0].
;; This matches the successful f64 approach but with f32 precision.
;;
;; Stage 1: Four radix-4 butterflies on groups (0,4,8,12), (1,5,9,13), etc.
;; Stage 2: Four radix-4 butterflies with twiddle factors

(func $fft_16
(local $x0 v128) (local $x1 v128) (local $x2 v128) (local $x3 v128)
(local $x4 v128) (local $x5 v128) (local $x6 v128) (local $x7 v128)
(local $x8 v128) (local $x9 v128) (local $x10 v128) (local $x11 v128)
(local $x12 v128) (local $x13 v128) (local $x14 v128) (local $x15 v128)
(local $t0 v128) (local $t1 v128) (local $t2 v128) (local $t3 v128)
(local $tmp v128)

;; Load all 16 complex numbers (8 bytes each = 64 bits)
(local.set $x0 (v128.load64_zero (i32.const 0)))
(local.set $x1 (v128.load64_zero (i32.const 8)))
(local.set $x2 (v128.load64_zero (i32.const 16)))
(local.set $x3 (v128.load64_zero (i32.const 24)))
(local.set $x4 (v128.load64_zero (i32.const 32)))
(local.set $x5 (v128.load64_zero (i32.const 40)))
(local.set $x6 (v128.load64_zero (i32.const 48)))
(local.set $x7 (v128.load64_zero (i32.const 56)))
(local.set $x8 (v128.load64_zero (i32.const 64)))
(local.set $x9 (v128.load64_zero (i32.const 72)))
(local.set $x10 (v128.load64_zero (i32.const 80)))
(local.set $x11 (v128.load64_zero (i32.const 88)))
(local.set $x12 (v128.load64_zero (i32.const 96)))
(local.set $x13 (v128.load64_zero (i32.const 104)))
(local.set $x14 (v128.load64_zero (i32.const 112)))
(local.set $x15 (v128.load64_zero (i32.const 120)))

;; ============================================================================
;; Stage 1: Four radix-4 butterflies (no twiddles)
;; ============================================================================

;; Group 0: x0, x4, x8, x12
(local.set $t0 (f32x4.add (local.get $x0) (local.get $x8)))
(local.set $t1 (f32x4.sub (local.get $x0) (local.get $x8)))
(local.set $t2 (f32x4.add (local.get $x4) (local.get $x12)))
(local.set $t3 (f32x4.sub (local.get $x4) (local.get $x12)))
;; t3 *= -j: [re,im] -> [im,-re]
(local.set $t3 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))
(local.set $x0 (f32x4.add (local.get $t0) (local.get $t2)))
(local.set $x4 (f32x4.add (local.get $t1) (local.get $t3)))
(local.set $x8 (f32x4.sub (local.get $t0) (local.get $t2)))
(local.set $x12 (f32x4.sub (local.get $t1) (local.get $t3)))

;; Group 1: x1, x5, x9, x13
(local.set $t0 (f32x4.add (local.get $x1) (local.get $x9)))
(local.set $t1 (f32x4.sub (local.get $x1) (local.get $x9)))
(local.set $t2 (f32x4.add (local.get $x5) (local.get $x13)))
(local.set $t3 (f32x4.sub (local.get $x5) (local.get $x13)))
(local.set $t3 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))
(local.set $x1 (f32x4.add (local.get $t0) (local.get $t2)))
(local.set $x5 (f32x4.add (local.get $t1) (local.get $t3)))
(local.set $x9 (f32x4.sub (local.get $t0) (local.get $t2)))
(local.set $x13 (f32x4.sub (local.get $t1) (local.get $t3)))

;; Group 2: x2, x6, x10, x14
(local.set $t0 (f32x4.add (local.get $x2) (local.get $x10)))
(local.set $t1 (f32x4.sub (local.get $x2) (local.get $x10)))
(local.set $t2 (f32x4.add (local.get $x6) (local.get $x14)))
(local.set $t3 (f32x4.sub (local.get $x6) (local.get $x14)))
(local.set $t3 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))
(local.set $x2 (f32x4.add (local.get $t0) (local.get $t2)))
(local.set $x6 (f32x4.add (local.get $t1) (local.get $t3)))
(local.set $x10 (f32x4.sub (local.get $t0) (local.get $t2)))
(local.set $x14 (f32x4.sub (local.get $t1) (local.get $t3)))

;; Group 3: x3, x7, x11, x15
(local.set $t0 (f32x4.add (local.get $x3) (local.get $x11)))
(local.set $t1 (f32x4.sub (local.get $x3) (local.get $x11)))
(local.set $t2 (f32x4.add (local.get $x7) (local.get $x15)))
(local.set $t3 (f32x4.sub (local.get $x7) (local.get $x15)))
(local.set $t3 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))
(local.set $x3 (f32x4.add (local.get $t0) (local.get $t2)))
(local.set $x7 (f32x4.add (local.get $t1) (local.get $t3)))
(local.set $x11 (f32x4.sub (local.get $t0) (local.get $t2)))
(local.set $x15 (f32x4.sub (local.get $t1) (local.get $t3)))

;; ============================================================================
;; Stage 2: Four radix-4 butterflies with twiddles
;; Twiddles: W_16^k = e^(-2πik/16) = (cos(πk/8), -sin(πk/8))
;; W_16^1 = (0.9238795, -0.3826834)
;; W_16^2 = (0.7071068, -0.7071068)
;; W_16^3 = (0.3826834, -0.9238795)
;; W_16^4 = (0, -1) = -j
;; W_16^6 = (-0.7071068, -0.7071068)
;; W_16^9 = (-0.9238795, 0.3826834)
;; ============================================================================

;; Group 0: x0, x1, x2, x3 -> outputs 0,4,8,12 (no twiddles)
(local.set $t0 (f32x4.add (local.get $x0) (local.get $x2)))
(local.set $t1 (f32x4.sub (local.get $x0) (local.get $x2)))
(local.set $t2 (f32x4.add (local.get $x1) (local.get $x3)))
(local.set $t3 (f32x4.sub (local.get $x1) (local.get $x3)))
(local.set $t3 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))
(v128.store64_lane 0 (i32.const 0) (f32x4.add (local.get $t0) (local.get $t2)))
(v128.store64_lane 0 (i32.const 32) (f32x4.add (local.get $t1) (local.get $t3)))
(v128.store64_lane 0 (i32.const 64) (f32x4.sub (local.get $t0) (local.get $t2)))
(v128.store64_lane 0 (i32.const 96) (f32x4.sub (local.get $t1) (local.get $t3)))

;; Group 1: x4, x5, x6, x7 -> outputs 1,5,9,13
;; Apply W_16^1 to x5, W_16^2 to x6, W_16^3 to x7
;; cmul: (a+bi)(c+di) = (ac-bd, ad+bc)
;; Using inline complex multiply: result = a*wr + swap(a)*wi*sign
(local.set $tmp (local.get $x5))
(local.set $x5 (f32x4.add
(f32x4.mul (local.get $tmp) (v128.const f32x4 0.9238795 0.9238795 0.9238795 0.9238795))
(f32x4.mul
(f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
(v128.const f32x4 -0.3826834 -0.3826834 -0.3826834 -0.3826834))
(v128.const f32x4 -1.0 1.0 -1.0 1.0))))

(local.set $tmp (local.get $x6))
(local.set $x6 (f32x4.add
(f32x4.mul (local.get $tmp) (v128.const f32x4 0.7071068 0.7071068 0.7071068 0.7071068))
(f32x4.mul
(f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
(v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
(v128.const f32x4 -1.0 1.0 -1.0 1.0))))

(local.set $tmp (local.get $x7))
(local.set $x7 (f32x4.add
(f32x4.mul (local.get $tmp) (v128.const f32x4 0.3826834 0.3826834 0.3826834 0.3826834))
(f32x4.mul
(f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
(v128.const f32x4 -0.9238795 -0.9238795 -0.9238795 -0.9238795))
(v128.const f32x4 -1.0 1.0 -1.0 1.0))))

(local.set $t0 (f32x4.add (local.get $x4) (local.get $x6)))
(local.set $t1 (f32x4.sub (local.get $x4) (local.get $x6)))
(local.set $t2 (f32x4.add (local.get $x5) (local.get $x7)))
(local.set $t3 (f32x4.sub (local.get $x5) (local.get $x7)))
(local.set $t3 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))
(v128.store64_lane 0 (i32.const 8) (f32x4.add (local.get $t0) (local.get $t2)))
(v128.store64_lane 0 (i32.const 40) (f32x4.add (local.get $t1) (local.get $t3)))
(v128.store64_lane 0 (i32.const 72) (f32x4.sub (local.get $t0) (local.get $t2)))
(v128.store64_lane 0 (i32.const 104) (f32x4.sub (local.get $t1) (local.get $t3)))

;; Group 2: x8, x9, x10, x11 -> outputs 2,6,10,14
;; Apply W_16^2 to x9, W_16^4=-j to x10, W_16^6 to x11
(local.set $tmp (local.get $x9))
(local.set $x9 (f32x4.add
(f32x4.mul (local.get $tmp) (v128.const f32x4 0.7071068 0.7071068 0.7071068 0.7071068))
(f32x4.mul
(f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
(v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
(v128.const f32x4 -1.0 1.0 -1.0 1.0))))

;; x10 *= -j: [re,im] -> [im,-re]
(local.set $x10 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $x10) (local.get $x10))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))

(local.set $tmp (local.get $x11))
(local.set $x11 (f32x4.add
(f32x4.mul (local.get $tmp) (v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
(f32x4.mul
(f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
(v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
(v128.const f32x4 -1.0 1.0 -1.0 1.0))))

(local.set $t0 (f32x4.add (local.get $x8) (local.get $x10)))
(local.set $t1 (f32x4.sub (local.get $x8) (local.get $x10)))
(local.set $t2 (f32x4.add (local.get $x9) (local.get $x11)))
(local.set $t3 (f32x4.sub (local.get $x9) (local.get $x11)))
(local.set $t3 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))
(v128.store64_lane 0 (i32.const 16) (f32x4.add (local.get $t0) (local.get $t2)))
(v128.store64_lane 0 (i32.const 48) (f32x4.add (local.get $t1) (local.get $t3)))
(v128.store64_lane 0 (i32.const 80) (f32x4.sub (local.get $t0) (local.get $t2)))
(v128.store64_lane 0 (i32.const 112) (f32x4.sub (local.get $t1) (local.get $t3)))

;; Group 3: x12, x13, x14, x15 -> outputs 3,7,11,15
;; Apply W_16^3 to x13, W_16^6 to x14, W_16^9 to x15
(local.set $tmp (local.get $x13))
(local.set $x13 (f32x4.add
(f32x4.mul (local.get $tmp) (v128.const f32x4 0.3826834 0.3826834 0.3826834 0.3826834))
(f32x4.mul
(f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
(v128.const f32x4 -0.9238795 -0.9238795 -0.9238795 -0.9238795))
(v128.const f32x4 -1.0 1.0 -1.0 1.0))))

(local.set $tmp (local.get $x14))
(local.set $x14 (f32x4.add
(f32x4.mul (local.get $tmp) (v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
(f32x4.mul
(f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
(v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
(v128.const f32x4 -1.0 1.0 -1.0 1.0))))

(local.set $tmp (local.get $x15))
(local.set $x15 (f32x4.add
(f32x4.mul (local.get $tmp) (v128.const f32x4 -0.9238795 -0.9238795 -0.9238795 -0.9238795))
(f32x4.mul
(f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
(v128.const f32x4 0.3826834 0.3826834 0.3826834 0.3826834))
(v128.const f32x4 -1.0 1.0 -1.0 1.0))))

(local.set $t0 (f32x4.add (local.get $x12) (local.get $x14)))
(local.set $t1 (f32x4.sub (local.get $x12) (local.get $x14)))
(local.set $t2 (f32x4.add (local.get $x13) (local.get $x15)))
(local.set $t3 (f32x4.sub (local.get $x13) (local.get $x15)))
(local.set $t3 (f32x4.mul
(i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
(v128.const f32x4 1.0 -1.0 1.0 -1.0)))
(v128.store64_lane 0 (i32.const 24) (f32x4.add (local.get $t0) (local.get $t2)))
(v128.store64_lane 0 (i32.const 56) (f32x4.add (local.get $t1) (local.get $t3)))
(v128.store64_lane 0 (i32.const 88) (f32x4.sub (local.get $t0) (local.get $t2)))
(v128.store64_lane 0 (i32.const 120) (f32x4.sub (local.get $t1) (local.get $t3)))
)


;; ============================================================================
;; General Dual-Complex Stockham FFT
;; ============================================================================
Expand Down Expand Up @@ -751,14 +983,21 @@
;; Main FFT Entry Point
;; ============================================================================

(func (export "fft") (param $n i32)
;; Internal dispatch function used by both fft and ifft
(func $fft_dispatch (param $n i32)
(if (i32.eq (local.get $n) (i32.const 4))
(then (call $fft_4) (return)))
(if (i32.eq (local.get $n) (i32.const 8))
(then (call $fft_8_dit) (return)))
(if (i32.eq (local.get $n) (i32.const 16))
(then (call $fft_16) (return)))
(call $fft_general (local.get $n))
)

(func (export "fft") (param $n i32)
(call $fft_dispatch (local.get $n))
)


;; ============================================================================
;; Main IFFT Entry Point
Expand All @@ -780,8 +1019,9 @@
(then (call $ifft_4) (return)))

;; General case: conj -> FFT -> conj+scale
;; Note: We use $fft_dispatch to ensure consistency with forward FFT
(call $conjugate_buffer (local.get $n))
(call $fft_general (local.get $n))
(call $fft_dispatch (local.get $n))
(call $scale_and_conjugate (local.get $n))
)
)