EmNudge · EmNudge · Jan 28, 2026 · Jan 28, 2026
diff --git a/README.md b/README.md
@@ -8,15 +8,16 @@ A high-performance FFT implementation in WebAssembly Text format that **signific
 
 Benchmarked against [pffft-wasm](https://www.npmjs.com/package/@echogarden/pffft-wasm) (PFFFT with SIMD):
 
-| Size   | wat-fft (f32)       | pffft-wasm (f32) | Speedup  |
-| ------ | ------------------- | ---------------- | -------- |
-| N=64   | **6,040,000 ops/s** | 4,440,000 ops/s  | **+36%** |
-| N=128  | **3,040,000 ops/s** | 1,950,000 ops/s  | **+56%** |
-| N=256  | **1,640,000 ops/s** | 980,000 ops/s    | **+67%** |
-| N=512  | **736,000 ops/s**   | 404,000 ops/s    | **+82%** |
-| N=1024 | **365,000 ops/s**   | 201,000 ops/s    | **+81%** |
-| N=2048 | **163,000 ops/s**   | 84,000 ops/s     | **+94%** |
-| N=4096 | **81,000 ops/s**    | 41,000 ops/s     | **+95%** |
+| Size   | wat-fft (f32)        | pffft-wasm (f32) | Speedup  |
+| ------ | -------------------- | ---------------- | -------- |
+| N=16   | **16,700,000 ops/s** | 13,900,000 ops/s | **+20%** |
+| N=64   | **6,040,000 ops/s**  | 4,440,000 ops/s  | **+36%** |
+| N=128  | **3,040,000 ops/s**  | 1,950,000 ops/s  | **+56%** |
+| N=256  | **1,640,000 ops/s**  | 980,000 ops/s    | **+67%** |
+| N=512  | **736,000 ops/s**    | 404,000 ops/s    | **+82%** |
+| N=1024 | **365,000 ops/s**    | 201,000 ops/s    | **+81%** |
+| N=2048 | **163,000 ops/s**    | 84,000 ops/s     | **+94%** |
+| N=4096 | **81,000 ops/s**     | 41,000 ops/s     | **+95%** |
 
 ```mermaid
 ---
@@ -30,18 +31,18 @@ config:
 ---
 xychart-beta
     title "Complex FFT Performance (Million ops/s)"
-    x-axis [N=64, N=128, N=256, N=512, N=1024, N=2048, N=4096]
-    y-axis "Million ops/s" 0 --> 7
-    line [3.83, 1.74, 0.96, 0.37, 0.19, 0.080, 0.044]
-    line [6.04, 3.04, 1.64, 0.74, 0.36, 0.163, 0.081]
-    line [4.44, 1.95, 0.98, 0.40, 0.20, 0.084, 0.041]
-    line [2.80, 1.07, 0.56, 0.22, 0.11, 0.047, 0.023]
-    line [1.86, 0.80, 0.44, 0.18, 0.10, 0.041, 0.022]
+    x-axis [N=16, N=64, N=128, N=256, N=512, N=1024, N=2048, N=4096]
+    y-axis "Million ops/s" 0 --> 18
+    line [17.57, 3.83, 1.74, 0.96, 0.37, 0.19, 0.080, 0.044]
+    line [16.68, 6.04, 3.04, 1.64, 0.74, 0.36, 0.163, 0.081]
+    line [13.88, 4.44, 1.95, 0.98, 0.40, 0.20, 0.084, 0.041]
+    line [11.50, 2.80, 1.07, 0.56, 0.22, 0.11, 0.047, 0.023]
+    line [6.05, 1.86, 0.80, 0.44, 0.18, 0.10, 0.041, 0.022]
 ```
 
 > 🟢 **wat-fft f64** · 🔵 **wat-fft f32** · 🟠 **pffft-wasm** · 🟣 **fft.js** · 🔴 **kissfft-js**
 
-**wat-fft f32 beats pffft-wasm by 36-95%** across all sizes. It's also **2-3x faster** than fft.js (the fastest pure JS). **Choose f64** (`fft_combined.wasm`) for double precision. **Choose f32** (`fft_stockham_f32_dual.wasm`) for maximum single-precision speed.
+**wat-fft f32 beats pffft-wasm by 20-95%** across all sizes. It's also **2-3x faster** than fft.js (the fastest pure JS). **Choose f64** (`fft_combined.wasm`) for double precision. **Choose f32** (`fft_stockham_f32_dual.wasm`) for maximum single-precision speed.
 
 ### Real FFT
 

diff --git a/docs/optimization/EXPERIMENT_LOG.md b/docs/optimization/EXPERIMENT_LOG.md
@@ -50,6 +50,7 @@ Detailed record of all optimization experiments.
 | 41  | Buffer Copy Unrolling       | INCONCLUSIVE     | Within variance, V8 handles simple loops well    |
 | 42  | Performance Analysis        | COMPLETE         | Optimization complete; beats all competitors     |
 | 43  | SIMD Split-Format IFFT      | SUCCESS          | 4x throughput for IFFT conjugation phases        |
+| 44  | f32 N=16 Radix-4 Codelet    | SUCCESS +18%     | Radix-4 codelet closes gap with f64              |
 
 ---
 
@@ -1264,3 +1265,47 @@ Further gains would require:
 **Lesson**: Consistency across modules makes the codebase easier to maintain. SIMD patterns that work in one module should be applied systematically.
 
 **Files modified**: `modules/fft_split_native_f32.wat`
+
+---
+
+## Experiment 44: f32 N=16 Radix-4 Codelet (2026-01-28)
+
+**Goal**: Improve f32 complex FFT performance at N=16, which underperformed f64.
+
+**Observation**: The f32 complex FFT at N=16 (14.1M ops/s) was 20% slower than f64 (17.6M ops/s). This is counterintuitive since f32 should be faster due to 2x SIMD throughput. The f32 module fell through to `$fft_general` for N=16, while f64 had a specialized `$fft_16` radix-4 codelet.
+
+**Hypothesis**: A radix-4 N=16 codelet for f32 would eliminate loop overhead and match f64 performance.
+
+**Approach**:
+
+- Port the f64 `$fft_16` radix-4 algorithm to f32
+- Use single-complex-per-lane (like f64) rather than dual-complex packing
+- 2 stages instead of 4 for radix-2 Stockham
+- Hardcoded twiddle factors (W_16^k for k=1,2,3,4,6,9)
+- Update dispatch in both `fft` and `ifft` paths (via shared `$fft_dispatch`)
+
+**Result**: SUCCESS - +18% improvement at N=16
+
+| Metric        | Before     | After      | Change |
+| ------------- | ---------- | ---------- | ------ |
+| f32 N=16      | 14.1M op/s | 16.7M op/s | +18%   |
+| Gap vs f64    | 20% slower | 5% slower  | +15pp  |
+| vs pffft-wasm | +0%        | +20%       | +20pp  |
+| vs fft.js     | +22%       | +45%       | +23pp  |
+
+**Analysis**:
+
+The radix-4 algorithm reduces N=16 from 4 stages (radix-2) to 2 stages. Key benefits:
+
+1. **Fewer iterations**: 2 stages × 4 groups vs 4 stages × varying groups
+2. **No loop overhead**: Fully unrolled butterflies
+3. **Inline twiddles**: `v128.const` eliminates memory loads
+4. **Better register usage**: 20 locals vs dynamic allocation in general loop
+
+The f32 codelet uses the same single-complex-per-lane approach as f64. Dual-complex packing was attempted but the complex shuffling required for radix-4 negated the benefits (similar to Experiment 34's N=16 DIT finding).
+
+**Key implementation detail**: The IFFT was initially broken because it called `$fft_general` directly instead of going through dispatch. Fixed by creating a shared `$fft_dispatch` function used by both `fft` export and `ifft`.
+
+**Lesson**: When f32 underperforms f64 at a specific size, check if f64 has a specialized codelet that f32 lacks. Direct algorithm ports often work well.
+
+**Files modified**: `modules/fft_stockham_f32_dual.wat`
diff --git a/modules/fft_stockham_f32_dual.wat b/modules/fft_stockham_f32_dual.wat
@@ -300,6 +300,238 @@
   )
 
 
+  ;; ============================================================================
+  ;; N=16 Radix-4 Codelet (single-complex per lane, matching f64 structure)
+  ;; ============================================================================
+  ;; Uses radix-4 algorithm: 2 stages instead of 4 stages for radix-2.
+  ;; Each v128 holds one f32 complex in low 64 bits [re, im, 0, 0].
+  ;; This matches the successful f64 approach but with f32 precision.
+  ;;
+  ;; Stage 1: Four radix-4 butterflies on groups (0,4,8,12), (1,5,9,13), etc.
+  ;; Stage 2: Four radix-4 butterflies with twiddle factors
+
+  (func $fft_16
+    (local $x0 v128) (local $x1 v128) (local $x2 v128) (local $x3 v128)
+    (local $x4 v128) (local $x5 v128) (local $x6 v128) (local $x7 v128)
+    (local $x8 v128) (local $x9 v128) (local $x10 v128) (local $x11 v128)
+    (local $x12 v128) (local $x13 v128) (local $x14 v128) (local $x15 v128)
+    (local $t0 v128) (local $t1 v128) (local $t2 v128) (local $t3 v128)
+    (local $tmp v128)
+
+    ;; Load all 16 complex numbers (8 bytes each = 64 bits)
+    (local.set $x0 (v128.load64_zero (i32.const 0)))
+    (local.set $x1 (v128.load64_zero (i32.const 8)))
+    (local.set $x2 (v128.load64_zero (i32.const 16)))
+    (local.set $x3 (v128.load64_zero (i32.const 24)))
+    (local.set $x4 (v128.load64_zero (i32.const 32)))
+    (local.set $x5 (v128.load64_zero (i32.const 40)))
+    (local.set $x6 (v128.load64_zero (i32.const 48)))
+    (local.set $x7 (v128.load64_zero (i32.const 56)))
+    (local.set $x8 (v128.load64_zero (i32.const 64)))
+    (local.set $x9 (v128.load64_zero (i32.const 72)))
+    (local.set $x10 (v128.load64_zero (i32.const 80)))
+    (local.set $x11 (v128.load64_zero (i32.const 88)))
+    (local.set $x12 (v128.load64_zero (i32.const 96)))
+    (local.set $x13 (v128.load64_zero (i32.const 104)))
+    (local.set $x14 (v128.load64_zero (i32.const 112)))
+    (local.set $x15 (v128.load64_zero (i32.const 120)))
+
+    ;; ============================================================================
+    ;; Stage 1: Four radix-4 butterflies (no twiddles)
+    ;; ============================================================================
+
+    ;; Group 0: x0, x4, x8, x12
+    (local.set $t0 (f32x4.add (local.get $x0) (local.get $x8)))
+    (local.set $t1 (f32x4.sub (local.get $x0) (local.get $x8)))
+    (local.set $t2 (f32x4.add (local.get $x4) (local.get $x12)))
+    (local.set $t3 (f32x4.sub (local.get $x4) (local.get $x12)))
+    ;; t3 *= -j: [re,im] -> [im,-re]
+    (local.set $t3 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+    (local.set $x0 (f32x4.add (local.get $t0) (local.get $t2)))
+    (local.set $x4 (f32x4.add (local.get $t1) (local.get $t3)))
+    (local.set $x8 (f32x4.sub (local.get $t0) (local.get $t2)))
+    (local.set $x12 (f32x4.sub (local.get $t1) (local.get $t3)))
+
+    ;; Group 1: x1, x5, x9, x13
+    (local.set $t0 (f32x4.add (local.get $x1) (local.get $x9)))
+    (local.set $t1 (f32x4.sub (local.get $x1) (local.get $x9)))
+    (local.set $t2 (f32x4.add (local.get $x5) (local.get $x13)))
+    (local.set $t3 (f32x4.sub (local.get $x5) (local.get $x13)))
+    (local.set $t3 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+    (local.set $x1 (f32x4.add (local.get $t0) (local.get $t2)))
+    (local.set $x5 (f32x4.add (local.get $t1) (local.get $t3)))
+    (local.set $x9 (f32x4.sub (local.get $t0) (local.get $t2)))
+    (local.set $x13 (f32x4.sub (local.get $t1) (local.get $t3)))
+
+    ;; Group 2: x2, x6, x10, x14
+    (local.set $t0 (f32x4.add (local.get $x2) (local.get $x10)))
+    (local.set $t1 (f32x4.sub (local.get $x2) (local.get $x10)))
+    (local.set $t2 (f32x4.add (local.get $x6) (local.get $x14)))
+    (local.set $t3 (f32x4.sub (local.get $x6) (local.get $x14)))
+    (local.set $t3 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+    (local.set $x2 (f32x4.add (local.get $t0) (local.get $t2)))
+    (local.set $x6 (f32x4.add (local.get $t1) (local.get $t3)))
+    (local.set $x10 (f32x4.sub (local.get $t0) (local.get $t2)))
+    (local.set $x14 (f32x4.sub (local.get $t1) (local.get $t3)))
+
+    ;; Group 3: x3, x7, x11, x15
+    (local.set $t0 (f32x4.add (local.get $x3) (local.get $x11)))
+    (local.set $t1 (f32x4.sub (local.get $x3) (local.get $x11)))
+    (local.set $t2 (f32x4.add (local.get $x7) (local.get $x15)))
+    (local.set $t3 (f32x4.sub (local.get $x7) (local.get $x15)))
+    (local.set $t3 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+    (local.set $x3 (f32x4.add (local.get $t0) (local.get $t2)))
+    (local.set $x7 (f32x4.add (local.get $t1) (local.get $t3)))
+    (local.set $x11 (f32x4.sub (local.get $t0) (local.get $t2)))
+    (local.set $x15 (f32x4.sub (local.get $t1) (local.get $t3)))
+
+    ;; ============================================================================
+    ;; Stage 2: Four radix-4 butterflies with twiddles
+    ;; Twiddles: W_16^k = e^(-2πik/16) = (cos(πk/8), -sin(πk/8))
+    ;; W_16^1 = (0.9238795, -0.3826834)
+    ;; W_16^2 = (0.7071068, -0.7071068)
+    ;; W_16^3 = (0.3826834, -0.9238795)
+    ;; W_16^4 = (0, -1) = -j
+    ;; W_16^6 = (-0.7071068, -0.7071068)
+    ;; W_16^9 = (-0.9238795, 0.3826834)
+    ;; ============================================================================
+
+    ;; Group 0: x0, x1, x2, x3 -> outputs 0,4,8,12 (no twiddles)
+    (local.set $t0 (f32x4.add (local.get $x0) (local.get $x2)))
+    (local.set $t1 (f32x4.sub (local.get $x0) (local.get $x2)))
+    (local.set $t2 (f32x4.add (local.get $x1) (local.get $x3)))
+    (local.set $t3 (f32x4.sub (local.get $x1) (local.get $x3)))
+    (local.set $t3 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+    (v128.store64_lane 0 (i32.const 0) (f32x4.add (local.get $t0) (local.get $t2)))
+    (v128.store64_lane 0 (i32.const 32) (f32x4.add (local.get $t1) (local.get $t3)))
+    (v128.store64_lane 0 (i32.const 64) (f32x4.sub (local.get $t0) (local.get $t2)))
+    (v128.store64_lane 0 (i32.const 96) (f32x4.sub (local.get $t1) (local.get $t3)))
+
+    ;; Group 1: x4, x5, x6, x7 -> outputs 1,5,9,13
+    ;; Apply W_16^1 to x5, W_16^2 to x6, W_16^3 to x7
+    ;; cmul: (a+bi)(c+di) = (ac-bd, ad+bc)
+    ;; Using inline complex multiply: result = a*wr + swap(a)*wi*sign
+    (local.set $tmp (local.get $x5))
+    (local.set $x5 (f32x4.add
+      (f32x4.mul (local.get $tmp) (v128.const f32x4 0.9238795 0.9238795 0.9238795 0.9238795))
+      (f32x4.mul
+        (f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
+                   (v128.const f32x4 -0.3826834 -0.3826834 -0.3826834 -0.3826834))
+        (v128.const f32x4 -1.0 1.0 -1.0 1.0))))
+
+    (local.set $tmp (local.get $x6))
+    (local.set $x6 (f32x4.add
+      (f32x4.mul (local.get $tmp) (v128.const f32x4 0.7071068 0.7071068 0.7071068 0.7071068))
+      (f32x4.mul
+        (f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
+                   (v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
+        (v128.const f32x4 -1.0 1.0 -1.0 1.0))))
+
+    (local.set $tmp (local.get $x7))
+    (local.set $x7 (f32x4.add
+      (f32x4.mul (local.get $tmp) (v128.const f32x4 0.3826834 0.3826834 0.3826834 0.3826834))
+      (f32x4.mul
+        (f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
+                   (v128.const f32x4 -0.9238795 -0.9238795 -0.9238795 -0.9238795))
+        (v128.const f32x4 -1.0 1.0 -1.0 1.0))))
+
+    (local.set $t0 (f32x4.add (local.get $x4) (local.get $x6)))
+    (local.set $t1 (f32x4.sub (local.get $x4) (local.get $x6)))
+    (local.set $t2 (f32x4.add (local.get $x5) (local.get $x7)))
+    (local.set $t3 (f32x4.sub (local.get $x5) (local.get $x7)))
+    (local.set $t3 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+    (v128.store64_lane 0 (i32.const 8) (f32x4.add (local.get $t0) (local.get $t2)))
+    (v128.store64_lane 0 (i32.const 40) (f32x4.add (local.get $t1) (local.get $t3)))
+    (v128.store64_lane 0 (i32.const 72) (f32x4.sub (local.get $t0) (local.get $t2)))
+    (v128.store64_lane 0 (i32.const 104) (f32x4.sub (local.get $t1) (local.get $t3)))
+
+    ;; Group 2: x8, x9, x10, x11 -> outputs 2,6,10,14
+    ;; Apply W_16^2 to x9, W_16^4=-j to x10, W_16^6 to x11
+    (local.set $tmp (local.get $x9))
+    (local.set $x9 (f32x4.add
+      (f32x4.mul (local.get $tmp) (v128.const f32x4 0.7071068 0.7071068 0.7071068 0.7071068))
+      (f32x4.mul
+        (f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
+                   (v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
+        (v128.const f32x4 -1.0 1.0 -1.0 1.0))))
+
+    ;; x10 *= -j: [re,im] -> [im,-re]
+    (local.set $x10 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $x10) (local.get $x10))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+
+    (local.set $tmp (local.get $x11))
+    (local.set $x11 (f32x4.add
+      (f32x4.mul (local.get $tmp) (v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
+      (f32x4.mul
+        (f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
+                   (v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
+        (v128.const f32x4 -1.0 1.0 -1.0 1.0))))
+
+    (local.set $t0 (f32x4.add (local.get $x8) (local.get $x10)))
+    (local.set $t1 (f32x4.sub (local.get $x8) (local.get $x10)))
+    (local.set $t2 (f32x4.add (local.get $x9) (local.get $x11)))
+    (local.set $t3 (f32x4.sub (local.get $x9) (local.get $x11)))
+    (local.set $t3 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+    (v128.store64_lane 0 (i32.const 16) (f32x4.add (local.get $t0) (local.get $t2)))
+    (v128.store64_lane 0 (i32.const 48) (f32x4.add (local.get $t1) (local.get $t3)))
+    (v128.store64_lane 0 (i32.const 80) (f32x4.sub (local.get $t0) (local.get $t2)))
+    (v128.store64_lane 0 (i32.const 112) (f32x4.sub (local.get $t1) (local.get $t3)))
+
+    ;; Group 3: x12, x13, x14, x15 -> outputs 3,7,11,15
+    ;; Apply W_16^3 to x13, W_16^6 to x14, W_16^9 to x15
+    (local.set $tmp (local.get $x13))
+    (local.set $x13 (f32x4.add
+      (f32x4.mul (local.get $tmp) (v128.const f32x4 0.3826834 0.3826834 0.3826834 0.3826834))
+      (f32x4.mul
+        (f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
+                   (v128.const f32x4 -0.9238795 -0.9238795 -0.9238795 -0.9238795))
+        (v128.const f32x4 -1.0 1.0 -1.0 1.0))))
+
+    (local.set $tmp (local.get $x14))
+    (local.set $x14 (f32x4.add
+      (f32x4.mul (local.get $tmp) (v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
+      (f32x4.mul
+        (f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
+                   (v128.const f32x4 -0.7071068 -0.7071068 -0.7071068 -0.7071068))
+        (v128.const f32x4 -1.0 1.0 -1.0 1.0))))
+
+    (local.set $tmp (local.get $x15))
+    (local.set $x15 (f32x4.add
+      (f32x4.mul (local.get $tmp) (v128.const f32x4 -0.9238795 -0.9238795 -0.9238795 -0.9238795))
+      (f32x4.mul
+        (f32x4.mul (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $tmp) (local.get $tmp))
+                   (v128.const f32x4 0.3826834 0.3826834 0.3826834 0.3826834))
+        (v128.const f32x4 -1.0 1.0 -1.0 1.0))))
+
+    (local.set $t0 (f32x4.add (local.get $x12) (local.get $x14)))
+    (local.set $t1 (f32x4.sub (local.get $x12) (local.get $x14)))
+    (local.set $t2 (f32x4.add (local.get $x13) (local.get $x15)))
+    (local.set $t3 (f32x4.sub (local.get $x13) (local.get $x15)))
+    (local.set $t3 (f32x4.mul
+      (i8x16.shuffle 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 (local.get $t3) (local.get $t3))
+      (v128.const f32x4 1.0 -1.0 1.0 -1.0)))
+    (v128.store64_lane 0 (i32.const 24) (f32x4.add (local.get $t0) (local.get $t2)))
+    (v128.store64_lane 0 (i32.const 56) (f32x4.add (local.get $t1) (local.get $t3)))
+    (v128.store64_lane 0 (i32.const 88) (f32x4.sub (local.get $t0) (local.get $t2)))
+    (v128.store64_lane 0 (i32.const 120) (f32x4.sub (local.get $t1) (local.get $t3)))
+  )
+
+
   ;; ============================================================================
   ;; General Dual-Complex Stockham FFT
   ;; ============================================================================
@@ -751,14 +983,21 @@
   ;; Main FFT Entry Point
   ;; ============================================================================
 
-  (func (export "fft") (param $n i32)
+  ;; Internal dispatch function used by both fft and ifft
+  (func $fft_dispatch (param $n i32)
     (if (i32.eq (local.get $n) (i32.const 4))
       (then (call $fft_4) (return)))
     (if (i32.eq (local.get $n) (i32.const 8))
       (then (call $fft_8_dit) (return)))
+    (if (i32.eq (local.get $n) (i32.const 16))
+      (then (call $fft_16) (return)))
     (call $fft_general (local.get $n))
   )
 
+  (func (export "fft") (param $n i32)
+    (call $fft_dispatch (local.get $n))
+  )
+
 
   ;; ============================================================================
   ;; Main IFFT Entry Point
@@ -780,8 +1019,9 @@
       (then (call $ifft_4) (return)))
 
     ;; General case: conj -> FFT -> conj+scale
+    ;; Note: We use $fft_dispatch to ensure consistency with forward FFT
     (call $conjugate_buffer (local.get $n))
-    (call $fft_general (local.get $n))
+    (call $fft_dispatch (local.get $n))
     (call $scale_and_conjugate (local.get $n))
   )
 )