Skip to content

Conversation

@nfrumkin
Copy link

@nfrumkin nfrumkin commented Dec 9, 2025

Motivation

A fast Hadamard implementation using blocked GEMM. To be used in conjunction with MXFP4 for better generation quality of quantized LLMs.

Technical Details

Even though FWHT is O(nlog(n)), the batched GEMM is faster for small hadamard sizes because we can materialize the entire hadamard matrix in register and avoid loading from main memory. We create a triton.jit function to construct the hadamard without having to tl.load a pre-existing hadamard matrix.

Test Plan

We compare with the reference torch implementation and a triton-based FWHT implementation.

Test Result

An MSE below 1e^-14 for both baselines:

Testing N=32, batch=1
----------------------------------------
matmul vs ref: 9.38e-15
matmul vs triton: 7.83e-15
ref vs triton: 3.85e-15
blocked gemm vs. triton: 1.3693907119360915e-14
blocked gemm vs. matmul: 7.74727504371242e-15
fast blocked gemm vs. blocked gemm: 0.0

Testing N=32, batch=4
----------------------------------------
matmul vs ref: 1.61e-14
matmul vs triton: 1.28e-14
ref vs triton: 1.02e-14
blocked gemm vs. triton: 1.7747286726690382e-14
blocked gemm vs. matmul: 9.220116261188932e-15
fast blocked gemm vs. blocked gemm: 0.0

Submission Checklist

@nfrumkin nfrumkin self-assigned this Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants