-
Notifications
You must be signed in to change notification settings - Fork 111
Description
It's very easy to come up with a pair of E5M2 inputs that produce absurd numerical results, even when the inputs are upcast to BF16 and F32 is used for the actual product and accumulation. This is a fundamental and broad problem, but here is a real example from the CI on a tiny random-generated 8-dimensional input:
| i = 0 | i = 1 | i = 2 | i = 3 | i = 4 | i = 5 | i = 6 | i = 7 | |
|---|---|---|---|---|---|---|---|---|
| aᵢ | 0.00122 | 20480 | −0.00122 | 1.5 | −0.00586 | −3072 | −640 | 0.00146 |
| bᵢ | −40 | 320 | −1280 | −7.63e⁻⁵ | 0 | 0.000427 | 10240 | −4.58e⁻⁵ |
| aᵢ·bᵢ | −0.04883 | 6553600 | 1.5625 | −0.000114 | 0 | −1.3125 | −6553600 | ≈ 0 |
Why F32 accumulation fails here. The accurate sum of these 8 products is ≈ 0.201. After two
vfmaq_f32calls, the 4 accumulator lanes hold pairwise products: lanes 1 and 2 carry values around ±6.5 M. At that magnitude the F32 ULP is 0.5 — so the small meaningful terms (−0.049, 1.563, −1.313, −0.0001) are all below one ULP and get absorbed during pairwise reduction. The large terms then cancel exactly to zero, and the information is gone. Final F32 result: 0.0 instead of 0.201.
Sadly, this issue is not limited to E5M2. Pretty much all of E4M3, E5M2, F16, & BF16 types use similar multi-precision mechanisms — constrained by F32 precision.
That said, F32 dot-products use F64 numerics and aren't affected. F64 dot-products use "Dot2"-like schemes for stable multiplication and addition and aren't affected either.
There are a few ways to mitigate this:
- F64 accumulators for all of E4M3, E5M2, F16, & BF16. Pretty much no modern hardware properly accelerates dot-products with 64-bit accumulators except for the
smef64capability level on Apple M4 chips. That said, even there — the upcast will be manual. - Separate accumulators for small & large magnitudes and positive & negative products. Such methods are often leveraged in traditional HPC environments. Still, they occupy too many registers for the "accumulator state", making them inapplicable to GEMM-like tiled kernels.
- Exact integer arithmetic for E_x_M_y_ dot-products for low x and y. We already leverage the same approach for E2M3 via I8 and E3M2 via I16, leveraging trivial algebraic transforms.
For now, we'll just accept the catastrophic errors in some cases, assuming they don't affect applications like USearch, where many products with neighboring nodes will be computed before picking a new traversal direction. But if new functionality for cheaper widening and multi-precision numerics emerges, it should be considered for the next releases.
Can you contribute to the implementation?
- I can contribute
Is your feature request specific to a certain interface?
It applies to everything
Contact Details
No response
Is there an existing issue for this?
- I have searched the existing issues
Code of Conduct
- I agree to follow this project's Code of Conduct