Skip to content

Feature: Hardening E4M3, E5M2, F16, & BF16 dot-products #308

@ashvardanian

Description

@ashvardanian

It's very easy to come up with a pair of E5M2 inputs that produce absurd numerical results, even when the inputs are upcast to BF16 and F32 is used for the actual product and accumulation. This is a fundamental and broad problem, but here is a real example from the CI on a tiny random-generated 8-dimensional input:

i = 0 i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7
aᵢ 0.00122 20480 −0.00122 1.5 −0.00586 −3072 −640 0.00146
bᵢ −40 320 −1280 −7.63e⁻⁵ 0 0.000427 10240 −4.58e⁻⁵
aᵢ·bᵢ −0.04883 6553600 1.5625 −0.000114 0 −1.3125 −6553600 ≈ 0

Why F32 accumulation fails here. The accurate sum of these 8 products is ≈ 0.201. After two vfmaq_f32 calls, the 4 accumulator lanes hold pairwise products: lanes 1 and 2 carry values around ±6.5 M. At that magnitude the F32 ULP is 0.5 — so the small meaningful terms (−0.049, 1.563, −1.313, −0.0001) are all below one ULP and get absorbed during pairwise reduction. The large terms then cancel exactly to zero, and the information is gone. Final F32 result: 0.0 instead of 0.201.

Sadly, this issue is not limited to E5M2. Pretty much all of E4M3, E5M2, F16, & BF16 types use similar multi-precision mechanisms — constrained by F32 precision.

That said, F32 dot-products use F64 numerics and aren't affected. F64 dot-products use "Dot2"-like schemes for stable multiplication and addition and aren't affected either.

There are a few ways to mitigate this:

  1. F64 accumulators for all of E4M3, E5M2, F16, & BF16. Pretty much no modern hardware properly accelerates dot-products with 64-bit accumulators except for the smef64 capability level on Apple M4 chips. That said, even there — the upcast will be manual.
  2. Separate accumulators for small & large magnitudes and positive & negative products. Such methods are often leveraged in traditional HPC environments. Still, they occupy too many registers for the "accumulator state", making them inapplicable to GEMM-like tiled kernels.
  3. Exact integer arithmetic for E_x_M_y_ dot-products for low x and y. We already leverage the same approach for E2M3 via I8 and E3M2 via I16, leveraging trivial algebraic transforms.

For now, we'll just accept the catastrophic errors in some cases, assuming they don't affect applications like USearch, where many products with neighboring nodes will be computed before picking a new traversal direction. But if new functionality for cheaper widening and multi-precision numerics emerges, it should be considered for the next releases.

Can you contribute to the implementation?

  • I can contribute

Is your feature request specific to a certain interface?

It applies to everything

Contact Details

No response

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions