Feature: Hardening E4M3, E5M2, F16, & BF16 dot-products

It's very easy to come up with a pair of E5M2 inputs that produce absurd numerical results, even when the inputs are upcast to BF16 and F32 is used for the actual product and accumulation. This is a fundamental and broad problem, but here is [a real example from the CI](https://github.com/ashvardanian/SimSIMD/actions/runs/22805061673/job/66152996517) on a tiny random-generated 8-dimensional input:

|         | _i_ = 0  | _i_ = 1 | _i_ = 2  | _i_ = 3   | _i_ = 4  | _i_ = 5  | _i_ = 6  | _i_ = 7  |
| ------- | -------- | ------- | -------- | --------- | -------- | -------- | -------- | -------- |
| _aᵢ_    | 0.00122  | 20480   | −0.00122 | 1.5       | −0.00586 | −3072    | −640     | 0.00146  |
| _bᵢ_    | −40      | 320     | −1280    | −7.63e⁻⁵  | 0        | 0.000427 | 10240    | −4.58e⁻⁵ |
| _aᵢ·bᵢ_ | −0.04883 | 6553600 | 1.5625   | −0.000114 | 0        | −1.3125  | −6553600 | ≈ 0      |

> __Why F32 accumulation fails here.__ The accurate sum of these 8 products is ≈ 0.201. After two `vfmaq_f32` calls, the 4 accumulator lanes hold pairwise products: lanes 1 and 2 carry values around ±6.5 M. At that magnitude the F32 ULP is 0.5 — so the small meaningful terms (−0.049, 1.563, −1.313, −0.0001) are all below one ULP and get absorbed during pairwise reduction. The large terms then cancel exactly to zero, and the information is gone. Final F32 result: __0.0__ instead of __0.201__.

Sadly, this issue is not limited to E5M2. Pretty much all of E4M3, E5M2, F16, & BF16 types use similar multi-precision mechanisms — constrained by F32 precision.

> That said, F32 dot-products use F64 numerics and aren't affected. F64 dot-products use "Dot2"-like schemes for stable multiplication and addition and aren't affected either.

There are a few ways to mitigate this:

1. __F64 accumulators__ for all of E4M3, E5M2, F16, & BF16. Pretty much no modern hardware properly accelerates dot-products with 64-bit accumulators except for the `smef64` capability level on Apple M4 chips. That said, even there — the upcast will be manual.
2. __Separate accumulators__ for small & large magnitudes and positive & negative products. Such methods are often leveraged in traditional HPC environments. Still, they occupy too many registers for the "accumulator state", making them inapplicable to GEMM-like tiled kernels.
3. __Exact integer arithmetic__ for E_x_M_y_ dot-products for low _x_ and _y_. We already leverage the same approach for E2M3 via I8 and E3M2 via I16, leveraging trivial algebraic transforms.

For now, we'll just accept the catastrophic errors in some cases, assuming they don't affect applications like [USearch](https://github.com/unum-cloud/usearch), where many products with neighboring nodes will be computed before picking a new traversal direction. But if new functionality for cheaper widening and multi-precision numerics emerges, it should be considered for the next releases.

### Can you contribute to the implementation?

- [x] I can contribute

### Is your feature request specific to a certain interface?

It applies to everything

### Contact Details

_No response_

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Code of Conduct

- [x] I agree to follow this project's Code of Conduct

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Hardening E4M3, E5M2, F16, & BF16 dot-products #308

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	i = 0	i = 1	i = 2	i = 3	i = 4	i = 5	i = 6	i = 7
aᵢ	0.00122	20480	−0.00122	1.5	−0.00586	−3072	−640	0.00146
bᵢ	−40	320	−1280	−7.63e⁻⁵	0	0.000427	10240	−4.58e⁻⁵
aᵢ·bᵢ	−0.04883	6553600	1.5625	−0.000114	0	−1.3125	−6553600	≈ 0

Feature: Hardening E4M3, E5M2, F16, & BF16 dot-products #308

Description

Can you contribute to the implementation?

Is your feature request specific to a certain interface?

Contact Details

Is there an existing issue for this?

Code of Conduct

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions