Skip to content

Conversation

@swfsql
Copy link

@swfsql swfsql commented Nov 27, 2025

  • Closes Add f16 support #1550
  • This is a minor implementation, given that is easy to add new float types to ndarray.
  • In best-effort, this adds tests/benches whenever both f32 and f64 were tested/benchmarked.
  • Adds a feature mention on Readme.

@swfsql swfsql marked this pull request as ready for review November 28, 2025 06:23
Comment on lines +826 to +829
(m127, 127, 127, 127) // ~128x slower than f32
(mix16x4, 32, 4, 32)
(mix32x2, 32, 2, 32)
// (mix10000, 128, 10000, 128) // too slow
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mention that f16 is slower in the issue and several times here. Is it faster on some operations? If not, why do you (and others) want to use it? Only to save space?

Copy link
Author

@swfsql swfsql Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, on my arch (x86_64) it is indeed quite slow. They mention on their docs that aarch64 has support for all operations (for half::f16), and it is possibly viable in performance for that arch -- but I haven't tested it. They also have specialized methods for storing/loading the half::{f16, bf16} data to/from other types (u16, f32, f64) which could improve performance also, but I didn't leverage those operations when including the types to ndarray (I don't really know if/how they could be leveraged).

Albeit (at least for me) it is a bit disappointing that it is slow, I find it useful for debugging fp16 models (for machine learning), given that some architectures behave poorly on fp16 and it is easier to debug them if everything is running on the cpu.
For wasm targets, some builds may not enable cpu features, and thus the performance of f32 and half::f16 should be closer -- so in that case I believe the memory savings could be meaningful, but I believe that is a niche target.

I don't know if proper simd support is possible for fp16, it appears that some work towards this is active (for the primitive f16).

With that being said, I still solicit for half::{f16, bf16} support on ndarray. That makes development and debugging smoother, even if fp16 doesn't have proper simd support -- given that the crux of the training happens on the gpus. In the future it is possible to both have simd improvements from the underlying half, or from a new addition or replacement into a primitive f16 type.
I also understand that ndarray takes performance in high regard, thus possibly opting for a delay or a non-inclusion of the fp16 types.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add f16 support

2 participants