-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend #1121
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If at all possible, please also add a CPU implementation and a test case for test-backend-ops
to assert that the implementations are consistent.
@JohannesGaessler Thanks, no problem. It looks like the current CPU implementation for fp16 addition isn't working correctly when adding a N-sized tensor with a single element tensor. Also, I had to change the initial assert to match what the f32 CPU implementation does ( For e.g. this test produces incorrect values on the CPU: So I'm not sure if fp16 addition is actually used with the CPU backend right now. But I'll dig further into why the CPU implementation for fp16 isn't giving correct responses. |
@cmdr2 tks. We are interested by that PR. |
@JohannesGaessler I've added fp16 op support for add/sub/mul/div on the CPU backend as well. Also added test cases in
Can you please take a look when you get a chance? Thanks! |
As a side note, since I saw a plan to refactor Right now, as you know, there are several copies of each operator function, and many of them still don't support broadcasting or non-contiguous cases, simply because that's not been copied from the fp32 implementation. We could easily de-duplicate a lot of them using function templates (including the permutations of I'm happy to help, if it's simply a lack of manpower. In a different PR, of course. Thanks |
Yes, we should switch to a C++ implementation and reduce code duplication. The most important thing is to not go overboard with fancy C++ features and keep them at a very minimum. Basically just templates + trivial containers such as |
I'm looking into the CI failure for Mac - https://github.com/ggml-org/ci/tree/results/ggml/73/8a3aea59f1c0c7751d65307d1228c1dbbf6a84/ggml-100-mac-m4 |
You'll most likely probably need to adjust the |
@JohannesGaessler Thanks, this is what I'm planning in case GGML_OP_ADD:
case GGML_OP_SUB:
case GGML_OP_MUL:
case GGML_OP_DIV:
return op->src[0]->type == GGML_TYPE_F32; |
@JohannesGaessler Sent a PR for this: #1123 Thanks for your help! |
This change increases the operator coverage for the float16 data type in the CUDA backend (for add/sub/mul/div).
At present, ggml requires the second tensor to be converted to float32, which doubles the VRAM requirement, and makes ggml a bit unintuitive. Especially since it's completely possible to add or multiply two float16 tensors in ggml.
float16 is fairly common for inference, and since ggml is a tensor library for ML, it's not uncommon to want to add or multiply two large tensors. Requiring the
src1
tensor to be float32 doubles the VRAM requirement for that tensor.I tested that this works (on my 3060 12 GB), has half the peak VRAM usage (compared to float32+float32 ops), is decently faster than float32+float32 addition. Also
test-backend-ops
continues to pass.Example program for float16 addition, Example program for float32 addition.
Related issue: #455
Thanks!
PS: I'm new to ggml, apologies if I missed something obvious! Happy to fix.