Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend #1121

cmdr2 · 2025-02-24T07:49:05Z

This change increases the operator coverage for the float16 data type in the CUDA backend (for add/sub/mul/div).

At present, ggml requires the second tensor to be converted to float32, which doubles the VRAM requirement, and makes ggml a bit unintuitive. Especially since it's completely possible to add or multiply two float16 tensors in ggml.

float16 is fairly common for inference, and since ggml is a tensor library for ML, it's not uncommon to want to add or multiply two large tensors. Requiring the src1 tensor to be float32 doubles the VRAM requirement for that tensor.

I tested that this works (on my 3060 12 GB), has half the peak VRAM usage (compared to float32+float32 ops), is decently faster than float32+float32 addition. Also test-backend-ops continues to pass.

Example program for float16 addition, Example program for float32 addition.

Related issue: #455

For example, ggml_mul currently also works with just F32 input, which prevents from having 1D F16 norm tensors. This is not a huge drawback since these tensors are usually small, but would be nice to also support F16.

Thanks!
PS: I'm new to ggml, apologies if I missed something obvious! Happy to fix.

…kend

JohannesGaessler

If at all possible, please also add a CPU implementation and a test case for test-backend-ops to assert that the implementations are consistent.

cmdr2 · 2025-02-24T12:39:20Z

@JohannesGaessler Thanks, no problem. It looks like the current CPU implementation for fp16 addition isn't working correctly when adding a N-sized tensor with a single element tensor.

Also, I had to change the initial assert to match what the f32 CPU implementation does (can_repeat instead of same_shape), otherwise it would fail to run several test cases.

For e.g. this test produces incorrect values on the CPU: ADD(type=f16,ne=[1,1,1,1],nr=[32,1,1,1]). I printed the computed values, and the CUDA backend values are correct, while the CPU values are incorrect.

So I'm not sure if fp16 addition is actually used with the CPU backend right now. But I'll dig further into why the CPU implementation for fp16 isn't giving correct responses.

WilliamTambellini · 2025-02-24T17:49:46Z

@cmdr2 tks. We are interested by that PR.

cmdr2 · 2025-02-25T09:38:26Z

@JohannesGaessler I've added fp16 op support for add/sub/mul/div on the CPU backend as well. Also added test cases in test-backend-ops.

test-backend-ops passes.

Can you please take a look when you get a chance? Thanks!

cmdr2 · 2025-02-25T09:47:11Z

As a side note, since I saw a plan to refactor ggml-cpu.c - it would really help if it was written as a cpp file, since we could use function templates to cut down a huge number of redundant lines of code. Maybe that's already the plan?

Right now, as you know, there are several copies of each operator function, and many of them still don't support broadcasting or non-contiguous cases, simply because that's not been copied from the fp32 implementation.

We could easily de-duplicate a lot of them using function templates (including the permutations of src0/src1/dst data types).

I'm happy to help, if it's simply a lack of manpower. In a different PR, of course.

Thanks

ggerganov · 2025-02-25T09:55:49Z

As a side note, since I saw a plan to refactor ggml-cpu.c - it would really help if it was written as a cpp file, since we could use function templates to cut down a huge number of redundant lines of code. Maybe that's already the plan?

Right now, as you know, there are several copies of each operator function, and many of them still don't support broadcasting or non-contiguous cases, simply because that's not been copied from the fp32 implementation.

We could easily de-duplicate a lot of them using function templates (including the permutations of src0/src1/dst data types).

I'm happy to help, if it's simply a lack of manpower. In a different PR, of course.

Thanks

Yes, we should switch to a C++ implementation and reduce code duplication. The most important thing is to not go overboard with fancy C++ features and keep them at a very minimum. Basically just templates + trivial containers such as std::vector where it would make sense.

cmdr2 · 2025-02-25T13:53:02Z

I'm looking into the CI failure for Mac - https://github.com/ggml-org/ci/tree/results/ggml/73/8a3aea59f1c0c7751d65307d1228c1dbbf6a84/ggml-100-mac-m4

JohannesGaessler · 2025-02-25T13:57:46Z

You'll most likely probably need to adjust the supports_op function for the Metal backend. It should be fairly straightforward.

cmdr2 · 2025-02-25T14:04:18Z

@JohannesGaessler Thanks, this is what I'm planning in ggml-metal.m. Does this look okay?

case GGML_OP_ADD:
case GGML_OP_SUB:
case GGML_OP_MUL:
case GGML_OP_DIV:
    return op->src[0]->type == GGML_TYPE_F32;

cmdr2 · 2025-02-25T14:11:41Z

@JohannesGaessler Sent a PR for this: #1123

Thanks for your help!

Support float16-to-float16 add/sub/mul/div operations in the CUDA bac…

d7af960

…kend

JohannesGaessler reviewed Feb 24, 2025

View reviewed changes

cmdr2 added 2 commits February 25, 2025 15:04

Add fp16 support for add/sub/mul/div on the CPU backend

b1d5436

Add test cases for fp16 add/sub/mul/div

f868952

cmdr2 changed the title ~~Support pure float16 add/sub/mul/div operations in the CUDA backend~~ Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend Feb 25, 2025

ggerganov approved these changes Feb 25, 2025

View reviewed changes

ggerganov requested review from slaren and JohannesGaessler February 25, 2025 09:56

JohannesGaessler approved these changes Feb 25, 2025

View reviewed changes

slaren approved these changes Feb 25, 2025

View reviewed changes

slaren merged commit 738a3ae into ggml-org:master Feb 25, 2025
3 checks passed

cmdr2 mentioned this pull request Feb 25, 2025

Skip metal backend tests for fp16 add/sub/mul/div operations (which don't have metal implementations) #1123

Open

This was referenced Feb 25, 2025

ggml : refactor ggml-cpu.c into multiple C++ source files ggml-org/llama.cpp#10180

Open

Can't add two float16 tensors on CUDA? #1117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend #1121

Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend #1121

cmdr2 commented Feb 24, 2025

JohannesGaessler left a comment

cmdr2 commented Feb 24, 2025

WilliamTambellini commented Feb 24, 2025

cmdr2 commented Feb 25, 2025

cmdr2 commented Feb 25, 2025 •

edited

Loading

ggerganov commented Feb 25, 2025

cmdr2 commented Feb 25, 2025

JohannesGaessler commented Feb 25, 2025

cmdr2 commented Feb 25, 2025

cmdr2 commented Feb 25, 2025

Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend #1121

Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend #1121

Conversation

cmdr2 commented Feb 24, 2025

JohannesGaessler left a comment

Choose a reason for hiding this comment

cmdr2 commented Feb 24, 2025

WilliamTambellini commented Feb 24, 2025

cmdr2 commented Feb 25, 2025

cmdr2 commented Feb 25, 2025 • edited Loading

ggerganov commented Feb 25, 2025

cmdr2 commented Feb 25, 2025

JohannesGaessler commented Feb 25, 2025

cmdr2 commented Feb 25, 2025

cmdr2 commented Feb 25, 2025

cmdr2 commented Feb 25, 2025 •

edited

Loading