Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend #1121

Merged
merged 3 commits into from
Feb 25, 2025

Conversation

cmdr2
Copy link
Contributor

@cmdr2 cmdr2 commented Feb 24, 2025

This change increases the operator coverage for the float16 data type in the CUDA backend (for add/sub/mul/div).

At present, ggml requires the second tensor to be converted to float32, which doubles the VRAM requirement, and makes ggml a bit unintuitive. Especially since it's completely possible to add or multiply two float16 tensors in ggml.

float16 is fairly common for inference, and since ggml is a tensor library for ML, it's not uncommon to want to add or multiply two large tensors. Requiring the src1 tensor to be float32 doubles the VRAM requirement for that tensor.

I tested that this works (on my 3060 12 GB), has half the peak VRAM usage (compared to float32+float32 ops), is decently faster than float32+float32 addition. Also test-backend-ops continues to pass.

Example program for float16 addition, Example program for float32 addition.

Related issue: #455

For example, ggml_mul currently also works with just F32 input, which prevents from having 1D F16 norm tensors. This is not a huge drawback since these tensors are usually small, but would be nice to also support F16.

Thanks!
PS: I'm new to ggml, apologies if I missed something obvious! Happy to fix.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If at all possible, please also add a CPU implementation and a test case for test-backend-ops to assert that the implementations are consistent.

@cmdr2
Copy link
Contributor Author

cmdr2 commented Feb 24, 2025

@JohannesGaessler Thanks, no problem. It looks like the current CPU implementation for fp16 addition isn't working correctly when adding a N-sized tensor with a single element tensor.

Also, I had to change the initial assert to match what the f32 CPU implementation does (can_repeat instead of same_shape), otherwise it would fail to run several test cases.

For e.g. this test produces incorrect values on the CPU: ADD(type=f16,ne=[1,1,1,1],nr=[32,1,1,1]). I printed the computed values, and the CUDA backend values are correct, while the CPU values are incorrect.

So I'm not sure if fp16 addition is actually used with the CPU backend right now. But I'll dig further into why the CPU implementation for fp16 isn't giving correct responses.

@WilliamTambellini
Copy link
Contributor

@cmdr2 tks. We are interested by that PR.

@cmdr2 cmdr2 changed the title Support pure float16 add/sub/mul/div operations in the CUDA backend Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend Feb 25, 2025
@cmdr2
Copy link
Contributor Author

cmdr2 commented Feb 25, 2025

@JohannesGaessler I've added fp16 op support for add/sub/mul/div on the CPU backend as well. Also added test cases in test-backend-ops.

test-backend-ops passes.

Can you please take a look when you get a chance? Thanks!

@cmdr2
Copy link
Contributor Author

cmdr2 commented Feb 25, 2025

As a side note, since I saw a plan to refactor ggml-cpu.c - it would really help if it was written as a cpp file, since we could use function templates to cut down a huge number of redundant lines of code. Maybe that's already the plan?

Right now, as you know, there are several copies of each operator function, and many of them still don't support broadcasting or non-contiguous cases, simply because that's not been copied from the fp32 implementation.

We could easily de-duplicate a lot of them using function templates (including the permutations of src0/src1/dst data types).

I'm happy to help, if it's simply a lack of manpower. In a different PR, of course.

Thanks

@ggerganov
Copy link
Member

As a side note, since I saw a plan to refactor ggml-cpu.c - it would really help if it was written as a cpp file, since we could use function templates to cut down a huge number of redundant lines of code. Maybe that's already the plan?

Right now, as you know, there are several copies of each operator function, and many of them still don't support broadcasting or non-contiguous cases, simply because that's not been copied from the fp32 implementation.

We could easily de-duplicate a lot of them using function templates (including the permutations of src0/src1/dst data types).

I'm happy to help, if it's simply a lack of manpower. In a different PR, of course.

Thanks

Yes, we should switch to a C++ implementation and reduce code duplication. The most important thing is to not go overboard with fancy C++ features and keep them at a very minimum. Basically just templates + trivial containers such as std::vector where it would make sense.

@slaren slaren merged commit 738a3ae into ggml-org:master Feb 25, 2025
3 checks passed
@cmdr2
Copy link
Contributor Author

cmdr2 commented Feb 25, 2025

@JohannesGaessler
Copy link
Collaborator

You'll most likely probably need to adjust the supports_op function for the Metal backend. It should be fairly straightforward.

@cmdr2
Copy link
Contributor Author

cmdr2 commented Feb 25, 2025

@JohannesGaessler Thanks, this is what I'm planning in ggml-metal.m. Does this look okay?

case GGML_OP_ADD:
case GGML_OP_SUB:
case GGML_OP_MUL:
case GGML_OP_DIV:
    return op->src[0]->type == GGML_TYPE_F32;

@cmdr2
Copy link
Contributor Author

cmdr2 commented Feb 25, 2025

@JohannesGaessler Sent a PR for this: #1123

Thanks for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants