-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cpu: Add CPU backend support for KleidiAI library #11390
ggml-cpu: Add CPU backend support for KleidiAI library #11390
Conversation
Is this supported on other platforms such as Raspberry Pi and Android Phone with Snapdragon? |
Yes, this should work on other platforms that support SME, I8MM, or dot product.
…________________________________
Från: Yiwei Shao ***@***.***>
Skickat: den 29 januari 2025 21:06
Till: ggerganov/llama.cpp ***@***.***>
Kopia: Charles Xu ***@***.***>; Author ***@***.***>
Ämne: Re: [ggerganov/llama.cpp] ggml-cpu: Add CPU backend support for KleidiAI library (PR #11390)
Is this supported on other platforms such as Raspberry Pi and Android Phone with Snapdragon?
—
Reply to this email directly, view it on GitHub<#11390 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APGVIEDGBIGW65AITREDY532NEYEHAVCNFSM6AAAAABVZKDDOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRSG4ZDQNJYGQ>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.
|
What performance can we expect with KleidiAI over the current kernels? I tried this on M3 Max, and at least with this hardware it does not seem to be faster than the current AARCH64 kernels: master:
PR:
|
Cross compiling for android platform as android.md will report an error.
Could you please have a look? |
@njsyw1997 we'll look into the issue. Did you use Termux or Android NDK for the Android build? |
KleidiAI's performance is generally comparable to the current AARCH64 kernels. However, we have observed performance degradation when using a high number of threads on hardware with high core count. We are actively investigating this issue and will update the PR with a fix. KleidiAI is under active development, and we expect continuous improvements over time. Future updates should further optimize performance across different hardware configurations and add additional kernels. |
@chaxu01 I am using Android SDK, cross compiling on a linux x86 server. It seems that Compiling on raspberry pi 5(quad-core A76, native compile) also got error. In Cmake I got warning
Ignore the warning and compile directly
|
@njsyw1997 Thanks for the info and we'll update the PR for the issue. |
I've been looking forward to getting SME enabled and finally had some time to review and benchmark this PR. Based on my observations so far SME on the M4 does not scale beyond two threads. This is aligned with findings See the benchmarks results below. I compared llama.cpp master-I8MM vs kleidi-I8MM vs kliedi-SME. It looks like even Kleidi I8MM kernels perform worse than what we have in the master. My thinking so far is that using SME-only kernels for the LLM is not going to work very well. @chaxu01 Have you guys considered my suggestion above (ie mix of SME and I8MM threads)? Build with Apple Clang
Build with LLVM Clang 19.1.7 (installed via homebrew)
PP bench : master-i8mm, kleidi-i8mm, kleidi-sme
build: 6eecde3 (4621)
build: 119d3bf (4542)
build: 119d3bf (4542) TG bench : master-i8mm, kleidi-i8mm, kleidi-sme
build: 6eecde3 (4621)
build: 119d3bf (4542)
build: 119d3bf (4542) PP and TG bench on Galaxy S25 (Snapdragon 8 Elite) : master-i8mm | kleidi-i8mmBuilt with Android NDK 27c (armv8.7a+dotprod+i8mm)
build: 6eecde3 (4621)
build: 119d3bf (4542)
build: 6eecde3 (4621)
build: 119d3bf (4542) |
@slaren |
I'd say I see worse performance across the board even on 1-2 threads with I8MM kernels. |
None of the currently available Snapdragon-based devices support SME. |
@max-krasnyansky Thanks for running the benchmarks. It looks like the commit f4eb1b3, which adds support for multithreaded LHS packing, wasn’t included in your tests.
Yes, we’ve considered this approach. The main issue is that different kernels may use different RHS encoding and that would require repacking and storing multiple copies of the weights in memory, one copy for each kernel. This would increase memory usage and mode loading time. BTW I've updated the PR with the commit 3e08f37 that switch the order for the kernel priority to dotprod and i8mm as we observed that dotprod would give better performance if supported. |
@slaren Thanks for the review. I've pushed a new commit addressing your comments. Please let me know if any further changes are needed for the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reorganize and rename the files in the following structure, following the amx
code as an example:
# old
ggml-kleidiai/
├── ggml-kleidiai.cpp
├── ggml-kleidiai.h
├── kleidiai_kernels.cpp
└── kleidiai_kernels.h
# new
kleidiai/
├── kleidiai.cpp
├── kleidiai.h
├── kernels.cpp
└── kernels.h
@ggerganov Thanks for your code review. I've pushed a commit that addresses your comments. Please let me know if any further changes are needed for the PR. |
ggml/src/ggml-cpu/kleidiai/kernels.h
Outdated
@@ -2,10 +2,6 @@ | |||
// SPDX-License-Identifier: MIT | |||
// | |||
|
|||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The #pragma once
should still be present in the header.
@ggerganov Thanks again for the quick review. I've pushed a new commit to address your comments. |
Is this expected to run on M4 Mac Mini? I think it should support SME, but I get the following build error: cmake -D CMAKE_BUILD_TYPE=Release -D GGML_METAL=OFF -D GGML_BLAS=OFF -D GGML_CPU_KLEIDIAI=ON ..
-- The C compiler identification is AppleClang 16.0.0.16000026
-- The CXX compiler identification is AppleClang 16.0.0.16000026
...
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- ARM -mcpu not found, -mcpu=native will be used
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosve
-- Performing Test GGML_MACHINE_SUPPORTS_nosve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Success
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- ARM feature SME enabled
-- Using KleidiAI optimized kernels if applicable
-- The ASM compiler identification is AppleClang
-- Found assembler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+sme
...
make -j
[ 16%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c.o
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c:8:2: error: This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
8 | #error This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
| ^
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c:387:41: warning: ISO C requires a translation unit to contain at least one declaration [-Wempty-translation-unit]
387 | #endif // Architectural features check.
| ^
1 warning and 1 error generated.
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c:8:2: error: This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
8 | #error This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
| ^
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c:357:41: warning: ISO C requires a translation unit to contain at least one declaration [-Wempty-translation-unit]
357 | #endif // Architectural features check.
| ^
1 warning and 1 error generated.
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c.o] Error 1
make[1]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/all] Error 2
make: *** [all] Error 2 Commit: 02315a8 |
We recently identified that our fix for an Android build issue unintentionally introduced an SME-related build error. We're currently working on a resolution and will upload a fix soon. |
@ggerganov I’ve updated the patch to address your previous comments. Let me know if you have any further feedback or concerns. Happy to make any additional adjustments if needed. |
I tested it on my Galaxy A34 running Android and noticed a slight degradation in performance This PR
Master build build: 63ac128 (4738)
|
@rmatif , thanks for testing this PR on your Galaxy A34 and for the feedback. The KleidiAI kernels integrated into llama.cpp for this PR are currently optimized for lower thread counts, which may explain the performance degradation you observed. I’ve tested it on a Pixel 8 (Cortex-X + 4x Cortex-A715), and the performance figures show that KleidiAI performs slightly better when using between 1 and 4 cores. However, we’re aware of the performance gap at higher thread counts and are actively working on addressing it. Improvements for better scaling with more threads will be included in a future commit. PR:
Master:
|
ggml/src/ggml-cpu/CMakeLists.txt
Outdated
set(PRIVATE_ARCH_FLAGS "${PRIVATE_ARCH_FLAGS}+sve+sve2") | ||
endif() | ||
|
||
list(APPEND GGML_CDEF_PUBLIC GGML_USE_CPU_KLEIDIAI) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_CDEF_PUBLIC
has been removed, and this will have no effect. If it is important to add this definition, it should be done with target_compile_definitions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge after addressing the GGML_CDEF_PUBLIC
comment and if there are no other reviews.
Using the new Arm instruction sets in ggml
will be important for CPU-based on-device inference and it seems that the KleidiAI library is focused on adding support for these hardware features. Note that we generally avoid adding 3rd-party dependencies to the project, mainly to keep things simple and easy to deploy. Still, the KAI library appears to be lightweight and designed in a way that is easy to integrate, so I think it's OK to make an exception and add this option with the prospect of better on-device performance in the future. For the time being, usage of the KAI kernels will be gated by the build and environment flags. My understanding is that the performance of the implementation will improve in the future and will focus on low-power use cases, so will be looking forward to that.
It would be very useful to add CI workflows to cover these instruction sets in order to improve long-term support and maintenance.
@chaxu01 Will be pinging you for support for any issues that might arise from these changes.
@ggerganov @slaren Thanks for the review and for considering the integration of KleidiAI. I've updated the PR to address the GGML_CDEF_PUBLIC comment. |
As a heads up:
It seems as through support for DOWNLOAD_EXTRACT_TIMESTAMP is only available for cmake 3.24 and beyond. Relevant logs:
Upgrading cmake fixed this behavior. |
This commit integrates Arm's KleidiAI library that provides optimized matrix multiplication (matmul) kernels tailored for hardware features such as sme, i8mm, and dot product acceleration. The feature can be enabled with the build option GGML_CPU_KLEIDIAI.