ggml-cpu: Add CPU backend support for KleidiAI library #11390

chaxu01 · 2025-01-24T10:05:06Z

This commit integrates Arm's KleidiAI library that provides optimized matrix multiplication (matmul) kernels tailored for hardware features such as sme, i8mm, and dot product acceleration. The feature can be enabled with the build option GGML_CPU_KLEIDIAI.

include/llama.h

njsyw1997 · 2025-01-29T20:06:05Z

Is this supported on other platforms such as Raspberry Pi and Android Phone with Snapdragon?

chaxu01 · 2025-01-30T08:26:29Z

Yes, this should work on other platforms that support SME, I8MM, or dot product.

…

________________________________ Från: Yiwei Shao ***@***.***> Skickat: den 29 januari 2025 21:06 Till: ggerganov/llama.cpp ***@***.***> Kopia: Charles Xu ***@***.***>; Author ***@***.***> Ämne: Re: [ggerganov/llama.cpp] ggml-cpu: Add CPU backend support for KleidiAI library (PR #11390) Is this supported on other platforms such as Raspberry Pi and Android Phone with Snapdragon? — Reply to this email directly, view it on GitHub<#11390 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/APGVIEDGBIGW65AITREDY532NEYEHAVCNFSM6AAAAABVZKDDOCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRSG4ZDQNJYGQ>. You are receiving this because you authored the thread.Message ID: ***@***.***> IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

slaren · 2025-01-30T13:17:24Z

What performance can we expect with KleidiAI over the current kernels?

I tried this on M3 Max, and at least with this hardware it does not seem to be faster than the current AARCH64 kernels:

master:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	pp128	229.89 ± 1.38
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg32	31.54 ± 0.01

PR:

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	pp128	192.93 ± 0.14
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	12	tg32	30.54 ± 0.00

njsyw1997 · 2025-01-31T00:07:02Z

Cross compiling for android platform as android.md will report an error.

CMake Error at ggml/src/ggml-cpu/CMakeLists.txt:373 (string):
  string sub-command FIND requires 3 or 4 parameters.
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:301 (ggml_add_cpu_backend_variant_impl)


CMake Error at ggml/src/ggml-cpu/CMakeLists.txt:374 (string):
  string sub-command FIND requires 3 or 4 parameters.
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:301 (ggml_add_cpu_backend_variant_impl)


CMake Error at ggml/src/ggml-cpu/CMakeLists.txt:375 (string):
  string sub-command FIND requires 3 or 4 parameters.
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:301 (ggml_add_cpu_backend_variant_impl)

Could you please have a look?

chaxu01 · 2025-01-31T08:21:06Z

@njsyw1997 we'll look into the issue. Did you use Termux or Android NDK for the Android build?

chaxu01 · 2025-01-31T10:31:16Z

What performance can we expect with KleidiAI over the current kernels?

KleidiAI's performance is generally comparable to the current AARCH64 kernels. However, we have observed performance degradation when using a high number of threads on hardware with high core count. We are actively investigating this issue and will update the PR with a fix.

KleidiAI is under active development, and we expect continuous improvements over time. Future updates should further optimize performance across different hardware configurations and add additional kernels.

njsyw1997 · 2025-01-31T21:50:15Z

@chaxu01 I am using Android SDK, cross compiling on a linux x86 server. It seems that ARCH_FLAGS is not properly set in this CMakeLists.txt

Compiling on raspberry pi 5(quad-core A76, native compile) also got error. In Cmake I got warning

-- Adding CPU backend variant ggml-cpu: -mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme
cc1: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
13:42
CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:152 (message):
  Failed to get ARM features
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:301 (ggml_add_cpu_backend_variant_impl)

Ignore the warning and compile directly

cc1plus: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1plus: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
cc1plus: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1plus: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:90: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:104: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-aarch64.cpp.o] Error 1
cc1plus: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1plus: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:118: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-hbm.cpp.o] Error 1
cc1: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:76: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:1757: ggml/src/CMakeFiles/ggml-cpu.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

chaxu01 · 2025-02-03T08:28:45Z

@njsyw1997 Thanks for the info and we'll update the PR for the issue.

max-krasnyansky · 2025-02-05T04:59:01Z

I've been looking forward to getting SME enabled and finally had some time to review and benchmark this PR.

Based on my observations so far SME on the M4 does not scale beyond two threads. This is aligned with findings
from others (e.g https://scalable.uni-jena.de/opt/sme/micro.html). Looks like M4 Pro has two SME units (i.e SMT sharing of the HW).

See the benchmarks results below. I compared llama.cpp master-I8MM vs kleidi-I8MM vs kliedi-SME.
SME does much better on 1 and 2 threads and much worse from then on. Tested with both Apple Clang and LLVM 19.

It looks like even Kleidi I8MM kernels perform worse than what we have in the master.
After noticing that on the M4 Pro I ran the benchmarks on the Galaxy S25 (Snapdragon 8-Elite).
8-Elite doesn't support SME so I just compared master-I8MM vs kliedi-I8MM and I see the same
issue (ie Kliedi performs worse).

My thinking so far is that using SME-only kernels for the LLM is not going to work very well.
Instead we should be starting 2x SME threads and 8x I8MM threads (M4 Pro has 10 perf cores) to get best performance.

@chaxu01 Have you guys considered my suggestion above (ie mix of SME and I8MM threads)?
SME kernels will have to work on the same tensor layout as the I8MM so perhaps there will be additional overhead to load that into the ZA. Hopefully not too much though.

Build with Apple Clang

$ cmake -D CMAKE_BUILD_TYPE=Release -D GGML_METAL=OFF -D GGML_BLAS=OFF -D GGML_CPU_KLEIDIAI=ON -G Ninja -B build-macos .

-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- ARM feature SME enabled
-- Using KleidiAI optimized kernels if applicable
-- The ASM compiler identification is AppleClang
-- Found assembler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+sme

Build with LLVM Clang 19.1.7 (installed via homebrew)

$ CC=clang CXX=clang++ CFLAGS="-march=armv9.2a+nosve+sme" CXXFLAGS="-march=armv9.2a+nosve+sme" cmake -D CMAKE_BUILD_TYPE=Release -D GGML_METAL=OFF -D GGML_BLAS=OFF -D GGML_CPU_KLEIDIAI=ON -D GGML_NATIVE=OFF -D GGML_CPU_ARM_ARCH="armv9.2a+nosve+sme+dotprod+i8mm" -D GGML_OPENMP=OFF -G Ninja -B build-macos-llvm

-- The C compiler identification is Clang 19.1.7
-- The CXX compiler identification is Clang 19.1.7
...
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- ARM feature SME enabled
-- Using KleidiAI optimized kernels if applicable
-- The ASM compiler identification is Clang with GNU-like command-line
-- Found assembler: /opt/homebrew/opt/llvm/bin/clang
-- Adding CPU backend variant ggml-cpu: -march=armv9.2a+nosve+sme+dotprod+i8mm

PP bench : master-i8mm, kleidi-i8mm, kleidi-sme

~/src/llama.cpp-master$ ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 128 -n 0 --delay 5

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	pp128	160.94 ± 0.42
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	pp128	312.10 ± 2.97
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	4	pp128	586.83 ± 0.75
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	6	pp128	829.32 ± 6.44
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	8	pp128	982.49 ± 18.52
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	10	pp128	1173.93 ± 24.68

build: 6eecde3 (4621)

~/src/llama.cpp-kai$ GGML_KLEIDIAI_SME=0 ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 128 -n 0 --delay 5

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	pp128	162.40 ± 0.28
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	pp128	304.78 ± 2.64
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	4	pp128	531.04 ± 0.76
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	6	pp128	733.72 ± 1.68
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	8	pp128	928.32 ± 3.53
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	10	pp128	1082.70 ± 7.14

build: 119d3bf (4542)

~/src/llama.cpp-kai$ GGML_KLEIDIAI_SME=1 ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 128 -n 0 --delay 5

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	pp128	205.39 ± 0.58
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	pp128	389.43 ± 0.21
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	4	pp128	326.86 ± 0.09
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	6	pp128	396.02 ± 2.46
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	8	pp128	466.28 ± 3.85
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	10	pp128	435.39 ± 4.12

build: 119d3bf (4542)

TG bench : master-i8mm, kleidi-i8mm, kleidi-sme

~/src/llama.cpp-master$ ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 0 -n 64 --delay 5

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	tg64	47.84 ± 0.08
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	tg64	87.31 ± 0.16
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	4	tg64	157.63 ± 1.01
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	6	tg64	188.15 ± 9.39
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	8	tg64	237.40 ± 0.71
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	10	tg64	243.65 ± 0.14

build: 6eecde3 (4621)

~/src/llama.cpp-kai$ GGML_KLEIDIAI_SME=0 ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 0 -n 64 --delay 5

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	tg64	48.15 ± 0.15
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	tg64	86.01 ± 0.29
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	4	tg64	153.94 ± 0.16
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	6	tg64	194.26 ± 1.96
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	8	tg64	226.74 ± 0.18
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	10	tg64	232.53 ± 0.58

build: 119d3bf (4542)

~/src/llama.cpp-kai$ GGML_KLEIDIAI_SME=1 ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 0 -n 64 --delay 5

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	1	tg64	62.22 ± 0.14
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	2	tg64	99.95 ± 0.22
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	4	tg64	110.52 ± 3.40
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	6	tg64	132.43 ± 4.76
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	8	tg64	149.24 ± 1.45
llama 1B Q4_0	727.75 MiB	1.24 B	CPU	10	tg64	162.36 ± 2.87

build: 119d3bf (4542)

PP and TG bench on Galaxy S25 (Snapdragon 8 Elite) : master-i8mm | kleidi-i8mm

Built with Android NDK 27c (armv8.7a+dotprod+i8mm)

./master/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/llama-v3.2-1b-instruct.q4_0.gguf -t 1,2,4,6 --delay 5 -p 128 -n 0

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	1	pp128	181.23 ± 0.05
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	2	pp128	314.97 ± 5.73
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	4	pp128	259.68 ± 3.36
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	6	pp128	377.92 ± 5.39

build: 6eecde3 (4621)

./kai/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/llama-v3.2-1b-instruct.q4_0.gguf -t 1,2,4,6 --delay 5 -p 128 -n 0

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	1	pp128	174.13 ± 0.05
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	2	pp128	290.73 ± 4.80
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	4	pp128	244.46 ± 0.99
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	6	pp128	344.21 ± 1.75

build: 119d3bf (4542)

./master/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/llama-v3.2-1b-instruct.q4_0.gguf -t 1,2,4,6 --delay 5 -p 0 -n 64

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	1	tg64	35.71 ± 0.17
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	2	tg64	58.65 ± 1.43
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	4	tg64	58.72 ± 0.64
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	6	tg64	71.24 ± 0.68

build: 6eecde3 (4621)

./kai/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/llama-v3.2-1b-instruct.q4_0.gguf -t 1,2,4,6 --delay 5 -p 0 -n 64

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	1	tg64	34.83 ± 0.51
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	2	tg64	56.47 ± 1.56
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	4	tg64	57.56 ± 0.69
llama 1B Q4_0	868.65 MiB	1.50 B	CPU	6	tg64	69.46 ± 0.73

build: 119d3bf (4542)

max-krasnyansky · 2025-02-05T05:04:57Z

What performance can we expect with KleidiAI over the current kernels?

I tried this on M3 Max, and at least with this hardware it does not seem to be faster than the current AARCH64 kernels:

master:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 pp128 229.89 ± 1.38
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 tg32 31.54 ± 0.01
PR:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 pp128 192.93 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 tg32 30.54 ± 0.00

@slaren
See my report below. Our I8MM kernels in master are definitely more performant.

max-krasnyansky · 2025-02-05T05:07:28Z

What performance can we expect with KleidiAI over the current kernels?

KleidiAI's performance is generally comparable to the current AARCH64 kernels. However, we have observed performance degradation when using a high number of threads on hardware with high core count. We are actively investigating this issue and will update the PR with a fix.

KleidiAI is under active development, and we expect continuous improvements over time. Future updates should further optimize performance across different hardware configurations and add additional kernels.

I'd say I see worse performance across the board even on 1-2 threads with I8MM kernels.
See my report from M4 Pro and Snapdragon 8-Elite.

max-krasnyansky · 2025-02-05T05:13:05Z

Is this supported on other platforms such as Raspberry Pi and Android Phone with Snapdragon?

None of the currently available Snapdragon-based devices support SME.
You're better off using CPU backend with I8MM --march=armv8.7a or the OpenCL Adreno backend.

chaxu01 · 2025-02-05T14:30:24Z

@max-krasnyansky Thanks for running the benchmarks. It looks like the commit f4eb1b3, which adds support for multithreaded LHS packing, wasn’t included in your tests.

Have you guys considered my suggestion above (ie mix of SME and I8MM threads)?

Yes, we’ve considered this approach. The main issue is that different kernels may use different RHS encoding and that would require repacking and storing multiple copies of the weights in memory, one copy for each kernel. This would increase memory usage and mode loading time.

BTW I've updated the PR with the commit 3e08f37 that switch the order for the kernel priority to dotprod and i8mm as we observed that dotprod would give better performance if supported.

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.cpp

ggml/src/ggml-cpu/CMakeLists.txt

chaxu01 · 2025-02-07T15:12:12Z

@slaren Thanks for the review. I've pushed a new commit addressing your comments. Please let me know if any further changes are needed for the PR.

ggml/src/ggml-cpu/CMakeLists.txt

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.cpp

ggerganov

Reorganize and rename the files in the following structure, following the amx code as an example:

# old
ggml-kleidiai/
├── ggml-kleidiai.cpp
├── ggml-kleidiai.h
├── kleidiai_kernels.cpp
└── kleidiai_kernels.h

# new
kleidiai/
├── kleidiai.cpp
├── kleidiai.h
├── kernels.cpp
└── kernels.h

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.cpp

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.h

ggml/src/ggml-cpu/ggml-kleidiai/kleidiai_kernels.h

chaxu01 · 2025-02-10T14:02:40Z

@ggerganov Thanks for your code review. I've pushed a commit that addresses your comments. Please let me know if any further changes are needed for the PR.

ggerganov · 2025-02-10T14:04:34Z

ggml/src/ggml-cpu/kleidiai/kernels.h

@@ -2,10 +2,6 @@
 // SPDX-License-Identifier: MIT
 //

-#pragma once


The #pragma once should still be present in the header.

chaxu01 · 2025-02-10T14:34:28Z

@ggerganov Thanks again for the quick review. I've pushed a new commit to address your comments.

ggerganov · 2025-02-12T08:10:25Z

Is this expected to run on M4 Mac Mini? I think it should support SME, but I get the following build error:

cmake -D CMAKE_BUILD_TYPE=Release -D GGML_METAL=OFF -D GGML_BLAS=OFF -D GGML_CPU_KLEIDIAI=ON ..

-- The C compiler identification is AppleClang 16.0.0.16000026
-- The CXX compiler identification is AppleClang 16.0.0.16000026
...
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- ARM -mcpu not found, -mcpu=native will be used
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosve
-- Performing Test GGML_MACHINE_SUPPORTS_nosve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Success
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- ARM feature SME enabled
-- Using KleidiAI optimized kernels if applicable
-- The ASM compiler identification is AppleClang
-- Found assembler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+sme 
...

make -j

[ 16%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c.o
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c:8:2: error: This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
    8 | #error This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
      |  ^
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c:387:41: warning: ISO C requires a translation unit to contain at least one declaration [-Wempty-translation-unit]
  387 | #endif  // Architectural features check.
      |                                         ^
1 warning and 1 error generated.
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c:8:2: error: This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
    8 | #error This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
      |  ^
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c:357:41: warning: ISO C requires a translation unit to contain at least one declaration [-Wempty-translation-unit]
  357 | #endif  // Architectural features check.
      |                                         ^
1 warning and 1 error generated.
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c.o] Error 1
make[1]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/all] Error 2
make: *** [all] Error 2

Commit: 02315a8

chaxu01 · 2025-02-12T08:22:24Z

We recently identified that our fix for an Android build issue unintentionally introduced an SME-related build error. We're currently working on a resolution and will upload a fix soon.

chaxu01 · 2025-02-17T11:25:44Z

@ggerganov I’ve updated the patch to address your previous comments. Let me know if you have any further feedback or concerns. Happy to make any additional adjustments if needed.

rmatif · 2025-02-18T15:12:39Z

I tested it on my Galaxy A34 running Android and noticed a slight degradation in performance

This PR

~/kleidiai/llama.cpp $ ./build/bin/llama-bench -m ../../Llama-3.2-1B-Instruct-Q4_0.gguf -t 8

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	729.75 MiB	1.24 B	CPU	8	pp512	87.11 ± 2.35
llama 1B Q4_0	729.75 MiB	1.24 B	CPU	8	tg128	11.00 ± 0.23

Master build

build: 63ac128 (4738)

~/ggml/llama.cpp $ ./build/bin/llama-bench -m ../../Llama-3.2-1B-Instruct-Q4_0.gguf -t 8

model	size	params	backend	threads	test	t/s
llama 1B Q4_0	729.75 MiB	1.24 B	CPU	8	pp512	84.25 ± 3.21
llama 1B Q4_0	729.75 MiB	1.24 B	CPU	8	tg128	13.79 ± 0.46

chaxu01 · 2025-02-19T11:45:35Z

@rmatif , thanks for testing this PR on your Galaxy A34 and for the feedback. The KleidiAI kernels integrated into llama.cpp for this PR are currently optimized for lower thread counts, which may explain the performance degradation you observed. I’ve tested it on a Pixel 8 (Cortex-X + 4x Cortex-A715), and the performance figures show that KleidiAI performs slightly better when using between 1 and 4 cores. However, we’re aware of the performance gap at higher thread counts and are actively working on addressing it. Improvements for better scaling with more threads will be included in a future commit.

PR:

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |         pp512 |         20.48 ± 0.29 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |         tg128 |         11.58 ± 0.02 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |         pp512 |         40.40 ± 0.24 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |         tg128 |         21.40 ± 0.47 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |         pp512 |         74.00 ± 0.30 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |         tg128 |         28.27 ± 0.03 |

build: 84a55f15 (4599)

Master:

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |         pp512 |         19.33 ± 0.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |         tg128 |          9.75 ± 0.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |         pp512 |         37.59 ± 0.38 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |         tg128 |         18.48 ± 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |         pp512 |         69.59 ± 0.95 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |         tg128 |         27.42 ± 1.21 |

build: d774ab3a (4644)

slaren · 2025-02-19T17:53:35Z

ggml/src/ggml-cpu/CMakeLists.txt

+            set(PRIVATE_ARCH_FLAGS "${PRIVATE_ARCH_FLAGS}+sve+sve2")
+        endif()
+
+        list(APPEND GGML_CDEF_PUBLIC GGML_USE_CPU_KLEIDIAI)


GGML_CDEF_PUBLIC has been removed, and this will have no effect. If it is important to add this definition, it should be done with target_compile_definitions.

ggerganov

Let's merge after addressing the GGML_CDEF_PUBLIC comment and if there are no other reviews.

Using the new Arm instruction sets in ggml will be important for CPU-based on-device inference and it seems that the KleidiAI library is focused on adding support for these hardware features. Note that we generally avoid adding 3rd-party dependencies to the project, mainly to keep things simple and easy to deploy. Still, the KAI library appears to be lightweight and designed in a way that is easy to integrate, so I think it's OK to make an exception and add this option with the prospect of better on-device performance in the future. For the time being, usage of the KAI kernels will be gated by the build and environment flags. My understanding is that the performance of the implementation will improve in the future and will focus on low-power use cases, so will be looking forward to that.

It would be very useful to add CI workflows to cover these instruction sets in order to improve long-term support and maintenance.

@chaxu01 Will be pinging you for support for any issues that might arise from these changes.

chaxu01 · 2025-02-20T12:41:20Z

@ggerganov @slaren Thanks for the review and for considering the integration of KleidiAI. I've updated the PR to address the GGML_CDEF_PUBLIC comment.
We understand the preference for minimizing third-party dependencies, and we've designed the KleidiAI integration to be gated behind build and environment flags. Our focus remains on enhancing the KleidiAI performance, and we appreciate the opportunity to contribute these optimizations.
Regarding CI workflows, that's a great suggestion. We'll explore this topic.
I'll also be available to assist with any issues that arise from these changes—please feel free to reach out anytime.
Thanks again for the feedback and for considering this PR. Looking forward to further improvements ahead.

jishminor · 2025-02-20T20:42:17Z

As a heads up:
If building atop ubuntu 22.04 using the apt default cmake 3.22, building with KleidiAI enabled will fail, as it doesn't like:

FetchContent_Declare(KleidiAI_Download
            URL ${KLEIDIAI_DOWNLOAD_URL}
            DOWNLOAD_EXTRACT_TIMESTAMP NEW
            URL_HASH MD5=${KLEIDIAI_ARCHIVE_MD5})

It seems as through support for DOWNLOAD_EXTRACT_TIMESTAMP is only available for cmake 3.24 and beyond.

Relevant logs:

[cmake] -- Using KleidiAI optimized kernels if applicable
[cmake] CMake Error at /usr/share/cmake-3.22/Modules/ExternalProject.cmake:2806 (message):
[cmake]   At least one entry of URL is a path (invalid in a list)
[cmake] Call Stack (most recent call first):
[cmake]   /usr/share/cmake-3.22/Modules/ExternalProject.cmake:3716 (_ep_add_download_command)
[cmake]   CMakeLists.txt:15 (ExternalProject_Add)
[cmake] 
[cmake] 
[cmake] -- Configuring incomplete, errors occurred!
[cmake] See also "/home/ubuntu/wavefront_llama/out/build/arm-linux-relwithdebinfo-kleidiai/_deps/kleidiai_download-subbuild/CMakeFiles/CMakeOutput.log".
[cmake] 
[cmake] CMake Error at /usr/share/cmake-3.22/Modules/FetchContent.cmake:1075 (message):
[cmake]   CMake step for kleidiai_download failed: 1
[cmake] Call Stack (most recent call first):
[cmake]   /usr/share/cmake-3.22/Modules/FetchContent.cmake:1216:EVAL:2 (__FetchContent_directPopulate)
[cmake]   /usr/share/cmake-3.22/Modules/FetchContent.cmake:1216 (cmake_language)
[cmake]   /usr/share/cmake-3.22/Modules/FetchContent.cmake:1259 (FetchContent_Populate)
[cmake]   ggml/src/ggml-cpu/CMakeLists.txt:342 (FetchContent_MakeAvailable)
[cmake]   ggml/src/CMakeLists.txt:318 (ggml_add_cpu_backend_variant_impl)
[cmake] 
[cmake] 
[cmake] -- Configuring incomplete, errors occurred!

Upgrading cmake fixed this behavior.

ggml-cpu: Add CPU backend support for KleidiAI library

6adca19

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 24, 2025

slaren reviewed Jan 24, 2025

View reviewed changes

include/llama.h Outdated Show resolved Hide resolved

Add environmental variable GGML_KLEIDIAI_SME

119d3bf

Add support for multithread LHS conversion

f4eb1b3

Switch kernel selection order to dotprod and i8mm

3e08f37

slaren reviewed Feb 6, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/CMakeLists.txt Outdated Show resolved Hide resolved

updates for review comments

9edd107

slaren reviewed Feb 7, 2025

View reviewed changes

ggml/src/ggml-cpu/CMakeLists.txt Outdated Show resolved Hide resolved

slaren reviewed Feb 7, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.cpp Outdated Show resolved Hide resolved

More updates for review comments

e04880f

ggerganov reviewed Feb 10, 2025

View reviewed changes

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.cpp Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ggml-kleidiai/ggml-kleidiai.h Outdated Show resolved Hide resolved

ggml/src/ggml-cpu/ggml-kleidiai/kleidiai_kernels.h Outdated Show resolved Hide resolved

Reorganize and rename KleidiAI files

ca0c8b6

ggerganov reviewed Feb 10, 2025

View reviewed changes

Move ggml-cpu-traits.h to source file

02315a8

Update cmake for SME build and add alignment for SME

09436f4

slaren reviewed Feb 19, 2025

View reviewed changes

ggerganov approved these changes Feb 20, 2025

View reviewed changes

Remove append GGML_USE_CPU_KLEIDIAI to the GGML_CDEF_PUBLIC list

710552f

slaren approved these changes Feb 20, 2025

View reviewed changes

ggerganov merged commit c5d91a7 into ggml-org:master Feb 20, 2025
42 of 45 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cpu: Add CPU backend support for KleidiAI library #11390

ggml-cpu: Add CPU backend support for KleidiAI library #11390

chaxu01 commented Jan 24, 2025

njsyw1997 commented Jan 29, 2025

chaxu01 commented Jan 30, 2025 via email •

edited

Loading

slaren commented Jan 30, 2025

njsyw1997 commented Jan 31, 2025

chaxu01 commented Jan 31, 2025

chaxu01 commented Jan 31, 2025

njsyw1997 commented Jan 31, 2025

chaxu01 commented Feb 3, 2025

max-krasnyansky commented Feb 5, 2025 •

edited

Loading

Build with Apple Clang

Build with LLVM Clang 19.1.7 (installed via homebrew)

PP bench : master-i8mm, kleidi-i8mm, kleidi-sme

TG bench : master-i8mm, kleidi-i8mm, kleidi-sme

PP and TG bench on Galaxy S25 (Snapdragon 8 Elite) : master-i8mm | kleidi-i8mm

max-krasnyansky commented Feb 5, 2025

max-krasnyansky commented Feb 5, 2025

max-krasnyansky commented Feb 5, 2025

chaxu01 commented Feb 5, 2025

chaxu01 commented Feb 7, 2025

ggerganov left a comment

chaxu01 commented Feb 10, 2025

ggerganov Feb 10, 2025

chaxu01 commented Feb 10, 2025

ggerganov commented Feb 12, 2025

chaxu01 commented Feb 12, 2025

chaxu01 commented Feb 17, 2025

rmatif commented Feb 18, 2025

chaxu01 commented Feb 19, 2025

slaren Feb 19, 2025

ggerganov left a comment

chaxu01 commented Feb 20, 2025

jishminor commented Feb 20, 2025 •

edited

Loading

ggml-cpu: Add CPU backend support for KleidiAI library #11390

ggml-cpu: Add CPU backend support for KleidiAI library #11390

Conversation

chaxu01 commented Jan 24, 2025

njsyw1997 commented Jan 29, 2025

chaxu01 commented Jan 30, 2025 via email • edited Loading

slaren commented Jan 30, 2025

njsyw1997 commented Jan 31, 2025

chaxu01 commented Jan 31, 2025

chaxu01 commented Jan 31, 2025

njsyw1997 commented Jan 31, 2025

chaxu01 commented Feb 3, 2025

max-krasnyansky commented Feb 5, 2025 • edited Loading

Build with Apple Clang

Build with LLVM Clang 19.1.7 (installed via homebrew)

PP bench : master-i8mm, kleidi-i8mm, kleidi-sme

TG bench : master-i8mm, kleidi-i8mm, kleidi-sme

PP and TG bench on Galaxy S25 (Snapdragon 8 Elite) : master-i8mm | kleidi-i8mm

max-krasnyansky commented Feb 5, 2025

max-krasnyansky commented Feb 5, 2025

max-krasnyansky commented Feb 5, 2025

chaxu01 commented Feb 5, 2025

chaxu01 commented Feb 7, 2025

ggerganov left a comment

Choose a reason for hiding this comment

chaxu01 commented Feb 10, 2025

ggerganov Feb 10, 2025

Choose a reason for hiding this comment

chaxu01 commented Feb 10, 2025

ggerganov commented Feb 12, 2025

chaxu01 commented Feb 12, 2025

chaxu01 commented Feb 17, 2025

rmatif commented Feb 18, 2025

chaxu01 commented Feb 19, 2025

slaren Feb 19, 2025

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

chaxu01 commented Feb 20, 2025

jishminor commented Feb 20, 2025 • edited Loading

chaxu01 commented Jan 30, 2025 via email •

edited

Loading

max-krasnyansky commented Feb 5, 2025 •

edited

Loading

jishminor commented Feb 20, 2025 •

edited

Loading