Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml-cpu: Add CPU backend support for KleidiAI library #11390

Merged
merged 10 commits into from
Feb 20, 2025

Conversation

chaxu01
Copy link
Collaborator

@chaxu01 chaxu01 commented Jan 24, 2025

This commit integrates Arm's KleidiAI library that provides optimized matrix multiplication (matmul) kernels tailored for hardware features such as sme, i8mm, and dot product acceleration. The feature can be enabled with the build option GGML_CPU_KLEIDIAI.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Jan 24, 2025
@njsyw1997
Copy link

Is this supported on other platforms such as Raspberry Pi and Android Phone with Snapdragon?

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Jan 30, 2025 via email

@slaren
Copy link
Member

slaren commented Jan 30, 2025

What performance can we expect with KleidiAI over the current kernels?

I tried this on M3 Max, and at least with this hardware it does not seem to be faster than the current AARCH64 kernels:

master:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 pp128 229.89 ± 1.38
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 tg32 31.54 ± 0.01

PR:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 pp128 192.93 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 tg32 30.54 ± 0.00

@njsyw1997
Copy link

Cross compiling for android platform as android.md will report an error.

CMake Error at ggml/src/ggml-cpu/CMakeLists.txt:373 (string):
  string sub-command FIND requires 3 or 4 parameters.
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:301 (ggml_add_cpu_backend_variant_impl)


CMake Error at ggml/src/ggml-cpu/CMakeLists.txt:374 (string):
  string sub-command FIND requires 3 or 4 parameters.
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:301 (ggml_add_cpu_backend_variant_impl)


CMake Error at ggml/src/ggml-cpu/CMakeLists.txt:375 (string):
  string sub-command FIND requires 3 or 4 parameters.
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:301 (ggml_add_cpu_backend_variant_impl)

Could you please have a look?

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Jan 31, 2025

@njsyw1997 we'll look into the issue. Did you use Termux or Android NDK for the Android build?

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Jan 31, 2025

What performance can we expect with KleidiAI over the current kernels?

KleidiAI's performance is generally comparable to the current AARCH64 kernels. However, we have observed performance degradation when using a high number of threads on hardware with high core count. We are actively investigating this issue and will update the PR with a fix.

KleidiAI is under active development, and we expect continuous improvements over time. Future updates should further optimize performance across different hardware configurations and add additional kernels.

@njsyw1997
Copy link

@chaxu01 I am using Android SDK, cross compiling on a linux x86 server. It seems that ARCH_FLAGS is not properly set in this CMakeLists.txt

Compiling on raspberry pi 5(quad-core A76, native compile) also got error. In Cmake I got warning

-- Adding CPU backend variant ggml-cpu: -mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme
cc1: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
13:42
CMake Warning at ggml/src/ggml-cpu/CMakeLists.txt:152 (message):
  Failed to get ARM features
Call Stack (most recent call first):
  ggml/src/CMakeLists.txt:301 (ggml_add_cpu_backend_variant_impl)

Ignore the warning and compile directly

cc1plus: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1plus: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
cc1plus: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1plus: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:90: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:104: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-aarch64.cpp.o] Error 1
cc1plus: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1plus: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:118: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu-hbm.cpp.o] Error 1
cc1: error: invalid feature modifier ‘sme’ in ‘-mcpu=cortex-a76+crypto+dotprod+noi8mm+nosve+nosme’
cc1: note: valid arguments are: fp simd crypto crc lse fp16 rcpc rdma dotprod aes sha2 sha3 sm4 fp16fml sve profile rng memtag sb ssbs predres sve2 sve2-sm4 sve2-aes sve2-sha3 sve2-bitperm tme i8mm f32mm f64mm bf16 flagm pauth ls64 mops; did you mean ‘sm4’?
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/build.make:76: ggml/src/CMakeFiles/ggml-cpu.dir/ggml-cpu/ggml-cpu.c.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:1757: ggml/src/CMakeFiles/ggml-cpu.dir/all] Error 2
make: *** [Makefile:146: all] Error 2

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 3, 2025

@njsyw1997 Thanks for the info and we'll update the PR for the issue.

@max-krasnyansky
Copy link
Collaborator

max-krasnyansky commented Feb 5, 2025

I've been looking forward to getting SME enabled and finally had some time to review and benchmark this PR.

Based on my observations so far SME on the M4 does not scale beyond two threads. This is aligned with findings
from others (e.g https://scalable.uni-jena.de/opt/sme/micro.html). Looks like M4 Pro has two SME units (i.e SMT sharing of the HW).

See the benchmarks results below. I compared llama.cpp master-I8MM vs kleidi-I8MM vs kliedi-SME.
SME does much better on 1 and 2 threads and much worse from then on. Tested with both Apple Clang and LLVM 19.

It looks like even Kleidi I8MM kernels perform worse than what we have in the master.
After noticing that on the M4 Pro I ran the benchmarks on the Galaxy S25 (Snapdragon 8-Elite).
8-Elite doesn't support SME so I just compared master-I8MM vs kliedi-I8MM and I see the same
issue (ie Kliedi performs worse).

My thinking so far is that using SME-only kernels for the LLM is not going to work very well.
Instead we should be starting 2x SME threads and 8x I8MM threads (M4 Pro has 10 perf cores) to get best performance.

@chaxu01 Have you guys considered my suggestion above (ie mix of SME and I8MM threads)?
SME kernels will have to work on the same tensor layout as the I8MM so perhaps there will be additional overhead to load that into the ZA. Hopefully not too much though.

Build with Apple Clang

$ cmake -D CMAKE_BUILD_TYPE=Release -D GGML_METAL=OFF -D GGML_BLAS=OFF -D GGML_CPU_KLEIDIAI=ON -G Ninja -B build-macos .

-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- ARM feature SME enabled
-- Using KleidiAI optimized kernels if applicable
-- The ASM compiler identification is AppleClang
-- Found assembler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+sme 

Build with LLVM Clang 19.1.7 (installed via homebrew)

$ CC=clang CXX=clang++ CFLAGS="-march=armv9.2a+nosve+sme" CXXFLAGS="-march=armv9.2a+nosve+sme" cmake -D CMAKE_BUILD_TYPE=Release -D GGML_METAL=OFF -D GGML_BLAS=OFF -D GGML_CPU_KLEIDIAI=ON -D GGML_NATIVE=OFF -D GGML_CPU_ARM_ARCH="armv9.2a+nosve+sme+dotprod+i8mm" -D GGML_OPENMP=OFF -G Ninja -B build-macos-llvm

-- The C compiler identification is Clang 19.1.7
-- The CXX compiler identification is Clang 19.1.7
...
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- ARM feature SME enabled
-- Using KleidiAI optimized kernels if applicable
-- The ASM compiler identification is Clang with GNU-like command-line
-- Found assembler: /opt/homebrew/opt/llvm/bin/clang
-- Adding CPU backend variant ggml-cpu: -march=armv9.2a+nosve+sme+dotprod+i8mm

PP bench : master-i8mm, kleidi-i8mm, kleidi-sme

~/src/llama.cpp-master$ ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 128 -n 0 --delay 5

model size params backend threads test t/s
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 pp128 160.94 ± 0.42
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 pp128 312.10 ± 2.97
llama 1B Q4_0 727.75 MiB 1.24 B CPU 4 pp128 586.83 ± 0.75
llama 1B Q4_0 727.75 MiB 1.24 B CPU 6 pp128 829.32 ± 6.44
llama 1B Q4_0 727.75 MiB 1.24 B CPU 8 pp128 982.49 ± 18.52
llama 1B Q4_0 727.75 MiB 1.24 B CPU 10 pp128 1173.93 ± 24.68

build: 6eecde3 (4621)

~/src/llama.cpp-kai$ GGML_KLEIDIAI_SME=0 ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 128 -n 0 --delay 5

model size params backend threads test t/s
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 pp128 162.40 ± 0.28
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 pp128 304.78 ± 2.64
llama 1B Q4_0 727.75 MiB 1.24 B CPU 4 pp128 531.04 ± 0.76
llama 1B Q4_0 727.75 MiB 1.24 B CPU 6 pp128 733.72 ± 1.68
llama 1B Q4_0 727.75 MiB 1.24 B CPU 8 pp128 928.32 ± 3.53
llama 1B Q4_0 727.75 MiB 1.24 B CPU 10 pp128 1082.70 ± 7.14

build: 119d3bf (4542)

~/src/llama.cpp-kai$ GGML_KLEIDIAI_SME=1 ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 128 -n 0 --delay 5

model size params backend threads test t/s
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 pp128 205.39 ± 0.58
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 pp128 389.43 ± 0.21
llama 1B Q4_0 727.75 MiB 1.24 B CPU 4 pp128 326.86 ± 0.09
llama 1B Q4_0 727.75 MiB 1.24 B CPU 6 pp128 396.02 ± 2.46
llama 1B Q4_0 727.75 MiB 1.24 B CPU 8 pp128 466.28 ± 3.85
llama 1B Q4_0 727.75 MiB 1.24 B CPU 10 pp128 435.39 ± 4.12

build: 119d3bf (4542)

TG bench : master-i8mm, kleidi-i8mm, kleidi-sme

~/src/llama.cpp-master$ ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 0 -n 64 --delay 5

model size params backend threads test t/s
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 tg64 47.84 ± 0.08
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 tg64 87.31 ± 0.16
llama 1B Q4_0 727.75 MiB 1.24 B CPU 4 tg64 157.63 ± 1.01
llama 1B Q4_0 727.75 MiB 1.24 B CPU 6 tg64 188.15 ± 9.39
llama 1B Q4_0 727.75 MiB 1.24 B CPU 8 tg64 237.40 ± 0.71
llama 1B Q4_0 727.75 MiB 1.24 B CPU 10 tg64 243.65 ± 0.14

build: 6eecde3 (4621)

~/src/llama.cpp-kai$ GGML_KLEIDIAI_SME=0 ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 0 -n 64 --delay 5

model size params backend threads test t/s
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 tg64 48.15 ± 0.15
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 tg64 86.01 ± 0.29
llama 1B Q4_0 727.75 MiB 1.24 B CPU 4 tg64 153.94 ± 0.16
llama 1B Q4_0 727.75 MiB 1.24 B CPU 6 tg64 194.26 ± 1.96
llama 1B Q4_0 727.75 MiB 1.24 B CPU 8 tg64 226.74 ± 0.18
llama 1B Q4_0 727.75 MiB 1.24 B CPU 10 tg64 232.53 ± 0.58

build: 119d3bf (4542)

~/src/llama.cpp-kai$ GGML_KLEIDIAI_SME=1 ./build-macos/bin/llama-bench -m ../gguf/llama-v3.2-1b-instruct.q4_0.gguf -ngl 0 -t 1,2,4,6,8,10 -p 0 -n 64 --delay 5

model size params backend threads test t/s
llama 1B Q4_0 727.75 MiB 1.24 B CPU 1 tg64 62.22 ± 0.14
llama 1B Q4_0 727.75 MiB 1.24 B CPU 2 tg64 99.95 ± 0.22
llama 1B Q4_0 727.75 MiB 1.24 B CPU 4 tg64 110.52 ± 3.40
llama 1B Q4_0 727.75 MiB 1.24 B CPU 6 tg64 132.43 ± 4.76
llama 1B Q4_0 727.75 MiB 1.24 B CPU 8 tg64 149.24 ± 1.45
llama 1B Q4_0 727.75 MiB 1.24 B CPU 10 tg64 162.36 ± 2.87

build: 119d3bf (4542)

PP and TG bench on Galaxy S25 (Snapdragon 8 Elite) : master-i8mm | kleidi-i8mm

Built with Android NDK 27c (armv8.7a+dotprod+i8mm)

./master/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/llama-v3.2-1b-instruct.q4_0.gguf -t 1,2,4,6 --delay 5 -p 128 -n 0

model size params backend threads mmap test t/s
llama 1B Q4_0 868.65 MiB 1.50 B CPU 1 0 pp128 181.23 ± 0.05
llama 1B Q4_0 868.65 MiB 1.50 B CPU 2 0 pp128 314.97 ± 5.73
llama 1B Q4_0 868.65 MiB 1.50 B CPU 4 0 pp128 259.68 ± 3.36
llama 1B Q4_0 868.65 MiB 1.50 B CPU 6 0 pp128 377.92 ± 5.39

build: 6eecde3 (4621)

./kai/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/llama-v3.2-1b-instruct.q4_0.gguf -t 1,2,4,6 --delay 5 -p 128 -n 0

model size params backend threads mmap test t/s
llama 1B Q4_0 868.65 MiB 1.50 B CPU 1 0 pp128 174.13 ± 0.05
llama 1B Q4_0 868.65 MiB 1.50 B CPU 2 0 pp128 290.73 ± 4.80
llama 1B Q4_0 868.65 MiB 1.50 B CPU 4 0 pp128 244.46 ± 0.99
llama 1B Q4_0 868.65 MiB 1.50 B CPU 6 0 pp128 344.21 ± 1.75

build: 119d3bf (4542)

./master/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/llama-v3.2-1b-instruct.q4_0.gguf -t 1,2,4,6 --delay 5 -p 0 -n 64

model size params backend threads mmap test t/s
llama 1B Q4_0 868.65 MiB 1.50 B CPU 1 0 tg64 35.71 ± 0.17
llama 1B Q4_0 868.65 MiB 1.50 B CPU 2 0 tg64 58.65 ± 1.43
llama 1B Q4_0 868.65 MiB 1.50 B CPU 4 0 tg64 58.72 ± 0.64
llama 1B Q4_0 868.65 MiB 1.50 B CPU 6 0 tg64 71.24 ± 0.68

build: 6eecde3 (4621)

./kai/bin/llama-bench --mmap 0 -m /data/local/tmp/lmcp/../gguf/llama-v3.2-1b-instruct.q4_0.gguf -t 1,2,4,6 --delay 5 -p 0 -n 64

model size params backend threads mmap test t/s
llama 1B Q4_0 868.65 MiB 1.50 B CPU 1 0 tg64 34.83 ± 0.51
llama 1B Q4_0 868.65 MiB 1.50 B CPU 2 0 tg64 56.47 ± 1.56
llama 1B Q4_0 868.65 MiB 1.50 B CPU 4 0 tg64 57.56 ± 0.69
llama 1B Q4_0 868.65 MiB 1.50 B CPU 6 0 tg64 69.46 ± 0.73

build: 119d3bf (4542)

@max-krasnyansky
Copy link
Collaborator

What performance can we expect with KleidiAI over the current kernels?

I tried this on M3 Max, and at least with this hardware it does not seem to be faster than the current AARCH64 kernels:

master:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 pp128 229.89 ± 1.38
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 tg32 31.54 ± 0.01
PR:

model size params backend threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 pp128 192.93 ± 0.14
llama 7B Q4_0 3.56 GiB 6.74 B CPU 12 tg32 30.54 ± 0.00

@slaren
See my report below. Our I8MM kernels in master are definitely more performant.

@max-krasnyansky
Copy link
Collaborator

What performance can we expect with KleidiAI over the current kernels?

KleidiAI's performance is generally comparable to the current AARCH64 kernels. However, we have observed performance degradation when using a high number of threads on hardware with high core count. We are actively investigating this issue and will update the PR with a fix.

KleidiAI is under active development, and we expect continuous improvements over time. Future updates should further optimize performance across different hardware configurations and add additional kernels.

I'd say I see worse performance across the board even on 1-2 threads with I8MM kernels.
See my report from M4 Pro and Snapdragon 8-Elite.

@max-krasnyansky
Copy link
Collaborator

Is this supported on other platforms such as Raspberry Pi and Android Phone with Snapdragon?

None of the currently available Snapdragon-based devices support SME.
You're better off using CPU backend with I8MM --march=armv8.7a or the OpenCL Adreno backend.

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 5, 2025

@max-krasnyansky Thanks for running the benchmarks. It looks like the commit f4eb1b3, which adds support for multithreaded LHS packing, wasn’t included in your tests.

Have you guys considered my suggestion above (ie mix of SME and I8MM threads)?

Yes, we’ve considered this approach. The main issue is that different kernels may use different RHS encoding and that would require repacking and storing multiple copies of the weights in memory, one copy for each kernel. This would increase memory usage and mode loading time.

BTW I've updated the PR with the commit 3e08f37 that switch the order for the kernel priority to dotprod and i8mm as we observed that dotprod would give better performance if supported.

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 7, 2025

@slaren Thanks for the review. I've pushed a new commit addressing your comments. Please let me know if any further changes are needed for the PR.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reorganize and rename the files in the following structure, following the amx code as an example:

# old
ggml-kleidiai/
├── ggml-kleidiai.cpp
├── ggml-kleidiai.h
├── kleidiai_kernels.cpp
└── kleidiai_kernels.h

# new
kleidiai/
├── kleidiai.cpp
├── kleidiai.h
├── kernels.cpp
└── kernels.h

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 10, 2025

@ggerganov Thanks for your code review. I've pushed a commit that addresses your comments. Please let me know if any further changes are needed for the PR.

@@ -2,10 +2,6 @@
// SPDX-License-Identifier: MIT
//

#pragma once
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The #pragma once should still be present in the header.

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 10, 2025

@ggerganov Thanks again for the quick review. I've pushed a new commit to address your comments.

@ggerganov
Copy link
Member

Is this expected to run on M4 Mac Mini? I think it should support SME, but I get the following build error:

cmake -D CMAKE_BUILD_TYPE=Release -D GGML_METAL=OFF -D GGML_BLAS=OFF -D GGML_CPU_KLEIDIAI=ON ..

-- The C compiler identification is AppleClang 16.0.0.16000026
-- The CXX compiler identification is AppleClang 16.0.0.16000026
...
-- ARM detected
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E
-- Performing Test GGML_COMPILER_SUPPORTS_FP16_FORMAT_I3E - Failed
-- ARM -mcpu not found, -mcpu=native will be used
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosve
-- Performing Test GGML_MACHINE_SUPPORTS_nosve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Success
-- ARM feature DOTPROD enabled
-- ARM feature MATMUL_INT8 enabled
-- ARM feature FMA enabled
-- ARM feature FP16_VECTOR_ARITHMETIC enabled
-- ARM feature SME enabled
-- Using KleidiAI optimized kernels if applicable
-- The ASM compiler identification is AppleClang
-- Found assembler: /Library/Developer/CommandLineTools/usr/bin/cc
-- Adding CPU backend variant ggml-cpu: -mcpu=native+dotprod+i8mm+nosve+sme 
...

make -j

[ 16%] Building C object ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c.o
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c:8:2: error: This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
    8 | #error This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
      |  ^
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c:387:41: warning: ISO C requires a translation unit to contain at least one declaration [-Wempty-translation-unit]
  387 | #endif  // Architectural features check.
      |                                         ^
1 warning and 1 error generated.
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa.c.o] Error 1
make[2]: *** Waiting for unfinished jobs....
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c:8:2: error: This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
    8 | #error This file must be compiled for AArch64, FEAT_SVE2 or FEAT_SME2.
      |  ^
/Users/ggml/work/llama.cpp/build-kai/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c:357:41: warning: ISO C requires a translation unit to contain at least one declaration [-Wempty-translation-unit]
  357 | #endif  // Architectural features check.
      |                                         ^
1 warning and 1 error generated.
make[2]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/__/__/_deps/kleidiai_download-src/kai/ukernels/matmul/matmul_clamp_f32_qsi8d32p_qsi4c32p/kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot.c.o] Error 1
make[1]: *** [ggml/src/CMakeFiles/ggml-cpu.dir/all] Error 2
make: *** [all] Error 2

Commit: 02315a8

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 12, 2025

We recently identified that our fix for an Android build issue unintentionally introduced an SME-related build error. We're currently working on a resolution and will upload a fix soon.

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 17, 2025

@ggerganov I’ve updated the patch to address your previous comments. Let me know if you have any further feedback or concerns. Happy to make any additional adjustments if needed.

@rmatif
Copy link

rmatif commented Feb 18, 2025

I tested it on my Galaxy A34 running Android and noticed a slight degradation in performance

This PR

~/kleidiai/llama.cpp $ ./build/bin/llama-bench -m ../../Llama-3.2-1B-Instruct-Q4_0.gguf -t 8

model size params backend threads test t/s
llama 1B Q4_0 729.75 MiB 1.24 B CPU 8 pp512 87.11 ± 2.35
llama 1B Q4_0 729.75 MiB 1.24 B CPU 8 tg128 11.00 ± 0.23

Master build

build: 63ac128 (4738)

~/ggml/llama.cpp $ ./build/bin/llama-bench -m ../../Llama-3.2-1B-Instruct-Q4_0.gguf -t 8

model size params backend threads test t/s
llama 1B Q4_0 729.75 MiB 1.24 B CPU 8 pp512 84.25 ± 3.21
llama 1B Q4_0 729.75 MiB 1.24 B CPU 8 tg128 13.79 ± 0.46

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 19, 2025

@rmatif , thanks for testing this PR on your Galaxy A34 and for the feedback. The KleidiAI kernels integrated into llama.cpp for this PR are currently optimized for lower thread counts, which may explain the performance degradation you observed. I’ve tested it on a Pixel 8 (Cortex-X + 4x Cortex-A715), and the performance figures show that KleidiAI performs slightly better when using between 1 and 4 cores. However, we’re aware of the performance gap at higher thread counts and are actively working on addressing it. Improvements for better scaling with more threads will be included in a future commit.

PR:

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |         pp512 |         20.48 ± 0.29 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |         tg128 |         11.58 ± 0.02 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |         pp512 |         40.40 ± 0.24 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |         tg128 |         21.40 ± 0.47 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |         pp512 |         74.00 ± 0.30 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |         tg128 |         28.27 ± 0.03 |

build: 84a55f15 (4599)

Master:

| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |         pp512 |         19.33 ± 0.14 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       1 |         tg128 |          9.75 ± 0.04 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |         pp512 |         37.59 ± 0.38 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       2 |         tg128 |         18.48 ± 0.12 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |         pp512 |         69.59 ± 0.95 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | CPU        |       4 |         tg128 |         27.42 ± 1.21 |

build: d774ab3a (4644)

set(PRIVATE_ARCH_FLAGS "${PRIVATE_ARCH_FLAGS}+sve+sve2")
endif()

list(APPEND GGML_CDEF_PUBLIC GGML_USE_CPU_KLEIDIAI)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GGML_CDEF_PUBLIC has been removed, and this will have no effect. If it is important to add this definition, it should be done with target_compile_definitions.

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge after addressing the GGML_CDEF_PUBLIC comment and if there are no other reviews.

Using the new Arm instruction sets in ggml will be important for CPU-based on-device inference and it seems that the KleidiAI library is focused on adding support for these hardware features. Note that we generally avoid adding 3rd-party dependencies to the project, mainly to keep things simple and easy to deploy. Still, the KAI library appears to be lightweight and designed in a way that is easy to integrate, so I think it's OK to make an exception and add this option with the prospect of better on-device performance in the future. For the time being, usage of the KAI kernels will be gated by the build and environment flags. My understanding is that the performance of the implementation will improve in the future and will focus on low-power use cases, so will be looking forward to that.

It would be very useful to add CI workflows to cover these instruction sets in order to improve long-term support and maintenance.

@chaxu01 Will be pinging you for support for any issues that might arise from these changes.

@chaxu01
Copy link
Collaborator Author

chaxu01 commented Feb 20, 2025

@ggerganov @slaren Thanks for the review and for considering the integration of KleidiAI. I've updated the PR to address the GGML_CDEF_PUBLIC comment.
We understand the preference for minimizing third-party dependencies, and we've designed the KleidiAI integration to be gated behind build and environment flags. Our focus remains on enhancing the KleidiAI performance, and we appreciate the opportunity to contribute these optimizations.
Regarding CI workflows, that's a great suggestion. We'll explore this topic.
I'll also be available to assist with any issues that arise from these changes—please feel free to reach out anytime.
Thanks again for the feedback and for considering this PR. Looking forward to further improvements ahead.

@ggerganov ggerganov merged commit c5d91a7 into ggml-org:master Feb 20, 2025
42 of 45 checks passed
@jishminor
Copy link

jishminor commented Feb 20, 2025

As a heads up:
If building atop ubuntu 22.04 using the apt default cmake 3.22, building with KleidiAI enabled will fail, as it doesn't like:

FetchContent_Declare(KleidiAI_Download
            URL ${KLEIDIAI_DOWNLOAD_URL}
            DOWNLOAD_EXTRACT_TIMESTAMP NEW
            URL_HASH MD5=${KLEIDIAI_ARCHIVE_MD5})

It seems as through support for DOWNLOAD_EXTRACT_TIMESTAMP is only available for cmake 3.24 and beyond.

Relevant logs:

[cmake] -- Using KleidiAI optimized kernels if applicable
[cmake] CMake Error at /usr/share/cmake-3.22/Modules/ExternalProject.cmake:2806 (message):
[cmake]   At least one entry of URL is a path (invalid in a list)
[cmake] Call Stack (most recent call first):
[cmake]   /usr/share/cmake-3.22/Modules/ExternalProject.cmake:3716 (_ep_add_download_command)
[cmake]   CMakeLists.txt:15 (ExternalProject_Add)
[cmake] 
[cmake] 
[cmake] -- Configuring incomplete, errors occurred!
[cmake] See also "/home/ubuntu/wavefront_llama/out/build/arm-linux-relwithdebinfo-kleidiai/_deps/kleidiai_download-subbuild/CMakeFiles/CMakeOutput.log".
[cmake] 
[cmake] CMake Error at /usr/share/cmake-3.22/Modules/FetchContent.cmake:1075 (message):
[cmake]   CMake step for kleidiai_download failed: 1
[cmake] Call Stack (most recent call first):
[cmake]   /usr/share/cmake-3.22/Modules/FetchContent.cmake:1216:EVAL:2 (__FetchContent_directPopulate)
[cmake]   /usr/share/cmake-3.22/Modules/FetchContent.cmake:1216 (cmake_language)
[cmake]   /usr/share/cmake-3.22/Modules/FetchContent.cmake:1259 (FetchContent_Populate)
[cmake]   ggml/src/ggml-cpu/CMakeLists.txt:342 (FetchContent_MakeAvailable)
[cmake]   ggml/src/CMakeLists.txt:318 (ggml_add_cpu_backend_variant_impl)
[cmake] 
[cmake] 
[cmake] -- Configuring incomplete, errors occurred!

Upgrading cmake fixed this behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants