Improve MNK Quantized Matrix Multiplications for small K values (2-8) #1593

0seba · 2024-11-16T23:19:50Z

0seba
Nov 16, 2024

Hi, since transformer inference is memory bound, when increasing the numbers of processed tokens the forward time should behave in an increasing step-wise manner.

In the following image I show the median forward times with 4-bit quantized weights for some sequence lengths, notably there is a 4.3x increase when going from sequence length 1 to 8, while the time between 8 and 32 tokens is along the same line. This shows that the QMM kernels for small lengths are likely under-optimized.

As comparison here are the results for FP16 weights. Where the forward for lengths between 1 and 64 are in the same step/wave.

I'm pretty uncertain about this, but based on the last image, shouldn't we expect that the QMM times for lengths between 1 and 64/4=16 to be in the same step and not several times more?

Experiments done on a M1 Macbook Air 8GB. mlx 0.20.0 and 0.19.3 MacOS 15.2 Beta, Llama 3.2-1B-Instruct-4bit/bf16 with the following code:

import time
import numpy as np
import mlx.core as mx
import mlx_lm
from mlx_lm.models import cache
from mlx_lm.models.cache import KVCache

# model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-1B-Instruct-4bit", tokenizer_config={})
model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-1B-Instruct-bf16", tokenizer_config={})

prompt_cache = cache.make_prompt_cache(model)
_c: KVCache
for _c in prompt_cache:
    _c.update_and_fetch(
        mx.random.normal([1, 8, 512, 64], dtype=mx.float16),
        mx.random.normal([1, 8, 512, 64], dtype=mx.float16),
        )

seqlen = 1
print("Seqlen", "Toks/s", "Fwd Time (ms)", sep="\t")
while seqlen <= 512:
    inp = mx.array([[100] * seqlen])
    tps = []
    fwd_times = []
    for _ in range(25):
        for _c in prompt_cache:
            _c.offset = max(0, 512 - seqlen - 64)
        
        tic = time.perf_counter()
        logits = model(inp, cache=prompt_cache)
        mx.eval(logits)
        toc = time.perf_counter() - tic
        # print(1 / toc * seqlen, 1 / toc, toc * 1000)
        tps.append(1 / toc * seqlen)
        fwd_times.append(toc * 1000)
    print(seqlen, round(np.median(tps), 1), round(np.median(fwd_times), 1), sep="\t")

    seqlen *= 2

N8python · 2024-12-03T17:58:28Z

N8python
Dec 3, 2024

This could dramatically revolutionize speculative decoding... atm a huge bottleneck is running a small batch of 2-4 tokens through the larger model.

0 replies

awni · 2024-12-03T18:46:21Z

awni
Dec 3, 2024
Maintainer

I modified the benchmark slightly to make sure the prompt cache is fixed length.

import time
import copy

import mlx.core as mx
import mlx_lm

from mlx_lm.models import cache

#model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-1B-Instruct-bf16")
model, tokenizer = mlx_lm.load("mlx-community/Llama-3.2-1B-Instruct-4bit")
#model, tokenizer = mlx_lm.load("mlx-community/Meta-Llama-3.1-8B-Instruct-4bit")

prompt_cache = cache.make_prompt_cache(model)

prompt = mx.array([[100] * 256])

logits = model(prompt, cache=prompt_cache)

mx.eval(logits)

seqlen = 1
print("Seqlen | Toks/s | Fwd Time (ms)")
print("------ | ------ | -------------")
while seqlen <= 32:
    inp = mx.array([[100] * seqlen])
    tic = time.perf_counter()
    its = 25
    for _ in range(its):
        logits = model(inp, cache=copy.deepcopy(prompt_cache))
        mx.eval(logits)
    toc = time.perf_counter()
    s = (toc - tic) / its
    tps = seqlen / s
    ms = 1000 * s

    print(f"{seqlen} | {tps:.3f} | {ms:.3f}")
    seqlen *= 2

Overall the results are not that bad. For the quantized 1B model I see a nice increase in toks/sec from 1 -> 2 -> 4 -> ... with a plateau at about 256. It's not linear from 1 -> 2 -> 4, so potentially there is room to improve there. And weirdly 8 is worse than 4 which is inexplicable..

Seqlen	Toks/s	Fwd Time (ms)
1	290.748	3.439
2	459.986	4.348
4	709.492	5.638
8	613.551	13.039
16	1207.244	13.253
32	2457.367	13.022
64	3195.917	20.026
128	3846.713	33.275
256	4202.046	60.923
512	4196.421	122.009

4 replies

awni Dec 3, 2024
Maintainer

Results for the 8B Llama in 4-bit. Good improvement from 1 -> 2. Less good from 2 -> 4. Again the drop from 4 -> 8, 🤷 .

Seqlen	Toks/s	Fwd Time (ms)
1	67.719	14.767
2	121.006	16.528
4	162.647	24.593
8	134.415	59.517
16	264.609	60.467
32	523.484	61.129
64	661.879	96.694
128	716.524	178.640
256	735.919	347.864

awni Dec 3, 2024
Maintainer

Results for 14B Qwen 2.5 in 4-bit, similar story:

Seqlen	Toks/s	Fwd Time (ms)
1	37.114	26.944
2	65.322	30.617
4	88.884	45.002
8	74.221	107.786
16	147.665	108.353
32	307.196	104.168
64	354.548	180.512
128	380.602	336.310
256	372.687	686.904

angeloskath Dec 3, 2024
Maintainer

The drop from 4 to 8 is that we switch from batched qmv to the qmm.

I do think that the small batches can take some optimization. We are only parallelizing on the contracted dimension which means we rely on the GPU cache for fast loads of the weights on the next query. It seems to work okay but maybe we can squeeze out more perf.

0seba Dec 23, 2024
Author

For the 1B, how do the fwd times compare for you for the BF16 model?

0seba · 2025-01-04T00:38:31Z

0seba
Jan 4, 2025
Author

related llama.cpp PR ggerganov/llama.cpp#10581

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve MNK Quantized Matrix Multiplications for small K values (2-8) #1593

{{title}}

Replies: 3 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Improve MNK Quantized Matrix Multiplications for small K values (2-8) #1593

0seba Nov 16, 2024

Replies: 3 comments · 4 replies

N8python Dec 3, 2024

awni Dec 3, 2024 Maintainer

awni Dec 3, 2024 Maintainer

awni Dec 3, 2024 Maintainer

angeloskath Dec 3, 2024 Maintainer

0seba Dec 23, 2024 Author

0seba Jan 4, 2025 Author

0seba
Nov 16, 2024

Replies: 3 comments 4 replies

N8python
Dec 3, 2024

awni
Dec 3, 2024
Maintainer

awni Dec 3, 2024
Maintainer

awni Dec 3, 2024
Maintainer

angeloskath Dec 3, 2024
Maintainer

0seba Dec 23, 2024
Author

0seba
Jan 4, 2025
Author