Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
230 changes: 137 additions & 93 deletions real-multi-round-qa/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,70 +2,74 @@

## Overview

This benchmark is designed to identify **the maximum harmonic mean of user sessions $(C,S)$ that can be kept active while maintaining a steady-state TTFT ≤ 2 s (95-th percentile)**. By sweeping the concurrency ($C$) and sequential ($S$) independently, it isolates whether compute capacity or KV-cache pressure is the first limiting factor.
This benchmark is designed to explore how TTFT changes across different $(C, S)$ combinations by sweeping concurrency ($C$) and session depth ($S$) independently. This helps isolate whether compute capacity or KV-cache pressure is the primary limiting factor.


We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time.
We highly recommend monitoring vLLM/LMCache/GPU/storage metrics at the same time. The JSON output from the benchmark includes metrics from vLLM/LMCache.

This benchmark feeds full‑length novels to your LLM server and asks many follow‑up questions, just like a book critic. It is handy for testing long‑context handling and KV‑cache tools such as LMCache.

The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Sequential users.

### Definition

Let us define the set of candidate pairs:

$$
\mathcal{D} = \{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}
$$

### Objective

More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:


$$
\underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
$$

We use the harmonic mean to compare scores.
As a business metric, we report the product, CxS.
For example, we say "Our system can keep up to {C×S} user sessions active!"
The benchmark is called CxS (pronounced six for simplicity), referring to the product of Concurrent $\times$ Session Depth.

## Two simple knobs

| Option | What it means |
| ---- | ---- |
| `--num-users-concurrent` (C) | How many threads run in parallel. |
| `--num-users-sequential` (S) | How many users each thread serves in turn. |
| `--concurrent` (C) | How many threads run in parallel. |
| `--session-depth` (S) | How many sessions each thread serves in turn. |

You can:
* raise concurrent to test compute-side capability (higher GPU utilization; total KV footprint also rises).
* raise sequential to test KV-cache pressure (larger resident KV per GPU, little change in instantaneous GPU utilization).
* raise $C$ to test compute-side capability (higher GPU utilization; total KV footprint also rises).
* raise $S$ to test KV-cache pressure (larger resident KV per GPU, little change in instantaneous GPU utilization).

## Execution model

```
Concurrent USER: {A,B}
Sequential USER: {X,Y}
All USER: {AX,AY,BX,BY}
Concurrent: {A,B}
Session Depth: {X,Y}
All Session: {AX,AY,BX,BY}

Timeline
-------------------------------------------------
Thread A:
Turn 0 → UserAX: Q1 "Read and summarize this novel. {AX novel contents}" → Get Response
Turn 0 → UserAY: Q1 "Read and summarize this novel. {AY novel contents}" → Get Response
Turn 1 → UserAX: Q2 "Write down the author's feelings." → Get Response
Turn 1 → UserAY: Q2 "Write down the author's feelings." → Get Response
Turn 0 → SessionAX: Q1 "Read and summarize this novel. {AX novel contents}" → Get Response
Turn 0 → SessionAY: Q1 "Read and summarize this novel. {AY novel contents}" → Get Response
Turn 1 → SessionAX: Q2 "Write down the author's feelings." → Get Response
Turn 1 → SessionAY: Q2 "Write down the author's feelings." → Get Response
...
Thread B:
Turn 0 → UserBX: Q1 "Read and summarize this novel. {BX novel contents}" → Get Response
Turn 0 → UserBY: Q1 "Read and summarize this novel. {BY novel contents}" → Get Response
Turn 1 → UserBX: Q2 "Write down the author's feelings." → Get Response
Turn 1 → UserBY: Q2 "Write down the author's feelings." → Get Response
Turn 0 → SessionBX: Q1 "Read and summarize this novel. {BX novel contents}" → Get Response
Turn 0 → SessionBY: Q1 "Read and summarize this novel. {BY novel contents}" → Get Response
Turn 1 → SessionBX: Q2 "Write down the author's feelings." → Get Response
Turn 1 → SessionBY: Q2 "Write down the author's feelings." → Get Response
...
```

## For system competition

The CxS benchmark provides a scalar score to encourage healthy competition, but its use is not mandatory.

### Definition

Let us define the set of candidate pairs:

$$
\mathcal{D} = {\{ (C_i, S_i) \mid \mathrm{TTFT}_{95}^{(i)} \leq 2 \}}
$$

### Objective

More precisely, we aim to find the pair that maximizes the harmonic mean among all candidates in $\mathcal{D}$:


$$
\underset{(C_i, S_i) \in \mathcal{D}}{\arg\max} \left( \frac{2 C_i S_i}{C_i + S_i} \right)
$$

## For business metric

As a business metric, we report the product, CxS.
For example, we say "Our system can keep up to {C×S} user sessions active!"

## Getting Started

```bash
Expand All @@ -75,9 +79,9 @@ python prepare.py --output data --model Qwen/Qwen2.5-7B-Instruct-1M # Models use

```bash
# Run the benchmark many times
BASE_URL="http://localhost:8000/v1"
BASE_URL="http://localhost:8000"
MODEL="Qwen/Qwen2.5-7B-Instruct-1M"
NUM_ROUNDS=3
NUM_ROUNDS=12
OUTPUT_DIR="bench_dir"
SRC_DIR="./data/128k"
mkdir -p "$OUTPUT_DIR"
Expand All @@ -86,67 +90,107 @@ for c in {1..4}; do # You can change c and s to any value you like.
for s in {1..4}; do
TIMESTAMP=$(date +%s)
OUTPUT_FILE="${OUTPUT_DIR}/bench_c${c}_s${s}_${TIMESTAMP}.json"
echo "Running benchmark: concurrent=${c}, sequential=${s}"
python multi-round-qa.py --num-users-concurrent "$c" --num-users-sequential "$s" --num-rounds "$NUM_ROUNDS" --model "$MODEL" --base-url "$BASE_URL" --output "$OUTPUT_FILE" --src-dir "$SRC_DIR"
echo "Running benchmark: C=${c}, S=${s}"
python multi-round-qa.py -c "$c" -s "$s" --num-rounds "$NUM_ROUNDS" --model "$MODEL" --base-url "$BASE_URL" --output "$OUTPUT_FILE" --src-dir "$SRC_DIR"
done
done
```

We compare two systems for demo:

System A
* Model
* Qwen/Qwen2.5-7B-Instruct-1M
* Dataset
* 32k
* CPU/GPU
* NVIDIA GH200 480GB
* vLLM
* v0.9.0.1
* enable prefix-caching
* enable chunked prefill
* LMCache
* local_cpu: True
* max_local_cpu_size: 200
* pipelined_backend: True
* save_decode_cache: True

System B
* Model
* Qwen/Qwen2.5-7B-Instruct-1M
* Dataset
* 32k
* CPU/GPU
* NVIDIA GH200 480GB
* vLLM
* v0.9.0.1
* enable prefix-caching
* enable chunked prefill
* LMCache
* local_cpu: True
* max_local_cpu_size: 200
* pipelined_backend: True
* save_decode_cache: True
* local_disk: file:///data/tmp
* max_local_disk_size: 400
* Storage
* DDN EXAScaler 2.14.0
* stripe count is 8
* stripe size is 1MiB

```bash
# Plot and Show Result
$ python plot.py ./bench_dir_vllm vllm.png
num_users_concurrent num_users_sequential ttft_95
0 4 2 0.498404
1 4 4 33.565437
2 4 3 0.794144
3 1 4 0.311046
4 2 2 0.406148
5 2 4 0.459704
6 2 3 0.326396
7 1 2 0.411317
8 3 3 0.378674
9 2 1 0.445499
10 3 4 42.531053
11 1 3 0.455651
12 4 1 0.504505
13 3 2 0.393902
14 3 1 0.364927
15 1 1 0.379049
Max harmonic mean (C,S) where TTFT_95 <= 2s: 3.43
=> C=4.0, S=3.0, CxS=12.0
$ python plot.py ./bench_dir_lmcache lmcache.png
num_users_concurrent num_users_sequential ttft_95
0 1 1 0.524989
1 3 2 0.592148
2 4 4 1.202544
3 3 4 1.286755
4 2 1 0.477370
5 3 3 0.586793
6 2 3 0.627655
7 4 1 0.575724
8 4 3 1.251918
9 2 4 0.446477
10 1 4 0.460711
11 3 1 0.495073
12 1 3 0.329389
13 4 2 0.586223
14 1 2 0.477946
15 2 2 0.457463
Max harmonic mean (C,S) where TTFT_95 <= 2s: 4.00
=> C=4.0, S=4.0, CxS=16.0
$ python plot.py lmcache_bench_dir-1749973344 lmcache_with_cpu_200g.png
c s ttft_95
0 8 16 2.674693
1 12 32 3.268448
2 4 32 2.496206
3 16 16 3.310291
4 4 8 0.146159
5 8 32 2.801732
6 12 24 3.283783
7 12 16 3.185047
8 12 8 0.390896
9 4 24 0.217809
10 16 8 3.799740
11 8 8 0.347083
12 16 24 3.171192
13 16 32 3.032414
14 8 24 3.383691
15 4 16 0.253737
Best (C,S) with TTFT_95 ≤ 2 s → C=12.0, S=8.0, HarmonicMean=9.60, C×S=96.0
Saved: lmcache_with_cpu_200g.png
$ python plot.py lmcache_bench_dir-1749897431 lmcache_with_cpu_200g_exa_400g.png
c s ttft_95
0 4 16 0.255378
1 8 24 3.213307
2 16 24 4.067904
3 4 24 0.612876
4 8 32 4.389398
5 4 8 0.158686
6 12 24 3.939205
7 12 8 0.634048
8 4 32 1.191106
9 12 32 3.475115
10 16 16 3.156051
11 8 8 0.264291
12 12 16 2.739532
13 16 32 3.853057
14 8 16 1.424959
15 16 8 3.470811
Best (C,S) with TTFT_95 ≤ 2 s → C=8.0, S=16.0, HarmonicMean=10.67, C×S=128.0
Saved: lmcache_with_cpu_200g_exa_400g.png
```

LMCache allows 1.17x increase in the number of user sessions kept active at least.

Note: LMCache has not yet reached its limit in this case,
so we can aim to further improve the score by changing C and S.
This result shows that adding external storage (DDN EXAScaler) as a tier in the KV cache can increase the number of active sessions.

## Viz

vllm.png
The white dashed line indicates the TTFT = 2s boundary.

System A result:

![vLLM Plot](vllm.png)
![LMCache+CPU Plot](lmcache_with_cpu_200g.png)

lmcache.png
System B result:

![LMCache Plot](lmcache.png)
![LMCache+CPU+Storage400g Plot](lmcache_with_cpu_200g_exa_400g.png)
Binary file removed real-multi-round-qa/lmcache.png
Binary file not shown.
Binary file added real-multi-round-qa/lmcache_with_cpu_200g.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 17 additions & 8 deletions real-multi-round-qa/multi-round-qa.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
from dataclasses import dataclass, asdict
from typing import List
import json
import requests

FIRST_PROMPT = "Read and summarize this novel.\n\n{}"
FOLLOWUP_PROMPTS = [
Expand Down Expand Up @@ -57,11 +58,13 @@
class Result:
session_id: str
turn: int
start_time: float
latency: float
ttft: float
generation_time: float
prompt_tokens: int
completion_tokens: int
metrics: str
status: str

class ChatSession:
Expand Down Expand Up @@ -98,7 +101,7 @@ def append_assistant_message(self, content):
self.messages.append({"role": "assistant", "content": content})
self.turns += 1

async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
async def run_turn(session: ChatSession, client: openai.AsyncOpenAI, base_url: str) -> Result:
prompt = session.get_next_prompt()
session.append_user_message(prompt)

Expand All @@ -109,6 +112,10 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
prompt_tokens = 0

print(f"Session {session.session_id}, Turn {session.turns}: {prompt[:50]}...")

resp = requests.get(f"{base_url}/metrics")
resp.raise_for_status()

response = await client.chat.completions.create(
model=session.model,
messages=session.messages,
Expand Down Expand Up @@ -137,41 +144,43 @@ async def run_turn(session: ChatSession, client: openai.AsyncOpenAI) -> Result:
result = Result(
session_id=session.session_id,
turn=session.turns,
start_time=start_time,
latency=latency,
ttft=ttft,
generation_time=generation_time,
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
metrics=resp.text,
status="success",
)

session.append_assistant_message(content)

return result

async def run_group(args) -> List[Result]:
client = openai.AsyncOpenAI(base_url=args.base_url, api_key="EMPTY")
sessions = [ChatSession(args) for _ in range(args.num_users_sequential)]
client = openai.AsyncOpenAI(base_url=f"{args.base_url}/v1", api_key="EMPTY")
sessions = [ChatSession(args) for _ in range(args.session_depth)]
results = []

while any(not s.is_finished() for s in sessions):
for session in sessions:
if session.is_finished():
continue
result = await run_turn(session, client)
result = await run_turn(session, client, args.base_url)
results.append(result)

return results

async def run_all_concurrent(args):
tasks = [run_group(args) for _ in range(args.num_users_concurrent)]
tasks = [run_group(args) for _ in range(args.concurrent)]
all_results = await asyncio.gather(*tasks)
return [asdict(r) for group in all_results for r in group]

def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("--num-users-concurrent", type=int, required=True)
parser.add_argument("--num-users-sequential", type=int, required=True)
parser.add_argument("-c", "--concurrent", type=int, required=True)
parser.add_argument("-s", "--session-depth", type=int, required=True)
parser.add_argument("--model", type=str, required=True)
parser.add_argument("--base-url", type=str, required=True)
parser.add_argument("--num-rounds", type=int, default=10)
Expand Down
Loading
Loading