[Bug] Slower TTFT Performance #3854

wahaha22 · 2025-02-25T13:19:13Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I'm benchmarking the Time-To-First-Token (TTFT) performance between SGLang (0.4.3.post2) and vLLM (v0.7.2) using the same hardware (H100) and model (meta-llama/Llama-3.1-8B).

My benchmark result:
sglang:

Batch 64: 16 batches, total 25783.48ms, avg 1611.47ms, std 143.69ms, p95 1795.47ms, p999 2068.31ms
Batch 128: 8 batches, total 12996.97ms, avg 1624.62ms, std 33.60ms, p95 1673.05ms, p999 1675.57ms
Batch 256: 4 batches, total 10848.67ms, avg 2712.17ms, std 47.40ms, p95 2769.73ms, p999 2782.20ms
Batch 512: 2 batches, total 8716.49ms, avg 4358.24ms, std 61.40ms, p95 4397.32ms, p999 4401.57ms
Batch 1024: 1 batches, total 8171.80ms, avg 8171.80ms, std 0.00ms, p95 8171.80ms, p999 8171.80ms

vLLM:

Batch 64: 16 batches, total 8432.63ms, avg 527.04ms, std 49.41ms, p95 623.41ms, p999 643.26ms
Batch 128: 8 batches, total 8359.63ms, avg 1044.95ms, std 92.96ms, p95 1190.91ms, p999 1257.67ms
Batch 256: 4 batches, total 8341.63ms, avg 2085.41ms, std 109.07ms, p95 2208.48ms, p999 2227.08ms
Batch 512: 2 batches, total 8369.78ms, avg 4184.89ms, std 28.69ms, p95 4203.15ms, p999 4205.14ms
Batch 1024: 1 batches, total 8411.21ms, avg 8411.21ms, std 0.00ms, p95 8411.21ms, p999 8411.21ms

Reproduction

Dataset: prompts_128.txt
1,024 prompts sampled from ShareGPT_V3_unfiltered_cleaned_split.json, sample script:

import argparse
import json
from collections import defaultdict

def main():
    # Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset_path", required=True, help="Path to dataset file")
    parser.add_argument("--token_lens", type=lambda s: sorted({int(x) for x in s.split(',')}, reverse=True),
                        required=True, help="Comma-separated target token lengths (descending order)")
    parser.add_argument("--num_prompt", type=int, required=True, help="Number of prompts per length")
    args = parser.parse_args()

    # Load and filter dataset
    with open(args.dataset_path) as f:
        dataset = [data for data in json.load(f) if len(data["conversations"]) >= 2]

    # Preprocess prompts with cached word splits
    prompt_words = [
        (data["conversations"][0]["value"].split(), len(data["conversations"][0]["value"].split()))
        for data in dataset
    ]

    # Group prompts by target lengths (descending order)
    length_groups = defaultdict(list)
    for words, word_count in prompt_words:
        for target_len in args.token_lens:
            if word_count >= target_len:
                if len(length_groups[target_len]) < args.num_prompt:
                    truncated = " ".join(words[:target_len])
                    length_groups[target_len].append(truncated)
                break  # Prioritize longer lengths first

    # Write outputs
    for length, prompts in length_groups.items():
        if len(prompts) < args.num_prompt:
            print(f"Warning: Insufficient prompts for length {length} ({len(prompts)}/{args.num_prompt})")
        with open(f"prompt_{length}.txt", "w") as f:
            f.write("\n".join(prompts))

if __name__ == "__main__":
    main()

command to generate Dataset:

python sample.py --dataset_path ./ShareGPT_V3_unfiltered_cleaned_split.json --token_lens=128 --num_prompt 1024

Script to test sglang:

import argparse
import time
import numpy as np
import statistics
import sglang as sgl

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", type=str, required=True, help="Path to input file containing prompts")
    parser.add_argument("--batchsize", type=lambda s: [int(x) for x in s.split(',')], 
                        default=[4], help="Comma-separated list of batch sizes to benchmark")
    parser.add_argument("--model", type=str, required=True, help="Path to model")
    args = parser.parse_args()

    with open(args.input, 'r') as f:
        all_prompts = [line.strip() for line in f.readlines()]

    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path=args.model)

    for batch_size in args.batchsize:
        print(f"\n{'='*40} Benchmarking batch size: {batch_size} {'='*40}")
        
        batch_times = []
        total_time = 0
        batch_count = 0

        for i in range(0, len(all_prompts), batch_size):
            batch_prompts = all_prompts[i:i + batch_size]
            
            st = time.perf_counter()
            outputs = llm.generate(batch_prompts, sampling_params)
            batch_time = time.perf_counter() - st
            
            batch_times.append(batch_time)
            total_time += batch_time
            batch_count += 1
            
            print(f"Batch {batch_count} ({batch_size} prompts) time: {batch_time * 1000:.2f} ms")

        print(f"Batch {batch_size}: {batch_count} batches, total {total_time*1000:.2f}ms, avg {statistics.mean(batch_times)*1000:.2f}ms, std {(statistics.stdev(batch_times)*1000 if len(batch_times)>1 else 0):.2f}ms, p95 {np.percentile(batch_times,95)*1000:.2f}ms, p999 {np.percentile(batch_times,99.9)*1000:.2f}ms")

run with:

python sglang_bench.py
  --batchsize 64,128 \
  --model /path/to/meta-llama_Llama-3.1-8B \
  --input ./prompt_128.txt

Script to test vLLM:

import argparse
import time
import numpy as np
import statistics
from vllm import LLM, SamplingParams

parser = argparse.ArgumentParser()
parser.add_argument("--input", type=str, required=True, help="Path to input file containing prompts")
parser.add_argument("--batchsize", type=lambda s: [int(x) for x in s.split(',')], 
                    default=[4], help="Comma-separated list of batch sizes to benchmark")
parser.add_argument("--model", type=str, required=True, help="Path to model")
args = parser.parse_args()

with open(args.input, 'r') as f:
    all_prompts = [line.strip() for line in f.readlines()]

sampling_params = SamplingParams(max_tokens=1, temperature=0.8, top_p=0.95)
llm = LLM(model=args.model, trust_remote_code=True)

results = []
for batch_size in args.batchsize:
    batch_times = []
    total_time = 0
    batch_count = 0

    for i in range(0, len(all_prompts), batch_size):
        batch_prompts = all_prompts[i:i + batch_size]
        if len(batch_prompts) < batch_size:
            continue
            
        st = time.perf_counter()
        outputs = llm.generate(batch_prompts, sampling_params)
        batch_time = time.perf_counter() - st
        
        batch_times.append(batch_time)
        total_time += batch_time
        batch_count += 1

    results.append(f"Batch {batch_size}: {batch_count} batches, total {total_time*1000:.2f}ms, avg {statistics.mean(batch_times)*1000:.2f}ms, std {(statistics.stdev(batch_times)*1000 if len(batch_times)>1 else 0):.2f}ms, p95 {np.percentile(batch_times,95)*1000:.2f}ms, p999 {np.percentile(batch_times,99.9)*1000:.2f}ms")

print("\n".join(results))

run with:

python vllm_bench.py \
  --batchsize 64,128 \
  --model /path/to/meta-llama_Llama-3.1-8B \
  --input ./prompt_128.txt

Environment

cuda 12.4
python 3.10
Hardware: H100*2
sglang version: sglang-0.4.3.post2
vLLM version: 0.7.2
Model tested: meta-llama/Llama-3.1-8B

The text was updated successfully, but these errors were encountered:

zhaochenyang20 · 2025-02-26T02:28:01Z

Thanks so much for your detailed profiling and reproduction commands. We will find this out soon. Could frank take a look? @FrankLeeeee

lambda7xx · 2025-02-26T13:10:06Z

May I take this task?

zhaochenyang20 · 2025-02-26T17:20:56Z

@lambda7xx Sure. Please go and make it!

zhaochenyang20 self-assigned this Feb 26, 2025

zhaochenyang20 added the help wanted Extra attention is needed label Feb 26, 2025

zhaochenyang20 changed the title ~~[Bug] Slower TTFT Performance Compared to vLLM~~ [Bug] Slower TTFT Performance Feb 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Slower TTFT Performance #3854

[Bug] Slower TTFT Performance #3854

wahaha22 commented Feb 25, 2025 •

edited

Loading

zhaochenyang20 commented Feb 26, 2025

lambda7xx commented Feb 26, 2025 •

edited

Loading

zhaochenyang20 commented Feb 26, 2025

[Bug] Slower TTFT Performance #3854

[Bug] Slower TTFT Performance #3854

Comments

wahaha22 commented Feb 25, 2025 • edited Loading

Checklist

Describe the bug

Reproduction

Environment

zhaochenyang20 commented Feb 26, 2025

lambda7xx commented Feb 26, 2025 • edited Loading

zhaochenyang20 commented Feb 26, 2025

wahaha22 commented Feb 25, 2025 •

edited

Loading

lambda7xx commented Feb 26, 2025 •

edited

Loading