Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Slower TTFT Performance #3854

Open
5 tasks done
wahaha22 opened this issue Feb 25, 2025 · 3 comments
Open
5 tasks done

[Bug] Slower TTFT Performance #3854

wahaha22 opened this issue Feb 25, 2025 · 3 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@wahaha22
Copy link

wahaha22 commented Feb 25, 2025

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I'm benchmarking the Time-To-First-Token (TTFT) performance between SGLang (0.4.3.post2) and vLLM (v0.7.2) using the same hardware (H100) and model (meta-llama/Llama-3.1-8B).

My benchmark result:
sglang:

Batch 64: 16 batches, total 25783.48ms, avg 1611.47ms, std 143.69ms, p95 1795.47ms, p999 2068.31ms
Batch 128: 8 batches, total 12996.97ms, avg 1624.62ms, std 33.60ms, p95 1673.05ms, p999 1675.57ms
Batch 256: 4 batches, total 10848.67ms, avg 2712.17ms, std 47.40ms, p95 2769.73ms, p999 2782.20ms
Batch 512: 2 batches, total 8716.49ms, avg 4358.24ms, std 61.40ms, p95 4397.32ms, p999 4401.57ms
Batch 1024: 1 batches, total 8171.80ms, avg 8171.80ms, std 0.00ms, p95 8171.80ms, p999 8171.80ms

vLLM:

Batch 64: 16 batches, total 8432.63ms, avg 527.04ms, std 49.41ms, p95 623.41ms, p999 643.26ms
Batch 128: 8 batches, total 8359.63ms, avg 1044.95ms, std 92.96ms, p95 1190.91ms, p999 1257.67ms
Batch 256: 4 batches, total 8341.63ms, avg 2085.41ms, std 109.07ms, p95 2208.48ms, p999 2227.08ms
Batch 512: 2 batches, total 8369.78ms, avg 4184.89ms, std 28.69ms, p95 4203.15ms, p999 4205.14ms
Batch 1024: 1 batches, total 8411.21ms, avg 8411.21ms, std 0.00ms, p95 8411.21ms, p999 8411.21ms

Reproduction

Dataset: prompts_128.txt
1,024 prompts sampled from ShareGPT_V3_unfiltered_cleaned_split.json, sample script:

import argparse
import json
from collections import defaultdict

def main():
    # Parse arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("--dataset_path", required=True, help="Path to dataset file")
    parser.add_argument("--token_lens", type=lambda s: sorted({int(x) for x in s.split(',')}, reverse=True),
                        required=True, help="Comma-separated target token lengths (descending order)")
    parser.add_argument("--num_prompt", type=int, required=True, help="Number of prompts per length")
    args = parser.parse_args()

    # Load and filter dataset
    with open(args.dataset_path) as f:
        dataset = [data for data in json.load(f) if len(data["conversations"]) >= 2]

    # Preprocess prompts with cached word splits
    prompt_words = [
        (data["conversations"][0]["value"].split(), len(data["conversations"][0]["value"].split()))
        for data in dataset
    ]

    # Group prompts by target lengths (descending order)
    length_groups = defaultdict(list)
    for words, word_count in prompt_words:
        for target_len in args.token_lens:
            if word_count >= target_len:
                if len(length_groups[target_len]) < args.num_prompt:
                    truncated = " ".join(words[:target_len])
                    length_groups[target_len].append(truncated)
                break  # Prioritize longer lengths first

    # Write outputs
    for length, prompts in length_groups.items():
        if len(prompts) < args.num_prompt:
            print(f"Warning: Insufficient prompts for length {length} ({len(prompts)}/{args.num_prompt})")
        with open(f"prompt_{length}.txt", "w") as f:
            f.write("\n".join(prompts))

if __name__ == "__main__":
    main()

command to generate Dataset:

python sample.py --dataset_path ./ShareGPT_V3_unfiltered_cleaned_split.json --token_lens=128 --num_prompt 1024

Script to test sglang:

import argparse
import time
import numpy as np
import statistics
import sglang as sgl

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", type=str, required=True, help="Path to input file containing prompts")
    parser.add_argument("--batchsize", type=lambda s: [int(x) for x in s.split(',')], 
                        default=[4], help="Comma-separated list of batch sizes to benchmark")
    parser.add_argument("--model", type=str, required=True, help="Path to model")
    args = parser.parse_args()

    with open(args.input, 'r') as f:
        all_prompts = [line.strip() for line in f.readlines()]

    sampling_params = {"temperature": 0.8, "top_p": 0.95}
    llm = sgl.Engine(model_path=args.model)

    for batch_size in args.batchsize:
        print(f"\n{'='*40} Benchmarking batch size: {batch_size} {'='*40}")
        
        batch_times = []
        total_time = 0
        batch_count = 0

        for i in range(0, len(all_prompts), batch_size):
            batch_prompts = all_prompts[i:i + batch_size]
            
            st = time.perf_counter()
            outputs = llm.generate(batch_prompts, sampling_params)
            batch_time = time.perf_counter() - st
            
            batch_times.append(batch_time)
            total_time += batch_time
            batch_count += 1
            
            print(f"Batch {batch_count} ({batch_size} prompts) time: {batch_time * 1000:.2f} ms")

        print(f"Batch {batch_size}: {batch_count} batches, total {total_time*1000:.2f}ms, avg {statistics.mean(batch_times)*1000:.2f}ms, std {(statistics.stdev(batch_times)*1000 if len(batch_times)>1 else 0):.2f}ms, p95 {np.percentile(batch_times,95)*1000:.2f}ms, p999 {np.percentile(batch_times,99.9)*1000:.2f}ms")

run with:

python sglang_bench.py
  --batchsize 64,128 \
  --model /path/to/meta-llama_Llama-3.1-8B \
  --input ./prompt_128.txt

Script to test vLLM:

import argparse
import time
import numpy as np
import statistics
from vllm import LLM, SamplingParams

parser = argparse.ArgumentParser()
parser.add_argument("--input", type=str, required=True, help="Path to input file containing prompts")
parser.add_argument("--batchsize", type=lambda s: [int(x) for x in s.split(',')], 
                    default=[4], help="Comma-separated list of batch sizes to benchmark")
parser.add_argument("--model", type=str, required=True, help="Path to model")
args = parser.parse_args()

with open(args.input, 'r') as f:
    all_prompts = [line.strip() for line in f.readlines()]

sampling_params = SamplingParams(max_tokens=1, temperature=0.8, top_p=0.95)
llm = LLM(model=args.model, trust_remote_code=True)

results = []
for batch_size in args.batchsize:
    batch_times = []
    total_time = 0
    batch_count = 0

    for i in range(0, len(all_prompts), batch_size):
        batch_prompts = all_prompts[i:i + batch_size]
        if len(batch_prompts) < batch_size:
            continue
            
        st = time.perf_counter()
        outputs = llm.generate(batch_prompts, sampling_params)
        batch_time = time.perf_counter() - st
        
        batch_times.append(batch_time)
        total_time += batch_time
        batch_count += 1

    results.append(f"Batch {batch_size}: {batch_count} batches, total {total_time*1000:.2f}ms, avg {statistics.mean(batch_times)*1000:.2f}ms, std {(statistics.stdev(batch_times)*1000 if len(batch_times)>1 else 0):.2f}ms, p95 {np.percentile(batch_times,95)*1000:.2f}ms, p999 {np.percentile(batch_times,99.9)*1000:.2f}ms")

print("\n".join(results))

run with:

python vllm_bench.py \
  --batchsize 64,128 \
  --model /path/to/meta-llama_Llama-3.1-8B \
  --input ./prompt_128.txt

Environment

  • cuda 12.4
  • python 3.10
  • Hardware: H100*2
  • sglang version: sglang-0.4.3.post2
  • vLLM version: 0.7.2
  • Model tested: meta-llama/Llama-3.1-8B
@zhaochenyang20
Copy link
Collaborator

Thanks so much for your detailed profiling and reproduction commands. We will find this out soon. Could frank take a look? @FrankLeeeee

@zhaochenyang20 zhaochenyang20 self-assigned this Feb 26, 2025
@zhaochenyang20 zhaochenyang20 added the help wanted Extra attention is needed label Feb 26, 2025
@zhaochenyang20 zhaochenyang20 changed the title [Bug] Slower TTFT Performance Compared to vLLM [Bug] Slower TTFT Performance Feb 26, 2025
@lambda7xx
Copy link

lambda7xx commented Feb 26, 2025

May I take this task?

@zhaochenyang20
Copy link
Collaborator

@lambda7xx Sure. Please go and make it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants