Skip to content

jranaraki/vllm-tuner

Repository files navigation

vLLM-Tuner

vllm-tuner

An intelligent tuner for vLLM that automatically monitors GPU metrics, uses Bayesian optimization to tune parameters (batch_size, max_num_batched_tokens, max_num_seqs, gpu_memory_utilization) to maximize throughput while minimizing latency and balancing memory, respecting user-provided constraints.

Features

  • Intelligent Profiling: Monitor GPU memory, utilization, and vLLM metrics automatically
  • Adaptive Parameter Search: Bayesian optimization (Optuna) with multi-objective support (throughput, latency, memory)
  • vLLM-Aware Integration: Parse vLLM logs for KV cache utilization, preemption tracking, and guidance
  • Multi-GPU Support: Handle data-parallel and model-parallel (tensor/pipeline) configurations
  • User-Friendly Configuration: Simple YAML configs to specify objectives and constraints
  • Rich Reporting: Plotly interactive HTML reports with trial progression, Pareto front, and GPU telemetry
  • Extensibility: Custom workloads and plugins for specific deployment scenarios

Installation

# Create and activate uv environment
uv venv --seed --python 3.10
source .venv/bin/activate

# Install vllm-tuner
uv pip install git+https://github.com/jranaraki/vllm-tuner

# Install vLLM
uv pip install vllm --torch-backend=auto

Configuration

Configuration is done via YAML file, see default.yaml, and here are the key settings:

Multi-Objective Weights (must sum to 100)

objectives:
  throughput: 60  # Weight for throughput maximization
  latency: 30     # Weight for latency minimization
  memory: 10      # Weight for memory efficiency

Search Space

search_space:
  batch_size: [1, 256]  # Range or override defaults
  gpu_memory_utilization: [0.6, 0.99]
  tensor_parallel_size: [1, 2, 4]

Workload

workload:
  dataset_name: "tatsu-lab/alpaca"  # HF dataset
  sample_size: 100                  # Number of prompts
  concurrent_requests: 10           # Concurrent clients

Store the config file under configs folder.

Run

Basic Tuning

# Run tuning study
vllm-tuner tune --config config/default.yaml --study-name my_study

Output Structure

Studies and reports are saved to studies/<study_name>/ and reports, respectively:

├── configs
│		 └── default.yaml                   # vLLM-Tuner config
├── reports
│		 └── my_study
│		     └── report.html                # Interactive Plotly report
└── studies
    └── my_study
        ├── baseline                        # Baseline metrics (if enabled)
        │		 ├── baseline_config.yaml
        │		 ├── baseline_metrics.json
        │		 ├── baseline_summary.txt
        │		 └── logs
        │		     └── vllm_baseline.log
        ├── configs                         # Summary & best configs
        │		 ├── best_config.json
        │		 ├── best_config.yaml
        │		 ├── summary.json
        │		 └── trials.json
        ├── logs                            # vLLM server logs
        │		 ├── vllm_trial_0.log
        │		 ├── vllm_trial_1.log
        │		 ├── vllm_trial_2.log
        │		 ├── vllm_trial_3.log
        │        ├── vllm_trial_4.log
        │		 └── ...
        └── optuna.db                       # SQLite study database

And the final report is as follows:

report_screenshot.png

Documentation

For detailed information, see the comprehensive documentation.

Citing

If you find vllm-tuner useful and interested in citing this work, please use the following BibTex entry:

@software{vllmtuner2026,
  author = {Javad Anaraki},
  title = {vllm-tuner: Automated Parameter Tuning for vLLM via Bayesian Optimization},
  url = {https://github.com/jranaraki/vllm-tuner},
  version = {0.1.0},
  year = {2026},
}

Acknowledgments

About

An intelligent tuner for vLLM that automatically monitors GPU metrics, uses Bayesian optimization to tune parameters

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages