An intelligent tuner for vLLM that automatically monitors GPU metrics, uses Bayesian optimization to tune parameters (batch_size, max_num_batched_tokens, max_num_seqs, gpu_memory_utilization) to maximize throughput while minimizing latency and balancing memory, respecting user-provided constraints.
- Intelligent Profiling: Monitor GPU memory, utilization, and vLLM metrics automatically
- Adaptive Parameter Search: Bayesian optimization (Optuna) with multi-objective support (throughput, latency, memory)
- vLLM-Aware Integration: Parse vLLM logs for KV cache utilization, preemption tracking, and guidance
- Multi-GPU Support: Handle data-parallel and model-parallel (tensor/pipeline) configurations
- User-Friendly Configuration: Simple YAML configs to specify objectives and constraints
- Rich Reporting: Plotly interactive HTML reports with trial progression, Pareto front, and GPU telemetry
- Extensibility: Custom workloads and plugins for specific deployment scenarios
# Create and activate uv environment
uv venv --seed --python 3.10
source .venv/bin/activate
# Install vllm-tuner
uv pip install git+https://github.com/jranaraki/vllm-tuner
# Install vLLM
uv pip install vllm --torch-backend=autoConfiguration is done via YAML file, see default.yaml, and here are the key settings:
objectives:
throughput: 60 # Weight for throughput maximization
latency: 30 # Weight for latency minimization
memory: 10 # Weight for memory efficiencysearch_space:
batch_size: [1, 256] # Range or override defaults
gpu_memory_utilization: [0.6, 0.99]
tensor_parallel_size: [1, 2, 4]workload:
dataset_name: "tatsu-lab/alpaca" # HF dataset
sample_size: 100 # Number of prompts
concurrent_requests: 10 # Concurrent clientsStore the config file under configs folder.
# Run tuning study
vllm-tuner tune --config config/default.yaml --study-name my_studyStudies and reports are saved to studies/<study_name>/ and reports, respectively:
├── configs
│ └── default.yaml # vLLM-Tuner config
├── reports
│ └── my_study
│ └── report.html # Interactive Plotly report
└── studies
└── my_study
├── baseline # Baseline metrics (if enabled)
│ ├── baseline_config.yaml
│ ├── baseline_metrics.json
│ ├── baseline_summary.txt
│ └── logs
│ └── vllm_baseline.log
├── configs # Summary & best configs
│ ├── best_config.json
│ ├── best_config.yaml
│ ├── summary.json
│ └── trials.json
├── logs # vLLM server logs
│ ├── vllm_trial_0.log
│ ├── vllm_trial_1.log
│ ├── vllm_trial_2.log
│ ├── vllm_trial_3.log
│ ├── vllm_trial_4.log
│ └── ...
└── optuna.db # SQLite study database
And the final report is as follows:
For detailed information, see the comprehensive documentation.
If you find vllm-tuner useful and interested in citing this work, please use the following BibTex entry:
@software{vllmtuner2026,
author = {Javad Anaraki},
title = {vllm-tuner: Automated Parameter Tuning for vLLM via Bayesian Optimization},
url = {https://github.com/jranaraki/vllm-tuner},
version = {0.1.0},
year = {2026},
}
- Optuna for Bayesian optimization
- vLLM for high-performance serving
- Hugging Face Datasets for workloads

