vLLM Responses

FastAPI gateway that exposes an OpenAI-style Responses API (/v1/responses) in front of a vLLM OpenAI-compatible server (/v1/chat/completions), with:

SSE streaming event shape + ordering
previous_response_id statefulness (ResponseStore)
gateway-executed built-in tool: code_interpreter
gateway-hosted MCP tools (tools[].type="mcp" with configured server_label)

Current MCP boundary:

tools[].type="mcp" is gateway-hosted MCP resolved via VR_MCP_CONFIG_PATH.
Request-declared MCP targets (server_url, connector_id) are not supported yet.

📚 Full User Documentation (Guides, API Reference, Examples)

Design docs (maintainer-facing): design_docs/index.md.

Install

The vllm-responses CLI is provided by the Python package in responses/.

Prerequisites: Python 3.12+ and uv.

Install from a prebuilt wheel (Linux x86_64) (Recommended)

Download a prebuilt wheel (vllm_responses-*.whl) from GitHub Releases (preferred) or a CI run artifact, then install it:

uv venv --python=3.12
source .venv/bin/activate
uv pip install path/to/vllm_responses-*.whl

On Linux x86_64 wheels, the Code Interpreter server binary is bundled, so Bun is not required. Currently, wheels are only built for Linux x86_64.

Install from source (repo checkout) (Development)

git clone https://github.com/EmbeddedLLM/vllm-responses
cd vllm-responses

uv venv --python=3.12
source .venv/bin/activate
uv pip install -e ./responses

# Development: enable Code Interpreter via Bun fallback
# - Required for source checkouts when running with `code_interpreter` enabled (default)
cd responses/python/vllm_responses/tools/code_interpreter
bun install
export VR_CODE_INTERPRETER_DEV_BUN_FALLBACK=1
cd -

vllm-responses --help

Verify installation:

vllm-responses --help

Optional dependency sets (extras)

Install any combination via:

uv pip install -e './responses[<extra1>,<extra2>]'

Available extras:

docs: MkDocs toolchain (contributors).
lint: Ruff + Markdown formatting.
test: Pytest + coverage + load testing tools.
tracing: OpenTelemetry tracing support (only needed if you enable VR_TRACING_ENABLED=true).
build: Package build/publish tools.
all: Everything above.

Run

one-command local runtime (`vllm-responses serve`)

Prereqs:

If you want to spawn vLLM: vllm must be installed (e.g. uv pip install vllm).
If code_interpreter is enabled (default), the first start may download the Pyodide runtime (~400MB) into a cache directory (see VR_PYODIDE_CACHE_DIR). This requires tar to be installed.
For non-Linux platforms (or source installs without the bundled binary), you can disable the tool via --code-interpreter disabled. For development you can also enable the Bun-based fallback via VR_CODE_INTERPRETER_DEV_BUN_FALLBACK=1.

External upstream (you start vLLM yourself; /v1 is optional):

vllm-responses serve --upstream http://127.0.0.1:8457

Spawn vLLM (everything after -- is forwarded to vllm serve):

vllm-responses serve --gateway-workers 4 -- \
  meta-llama/Llama-3.2-3B-Instruct \
  --dtype auto \
  --port 8457

The Responses endpoint is:

POST http://127.0.0.1:5969/v1/responses

Remote access note:

If you bind the gateway with --gateway-host 0.0.0.0, use the machine’s IP/hostname to connect (not 0.0.0.0).

Optional: ResponseStore hot cache (Redis)

previous_response_id hydration reads the previous response state from the DB. For multi-worker deployments, you can optionally enable a Redis-backed hot cache to reduce DB reads/latency.

Env vars (default off):

VR_RESPONSE_STORE_CACHE=1
VR_RESPONSE_STORE_CACHE_TTL_SECONDS=3600

Redis connection:

VR_REDIS_HOST, VR_REDIS_PORT

Quick smoke test (OpenAI Python SDK)

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:5969/v1", api_key="dummy")

with client.responses.stream(
    model="MiniMaxAI/MiniMax-M2.1",
    input=[{"role": "user", "content": "You MUST call the code_interpreter tool. Execute: 2+2. Reply with ONLY the number."}],
    tools=[{"type": "code_interpreter"}],
    tool_choice="auto",
    include=["code_interpreter_call.outputs"],
) as stream:
    for evt in stream:
        if getattr(evt, "type", "").endswith(".delta"):
            continue
        print(getattr(evt, "type", evt))
    r1 = stream.get_final_response().id

with client.responses.stream(
    model="MiniMaxAI/MiniMax-M2.1",
    previous_response_id=r1,
    input=[{"role": "user", "content": "What number did you just compute? Reply with ONLY the number."}],
    tool_choice="none",
) as stream:
    for evt in stream:
        if getattr(evt, "type", "").endswith(".delta"):
            continue
        print(getattr(evt, "type", evt))

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github		.github
.vscode		.vscode
docs		docs
responses		responses
scripts		scripts
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.prettierignore		.prettierignore
.prettierrc		.prettierrc
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_NOTICES.md		THIRD_PARTY_NOTICES.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Responses

Install

Install from a prebuilt wheel (Linux x86_64) (Recommended)

Install from source (repo checkout) (Development)

Optional dependency sets (extras)

Run

one-command local runtime (`vllm-responses serve`)

Optional: ResponseStore hot cache (Redis)

Quick smoke test (OpenAI Python SDK)

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM Responses

Install

Install from a prebuilt wheel (Linux x86_64) (Recommended)

Install from source (repo checkout) (Development)

Optional dependency sets (extras)

Run

one-command local runtime (vllm-responses serve)

Optional: ResponseStore hot cache (Redis)

Quick smoke test (OpenAI Python SDK)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

one-command local runtime (`vllm-responses serve`)

Packages