Skip to content

Conversation

ehhuang
Copy link
Contributor

@ehhuang ehhuang commented Sep 30, 2025

What does this PR do?

#3462 allows using uvicorn to start llama stack server which supports spawning multiple workers.

This PR enables us to launch >1 workers from llama stack run (will add the parameter in a follow-up PR, keeping this PR on simplifying) by removing the old way of launching stack server and consolidates launching via uvicorn.run only.

Test Plan

ran llama stack run starter
CI

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 30, 2025
@ehhuang ehhuang force-pushed the pr3625 branch 3 times, most recently from 7343031 to 70736e8 Compare October 3, 2025 00:09
@ehhuang ehhuang changed the title remove main chore: use uvicorn to start llama stack server everywhere Oct 3, 2025
@ehhuang ehhuang changed the title chore: use uvicorn to start llama stack server everywhere chore!: use uvicorn to start llama stack server everywhere Oct 3, 2025
@ehhuang ehhuang force-pushed the pr3625 branch 5 times, most recently from efcfdd0 to a22534c Compare October 3, 2025 04:57
@ehhuang ehhuang marked this pull request as ready for review October 3, 2025 05:16
Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it! but it's missing a ton of --env from the docs and other places?

CONTRIBUTING.md:uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct
docs/docs/building_applications/tools.mdx:--env TAVILY_SEARCH_API_KEY=${TAVILY_SEARCH_API_KEY}
docs/docs/building_applications/tools.mdx:    --env WOLFRAM_ALPHA_API_KEY=${WOLFRAM_ALPHA_API_KEY}
docs/docs/contributing/index.mdx:uv run --env-file .env -- pytest -v tests/integration/inference/test_text_inference.py --text-model=meta-llama/Llama-3.1-8B-Instruct
docs/docs/contributing/new_api_provider.mdx: typically references some environment variables for specifying API keys and the like. You can set these in the environment or pass these via the `--env` flag to the test command.
docs/docs/distributions/building_distro.mdx:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
docs/docs/distributions/building_distro.mdx:  --env OLLAMA_URL=http://host.docker.internal:11434
docs/docs/distributions/building_distro.mdx:* `--env INFERENCE_MODEL=$INFERENCE_MODEL`: Sets the model to use for inference
docs/docs/distributions/building_distro.mdx:* `--env OLLAMA_URL=http://host.docker.internal:11434`: Configures the URL for the Ollama service
docs/docs/distributions/building_distro.mdx:usage: llama stack run [-h] [--port PORT] [--image-name IMAGE_NAME] [--env KEY=VALUE]
docs/docs/distributions/building_distro.mdx:  --env KEY=VALUE       Environment variables to pass to the server in KEY=VALUE format. Can be specified multiple times. (default: None)
docs/docs/distributions/configuration.mdx:- Notice that configuration can reference environment variables (with default values), which are expanded at runtime. When you run a stack server (via docker or via `llama stack run`), you can specify `--env OLLAMA_URL=http://my-server:11434` to override the default value.
docs/docs/distributions/configuration.mdx:llama stack run --config run.yaml --env API_KEY=sk-123 --env BASE_URL=https://custom-api.com
docs/docs/distributions/remote_hosted_distro/watsonx.md:  --env WATSONX_API_KEY=$WATSONX_API_KEY \
docs/docs/distributions/remote_hosted_distro/watsonx.md:  --env WATSONX_PROJECT_ID=$WATSONX_PROJECT_ID \
docs/docs/distributions/remote_hosted_distro/watsonx.md:  --env WATSONX_BASE_URL=$WATSONX_BASE_URL
docs/docs/distributions/self_hosted_distro/dell.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env DEH_URL=$DEH_URL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env CHROMA_URL=$CHROMA_URL
docs/docs/distributions/self_hosted_distro/dell.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env DEH_URL=$DEH_URL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env SAFETY_MODEL=$SAFETY_MODEL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env DEH_SAFETY_URL=$DEH_SAFETY_URL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env CHROMA_URL=$CHROMA_URL
docs/docs/distributions/self_hosted_distro/dell.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env DEH_URL=$DEH_URL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env CHROMA_URL=$CHROMA_URL
docs/docs/distributions/self_hosted_distro/dell.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env DEH_URL=$DEH_URL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env SAFETY_MODEL=$SAFETY_MODEL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env DEH_SAFETY_URL=$DEH_SAFETY_URL \
docs/docs/distributions/self_hosted_distro/dell.md:  --env CHROMA_URL=$CHROMA_URL
docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md:  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md:  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md:  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md:  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md:  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
docs/docs/distributions/self_hosted_distro/meta-reference-gpu.md:  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
docs/docs/distributions/self_hosted_distro/nvidia.md:  --env NVIDIA_API_KEY=$NVIDIA_API_KEY
docs/docs/distributions/self_hosted_distro/nvidia.md:  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
docs/docs/distributions/self_hosted_distro/nvidia.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL
docs/docs/getting_started/detailed_tutorial.mdx:  --env OLLAMA_URL=http://host.docker.internal:11434
docs/docs/getting_started/detailed_tutorial.mdx:  --env OLLAMA_URL=http://localhost:11434
docs/getting_started_llama4.ipynb:        "        f\"uv run --with llama-stack llama stack run meta-reference-gpu --image-type venv --env INFERENCE_MODEL={model_id}\",\n",
docs/zero_to_hero_guide/README.md:      --env INFERENCE_MODEL=$INFERENCE_MODEL \
docs/zero_to_hero_guide/README.md:      --env SAFETY_MODEL=$SAFETY_MODEL \
docs/zero_to_hero_guide/README.md:      --env OLLAMA_URL=$OLLAMA_URL
llama_stack/cli/stack/run.py:            "--env",
llama_stack/cli/stack/run.py:                    run_args.extend(["--env", f"{key}={value}"])
llama_stack/core/build.py:            "--env-name",
llama_stack/core/build_venv.sh:  echo "Usage: $0 --env-name <env_name> --normal-deps <pip_dependencies> [--external-provider-deps <external_provider_deps>] [--optional-deps <special_pip_deps>]"
llama_stack/core/build_venv.sh:  echo "Example: $0 --env-name mybuild --normal-deps 'numpy pandas scipy' --external-provider-deps 'foo' --optional-deps 'bar'"
llama_stack/core/build_venv.sh:    --env-name)
llama_stack/core/build_venv.sh:        echo "Error: --env-name requires a string value" >&2
llama_stack/core/build_venv.sh:  echo "Error: --env-name and --normal-deps are required." >&2
llama_stack/core/server/server.py:        "--env",
llama_stack/core/start_stack.sh:  echo "Usage: $0 <env_type> <env_path_or_name> <port> [--config <yaml_config>] [--env KEY=VALUE]..."
llama_stack/core/start_stack.sh:    --env)
llama_stack/core/start_stack.sh:        env_vars="$env_vars --env $2"
llama_stack/core/start_stack.sh:        echo -e "${RED}Error: --env requires a KEY=VALUE argument${NC}" >&2
llama_stack/distributions/dell/doc_template.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
llama_stack/distributions/dell/doc_template.md:  --env DEH_URL=$DEH_URL \
llama_stack/distributions/dell/doc_template.md:  --env CHROMA_URL=$CHROMA_URL
llama_stack/distributions/dell/doc_template.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
llama_stack/distributions/dell/doc_template.md:  --env DEH_URL=$DEH_URL \
llama_stack/distributions/dell/doc_template.md:  --env SAFETY_MODEL=$SAFETY_MODEL \
llama_stack/distributions/dell/doc_template.md:  --env DEH_SAFETY_URL=$DEH_SAFETY_URL \
llama_stack/distributions/dell/doc_template.md:  --env CHROMA_URL=$CHROMA_URL
llama_stack/distributions/dell/doc_template.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
llama_stack/distributions/dell/doc_template.md:  --env DEH_URL=$DEH_URL \
llama_stack/distributions/dell/doc_template.md:  --env CHROMA_URL=$CHROMA_URL
llama_stack/distributions/dell/doc_template.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL \
llama_stack/distributions/dell/doc_template.md:  --env DEH_URL=$DEH_URL \
llama_stack/distributions/dell/doc_template.md:  --env SAFETY_MODEL=$SAFETY_MODEL \
llama_stack/distributions/dell/doc_template.md:  --env DEH_SAFETY_URL=$DEH_SAFETY_URL \
llama_stack/distributions/dell/doc_template.md:  --env CHROMA_URL=$CHROMA_URL
llama_stack/distributions/meta-reference-gpu/doc_template.md:  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
llama_stack/distributions/meta-reference-gpu/doc_template.md:  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
llama_stack/distributions/meta-reference-gpu/doc_template.md:  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
llama_stack/distributions/meta-reference-gpu/doc_template.md:  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct
llama_stack/distributions/meta-reference-gpu/doc_template.md:  --env INFERENCE_MODEL=meta-llama/Llama-3.2-3B-Instruct \
llama_stack/distributions/meta-reference-gpu/doc_template.md:  --env SAFETY_MODEL=meta-llama/Llama-Guard-3-1B
llama_stack/distributions/nvidia/doc_template.md:  --env NVIDIA_API_KEY=$NVIDIA_API_KEY
llama_stack/distributions/nvidia/doc_template.md:  --env NVIDIA_API_KEY=$NVIDIA_API_KEY \
llama_stack/distributions/nvidia/doc_template.md:  --env INFERENCE_MODEL=$INFERENCE_MODEL
llama_stack/env.py:            f"\n3. Pass directly to pytest: pytest --env {key}=your-key"
scripts/install.sh:      --env OLLAMA_URL="http://ollama-server:${OLLAMA_PORT}")
tests/README.md:- Any API keys you need to use should be set in the environment, or can be passed in with the --env option.
tests/integration/README.md:- `--env`: set environment variables, e.g. --env KEY=value. this is a utility option to set environment variables required by various providers.
tests/integration/conftest.py:    env_vars = config.getoption("--env") or []
tests/integration/conftest.py:    parser.addoption("--env", action="append", help="Set environment variables, e.g. --env KEY=value")

@ehhuang ehhuang changed the title chore!: use uvicorn to start llama stack server everywhere chore: use uvicorn to start llama stack server everywhere Oct 3, 2025
@ehhuang
Copy link
Contributor Author

ehhuang commented Oct 3, 2025

@leseb I've removed the --env deprecation out of this PR for ease of reviewing :)

# What does this PR do?


## Test Plan
@leseb
Copy link
Collaborator

leseb commented Oct 6, 2025

@leseb I've removed the --env deprecation out of this PR for ease of reviewing :)

Makes sense!

Copy link
Collaborator

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulled and testing locally with a few openai chat completion calls. Thanks!

@leseb leseb merged commit 426cac0 into llamastack:main Oct 6, 2025
43 checks passed
leseb added a commit to leseb/llama-stack-k8s-operator that referenced this pull request Oct 6, 2025
0.3.0 with llamastack/llama-stack#3625 forces us
to use "llama stack run" and the server module doesn't execute the
server anymore.

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to leseb/llama-stack-distribution that referenced this pull request Oct 10, 2025
The next 0.3.0 version will have a different way to run the server. The
server module does not run the server anymore. It is happening via

```
llama stack run <path to config>
```

For more details llamastack/llama-stack#3625

Signed-off-by: Sébastien Han <seb@redhat.com>
leseb added a commit to opendatahub-io/llama-stack-distribution that referenced this pull request Oct 10, 2025
The next 0.3.0 version will have a different way to run the server. The
server module does not run the server anymore. It is happening via

```
llama stack run <path to config>
```

For more details llamastack/llama-stack#3625

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- Chores
- Updated container entrypoint to launch via the CLI, providing more
consistent startup behavior across environments.
- Preserves existing configuration (same run YAML path); no action
required from users.
- May result in slightly different log format and signal handling during
startup/shutdown.
  - No changes to user-facing APIs or features.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
leseb added a commit to llamastack/llama-stack-k8s-operator that referenced this pull request Oct 16, 2025
0.3.0 with llamastack/llama-stack#3625 forces us
to use "llama stack run" and the server module doesn't execute the
server anymore.

Signed-off-by: Sébastien Han <seb@redhat.com>
VaishnaviHire pushed a commit to VaishnaviHire/llama-stack-k8s-operator that referenced this pull request Oct 17, 2025
0.3.0 with llamastack/llama-stack#3625 forces us
to use "llama stack run" and the server module doesn't execute the
server anymore.

Signed-off-by: Sébastien Han <seb@redhat.com>
(cherry picked from commit 90365e7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants