A FastAPI service that calls a locally hosted Ollama model for chat-style interactions. Uses the official ollama
Python client. Designed with simple, clean structure and reasonable best practices.
- Python 3.10+
- Ollama running within container (default
http://localhost:11434
) - Considering the extensive resource usage,
llama3:8b
model available in local Ollama. (for testing the app functionalities)
The app reads settings from .env
in the project root.
Required:
- Instance sizing for 20B (recommended):
- GPU: 24 GB VRAM fits 4-bit quantization comfortably (e.g., g5.2xlarge).
- CPU-only: favor 32–64 GB RAM; e.g., c7i.4xlarge (16 vCPU, 32 GB) for better headroom.
Optional:
OLLAMA_HOST
— defaulthttp://ollama:11434
APP_NAME
— defaultLocal gpt-oss-20b Chat API
DEBUG
— defaultfalse
CORS_ORIGINS
— comma-separated list (defaults to*
)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\\Scripts\\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000
Build the image:
docker build -t local-llm-api:latest .
Run an Ollama container too (optional):
docker compose --profile ollama up --build
The API is available at http://localhost:8000
. Health check: GET /api/health
.
- If both run in compose, use the Ollama service name, e.g.
OLLAMA_HOST=http://ollama:11434
.
To deploy this app with the gpt-oss-20b
model on AWS EC2 (instance sizing, GPU options, Compose configurations, and hardening), see:
docs/aws-ec2-deployment.md
Quick starts on EC2:
- Host Ollama (CPU or GPU on host):
- Compose-managed Ollama (same box):
export OLLAMA_HOST=http://ollama:11434 && COMPOSE_PROFILES=ollama docker compose up --build -d
docker compose --profile ollama exec ollama ollama pull gpt-oss-20b
Helper scripts:
bash scripts/ollama_container_setup.sh --model gpt-oss-20b [--gpus]
to run Ollama in a container and pull the model.
Run a single script to build the image, pull the model, and start services:
# Host Ollama (default):
bash scripts/build_and_pull.sh --model "$OLLAMA_MODEL"
Notes:
- If
--model
is omitted, the script readsOLLAMA_MODEL
from.env
.
GET /
— basic infoGET /api/health
— app + Ollama healthPOST /api/chat
— chat completion
Body:
{
"messages": [
{"role": "system", "content": "You are helpful assistant."},
{"role": "user", "content": "What is Langchain in one line?"}
],
"stream": false,
"temperature": 0.2
}
- If
stream: true
, the response is newline-delimited JSON (application/x-ndjson
) compatible with Ollama streaming chunks. - If
stream: false
or omitted, returns a single JSON object from Ollama.
- Startup performs a best-effort health check but does not fail the app if Ollama is down.