A FastAPI service that calls a locally hosted Ollama model for chat-style interactions. Uses the official ollama Python client. Designed with simple, clean structure and reasonable best practices.
- Python 3.10+
- Ollama running within container (default
http://localhost:11434) - Considering the extensive resource usage,
llama3:8bmodel available in local Ollama. (for testing the app functionalities)
The app reads settings from .env in the project root.
Required:
- Instance sizing for 20B (recommended):
- GPU: 24 GB VRAM fits 4-bit quantization comfortably (e.g., g5.2xlarge).
- CPU-only: favor 32–64 GB RAM; e.g., c7i.4xlarge (16 vCPU, 32 GB) for better headroom.
Optional:
OLLAMA_HOST— defaulthttp://ollama:11434APP_NAME— defaultLocal gpt-oss-20b Chat APIDEBUG— defaultfalseCORS_ORIGINS— comma-separated list (defaults to*)
python -m venv .venv
source .venv/bin/activate # Windows: .venv\\Scripts\\activate
pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000Build the image:
docker build -t local-llm-api:latest .Run an Ollama container too (optional):
docker compose --profile ollama up --buildThe API is available at http://localhost:8000. Health check: GET /api/health.
- If both run in compose, use the Ollama service name, e.g.
OLLAMA_HOST=http://ollama:11434.
To deploy this app with the gpt-oss-20b model on AWS EC2 (instance sizing, GPU options, Compose configurations, and hardening), see:
docs/aws-ec2-deployment.md
Quick starts on EC2:
- Host Ollama (CPU or GPU on host):
- Compose-managed Ollama (same box):
export OLLAMA_HOST=http://ollama:11434 && COMPOSE_PROFILES=ollama docker compose up --build -ddocker compose --profile ollama exec ollama ollama pull gpt-oss-20b
Helper scripts:
bash scripts/ollama_container_setup.sh --model gpt-oss-20b [--gpus]to run Ollama in a container and pull the model.
Run a single script to build the image, pull the model, and start services:
# Host Ollama (default):
bash scripts/build_and_pull.sh --model "$OLLAMA_MODEL"Notes:
- If
--modelis omitted, the script readsOLLAMA_MODELfrom.env.
GET /— basic infoGET /api/health— app + Ollama healthPOST /api/chat— chat completion
Body:
{
"messages": [
{"role": "system", "content": "You are helpful assistant."},
{"role": "user", "content": "What is Langchain in one line?"}
],
"stream": false,
"temperature": 0.2
}- If
stream: true, the response is newline-delimited JSON (application/x-ndjson) compatible with Ollama streaming chunks. - If
stream: falseor omitted, returns a single JSON object from Ollama.
- Startup performs a best-effort health check but does not fail the app if Ollama is down.