Native, Apple Silicon–only local LLM server. Built on Apple's MLX for maximum performance on M‑series chips, with Apple Foundation Models integration when available. SwiftUI app + SwiftNIO server with OpenAI‑compatible and Ollama‑compatible endpoints.
Created by Dinoki Labs (dinoki.ai), a fully native desktop AI assistant and companion.
📚 View Documentation - Guides, tutorials, and comprehensive documentation
- Native MLX runtime: Optimized for Apple Silicon using MLX/MLXLLM
- Apple Foundation Models: Use the system default model via
model: "foundation"ormodel: "default"on supported macOS versions; accelerated by Apple Neural Engine (ANE) when available - Apple Silicon only: Designed and tested for M‑series Macs
- OpenAI API compatible:
/v1/modelsand/v1/chat/completions(stream and non‑stream) - Ollama‑compatible:
/chatendpoint with NDJSON streaming for OllamaKit and other Ollama clients - Function/Tool calling: OpenAI‑style
tools+tool_choice, withtool_callsparsing and streaming deltas - Fast token streaming: Server‑Sent Events for low‑latency output
- In‑app Chat overlay: Chat directly with your models in a resizable glass window — streaming, Markdown, model picker, and a global hotkey (default ⌘;)
- Model manager UI: Browse, download, and manage MLX models from
mlx-community - System resource monitor: Real-time CPU and RAM usage visualization
- Self‑contained: SwiftUI app with an embedded SwiftNIO HTTP server
- macOS 15.5+
- Apple Silicon (M1 or newer)
- Xcode 16.4+ (to build from source)
- Apple Intelligence features require macOS 26 (Tahoe)
osaurus/
├── App/
│ ├── osaurus.xcodeproj
│ └── osaurus/
│ ├── osaurusApp.swift # Thin app entry point
│ └── Assets.xcassets/
└── Packages/
├── OsaurusCore/ # Swift Package (all app logic & deps)
│ ├── Controllers/ # NIO server lifecycle
│ ├── Managers/ # Model discovery & downloads (Hugging Face)
│ ├── Models/ # DTOs, config, health, etc.
│ ├── Networking/ # Router, handlers, response writers
│ ├── Services/ # MLX runtime, Foundation, Hugging Face, etc.
│ ├── Theme/
│ └── Views/ # SwiftUI views (popover, chat, managers)
└── OsaurusCLI/ # Swift Package (executable CLI)
Notes:
- Dependencies are managed by Swift Package Manager in
Packages/OsaurusCore/Package.swift. - The macOS app target depends only on
OsaurusCore.
- Native MLX text generation with model
- Model manager with curated suggestions (Llama, Qwen, Gemma, Mistral, etc.)
- Download sizes estimated via Hugging Face metadata
- Streaming and non‑streaming chat completions
- Multiple response formats: SSE (OpenAI‑style) and NDJSON (Ollama‑style)
- Compatible with OllamaKit and other Ollama client libraries
- OpenAI‑compatible function calling with robust parser for model outputs (handles code fences/formatting noise)
- Auto‑detects stop sequences and BOS token from tokenizer configs
- Health endpoint and simple status UI
- Real-time system resource monitoring
- Path normalization for API compatibility
- Overlay chat UI accessible from the menu bar bubble button or a global hotkey (default ⌘;)
- Foundation‑first model picker, plus any installed MLX models;
foundationappears when available - Real‑time token streaming with a Stop button and smooth auto‑scroll
- Rich Markdown rendering with one‑click copy per message
- Input shortcuts: Return or ⌘Return to send; Shift+Return inserts a newline
- Optional global system prompt is prepended to every chat
The following are 20-run averages from our batch benchmark suite. See raw results for details and variance.
| Server | Model | TTFT avg (ms) | Total avg (ms) | Chars/s avg | TTFT rel | Total rel | Chars/s rel | Success |
|---|---|---|---|---|---|---|---|---|
| Osaurus | llama-3.2-3b-instruct-4bit | 87 | 1237 | 554 | 0% | 0% | 0% | 100% |
| Ollama | llama3.2 | 33 | 1622 | 430 | +63% | -31% | -22% | 100% |
| LM Studio | llama-3.2-3b-instruct | 113 | 1221 | 588 | -30% | +1% | +6% | 100% |
- Metrics: TTFT = time-to-first-token, Total = time to final token, Chars/s = streaming throughput.
- Relative % vs Osaurus baseline: TTFT/Total computed as 1 - other/osaurus; Chars/s as other/osaurus - 1. Positive = better.
- Data sources:
results/osaurus-vs-ollama-lmstudio-batch.summary.json,results/osaurus-vs-ollama-lmstudio-batch.results.csv. - How to reproduce:
scripts/run_bench.shcallsscripts/benchmark_models.pyto run prompts across servers and write results.
GET /→ Plain text statusGET /health→ JSON health infoGET /models→ OpenAI‑compatible models listGET /tags→ Ollama‑compatible models listPOST /chat/completions→ OpenAI‑compatible chat completionsPOST /chat→ Ollama‑compatible chat endpoint
Path normalization: All endpoints support common API prefixes (/v1, /api, /v1/api). For example:
/v1/models→/models/api/chat/completions→/chat/completions/api/chat→/chat(Ollama‑style)
Download the latest signed build from the Releases page.
The easiest way to install Osaurus is through Homebrew cask (app bundle):
brew install --cask osaurusThis installs Osaurus.app. The CLI (osaurus) is embedded inside the app and will be auto-linked by the cask if available. If the osaurus command isn't found on your PATH, run one of the following:
# One-liner: symlink the embedded CLI into your Homebrew bin (Helpers preferred)
ln -sf "/Applications/Osaurus.app/Contents/Helpers/osaurus" "$(brew --prefix)/bin/osaurus" || \
ln -sf "$HOME/Applications/Osaurus.app/Contents/Helpers/osaurus" "$(brew --prefix)/bin/osaurus"
# Or use the helper script (auto-detects paths and Homebrew prefix)
curl -fsSL https://raw.githubusercontent.com/dinoki-ai/osaurus/main/scripts/install_cli_symlink.sh | bashOnce installed, you can launch Osaurus from:
- Spotlight: Press
⌘ Spaceand type "osaurus" - Applications folder: Find Osaurus in
/Applications - Terminal: Run
osaurus ui(oropen -a osaurus)
The app will appear in your menu bar, ready to serve local LLMs on your Mac.
- Open
osaurus.xcworkspace(recommended for editing app + packages), or openApp/osaurus.xcodeprojto build the app target directly - Build and run the
osaurustarget - In the UI, configure the port via the gear icon (default
1337) and press Start - Open the model manager to download a model (e.g., "Llama 3.2 3B Instruct 4bit")
- Open the Chat overlay via the chat bubble icon or press
⌘;to start chatting
Models are stored by default at ~/MLXModels. Override with the environment variable OSU_MODELS_DIR.
- Open the configuration popover (gear icon) → Chat
- Global Hotkey: record a shortcut to toggle the Chat overlay (default
⌘;) - System Prompt: optional text prepended to all chats
- Settings are saved locally and the hotkey applies immediately
The CLI lets you start/stop the server and open the UI from your terminal. If osaurus isn’t found in your PATH after installing the app:
- Run the one-liner above to create the symlink, or
- From a cloned repo, run:
scripts/install_cli_symlink.sh, or - For development builds:
make install-cli(uses DerivedData output)
# Start on localhost (default)
osaurus serve --port 1337
# Start exposed on your LAN (will prompt for confirmation)
osaurus serve --port 1337 --expose
# Start exposed without prompt (non-interactive)
osaurus serve --port 1337 --expose --yes
# Open the UI (menu bar popover)
osaurus ui
# Check status
osaurus status
# Stop the server
osaurus stop
# List model IDs
osaurus list
# Interactive chat with a downloaded model (use an ID from `osaurus list`)
osaurus run llama-3.2-3b-instruct-4bitTip: Set OSU_PORT to override the default/auto-detected port for CLI commands.
Notes:
- When started via CLI without
--expose, Osaurus binds to127.0.0.1only. --exposebinds to0.0.0.0(LAN). There is no authentication; use only on trusted networks.- Management is local-only via macOS Distributed Notifications; there are no HTTP start/stop endpoints.
Base URL: http://127.0.0.1:1337 (or your chosen port)
📚 Need more help? Check out our comprehensive documentation for detailed guides, tutorials, and advanced usage examples.
List models:
curl -s http://127.0.0.1:1337/v1/models | jqIf your system supports Apple Foundation Models, you will also see a foundation entry representing the system default model. You can target it explicitly with model: "foundation" or by passing model: "default" or an empty string (the server routes default requests to the system model when available).
Ollama‑compatible models list:
curl -s http://127.0.0.1:1337/v1/tags | jqNon‑streaming chat completion:
curl -s http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Write a haiku about dinosaurs"}],
"max_tokens": 200
}'Non‑streaming with Apple Foundation Models (when available):
curl -s http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "foundation",
"messages": [{"role":"user","content":"Write a haiku about dinosaurs"}],
"max_tokens": 200
}'Streaming chat completion (SSE format for /chat/completions):
curl -N http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Summarize Jurassic Park in one paragraph"}],
"stream": true
}'Streaming with Apple Foundation Models (when available):
curl -N http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "default",
"messages": [{"role":"user","content":"Summarize Jurassic Park in one paragraph"}],
"stream": true
}'Ollama‑compatible streaming (NDJSON format for /chat):
curl -N http://127.0.0.1:1337/v1/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [{"role":"user","content":"Tell me about dinosaurs"}],
"stream": true
}'This endpoint is compatible with OllamaKit and other Ollama client libraries.
Tip: Model names are lower‑cased with hyphens (derived from the friendly name), for example: Llama 3.2 3B Instruct 4bit → llama-3.2-3b-instruct-4bit.
If you're building a macOS app (Swift/Objective‑C/SwiftUI/Electron) and want to discover and connect to a running Osaurus instance, see the Shared Configuration guide: SHARED_CONFIGURATION_GUIDE.md.
Osaurus supports OpenAI‑style function calling. Send tools and optional tool_choice in your request. The model is instructed to reply with an exact JSON object containing tool_calls, and the server parses it, including common formatting like code fences.
Notes on Apple Foundation Models:
- When using
model: "foundation"/"default"on supported systems, tool calls are mapped through Apple Foundation Models' tool interface. In streaming mode, Osaurus emits OpenAI‑styletool_callsdeltas so your client code works unchanged.
Define tools and let the model decide (tool_choice: "auto"):
curl -s http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [
{"role":"system","content":"You can call functions to answer queries succinctly."},
{"role":"user","content":"What\'s the weather in SF?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather by city name",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"]
}
}
}
],
"tool_choice": "auto"
}'Non‑stream response will include message.tool_calls and finish_reason: "tool_calls". Streaming responses emit OpenAI‑style deltas for tool_calls (id, type, function name, and chunked arguments), finishing with finish_reason: "tool_calls" and [DONE].
Note: Tool‑calling is supported on the OpenAI‑style /chat/completions endpoint. The Ollama‑style /chat (NDJSON) endpoint streams text only and does not emit tool_calls deltas.
After you execute a tool, continue the conversation by sending a tool role message with tool_call_id:
curl -s http://127.0.0.1:1337/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.2-3b-instruct-4bit",
"messages": [
{"role":"user","content":"What\'s the weather in SF?"},
{"role":"assistant","content":"","tool_calls":[{"id":"call_1","type":"function","function":{"name":"get_weather","arguments":"{\"city\":\"SF\"}"}}]},
{"role":"tool","tool_call_id":"call_1","content":"{\"tempC\":18,\"conditions\":\"Foggy\"}"}
]
}'Notes:
- Only
type: "function"tools are supported. - Arguments must be a JSON‑escaped string in the assistant response; Osaurus also tolerates a nested
parametersobject and will normalize. - Parser accepts minor formatting noise like code fences and
assistant:prefixes.
Point your client at Osaurus and use any placeholder API key.
Python example:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:1337/v1", api_key="osaurus")
resp = client.chat.completions.create(
model="llama-3.2-3b-instruct-4bit",
messages=[{"role": "user", "content": "Hello there!"}],
)
print(resp.choices[0].message.content)Python with tools (non‑stream):
import json
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:1337/v1", api_key="osaurus")
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather by city",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}
]
resp = client.chat.completions.create(
model="llama-3.2-3b-instruct-4bit",
messages=[{"role": "user", "content": "Weather in SF?"}],
tools=tools,
tool_choice="auto",
)
tool_calls = resp.choices[0].message.tool_calls or []
for call in tool_calls:
args = json.loads(call.function.arguments)
result = {"tempC": 18, "conditions": "Foggy"} # your tool result
followup = client.chat.completions.create(
model="llama-3.2-3b-instruct-4bit",
messages=[
{"role": "user", "content": "Weather in SF?"},
{"role": "assistant", "content": "", "tool_calls": tool_calls},
{"role": "tool", "tool_call_id": call.id, "content": json.dumps(result)},
],
)
print(followup.choices[0].message.content)Osaurus includes built‑in CORS support for browser clients.
- Disabled by default: No CORS headers are sent unless you configure allowed origins.
- Enable via UI: gear icon → Advanced Settings → CORS Settings → Allowed Origins.
- Enter a comma‑separated list, for example:
http://localhost:3000, http://127.0.0.1:5173, https://app.example.com - Use
*to allow any origin (recommended only for local development).
- Enter a comma‑separated list, for example:
- Expose to network: If you need to access from other devices, also enable "Expose to network" in Network Settings.
Behavior when CORS is enabled:
- Requests with an allowed
OriginreceiveAccess-Control-Allow-Origin(either the specific origin or*). - Preflight
OPTIONSrequests are answered with204 No Contentand headers:Access-Control-Allow-Methods: echoes requested method or defaults toGET, POST, OPTIONS, HEADAccess-Control-Allow-Headers: echoes requested headers or defaults toContent-Type, AuthorizationAccess-Control-Max-Age: 600
- Streaming endpoints also include CORS headers on their responses.
Quick examples
Configure via UI (persists to app settings). The underlying config includes:
{
"allowedOrigins": ["http://localhost:3000", "https://app.example.com"]
}Browser fetch from a web app running on http://localhost:3000:
await fetch("http://127.0.0.1:1337/v1/chat/completions", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "llama-3.2-3b-instruct-4bit",
messages: [{ role: "user", content: "Hello!" }],
}),
});Notes
- Leave the field empty to disable CORS entirely.
*cannot be combined with credentials; Osaurus does not use cookies, so this is typically fine for local use.
- Curated suggestions include Llama, Qwen, Gemma, Mistral, Phi, DeepSeek, etc. (4‑bit variants for speed)
- Discovery pulls from Hugging Face
mlx-communityand computes size estimates - Required files are fetched automatically (tokenizer/config/weights)
- Change the models directory with
OSU_MODELS_DIR
Foundation Models:
- On macOS versions that provide Apple Foundation Models, the
/v1/modelslist includes a virtualfoundationentry representing the system default language model. You can select it viamodel: "foundation"ormodel: "default".
- Apple Silicon only (requires MLX); Intel Macs are not supported
- Localhost by default;
--exposeenables LAN access. No authentication; use only on trusted networks or behind a reverse proxy. /transcribeendpoints are placeholders pending Whisper integration- Apple Foundation Models availability depends on macOS version and frameworks. If unavailable, requests with
model: "foundation"/"default"will return an error. Use/v1/modelsto detect support. - Apple Intelligence requires macOS 26 (Tahoe).
- Tool‑calling deltas are only available on
/chat/completions(SSE). The/chat(NDJSON) endpoint is text‑only.
- temperature: Supported on all backends.
- max_tokens: Supported on all backends.
- top_p: If provided per request, overrides the server default; otherwise the server uses the configured
genTopP. - frequency_penalty / presence_penalty: Mapped to a repetition penalty on MLX backends (
repetitionPenalty = 1.0 + max(fp, pp)when positive). If both are missing or ≤ 0, no repetition penalty is applied. - stop: Array of strings. Honored in both streaming and non‑streaming modes on MLX and Foundation backends; output is trimmed before the first stop sequence.
- n: Only
1is supported; other values are ignored. - session_id: Accepted but not currently used for KV‑cache reuse.
- Managed via Swift Package Manager in
Packages/OsaurusCore/Package.swift:- SwiftNIO (HTTP server)
- IkigaJSON (fast JSON)
- Sparkle (updates)
- MLX‑Swift, MLXLLM, MLXLMCommon (runtime and generation)
- Hugging Face swift‑transformers (Hub/Tokenizers)
- wizardeur — first PR creator
- 📚 Browse our Documentation for guides and tutorials
- 💬 Join us on Discord
- 📖 Read the Contributing Guide and our Code of Conduct
- 🔒 See our Security Policy for reporting vulnerabilities
- ❓ Get help in Support
- 🚀 Pick up a good first issue or help wanted
If you find Osaurus useful, please ⭐ the repo and share it!
