One endpoint. 6 providers. 25+ models. Zero dollars.
FreeLLM is an OpenAI-compatible gateway that routes across Groq, Gemini, Mistral, Cerebras, NVIDIA NIM, and Ollama. When one rate-limits, the next one answers. You stop seeing 429s.
Stack 3 keys per provider and you get ~360 free requests per minute. Including Llama 3.3 70B, Gemini 2.5 Pro, and DeepSeek R1.
Drop-in for any OpenAI SDK. Swap the base URL. Keep your code.
Website · Docs · Quickstart · Providers · How it works · Multi-tenant · Browser tokens · API · Dashboard
If you've ever burned $20 testing prompts, star the repo. It helps other builders find it.
Every major provider has a free tier. Groq, Gemini, Mistral, Cerebras, NVIDIA. All of them.
But using them is painful.
Each one ships its own SDK. Each one has its own rate limits. Each one goes down at the worst possible time. So you end up writing provider-switching logic, handling 429s, and babysitting API keys across five different dashboards.
I built FreeLLM because I was tired of paying OpenAI $20 to test a prompt I'd run 30 times in an afternoon.
One line replaces all of that:
curl http://localhost:3000/v1/chat/completions \
-d '{"model": "free-fast", "messages": [{"role": "user", "content": "Hello!"}]}'The request goes to the fastest available provider. If that one is rate-limited or down, FreeLLM tries the next. You get a response. Every time.
- Drop-in OpenAI SDK. Swap your base URL. Keep your code.
- Automatic failover. Groq rate-limited? Routes to Gemini, then Mistral, then Cerebras.
- Three meta-models.
free-fastfor speed,free-smartfor reasoning,freefor max availability. - Multi-key rotation. Stack keys per provider for 3-4× the free RPM.
- Response caching. Identical prompts return in ~23ms with zero quota burn.
- Token tracking. Rolling 24h budget per provider, surfaced in the dashboard.
- Circuit breakers. Failing providers get sidelined and tested for recovery.
- Real-time dashboard. Provider health, request log, latency, cache hit rate.
- Transparent routing. Every response tells you which provider answered, and why.
- Strict mode. Opt in and refuse silent provider substitution.
- Privacy routing. Skip providers that train on free-tier prompts.
- Virtual sub-keys. Issue scoped keys with per-key request and token caps.
- Per-user rate limits. Safely expose the gateway to your app's end users.
- Browser-safe tokens. Short-lived HMAC-signed tokens for static sites, no auth backend needed.
- Streaming tool calls that work. Gemini and Ollama streaming tool_call bugs normalized at the gateway.
- JSON mode across all providers.
json_schemaworks on NIM (translated toguided_jsonautomatically), and truncated JSON responses carry a warning header so you don't discover the break at parse time. - Gemini reasoning handled for you. Gemini 2.5 models burn most of your output budget on internal thinking by default. FreeLLM sets the right
reasoning_effortper model so yourmax_tokensactually buys you output. - Zero cost. Every provider runs on its free tier.
| Provider | Models | Free tier (per key) |
|---|---|---|
| Groq | Llama 3.3 70B, Llama 3.1 8B, Llama 4 Scout, Qwen3 32B | ~30 req/min |
| Gemini | Gemini 2.5 Flash, 2.5 Pro, 2.0 Flash, 2.0 Flash Lite | ~15 req/min |
| Mistral | Mistral Small, Medium, Nemo | ~5 req/min |
| Cerebras | Llama 3.1 8B, Qwen3 235B, GPT-OSS 120B | ~30 req/min |
| NVIDIA NIM | Llama 3.3 70B, Llama 3.1 405B, Nemotron 70B, DeepSeek R1 | ~40 req/min |
| Ollama | Any local model | Unlimited |
Baseline: ~120 req/min combined. With 3 keys per provider: ~360 req/min. All $0.
Get free keys: Groq, Gemini, Mistral, Cerebras, NVIDIA NIM
One-click deploy (no terminal needed):
Or run locally with Docker:
docker run -d -p 3000:3000 \
-e GROQ_API_KEY=gsk_... \
-e GEMINI_API_KEY=AI... \
ghcr.io/devansh-365/freellm:latestOr clone for local dev:
git clone https://github.com/Devansh-365/freellm.git
cd freellm
cp .env.example .env # add your keys
pnpm install && pnpm devAPI runs on http://localhost:3000. Dashboard on http://localhost:5173.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:3000/v1", api_key="unused")
response = client.chat.completions.create(
model="free-smart",
messages=[{"role": "user", "content": "Explain quantum computing in one paragraph."}]
)
print(response.choices[0].message.content)import OpenAI from "openai";
const client = new OpenAI({ baseURL: "http://localhost:3000/v1", apiKey: "unused" });
const response = await client.chat.completions.create({
model: "free-fast",
messages: [{ role: "user", content: "Hello!" }],
});Don't pick a provider. Pick a strategy.
| Model | What it does | Use when |
|---|---|---|
free |
Rotates across all available providers | You want max uptime |
free-fast |
Lowest-latency provider first (Groq, Cerebras, Gemini, NIM) | You're building a chatbot or real-time UI |
free-smart |
Most capable provider first (Gemini, NIM, Groq, Mistral) | You need stronger reasoning or longer context |
Need a specific model? Target it directly: groq/llama-3.3-70b-versatile, gemini/gemini-2.5-flash, nim/deepseek-ai/deepseek-r1.
Every provider env var accepts a comma-separated list. FreeLLM rotates round-robin, and each key gets its own rate-limit budget and cooldown.
GROQ_API_KEY=gsk_key1,gsk_key2,gsk_key3,gsk_key4 # 4× the free RPMWhen one key hits its window, FreeLLM silently uses the next. A 429 on key1 only sidelines that key, not the whole provider. Per-key state is exposed via GET /v1/status.
Stack 3 keys across all 5 cloud providers and you get ~360 req/min of free inference. No other LLM gateway does this because they all assume you pay per token.
Identical prompts return in ~23ms with zero quota burn. The cache keys on (model, messages, temperature, max_tokens, top_p, stop) via SHA-256, uses LRU eviction, and respects per-entry TTL (default 1 hour).
Call A (cold) cached=false latency=200ms → Groq
Call B (same prompt) cached=true latency=23ms ← cache
That's a 9× speedup on duplicate requests. During development you typically hammer the same prompt 10-20 times while iterating. That's now 10-20 free hits.
Configure in .env:
CACHE_ENABLED=true
CACHE_TTL_MS=3600000 # 1 hour
CACHE_MAX_ENTRIES=1000Streaming and error responses are never cached. Cached responses are marked with x_freellm_cached: true.
Every response carries headers that tell you exactly how FreeLLM handled the request:
X-FreeLLM-Provider: groq
X-FreeLLM-Model: groq/llama-3.3-70b-versatile
X-FreeLLM-Requested-Model: free-fast
X-FreeLLM-Cached: false
X-FreeLLM-Route-Reason: meta
X-Request-Id: 4d6c9e1a-...
Route-Reason is one of direct, meta, cache, or failover. Every response, successful or not, carries a unique X-Request-Id that also appears in the server logs and the error body, so a single grep correlates everything.
If you want to refuse silent substitution, opt into strict mode:
X-FreeLLM-Strict: true
In strict mode meta-models are rejected (400), and concrete models are tried against exactly one provider. If that provider fails, the upstream error surfaces verbatim instead of failing over to a different one.
When all providers are exhausted, the response body now tells you how to recover:
{
"error": {
"type": "rate_limit_error",
"code": "all_providers_exhausted",
"retry_after_ms": 12000,
"providers": [
{ "id": "groq", "retry_after_ms": 12000, "keys_available": 0, "circuit_state": "closed" },
{ "id": "gemini", "retry_after_ms": 5000, "keys_available": 0, "circuit_state": "closed" }
],
"suggestions": [{ "model": "free-fast", "available_in_ms": 5000 }]
}
}The response also carries an HTTP Retry-After header in seconds so any standard client retries at the right time.
Not every free tier treats your prompts the same way. Send X-FreeLLM-Privacy: no-training and the router will only consider providers that contractually exclude free-tier data from training:
| Provider | Policy |
|---|---|
| Groq | no-training |
| Cerebras | no-training |
| NVIDIA NIM | no-training |
| Ollama | local |
| Mistral | configurable |
| Gemini | free-tier trains |
If no configured provider can satisfy the posture for the model you asked for, you get a clean 400 model_not_supported up front. Catalog entries carry source URLs and last_verified dates; the server warns at boot for any entry older than 90 days.
Building a side project and want to safely expose the gateway to your visitors without giving everyone your master key? FreeLLM ships two independent mechanisms that compose:
Virtual sub-keys. Declare them in a JSON file:
{
"keys": [
{
"id": "sk-freellm-portfolio-abc123",
"label": "My portfolio site",
"dailyRequestCap": 500,
"dailyTokenCap": 200000,
"allowedModels": ["free-fast", "free"],
"expiresAt": "2026-07-01T00:00:00Z"
}
]
}Point FREELLM_VIRTUAL_KEYS_PATH at the file, restart, and authenticate with Authorization: Bearer sk-freellm-portfolio-abc123. The gateway enforces the allowlist and caps before touching any upstream provider, and records usage only after a successful response so failed routes never burn quota. Caps are in-memory rolling 24h windows (soft cap, not a billing system; documented and logged at boot).
Per-identifier rate limits. Tag each request with X-FreeLLM-Identifier: user-42 (anything matching ^[A-Za-z0-9_.:-]{1,128}$) and each identifier gets its own sliding-window bucket, independent from the per-IP and per-provider limiters. One noisy user hitting their cap does not affect anyone else. Configure via FREELLM_IDENTIFIER_LIMIT=<max>/<windowMs> (default 60/60000). Responses carry X-FreeLLM-Identifier-Remaining and -Reset so clients can self-throttle.
Missing identifier falls back to the client IP. Literal "undefined" and "null" strings are treated as missing. Tainted values are rejected with a clear 400 instead of landing in logs.
Want to drop an AI chatbot into a static site without giving every visitor your master key? Mint a short-lived HMAC-signed token from a one-file serverless function and pass it straight to the browser. The token is bound to an origin, expires in 15 minutes, and counts against a per-identifier bucket so one noisy user cannot burn your quota.
# Backend: mint a token using your master key
curl https://your-gateway/v1/tokens/issue \
-H "Authorization: Bearer $FREELLM_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"origin": "https://yoursite.com",
"identifier": "session-abc",
"ttlSeconds": 900
}'
# => { "token": "flt.eyJ2Ijox...", "expiresAt": "...", "origin": "...", "identifier": "..." }<!-- Browser: use the token directly with the official OpenAI SDK -->
<script type="module">
import OpenAI from "https://esm.sh/openai@^4";
const { token } = await fetch("/api/freellm-token").then((r) => r.json());
const client = new OpenAI({
baseURL: "https://your-gateway/v1",
apiKey: token,
dangerouslyAllowBrowser: true,
});
const stream = await client.chat.completions.create({
model: "free-fast",
messages: [{ role: "user", content: "Hi" }],
stream: true,
});
for await (const chunk of stream) console.log(chunk.choices[0].delta.content ?? "");
</script>Security model: max 15 minute TTL, origin-bound (browser Origin header verified on every request), per-identifier rate limiting ties into the existing bucket system, and FREELLM_TOKEN_SECRET must be at least 32 bytes or the gateway refuses to boot. Full walkthrough on the Browser integration docs page and a runnable example in examples/browser-chatbot/.
Gemini and Ollama both ship known bugs in their streaming tool_call output (Gemini drops the index field, Ollama flattens arguments outside the function wrapper). Every agent framework currently maintains its own workaround for these. FreeLLM fixes both at the gateway so the same stream works unchanged in the OpenAI SDK, Cline, Cursor, Aider, or anything else that expects OpenAI-spec SSE. Verified with real calls against live Gemini and reassembled by the real openai npm SDK.
Gemini 2.5 Flash and 2.5 Pro are reasoning models. Left to their own devices, they spend 90-98% of your max_tokens thinking internally before writing a single visible word. Ask for 1000 tokens of output and you'll get back 37.
FreeLLM fixes this per model. For 2.5 Flash, thinking is disabled entirely (reasoning_effort: "none") so your full budget goes to the actual answer. For 2.5 Pro (which refuses to run without some thinking), reasoning is set to "low" so it thinks briefly and gives you the rest.
If you want the full reasoning power back, override it per request:
curl https://your-gateway/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemini/gemini-2.5-flash",
"messages": [{"role": "user", "content": "Prove P != NP"}],
"max_tokens": 4000,
"reasoning_effort": "high"
}'With "high" and a larger budget (4000+), the model gets room for both thinking and output. The point is: you choose the trade-off, not Google's default.
FreeLLM accepts response_format: { type: "json_object" } and { type: "json_schema", json_schema: { schema: {...} } } and forwards them to the upstream provider. Most providers support this natively. For NVIDIA NIM, which uses a proprietary guided_json parameter instead, FreeLLM translates the standard format automatically so you don't have to special-case your code per provider.
When a JSON-mode response hits max_tokens and the output is almost certainly broken (missing closing brackets, truncated strings), the response carries a X-FreeLLM-Warning: json-possibly-truncated header. You'll know the JSON is incomplete before you try to parse it.
All optional. Leave empty for local dev.
| Variable | What it does |
|---|---|
FREELLM_API_KEY |
Master key. Requires Authorization: Bearer <key> on every request. |
FREELLM_ADMIN_KEY |
Separate key for admin endpoints (circuit breaker reset, routing switch). |
FREELLM_VIRTUAL_KEYS_PATH |
Path to a JSON file declaring virtual sub-keys with per-key caps. |
FREELLM_IDENTIFIER_LIMIT |
Per-identifier rate limit, format <max>/<windowMs> (default 60/60000). |
FREELLM_IDENTIFIER_MAX_BUCKETS |
Hard ceiling on distinct identifiers tracked (default 10000). |
FREELLM_TOKEN_SECRET |
HMAC secret for browser tokens, minimum 32 bytes. Short = fatal boot failure. Unset = browser tokens disabled, rest of the gateway runs normally. |
STREAM_IDLE_TIMEOUT_MS |
Heartbeat cadence for SSE keep-alive comments (default 30000). |
ALLOWED_ORIGINS |
Comma-separated CORS allowlist. Required for browser-token-backed frontends. |
Dependency posture and trust details live on the dedicated website pages:
- Security and dependencies — direct dep list, what is deliberately not in the codebase, image verification
- Privacy and training — the full provider catalog with source links
- Benchmarks — cold start and overhead numbers with methodology
Fully OpenAI-compatible. Available at /v1/....
| Method | Endpoint | Description |
|---|---|---|
POST |
/v1/chat/completions |
Chat completion (streaming and non-streaming) |
GET |
/v1/models |
List all available models + meta-models |
GET |
/v1/status |
Provider states, per-key state, token usage, cache stats |
POST |
/v1/status/providers/{id}/reset |
Force-reset a provider's circuit breaker |
PATCH |
/v1/status/routing |
Switch between round_robin and random |
Every response includes observability headers so you know exactly how the request was handled:
X-FreeLLM-Provider,X-FreeLLM-Model,X-FreeLLM-Requested-Model,X-FreeLLM-Cached,X-FreeLLM-Route-ReasonX-FreeLLM-Identifier,-Remaining,-Resetwhen identifier rate limiting is in playX-Request-Idon every response, matching the id in logs and error bodies
Request-side headers you can opt into:
X-FreeLLM-Strict: truerefuses silent provider substitutionX-FreeLLM-Privacy: no-trainingfilters out providers that train on free-tier dataX-FreeLLM-Identifier: <id>tags the request with a per-user bucket
A built-in web UI for monitoring your gateway in real time. Provider health, cache hit rate, per-provider token usage, multi-key status, live request log, routing controls, and circuit breaker management.
PRs welcome. See CONTRIBUTING.md.

