LLM Token Gating with LangGraph

A FastAPI service implementing token-gated LLM execution using LangGraph. This system enforces predictable cost envelopes across planning, retrieval, generation, and quality assessment phases.

For the conceptual deep-dive, see the accompanying article: LLM Cost Optimization and Token Gating.

Architecture

User Request
     │
     ▼
┌─────────────────┐
│  Budget Manager │  ← Token budget as first-class state
└────────┬────────┘
         │
         ▼
┌─────────────────┐     ┌─────────────┐
│    Planner      │────▶│  Retriever  │
│  (800 tokens)   │     │ (dynamic k) │
└─────────────────┘     └──────┬──────┘
                               │
                               ▼
                        ┌─────────────┐
                        │  Generator  │
                        │(2500 tokens)│
                        └──────┬──────┘
                               │
                               ▼
                        ┌─────────────┐
                        │   Critic    │
                        │(800 tokens) │
                        └──────┬──────┘
                               │
              ┌────────────────┼────────────────┐
              ▼                ▼                ▼
         [quality ≥ 0.85]  [budget low]    [loop]
              │                │                │
              ▼                ▼                ▼
           Finalize       Summarizer        Planner
              │                │            (retry)
              ▼                ▼
             END              END

Features

Token Budget Enforcement: Every node checks and consumes tokens from a shared budget
Dynamic RAG: Retriever adjusts top_k based on remaining budget after reserving generation capacity
Graceful Degradation: System summarizes and exits cleanly when budget is exhausted
Quality-Driven Termination: Exits early when quality threshold is met
Loop Prevention: Maximum step count prevents infinite agent loops
Full Observability: Token usage breakdown by node in every response

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Configure Environment

cp .env.example .env
# Edit .env and add your OpenAI API key

Required environment variables:

OPENAI_API_KEY: Your OpenAI API key

Optional:

OPENAI_MODEL_NAME: Model for chat completions (default: gpt-4o)
OPENAI_EMBEDDING_MODEL: Model for embeddings (default: text-embedding-3-small)
DEFAULT_TOKEN_BUDGET: Default budget per request (default: 10000)
DEFAULT_MAX_STEPS: Maximum agent loop iterations (default: 5)
CHROMA_PERSIST_DIR: ChromaDB storage path (default: ./chroma_db)

3. Run the Server

uvicorn app.main:app --reload

Or:

python -m app.main

The API will be available at http://localhost:8000.

API Endpoints

`GET /health`

Health check endpoint.

curl http://localhost:8000/health

`POST /documents`

Ingest documents into the vector store for RAG retrieval.

curl -X POST http://localhost:8000/documents \
  -H "Content-Type: application/json" \
  -d '{
    "texts": [
      "LangGraph is a library for building stateful, multi-actor applications with LLMs.",
      "Token gating enforces budget constraints across LLM execution.",
      "RAG combines retrieval with generation for grounded responses."
    ]
  }'

`POST /query`

Execute a token-gated query.

curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{
    "query": "What is token gating and why is it important?",
    "token_budget": 8000,
    "max_steps": 3
  }'

Response:

{
  "answer": "Token gating is a budget enforcement layer...",
  "status": "COMPLETED",
  "tokens_used": {
    "planner": 450,
    "retriever": 800,
    "generator": 1200,
    "critic": 350
  },
  "total_tokens": 2800,
  "steps_executed": 1
}

Status Values:

COMPLETED: Quality threshold met, full answer generated
COMPLETED_WITH_SUMMARY: Budget exhausted, summarized answer
INSUFFICIENT_BUDGET_FOR_PLANNING: Not enough budget to start
INSUFFICIENT_BUDGET_FOR_GENERATION: Budget depleted before generation

Project Structure

llm-token-gating/
├── app/
│   ├── __init__.py
│   ├── main.py              # FastAPI app and endpoints
│   ├── config.py            # Pydantic settings
│   ├── state.py             # AgentState TypedDict
│   ├── token_accounting.py  # Centralized token tracking
│   ├── graph.py             # LangGraph assembly
│   ├── nodes/
│   │   ├── __init__.py
│   │   ├── planner.py       # Bounded planning
│   │   ├── retriever.py     # Token-aware RAG
│   │   ├── generator.py     # Budgeted generation
│   │   ├── critic.py        # Optional quality check
│   │   └── summarizer.py    # Safety exit
│   └── rag/
│       ├── __init__.py
│       ├── embeddings.py    # OpenAI embeddings
│       └── vector_store.py  # ChromaDB integration
├── requirements.txt
├── .env.example
└── README.md

How Token Gating Works

1. Budget as State

Every request initializes with a token budget:

state = initialize_state(
    user_query="...",
    total_token_budget=10000,
    max_steps=5,
)

2. Pre-flight Budget Checks

Each node checks budget before executing:

if state["remaining_tokens"] < REQUIRED_BUDGET:
    state["status"] = "INSUFFICIENT_BUDGET"
    return state

3. Centralized Consumption

All token usage flows through one function:

state = consume_tokens(state, "generator", actual_tokens_used)

4. Dynamic Retrieval

The retriever adjusts top_k based on remaining budget:

available_for_context = remaining_tokens - MIN_GENERATION_BUDGET
top_k = max(1, available_for_context // TOKENS_PER_CHUNK)

5. Conditional Routing

The graph routes based on budget and quality:

def should_continue(state):
    if state["remaining_tokens"] <= 500:
        return "summarize"
    if state["quality_score"] >= 0.85:
        return "end"
    return "loop"

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Token Gating with LangGraph

Architecture

Features

Quick Start

1. Install Dependencies

2. Configure Environment

3. Run the Server

API Endpoints

`GET /health`

`POST /documents`

`POST /query`

Project Structure

How Token Gating Works

1. Budget as State

2. Pre-flight Budget Checks

3. Centralized Consumption

4. Dynamic Retrieval

5. Conditional Routing

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
app		app
.env.example		.env.example
README.md		README.md
requirements.txt		requirements.txt

AdnanSattar/llm-token-gating

Folders and files

Latest commit

History

Repository files navigation

LLM Token Gating with LangGraph

Architecture

Features

Quick Start

1. Install Dependencies

2. Configure Environment

3. Run the Server

API Endpoints

GET /health

POST /documents

POST /query

Project Structure

How Token Gating Works

1. Budget as State

2. Pre-flight Budget Checks

3. Centralized Consumption

4. Dynamic Retrieval

5. Conditional Routing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`GET /health`

`POST /documents`

`POST /query`

Packages