Skip to content

2389-research/simmer-sdk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

simmer-sdk

Programmatic implementation of the simmer iterative refinement loop using the Claude Agent SDK. Same architecture as the Claude Code skill — generate, evaluate, judge, reflect — but callable from scripts, pipelines, and any Python environment.

Install

uv add simmer-sdk
# or
pip install simmer-sdk

The PyPI package name is simmer-sdk; the import name is simmer_sdk.

Requires ANTHROPIC_API_KEY set in your environment, AWS credentials for Bedrock mode, or a running Ollama instance for local models.

Usage

import anyio
from pathlib import Path
from simmer_sdk import refine

async def main():
    result = await refine(
        artifact=(
            "A one-shot DND adventure hook for a party of 4 level-5 characters. "
            "The setting: a coastal town where fishermen have been pulling up bones "
            "instead of fish for the past week. The town's mayor has gone missing. "
            "Should be 300-500 words, playable in a 3-4 hour session."
        ),
        criteria={
            "narrative_tension": (
                "scenes have escalating stakes, time pressure, and meaningful "
                'consequences — 10/10 means every scene raises the question '
                '"what happens if we don\'t act?"'
            ),
            "player_agency": (
                "multiple decision points where the party's choices genuinely "
                "change the outcome — 10/10 means no railroading, at least 3 "
                "distinct paths through the adventure"
            ),
            "specificity": (
                "concrete names, locations, sensory details, NPC motivations — "
                "10/10 means a DM could run this cold without inventing anything"
            ),
        },
        primary="player_agency",
        iterations=3,
        mode="seedless",
        judge_mode="board",
        output_dir=Path("/tmp/simmer-dnd-test"),
    )

    print(f"Best: iteration {result.best_iteration} ({result.composite}/10)")
    print(result.best_candidate)

anyio.run(main)

Configuration

The refine() function accepts a config that maps to the simmer skill's setup brief:

ARTIFACT          -> artifact         # text content, file path, directory path, or description
ARTIFACT_TYPE     -> (auto-detected)  # "single-file" or "workspace"
CRITERIA          -> criteria         # dict of {name: "what 10/10 looks like"}
PRIMARY           -> primary          # criterion name for best-so-far comparison
EVALUATOR         -> evaluator        # shell command (supports {candidate_path}, {iteration}, {output_dir})
BACKGROUND        -> background       # constraints, available resources
OUTPUT_CONTRACT   -> output_contract  # valid output format description
VALIDATION_COMMAND-> validation_command # quick check command
SEARCH_SPACE      -> search_space     # what's in scope to explore
JUDGE_MODE        -> judge_mode       # "auto", "single", "board"
JUDGE_PANEL       -> judge_panel      # custom judge definitions [{name, lens}]
JUDGE_COUNT       -> judge_count      # number of judges on the board (default 3, min 2)
ITERATIONS        -> iterations       # number of generate-judge-reflect cycles (default 3)
MODE              -> mode             # "auto", "seedless", "from-file", "from-paste", "from-workspace"
OUTPUT_DIR        -> output_dir       # where iteration files go (default "docs/simmer")

Models

Every LLM call is independently configurable:

result = await refine(
    ...,
    generator_model="claude-sonnet-4-6",  # default
    judge_model="claude-sonnet-4-6",      # default
    clerk_model="claude-haiku-4-5",       # default, board composition + deliberation + reflect
)

AWS Bedrock

To use AWS Bedrock instead of the direct Anthropic API:

result = await refine(
    ...,
    api_provider="bedrock",
    aws_access_key="AKIA...",
    aws_secret_key="...",
    aws_region="us-east-1",
    generator_model="claude-sonnet-4-5",  # auto-mapped to Bedrock ID
    judge_model="claude-sonnet-4-5",
    clerk_model="claude-haiku-4-5",
)

Model IDs are auto-mapped to Bedrock format (e.g., claude-sonnet-4-6us.anthropic.claude-sonnet-4-6-20260217-v1:0). You can also pass Bedrock model IDs directly to bypass mapping. Requires boto3 (included as a dependency).

Ollama (Local Models)

To run the full simmer pipeline against local models via Ollama:

result = await refine(
    ...,
    api_provider="ollama",
    ollama_url="http://localhost:11434",  # default
    generator_model="qwen3:32b",
    judge_model="qwen3:32b",
    clerk_model="qwen3.5:9b",
)

No ANTHROPIC_API_KEY required. Ollama exposes an Anthropic-compatible /v1/messages endpoint, so the full pipeline (generator agents, judge board, deliberation, synthesis) works without code changes.

Requirements:

  • Ollama running locally or on a network host
  • Models pulled before use: ollama pull qwen3:32b
  • Claude CLI installed (for judge agents): npm install -g @anthropic-ai/claude-code

Tested models: qwen3:32b, qwen3.5:27b, qwen3.5:9b, gemma3:27b, llama4:16x17b. Any Ollama model tag works — pass it directly as the model ID.

Docker Compose: When running in containers, set ollama_url to the service name (e.g., http://ollama:11434). The judge board agents inherit the URL via ANTHROPIC_BASE_URL environment variable.

Evaluators

The evaluator is a shell command run after each generator step. Its stdout/stderr is passed to the judge as evidence. Supports template variables:

# Word count check
evaluator="wc -w {candidate_path}"

# Run a test suite
evaluator="python eval_scorer.py --spec {candidate_path} --output {output_dir}/eval_v{iteration}"

# Shell script
evaluator="./run_eval.sh {candidate_path} {output_dir}"

Callbacks

async def on_iteration(record, trajectory, trajectory_table):
    """Called after each iteration with the trajectory table."""
    print(trajectory_table)

async def on_plateau(trajectory):
    """Called when 3 iterations without improvement. Only triggered in judge_mode="single".
    Return True to upgrade to board and extend the run by 2 iterations."""
    return True

result = await refine(..., on_iteration=on_iteration, on_plateau=on_plateau)

Output

refine() returns a SimmerResult:

result.best_candidate   # str — the best artifact text
result.best_iteration   # int — which iteration produced the best
result.best_scores      # dict — per-criterion scores for best
result.composite        # float — best composite score
result.trajectory       # list[IterationRecord] — full history
result.stable_wins      # list[str] — what's been working
result.not_working      # list[str] — what's been tried and failed
result.output_dir       # Path — where iteration files were written

Output Directory

{output_dir}/
  iteration-0-candidate.md     # seed (or seedless first generation)
  iteration-0-judgment.md      # raw judge output (scores, evidence, improvements per criterion)
  iteration-1-candidate.md     # each improved candidate
  iteration-1-judgment.md      # raw judge output
  iteration-2-candidate.md
  iteration-2-judgment.md
  iteration-3-candidate.md
  iteration-3-judgment.md
  trajectory.md                # running score table
  result.md                    # final best output

The iteration-N-judgment.md files contain per-criterion scores, evidence, and improvement suggestions from the judge. Useful for building UIs that show the judge's reasoning or for downstream agents that extract structured detail.

Architecture

1:1 translation of the 6 simmer Claude Code skill files. The actual skill markdown files are bundled in src/simmer_sdk/skill_reference/ and used as prompts.

Role Implementation Tools
Generator Agent SDK subagent Read, Edit, Write, Bash, Glob, Grep
Judge (single) Agent SDK subagent Read, Grep, Glob
Judge (board) N Agent SDK subagents in parallel (default 3, configurable) Read, Grep, Glob
Reflect Agent SDK subagent Read, Write, Glob
Evaluator subprocess.run() (user-provided command)
Clerk (synthesis) anthropic messages.create() (no tools)

Development

uv sync --all-extras
uv run pytest tests/ -m "not integration"        # unit tests (no API calls)
ANTHROPIC_API_KEY=... uv run pytest -m integration -v -s  # integration tests (real API)

About

simmer skill via claude code sdk

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages