Conductor Evals

Run AI evals across multiple models in parallel. One command. Side-by-side results.

Get running in 60 seconds

npm install -g @conductor-oss/conductor-cli    # 1. Install the CLI
export ANTHROPIC_API_KEY="sk-ant-..."          # 2. Set at least one provider key
conductor server start                         # 3. Start Conductor (runs on :8080)
pip install conductor-evals                    # 4. Install this package
conductor-eval workers &                         # 5. Start workers (auto-registers workflows)
conductor-eval runmath-basics --models claude-sonnet gpt-4o --wait  # 6. Run!

See example output

Suite: math-basics | Run: run_a1b2c3d4e5f6 | Status: COMPLETED

Model Summary
Model                       Avg Score  Pass Rate  Passed  Total
claude-sonnet-4-20250514        1.000    100.0%        4      4
gpt-4o                          0.750     75.0%        3      4

Case Results
Case            Model                      Score  Passed
add_simple      claude-sonnet-4-20250514   1.000  PASS
add_simple      gpt-4o                     1.000  PASS
add_decimals    claude-sonnet-4-20250514   1.000  PASS
add_decimals    gpt-4o                     1.000  PASS
...

Create your first eval in 60 seconds

mkdir -p evals/my-eval                        # 1. Create a suite directory

cat > evals/my-eval/capital.json << 'EOF'     # 2. Add a test case
{
  "id": "capital_france",
  "prompt": "What is the capital of France? Reply with just the city name.",
  "agent_type": "direct_llm",
  "scoring_method": "text_match",
  "expected": { "value": "Paris" },
  "match_mode": "contains"
}
EOF

conductor-eval runmy-eval --models claude-sonnet --wait  # 3. Run it!

That's one file per test case. No config, no registration, no boilerplate. Add more .json files to the suite directory and they're automatically picked up on the next run.

Want an LLM-as-judge eval instead?

For open-ended questions where there's no single right answer, use llm_judge scoring:

cat > evals/my-eval/reasoning.json << 'EOF'
{
  "id": "explain_gravity",
  "prompt": "Explain gravity to a 5-year-old in 2 sentences.",
  "agent_type": "direct_llm",
  "scoring_method": "llm_judge",
  "rubric": "Score 1-5: age-appropriate language, accurate concept, concise (2 sentences)"
}
EOF

conductor-eval runmy-eval --models claude-sonnet gpt-4o --wait

Why Conductor Evals?

	Conductor Evals	Manual scripts	Other eval frameworks
Multi-model comparison	One command, N models in parallel	Write a loop per provider	Varies; often single-model
Execution speed	Fan-out — 60 evals run concurrently	Sequential by default	Usually sequential
Observability	Full Conductor UI: timing, retries, logs per task	`print()` and log files	Dashboard if you're lucky
Reproducibility	JSON cases in git, deterministic mock tool responses	Fragile scripts, no versioning	Config files, but no mock tool layer
Scoring methods	Text match, regex, LLM-as-judge, tool trace — built in	Roll your own	Typically text match + LLM judge
CI integration	`--wait -o json` exits with structured results	Custom glue code	Often requires wrapper scripts
Adding a test case	Drop a `.json` file in a directory	Edit code	Edit code or YAML

Quick Start

1. Install the Conductor CLI

npm install -g @conductor-oss/conductor-cli
conductor --version

Alternative installation methods

Quick install (macOS/Linux):

curl -fsSL https://raw.githubusercontent.com/conductor-oss/conductor-cli/main/install.sh | sh

Homebrew (macOS/Linux):

brew install conductor-oss/conductor/conductor

Quick install (Windows PowerShell):

irm https://raw.githubusercontent.com/conductor-oss/conductor-cli/main/install.ps1 | iex

Manual download: grab the binary for your platform from the conductor-cli releases page.

See the conductor-cli repo for full details.

2. Set your API keys

Set these before starting the Conductor server. Workers read API keys from the environment at startup. If the keys aren't set, LLM calls will fail at runtime.

You only need keys for the providers you plan to evaluate:

# Anthropic (Claude models)
export ANTHROPIC_API_KEY="sk-ant-..."

# OpenAI (GPT models)
export OPENAI_API_KEY="sk-..."

All supported providers

# Google (Gemini models)
export GOOGLE_API_KEY="..."
# or use a service account
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/service-account.json"

# AWS Bedrock
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_REGION="us-east-1"

# Azure OpenAI
export AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
export AZURE_OPENAI_API_VERSION="2024-12-01-preview"

# Mistral
export MISTRAL_API_KEY="..."

# Cohere
export COHERE_API_KEY="..."

# Together AI
export TOGETHER_API_KEY="..."

# Groq
export GROQ_API_KEY="..."

Tip: Add these to a .env file (already in .gitignore) and source it: source .env

3. Start a Conductor server

conductor server start

The server will be available at http://localhost:8080.

Using Docker instead

docker run -d --name conductor -p 8080:8080 orkesio/orkes-conductor-standalone:latest

Wait ~30 seconds, then verify it's running:

curl http://localhost:8080/health

See the Conductor repo for more Docker options.

4. Install Conductor Evals

pip install conductor-evals

Install from source

git clone https://github.com/conductor-sdk/conductor-evals.git
cd conductor-evals
pip install -e .

5. Configure the Conductor connection (optional)

The default config connects to localhost:8080 — no changes needed for local development.

Option A: Environment variables (recommended for remote/CI)

Set these environment variables to connect to any Conductor server:

Variable	Required	Description
`CONDUCTOR_URL`	Yes	Base URL of the Conductor server (e.g., `http://localhost:8080` or `https://my-conductor.example.com`)
`CONDUCTOR_AUTH_KEY`	Yes	API key ID for authentication
`CONDUCTOR_AUTH_SECRET`	No	API key secret (required for Orkes Conductor; omit for open-source Conductor)

Open-source Conductor (no auth or static key):

export CONDUCTOR_URL="https://conductor.example.com"
export CONDUCTOR_AUTH_KEY="my-api-key"

The key is sent as-is in the X-Authorization header.

Orkes Conductor (JWT auth):

export CONDUCTOR_URL="https://play.orkes.io"
export CONDUCTOR_AUTH_KEY="your-key-id"
export CONDUCTOR_AUTH_SECRET="your-key-secret"

When CONDUCTOR_AUTH_SECRET is set, the system exchanges the key ID and secret for a JWT token via POST /api/token and handles automatic token refresh.

Tip: Add these to a .env file (already in .gitignore) and source it: source .env

Option B: Config file

If environment variables are not set, the system falls back to config/orkes-config.json:

Show config format

{
  "clusters": [
    {
      "name": "local",
      "url": "http://localhost:8080",
      "keyId": "your-key-id",
      "keySecret": "your-key-secret"
    }
  ]
}

Copy from the example: cp config/orkes-config.example.json config/orkes-config.json

Environment variables always take precedence over the config file.

6. Start workers

conductor-eval workers    # Auto-registers workflows and starts workers (keep this running)

7. Run your first eval

# Quick test with dry-run (no LLM calls, instant results)
conductor-eval runmath-basics --models claude-haiku --dry-run --wait

# Real run against Claude Haiku
conductor-eval runmath-basics --models claude-haiku --wait

# Compare two models
conductor-eval runcoding-basics --models claude-sonnet gpt-4o --wait

That's it. You're running evals.

CLI Reference

All commands use a single conductor-eval entry point with subcommands.

conductor-eval <command> [options]

Running evals

conductor-eval run <suite> --models <model> [<model> ...] [options]

Flag	Description
`<suite>`	Name of an eval suite in `evals/`, or path to a directory of JSON cases
`--models`	One or more model presets or custom `provider:model_id` specs (see below)
`--wait`	Poll until the workflow completes, then print results
`--output`, `-o`	Output format: `text` (default), `markdown`, `json`, `csv`
`--run-id`	Custom run ID (auto-generated if omitted)
`--dry-run`	Skip real LLM calls, use placeholder responses
`--tags`	Only run cases matching any of these tags
`--exclude-tags`	Exclude cases matching any of these tags
`--sample N`	Randomly sample N cases from the suite
`--threshold`	Minimum pass rate (0.0-1.0). Exit non-zero if below. Requires `--wait`

Without --wait, the CLI prints the workflow ID and exits — you can check results in the Conductor UI or the Web UI.

Browsing suites and models

conductor-eval suites                   # List all eval suites and case counts
conductor-eval cases <suite>            # List cases in a suite (id, agent type, scoring method)
conductor-eval models                   # List available model presets

Managing runs

conductor-eval runs                     # List all past runs (default: last 20)
conductor-eval runs --suite <suite>     # List runs for a specific suite
conductor-eval runs --limit 50          # Show more results
conductor-eval status <workflow_id>     # Show run status, progress, and results
conductor-eval cancel <workflow_id>     # Cancel a running eval

Comparing runs

Compare two completed runs side-by-side:

conductor-eval compare <workflow-id-A> <workflow-id-B>

======================================================================
Run A: run_abc123 (COMPLETED)
Run B: run_def456 (COMPLETED)
======================================================================

Model                            Run A Avg    Run B Avg      Delta
------------------------------ ---------- ---------- ----------
claude-sonnet-4-20250514            0.850      0.900     +0.050
gpt-4o                             0.750      0.800     +0.050

Use --regression-threshold 0.05 to fail if any model's average score drops by more than the threshold (useful in CI).

Output formats

# Plain text table (default)
conductor-eval run math-basics --models claude-haiku --wait

# Markdown (good for pasting into PRs/docs)
conductor-eval run math-basics --models claude-haiku --wait -o markdown

# JSON (pipe to jq, save to file, feed into other tools)
conductor-eval run math-basics --models claude-haiku --wait -o json > results.json

# CSV (open in Excel, import into pandas)
conductor-eval run math-basics --models claude-haiku --wait -o csv > results.csv

Model presets

Preset	Provider	Model ID
`claude-sonnet`	Anthropic	claude-sonnet-4-20250514
`claude-opus`	Anthropic	claude-opus-4-20250514
`claude-haiku`	Anthropic	claude-haiku-4-5-20251001
`gpt-4o`	OpenAI	gpt-4o
`gpt-4o-mini`	OpenAI	gpt-4o-mini
`gpt-5`	OpenAI	gpt-5
`gemini-2.5-pro`	Google Gemini	gemini-2.5-pro
`gemini-2.5-flash`	Google Gemini	gemini-2.5-flash
`gemini-2.5-flash-lite`	Google Gemini	gemini-2.5-flash-lite

Presets are defined in config/model-presets.json. You can add your own there. Run conductor-eval models to see all available presets.

Custom models

You don't need to define a preset to try a new model. Use provider:model_id syntax directly:

# Mix presets and custom models
conductor-eval run math-basics --models claude-sonnet google_gemini:gemini-2.0-flash --wait

# Use any model from any provider
conductor-eval run coding-basics --models openai:o3-mini anthropic:claude-haiku-4-5-20251001 --wait

Custom models get default params (max_tokens: 4096, temperature: 0). The Web UI also supports this — type provider:model_id in the custom model input field.

Conductor CLI Reference

The conductor CLI manages your Conductor server and interacts with workflows and tasks.

Server management

conductor server start             # Start the Conductor server
conductor server start --port 9090 # Start on a custom port
conductor server stop              # Stop the server
conductor server logs -f           # Tail server logs

Workflow operations

conductor workflow list                                        # List all workflows
conductor workflow start -w workflow_name -i '{"key":"value"}' # Start a workflow
conductor workflow status <workflow_id>                         # Check workflow status
conductor workflow get-execution <workflow_id>                  # Get full execution details

Task operations

conductor task list                                            # List all task definitions
conductor task poll <task_type>                                # Poll for a task
conductor task update-execution --workflow-id <id> --task-ref-name <ref>  # Update a task

Global flags

Flag	Description
`--server <url>`	Conductor server URL
`--auth-token <token>`	Authentication token
`--profile <name>`	Configuration profile
`--verbose`	Detailed output
`--help`	Show help for any command

For the full CLI documentation, see the conductor-cli repo.

Writing Eval Cases

Eval cases are JSON files in evals/<suite-name>/. Each file is one test case. Drop a new .json file in a suite directory and it's automatically included in the next run.

Text match

Check if the model's response contains, matches, or equals expected text.

{
  "id": "add_simple",
  "name": "Add two small numbers",
  "agent_type": "direct_llm",
  "scoring_method": "text_match",
  "prompt": "What is 23 + 47? Reply with just the number.",
  "expected": { "value": "70" },
  "match_mode": "contains"
}

Match modes: exact, contains, regex, contains_all, contains_any

For contains_all and contains_any, use "expected": { "values": ["foo", "bar"] }.

LLM judge

Another LLM scores the response on a 1-5 rubric (normalized to 0.0-1.0, passes at 0.5+).

{
  "id": "ethical_dilemma",
  "agent_type": "direct_llm",
  "scoring_method": "llm_judge",
  "prompt": "Should a hospital deploy an AI system that has 85% accuracy but 15% false positives for critical diagnoses?",
  "rubric": "Score 1-5: identifies tradeoffs, considers patient impact, proposes nuanced approach with human oversight"
}

The judge defaults to Claude Sonnet. Override with "judge_model" and "judge_provider".

Tool trace

Verify that a tool-use agent called the right tools with the right arguments.

{
  "id": "file_search",
  "agent_type": "tool_use_agent",
  "scoring_method": "tool_trace",
  "prompt": "Find the definition of 'calculate_tax' in the project.",
  "tools": [
    {
      "name": "grep_search",
      "description": "Search file contents",
      "input_schema": {
        "type": "object",
        "properties": { "pattern": { "type": "string" } },
        "required": ["pattern"]
      }
    }
  ],
  "tool_responses": {
    "grep_search": {
      "default": { "matches": [] },
      "when": [
        {
          "args_contain": { "pattern": "calculate_tax" },
          "response": { "matches": [{ "file": "src/billing.py", "line": 42 }] }
        }
      ]
    }
  },
  "expected_trace": [
    { "tool_name": "grep_search", "args_contain": { "pattern": "calculate_tax" } }
  ],
  "strict_order": false
}

tool_responses provides mock responses so tests are deterministic. strict_order controls whether tool calls must appear in the exact sequence.

Agent types

`agent_type`	Description
`direct_llm`	Single prompt, single response, no tool use
`tool_use_agent`	Multi-turn tool-use loop with mock responses
`claude_code_agent`	Shells out to the `claude` CLI

Optional fields

Field	Description
`system_prompt`	System prompt passed to the model
`tags`	String array for categorization
`timeout_seconds`	Per-case timeout hint
`max_tool_turns`	Max tool-use iterations (default: 10)

Included Eval Suites

Suite	Cases	What it tests
`math-basics`	4	Simple arithmetic with text matching
`coding-basics`	2	FizzBuzz, string reversal
`tricky-math`	3	Order of operations, edge cases
`reasoning`	1	Ethical reasoning with LLM judge
`tool-use`	1	Tool call verification
`conductor-skill`	10	Conductor workflow management with LLM judge

Web UI

The project includes a web dashboard for managing and monitoring evals.

conductor-eval server    # Start the web UI server (http://localhost:3939)

Development mode

cd ui
npm install
npm run dev    # Hot reload for frontend development

The UI runs at http://localhost:3939 and provides:

Dashboard — overview of all eval suites
Run management — start runs with preset or custom models, monitor progress in real-time
Results — sortable results table with expandable details and direct links to Conductor workflow executions
Case editor — view and edit eval cases (form mode or raw JSON)
Run comparison — side-by-side comparison of two runs

Custom models can be added on the fly by typing provider:model_id (e.g. google_gemini:gemini-2.0-flash) in the model input field.

Architecture

flowchart TD
    A["<b>eval_suite</b><br/>Top-level workflow"] --> B["<b>prepare_fork_inputs</b><br/>Build cases x models matrix"]
    B --> C["<b>FORK_JOIN_DYNAMIC</b><br/>Fan out N parallel sub-workflows"]

    C --> D1["eval_case_run<br/><i>fizzbuzz + claude-sonnet</i>"]
    C --> D2["eval_case_run<br/><i>fizzbuzz + gpt-4o</i>"]
    C --> D3["eval_case_run<br/><i>string_reverse + claude-sonnet</i>"]
    C --> D4["eval_case_run<br/><i>...</i>"]

    D1 --> E["<b>JOIN</b>"]
    D2 --> E
    D3 --> E
    D4 --> E

    E --> F["<b>aggregate_results</b><br/>Per-model averages and pass rates"]

    style A fill:#4A154B,color:#fff
    style C fill:#1a73e8,color:#fff
    style E fill:#1a73e8,color:#fff
    style F fill:#0d652d,color:#fff

Each eval_case_run sub-workflow:

flowchart LR
    X["Execute Agent<br/><i>(LLM call or tool loop)</i>"] --> Y["Route to Scorer<br/><i>(text match / LLM judge / tool trace)</i>"] --> Z["Record Result<br/><i>(score + pass/fail)</i>"]

    style X fill:#4A154B,color:#fff
    style Y fill:#1a73e8,color:#fff
    style Z fill:#0d652d,color:#fff

Extending Conductor Evals

The system is built around Conductor workers — stateless Python functions decorated with @worker_task. You can add custom scorers, agent types, and LLM providers without modifying the core framework.

How workers work

Every worker follows the same pattern:

A Python function in workers/ decorated with @worker_task(task_definition_name='your_task')
An import in main.py so the worker is discovered at startup
Wiring in the workflow JSON (workflows/eval_case_run.json) via a SWITCH task that routes to your worker

Task registration is handled automatically — no need to create separate task definition JSON files.

workers/your_worker.py        <-- Python logic
main.py                       <-- Import to register
workflows/eval_case_run.json  <-- Route to it

After making changes, restart workers:

conductor-eval workers    # Restart workers to pick up new code

Adding a custom scorer

Create a new scoring method (e.g., cosine similarity, BLEU score).

Step 1. Add a scorer function to workers/scorers.py:

@worker_task(task_definition_name='score_cosine_similarity')
def score_cosine_similarity(agent_output: str, expected: dict) -> dict:
    """Score based on cosine similarity between output and expected embeddings."""
    similarity = compute_cosine_similarity(agent_output, expected["value"])
    return {
        "score": similarity,           # float 0.0-1.0
        "passed": similarity >= 0.8,   # bool
        "details": f"Cosine similarity: {similarity:.3f}"
    }

All scorers must return {"score": float, "passed": bool, "details": str}.

Step 2. Add a case to the route_scorer SWITCH in workflows/eval_case_run.json:

{
  "name": "route_scorer",
  "taskReferenceName": "route_scorer",
  "type": "SWITCH",
  "evaluatorType": "value-param",
  "expression": "scoring_method",
  "decisionCases": {
    "text_match": [ ... ],
    "llm_judge": [ ... ],
    "tool_trace": [ ... ],
    "cosine_similarity": [
      {
        "name": "score_cosine_similarity",
        "taskReferenceName": "score_cosine_similarity",
        "type": "SIMPLE",
        "inputParameters": {
          "agent_output": "${execute_agent.output.response}",
          "expected": "${workflow.input.eval_case.expected}"
        }
      }
    ]
  }
}

Step 3. Wire the output into record_result so it picks up the new scorer's result.

Step 4. Use it in an eval case:

{
  "id": "similarity_check",
  "agent_type": "direct_llm",
  "scoring_method": "cosine_similarity",
  "prompt": "Explain photosynthesis in one sentence.",
  "expected": { "value": "Plants convert sunlight into energy using chlorophyll." }
}

Adding a custom agent type

Agent types control how prompts are executed. Built-in types use Conductor's LLM_CHAT_COMPLETE system task (direct_llm, tool_use_agent) or shell out to a CLI (claude_code_agent). You can add new ones for custom execution strategies.

Step 1. Create a worker function (or add to workers/agent_executor.py):

@worker_task(task_definition_name='execute_custom_agent')
def execute_custom_agent(eval_case: dict, model: dict) -> dict:
    """Execute a custom agent loop."""
    prompt = eval_case["prompt"]
    system_prompt = eval_case.get("system_prompt")

    # Your custom logic here
    result = your_custom_execution(prompt, model)

    # Must return this shape
    return {
        "response": result["text"],
        "tool_calls": result.get("tool_calls", []),
        "token_usage": result.get("tokens", {}),
        "latency_ms": result.get("latency_ms", 0)
    }

All agent executors must return {"response", "tool_calls", "token_usage", "latency_ms"}.

Step 2. Add your import to main.py if you created a new file.

Step 3. Add a case to the route_agent SWITCH in workflows/eval_case_run.json.

Step 4. Use it in eval cases:

{
  "id": "my_test",
  "agent_type": "custom_agent",
  "scoring_method": "text_match",
  "prompt": "What is 2 + 2?",
  "expected": { "value": "4" }
}

Adding a custom LLM provider

For agent types that use Conductor's built-in LLM_CHAT_COMPLETE task (direct_llm, tool_use_agent), providers are configured on the Conductor server via environment variables — no custom code needed.

For agent types that need custom execution (like claude_code_agent which shells out to the Claude CLI), you write a provider class:

Step 1. Create providers/your_provider.py:

class YourProvider:
    def __init__(self, model_id: str, params: dict | None = None):
        self.model_id = model_id
        self.params = params or {}

    def call(self, prompt: str, system_prompt: str | None = None) -> dict:
        """Execute the prompt and return normalized output."""
        start = time.time()

        # Your API call here
        result = call_your_api(prompt, system_prompt, self.model_id)

        return {
            "response": result["text"],
            "tool_calls": result.get("tool_calls", []),
            "token_usage": result.get("usage", {}),
            "latency_ms": int((time.time() - start) * 1000)
        }

Step 2. Import and use it in your agent executor worker:

if agent_type == "your_custom_type":
    from providers.your_provider import YourProvider
    provider = YourProvider(model_id=model.get("model_id"))
    return provider.call(prompt, system_prompt)

Adding a model preset

Model presets are shortcuts used with --models on the CLI and in the Web UI. Add new ones to config/model-presets.json:

{
  "llama-70b": {
    "provider": "together",
    "model_id": "meta-llama/Llama-3-70b-chat-hf",
    "params": { "max_tokens": 4096, "temperature": 0.0 }
  }
}

No re-registration needed — presets are resolved before the workflow starts. Restart the server to pick up changes.

Tip: You can also use any model on the fly with provider:model_id syntax (e.g. together:meta-llama/Llama-3-70b-chat-hf) without adding a preset.

Extension cheat sheet

What you want to do	Files to change	Restart workers?
Add an eval case	Drop a `.json` in `evals/<suite>/`	No
Add a model preset	Edit `config/model-presets.json`	No (restart server)
Use a custom model	Use `provider:model_id` on CLI or Web UI	No
Add a scorer	`workers/scorers.py` + `workflows/eval_case_run.json`	Yes
Add an agent type	`workers/agent_executor.py` + `workflows/eval_case_run.json`	Yes
Add an LLM provider	`providers/*.py` + wire into agent executor	Yes

Contributing

Contributions are welcome! Here's how to get started:

Fork the repo and create a feature branch (git checkout -b my-feature)
Install in development mode: pip install -e ".[dev]"
Make your changes — add tests for new functionality
Run the test suite: pytest
Open a pull request against main

Ideas for contributions

New scoring methods (e.g., cosine similarity, BLEU score)
Additional provider adapters
New eval suites — drop JSON files into evals/<your-suite>/
Improvements to output formatting or CI integrations

License

This project is licensed under the Apache License 2.0. See LICENSE for details.

Community

Slack Community — Ask questions, share feedback, get help
Conductor GitHub — The orchestration engine powering this project
Conductor CLI — The CLI for managing Conductor servers
Conductor Documentation — Official docs and guides

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
config		config
evals		evals
providers		providers
scripts		scripts
tasks		tasks
ui		ui
workers		workers
workflows		workflows
.gitignore		.gitignore
AGENT.md		AGENT.md
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

Conductor Evals

Get running in 60 seconds

Create your first eval in 60 seconds

Why Conductor Evals?

Table of Contents

Quick Start

1. Install the Conductor CLI

2. Set your API keys

3. Start a Conductor server

4. Install Conductor Evals

5. Configure the Conductor connection (optional)

Option A: Environment variables (recommended for remote/CI)

Option B: Config file

6. Start workers

7. Run your first eval

CLI Reference

Running evals

Browsing suites and models

Managing runs

Comparing runs

Output formats

Model presets

Custom models

Conductor CLI Reference

Server management

Workflow operations

Task operations

Global flags

Writing Eval Cases

Text match

LLM judge

Tool trace

Agent types

Optional fields

Included Eval Suites

Web UI

Architecture

Extending Conductor Evals

How workers work

Adding a custom scorer

Adding a custom agent type

Adding a custom LLM provider

Adding a model preset

Extension cheat sheet

Contributing

Ideas for contributions

License

Community

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages