EvalOps Kit

CI-native evals for tool-using agents.

EvalOps Kit gives you a deterministic local workflow to:

run a suite against dataset + traces (evalops-kit run)
compare candidate vs baseline (evalops-kit diff)
fail CI when regression gates are violated

Problem this solves

Agent behavior regresses when prompts, tool wiring, model settings, or orchestration change. Manual spot checks miss issues and are hard to reproduce.

EvalOps Kit makes regressions explicit with stable artifacts:

summary.json for aggregate metrics
cases.jsonl for per-case outcomes and tags
report.md / diff Markdown for humans and CI logs

Why CI-native evals matter

If evals are not part of CI, they become optional and drift out of release flow.

With EvalOps Kit, you can gate merges on deterministic checks:

score drops (overall_avg_drop_max, per_grader_avg_drop_max)
quality regressions surfaced as top regressed cases
reliability regressions (missing_traces_increase_max, invalid_traces_increase_max)

The diff command exits non-zero on gate failure, which is directly usable in GitHub Actions.

Quickstart (from zero)

Requirements:

Git
Python 3.10+

From a fresh machine or shell:

git clone https://github.com/mireya-o/evalops-kit.git
cd evalops-kit
python3 -m venv .venv
./.venv/bin/python -m pip install -U pip
./.venv/bin/python -m pip install -e .
./.venv/bin/evalops-kit run --suite examples/minimal/suite.toml --out /tmp/evalops-run
./.venv/bin/evalops-kit diff --base /tmp/evalops-run --cand /tmp/evalops-run --suite examples/minimal/suite.toml

After run, inspect:

/tmp/evalops-run/summary.json
/tmp/evalops-run/cases.jsonl
/tmp/evalops-run/report.md

Important:

This quickstart is a deterministic sanity check on the bundled minimal fixtures.
The bundled minimal suite intentionally includes one missing trace, so seeing traces_missing=1 in the diff output is expected.
For a true baseline-vs-candidate pass/fail regression demo, run ./.venv/bin/python scripts/demo_golden_path.py.

If you want contributor tooling as well, use:

./.venv/bin/python -m pip install -e ".[dev]"

Golden path demo (from a fresh clone)

From the repo root, after the install above:

./.venv/bin/python scripts/demo_golden_path.py

What it does:

Runs a baseline from committed fixtures in examples/golden_path/.
Runs a no-regression candidate and verifies diff exits 0.
Runs a regressed candidate and verifies diff exits non-zero.
Writes pass/fail diff reports and prints their paths.

The script exits 0 only when both expected behaviors are observed.

Diff + gates

Basic usage:

evalops-kit diff \
  --base /path/to/baseline-run \
  --cand /path/to/candidate-run \
  --suite examples/golden_path/suite_baseline.toml \
  --out /tmp/evalops-diff.md

Behavior:

writes Markdown diff report (--out optional; otherwise stdout)
returns 0 when all configured gates pass
returns 2 when one or more gates fail

Supported gate kinds in suite TOML:

overall_avg_drop_max
per_grader_avg_drop_max
missing_traces_increase_max
invalid_traces_increase_max

GitHub Actions workflow template

A copy-paste-ready template is provided at:

examples/github_actions/regression-check.yml

Copy it into .github/workflows/ in your repo, then adapt the suite paths and candidate trace generation step.

OpenAI Agents SDK integration (optional)

Install optional adapter dependencies:

python -m pip install -e ".[agents]"

Minimal integration:

from pathlib import Path

from agents import RunConfig, Runner
from evalops_kit.adapters.openai_agents import (
    EvalOpsAgentsCollector,
    get_processor,
    install_agents_processor,
)

collector = EvalOpsAgentsCollector(Path("traces"))
processor = get_processor(collector=collector)
install_agents_processor(processor, replace_existing=True)

case_id = "case-1"
result = Runner.run_sync(
    agent,
    "Say hello.",
    run_config=RunConfig(trace_metadata={"case_id": case_id}, workflow_name="evalops"),
)
collector.set_final_output(case_id, str(result.final_output))

Use replace_existing=True when EvalOps should be the only trace exporter. See examples/openai_agents/README.md for a complete usage note.

Repository structure

src/evalops_kit/            Core package (CLI, run/diff pipeline, suite parsing, graders)
tests/                      Deterministic unit/integration tests
examples/minimal/           Small starter suite + fixtures
examples/golden_path/       Baseline/regression fixtures for end-to-end demo
examples/github_actions/    Workflow templates (not auto-triggered in this repo)
scripts/demo_golden_path.py Portable Python demo runner

Roadmap (short)

Add more deterministic graders and richer failure tags.
Add more trace adapters behind optional extras.
Stabilize artifact schema/versioning for long-term CI compatibility.
Add migration guidance and release automation for public adoption.

Development checks

python -m ruff check .
python -m ruff format --check .
python -m pytest

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
examples		examples
scripts		scripts
src/evalops_kit		src/evalops_kit
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalOps Kit

Problem this solves

Why CI-native evals matter

Quickstart (from zero)

Golden path demo (from a fresh clone)

Diff + gates

GitHub Actions workflow template

OpenAI Agents SDK integration (optional)

Repository structure

Roadmap (short)

Development checks

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvalOps Kit

Problem this solves

Why CI-native evals matter

Quickstart (from zero)

Golden path demo (from a fresh clone)

Diff + gates

GitHub Actions workflow template

OpenAI Agents SDK integration (optional)

Repository structure

Roadmap (short)

Development checks

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages