Skip to content

Conversation

@lukeqin-oai
Copy link
Collaborator

Overview

This PR adds two Codex readiness skills that evaluate repository guidance quality
and end‑to‑end agentic execution. The unit test focuses on deterministic
checks + in‑session LLM evaluation of AGENTS.md/PLANS.md quality, while the
integration test runs a full agentic loop and scores real code changes and
build/test outcomes.

1) codex-readiness-unit-test (LLM Codex Readiness Unit Test)

Goal

Validate that AGENTS.md and PLANS.md provide sufficient, usable guidance using
deterministic checks plus in‑session LLM evaluation, and generate a scored
JSON + HTML report.

How it works

This skill builds a report from two pipelines: deterministic filesystem checks
and in‑session LLM evaluation of AGENTS.md/PLANS.md guidance. It writes a
timestamped run directory with evidence, LLM results, and a scored JSON/HTML
report; in optional execute mode it runs a user‑approved plan and includes
execution logs in scoring. JSON outputs are strictly validated with a retry +
json‑fix loop.

2) codex-readiness-integration-test (LLM Codex Readiness Integration Test)

Goal

Validate real agentic execution quality by running Codex CLI against the repo,
executing an approved change prompt, and scoring results with evidence + LLM
evaluation.

How it works

This skill runs an end‑to‑end agentic execution against the repo using Codex
CLI, then executes a build/test plan and scores the run from evidence plus LLM
evaluation. It spins up the Codex session by launching the CLI as a subprocess
with HOME/XDG_CACHE_HOME pointed at the repo‑local .codex-home, using the
approved prompt.json (change prompt + agentic_loop settings) so the CLI reads
AGENTS.md and operates in the repo. It requires a repo‑local login, always
runs in execute mode, and writes results to a timestamped run directory with
agentic logs, LLM results, and a summary.

@lukeqin-oai lukeqin-oai requested a review from a team January 21, 2026 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants