Luke/codex readiness skills #48
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR adds two Codex readiness skills that evaluate repository guidance quality
and end‑to‑end agentic execution. The unit test focuses on deterministic
checks + in‑session LLM evaluation of AGENTS.md/PLANS.md quality, while the
integration test runs a full agentic loop and scores real code changes and
build/test outcomes.
1) codex-readiness-unit-test (LLM Codex Readiness Unit Test)
Goal
Validate that AGENTS.md and PLANS.md provide sufficient, usable guidance using
deterministic checks plus in‑session LLM evaluation, and generate a scored
JSON + HTML report.
How it works
This skill builds a report from two pipelines: deterministic filesystem checks
and in‑session LLM evaluation of AGENTS.md/PLANS.md guidance. It writes a
timestamped run directory with evidence, LLM results, and a scored JSON/HTML
report; in optional execute mode it runs a user‑approved plan and includes
execution logs in scoring. JSON outputs are strictly validated with a retry +
json‑fix loop.
2) codex-readiness-integration-test (LLM Codex Readiness Integration Test)
Goal
Validate real agentic execution quality by running Codex CLI against the repo,
executing an approved change prompt, and scoring results with evidence + LLM
evaluation.
How it works
This skill runs an end‑to‑end agentic execution against the repo using Codex
CLI, then executes a build/test plan and scores the run from evidence plus LLM
evaluation. It spins up the Codex session by launching the CLI as a subprocess
with HOME/XDG_CACHE_HOME pointed at the repo‑local .codex-home, using the
approved prompt.json (change prompt + agentic_loop settings) so the CLI reads
AGENTS.md and operates in the repo. It requires a repo‑local login, always
runs in execute mode, and writes results to a timestamped run directory with
agentic logs, LLM results, and a summary.