diff --git a/.gitignore b/.gitignore index f583b7e5..9e76d3f4 100644 --- a/.gitignore +++ b/.gitignore @@ -11,3 +11,8 @@ progress.txt #Claude .claude/ + +# Parallel mode state +.ralph/ +agent_logs/ +progress-agent-*.txt diff --git a/AGENTS.md b/AGENTS.md index 9da9ecd1..0f00dcea 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -45,3 +45,14 @@ npm run dev - Memory persists via git history, `progress.txt`, and `prd.json` - Stories should be small enough to complete in one context window - Always update AGENTS.md with discovered patterns for future iterations + +## Parallel Mode + +Ralph supports running multiple agents in parallel via Docker containers. See `parallel/README.md` for details. + +- Parallel scripts live in `parallel/` — orchestrator, status, stop +- Docker image and container entrypoint live in `docker/` +- Agents claim stories via `claimed_by` field in prd.json using git atomic push +- Each agent writes to its own `progress-.txt` to avoid merge conflicts +- Builder agents have restricted network access (Claude API + npm only) +- Researcher agents have full internet access diff --git a/CLAUDE.md b/CLAUDE.md index f95bb927..121fb5ef 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -7,7 +7,7 @@ You are an autonomous coding agent working on a software project. 1. Read the PRD at `prd.json` (in the same directory as this file) 2. Read the progress log at `progress.txt` (check Codebase Patterns section first) 3. Check you're on the correct branch from PRD `branchName`. If not, check it out or create from main. -4. Pick the **highest priority** user story where `passes: false` +4. Pick the **highest priority** user story where `passes: false` and all `dependsOn` story IDs (if any) have `passes: true` 5. Implement that single user story 6. Run quality checks (e.g., typecheck, lint, test - use whatever your project requires) 7. Update CLAUDE.md files if you discover reusable patterns (see below) diff --git a/README.md b/README.md index d79d8b62..ccf2864a 100644 --- a/README.md +++ b/README.md @@ -143,6 +143,8 @@ Ralph will: | `skills/ralph/` | Skill for converting PRDs to JSON (works with Amp and Claude Code) | | `.claude-plugin/` | Plugin manifest for Claude Code marketplace discovery | | `flowchart/` | Interactive visualization of how Ralph works | +| `docker/` | Dockerfile and container scripts for parallel mode | +| `parallel/` | Parallel mode orchestrator, status, and stop scripts | ## Flowchart @@ -232,6 +234,58 @@ After copying `prompt.md` (for Amp) or `CLAUDE.md` (for Claude Code) to your pro Ralph automatically archives previous runs when you start a new feature (different `branchName`). Archives are saved to `archive/YYYY-MM-DD-feature-name/`. +## Parallel Mode (Docker) + +Ralph includes a parallel mode that runs N containerized Claude Code agents simultaneously against the same PRD. Each agent runs in a Docker container with: + +- **Network restrictions** — builder agents can only reach Claude API and npm registry +- **Resource limits** — configurable memory and CPU caps per container +- **Story claiming** — agents claim stories via git atomic push to avoid duplicate work +- **Automatic recovery** — stale claims are cleared, crashed containers are restarted + +### Prerequisites (Parallel Mode) + +- Docker installed and running +- A Claude Code auth token (env var, file, or 1Password) +- `jq` installed + +### Quick Start (Parallel Mode) + +```bash +# Set your Claude auth token +export RALPH_CLAUDE_TOKEN='' + +# Run 3 agents in parallel +./parallel/ralph-parallel.sh --agents 3 + +# Check status +./parallel/status.sh + +# Graceful shutdown +./parallel/stop.sh +``` + +### Options + +```bash +./parallel/ralph-parallel.sh \ + --agents 3 \ # number of builder agents (default: 2) + --model claude-sonnet-4-5-20250929 \ # model (default: sonnet) + --memory 4g \ # per-container memory limit + --cpus 2 \ # per-container CPU limit + --researcher 1 \ # researcher agents with full internet access + [max_iterations] # per-agent iteration cap (default: 0 = until PRD complete) +``` + +### Auth Token + +Priority order (first wins): +1. `RALPH_CLAUDE_TOKEN` environment variable +2. `.ralph/token` file in the project directory +3. 1Password via `op read` (interactive, startup only) + +See [parallel/README.md](parallel/README.md) for full documentation. + ## References - [Geoffrey Huntley's Ralph article](https://ghuntley.com/ralph/) diff --git a/docker/Dockerfile b/docker/Dockerfile new file mode 100644 index 00000000..40434b44 --- /dev/null +++ b/docker/Dockerfile @@ -0,0 +1,41 @@ +FROM node:20-slim + +# System deps for networking, git, and general tooling +RUN apt-get update && apt-get install -y --no-install-recommends \ + git \ + iptables \ + ipset \ + iproute2 \ + dnsutils \ + jq \ + curl \ + sudo \ + ca-certificates \ + && rm -rf /var/lib/apt/lists/* + +# Install claude-code globally +RUN npm install -g @anthropic-ai/claude-code + +# Create non-root agent user (UID 1001 since node:20-slim uses 1000 for 'node') +RUN useradd -m -s /bin/bash -u 1001 agent + +# Allow agent to run firewall init scripts via sudo (no password) +RUN echo "agent ALL=(root) NOPASSWD:SETENV: /opt/ralph/init-firewall-builder.sh, /opt/ralph/init-firewall-researcher.sh" \ + > /etc/sudoers.d/agent-firewall && chmod 0440 /etc/sudoers.d/agent-firewall + +# Copy scripts +COPY agent-loop.sh /opt/ralph/agent-loop.sh +COPY init-firewall-builder.sh /opt/ralph/init-firewall-builder.sh +COPY init-firewall-researcher.sh /opt/ralph/init-firewall-researcher.sh +RUN chmod +x /opt/ralph/agent-loop.sh /opt/ralph/init-firewall-builder.sh /opt/ralph/init-firewall-researcher.sh + +# Workspace for cloned repo +RUN mkdir -p /workspace && chown agent:agent /workspace + +# Claude config directory +RUN mkdir -p /home/agent/.claude && chown agent:agent /home/agent/.claude + +USER agent +WORKDIR /workspace + +ENTRYPOINT ["/opt/ralph/agent-loop.sh"] diff --git a/docker/agent-loop.sh b/docker/agent-loop.sh new file mode 100755 index 00000000..d72eb8ee --- /dev/null +++ b/docker/agent-loop.sh @@ -0,0 +1,586 @@ +#!/usr/bin/env bash +set -euo pipefail +# +# agent-loop.sh — Container entrypoint for Ralph parallel agents. +# +# Clones from a bind-mounted project directory, claims stories from prd.json +# via git atomic push, runs Claude Code per iteration, and pushes results. +# +# Expected environment variables: +# AGENT_ID - Unique agent identifier (e.g., "agent-1") +# AGENT_ROLE - One of: builder, researcher +# MAX_ITERATIONS - Max loop iterations (0 = infinite, default: 0) +# CLAUDE_MODEL - Model to use (default: claude-sonnet-4-5-20250929) +# +# Auth: Claude credentials are mounted via Docker volume at /home/agent/.claude +# +# Exit codes: +# 0 - Clean exit (all stories complete, stop requested, or iteration limit) +# 1 - General failure (missing credentials, prompt error, etc.) +# 2 - Auth failure (expired/invalid credentials after MAX_AUTH_FAILURES retries) +# The orchestrator treats exit 2 as a signal to halt all agents. +# + +AGENT_ID="${AGENT_ID:?AGENT_ID is required}" +AGENT_ROLE="${AGENT_ROLE:?AGENT_ROLE is required}" +MAX_ITERATIONS="${MAX_ITERATIONS:-0}" +CLAUDE_MODEL="${CLAUDE_MODEL:-claude-sonnet-4-5-20250929}" + +REPO_PATH="/repo.git" +WORKSPACE="/workspace" +PROMPT_DIR="/parallel-prompt" +PROMPT_FILE="$PROMPT_DIR/CLAUDE-parallel.md" +STOP_FILE="/harness-state/stop_requested" +LOG_DIR="/agent-logs" +ITERATION=0 + +echo "[$AGENT_ID] Starting agent loop (role=$AGENT_ROLE, model=$CLAUDE_MODEL, max_iterations=$MAX_ITERATIONS)" + +# --- Step 1: Initialize firewall based on role --- +echo "[$AGENT_ID] Initializing firewall for role: $AGENT_ROLE" +case "$AGENT_ROLE" in + researcher) + sudo /opt/ralph/init-firewall-researcher.sh + ;; + *) + sudo RALPH_EXTRA_DOMAINS="${RALPH_EXTRA_DOMAINS:-}" /opt/ralph/init-firewall-builder.sh + ;; +esac + +# --- Step 2: Copy Claude auth credentials from mounted volume --- +if [ ! -f /claude-auth/.credentials.json ]; then + echo "[$AGENT_ID] ERROR: No Claude credentials found at /claude-auth/.credentials.json" + echo "[$AGENT_ID] Ensure the ralph-claude-auth volume is mounted." + exit 1 +fi +mkdir -p ~/.claude +cp /claude-auth/.credentials.json ~/.claude/.credentials.json +chmod 600 ~/.claude/.credentials.json +echo "[$AGENT_ID] Claude credentials copied" + +# --- Step 3: Clone or update workspace from bare repo --- +setup_workspace() { + if [ -d "$WORKSPACE/.git" ]; then + echo "[$AGENT_ID] Fetching latest changes" + cd "$WORKSPACE" + git fetch origin + # Reset to current branch's remote tracking, not hard-coded main + local current_branch + current_branch=$(git branch --show-current 2>/dev/null || echo "") + if [ -n "$current_branch" ]; then + git reset --hard "origin/$current_branch" 2>/dev/null || true + else + git reset --hard origin/main 2>/dev/null || git reset --hard origin/master 2>/dev/null || true + fi + else + echo "[$AGENT_ID] Cloning bare repo into workspace" + git clone "$REPO_PATH" "$WORKSPACE" + cd "$WORKSPACE" + fi +} + +# --- Step 4: Set git identity --- +setup_git_identity() { + if [ -n "${GIT_AUTHOR_NAME_OVERRIDE:-}" ] && [ -n "${GIT_AUTHOR_EMAIL_OVERRIDE:-}" ]; then + git config user.name "$GIT_AUTHOR_NAME_OVERRIDE" + git config user.email "$GIT_AUTHOR_EMAIL_OVERRIDE" + echo "[$AGENT_ID] Committing as: $GIT_AUTHOR_NAME_OVERRIDE <$GIT_AUTHOR_EMAIL_OVERRIDE>" + else + git config user.name "$AGENT_ID" + git config user.email "${AGENT_ID}@ralph-agent.local" + fi + git config pull.rebase true + git config push.autoSetupRemote true +} + +# --- Step 5: Check out the correct branch from prd.json --- +checkout_prd_branch() { + if [ ! -f "$WORKSPACE/prd.json" ]; then + echo "[$AGENT_ID] WARNING: No prd.json found in workspace" + return 1 + fi + + local branch_name + branch_name=$(jq -r '.branchName // empty' "$WORKSPACE/prd.json" 2>/dev/null || echo "") + + if [ -z "$branch_name" ]; then + echo "[$AGENT_ID] No branchName in prd.json, staying on current branch" + return 0 + fi + + local current_branch + current_branch=$(git branch --show-current 2>/dev/null || echo "") + + if [ "$current_branch" = "$branch_name" ]; then + echo "[$AGENT_ID] Already on branch: $branch_name" + return 0 + fi + + echo "[$AGENT_ID] Checking out branch: $branch_name" + if git show-ref --verify --quiet "refs/heads/$branch_name" 2>/dev/null; then + git checkout "$branch_name" + elif git show-ref --verify --quiet "refs/remotes/origin/$branch_name" 2>/dev/null; then + git checkout -b "$branch_name" "origin/$branch_name" + else + git checkout -b "$branch_name" + fi +} + +# --- Step 5b: Check if this agent already owns an incomplete story --- +# Returns 0 and prints story ID if found, returns 1 if no active claim +check_existing_claim() { + cd "$WORKSPACE" + + # Pull latest prd.json + local current_branch + current_branch=$(git branch --show-current 2>/dev/null || echo "") + # Always fetch first so we discover remote branches created by other agents + git fetch origin >&2 2>&1 || true + if git rev-parse --verify "origin/$current_branch" >/dev/null 2>&1; then + git pull --rebase >&2 2>&1 || { + git rebase --abort >/dev/null 2>&1 || true + git fetch origin >&2 2>&1 + git reset --hard "origin/$current_branch" >&2 2>&1 + } + fi + + if [ ! -f prd.json ]; then + return 1 + fi + + local story_id + story_id=$(jq -r --arg agent "$AGENT_ID" ' + .userStories + | map(select(.passes == false and .claimed_by == $agent)) + | first + | .id // empty + ' prd.json 2>/dev/null || echo "") + + if [ -n "$story_id" ]; then + echo "[$AGENT_ID] Already owns incomplete story: $story_id" >&2 + echo "$story_id" + return 0 + fi + return 1 +} + +# --- Step 5c: Release a claim after Claude failure --- +release_claim() { + local story_id="$1" + cd "$WORKSPACE" + + if [ ! -f prd.json ]; then + return 1 + fi + + echo "[$AGENT_ID] Releasing claim on $story_id" >&2 + + jq --arg sid "$story_id" ' + .userStories |= map( + if .id == $sid then + .claimed_by = null | .claimed_at = null + else . end + ) + ' prd.json > prd.json.tmp && mv prd.json.tmp prd.json + + git add prd.json >&2 2>&1 + git commit -m "[$AGENT_ID] Release: $story_id (Claude failure)" >&2 2>&1 || { + git checkout -- prd.json >/dev/null 2>&1 || true + return 1 + } + git push >&2 2>&1 || { + echo "[$AGENT_ID] Failed to push release for $story_id" >&2 + git reset --hard HEAD~1 >&2 2>&1 + return 1 + } + return 0 +} + +# --- Step 6: Claim a story in prd.json --- +# Returns 0 and prints story ID if claimed, returns 1 if no stories available +claim_story() { + cd "$WORKSPACE" + + # Pull latest prd.json (all git output to stderr to keep stdout clean for return value) + # On a new branch with no remote tracking yet, pull will fail — that's fine, we continue + local current_branch + current_branch=$(git branch --show-current 2>/dev/null || echo "") + # Always fetch first so we discover remote branches created by other agents + git fetch origin >&2 2>&1 || true + if git rev-parse --verify "origin/$current_branch" >/dev/null 2>&1; then + git pull --rebase >&2 2>&1 || { + git rebase --abort >/dev/null 2>&1 || true + git fetch origin >&2 2>&1 + git reset --hard "origin/$current_branch" >&2 2>&1 + } + fi + + if [ ! -f prd.json ]; then + echo "[$AGENT_ID] No prd.json found" >&2 + return 1 + fi + + # Find highest-priority unclaimed story whose dependencies are all satisfied + local story_id + story_id=$(jq -r ' + . as $prd | + ($prd.userStories | map(select(.passes == true)) | map(.id)) as $passed | + $prd.userStories + | map(select( + .passes == false + and (.claimed_by == null or .claimed_by == "") + and ((.dependsOn // []) | all(. as $dep | $passed | any(. == $dep))) + )) + | sort_by(.priority) + | first + | .id // empty + ' prd.json 2>/dev/null || echo "") + + if [ -z "$story_id" ]; then + echo "[$AGENT_ID] No unclaimed stories available" >&2 + return 1 + fi + + # Claim it by setting claimed_by and claimed_at + local timestamp + timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + + jq --arg agent "$AGENT_ID" --arg ts "$timestamp" --arg sid "$story_id" ' + .userStories |= map( + if .id == $sid then + .claimed_by = $agent | .claimed_at = $ts + else . end + ) + ' prd.json > prd.json.tmp && mv prd.json.tmp prd.json + + git add prd.json >&2 2>&1 + git commit -m "[$AGENT_ID] Claim: $story_id" >&2 2>&1 || { + echo "[$AGENT_ID] Failed to commit claim for $story_id" >&2 + git checkout -- prd.json >/dev/null 2>&1 || true + return 1 + } + + # Atomic push — if this fails, another agent claimed something concurrently + if git push >&2 2>&1; then + echo "$story_id" + return 0 + else + echo "[$AGENT_ID] Push failed (concurrent claim). Resetting and retrying..." >&2 + git reset --hard HEAD~1 >&2 2>&1 + local retry_branch + retry_branch=$(git branch --show-current 2>/dev/null || echo "") + # Fetch first to discover remote branches created by other agents + git fetch origin >&2 2>&1 || true + if git rev-parse --verify "origin/$retry_branch" >/dev/null 2>&1; then + git pull --rebase >&2 2>&1 || { + git rebase --abort >/dev/null 2>&1 || true + git fetch origin >&2 2>&1 + git reset --hard "origin/$retry_branch" >&2 2>&1 + } + fi + return 1 + fi +} + +# --- Step 6b: Claim a story for verification --- +# Returns 0 and prints story ID if claimed, returns 1 if no stories ready for verification +claim_verification() { + cd "$WORKSPACE" + + # Pull latest prd.json + local current_branch + current_branch=$(git branch --show-current 2>/dev/null || echo "") + # Always fetch first so we discover remote branches created by other agents + git fetch origin >&2 2>&1 || true + if git rev-parse --verify "origin/$current_branch" >/dev/null 2>&1; then + git pull --rebase >&2 2>&1 || { + git rebase --abort >/dev/null 2>&1 || true + git fetch origin >&2 2>&1 + git reset --hard "origin/$current_branch" >&2 2>&1 + } + fi + + if [ ! -f prd.json ]; then + echo "[$AGENT_ID] No prd.json found" >&2 + return 1 + fi + + # Find stories with passes=true, verified!=true, and no verified_by claim + local story_id + story_id=$(jq -r ' + .userStories + | map(select(.passes == true and .verified != true and (.verified_by == null or .verified_by == ""))) + | first + | .id // empty + ' prd.json 2>/dev/null || echo "") + + if [ -z "$story_id" ]; then + echo "[$AGENT_ID] No stories ready for verification" >&2 + return 1 + fi + + # Claim it by setting verified_by and verified_at + local timestamp + timestamp=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + + jq --arg agent "$AGENT_ID" --arg ts "$timestamp" --arg sid "$story_id" ' + .userStories |= map( + if .id == $sid then + .verified_by = $agent | .verified_at = $ts + else . end + ) + ' prd.json > prd.json.tmp && mv prd.json.tmp prd.json + + git add prd.json >&2 2>&1 + git commit -m "[$AGENT_ID] Verify claim: $story_id" >&2 2>&1 || { + echo "[$AGENT_ID] Failed to commit verification claim for $story_id" >&2 + git checkout -- prd.json >/dev/null 2>&1 || true + return 1 + } + + # Atomic push + if git push >&2 2>&1; then + echo "$story_id" + return 0 + else + echo "[$AGENT_ID] Push failed (concurrent claim). Resetting and retrying..." >&2 + git reset --hard HEAD~1 >&2 2>&1 + local retry_branch + retry_branch=$(git branch --show-current 2>/dev/null || echo "") + # Fetch first to discover remote branches created by other agents + git fetch origin >&2 2>&1 || true + if git rev-parse --verify "origin/$retry_branch" >/dev/null 2>&1; then + git pull --rebase >&2 2>&1 || { + git rebase --abort >/dev/null 2>&1 || true + git fetch origin >&2 2>&1 + git reset --hard "origin/$retry_branch" >&2 2>&1 + } + fi + return 1 + fi +} + +# --- Step 7: Check if all stories are complete --- +all_stories_complete() { + if [ ! -f "$WORKSPACE/prd.json" ]; then + return 1 + fi + + if [ "$AGENT_ROLE" = "verifier" ]; then + # Verifiers check that all stories are both passing AND verified + local incomplete + incomplete=$(jq '[.userStories[] | select(.passes == false or .verified != true)] | length' "$WORKSPACE/prd.json" 2>/dev/null || echo "1") + [ "$incomplete" -eq 0 ] + else + local incomplete + incomplete=$(jq '[.userStories[] | select(.passes == false)] | length' "$WORKSPACE/prd.json" 2>/dev/null || echo "1") + [ "$incomplete" -eq 0 ] + fi +} + +# --- Step 8: Push changes with retry --- +push_with_retry() { + local max_attempts=3 + local attempt=0 + local branch + branch=$(git branch --show-current 2>/dev/null || echo "main") + + while [ $attempt -lt $max_attempts ]; do + if git push origin "$branch" 2>&1; then + echo "[$AGENT_ID] Push successful." + return 0 + fi + attempt=$((attempt + 1)) + echo "[$AGENT_ID] Push failed (attempt $attempt/$max_attempts). Rebasing..." + git pull --rebase origin "$branch" 2>&1 || { + echo "[$AGENT_ID] Rebase conflict. Aborting rebase and resetting." + git rebase --abort 2>/dev/null || true + git fetch origin + git reset --hard "origin/$branch" + return 1 + } + done + + echo "[$AGENT_ID] Failed to push after $max_attempts attempts." + return 1 +} + +# --- Step 9: Prepare the prompt with agent identity --- +prepare_prompt() { + if [ ! -f "$PROMPT_FILE" ]; then + echo "[$AGENT_ID] ERROR: Prompt file not found at $PROMPT_FILE" + return 1 + fi + + # Inject agent identity and claimed story into prompt + sed -e "s/{{AGENT_ID}}/$AGENT_ID/g" -e "s/{{CLAIMED_STORY}}/$CLAIMED_STORY/g" "$PROMPT_FILE" +} + +# --- Main loop --- +setup_workspace +setup_git_identity +checkout_prd_branch + +echo "[$AGENT_ID] Entering main loop" + +AUTH_FAILURES=0 +MAX_AUTH_FAILURES=5 + +while true; do + # Check stop signal + if [ -s "$STOP_FILE" ]; then + echo "[$AGENT_ID] Stop requested. Exiting gracefully." + exit 0 + fi + + # Check iteration limit + if [ "$MAX_ITERATIONS" -gt 0 ] && [ "$ITERATION" -ge "$MAX_ITERATIONS" ]; then + echo "[$AGENT_ID] Reached max iterations ($MAX_ITERATIONS). Exiting." + exit 0 + fi + + # Check if all stories are done + if all_stories_complete; then + echo "[$AGENT_ID] All stories complete. Exiting." + exit 0 + fi + + # Clean any unstaged changes from previous iteration to prevent rebase failures + cd "$WORKSPACE" + git checkout -- . 2>/dev/null || true + git clean -fd 2>/dev/null || true + + ITERATION=$((ITERATION + 1)) + COMMIT=$(git rev-parse --short=6 HEAD 2>/dev/null || echo "000000") + LOGFILE="${LOG_DIR}/${AGENT_ID}_iter${ITERATION}_${COMMIT}.log" + TIMESTAMP=$(date -u +"%Y-%m-%dT%H:%M:%SZ") + + echo "[$AGENT_ID] === Iteration $ITERATION (commit: $COMMIT) at $TIMESTAMP ===" + + # First check if this agent already owns an incomplete story (from a failed iteration) + CLAIMED_STORY="" + + if [ "$AGENT_ROLE" = "verifier" ]; then + CLAIM_FUNC="claim_verification" + else + # Check for existing claim before trying to grab a new one + if CLAIMED_STORY=$(check_existing_claim); then + echo "[$AGENT_ID] Resuming existing claim: $CLAIMED_STORY" + fi + CLAIM_FUNC="claim_story" + fi + + # If no existing claim, attempt to claim a new story (retry up to 3 times) + CLAIM_ATTEMPTS=0 + while [ $CLAIM_ATTEMPTS -lt 3 ] && [ -z "$CLAIMED_STORY" ]; do + # Use if to prevent set -e from killing the script on claim failure + if CLAIMED_STORY=$($CLAIM_FUNC); then + break + else + CLAIMED_STORY="" + CLAIM_ATTEMPTS=$((CLAIM_ATTEMPTS + 1)) + sleep 2 + fi + done + + if [ -z "$CLAIMED_STORY" ]; then + if [ "$AGENT_ROLE" = "verifier" ]; then + echo "[$AGENT_ID] No stories ready for verification. Checking if all complete..." + if all_stories_complete; then + echo "[$AGENT_ID] All stories verified. Exiting." + exit 0 + fi + echo "[$AGENT_ID] Builders still working. Waiting 30s..." + sleep 30 + continue + else + echo "[$AGENT_ID] Could not claim any story. Checking if all complete..." + if all_stories_complete; then + echo "[$AGENT_ID] All stories complete. Exiting." + exit 0 + fi + echo "[$AGENT_ID] Stories exist but couldn't claim. Waiting 30s..." + sleep 30 + continue + fi + fi + + echo "[$AGENT_ID] Claimed story: $CLAIMED_STORY" + + # Prepare prompt + PROMPT=$(prepare_prompt) || { + echo "[$AGENT_ID] Failed to prepare prompt. Sleeping 10s..." + sleep 10 + continue + } + + # Run Claude + echo "[$AGENT_ID] Running Claude (model: $CLAUDE_MODEL) for story: $CLAIMED_STORY" + CLAUDE_EXIT=0 + claude --dangerously-skip-permissions \ + --print \ + --model "$CLAUDE_MODEL" \ + -p "$PROMPT" \ + &> "$LOGFILE" || CLAUDE_EXIT=$? + + # Reset auth failure counter on successful invocation + if [ $CLAUDE_EXIT -eq 0 ]; then + AUTH_FAILURES=0 + fi + + # Detect hard failures (auth errors, crashes) — release claim so other agents can take it + if [ $CLAUDE_EXIT -ne 0 ] && [ -f "$LOGFILE" ]; then + if grep -q "authentication_error\|OAuth token has expired\|Failed to authenticate" "$LOGFILE"; then + AUTH_FAILURES=$((AUTH_FAILURES + 1)) + echo "[$AGENT_ID] Claude auth failure #$AUTH_FAILURES/$MAX_AUTH_FAILURES. Releasing claim on $CLAIMED_STORY." + echo "[$AGENT_ID] Error details from log:" + grep -i "error\|rate\|limit\|auth" "$LOGFILE" | tail -5 || true + release_claim "$CLAIMED_STORY" || true + if [ "$AUTH_FAILURES" -ge "$MAX_AUTH_FAILURES" ]; then + echo "[$AGENT_ID] Reached max auth failures ($MAX_AUTH_FAILURES). Exiting to avoid infinite loop." + exit 2 + fi + # Exponential backoff: 60, 120, 240, 480, 480 (capped) + BACKOFF=$((60 * (1 << (AUTH_FAILURES - 1)))) + [ "$BACKOFF" -gt 480 ] && BACKOFF=480 + echo "[$AGENT_ID] Waiting ${BACKOFF}s before retrying (exponential backoff)..." + sleep "$BACKOFF" + continue + fi + echo "[$AGENT_ID] Claude exited with error (code: $CLAUDE_EXIT). Check log: $LOGFILE" + fi + + echo "[$AGENT_ID] Claude session complete. Pushing changes..." + + # Stage and commit any remaining unstaged changes + if ! git diff --quiet || ! git diff --cached --quiet; then + git add -A + if ! git diff --cached --quiet; then + git commit -m "[$AGENT_ID] Iteration $ITERATION: $CLAIMED_STORY" || true + fi + fi + + # Push with retry + push_with_retry + + # Write per-agent progress + { + echo "## $TIMESTAMP - $CLAIMED_STORY (Iteration $ITERATION)" + echo "- Agent: $AGENT_ID" + echo "- Commit: $COMMIT" + echo "---" + } >> "$WORKSPACE/progress-${AGENT_ID}.txt" + + # Check for completion sentinel in output + if [ -f "$LOGFILE" ] && grep -q "COMPLETE" "$LOGFILE"; then + echo "[$AGENT_ID] Completion sentinel detected. Verifying all stories..." + git pull --rebase 2>/dev/null || true + if all_stories_complete; then + echo "[$AGENT_ID] All stories confirmed complete. Exiting." + exit 0 + fi + fi + + echo "[$AGENT_ID] Iteration $ITERATION complete. Sleeping 5s..." + sleep 5 +done diff --git a/docker/init-firewall-builder.sh b/docker/init-firewall-builder.sh new file mode 100755 index 00000000..5da643fe --- /dev/null +++ b/docker/init-firewall-builder.sh @@ -0,0 +1,65 @@ +#!/usr/bin/env bash +set -euo pipefail +# +# Builder firewall: whitelist only Claude API and user-specified domains. +# Everything else is denied. Must run as root (called via sudo). +# +# Extra domains are passed via the RALPH_EXTRA_DOMAINS env var (comma-separated). +# + +# Flush existing rules +iptables -F OUTPUT 2>/dev/null || true +iptables -F INPUT 2>/dev/null || true + +# Allow loopback +iptables -A OUTPUT -o lo -j ACCEPT +iptables -A INPUT -i lo -j ACCEPT + +# Allow established/related connections (responses to our outbound requests) +iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT +iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT + +# Allow DNS (needed to resolve hostnames before we can whitelist IPs) +iptables -A OUTPUT -p udp --dport 53 -j ACCEPT +iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT + +# Always-allowed domains (Claude Code needs these) +ALLOWED_DOMAINS=( + "api.anthropic.com" + "statsig.anthropic.com" +) + +# Append user-specified domains from RALPH_EXTRA_DOMAINS env var +if [ -n "${RALPH_EXTRA_DOMAINS:-}" ]; then + IFS=',' read -ra extra <<< "$RALPH_EXTRA_DOMAINS" + for domain in "${extra[@]}"; do + # Trim whitespace + domain=$(echo "$domain" | xargs) + [ -n "$domain" ] && ALLOWED_DOMAINS+=("$domain") + done +fi + +# Resolve and allow each whitelisted domain +for domain in "${ALLOWED_DOMAINS[@]}"; do + # Resolve all IPs for the domain + ips=$(dig +short "$domain" 2>/dev/null | grep -E '^[0-9]+\.' || true) + for ip in $ips; do + iptables -A OUTPUT -p tcp -d "$ip" --dport 443 -j ACCEPT + echo "[firewall] Allowed: $domain -> $ip:443" + done + + # Also resolve CNAME targets (CDNs etc) + cnames=$(dig +short "$domain" 2>/dev/null | grep -v -E '^[0-9]+\.' || true) + for cname in $cnames; do + cname_ips=$(dig +short "$cname" 2>/dev/null | grep -E '^[0-9]+\.' || true) + for ip in $cname_ips; do + iptables -A OUTPUT -p tcp -d "$ip" --dport 443 -j ACCEPT + echo "[firewall] Allowed: $domain (via $cname) -> $ip:443" + done + done +done + +# Default deny all other outbound traffic +iptables -A OUTPUT -j DROP + +echo "[firewall] Builder firewall initialized. Allowed domains: ${ALLOWED_DOMAINS[*]}" diff --git a/docker/init-firewall-researcher.sh b/docker/init-firewall-researcher.sh new file mode 100755 index 00000000..a98855e3 --- /dev/null +++ b/docker/init-firewall-researcher.sh @@ -0,0 +1,9 @@ +#!/usr/bin/env bash +set -euo pipefail +# +# Researcher firewall: no-op. +# Docker default networking allows all outbound traffic. +# Researcher agents need full internet access for web searches, docs, etc. +# + +echo "[firewall] Researcher firewall: no restrictions (full internet access)." diff --git a/parallel/CLAUDE-parallel.md b/parallel/CLAUDE-parallel.md new file mode 100644 index 00000000..5ab0adf6 --- /dev/null +++ b/parallel/CLAUDE-parallel.md @@ -0,0 +1,132 @@ +# Ralph Parallel Agent Instructions + +You are **{{AGENT_ID}}**, an autonomous coding agent running in parallel with other agents on this project. You are in a sandboxed Docker container with `--dangerously-skip-permissions`. + +## Your Task + +Your assigned story is **{{CLAIMED_STORY}}** — it has already been claimed for you in prd.json. Do NOT claim or work on any other story. + +> **Note:** Stories may have a `dependsOn` field listing prerequisite story IDs. The harness only assigns stories whose dependencies are already complete. You don't need to check this yourself. + +1. Read the PRD at `prd.json` +2. Read ALL progress files: `progress.txt` and any `progress-*.txt` files (check Codebase Patterns section first) +3. Check you're on the correct branch from PRD `branchName`. If not, check it out or create from main. +4. `git pull --rebase` to get the latest code +5. Implement **only** story **{{CLAIMED_STORY}}** +6. Run quality checks (e.g., typecheck, lint, test - use whatever your project requires) +7. Update AGENTS.md files if you discover reusable patterns (see below) +8. If checks pass, commit ALL changes with message: `feat: {{CLAIMED_STORY}} - [Story Title]` +9. Update the PRD to set `passes: true` for **{{CLAIMED_STORY}}** only +10. Append your progress to `progress-{{AGENT_ID}}.txt` + +**Important**: Do NOT modify `claimed_by` fields or claim additional stories. The harness script manages claiming. You implement the one story assigned to you. + +## Per-Agent Progress Files + +- **Write** your progress to: `progress-{{AGENT_ID}}.txt` +- **Read** all progress files before starting: `progress.txt` and all `progress-*.txt` + +This avoids merge conflicts on a shared progress file. Your per-agent progress file follows the same format: + +``` +## [Date/Time] - [Story ID] +- What was implemented +- Files changed +- **Learnings for future iterations:** + - Patterns discovered + - Gotchas encountered + - Useful context +--- +``` + +## Conflict Resolution + +Since multiple agents push to the same branch: + +1. **Always `git pull --rebase` before pushing.** Never merge. +2. If rebase has conflicts: + - Try to resolve them (prefer keeping both changes) + - If you can't resolve: `git rebase --abort`, `git fetch origin`, `git reset --hard origin/`, and redo your changes +3. **3-strike push rule**: If push fails 3 times after rebase, document the blocker in `progress-{{AGENT_ID}}.txt` under "## Blockers" and move on to a different story. + +## Push Protocol + +Always follow this sequence: +1. `git add -A` +2. `git commit -m "[{{AGENT_ID}}] "` +3. `git pull --rebase origin ` +4. `git push origin ` +5. If push fails, repeat from step 3 (max 3 retries) + +## Progress Report Format + +APPEND to `progress-{{AGENT_ID}}.txt` (never replace, always append): +``` +## [Date/Time] - [Story ID] +- What was implemented +- Files changed +- **Learnings for future iterations:** + - Patterns discovered (e.g., "this codebase uses X for Y") + - Gotchas encountered (e.g., "don't forget to update Z when changing W") + - Useful context (e.g., "the evaluation panel is in component X") +--- +``` + +## Consolidate Patterns + +If you discover a **reusable pattern** that future iterations should know, add it to the `## Codebase Patterns` section at the TOP of `progress-{{AGENT_ID}}.txt` (create it if it doesn't exist). Only add patterns that are **general and reusable**, not story-specific details. + +## Update AGENTS.md Files + +Before committing, check if any edited files have learnings worth preserving in nearby AGENTS.md files: + +1. **Identify directories with edited files** - Look at which directories you modified +2. **Check for existing AGENTS.md** - Look for AGENTS.md in those directories or parent directories +3. **Add valuable learnings** - If you discovered something future developers/agents should know: + - API patterns or conventions specific to that module + - Gotchas or non-obvious requirements + - Dependencies between files + - Testing approaches for that area + - Configuration or environment requirements + +**Do NOT add:** +- Story-specific implementation details +- Temporary debugging notes +- Information already in progress files + +## Quality Requirements + +- ALL commits must pass your project's quality checks (typecheck, lint, test) +- Do NOT commit broken code +- Keep changes focused and minimal +- Follow existing code patterns + +## Browser Testing (If Available) + +For any story that changes UI, verify it works in the browser if you have browser testing tools configured: + +1. Navigate to the relevant page +2. Verify the UI changes work as expected +3. Take a screenshot if helpful for the progress log + +If no browser tools are available, note in your progress report that manual browser verification is needed. + +## Stop Condition + +After completing a user story, check if ALL stories have `passes: true`. + +If ALL stories are complete and passing, reply with: +COMPLETE + +If there are still stories with `passes: false`, end your response normally (another iteration will pick up the next story). + +## Important + +- You are **{{AGENT_ID}}** — always use this in commit messages and progress files +- Work on ONE story per iteration +- Claim before working — never start without claiming +- Commit frequently with small, focused commits +- Keep CI green +- Read ALL progress files (yours and other agents') before starting +- Do not attempt to install system packages or modify system configuration +- Focus on making measurable progress each iteration — quality over quantity diff --git a/parallel/CLAUDE-verifier.md b/parallel/CLAUDE-verifier.md new file mode 100644 index 00000000..b18549b1 --- /dev/null +++ b/parallel/CLAUDE-verifier.md @@ -0,0 +1,101 @@ +# Ralph Verifier Agent Instructions + +You are **{{AGENT_ID}}** (role: verifier), an autonomous verification agent running in parallel with builder agents. You are in a sandboxed Docker container with `--dangerously-skip-permissions`. + +## Your Task + +Your assigned story is **{{CLAIMED_STORY}}** — it has `passes: true` set by a builder agent. Your job is to **independently verify** that the implementation actually works by running the project's tests and inspecting results against the acceptance criteria. + +1. Read the PRD at `prd.json` +2. Find story **{{CLAIMED_STORY}}** and read its acceptance criteria +3. Read ALL progress files: `progress.txt` and any `progress-*.txt` files for context on what was built +4. `git pull --rebase` to get the latest code +5. Auto-detect the test framework and run tests +6. Evaluate results against acceptance criteria +7. Update prd.json based on your findings (see Verification Outcomes below) +8. Commit and push your changes + +## Auto-Detect Test Framework + +Inspect the project root for build/test configuration: + +| File | Command | +|------|---------| +| `package.json` (with `scripts.test`) | `npm test` | +| `pyproject.toml` or `setup.py` | `pytest` | +| `Cargo.toml` | `cargo test` | +| `go.mod` | `go test ./...` | +| `Makefile` (with `test` target) | `make test` | +| `build.gradle` or `build.gradle.kts` | `./gradlew test` | +| `pom.xml` | `mvn test` | + +If multiple are present, prefer the one most relevant to the story's changes. If no test framework is found, note this in `verification_notes` and mark as verified (no tests to fail). + +## Verification Outcomes + +### If tests PASS and acceptance criteria are met: + +Update prd.json for **{{CLAIMED_STORY}}**: +```json +{ + "verified": true, + "verified_by": "{{AGENT_ID}}", + "verified_at": "", + "verification_notes": "All tests pass. " +} +``` + +### If tests FAIL or acceptance criteria are NOT met: + +**Bounce the story back to builders** by updating prd.json for **{{CLAIMED_STORY}}**: +```json +{ + "passes": false, + "claimed_by": null, + "claimed_at": null, + "verified": false, + "verified_by": null, + "verified_at": null, + "verification_notes": "FAILED: " +} +``` + +This clears the builder's claim so another builder can pick it up and fix the issues. + +## Push Protocol + +Always follow this sequence: +1. `git add prd.json` +2. `git commit -m "[{{AGENT_ID}}] Verify: {{CLAIMED_STORY}} — "` +3. `git pull --rebase origin ` +4. `git push origin ` +5. If push fails, repeat from step 3 (max 3 retries) + +## Critical Rules + +- **Do NOT modify source code** — you only modify `prd.json` +- **Do NOT claim additional stories** — the harness assigns stories to you +- **One story per iteration** — verify the assigned story and exit +- Run the full test suite, not just targeted tests, to catch regressions +- Be specific in `verification_notes` — builders need actionable feedback to fix failures +- If the test command itself fails to run (missing dependencies, build errors), that counts as a failure + +## Progress Report + +APPEND to `progress-{{AGENT_ID}}.txt`: +``` +## [Date/Time] - Verify {{CLAIMED_STORY}} +- Result: PASS/FAIL +- Tests run: +- Details: +--- +``` + +## Stop Condition + +After verifying a story, check if ALL stories have `passes: true` AND `verified: true`. + +If ALL stories are verified, reply with: +COMPLETE + +If there are still unverified stories, end your response normally (another iteration will pick up the next story). diff --git a/parallel/README.md b/parallel/README.md new file mode 100644 index 00000000..1a5ede0e --- /dev/null +++ b/parallel/README.md @@ -0,0 +1,243 @@ +# Ralph Parallel Mode + +Run N containerized Claude Code agents simultaneously against the same PRD. Each agent is sandboxed in Docker with network restrictions, resource limits, and no host access. + +## How It Works + +1. **Orchestrator** (`ralph-parallel.sh`) builds a Docker image, creates networks, and launches N containers +2. Each container runs the **agent loop** (`docker/agent-loop.sh`) which: + - Clones the project from a bind-mounted directory + - Claims a story in `prd.json` via git atomic push + - Runs Claude Code with the parallel prompt + - Pushes results and picks the next story +3. The orchestrator monitors container health and recovers stale claims +4. When all stories have `passes: true`, everything shuts down + +## Prerequisites + +- Docker installed and running +- A Claude Code auth token +- `jq` installed (`brew install jq` on macOS) +- A `prd.json` in the ralph root directory + +## Quick Start + +```bash +# Set your Claude auth token +export RALPH_CLAUDE_TOKEN='' + +# Run with 3 builder agents +./parallel/ralph-parallel.sh --agents 3 + +# Check status +./parallel/status.sh + +# Graceful shutdown +./parallel/stop.sh +``` + +## CLI Options + +``` +./parallel/ralph-parallel.sh [options] [max_iterations] + +Options: + --project DIR Project directory with prd.json (default: current dir) + --image IMAGE Custom Docker image (default: ralph-agent:latest) + --agents N Number of builder agents (default: 2) + --researcher N Number of researcher agents with full internet (default: 0) + --model MODEL Claude model (default: claude-sonnet-4-5-20250929) + --memory SIZE Per-container memory limit (default: 4g) + --cpus N Per-container CPU limit (default: 2) + --allow-domain D Extra domain to whitelist in firewall (repeatable) + +Arguments: + max_iterations Per-agent iteration cap (default: 0 = until PRD complete) +``` + +## Authentication + +Token retrieval priority (first wins): + +1. **`RALPH_CLAUDE_TOKEN` env var** — set before running +2. **`.ralph/token` file** — write your token here +3. **1Password** via `op read` — interactive, startup only + +## Story Claiming + +Agents claim stories by modifying `prd.json` and using git's atomic push as a lock: + +1. Agent finds highest-priority unclaimed story (`passes: false`, `claimed_by` empty) +2. Sets `claimed_by` and `claimed_at` fields +3. Commits and pushes +4. If push fails (another agent pushed first), rebase and pick a different story + +### Stale Claim Recovery + +The orchestrator checks for claims older than 30 minutes where the agent's container is no longer running. Stale claims are automatically cleared so other agents can pick up the work. + +## Agent Roles + +| Role | Network | Purpose | +|------|---------|---------| +| `builder` | API + allowed domains only | Feature implementation, testing, code changes | +| `researcher` | Full internet | Web research, documentation lookup | + +Builder agents are restricted via iptables to only reach: +- `api.anthropic.com` (Claude API) — always allowed +- `statsig.anthropic.com` (telemetry) — always allowed +- Any domains passed via `--allow-domain` + +Use `--allow-domain` to whitelist package registries your project needs: + +```bash +# Node.js +./parallel/ralph-parallel.sh --allow-domain registry.npmjs.org + +# Python +./parallel/ralph-parallel.sh \ + --allow-domain pypi.org \ + --allow-domain files.pythonhosted.org + +# Go +./parallel/ralph-parallel.sh \ + --allow-domain proxy.golang.org \ + --allow-domain sum.golang.org + +# Rust +./parallel/ralph-parallel.sh \ + --allow-domain crates.io \ + --allow-domain static.crates.io +``` + +## Custom Images + +The default `ralph-agent:latest` image is based on `node:20-slim` (Node.js is required for Claude Code). If your project needs additional runtimes (Python, Go, Rust, etc.), extend the base image. + +### `Dockerfile.ralph` Convention + +The easiest way to make a project "ralph-ready" is to add a `Dockerfile.ralph` to the project root. When ralph detects this file, it automatically builds a project-specific image — no `--image` flag needed. + +``` +my-project/ +├── Dockerfile.ralph # <-- ralph auto-detects this +├── prd.json +├── src/ +└── ... +``` + +```bash +# Ralph sees Dockerfile.ralph and builds automatically +./parallel/ralph-parallel.sh --project /path/to/my-project --agents 3 +``` + +The image is tagged `ralph-agent-:latest` (derived from `prd.json`'s `project` field) so multiple projects don't collide. + +### Image Contract + +`Dockerfile.ralph` should extend `ralph-agent:latest`. When doing so, you **must**: + +- Preserve the `agent` user (UID 1001) — do not delete or change its UID +- Keep the `/opt/ralph/` scripts intact — do not remove or modify them +- Keep the default `ENTRYPOINT` (`/opt/ralph/agent-loop.sh`) +- Switch back to `USER agent` after installing system packages + +### Example: Python Project + +```dockerfile +# Dockerfile.ralph +FROM ralph-agent:latest +USER root +RUN apt-get update && apt-get install -y --no-install-recommends \ + python3 python3-pip python3-venv \ + && rm -rf /var/lib/apt/lists/* +USER agent +``` + +### Example: Go Project + +```dockerfile +# Dockerfile.ralph +FROM ralph-agent:latest +USER root +RUN curl -fsSL https://go.dev/dl/go1.22.0.linux-$(dpkg --print-architecture).tar.gz \ + | tar -C /usr/local -xz +ENV PATH="/usr/local/go/bin:${PATH}" +USER agent +``` + +### Image Resolution Order + +1. `--image IMAGE` flag — explicit override, used as-is +2. `Dockerfile.ralph` in the project directory — auto-built +3. Default `ralph-agent:latest` — base image with Node.js only + +## File Layout + +``` +docker/ +├── Dockerfile # Container image: node:20-slim + claude-code +├── agent-loop.sh # Container entrypoint: firewall → auth → clone → loop +├── init-firewall-builder.sh # iptables: whitelist API + allowed domains +└── init-firewall-researcher.sh # No-op (full internet) + +parallel/ +├── ralph-parallel.sh # Host orchestrator: launch, monitor, restart +├── stop.sh # Graceful shutdown +├── status.sh # Container status + story board + logs +├── CLAUDE-parallel.md # Parallel-aware prompt for agents +├── README.md # This file +└── lib/ + ├── auth.sh # Token retrieval: env > file > 1Password + ├── network-setup.sh # Docker network create/teardown + ├── docker-helpers.sh # Container launch/stop/restart + └── logging.sh # Timestamped log helpers +``` + +## Exit Codes & Auth Failure Halt + +Agent containers use distinct exit codes so the orchestrator can respond appropriately: + +| Exit Code | Meaning | Orchestrator Action | +|-----------|---------|---------------------| +| 0 | Clean exit (stories complete, stop requested, iteration limit) | No action | +| 2 | Auth failure (credentials expired after 5 retries) | **Halt all agents** | +| Other non-zero | Crash, OOM, or unexpected error | Restart the container | + +Since all agents share the same credential volume, a single auth failure means none can authenticate. When exit code 2 is detected, the orchestrator immediately stops all remaining containers, tears down networks, and exits with a message to refresh credentials. No restart loop occurs. + +## Per-Agent Progress Files + +Instead of all agents appending to one `progress.txt` (merge conflict risk), each agent writes to `progress-.txt`. The parallel prompt instructs agents to read ALL progress files for context and write only to their own. + +## Differences from `ralph.sh` + +| | `ralph.sh` (original) | `ralph-parallel.sh` (new) | +|---|---|---| +| Runs on | Host, bare metal | Host, launches Docker containers | +| Agents | 1, sequential | N, parallel | +| Prompt | `CLAUDE.md` | `parallel/CLAUDE-parallel.md` | +| Auth | Delegates to CLI | Env var / file / 1Password | +| Network | Unrestricted | iptables firewall per container | +| Progress | `progress.txt` | `progress-.txt` per agent | +| Config | CLI args | CLI args | + +## Debugging + +```bash +# Check container status +./parallel/status.sh + +# View container logs directly +docker logs ralph-agent-1 + +# View agent log files +ls -lt agent_logs/ + +# Check prd.json story status +cat prd.json | jq '.userStories[] | {id, title, passes, claimed_by}' + +# Force rebuild the Docker image +docker rmi ralph-agent:latest +./parallel/ralph-parallel.sh --agents 1 +``` diff --git a/parallel/lib/auth.sh b/parallel/lib/auth.sh new file mode 100644 index 00000000..5115ac20 --- /dev/null +++ b/parallel/lib/auth.sh @@ -0,0 +1,59 @@ +#!/usr/bin/env bash +# +# auth.sh — Claude auth token retrieval for Ralph parallel mode. +# +# Token retrieval priority: +# 1. RALPH_CLAUDE_TOKEN env var +# 2. .ralph/token file in project directory +# 3. 1Password via `op read` (interactive — only at startup) +# + +fetch_claude_token() { + local project_dir="${1:-}" + + # Priority 1: Environment variable + if [ -n "${RALPH_CLAUDE_TOKEN:-}" ]; then + log_info "Using token from RALPH_CLAUDE_TOKEN env var" + echo "$RALPH_CLAUDE_TOKEN" + return 0 + fi + + # Priority 2: Token file + if [ -n "$project_dir" ] && [ -f "$project_dir/.ralph/token" ]; then + local file_token + file_token=$(cat "$project_dir/.ralph/token") + if [ -n "$file_token" ]; then + log_info "Using token from $project_dir/.ralph/token" + echo "$file_token" + return 0 + fi + fi + + # Priority 3: 1Password (interactive — will prompt for biometric/password) + log_info "Fetching token from 1Password (may prompt for auth)..." + + if ! command -v op &> /dev/null; then + log_error "No token available." + log_error "Provide a token via one of:" + log_error " 1. RALPH_CLAUDE_TOKEN env var" + log_error " 2. File at /.ralph/token" + log_error " 3. Install 1Password CLI: https://developer.1password.com/docs/cli/get-started/" + return 1 + fi + + local op_ref="${OP_ITEM_REF:-op://Private/Claude Code OAuth/credential}" + local token + token=$(op read "$op_ref" 2>&1) || { + log_error "Failed to read token from 1Password." + log_error "Reference: $op_ref" + log_error "Fallback: set RALPH_CLAUDE_TOKEN env var or write token to /.ralph/token" + return 1 + } + + if [ -z "$token" ]; then + log_error "1Password returned empty token." + return 1 + fi + + echo "$token" +} diff --git a/parallel/lib/docker-helpers.sh b/parallel/lib/docker-helpers.sh new file mode 100644 index 00000000..f37c1737 --- /dev/null +++ b/parallel/lib/docker-helpers.sh @@ -0,0 +1,136 @@ +#!/usr/bin/env bash +# +# docker-helpers.sh — Container launch and management helpers for Ralph parallel mode. +# + +RALPH_IMAGE="${RALPH_IMAGE:-ralph-agent:latest}" +CLAUDE_AUTH_VOLUME="ralph-claude-auth" + +build_image() { + local docker_dir="$1" + local dockerfile="${2:-}" + + log_info "Building Ralph agent image..." + if [ -n "$dockerfile" ]; then + docker build -t "$RALPH_IMAGE" -f "$dockerfile" "$docker_dir" + else + docker build -t "$RALPH_IMAGE" "$docker_dir" + fi + log_info "Image built: $RALPH_IMAGE" +} + +# Verify the shared Claude auth volume exists and has credentials +check_auth_volume() { + if ! docker volume inspect "$CLAUDE_AUTH_VOLUME" &> /dev/null; then + log_error "Claude auth volume '$CLAUDE_AUTH_VOLUME' not found." + log_error "Run the auth setup first:" + log_error " docker run -it --entrypoint bash \\" + log_error " -v $CLAUDE_AUTH_VOLUME:/home/agent/.claude \\" + log_error " $RALPH_IMAGE" + log_error " Then inside: claude login" + return 1 + fi + + # Quick check that credentials exist on the volume + local has_creds + has_creds=$(docker run --rm --entrypoint test \ + -v "$CLAUDE_AUTH_VOLUME":/claude-auth:ro \ + "$RALPH_IMAGE" -f /claude-auth/.credentials.json && echo "yes" || echo "no") + + if [ "$has_creds" != "yes" ]; then + log_error "No credentials found on auth volume. Run 'claude login' in a container first." + return 1 + fi + + return 0 +} + +launch_agent() { + local agent_id="$1" + local agent_role="$2" + local project_dir="$3" + local claude_model="$4" + local max_iterations="$5" + local container_memory="${6:-4g}" + local container_cpus="${7:-2}" + local git_author_name="${8:-}" + local git_author_email="${9:-}" + + # Determine network based on role (verifiers use builder network — no internet needed) + local network + case "$agent_role" in + researcher) network="$RESEARCHER_NETWORK" ;; + *) network="$BUILDER_NETWORK" ;; + esac + + local container_name="ralph-${agent_id}" + local project_dir_abs + project_dir_abs="$(cd "$project_dir" && pwd)" + + # Select prompt based on role: verifiers get CLAUDE-verifier.md, others get CLAUDE-parallel.md + local prompt_path + if [ "$agent_role" = "verifier" ]; then + prompt_path="${VERIFIER_PROMPT:-$project_dir_abs/parallel/CLAUDE-verifier.md}" + else + prompt_path="${PARALLEL_PROMPT:-$project_dir_abs/parallel/CLAUDE-parallel.md}" + fi + + log_info "Launching container: $container_name (role=$agent_role, network=$network)" + + # Ensure log and state directories exist, and stop signal file is present + # (Docker errors on bind-mounting a nonexistent file) + mkdir -p "$project_dir_abs/agent_logs" "$project_dir_abs/.ralph" + touch "$project_dir_abs/.ralph/stop_requested" + + docker run -d \ + --name "$container_name" \ + --network "$network" \ + --memory="$container_memory" \ + --cpus="$container_cpus" \ + --pids-limit=256 \ + --cap-add=NET_ADMIN \ + --cap-add=NET_RAW \ + --label "ralph.role=$agent_role" \ + --label "ralph.agent_id=$agent_id" \ + --label "ralph.project_dir=$project_dir_abs" \ + -e "AGENT_ID=$agent_id" \ + -e "AGENT_ROLE=$agent_role" \ + -e "CLAUDE_MODEL=$claude_model" \ + -e "MAX_ITERATIONS=$max_iterations" \ + -e "GIT_AUTHOR_NAME_OVERRIDE=${git_author_name}" \ + -e "GIT_AUTHOR_EMAIL_OVERRIDE=${git_author_email}" \ + -e "RALPH_EXTRA_DOMAINS=${RALPH_EXTRA_DOMAINS:-}" \ + -v "$CLAUDE_AUTH_VOLUME:/claude-auth:ro" \ + -v "$project_dir_abs/.ralph/repo.git:/repo.git:rw" \ + -v "$prompt_path:/parallel-prompt/CLAUDE-parallel.md:ro" \ + -v "$project_dir_abs/agent_logs:/agent-logs:rw" \ + -v "$project_dir_abs/.ralph/stop_requested:/harness-state/stop_requested:ro" \ + "$RALPH_IMAGE" + + log_info "Container $container_name started" +} + +stop_agent() { + local container_name="$1" + local timeout="${2:-30}" + + log_info "Stopping container: $container_name (timeout=${timeout}s)" + docker stop -t "$timeout" "$container_name" 2>/dev/null || true + docker rm "$container_name" 2>/dev/null || true +} + +is_agent_running() { + local container_name="$1" + docker inspect -f '{{.State.Running}}' "$container_name" 2>/dev/null | grep -q "true" +} + +restart_agent() { + local container_name="$1" + + log_info "Restarting container: $container_name" + docker restart "$container_name" 2>/dev/null || { + log_error "Could not restart $container_name" + return 1 + } + log_info "Container $container_name restarted" +} diff --git a/parallel/lib/logging.sh b/parallel/lib/logging.sh new file mode 100644 index 00000000..8c8f6290 --- /dev/null +++ b/parallel/lib/logging.sh @@ -0,0 +1,16 @@ +#!/usr/bin/env bash +# +# logging.sh — Timestamped log helpers for Ralph parallel mode. +# + +log_info() { + echo "[$(date -u +"%H:%M:%S")] INFO: $*" +} + +log_warn() { + echo "[$(date -u +"%H:%M:%S")] WARN: $*" >&2 +} + +log_error() { + echo "[$(date -u +"%H:%M:%S")] ERROR: $*" >&2 +} diff --git a/parallel/lib/network-setup.sh b/parallel/lib/network-setup.sh new file mode 100644 index 00000000..dfeca5a0 --- /dev/null +++ b/parallel/lib/network-setup.sh @@ -0,0 +1,42 @@ +#!/usr/bin/env bash +# +# network-setup.sh — Docker network creation and teardown for Ralph parallel mode. +# + +BUILDER_NETWORK="ralph-builder" +RESEARCHER_NETWORK="ralph-researcher" + +create_networks() { + log_info "Creating Docker networks..." + + # Builder network: bridge with masquerade (firewall handles restrictions) + if ! docker network inspect "$BUILDER_NETWORK" &> /dev/null; then + docker network create "$BUILDER_NETWORK" \ + --driver bridge \ + --opt "com.docker.network.bridge.enable_ip_masquerade=true" + log_info "Created network: $BUILDER_NETWORK" + else + log_info "Network $BUILDER_NETWORK already exists" + fi + + # Researcher network: standard bridge with full internet + if ! docker network inspect "$RESEARCHER_NETWORK" &> /dev/null; then + docker network create "$RESEARCHER_NETWORK" \ + --driver bridge + log_info "Created network: $RESEARCHER_NETWORK" + else + log_info "Network $RESEARCHER_NETWORK already exists" + fi +} + +teardown_networks() { + log_info "Tearing down Docker networks..." + + docker network rm "$BUILDER_NETWORK" 2>/dev/null && \ + log_info "Removed network: $BUILDER_NETWORK" || \ + log_info "Network $BUILDER_NETWORK not found or in use" + + docker network rm "$RESEARCHER_NETWORK" 2>/dev/null && \ + log_info "Removed network: $RESEARCHER_NETWORK" || \ + log_info "Network $RESEARCHER_NETWORK not found or in use" +} diff --git a/parallel/ralph-parallel.sh b/parallel/ralph-parallel.sh new file mode 100755 index 00000000..370db890 --- /dev/null +++ b/parallel/ralph-parallel.sh @@ -0,0 +1,612 @@ +#!/usr/bin/env bash +set -euo pipefail +# +# ralph-parallel.sh — Parallel mode orchestrator for Ralph. +# +# Launches N containerized Claude Code agents that work on prd.json stories +# simultaneously. Each agent runs in a Docker container with network restrictions, +# resource limits, and no host access. +# +# Usage: ./parallel/ralph-parallel.sh [options] [max_iterations] +# +# Options: +# --project DIR Project directory containing prd.json (default: current dir) +# --image IMAGE Custom Docker image (default: ralph-agent:latest, auto-built) +# --agents N Number of builder agents (default: 2) +# --researcher N Number of researcher agents with full internet (default: 0) +# --model MODEL Claude model to use (default: claude-sonnet-4-5-20250929) +# --memory SIZE Per-container memory limit (default: 4g) +# --cpus N Per-container CPU limit (default: 2) +# + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +RALPH_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" + +# Source library scripts +source "$SCRIPT_DIR/lib/logging.sh" +source "$SCRIPT_DIR/lib/auth.sh" +source "$SCRIPT_DIR/lib/network-setup.sh" +source "$SCRIPT_DIR/lib/docker-helpers.sh" + +# --- Defaults --- +NUM_BUILDERS=2 +NUM_RESEARCHERS=0 +NUM_VERIFIERS=0 +CLAUDE_MODEL="claude-sonnet-4-5-20250929" +CONTAINER_MEMORY="4g" +CONTAINER_CPUS="2" +MAX_ITERATIONS=0 +STALE_CLAIM_MINUTES=30 +PROJECT_DIR="" +CUSTOM_IMAGE="" +declare -a EXTRA_DOMAINS=() + +# --- Parse arguments --- +while [[ $# -gt 0 ]]; do + case $1 in + --project) + PROJECT_DIR="$2" + shift 2 + ;; + --project=*) + PROJECT_DIR="${1#*=}" + shift + ;; + --image) + CUSTOM_IMAGE="$2" + shift 2 + ;; + --image=*) + CUSTOM_IMAGE="${1#*=}" + shift + ;; + --agents) + NUM_BUILDERS="$2" + shift 2 + ;; + --agents=*) + NUM_BUILDERS="${1#*=}" + shift + ;; + --researcher) + NUM_RESEARCHERS="$2" + shift 2 + ;; + --researcher=*) + NUM_RESEARCHERS="${1#*=}" + shift + ;; + --verifier) + NUM_VERIFIERS="$2" + shift 2 + ;; + --verifier=*) + NUM_VERIFIERS="${1#*=}" + shift + ;; + --model) + CLAUDE_MODEL="$2" + shift 2 + ;; + --model=*) + CLAUDE_MODEL="${1#*=}" + shift + ;; + --memory) + CONTAINER_MEMORY="$2" + shift 2 + ;; + --memory=*) + CONTAINER_MEMORY="${1#*=}" + shift + ;; + --cpus) + CONTAINER_CPUS="$2" + shift 2 + ;; + --cpus=*) + CONTAINER_CPUS="${1#*=}" + shift + ;; + --allow-domain) + EXTRA_DOMAINS+=("$2") + shift 2 + ;; + --allow-domain=*) + EXTRA_DOMAINS+=("${1#*=}") + shift + ;; + -h|--help) + echo "Usage: $0 [options] [max_iterations]" + echo "" + echo "Options:" + echo " --project DIR Project directory with prd.json (default: current dir)" + echo " --image IMAGE Custom Docker image (default: ralph-agent:latest)" + echo " --agents N Number of builder agents (default: 2)" + echo " --researcher N Number of researcher agents (default: 0)" + echo " --verifier N Number of verifier agents (default: 0)" + echo " --model MODEL Claude model (default: claude-sonnet-4-5-20250929)" + echo " --memory SIZE Per-container memory limit (default: 4g)" + echo " --cpus N Per-container CPU limit (default: 2)" + echo " --allow-domain D Extra domain to whitelist in firewall (repeatable)" + echo "" + echo "Arguments:" + echo " max_iterations Per-agent iteration cap (default: 0 = until PRD complete)" + exit 0 + ;; + *) + if [[ "$1" =~ ^[0-9]+$ ]]; then + MAX_ITERATIONS="$1" + else + log_error "Unknown option: $1" + exit 1 + fi + shift + ;; + esac +done + +TOTAL_AGENTS=$((NUM_BUILDERS + NUM_RESEARCHERS + NUM_VERIFIERS)) + +if [ "$TOTAL_AGENTS" -eq 0 ]; then + log_error "No agents configured. Use --agents N and/or --researcher N." + exit 1 +fi + +# --- Validate project directory --- +# Default to current working directory if --project not specified +if [ -z "$PROJECT_DIR" ]; then + PROJECT_DIR="$(pwd)" +fi +# Resolve to absolute path +PROJECT_DIR="$(cd "$PROJECT_DIR" && pwd)" +PRD_FILE="$PROJECT_DIR/prd.json" +# CLAUDE-parallel.md lives in the ralph repo, not the project +PARALLEL_PROMPT="$SCRIPT_DIR/CLAUDE-parallel.md" +VERIFIER_PROMPT="$SCRIPT_DIR/CLAUDE-verifier.md" + +if [ ! -f "$PRD_FILE" ]; then + log_error "No prd.json found in $PROJECT_DIR" + log_error "Create a prd.json first (see prd.json.example)." + exit 1 +fi + +if [ ! -f "$PARALLEL_PROMPT" ]; then + log_error "Missing parallel/CLAUDE-parallel.md prompt file" + exit 1 +fi + +if [ "$NUM_VERIFIERS" -gt 0 ] && [ ! -f "$VERIFIER_PROMPT" ]; then + log_error "Missing parallel/CLAUDE-verifier.md prompt file (required when --verifier > 0)" + exit 1 +fi + +if [ ! -d "$PROJECT_DIR/.git" ]; then + log_error "$PROJECT_DIR is not a git repository" + exit 1 +fi + +# --- Display config --- +PROJECT_NAME=$(jq -r '.project // "unknown"' "$PRD_FILE" 2>/dev/null || echo "unknown") +BRANCH_NAME=$(jq -r '.branchName // empty' "$PRD_FILE" 2>/dev/null || echo "") +TOTAL_STORIES=$(jq '.userStories | length' "$PRD_FILE" 2>/dev/null || echo "?") +DONE_STORIES=$(jq '[.userStories[] | select(.passes == true)] | length' "$PRD_FILE" 2>/dev/null || echo "?") + +log_info "Ralph Parallel Mode" +log_info "====================" +log_info "Project: $PROJECT_NAME" +log_info "Branch: ${BRANCH_NAME:-}" +log_info "Stories: $DONE_STORIES/$TOTAL_STORIES complete" +log_info "Agents: $NUM_BUILDERS builders, $NUM_RESEARCHERS researchers, $NUM_VERIFIERS verifiers ($TOTAL_AGENTS total)" +log_info "Image: ${CUSTOM_IMAGE:-$RALPH_IMAGE (default)}" +log_info "Model: $CLAUDE_MODEL" +log_info "Memory: $CONTAINER_MEMORY per container" +log_info "CPUs: $CONTAINER_CPUS per container" +log_info "Max iterations: $MAX_ITERATIONS (0=until PRD complete)" +if [ ${#EXTRA_DOMAINS[@]} -gt 0 ]; then + RALPH_EXTRA_DOMAINS=$(IFS=,; echo "${EXTRA_DOMAINS[*]}") + export RALPH_EXTRA_DOMAINS + log_info "Extra domains: $RALPH_EXTRA_DOMAINS" +else + RALPH_EXTRA_DOMAINS="" +fi +echo "" + +# --- Step 1: Build or verify Docker image --- +PROJECT_DOCKERFILE="$PROJECT_DIR/Dockerfile.ralph" + +if [ -n "$CUSTOM_IMAGE" ]; then + # Explicit --image flag takes priority + export RALPH_IMAGE="$CUSTOM_IMAGE" + log_info "Using custom image: $RALPH_IMAGE" + if ! docker image inspect "$RALPH_IMAGE" &> /dev/null; then + log_error "Custom image '$RALPH_IMAGE' not found. Build it first." + exit 1 + fi +elif [ -f "$PROJECT_DOCKERFILE" ]; then + # Project has a Dockerfile.ralph — build a project-specific image + # Tag includes project name to avoid collisions between projects + PROJECT_IMAGE_TAG="ralph-agent-${PROJECT_NAME}:latest" + export RALPH_IMAGE="$PROJECT_IMAGE_TAG" + log_info "Found Dockerfile.ralph — building project image: $RALPH_IMAGE" + + # Always ensure the base image exists first + if ! docker image inspect "ralph-agent:latest" &> /dev/null; then + log_info "Building base image first..." + build_image "$RALPH_ROOT/docker" + fi + + build_image "$PROJECT_DIR" "$PROJECT_DOCKERFILE" +else + log_info "Checking Docker image..." + if ! docker image inspect "$RALPH_IMAGE" &> /dev/null; then + build_image "$RALPH_ROOT/docker" + else + log_info "Image $RALPH_IMAGE already exists. Use 'docker rmi $RALPH_IMAGE' to force rebuild." + fi +fi + +# --- Step 2: Create Docker networks --- +create_networks + +# --- Step 3: Verify Claude auth volume --- +log_info "Checking Claude auth volume..." +if ! check_auth_volume; then + exit 1 +fi +log_info "Claude auth volume verified" + +# --- Step 4: Create bare repo for agent coordination --- +# Agents need a shared bare repo to push to — you can't reliably push +# to a non-bare repo's checked-out branch. We create .ralph/repo.git +# as a bare clone of the project, and agents push/pull from this. +BARE_REPO="$PROJECT_DIR/.ralph/repo.git" +if [ ! -d "$BARE_REPO" ]; then + log_info "Creating bare repo for agent coordination..." + mkdir -p "$PROJECT_DIR/.ralph" + git clone --bare --filter=blob:none "file://$PROJECT_DIR" "$BARE_REPO" + log_info "Bare repo created at $BARE_REPO" +else + # Update the bare repo from the working directory + log_info "Updating bare repo from project..." + cd "$PROJECT_DIR" + git push --force "$BARE_REPO" --all 2>&1 || { + log_error "Failed to sync bare repo from project. Manual intervention required." + exit 1 + } + # Clean stale feature branches from bare repo to prevent agents from seeing old state + log_info "Cleaning stale branches from bare repo..." + git --git-dir="$BARE_REPO" for-each-ref --format='%(refname:short)' refs/heads/ | \ + grep -Ev '^(main|master)$' | \ + xargs -I{} git --git-dir="$BARE_REPO" branch -D {} 2>/dev/null || true + cd - > /dev/null +fi + +# Clear any previous stop signal (truncate to empty; file is kept for Docker bind-mount) +: > "$PROJECT_DIR/.ralph/stop_requested" + +# --- Step 6: Launch agent containers --- +AGENT_NUM=0 +declare -a CONTAINER_NAMES=() + +# Load friend identities for git author spoofing (optional) +FRIENDS_FILE="$PROJECT_DIR/.ralph/friends.json" +declare -a FRIEND_NAMES=() +declare -a FRIEND_EMAILS=() +if [ -f "$FRIENDS_FILE" ]; then + while IFS= read -r name; do + FRIEND_NAMES+=("$name") + done < <(jq -r '.[].name' "$FRIENDS_FILE") + while IFS= read -r email; do + FRIEND_EMAILS+=("$email") + done < <(jq -r '.[].email' "$FRIENDS_FILE") + log_info "Loaded ${#FRIEND_NAMES[@]} friend identities from friends.json" +fi + +launch_agents_for_role() { + local role="$1" + local count="$2" + + [ "$count" -le 0 ] && return + + for i in $(seq 1 "$count"); do + AGENT_NUM=$((AGENT_NUM + 1)) + local agent_id="agent-${AGENT_NUM}" + local container_name="ralph-${agent_id}" + + # Assign friend identity if available, otherwise use default agent identity + local git_author_name="" + local git_author_email="" + local idx=$((AGENT_NUM - 1)) + if [ ${#FRIEND_NAMES[@]} -gt 0 ] && [ "$idx" -lt ${#FRIEND_NAMES[@]} ]; then + git_author_name="${FRIEND_NAMES[$idx]}" + git_author_email="${FRIEND_EMAILS[$idx]}" + log_info "Agent $agent_id will commit as: $git_author_name <$git_author_email>" + fi + + # Stop existing container with same name if present + if docker inspect "$container_name" &> /dev/null; then + log_warn "Container $container_name already exists. Removing." + stop_agent "$container_name" 10 + fi + + launch_agent \ + "$agent_id" \ + "$role" \ + "$PROJECT_DIR" \ + "$CLAUDE_MODEL" \ + "$MAX_ITERATIONS" \ + "$CONTAINER_MEMORY" \ + "$CONTAINER_CPUS" \ + "$git_author_name" \ + "$git_author_email" + + CONTAINER_NAMES+=("$container_name") + done +} + +log_info "Launching agents..." +launch_agents_for_role "builder" "$NUM_BUILDERS" +launch_agents_for_role "researcher" "$NUM_RESEARCHERS" +launch_agents_for_role "verifier" "$NUM_VERIFIERS" + +log_info "All $TOTAL_AGENTS agents launched." +echo "" + +# --- Step 6: Monitor loop --- +MONITOR_INTERVAL=30 +log_info "Entering monitor loop (checking every ${MONITOR_INTERVAL}s)." +log_info "Use ./parallel/stop.sh to stop." +log_info "Use ./parallel/status.sh to check status." +echo "" + +# Helper: read a file from the bare repo without a full checkout +read_from_bare_repo() { + local file="$1" + local branch="${2:-main}" + git --git-dir="$BARE_REPO" show "${branch}:${file}" 2>/dev/null \ + || git --git-dir="$BARE_REPO" show "master:${file}" 2>/dev/null \ + || echo "" +} + +# Helper: check if all stories are complete in the bare repo +# When verifiers are in use, requires passes AND verified for all stories. +check_all_stories_complete() { + local prd_content + prd_content=$(read_from_bare_repo "prd.json" "$BRANCH_NAME") + [ -z "$prd_content" ] && return 1 + + if [ "$NUM_VERIFIERS" -gt 0 ]; then + # Require both passes and verified + local incomplete + incomplete=$(echo "$prd_content" | jq '[.userStories[] | select(.passes == false or .verified != true)] | length' 2>/dev/null || echo "1") + [ "$incomplete" -eq 0 ] + else + # Passes-only (backward compat) + local incomplete + incomplete=$(echo "$prd_content" | jq '[.userStories[] | select(.passes == false)] | length' 2>/dev/null || echo "1") + [ "$incomplete" -eq 0 ] + fi +} + +recover_stale_claims() { + # Read prd.json from the bare repo (agents push there, not to project dir) + local prd_content + prd_content=$(read_from_bare_repo "prd.json" "$BRANCH_NAME") + [ -z "$prd_content" ] && return + + local now_epoch + now_epoch=$(date +%s) + local stale_seconds=$((STALE_CLAIM_MINUTES * 60)) + + local claims + claims=$(echo "$prd_content" | jq -r ' + .userStories[] + | select(.passes == false and .claimed_by != null and .claimed_by != "") + | "\(.id)|\(.claimed_by)|\(.claimed_at // "")" + ' 2>/dev/null || echo "") + + [ -z "$claims" ] && return + + local cleared=false + local updated_prd="$prd_content" + while IFS='|' read -r story_id agent claimed_at; do + [ -z "$story_id" ] && continue + [ -z "$claimed_at" ] && continue + + # Parse claimed_at timestamp (GNU date -d first, macOS date -j fallback) + local claimed_epoch + claimed_epoch=$(date -d "$claimed_at" +%s 2>/dev/null \ + || date -j -f "%Y-%m-%dT%H:%M:%SZ" "$claimed_at" +%s 2>/dev/null \ + || echo "0") + + if [ "$claimed_epoch" -eq 0 ]; then + continue + fi + + local age=$((now_epoch - claimed_epoch)) + if [ "$age" -gt "$stale_seconds" ]; then + local container_name="ralph-${agent}" + if ! is_agent_running "$container_name"; then + log_warn "Stale claim detected: $story_id by $agent (${age}s old, container not running). Clearing." + updated_prd=$(echo "$updated_prd" | jq --arg sid "$story_id" ' + .userStories |= map( + if .id == $sid then + del(.claimed_by) | del(.claimed_at) + else . end + ) + ') + cleared=true + fi + fi + done <<< "$claims" + + if $cleared; then + # Commit the cleared claims to the bare repo via a temp checkout + local temp_dir + temp_dir=$(mktemp -d) + git clone "$BARE_REPO" "$temp_dir/work" 2>/dev/null + cd "$temp_dir/work" + git config user.name "ralph-orchestrator" + git config user.email "orchestrator@ralph-agent.local" + if [ -n "$BRANCH_NAME" ]; then + git checkout "$BRANCH_NAME" 2>/dev/null || true + fi + echo "$updated_prd" | jq '.' > prd.json + git add prd.json + git commit -m "[orchestrator] Clear stale claims" 2>/dev/null || true + git push origin 2>/dev/null || true + cd - > /dev/null + rm -rf "$temp_dir" + fi +} + +recover_stale_verification_claims() { + [ "$NUM_VERIFIERS" -eq 0 ] && return + + local prd_content + prd_content=$(read_from_bare_repo "prd.json" "$BRANCH_NAME") + [ -z "$prd_content" ] && return + + local now_epoch + now_epoch=$(date +%s) + local stale_seconds=$((STALE_CLAIM_MINUTES * 60)) + + local vclaims + vclaims=$(echo "$prd_content" | jq -r ' + .userStories[] + | select(.passes == true and .verified != true and .verified_by != null and .verified_by != "") + | "\(.id)|\(.verified_by)|\(.verified_at // "")" + ' 2>/dev/null || echo "") + + [ -z "$vclaims" ] && return + + local cleared=false + local updated_prd="$prd_content" + while IFS='|' read -r story_id agent verified_at; do + [ -z "$story_id" ] && continue + [ -z "$verified_at" ] && continue + + local claimed_epoch + claimed_epoch=$(date -d "$verified_at" +%s 2>/dev/null \ + || date -j -f "%Y-%m-%dT%H:%M:%SZ" "$verified_at" +%s 2>/dev/null \ + || echo "0") + + if [ "$claimed_epoch" -eq 0 ]; then + continue + fi + + local age=$((now_epoch - claimed_epoch)) + if [ "$age" -gt "$stale_seconds" ]; then + local container_name="ralph-${agent}" + if ! is_agent_running "$container_name"; then + log_warn "Stale verification claim: $story_id by $agent (${age}s old, container not running). Clearing." + updated_prd=$(echo "$updated_prd" | jq --arg sid "$story_id" ' + .userStories |= map( + if .id == $sid then + .verified_by = null | .verified_at = null + else . end + ) + ') + cleared=true + fi + fi + done <<< "$vclaims" + + if $cleared; then + local temp_dir + temp_dir=$(mktemp -d) + git clone "$BARE_REPO" "$temp_dir/work" 2>/dev/null + cd "$temp_dir/work" + git config user.name "ralph-orchestrator" + git config user.email "orchestrator@ralph-agent.local" + if [ -n "$BRANCH_NAME" ]; then + git checkout "$BRANCH_NAME" 2>/dev/null || true + fi + echo "$updated_prd" | jq '.' > prd.json + git add prd.json + git commit -m "[orchestrator] Clear stale verification claims" 2>/dev/null || true + git push origin 2>/dev/null || true + cd - > /dev/null + rm -rf "$temp_dir" + fi +} + +while true; do + sleep "$MONITOR_INTERVAL" + + # Check if stop was requested + if [ -s "$PROJECT_DIR/.ralph/stop_requested" ]; then + log_info "Stop requested. Shutting down all agents..." + for name in "${CONTAINER_NAMES[@]}"; do + stop_agent "$name" 30 + done + teardown_networks + log_info "All agents stopped. Exiting." + exit 0 + fi + + # Recover stale claims + recover_stale_claims + recover_stale_verification_claims + + # Check container health + ALL_STOPPED=true + for name in "${CONTAINER_NAMES[@]}"; do + if is_agent_running "$name"; then + ALL_STOPPED=false + else + EXIT_CODE=$(docker inspect -f '{{.State.ExitCode}}' "$name" 2>/dev/null || echo "unknown") + + if [ "$EXIT_CODE" = "0" ]; then + log_info "Container $name exited cleanly (code 0)." + elif [ "$EXIT_CODE" = "2" ]; then + log_error "Container $name exited due to auth failure (code 2). Credentials may be expired." + log_error "Halting all agents." + echo "auth_failure" > "$PROJECT_DIR/.ralph/stop_requested" + for stop_name in "${CONTAINER_NAMES[@]}"; do + [ "$stop_name" = "$name" ] && continue + stop_agent "$stop_name" 10 + done + teardown_networks + log_error "All agents stopped. Refresh your Claude credentials and re-run." + exit 1 + else + log_warn "Container $name stopped unexpectedly (exit code: $EXIT_CODE). Restarting..." + restart_agent "$name" + if is_agent_running "$name"; then + ALL_STOPPED=false + else + log_error "Failed to restart $name" + fi + fi + fi + done + + # Check if all stories are complete (read from bare repo) + if check_all_stories_complete; then + log_info "All PRD stories are complete!" + log_info "Shutting down agents..." + for name in "${CONTAINER_NAMES[@]}"; do + stop_agent "$name" 15 + done + # Sync bare repo back to project working directory + log_info "Syncing results back to project..." + cd "$PROJECT_DIR" + git fetch "$BARE_REPO" 2>/dev/null || true + git merge FETCH_HEAD 2>/dev/null || true + teardown_networks + log_info "Done. All stories passed." + exit 0 + fi + + if $ALL_STOPPED; then + log_info "All agents have exited. Cleaning up..." + for name in "${CONTAINER_NAMES[@]}"; do + docker rm "$name" 2>/dev/null || true + done + teardown_networks + log_info "Done." + exit 0 + fi +done diff --git a/parallel/status.sh b/parallel/status.sh new file mode 100755 index 00000000..924832b3 --- /dev/null +++ b/parallel/status.sh @@ -0,0 +1,148 @@ +#!/usr/bin/env bash +set -euo pipefail +# +# status.sh — Show status of Ralph parallel agents and PRD stories. +# +# Usage: ./parallel/status.sh [--project DIR] +# + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +RALPH_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" + +source "$SCRIPT_DIR/lib/logging.sh" + +PROJECT_DIR="" +while [[ $# -gt 0 ]]; do + case $1 in + --project) PROJECT_DIR="$2"; shift 2 ;; + --project=*) PROJECT_DIR="${1#*=}"; shift ;; + *) shift ;; + esac +done +if [ -z "$PROJECT_DIR" ]; then + PROJECT_DIR="$(pwd)" +fi +PROJECT_DIR="$(cd "$PROJECT_DIR" && pwd)" +PRD_FILE="$PROJECT_DIR/prd.json" +BARE_REPO="$PROJECT_DIR/.ralph/repo.git" + +# Helper: read file from bare repo +read_from_bare_repo() { + local file="$1" + git --git-dir="$BARE_REPO" show "HEAD:${file}" 2>/dev/null || echo "" +} + +# Load project info — prefer bare repo (has latest agent pushes), fallback to working dir +PROJECT_NAME="unknown" +PRD_CONTENT="" +if [ -d "$BARE_REPO" ]; then + PRD_CONTENT=$(read_from_bare_repo "prd.json") +fi +if [ -z "$PRD_CONTENT" ] && [ -f "$PRD_FILE" ]; then + PRD_CONTENT=$(cat "$PRD_FILE") +fi +if [ -n "$PRD_CONTENT" ]; then + PROJECT_NAME=$(echo "$PRD_CONTENT" | jq -r '.project // "unknown"' 2>/dev/null || echo "unknown") +fi + +echo "========================================" +echo " Ralph Parallel Status: $PROJECT_NAME" +echo "========================================" +echo "" + +# --- Stop signal check --- +if [ -s "$PROJECT_DIR/.ralph/stop_requested" ]; then + echo "** STOP REQUESTED -- agents will exit after current iteration **" + echo "" +fi + +# --- Container Status --- +echo "--- Containers ---" +CONTAINERS=$(docker ps -a --filter "name=ralph-agent-" --format "table {{.Names}}\t{{.Status}}\t{{.RunningFor}}" 2>/dev/null || true) +if [ -n "$CONTAINERS" ]; then + echo "$CONTAINERS" +else + echo "No Ralph containers found." +fi +echo "" + +# --- Story Board from prd.json --- +echo "--- Story Board ---" +if [ -n "$PRD_CONTENT" ]; then + # Available stories (passes: false, no claim) + echo "Available:" + AVAILABLE=$(echo "$PRD_CONTENT" | jq -r ' + .userStories[] + | select(.passes == false and (.claimed_by == null or .claimed_by == "")) + | " [ ] \(.id): \(.title) (priority: \(.priority))" + ' 2>/dev/null || echo "") + if [ -n "$AVAILABLE" ]; then + echo "$AVAILABLE" + else + echo " (none)" + fi + + # Claimed stories (passes: false, has claim) + echo "Claimed:" + CLAIMED=$(echo "$PRD_CONTENT" | jq -r ' + .userStories[] + | select(.passes == false and .claimed_by != null and .claimed_by != "") + | " [~] \(.id): \(.title) (by \(.claimed_by) at \(.claimed_at // "?"))" + ' 2>/dev/null || echo "") + if [ -n "$CLAIMED" ]; then + echo "$CLAIMED" + else + echo " (none)" + fi + + # Complete stories (passes: true) + echo "Done:" + DONE=$(echo "$PRD_CONTENT" | jq -r ' + .userStories[] + | select(.passes == true) + | " [x] \(.id): \(.title)" + ' 2>/dev/null || echo "") + if [ -n "$DONE" ]; then + echo "$DONE" + else + echo " (none)" + fi + + # Summary + TOTAL=$(echo "$PRD_CONTENT" | jq '.userStories | length' 2>/dev/null || echo "?") + DONE_COUNT=$(echo "$PRD_CONTENT" | jq '[.userStories[] | select(.passes == true)] | length' 2>/dev/null || echo "?") + echo "" + echo "Progress: $DONE_COUNT/$TOTAL stories complete" +else + echo "No prd.json found." +fi +echo "" + +# --- Recent Logs --- +echo "--- Recent Logs (last 10 lines, 3 most recent) ---" +LOG_DIR="$PROJECT_DIR/agent_logs" +if [ -d "$LOG_DIR" ]; then + LATEST_LOGS=$(ls -t "$LOG_DIR"/*.log 2>/dev/null | head -3) + if [ -n "$LATEST_LOGS" ]; then + for logfile in $LATEST_LOGS; do + echo " $(basename "$logfile"):" + tail -n 10 "$logfile" | sed 's/^/ /' + echo "" + done + else + echo " No log files yet." + fi +else + echo " No log directory found." +fi + +# --- Git Log --- +echo "--- Recent Commits ---" +if [ -d "$BARE_REPO" ]; then + git --git-dir="$BARE_REPO" log --oneline -10 2>/dev/null || echo " No commits yet." +elif [ -d "$PROJECT_DIR/.git" ]; then + git -C "$PROJECT_DIR" log --oneline -10 2>/dev/null || echo " No commits yet." +else + echo " Not a git repository." +fi +echo "" diff --git a/parallel/stop.sh b/parallel/stop.sh new file mode 100755 index 00000000..bfe23084 --- /dev/null +++ b/parallel/stop.sh @@ -0,0 +1,71 @@ +#!/usr/bin/env bash +set -euo pipefail +# +# stop.sh — Graceful shutdown of Ralph parallel agents. +# +# Usage: ./parallel/stop.sh [--project DIR] +# +# Creates a stop_requested file that agents check each iteration. +# Waits up to 120s for graceful exit, then force-kills remaining containers. +# + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +RALPH_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)" + +source "$SCRIPT_DIR/lib/logging.sh" + +PROJECT_DIR="" +while [[ $# -gt 0 ]]; do + case $1 in + --project) PROJECT_DIR="$2"; shift 2 ;; + --project=*) PROJECT_DIR="${1#*=}"; shift ;; + *) shift ;; + esac +done +if [ -z "$PROJECT_DIR" ]; then + PROJECT_DIR="$(pwd)" +fi +PROJECT_DIR="$(cd "$PROJECT_DIR" && pwd)" + +# --- Signal stop --- +log_info "Requesting graceful stop for Ralph parallel agents..." +mkdir -p "$PROJECT_DIR/.ralph" +echo "stop" > "$PROJECT_DIR/.ralph/stop_requested" + +# --- Wait for containers to stop --- +TIMEOUT=120 +ELAPSED=0 +CHECK_INTERVAL=5 + +log_info "Waiting up to ${TIMEOUT}s for agents to finish current iteration..." + +while [ $ELAPSED -lt $TIMEOUT ]; do + RUNNING=$(docker ps --filter "name=ralph-agent-" --format "{{.Names}}" 2>/dev/null || true) + + if [ -z "$RUNNING" ]; then + log_info "All agents stopped gracefully." + rm -f "$PROJECT_DIR/.ralph/stop_requested" + exit 0 + fi + + RUNNING_COUNT=$(echo "$RUNNING" | wc -l | tr -d ' ') + log_info "Still running: $RUNNING_COUNT containers ($ELAPSED/${TIMEOUT}s)" + + sleep "$CHECK_INTERVAL" + ELAPSED=$((ELAPSED + CHECK_INTERVAL)) +done + +# --- Force kill remaining containers --- +REMAINING=$(docker ps --filter "name=ralph-agent-" --format "{{.Names}}" 2>/dev/null || true) + +if [ -n "$REMAINING" ]; then + log_warn "Timeout reached. Force-stopping remaining containers..." + for name in $REMAINING; do + log_warn "Force-stopping: $name" + docker kill "$name" 2>/dev/null || true + docker rm "$name" 2>/dev/null || true + done +fi + +rm -f "$PROJECT_DIR/.ralph/stop_requested" +log_info "Shutdown complete." diff --git a/prd.json.example b/prd.json.example index fbc40668..41a0326a 100644 --- a/prd.json.example +++ b/prd.json.example @@ -26,6 +26,7 @@ "Typecheck passes", "Verify in browser using dev-browser skill" ], + "dependsOn": ["US-001"], "priority": 2, "passes": false, "notes": "" @@ -41,6 +42,7 @@ "Typecheck passes", "Verify in browser using dev-browser skill" ], + "dependsOn": ["US-001"], "priority": 3, "passes": false, "notes": "" @@ -56,6 +58,7 @@ "Typecheck passes", "Verify in browser using dev-browser skill" ], + "dependsOn": ["US-002", "US-003"], "priority": 4, "passes": false, "notes": "" diff --git a/skills/ralph/SKILL.md b/skills/ralph/SKILL.md index e402ab8d..5f90c9d9 100644 --- a/skills/ralph/SKILL.md +++ b/skills/ralph/SKILL.md @@ -33,6 +33,7 @@ Take a PRD (markdown file or text) and convert it to `prd.json` in your ralph di "Criterion 2", "Typecheck passes" ], + "dependsOn": [], "priority": 1, "passes": false, "notes": "" @@ -68,11 +69,15 @@ Ralph spawns a fresh Amp instance per iteration with no memory of previous work. Stories execute in priority order. Earlier stories must not depend on later ones. +For explicit dependency control, use the `dependsOn` field — an array of story IDs that must have `passes: true` before this story can be claimed. This complements priority ordering by enforcing hard prerequisites, which is especially useful for parallel agents where multiple stories run concurrently. + +**If a story requires another story to be completed first, list the prerequisite IDs in `dependsOn`.** Stories without dependencies use an empty array `[]`. + **Correct order:** 1. Schema/database changes (migrations) 2. Server actions / backend logic -3. UI components that use the backend -4. Dashboard/summary views that aggregate data +3. UI components that use the backend — `dependsOn: ["US-001"]` +4. Dashboard/summary views that aggregate data — `dependsOn: ["US-002", "US-003"]` **Wrong order:** 1. UI component (depends on schema that does not exist yet) @@ -121,9 +126,10 @@ Frontend stories are NOT complete until visually verified. Ralph will use the de 1. **Each user story becomes one JSON entry** 2. **IDs**: Sequential (US-001, US-002, etc.) 3. **Priority**: Based on dependency order, then document order -4. **All stories**: `passes: false` and empty `notes` -5. **branchName**: Derive from feature name, kebab-case, prefixed with `ralph/` -6. **Always add**: "Typecheck passes" to every story's acceptance criteria +4. **dependsOn**: If a story requires another story first, list prerequisite IDs in `dependsOn`. Use `[]` for stories with no prerequisites. +5. **All stories**: `passes: false` and empty `notes` +6. **branchName**: Derive from feature name, kebab-case, prefixed with `ralph/` +7. **Always add**: "Typecheck passes" to every story's acceptance criteria --- @@ -177,6 +183,7 @@ Add ability to mark tasks with different statuses. "Generate and run migration successfully", "Typecheck passes" ], + "dependsOn": [], "priority": 1, "passes": false, "notes": "" @@ -191,6 +198,7 @@ Add ability to mark tasks with different statuses. "Typecheck passes", "Verify in browser using dev-browser skill" ], + "dependsOn": ["US-001"], "priority": 2, "passes": false, "notes": "" @@ -206,6 +214,7 @@ Add ability to mark tasks with different statuses. "Typecheck passes", "Verify in browser using dev-browser skill" ], + "dependsOn": ["US-001"], "priority": 3, "passes": false, "notes": "" @@ -220,6 +229,7 @@ Add ability to mark tasks with different statuses. "Typecheck passes", "Verify in browser using dev-browser skill" ], + "dependsOn": ["US-002", "US-003"], "priority": 4, "passes": false, "notes": "" @@ -252,6 +262,7 @@ Before writing prd.json, verify: - [ ] **Previous run archived** (if prd.json exists with different branchName, archive it first) - [ ] Each story is completable in one iteration (small enough) - [ ] Stories are ordered by dependency (schema to backend to UI) +- [ ] Stories with prerequisites have correct `dependsOn` arrays - [ ] Every story has "Typecheck passes" as criterion - [ ] UI stories have "Verify in browser using dev-browser skill" as criterion - [ ] Acceptance criteria are verifiable (not vague)