diff --git a/docs/FUNCTIONAL_TESTS.md b/docs/FUNCTIONAL_TESTS.md new file mode 100644 index 0000000..6b8e980 --- /dev/null +++ b/docs/FUNCTIONAL_TESTS.md @@ -0,0 +1,464 @@ +# Functional Test Plan + +Manual (or LLM-assisted) test plan for runqy. Requires a running server, Redis, and PostgreSQL. + +## Prerequisites + +```bash +# 1. Start Redis on localhost:6379 +# 2. Start PostgreSQL +# 3. Configure environment +cd runqy/app +cp .env.secret.sample .env.secret +# Edit .env.secret with credentials + +# 4. Build and start server +go build -o runqy . +./runqy serve + +# 5. Set API key for CLI tests (used with -s and -k flags for remote mode) +export RUNQY_API_KEY="your-api-key" +export RUNQY_SERVER="http://localhost:3000" +``` + +!!! note "CLI remote mode" + Most CLI commands operate in **remote mode** and require `-s $RUNQY_SERVER -k $RUNQY_API_KEY` flags (or the `./runqy login` flow). The examples below omit these flags for brevity. + +!!! note "WebUI authentication" + The monitoring dashboard at `localhost:3000/monitoring` requires login. Navigate to `localhost:3000` first — you will be redirected to a login page. + +--- + +## Section A — CLI + +### A1. Build & Startup + +| # | Command | Expected | +|---|---------|----------| +| 1 | `cd runqy/app && go build -o runqy .` | Build succeeds, binary created | +| 2 | `./runqy serve` | Log shows "Server started on :3000" | + +### A2. Queue Management + +| # | Command | Expected | +|---|---------|----------| +| 1 | `./runqy config create --name testqueue --mode one_shot --startup-cmd "python main.py" --git-url "https://github.com/example/repo"` | Success | +| 2 | `./runqy queue list` | Output contains "testqueue" | +| 3 | `./runqy queue inspect testqueue.default` | Shows pending, active, completed counts | +| 4 | `./runqy queue pause testqueue.default` | Success message | +| 5 | `./runqy queue unpause testqueue.default` | Success message | + +### A3. Task Lifecycle + +| # | Command | Expected | +|---|---------|----------| +| 1 | `./runqy task enqueue -q testqueue -p '{"msg":"hello"}'` | Returns task ID | +| 2 | `./runqy task list testqueue.default` | Contains the task | +| 3 | `./runqy task get testqueue.default ` | JSON details | +| 4 | `./runqy task cancel ` | Success | +| 5 | `./runqy task delete ` | Success | + +### A4. Vault Management + +| # | Step | Expected | +|---|------|----------| +| 1 | `./runqy vault list` (without RUNQY_VAULT_MASTER_KEY) | Message with hint `openssl rand -base64 32` | +| 2 | Export `RUNQY_VAULT_MASTER_KEY=$(openssl rand -base64 32)` and restart server | Server starts with vaults enabled | +| 3 | `./runqy vault create myvault` | Success | +| 4 | `./runqy vault list` | Contains "myvault" | +| 5 | `./runqy vault set myvault apikey secret123` | Success | +| 6 | `./runqy vault get myvault apikey` | Outputs "secret123" (local mode only — blocked in remote mode by design) | +| 7 | `./runqy vault entries myvault` | Shows "apikey" with masked value | +| 8 | `./runqy vault unset myvault apikey` | Success | +| 9 | `./runqy vault delete myvault --force` | Success | + +### A5. Input Validation + +| # | Command | Expected Error | +|---|---------|---------------| +| 1 | `./runqy queue inspect ""` | "queue name cannot be empty" | +| 2 | `./runqy task list q --state pendingg` | "invalid state 'pendingg', valid: pending, active, ..." | +| 3 | `./runqy task list q --limit -1` | "--limit must be positive" | +| 4 | `./runqy task list q --limit 0` | "--limit must be positive" | +| 5 | `./runqy task enqueue -q q -p '{}' --timeout -100` | "--timeout must be positive" | +| 6 | `./runqy config create test --mode invalid` | "invalid mode 'invalid', must be 'long_running' or 'one_shot'" | +| 7 | `./runqy vault create ""` | "vault name cannot be empty" | +| 8 | `./runqy vault set "" "" ""` | "vault name cannot be empty" | + +### A6. Worker (optional) + +| # | Step | Expected | +|---|------|----------| +| 1 | `cd runqy-worker && go build ./cmd/worker` | Build succeeds | +| 2 | Start worker with config pointing to server | "Worker registered" log | +| 3 | Enqueue a task to the worker's queue | Task processed, result in Redis | + +### A7. Worker Recovery & Auto-Restart + +Requires: a running worker connected to the server, with a `long_running` queue. + +| # | Step | Expected | +|---|------|----------| +| 1 | Start worker with a process that runs stably | Worker registered, heartbeat active | +| 2 | Kill the supervised process (not the worker): `kill ` | Worker logs: "Process exited", then "Restarting process (attempt 1/5)" | +| 3 | Wait for restart | Worker logs: "Process startup detected - service is ready" | +| 4 | Enqueue a task | Task processed successfully (recovery reconnected stdio) | +| 5 | Kill the supervised process 5+ times rapidly | Worker enters degraded state: "Circuit breaker open, max restarts reached" | +| 6 | Check worker heartbeat (API `/api/workers`) | Worker shows `recovery.state: "degraded"` or `"circuit_open"` | +| 7 | Wait for cooldown (or restart worker) | Recovery resets, worker resumes processing | + +### A8. Graceful Shutdown + +| # | Step | Expected | +|---|------|----------| +| 1 | Enqueue a long-running task | Task is "active" | +| 2 | Send `SIGTERM` to worker | Worker logs: "Received SIGTERM, shutting down gracefully" | +| 3 | Worker waits for active task to finish | Task completes before worker exits | +| 4 | Worker deregisters | Worker removed from `/api/workers` | +| 5 | Send `SIGTERM` twice quickly | Worker force-exits on second signal | + +### A9. RetryableError (Python SDK) + +Requires: a Python worker using `runqy-python` SDK. + +| # | Step | Expected | +|---|------|----------| +| 1 | Worker handler raises `RetryableError("temporary failure")` | Task retried (retry count increments) | +| 2 | Worker handler raises regular `Exception("permanent")` | Task fails permanently (no retry) | +| 3 | Worker handler raises `RetryableError` with max retries exhausted | Task moves to failed state | + +### A10. Stdout Protection (Python SDK) + +| # | Step | Expected | +|---|------|----------| +| 1 | Worker handler contains `print("debug output")` | Task still completes (print goes to stderr, not protocol) | +| 2 | Worker handler writes to `sys.stdout` | Output redirected to stderr, protocol unaffected | + +--- + +## Section B — API (curl) + +Runqy exposes **two separate APIs** with different auth mechanisms: + +1. **Main API** (`/queue/*`, `/worker/*`, `/workers/*`, `/api/vaults/*`) — auth via `X-API-Key` header +2. **Monitoring API** (`/monitoring/api/*`) — auth via cookie JWT (setup + login flow) + +### B0. Monitoring Auth Setup + +Before testing monitoring endpoints, set up authentication: + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl localhost:3000/monitoring/api/auth/status` | `{"authenticated":false,"setup_required":true}` | +| 2 | `curl -X POST localhost:3000/monitoring/api/auth/setup -H 'Content-Type: application/json' -d '{"email":"admin@test.com","password":"password123","confirm_password":"password123"}'` | 200, admin created | +| 3 | `curl -c cookies.txt -X POST localhost:3000/monitoring/api/auth/login -H 'Content-Type: application/json' -d '{"email":"admin@test.com","password":"password123"}'` | 200, `Set-Cookie: runqy_auth=...` | + +!!! note "Auth types" + Main API examples below use `-H "X-API-Key: $RUNQY_API_KEY"`. + Monitoring API examples use `-b cookies.txt` (cookie from B0 login). + +### B1. Health & System + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl localhost:3000/health` | 200 OK | +| 2 | `curl -b cookies.txt localhost:3000/monitoring/api/redis_info` | 200, JSON with Redis version, uptime, clients | +| 3 | `curl -b cookies.txt localhost:3000/monitoring/api/database_info` | 200, JSON with connection info | + +### B2. Queue Endpoints (Monitoring API) + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -b cookies.txt localhost:3000/monitoring/api/queues` | 200, JSON object `{"queues":[...]}` | +| 2 | `curl -b cookies.txt localhost:3000/monitoring/api/queues/testqueue.default` | 200, queue details | +| 3 | `curl -X POST -b cookies.txt localhost:3000/monitoring/api/queues/testqueue.default:pause` | 204 No Content | +| 4 | `curl -X POST -b cookies.txt localhost:3000/monitoring/api/queues/testqueue.default:resume` | 204 No Content | + +### B3. Queue Config Endpoints (Monitoring API) + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -b cookies.txt localhost:3000/monitoring/api/queue_configs` | 200, list of configs | +| 2 | `curl -b cookies.txt localhost:3000/monitoring/api/queue_configs/testqueue` | 200, config details | +| 3 | `curl -X POST -b cookies.txt localhost:3000/monitoring/api/queue_configs -H 'Content-Type: application/json' -d '{"name":"apitest","mode":"one_shot","startup_cmd":"python main.py","priority":1}'` | 201 | +| 4 | `curl -X DELETE -b cookies.txt localhost:3000/monitoring/api/queue_configs/apitest` | 200 | + +!!! note "Required fields" + Queue config create requires `priority` (minimum 1). + +### B4. Task Endpoints + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -X POST -H "X-API-Key: $RUNQY_API_KEY" -H 'Content-Type: application/json' localhost:3000/queue/add -d '{"queue":"testqueue","data":{"msg":"curl-test"}}'` | 200, returns `{"info":{"id":"..."},...,"data":{...}}` | +| 2 | `curl -b cookies.txt localhost:3000/monitoring/api/queues/testqueue.default/pending_tasks` | 200, task list | +| 3 | `curl -X DELETE -b cookies.txt localhost:3000/monitoring/api/queues/testqueue.default/pending_tasks/` | 200 | + +!!! note "Task enqueue response" + The task ID is at `info.id` in the response JSON (not a top-level `task_id` field). + +### B5. Worker Endpoints + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -H "X-API-Key: $RUNQY_API_KEY" localhost:3000/api/workers` | 200, JSON array | +| 2 | `curl -H "X-API-Key: $RUNQY_API_KEY" localhost:3000/api/servers` | 200, JSON array | +| 3 | `curl -b cookies.txt localhost:3000/monitoring/api/servers` | 200, JSON array (same data via monitoring route) | + +### B6. Vault Endpoints (Main API) + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -H "X-API-Key: $RUNQY_API_KEY" localhost:3000/api/vaults` | 200, list | +| 2 | `curl -X POST -H "X-API-Key: $RUNQY_API_KEY" -H 'Content-Type: application/json' localhost:3000/api/vaults -d '{"name":"apivault","description":"test"}'` | 201 | +| 3 | `curl -H "X-API-Key: $RUNQY_API_KEY" localhost:3000/api/vaults/apivault` | 200, vault details | +| 4 | `curl -X POST -H "X-API-Key: $RUNQY_API_KEY" -H 'Content-Type: application/json' localhost:3000/api/vaults/apivault/entries -d '{"key":"secret","value":"val123","is_secret":true}'` | 200 | +| 5 | `curl -X DELETE -H "X-API-Key: $RUNQY_API_KEY" localhost:3000/api/vaults/apivault/entries/secret` | 200 | +| 6 | `curl -X DELETE -H "X-API-Key: $RUNQY_API_KEY" localhost:3000/api/vaults/apivault` | 200 | + +### B7. Error Format Consistency + +Main API (`/queue/*`, `/api/vaults/*`) returns errors in array format: + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -H "X-API-Key: ..." localhost:3000/api/vaults/nonexistent` | 404, `{"errors":["vault not found"]}` | +| 2 | `curl -X POST -H "X-API-Key: ..." localhost:3000/queue/add -d '{}'` | 400, `{"errors":["queue is required"]}` | +| 3 | Request without auth header on protected endpoint | 401, `{"errors":["access unauthorized"]}` | + +!!! warning "Known inconsistency" + Monitoring API (`/monitoring/api/*`) returns `{"error":"..."}` (singular string) instead of `{"errors":[...]}` (array). The monitoring handler was not yet migrated to the bulletproof error format. + +### B8. Status Codes + +| # | Scenario | Expected Code | +|---|----------|--------------| +| 1 | Queue not found (monitoring API) | 404 | +| 2 | Vault not found (main API) | 404 | +| 3 | Invalid payload | 400 | +| 4 | Missing auth (main API) | 401 | +| 5 | Missing auth (monitoring API) | 401 "Unauthorized" | + +!!! note "Task poll" + `GET /queue/` is a **long-poll** endpoint — it blocks until the task completes or the connection drops. There is no 408 timeout response; the client controls the timeout. + +### B9. Batch Enqueue + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -X POST -H "X-API-Key: ..." -H 'Content-Type: application/json' localhost:3000/queue/add-batch -d '{"queue":"testqueue","jobs":[{"data":{"i":1}},{"data":{"i":2}}]}'` | 200, `{"enqueued":2,"failed":0,"task_ids":[...]}` | +| 2 | Empty jobs: `-d '{"queue":"testqueue","jobs":[]}'` | 400 | +| 3 | Missing fields: `-d '{}'` | 400, validation error | + +!!! note "Batch format" + Batch enqueue uses `{"queue":"...", "jobs":[...]}` — a single queue name with an array of job objects. Each job contains `data` and optional fields. This is **not** `{"tasks":[{"queue":...}]}`. + +### B10. Auth on Previously Public Routes + +These routes were public before `bulletproof` and now require API key authentication: + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl localhost:3000/queue/` (no auth) | 401, `{"errors":["access unauthorized"]}` | +| 2 | `curl localhost:3000/workers/config/` (no auth) | 401, `{"errors":["access unauthorized"]}` | +| 3 | Same with `-H "X-API-Key: $RUNQY_API_KEY"` | 200, expected response | + +!!! note "Long-poll caveat" + `GET /queue/` with auth is a long-poll endpoint. It will block until the task is processed by a worker. Use `--max-time` with curl to avoid indefinite hangs. + +### B11. Body Size Limit + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -X POST localhost:3000/queue/add -d ' 50MB>'` | 413 or connection reset | +| 2 | Normal-sized payload after large one | 200, server still healthy | + +### B12. Input Validation (API) + +| # | Command | Expected | +|---|---------|----------| +| 1 | `POST /queue/add` with special chars in queue name `{"queue":"test;drop"}` | Returns "queue not found" (no queue name char validation — the name passes through but the config doesn't exist) | +| 2 | `POST /api/vaults/v/entries` with empty key `{"key":"","value":"x"}` | 400, validation error | + +### B9. Batch Enqueue + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -X POST localhost:3000/queue/add-batch -d '{"tasks":[{"queue":"testqueue","data":{"i":1}},{"queue":"testqueue","data":{"i":2}}]}'` | 200, array of task IDs | +| 2 | Same with one invalid queue | Partial success or error for invalid entry | +| 3 | Empty tasks array | 400, `{"errors":[...]}` | + +### B10. Auth on Previously Public Routes + +These routes were public before `bulletproof` and now require authentication: + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl localhost:3000/queue/` (no auth) | 401 `{"errors":["access unauthorized"]}` | +| 2 | `curl localhost:3000/workers/config/` (no auth) | 401 `{"errors":["access unauthorized"]}` | +| 3 | Same with `-H "X-API-Key: $RUNQY_API_KEY"` | 200, expected response | + +### B11. Body Size Limit + +| # | Command | Expected | +|---|---------|----------| +| 1 | `curl -X POST localhost:3000/queue/add -d ' 50MB>'` | 413 or connection reset | +| 2 | Normal-sized payload after large one | 200, server still healthy | + +### B12. Input Validation (API) + +| # | Command | Expected | +|---|---------|----------| +| 1 | `POST /queue/add` with queue name containing special chars `{"queue":"test;drop"}` | 400, invalid queue name | +| 2 | `POST /api/vaults` with reserved key name `{"name":"PATH"}` | 400, reserved name | +| 3 | `POST /api/vaults/v/entries` with empty key `{"key":"","value":"x"}` | 400 | + +--- + +## Section C — WebUI Monitoring (Playwright) + +Prerequisites: server running on `localhost:3000`, Playwright MCP available, logged in to monitoring dashboard. + +### C1. Navigation & Layout + +| # | Action | Expected | +|---|--------|----------| +| 1 | Navigate to `http://localhost:3000` | Dashboard loads | +| 2 | Sidebar visible | Links: Dashboard, Queues, Workers, Vaults, System, Settings | +| 3 | Click each sidebar link | Corresponding page loads | +| 4 | Toggle sidebar collapse | Sidebar collapses/expands | + +### C2. Dashboard (`/`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Page loads | Stat cards: Queues, Pending, Active, Processed, Failed, Workers | +| 2 | Click "Queues" stat card | Navigates to `/queues` | +| 3 | Click "Workers" stat card | Navigates to `/workers` | +| 4 | Click Refresh button | "Last updated" timestamp updates | +| 5 | If queues exist | Queue cards in grid | +| 6 | Toggle Group/Ungroup | Sub-queues group/ungroup | +| 7 | Click queue card | Navigates to `/queues/` | + +### C3. Queues Page (`/queues`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Page loads | Queue list displayed | +| 2 | Toggle Cards/Table | View switches | +| 3 | Toggle Grouped/Ungrouped | Display changes | +| 4 | Filter: All, Running, Paused | Queues filtered | +| 5 | Search field | Filters by name | +| 6 | Click "Create Queue" | QueueConfigModal opens | +| 7 | Fill name "playwright-test", mode "one_shot", startup cmd "echo ok" | Fields filled | +| 8 | Click Create | Modal closes, toast success, queue appears | +| 9 | Click Pause on a queue | Badge changes to "Paused" | +| 10 | Click Resume | Badge changes to "Running" | +| 11 | Click Delete | Confirmation dialog appears | +| 12 | Confirm delete | Queue removed, toast success | + +### C4. Queue Detail (`/queues/:qname`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Navigate to existing queue | Queue name and status badge shown | +| 2 | Tabs visible | Active, Pending, Retry, Archived, Completed | +| 3 | Click each tab | Table updates | +| 4 | Click task ID (if tasks exist) | Row expands, payload JSON visible | +| 5 | Click Copy ID | Copied to clipboard | +| 6 | Checkbox selection | Bulk action buttons appear | +| 7 | Select All | All tasks selected | +| 8 | Search field | Filters tasks | +| 9 | Per-task actions | Cancel, Archive, Delete, Run (depending on tab) | +| 10 | Bulk actions | Bulk Delete, Bulk Archive, Bulk Run + confirmation | +| 11 | Batch actions | Delete All, Archive All, Run All + confirmation | +| 12 | Pause/Resume in header | Queue state toggles | + +### C5. Workers Page (`/workers`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Page loads | Worker list (table or cards) | +| 2 | Toggle Cards/Table | View switches | +| 3 | Filters | All, Processing, Idle, Bootstrapping, Stale, Stopped | +| 4 | If workers exist | Status badge visible, click navigates to detail | +| 5 | If no workers | Empty state displayed | + +### C6. Worker Detail (`/workers/:id`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Page loads | Worker ID, status badge | +| 2 | Info section | Concurrency, Started, Last beat | +| 3 | Queue badges | Queues listed | +| 4 | Metrics (if available) | CPU %, Memory, Processes | +| 5 | Log section | SSE streaming logs | +| 6 | Toggle "Pin to bottom" | Auto-scroll behavior | +| 7 | Click Back | Returns to `/workers` | + +### C7. Vaults Page (`/vaults`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Without RUNQY_VAULT_MASTER_KEY | "Feature disabled" message | +| 2 | With key configured, page loads | Vault list in grid | +| 3 | Search field | Filters by name | +| 4 | Click "Create Vault" | Modal opens | +| 5 | Fill name "pw-vault", description "test" | Fields filled | +| 6 | Click Create | Vault appears, toast success | +| 7 | Click vault card | Navigates to `/vaults/pw-vault` | + +### C8. Vault Detail (`/vaults/:name`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Page loads | Name and description shown | +| 2 | Entries table | Empty initially | +| 3 | Click "Add Entry" | EntryModal opens | +| 4 | Fill key "mykey", value "myval", toggle Secret ON | Fields filled | +| 5 | Click Create | Entry appears in table | +| 6 | Secret value display | Masked (****) | +| 7 | Click Edit on entry | Modal pre-filled, key read-only | +| 8 | Modify value, click Update | Table updated | +| 9 | Click Delete on entry | Confirmation dialog, entry removed | +| 10 | Click "Delete Vault" | Confirmation dialog, redirects to `/vaults` | + +### C9. Settings Page (`/settings`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Click Light theme | Theme changes to light | +| 2 | Click Dark theme | Theme changes to dark | +| 3 | Click System theme | Theme follows system | +| 4 | Toggle Comfortable/Compact | Layout density changes | +| 5 | Change poll interval | Dropdown reflects choice | +| 6 | Sidebar collapse toggle | Sidebar state changes | +| 7 | Click Reset | All settings return to defaults | + +### C10. System Page (`/system`) + +| # | Action | Expected | +|---|--------|----------| +| 1 | Redis Status card | Badge: Connected/Disconnected | +| 2 | If connected | Address, Version, Uptime, Clients shown | +| 3 | Database Status card | Badge with connection status | +| 4 | If connected | Type, Host, Database, Connections shown | +| 5 | Redis Memory section | Memory metrics displayed | +| 6 | Raw Redis Info | Collapsible section, click to toggle | + +### C11. Error & Empty States + +| # | Scenario | Expected | +|---|----------|----------| +| 1 | Queues page, no queues | Empty state visible | +| 2 | Workers page, no workers | Empty state visible | +| 3 | Vaults page, no vaults | Empty state with "Create your first vault" CTA | +| 4 | Vault detail, no entries | Empty state visible | + +### C12. Responsive & Toasts + +| # | Action | Expected | +|---|--------|----------| +| 1 | Successful CRUD action | Toast success appears | +| 2 | Failed action | Toast error appears | +| 3 | Click modal backdrop or Cancel | Modal closes | diff --git a/docs/worker/configuration.md b/docs/worker/configuration.md index 3b296ea..30d57ac 100644 --- a/docs/worker/configuration.md +++ b/docs/worker/configuration.md @@ -17,6 +17,11 @@ worker: deployment: dir: "./deployment" use_system_site_packages: true # Set to false for isolated virtualenv + +recovery: + enabled: true # Auto-restart crashed Python processes (default: true) + max_restarts: 5 # Circuit breaker threshold (default: 5) + cooldown_period: 10m # Stable run time to reset counter (default: 10m) ``` ## Configuration Options @@ -43,6 +48,35 @@ deployment: | `dir` | string | Yes | Directory for cloning task code | | `use_system_site_packages` | bool | No | Inherit packages from base Python environment (default: `true`). Set to `false` for isolated virtualenv | +### `recovery` + +Controls auto-recovery when a supervised Python process crashes. Enabled by default. + +| Option | Type | Required | Description | +|--------|------|----------|-------------| +| `enabled` | bool | No | Enable auto-recovery (default: `true`) | +| `max_restarts` | int | No | Max consecutive restarts before entering degraded state (default: `5`) | +| `initial_delay` | duration | No | Delay before the first restart attempt (default: `1s`) | +| `max_delay` | duration | No | Maximum delay between restart attempts (default: `5m`) | +| `backoff_factor` | float | No | Multiplier for exponential backoff between restarts (default: `2.0`) | +| `cooldown_period` | duration | No | Time without crash to reset the failure counter (default: `10m`) | + +```yaml +recovery: + enabled: true + max_restarts: 5 + initial_delay: "1s" + max_delay: "5m" + backoff_factor: 2.0 + cooldown_period: "10m" +``` + +!!! info "How auto-recovery works" + When a Python process crashes, the worker automatically restarts it with exponential backoff. + If the process keeps crashing (reaching `max_restarts` without a stable run), the worker enters + **degraded state** and stops retrying — manual restart is required. If the process runs + successfully for `cooldown_period`, the failure counter resets to zero. + ## Environment Variables All configuration values can be set via environment variables, which take priority over `config.yml`: @@ -61,6 +95,11 @@ All configuration values can be set via environment variables, which take priori | `RUNQY_DEPLOYMENT_DIR` | Local deployment directory | `./deployment` | | `RUNQY_USE_SYSTEM_SITE_PACKAGES` | Inherit packages from base Python (`true`/`false`) | `true` | | `RUNQY_MAX_RETRY` | Max task retries | `3` | +| `RUNQY_RECOVERY_ENABLED` | Enable process auto-recovery (`true`/`false`) | `true` | +| `RUNQY_RECOVERY_MAX_RESTARTS` | Max consecutive restarts before degraded state | `5` | +| `RUNQY_RECOVERY_INITIAL_DELAY` | Initial delay before restart attempt | `1s` | +| `RUNQY_RECOVERY_MAX_DELAY` | Maximum backoff delay between restarts | `5m` | +| `RUNQY_RECOVERY_COOLDOWN` | Stable run time to reset failure counter | `10m` | ### Examples @@ -124,3 +163,14 @@ Each worker instance: - Has a concurrency of 1 for the Python process (though the worker can manage queue operations concurrently) To scale, run multiple worker instances. + +## Degraded State + +If the supervised Python process crashes repeatedly and exceeds `max_restarts` without a stable run: + +1. Worker enters **degraded state** — no more restart attempts +2. Heartbeat reports `healthy: false` with recovery state `degraded` +3. Tasks are returned to queue for retry (but will keep failing on this worker) +4. **Manual restart of the worker is required** to recover + +To disable auto-recovery entirely and revert to the old behavior (immediate degraded state on first crash), set `recovery.enabled: false`.