# Automation Rule Playbooks

LogForge's Alert Engine automation service applies rule definitions to container signals and executes guarded remediation. Use the playbooks below as reference implementations. They map directly onto the Rule Builder panels (`Trigger Condition`, `Rule Scope`, `Actions Chain`, and the real-time `Rule Preview`) and call out when you should visit the global `Advanced Settings` modal for guardrails.

## How the rule builder is organised
- **Trigger Condition**: choose the signal (`Keyword in Logs`, `Container Event`, or `Metric Threshold` for Pro).
- **Timeline windows**: use the `Count` and `Minutes` fields to require N events within M minutes before the trigger fires; leave them blank for immediate alerts.
- **Rule Scope**: pick whether the rule inspects every container or specific containers/groups that you already maintain inside LogForge Core. The Alert Engine only lists groups that still contain monitored containers.
- **Actions Chain**: order remediation steps (`Restart`, `Stop`, `Kill`, `Start`, `Run script`, `Send notification`) and add per-step delays. Script actions are automatically disabled when scope is global.
- **Notifier delivery**: every `Send notification` step posts the alert payload to the Notifier service, which fans it out to the destinations configured in LogForge (email, chat, PagerDuty, webhooks, etc.).
- **Advanced Settings**: open from the `Alert Rules` toolbar to adjust action cooldowns, backoff, per-hour limits, verification delay, and keyword defaults (case sensitivity, ANY/ALL matching, ignore patterns) for the entire engine.

## Rule 1: Restart Loop Stopper *(Template: Restart Loop Detection)*
- **Template**: Open the `Templates` tab, activate `Restart Loop Detection`, and set the rule name/scope if prompted. No other fields need to change for default behaviour.
- **Trigger Condition**: Container event `start` with `Count >= 3` inside a 5 minute window.
- **Rule Scope**: In the activation modal, choose `All containers` or target one of the LogForge Core groups (for example `Payments Services`); only groups with monitored containers appear. You can revisit scope later from the rule detail.
- **Actions Chain**:
  1. `Stop container` (fires immediately on the offending container).
  2. `Send notification` (Notifier pushes the alert to your configured destinations) with a 30 second delay so the stop action completes first.
- **Advanced Settings**: The template inherits global guardrails (default stop cooldown 3 minutes, per-rule max actions 3/hour). Override those system-wide only if you need more frequent remediation.
- **Operator notes**: For most teams the template defaults suffice-activate it, set the scope, and watch for duplicate-rule warnings before cloning.

## Rule 2: Crash Loop Kill Switch *(Template: Crash Loop Detection)*
- **Template**: Activate `Crash Loop Detection` from the `Templates` tab; adjust only the scope. This template kills instead of stops when a container keeps failing to start.
- **Trigger Condition**: Container event `stop` with `Count >= 5` within 10 minutes.
- **Rule Scope**: Point the rule at the LogForge Core group that owns the fragile workload (for example `Core APIs`); the picker only shows groups with monitored containers.
- **Actions Chain**:
  1. `Kill container` immediately to break the crash storm.
  2. `Send notification` (Notifier delivers the kill context to the configured destinations) after a 60 second delay.
- **Advanced Settings**: Defaults enforce a 5 minute kill cooldown and a 30 second verification delay. Increase the cooldown globally if the same app repeatedly relaunches under investigation.
- **Operator notes**: To tweak timelines or swap the kill action for a stop, convert the rule to a custom rule (Pro) or build a fresh custom rule in the builder.

## Rule 3: Error Flood Alert *(Custom rule - Keyword trigger)*
- **Trigger Condition**: Select `Keyword in Logs`, enter keywords such as `ERROR` and `panic`, and set `Count >= 5` in `2` minutes. Use the Keyword Settings tab in `Advanced Settings` to flip to case-insensitive matching or add ignore patterns for noisy stack traces.
- **Rule Scope**: Target the specific containers known for noisy logs (multi-select from the monitored container inventory) or one of your LogForge Core groups such as `Frontend Services`.
- **Actions Chain**:
  1. `Send notification` (Notifier pushes the incident payload to the destinations you configured).
  2. Optional `Restart container` with a 60 second delay if rebooting clears the condition.
- **Advanced Settings**: Keep the default restart cooldown (5 minutes) and verification delay (30 seconds); widen them globally if restart storms occur.
- **Operator notes**: Start as notify-only, watch the Alerts dashboard, then layer the restart step with a delay.

## Rule 4: Security Keyword Escalation *(Custom rule - Keyword trigger)*
- **Trigger Condition**: `Keyword in Logs`, enable case-insensitive matching, add `unauthorized`, `failed login`, `AWS_SECRET_ACCESS_KEY`, and require `Count >= 2` within `60` seconds.
- **Rule Scope**: Select the relevant ingress/authentication group that you already maintain in LogForge Core (for example `Edge Gateways`); the Alert Engine surfaces those groups so you can include them while excluding internal tooling.
- **Actions Chain**:
  1. `Send notification` (Notifier delivers to the security destinations configured in LogForge Core / Notifier).
  2. `Send notification` via `Webhook` (Notifier calls the configured webhook endpoint with the formatted alert message).
  3. Optional `Stop container` as a follow-up action after human review.
- **Advanced Settings**: Use the Keywords tab to add ignore patterns for known scanners. Tighten the stop cooldown if security remediations should run only once per hour.
- **Operator notes**: Document the response procedure inside the rule description, note which LogForge Core groups the rule covers, and enable `Require approval before restart` in the Notifier workflow if you automate container stops.

## Rule 5: Post-Start Warmup *(Custom rule - Container event trigger)*
- **Trigger Condition**: `Container event` set to `start` with no additional thresholds.
- **Rule Scope**: Select the release group that needs priming (e.g., `Frontend Services`) from the Core-managed list. Script actions are only available when the scope is container or group.
- **Actions Chain**:
  1. `Delay` 45 seconds using the step-level delay field before remediation starts.
  2. `Run script` to execute the first executable `.sh` in `/logforge-scripts/` inside the triggering container.
  3. `Send notification` (Notifier shares the captured `{{ action.output }}`) so deployers see warmup results.
- **Advanced Settings**: Ensure `run_script` cooldown (default 10 minutes) aligns with your rollout cadence. The gatekeeper also enforces max actions per container/hour.
- **Operator notes**: Provision `/logforge-scripts/` with numbered scripts (for example `01_warmup.sh`) and check script validity via the `Validate scripts` API before enabling the rule.

## Building reliable automation
- Prefer templates when they exist; they already include safe defaults and duplicate detection. Adjust scope first, then customise if needed.
- Keep remediation narrow by scoping rules to the container groups you curate in LogForge Core. The builder prevents wildcard targeting but it is still easy to over-select.
- Use the global `Advanced Settings` modal to align cooldowns, exponential backoff, per-rule/per-container limits, verification delays, and keyword defaults with your operational runbooks.
- Layer remediation gradually: run notification-only rules, review the Alerts dashboard, then append restart/stop/script actions with deliberate delays.
- When you add script actions, confirm `/bin/sh` exists, scripts are executable, and the gatekeeper's verification delay is long enough for the script to finish.
- Keep the Notifier service destinations current so every `Send notification` step lands in the right inboxes, chat channels, or webhooks.