The Alert Engine transforms raw container signals into actionable alerts and safe remediation steps. This guide explains the building blocks, workflows, and best practices for crafting reliable rules. ## Core concepts | Concept | Description | |---------|-------------| | **Rule** | A saved definition that listens for events (logs, status changes, performance thresholds) and executes actions. | | **Trigger** | One condition per rule (keyword match, metric threshold, or container event) that determines when actions fire. | | **Scope** | Determines which containers a rule inspects (all, specific labels/groups, or explicit container IDs). | | **Action** | What happens when the trigger fires: notify, restart, stop, kill, start, or run a script. | | **Advanced Settings** | The "Advanced Settings" panel in the UI (Gatekeeper & keyword tabs) where cooldowns, verification delays, rate limits, and keyword behaviour are configured. | ## Rule lifecycle 1. **Create** -" Start from a template or blank rule within the Alert Engine UI. 2. **Scope** -" Select containers/groups. Use include/exclude lists for precision. 3. **Trigger** -" Choose a trigger type (keywords, container events, metrics) and configure thresholds. 4. **Actions** -" Add one or more actions with optional delays between steps. 5. **Advanced Settings** -" Configure cooldowns, max executions, verification delays, backoff, and keyword behaviour. 6. **Activate** -" Enable the rule. Evaluations begin immediately. 7. **Review** -" Inspect alert history, acknowledgements, and audit logs to tune behavior. ## Trigger types | Trigger | Description | Example | |---------|-------------|---------| | **Log keyword** | Matches one or many substrings (ANY/ALL) in container logs. Optional timeline settings require N matches within M minutes before firing. | Alert when `OutOfMemoryError` appears 3 times in 2 minutes for `backend-*` containers. | | **Performance metric** *(LogForge Pro)* | Evaluates CPU, memory, or restart counters against a threshold, with optional sustained-time windows. | Trigger when memory usage stays above 85% for 5 minutes. | | **Container event** | Reacts to lifecycle events emitted by the LogForge backend (`start`, `stop`, `die`, `oom`, etc.), with optional "N events in M minutes" thresholds. | Notify when a database container restarts twice within 10 minutes. | Each rule supports only one trigger type; create additional rules if you need to combine different signal types. ## Actions | Action | Details | |--------|---------| | **Notify** | Sends the alert payload to one or more channels configured in the Notifier service. Supports templated bodies and includes context (container, rule, timestamps). | | **Restart / Stop / Start / Kill** | Executes Docker lifecycle operations via the backend. Guardrails stop repeated restarts if the container fails health checks. | | **Run script** | Executes the first executable `.sh` script found under `/logforge-scripts/` inside the container. Ensure the directory exists, scripts are executable, and a shell (`/bin/sh`) is present. | | **Delay** | Chain actions with delays to stage responses (e.g., notify immediately, restart after 30 seconds if not acknowledged). | Each action has additional safeguards: - **Verification delay** -" Wait for a steady state before confirming success. - **Max executions** -" Cap the number of times the action runs within a cooldown window. - **Cooldown** -" Minimum wait before the rule can fire again. ## Templates The UI includes templates covering common reliability and security cases: - Crash loop detection - High memory or CPU usage - Log spike / noisy errors - TLS certificate renewal reminder - Security keyword detection - Container start/stop notifications Templates are editable after import. Use them to ensure guardrails are pre-populated. ## Building a rule -" example Goal: Restart a worker if it throws repeated queue errors and notify Slack. 1. **Scope**: Containers tagged with group `workers`. 2. **Trigger**: Log keyword `Failed to fetch job` with frequency 3 times in 60 seconds. 3. **Actions**: - Notify Slack channel `#on-call` (immediate). - Delay 30 seconds. - Restart container. Verification delay 45 seconds. 4. **Guardrails**: - Cooldown: 10 minutes. - Max executions per hour: 2. - Abort if the container was restarted manually in the last 5 minutes. This pattern avoids restart storms while keeping operators informed. ## Alert history & insights - The Alerts dashboard shows the latest events, total alert count, and a rolling view of recent triggers. - Switch to the **Stats** sub-tab to explore trend charts, rule and container breakdowns, and timeline analytics. - Free edition retains the most recent alerts (displayed at the top of the page); upgrading lifts that limit for deeper history. - Use the built-in filters (rule, container, timeframe) to focus on the signals that matter before exporting data manually if needed. ## Troubleshooting rules - Verify rule definitions in the Alert Engine UI and confirm the trigger preview matches your intent. - Review backend logs (`docker compose logs alert-engine-backend`) for evaluation errors or guardrail messages. - Ensure the Notifier service is reachable if notifications fail; inspect the Notifier dashboard (Logs tab) for recent delivery attempts and response codes. - For script actions, confirm the container has `/logforge-scripts/` with an executable `.sh` script and that `/bin/sh` is available. Advance to the [Automation Playbooks](Automation-Playbooks.md) for high-level strategies that combine multiple rules.