The Alert Engine transforms raw container signals into actionable alerts and safe remediation steps. This guide explains the building blocks, workflows, and best practices for crafting reliable rules.

## Core concepts

| Concept | Description |
|---------|-------------|
| **Rule** | A saved definition that listens for events (logs, status changes, performance thresholds) and executes actions. |
| **Trigger** | One condition per rule (keyword match, metric threshold, or container event) that determines when actions fire. |
| **Scope** | Determines which containers a rule inspects (all, specific labels/groups, or explicit container IDs). |
| **Action** | What happens when the trigger fires: notify, restart, stop, kill, start, or run a script. |
| **Advanced Settings** | The "Advanced Settings" panel in the UI (Gatekeeper & keyword tabs) where cooldowns, verification delays, rate limits, and keyword behaviour are configured. |

## Rule lifecycle
1. **Create** -" Start from a template or blank rule within the Alert Engine UI.
2. **Scope** -" Select containers/groups. Use include/exclude lists for precision.
3. **Trigger** -" Choose a trigger type (keywords, container events, metrics) and configure thresholds.
4. **Actions** -" Add one or more actions with optional delays between steps.
5. **Advanced Settings** -" Configure cooldowns, max executions, verification delays, backoff, and keyword behaviour.
6. **Activate** -" Enable the rule. Evaluations begin immediately.
7. **Review** -" Inspect alert history, acknowledgements, and audit logs to tune behavior.

## Trigger types

| Trigger | Description | Example |
|---------|-------------|---------|
| **Log keyword** | Matches one or many substrings (ANY/ALL) in container logs. Optional timeline settings require N matches within M minutes before firing. | Alert when `OutOfMemoryError` appears 3 times in 2 minutes for `backend-*` containers. |
| **Performance metric** *(LogForge Pro)* | Evaluates CPU, memory, or restart counters against a threshold, with optional sustained-time windows. | Trigger when memory usage stays above 85% for 5 minutes. |
| **Container event** | Reacts to lifecycle events emitted by the LogForge backend (`start`, `stop`, `die`, `oom`, etc.), with optional "N events in M minutes" thresholds. | Notify when a database container restarts twice within 10 minutes. |

Each rule supports only one trigger type; create additional rules if you need to combine different signal types.

## Actions

| Action | Details |
|--------|---------|
| **Notify** | Sends the alert payload to one or more channels configured in the Notifier service. Supports templated bodies and includes context (container, rule, timestamps). |
| **Restart / Stop / Start / Kill** | Executes Docker lifecycle operations via the backend. Guardrails stop repeated restarts if the container fails health checks. |
| **Run script** | Executes the first executable `.sh` script found under `/logforge-scripts/` inside the container. Ensure the directory exists, scripts are executable, and a shell (`/bin/sh`) is present. |
| **Delay** | Chain actions with delays to stage responses (e.g., notify immediately, restart after 30 seconds if not acknowledged). |

Each action has additional safeguards:
- **Verification delay** -" Wait for a steady state before confirming success.
- **Max executions** -" Cap the number of times the action runs within a cooldown window.
- **Cooldown** -" Minimum wait before the rule can fire again.

## Templates

The UI includes templates covering common reliability and security cases:
- Crash loop detection
- High memory or CPU usage
- Log spike / noisy errors
- TLS certificate renewal reminder
- Security keyword detection
- Container start/stop notifications

Templates are editable after import. Use them to ensure guardrails are pre-populated.

## Building a rule -" example

Goal: Restart a worker if it throws repeated queue errors and notify Slack.

1. **Scope**: Containers tagged with group `workers`.
2. **Trigger**: Log keyword `Failed to fetch job` with frequency 3 times in 60 seconds.
3. **Actions**:
   - Notify Slack channel `#on-call` (immediate).
   - Delay 30 seconds.
   - Restart container. Verification delay 45 seconds.
4. **Guardrails**:
   - Cooldown: 10 minutes.
   - Max executions per hour: 2.
   - Abort if the container was restarted manually in the last 5 minutes.

This pattern avoids restart storms while keeping operators informed.

## Alert history & insights
- The Alerts dashboard shows the latest events, total alert count, and a rolling view of recent triggers.
- Switch to the **Stats** sub-tab to explore trend charts, rule and container breakdowns, and timeline analytics.
- Free edition retains the most recent alerts (displayed at the top of the page); upgrading lifts that limit for deeper history.
- Use the built-in filters (rule, container, timeframe) to focus on the signals that matter before exporting data manually if needed.

## Troubleshooting rules
- Verify rule definitions in the Alert Engine UI and confirm the trigger preview matches your intent.
- Review backend logs (`docker compose logs alert-engine-backend`) for evaluation errors or guardrail messages.
- Ensure the Notifier service is reachable if notifications fail; inspect the Notifier dashboard (Logs tab) for recent delivery attempts and response codes.
- For script actions, confirm the container has `/logforge-scripts/` with an executable `.sh` script and that `/bin/sh` is available.

Advance to the [Automation Playbooks](Automation-Playbooks.md) for high-level strategies that combine multiple rules.