diff --git a/README.md b/README.md index e064931..0c3584a 100644 --- a/README.md +++ b/README.md @@ -11,7 +11,8 @@ DISCLAIMER: This is a personal project and is not meant to be used in a producti ### Core Features - Multi-protocol support (TCP, HTTP, SMTP, DNS) -- Hierarchical configuration (Sites → Hosts → Checks) +- Hierarchical configuration (Sites → Groups → Hosts → Checks) +- High availability group monitoring - Configurable check intervals per service - Prometheus metrics integration - Rule-based monitoring with custom conditions @@ -29,7 +30,7 @@ DISCLAIMER: This is a personal project and is not meant to be used in a producti - YAML-based configuration - Modular architecture for easy extension using interfaces - Site-based infrastructure organization -- Tag inheritance (site tags are inherited by hosts) +- Tag inheritance (site tags are inherited by groups and hosts) ## Quick Start @@ -55,41 +56,36 @@ CheckMate is configured using a YAML (default: ./config.yaml) file. Here's a com ```yaml sites: - - name: us-east-1 - tags: ["prod", "aws", "use1"] - hosts: - - host: api.example.com - tags: ["public"] + - name: "mars-prod" + tags: ["region-mars", "prod"] + groups: + - name: "api-service.dev.com" + tags: ["prod"] + hosts: + - host: "127.0.0.1" + tags: ["primary"] + - host: "localhost2" + tags: ["secondary"] checks: - port: "443" - protocol: HTTP - interval: 30s - tags: ["https", "api"] + protocol: HTTPS + interval: 10s + tags: ["api"] - port: "22" protocol: TCP - interval: 1m + interval: 10s tags: ["ssh"] - - name: eu-west-1 - tags: ["prod", "aws", "euw1"] - hosts: - - host: eu.example.com - tags: ["api", "public"] - checks: - - port: "443" - protocol: HTTP - interval: 30s - rules: - name: high_latency_warning - condition: "responseTime > 2s" + condition: "responseTime > 5ms" tags: ["prod"] notifications: ["log"] - - - name: critical_downtime - condition: "downtime > 5m" + + - name: critical_downtime_prod + condition: "downtime > 15s" tags: ["prod"] - notifications: ["log", "slack"] + notifications: ["log"] notifications: - type: "log" @@ -97,22 +93,27 @@ notifications: ### Site Configuration - `name`: Unique identifier for the site -- `tags`: List of tags inherited by all hosts in the site -- `hosts`: List of hosts in this site - -### Host Configuration -- `host`: The hostname or IP to monitor -- `tags`: Additional tags specific to this host (combined with site tags) -- `checks`: List of service checks +- `tags`: List of tags inherited by all groups in the site +- `groups`: List of service groups in this site + +### Group Configuration +- `name`: The group identifier +- `tags`: Additional tags specific to this group (combined with site tags) +- `hosts`: List of hosts in this group +- `checks`: List of service checks applied to all hosts - `port`: Port number to check - `protocol`: One of: TCP, HTTP, SMTP, DNS - `interval`: Check frequency (e.g., "30s", "1m") - - `tags`: Additional tags specific to this check (combined with site and host tags) + - `tags`: Additional tags specific to this check + +### Host Configuration +- `host`: The hostname or IP to monitor +- `tags`: Additional tags specific to this host ### Rule Configuration - `name`: Unique rule identifier - `condition`: Expression to evaluate (uses responseTime and downtime variables) -- `tags`: List of tags to match against hosts +- `tags`: List of tags to match against groups - `notifications`: List of notification types to use when rule triggers - If omitted, all configured notifiers will be used @@ -120,6 +121,14 @@ notifications: - `type`: Type of notification ("log", with more coming soon) - Each notification type can have its own configuration options +## High Availability Monitoring + +Groups support high availability monitoring: +- A group is considered "up" if any host in the group is responding +- Response times are averaged across all successful checks in the group +- Metrics are tracked at both host and group levels +- Prometheus histograms are used for latency tracking + ## Metrics CheckMate exposes Prometheus metrics at `:9100/metrics` including: @@ -129,15 +138,16 @@ CheckMate exposes Prometheus metrics at `:9100/metrics` including: Labels included with metrics: - `site`: Site name -- `host`: Target hostname +- `group`: Group name +- `host`: Target hostname (empty for group-level metrics) - `port`: Service port - `protocol`: Check protocol -- `tags`: Comma-separated list of combined site and host tags +- `tags`: Comma-separated list of combined tags Example Prometheus queries: ```promql # Filter checks by site -checkmate_check_success{site="us-east-1"} +checkmate_check_success{site="mars-lab"} # Average response time for production APIs avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"}) @@ -164,6 +174,7 @@ All health check endpoints are served on port 9100 alongside the metrics endpoin CheckMate uses structured logging with the following fields: - Basic check information: - `site`: Site name + - `group`: Target hostname - `host`: Target hostname - `port`: Service port - `protocol`: Check protocol @@ -200,128 +211,4 @@ This project is licensed under the GNU General Public License v3.0 - see the [LI - [x] Multi-protocol per host - [x] Service tagging system - [x] Site-based infrastructure organization - -## Technical Details - -### Implementing a New Checker - -To add support for a new protocol, implement the Checker interface: - -```go -// 1. Create a new type for your checker -type MyNewChecker struct { - protocol Protocol -} - -// 2. Implement the Checker interface -func (c *MyNewChecker) Check(ctx context.Context, address string) CheckResult { - // Perform your check logic here - result := CheckResult{ - Success: false, - ResponseTime: 0, - Error: nil, - } - - startTime := time.Now() - - // Your check implementation - // For example: - // - Open a connection - // - Send/receive data - // - Validate response - - result.ResponseTime = time.Since(startTime) - result.Success = true // based on check success - - return result -} - -func (c *MyNewChecker) Protocol() Protocol { - return c.protocol -} - -// 3. Register your checker in pkg/checkers/checker.go -func NewChecker(protocol string) (Checker, error) { - switch Protocol(protocol) { - // ... existing protocols ... - case ProtocolMyNew: - return &MyNewChecker{protocol: ProtocolMyNew}, nil - default: - return nil, fmt.Errorf("unsupported protocol: %s", protocol) - } -} -``` - -### Implementing a New Notifier - -To add a new notification system, implement the Notifier interface: - -```go -// 1. Create a new notification type constant -const MyNewNotification NotificationType = "mynew" - -// 2. Create your notifier type -type MyNewNotifier struct { - // Add any required fields - client *myclient.Client - apiKey string -} - -// 3. Implement the Notifier interface -func (n *MyNewNotifier) Initialize(ctx context.Context) error { - // Setup your notification client/connection - n.client = myclient.New(n.apiKey) - return nil -} - -func (n *MyNewNotifier) SendNotification(ctx context.Context, notification Notification) error { - // Convert the notification to your system's format - message := MyNotificationFormat{ - Text: notification.Message, - Severity: convertLevel(notification.Level), - Tags: notification.Tags, - Metadata: map[string]string{ - "host": notification.Host, - "port": notification.Port, - "protocol": notification.Protocol, - }, - } - - // Send the notification - return n.client.Send(ctx, message) -} - -func (n *MyNewNotifier) Type() NotificationType { - return MyNewNotification -} - -func (n *MyNewNotifier) Close() error { - return n.client.Close() -} - -// 4. Register your notifier in pkg/notifications/notifier.go -func NewNotifier(notifierType string, opts ...interface{}) (Notifier, error) { - switch NotificationType(notifierType) { - // ... existing notifiers ... - case MyNewNotification: - if len(opts) > 0 { - if apiKey, ok := opts[0].(string); ok { - return &MyNewNotifier{apiKey: apiKey}, nil - } - } - return nil, fmt.Errorf("mynew notifier requires an API key") - default: - return nil, fmt.Errorf("unsupported notification type: %s", notifierType) - } -} -``` - -### Available Make Commands -```bash -make dev # Setup development environment -make lint # Run linter -make test # Run tests -make coverage # Generate test coverage report -make docker-build # Build Docker image -make help # Show all available commands -``` \ No newline at end of file +- [x] High availability group monitoring \ No newline at end of file diff --git a/examples/config.yaml b/examples/config.yaml new file mode 100644 index 0000000..3b6708d --- /dev/null +++ b/examples/config.yaml @@ -0,0 +1,54 @@ +sites: + - name: "mars-lab" + tags: ["region-mars", "prod"] + groups: + - name: "api-service.dev.com" + tags: ["prod"] + hosts: + - host: "127.0.0.1" + tags: ["primary"] + - host: "localhost2" + tags: ["secondary"] + checks: + - port: "800" + protocol: HTTP + interval: 10s + tags: ["api"] + - port: "22" + protocol: TCP + interval: 10s + tags: ["ssh"] + # - name: "pluto-prod" + # tags: ["region-pluto", "prod"] + # domains: + # - name: "api-service.prod.com" + # tags: ["prod"] + # hosts: + # - host: "127.0.0.1" + # tags: ["primary"] + # - host: "localhost2" + # tags: ["secondary"] + # checks: + # - port: "800" + # protocol: HTTP + # interval: 10s + # tags: ["api"] + # - port: "22" + # protocol: TCP + # interval: 10s + # tags: ["ssh"] +rules: + - name: high_latency_warning + condition: "responseTime > 1ms" + tags: ["prod"] + notifications: ["log"] # This rule only sends log notifications + - name: critical_downtime_prod + condition: "downtime > 15s" + tags: ["prod"] + notifications: ["log"] # This rule would use both (when slack is implemented) + - name: critical_downtime_SSH + condition: "downtime > 10s" + tags: ["ssh"] + notifications: ["log"] # This rule would use both (when slack is implemented) +notifications: + - type: "log" # Currently the only implemented type \ No newline at end of file