docs: update README and add example configuration for group-based mon…

…itoring
whiskeyjimbo · Dec 30, 2024 · cb98c0c · cb98c0c
1 parent 00058a1
commit cb98c0c
Show file tree

Hide file tree

Showing 2 changed files with 104 additions and 163 deletions.
diff --git a/README.md b/README.md
@@ -11,7 +11,8 @@ DISCLAIMER: This is a personal project and is not meant to be used in a producti
 
 ### Core Features
 - Multi-protocol support (TCP, HTTP, SMTP, DNS)
-- Hierarchical configuration (Sites → Hosts → Checks)
+- Hierarchical configuration (Sites → Groups → Hosts → Checks)
+- High availability group monitoring
 - Configurable check intervals per service
 - Prometheus metrics integration
 - Rule-based monitoring with custom conditions
@@ -29,7 +30,7 @@ DISCLAIMER: This is a personal project and is not meant to be used in a producti
 - YAML-based configuration
 - Modular architecture for easy extension using interfaces
 - Site-based infrastructure organization
-- Tag inheritance (site tags are inherited by hosts)
+- Tag inheritance (site tags are inherited by groups and hosts)
 
 ## Quick Start
 
@@ -55,71 +56,79 @@ CheckMate is configured using a YAML (default: ./config.yaml) file. Here's a com
 
 ```yaml
 sites:
-  - name: us-east-1
-    tags: ["prod", "aws", "use1"]
-    hosts:
-      - host: api.example.com
-        tags: ["public"]
+  - name: "mars-prod"
+    tags: ["region-mars", "prod"]
+    groups:
+      - name: "api-service.dev.com"
+        tags: ["prod"]
+        hosts:
+          - host: "127.0.0.1"
+            tags: ["primary"]
+          - host: "localhost2"
+            tags: ["secondary"]
         checks:
           - port: "443"
-            protocol: HTTP
-            interval: 30s
-            tags: ["https", "api"]
+            protocol: HTTPS
+            interval: 10s
+            tags: ["api"]
           - port: "22"
             protocol: TCP
-            interval: 1m
+            interval: 10s
             tags: ["ssh"]
 
-  - name: eu-west-1
-    tags: ["prod", "aws", "euw1"]
-    hosts:
-      - host: eu.example.com
-        tags: ["api", "public"]
-        checks:
-          - port: "443"
-            protocol: HTTP
-            interval: 30s
-
 rules:
   - name: high_latency_warning
-    condition: "responseTime > 2s"
+    condition: "responseTime > 5ms"
     tags: ["prod"]
     notifications: ["log"]
-  
-  - name: critical_downtime
-    condition: "downtime > 5m"
+
+  - name: critical_downtime_prod
+    condition: "downtime > 15s"
     tags: ["prod"]
-    notifications: ["log", "slack"]
+    notifications: ["log"]
 
 notifications:
   - type: "log"
 ```
 
 ### Site Configuration
 - `name`: Unique identifier for the site
-- `tags`: List of tags inherited by all hosts in the site
-- `hosts`: List of hosts in this site
-
-### Host Configuration
-- `host`: The hostname or IP to monitor
-- `tags`: Additional tags specific to this host (combined with site tags)
-- `checks`: List of service checks
+- `tags`: List of tags inherited by all groups in the site
+- `groups`: List of service groups in this site
+
+### Group Configuration
+- `name`: The group identifier
+- `tags`: Additional tags specific to this group (combined with site tags)
+- `hosts`: List of hosts in this group
+- `checks`: List of service checks applied to all hosts
   - `port`: Port number to check
   - `protocol`: One of: TCP, HTTP, SMTP, DNS
   - `interval`: Check frequency (e.g., "30s", "1m")
-  - `tags`: Additional tags specific to this check (combined with site and host tags)
+  - `tags`: Additional tags specific to this check
+
+### Host Configuration
+- `host`: The hostname or IP to monitor
+- `tags`: Additional tags specific to this host
 
 ### Rule Configuration
 - `name`: Unique rule identifier
 - `condition`: Expression to evaluate (uses responseTime and downtime variables)
-- `tags`: List of tags to match against hosts
+- `tags`: List of tags to match against groups
 - `notifications`: List of notification types to use when rule triggers
   - If omitted, all configured notifiers will be used
 
 ### Notification Configuration
 - `type`: Type of notification ("log", with more coming soon)
 - Each notification type can have its own configuration options
 
+## High Availability Monitoring
+
+Groups support high availability monitoring:
+- A group is considered "up" if any host in the group is responding
+- Response times are averaged across all successful checks in the group
+- Metrics are tracked at both host and group levels
+- Prometheus histograms are used for latency tracking
+
 ## Metrics
 
 CheckMate exposes Prometheus metrics at `:9100/metrics` including:
@@ -129,15 +138,16 @@ CheckMate exposes Prometheus metrics at `:9100/metrics` including:
 
 Labels included with metrics:
 - `site`: Site name
-- `host`: Target hostname
+- `group`: Group name
+- `host`: Target hostname (empty for group-level metrics)
 - `port`: Service port
 - `protocol`: Check protocol
-- `tags`: Comma-separated list of combined site and host tags
+- `tags`: Comma-separated list of combined tags
 
 Example Prometheus queries:
 ```promql
 # Filter checks by site
-checkmate_check_success{site="us-east-1"}
+checkmate_check_success{site="mars-lab"}
 
 # Average response time for production APIs
 avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"})
@@ -164,6 +174,7 @@ All health check endpoints are served on port 9100 alongside the metrics endpoin
 CheckMate uses structured logging with the following fields:
 - Basic check information:
   - `site`: Site name
+  - `group`: Target hostname
   - `host`: Target hostname
   - `port`: Service port
   - `protocol`: Check protocol
@@ -200,128 +211,4 @@ This project is licensed under the GNU General Public License v3.0 - see the [LI
 - [x] Multi-protocol per host
 - [x] Service tagging system
 - [x] Site-based infrastructure organization
-
-## Technical Details
-
-### Implementing a New Checker
-
-To add support for a new protocol, implement the Checker interface:
-
-```go
-// 1. Create a new type for your checker
-type MyNewChecker struct {
-    protocol Protocol
-}
-
-// 2. Implement the Checker interface
-func (c *MyNewChecker) Check(ctx context.Context, address string) CheckResult {
-    // Perform your check logic here
-    result := CheckResult{
-        Success:      false,
-        ResponseTime: 0,
-        Error:        nil,
-    }
-    
-    startTime := time.Now()
-    
-    // Your check implementation
-    // For example:
-    // - Open a connection
-    // - Send/receive data
-    // - Validate response
-    
-    result.ResponseTime = time.Since(startTime)
-    result.Success = true // based on check success
-    
-    return result
-}
-
-func (c *MyNewChecker) Protocol() Protocol {
-    return c.protocol
-}
-
-// 3. Register your checker in pkg/checkers/checker.go
-func NewChecker(protocol string) (Checker, error) {
-    switch Protocol(protocol) {
-    // ... existing protocols ...
-    case ProtocolMyNew:
-        return &MyNewChecker{protocol: ProtocolMyNew}, nil
-    default:
-        return nil, fmt.Errorf("unsupported protocol: %s", protocol)
-    }
-}
-```
-
-### Implementing a New Notifier
-
-To add a new notification system, implement the Notifier interface:
-
-```go
-// 1. Create a new notification type constant
-const MyNewNotification NotificationType = "mynew"
-
-// 2. Create your notifier type
-type MyNewNotifier struct {
-    // Add any required fields
-    client    *myclient.Client
-    apiKey    string
-}
-
-// 3. Implement the Notifier interface
-func (n *MyNewNotifier) Initialize(ctx context.Context) error {
-    // Setup your notification client/connection
-    n.client = myclient.New(n.apiKey)
-    return nil
-}
-
-func (n *MyNewNotifier) SendNotification(ctx context.Context, notification Notification) error {
-    // Convert the notification to your system's format
-    message := MyNotificationFormat{
-        Text:     notification.Message,
-        Severity: convertLevel(notification.Level),
-        Tags:     notification.Tags,
-        Metadata: map[string]string{
-            "host":     notification.Host,
-            "port":     notification.Port,
-            "protocol": notification.Protocol,
-        },
-    }
-    
-    // Send the notification
-    return n.client.Send(ctx, message)
-}
-
-func (n *MyNewNotifier) Type() NotificationType {
-    return MyNewNotification
-}
-
-func (n *MyNewNotifier) Close() error {
-    return n.client.Close()
-}
-
-// 4. Register your notifier in pkg/notifications/notifier.go
-func NewNotifier(notifierType string, opts ...interface{}) (Notifier, error) {
-    switch NotificationType(notifierType) {
-    // ... existing notifiers ...
-    case MyNewNotification:
-        if len(opts) > 0 {
-            if apiKey, ok := opts[0].(string); ok {
-                return &MyNewNotifier{apiKey: apiKey}, nil
-            }
-        }
-        return nil, fmt.Errorf("mynew notifier requires an API key")
-    default:
-        return nil, fmt.Errorf("unsupported notification type: %s", notifierType)
-    }
-}
-```
-
-### Available Make Commands
-```bash
-make dev          # Setup development environment
-make lint         # Run linter
-make test         # Run tests
-make coverage     # Generate test coverage report
-make docker-build # Build Docker image
-make help         # Show all available commands
-```
+- [x] High availability group monitoring
diff --git a/examples/config.yaml b/examples/config.yaml
@@ -0,0 +1,54 @@
+sites:
+  - name: "mars-lab"
+    tags: ["region-mars", "prod"]
+    groups:
+      - name: "api-service.dev.com"
+        tags: ["prod"]
+        hosts:
+          - host: "127.0.0.1"
+            tags: ["primary"]
+          - host: "localhost2"
+            tags: ["secondary"]
+        checks:
+          - port: "800"
+            protocol: HTTP
+            interval: 10s
+            tags: ["api"]
+          - port: "22"
+            protocol: TCP
+            interval: 10s
+            tags: ["ssh"]
+  # - name: "pluto-prod"
+  #   tags: ["region-pluto", "prod"]
+  #   domains:
+  #     - name: "api-service.prod.com"
+  #       tags: ["prod"]
+  #       hosts:
+  #         - host: "127.0.0.1"
+  #           tags: ["primary"]
+  #         - host: "localhost2"
+  #           tags: ["secondary"]
+  #       checks:
+  #         - port: "800"
+  #           protocol: HTTP
+  #           interval: 10s
+  #           tags: ["api"]
+  #         - port: "22"
+  #           protocol: TCP
+  #           interval: 10s
+  #           tags: ["ssh"]
+rules:
+  - name: high_latency_warning
+    condition: "responseTime > 1ms"
+    tags: ["prod"]
+    notifications: ["log"]  # This rule only sends log notifications
+  - name: critical_downtime_prod
+    condition: "downtime > 15s"
+    tags: ["prod"]
+    notifications: ["log"]  # This rule would use both (when slack is implemented)
+  - name: critical_downtime_SSH
+    condition: "downtime > 10s"
+    tags: ["ssh"]
+    notifications: ["log"]  # This rule would use both (when slack is implemented)
+notifications:
+  - type: "log"    # Currently the only implemented type