Skip to content

Commit

Permalink
docs: update README and add example configuration for group-based mon…
Browse files Browse the repository at this point in the history
…itoring
  • Loading branch information
whiskeyjimbo committed Dec 30, 2024
1 parent 00058a1 commit cb98c0c
Show file tree
Hide file tree
Showing 2 changed files with 104 additions and 163 deletions.
213 changes: 50 additions & 163 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ DISCLAIMER: This is a personal project and is not meant to be used in a producti

### Core Features
- Multi-protocol support (TCP, HTTP, SMTP, DNS)
- Hierarchical configuration (Sites → Hosts → Checks)
- Hierarchical configuration (Sites → Groups → Hosts → Checks)
- High availability group monitoring
- Configurable check intervals per service
- Prometheus metrics integration
- Rule-based monitoring with custom conditions
Expand All @@ -29,7 +30,7 @@ DISCLAIMER: This is a personal project and is not meant to be used in a producti
- YAML-based configuration
- Modular architecture for easy extension using interfaces
- Site-based infrastructure organization
- Tag inheritance (site tags are inherited by hosts)
- Tag inheritance (site tags are inherited by groups and hosts)

## Quick Start

Expand All @@ -55,71 +56,79 @@ CheckMate is configured using a YAML (default: ./config.yaml) file. Here's a com

```yaml
sites:
- name: us-east-1
tags: ["prod", "aws", "use1"]
hosts:
- host: api.example.com
tags: ["public"]
- name: "mars-prod"
tags: ["region-mars", "prod"]
groups:
- name: "api-service.dev.com"
tags: ["prod"]
hosts:
- host: "127.0.0.1"
tags: ["primary"]
- host: "localhost2"
tags: ["secondary"]
checks:
- port: "443"
protocol: HTTP
interval: 30s
tags: ["https", "api"]
protocol: HTTPS
interval: 10s
tags: ["api"]
- port: "22"
protocol: TCP
interval: 1m
interval: 10s
tags: ["ssh"]

- name: eu-west-1
tags: ["prod", "aws", "euw1"]
hosts:
- host: eu.example.com
tags: ["api", "public"]
checks:
- port: "443"
protocol: HTTP
interval: 30s

rules:
- name: high_latency_warning
condition: "responseTime > 2s"
condition: "responseTime > 5ms"
tags: ["prod"]
notifications: ["log"]
- name: critical_downtime
condition: "downtime > 5m"

- name: critical_downtime_prod
condition: "downtime > 15s"
tags: ["prod"]
notifications: ["log", "slack"]
notifications: ["log"]

notifications:
- type: "log"
```
### Site Configuration
- `name`: Unique identifier for the site
- `tags`: List of tags inherited by all hosts in the site
- `hosts`: List of hosts in this site

### Host Configuration
- `host`: The hostname or IP to monitor
- `tags`: Additional tags specific to this host (combined with site tags)
- `checks`: List of service checks
- `tags`: List of tags inherited by all groups in the site
- `groups`: List of service groups in this site

### Group Configuration
- `name`: The group identifier
- `tags`: Additional tags specific to this group (combined with site tags)
- `hosts`: List of hosts in this group
- `checks`: List of service checks applied to all hosts
- `port`: Port number to check
- `protocol`: One of: TCP, HTTP, SMTP, DNS
- `interval`: Check frequency (e.g., "30s", "1m")
- `tags`: Additional tags specific to this check (combined with site and host tags)
- `tags`: Additional tags specific to this check

### Host Configuration
- `host`: The hostname or IP to monitor
- `tags`: Additional tags specific to this host

### Rule Configuration
- `name`: Unique rule identifier
- `condition`: Expression to evaluate (uses responseTime and downtime variables)
- `tags`: List of tags to match against hosts
- `tags`: List of tags to match against groups
- `notifications`: List of notification types to use when rule triggers
- If omitted, all configured notifiers will be used

### Notification Configuration
- `type`: Type of notification ("log", with more coming soon)
- Each notification type can have its own configuration options

## High Availability Monitoring

Groups support high availability monitoring:
- A group is considered "up" if any host in the group is responding
- Response times are averaged across all successful checks in the group
- Metrics are tracked at both host and group levels
- Prometheus histograms are used for latency tracking

## Metrics

CheckMate exposes Prometheus metrics at `:9100/metrics` including:
Expand All @@ -129,15 +138,16 @@ CheckMate exposes Prometheus metrics at `:9100/metrics` including:

Labels included with metrics:
- `site`: Site name
- `host`: Target hostname
- `group`: Group name
- `host`: Target hostname (empty for group-level metrics)
- `port`: Service port
- `protocol`: Check protocol
- `tags`: Comma-separated list of combined site and host tags
- `tags`: Comma-separated list of combined tags

Example Prometheus queries:
```promql
# Filter checks by site
checkmate_check_success{site="us-east-1"}
checkmate_check_success{site="mars-lab"}
# Average response time for production APIs
avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"})
Expand All @@ -164,6 +174,7 @@ All health check endpoints are served on port 9100 alongside the metrics endpoin
CheckMate uses structured logging with the following fields:
- Basic check information:
- `site`: Site name
- `group`: Target hostname
- `host`: Target hostname
- `port`: Service port
- `protocol`: Check protocol
Expand Down Expand Up @@ -200,128 +211,4 @@ This project is licensed under the GNU General Public License v3.0 - see the [LI
- [x] Multi-protocol per host
- [x] Service tagging system
- [x] Site-based infrastructure organization

## Technical Details

### Implementing a New Checker

To add support for a new protocol, implement the Checker interface:

```go
// 1. Create a new type for your checker
type MyNewChecker struct {
protocol Protocol
}
// 2. Implement the Checker interface
func (c *MyNewChecker) Check(ctx context.Context, address string) CheckResult {
// Perform your check logic here
result := CheckResult{
Success: false,
ResponseTime: 0,
Error: nil,
}
startTime := time.Now()
// Your check implementation
// For example:
// - Open a connection
// - Send/receive data
// - Validate response
result.ResponseTime = time.Since(startTime)
result.Success = true // based on check success
return result
}
func (c *MyNewChecker) Protocol() Protocol {
return c.protocol
}
// 3. Register your checker in pkg/checkers/checker.go
func NewChecker(protocol string) (Checker, error) {
switch Protocol(protocol) {
// ... existing protocols ...
case ProtocolMyNew:
return &MyNewChecker{protocol: ProtocolMyNew}, nil
default:
return nil, fmt.Errorf("unsupported protocol: %s", protocol)
}
}
```

### Implementing a New Notifier

To add a new notification system, implement the Notifier interface:

```go
// 1. Create a new notification type constant
const MyNewNotification NotificationType = "mynew"
// 2. Create your notifier type
type MyNewNotifier struct {
// Add any required fields
client *myclient.Client
apiKey string
}
// 3. Implement the Notifier interface
func (n *MyNewNotifier) Initialize(ctx context.Context) error {
// Setup your notification client/connection
n.client = myclient.New(n.apiKey)
return nil
}
func (n *MyNewNotifier) SendNotification(ctx context.Context, notification Notification) error {
// Convert the notification to your system's format
message := MyNotificationFormat{
Text: notification.Message,
Severity: convertLevel(notification.Level),
Tags: notification.Tags,
Metadata: map[string]string{
"host": notification.Host,
"port": notification.Port,
"protocol": notification.Protocol,
},
}
// Send the notification
return n.client.Send(ctx, message)
}
func (n *MyNewNotifier) Type() NotificationType {
return MyNewNotification
}
func (n *MyNewNotifier) Close() error {
return n.client.Close()
}
// 4. Register your notifier in pkg/notifications/notifier.go
func NewNotifier(notifierType string, opts ...interface{}) (Notifier, error) {
switch NotificationType(notifierType) {
// ... existing notifiers ...
case MyNewNotification:
if len(opts) > 0 {
if apiKey, ok := opts[0].(string); ok {
return &MyNewNotifier{apiKey: apiKey}, nil
}
}
return nil, fmt.Errorf("mynew notifier requires an API key")
default:
return nil, fmt.Errorf("unsupported notification type: %s", notifierType)
}
}
```

### Available Make Commands
```bash
make dev # Setup development environment
make lint # Run linter
make test # Run tests
make coverage # Generate test coverage report
make docker-build # Build Docker image
make help # Show all available commands
```
- [x] High availability group monitoring
54 changes: 54 additions & 0 deletions examples/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
sites:
- name: "mars-lab"
tags: ["region-mars", "prod"]
groups:
- name: "api-service.dev.com"
tags: ["prod"]
hosts:
- host: "127.0.0.1"
tags: ["primary"]
- host: "localhost2"
tags: ["secondary"]
checks:
- port: "800"
protocol: HTTP
interval: 10s
tags: ["api"]
- port: "22"
protocol: TCP
interval: 10s
tags: ["ssh"]
# - name: "pluto-prod"
# tags: ["region-pluto", "prod"]
# domains:
# - name: "api-service.prod.com"
# tags: ["prod"]
# hosts:
# - host: "127.0.0.1"
# tags: ["primary"]
# - host: "localhost2"
# tags: ["secondary"]
# checks:
# - port: "800"
# protocol: HTTP
# interval: 10s
# tags: ["api"]
# - port: "22"
# protocol: TCP
# interval: 10s
# tags: ["ssh"]
rules:
- name: high_latency_warning
condition: "responseTime > 1ms"
tags: ["prod"]
notifications: ["log"] # This rule only sends log notifications
- name: critical_downtime_prod
condition: "downtime > 15s"
tags: ["prod"]
notifications: ["log"] # This rule would use both (when slack is implemented)
- name: critical_downtime_SSH
condition: "downtime > 10s"
tags: ["ssh"]
notifications: ["log"] # This rule would use both (when slack is implemented)
notifications:
- type: "log" # Currently the only implemented type

0 comments on commit cb98c0c

Please sign in to comment.