CheckMate is a service monitoring tool written in Go that provides real-time health checks and metrics for infrastructure. It supports multiple protocols, customizable rules, and Prometheus integration.
DISCLAIMER: This is a personal project and is not meant to be used in a production environment as it is not feature complete nor secure nor tested and under heavy development.
- Multi-protocol support (TCP, HTTP, SMTP, DNS)
- Hierarchical configuration (Sites → Hosts → Checks)
- Configurable check intervals per service
- Prometheus metrics integration
- Rule-based monitoring with custom conditions
- Flexible notification system with rule-specific routing
- Service availability status
- Response time measurements
- Rule-based alerting with customizable conditions
- Prometheus-compatible metrics endpoint
- Downtime tracking
- Latency histograms and gauges
- YAML-based configuration
- Modular architecture for easy extension using interfaces
- Site-based infrastructure organization
- Tag inheritance (site tags are inherited by hosts)
- Clone the repository:
git clone https://github.com/whiskeyjimbo/CheckMate.git
cd CheckMate
- Build using Make:
make build
- Run using Make:
make run config.yaml ## see configuration below for more details
CheckMate is configured using a YAML (default: ./config.yaml) file. Here's a complete example:
sites:
- name: us-east-1
tags: ["prod", "aws", "use1"]
hosts:
- host: api.example.com
tags: ["public"]
checks:
- port: "443"
protocol: HTTP
interval: 30s
tags: ["https", "api"]
- port: "22"
protocol: TCP
interval: 1m
tags: ["ssh"]
- name: eu-west-1
tags: ["prod", "aws", "euw1"]
hosts:
- host: eu.example.com
tags: ["api", "public"]
checks:
- port: "443"
protocol: HTTP
interval: 30s
rules:
- name: high_latency_warning
condition: "responseTime > 2s"
tags: ["prod"]
notifications: ["log"]
- name: critical_downtime
condition: "downtime > 5m"
tags: ["prod"]
notifications: ["log", "slack"]
notifications:
- type: "log"
name
: Unique identifier for the sitetags
: List of tags inherited by all hosts in the sitehosts
: List of hosts in this site
host
: The hostname or IP to monitortags
: Additional tags specific to this host (combined with site tags)checks
: List of service checksport
: Port number to checkprotocol
: One of: TCP, HTTP, SMTP, DNSinterval
: Check frequency (e.g., "30s", "1m")tags
: Additional tags specific to this check (combined with site and host tags)
name
: Unique rule identifiercondition
: Expression to evaluate (uses responseTime and downtime variables)tags
: List of tags to match against hostsnotifications
: List of notification types to use when rule triggers- If omitted, all configured notifiers will be used
type
: Type of notification ("log", with more coming soon)- Each notification type can have its own configuration options
CheckMate exposes Prometheus metrics at :9100/metrics
including:
checkmate_check_success
: Service availability (1 = up, 0 = down)checkmate_check_latency_milliseconds
: Response time gaugecheckmate_check_latency_milliseconds_histogram
: Response time distribution
Labels included with metrics:
site
: Site namehost
: Target hostnameport
: Service portprotocol
: Check protocoltags
: Comma-separated list of combined site and host tags
Example Prometheus queries:
# Filter checks by site
checkmate_check_success{site="us-east-1"}
# Average response time for production APIs
avg(checkmate_check_latency_milliseconds{tags=~".*prod.*", tags=~".*api.*"})
# 95th percentile latency by site
histogram_quantile(0.95, sum(rate(checkmate_check_latency_milliseconds_histogram[5m])) by (le, site))
CheckMate provides Kubernetes-compatible health check endpoints:
-
/health/live
- Liveness probe- Returns 200 OK when the service is running
-
/health/ready
- Readiness probe- Returns 200 OK when the service is ready to receive traffic
- Returns 503 Service Unavailable during initialization
All health check endpoints are served on port 9100 alongside the metrics endpoint.
CheckMate uses structured logging with the following fields:
- Basic check information:
site
: Site namehost
: Target hostnameport
: Service portprotocol
: Check protocolsuccess
: Check result (true/false)responseTime_us
: Response time in microsecondstags
: Array of host tags
- Rule evaluation:
rule
: Rule nameruleTags
: Tags assigned to the rulehostTags
: Tags assigned to the hostcondition
: Rule conditiondowntime
: Current downtime durationresponseTime
: Last check response time
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the GNU General Public License v3.0 - see the LICENSE file for details.
- Additional protocol support (HTTPS, TLS verification)
- Notification system integration (Slack, Email, etc.)
- Configurable notification thresholds
- Database support for historical data
- Docker container
- Web UI for monitoring (MAYBE)
- Kubernetes readiness/liveness probe support
- Multiple host monitoring
- Multi-protocol per host
- Service tagging system
- Site-based infrastructure organization
To add support for a new protocol, implement the Checker interface:
// 1. Create a new type for your checker
type MyNewChecker struct {
protocol Protocol
}
// 2. Implement the Checker interface
func (c *MyNewChecker) Check(ctx context.Context, address string) CheckResult {
// Perform your check logic here
result := CheckResult{
Success: false,
ResponseTime: 0,
Error: nil,
}
startTime := time.Now()
// Your check implementation
// For example:
// - Open a connection
// - Send/receive data
// - Validate response
result.ResponseTime = time.Since(startTime)
result.Success = true // based on check success
return result
}
func (c *MyNewChecker) Protocol() Protocol {
return c.protocol
}
// 3. Register your checker in pkg/checkers/checker.go
func NewChecker(protocol string) (Checker, error) {
switch Protocol(protocol) {
// ... existing protocols ...
case ProtocolMyNew:
return &MyNewChecker{protocol: ProtocolMyNew}, nil
default:
return nil, fmt.Errorf("unsupported protocol: %s", protocol)
}
}
To add a new notification system, implement the Notifier interface:
// 1. Create a new notification type constant
const MyNewNotification NotificationType = "mynew"
// 2. Create your notifier type
type MyNewNotifier struct {
// Add any required fields
client *myclient.Client
apiKey string
}
// 3. Implement the Notifier interface
func (n *MyNewNotifier) Initialize(ctx context.Context) error {
// Setup your notification client/connection
n.client = myclient.New(n.apiKey)
return nil
}
func (n *MyNewNotifier) SendNotification(ctx context.Context, notification Notification) error {
// Convert the notification to your system's format
message := MyNotificationFormat{
Text: notification.Message,
Severity: convertLevel(notification.Level),
Tags: notification.Tags,
Metadata: map[string]string{
"host": notification.Host,
"port": notification.Port,
"protocol": notification.Protocol,
},
}
// Send the notification
return n.client.Send(ctx, message)
}
func (n *MyNewNotifier) Type() NotificationType {
return MyNewNotification
}
func (n *MyNewNotifier) Close() error {
return n.client.Close()
}
// 4. Register your notifier in pkg/notifications/notifier.go
func NewNotifier(notifierType string, opts ...interface{}) (Notifier, error) {
switch NotificationType(notifierType) {
// ... existing notifiers ...
case MyNewNotification:
if len(opts) > 0 {
if apiKey, ok := opts[0].(string); ok {
return &MyNewNotifier{apiKey: apiKey}, nil
}
}
return nil, fmt.Errorf("mynew notifier requires an API key")
default:
return nil, fmt.Errorf("unsupported notification type: %s", notifierType)
}
}
make dev # Setup development environment
make lint # Run linter
make test # Run tests
make coverage # Generate test coverage report
make docker-build # Build Docker image
make help # Show all available commands