A comprehensive monitoring and observability solution for Kubernetes clusters. Get real-time insights, alerting, and log aggregation for 100+ services.
- Metrics: Prometheus with pre-configured alerting rules
- Visualization: Grafana dashboards for all components
- Logging: ELK Stack for centralized log management
- Tracing: Jaeger for distributed tracing
- Alerting: AlertManager with PagerDuty/Slack integration
┌─────────────────────────────────────┐
│ Grafana │
│ (Visualization & Dashboards) │
└───────────────┬─────────────────────┘
│
┌───────────────────────────┼───────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Prometheus │ │ Elasticsearch │ │ Jaeger │
│ (Metrics) │ │ (Logs) │ │ (Tracing) │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ AlertManager │ │ Fluentd │ │ Jaeger Agent │
└───────────────┘ └───────────────┘ └───────────────┘
│ │ │
└───────────────────────────┼───────────────────────────┘
│
┌───────────────┴───────────────┐
│ Kubernetes Cluster │
│ (Pods, Services, Nodes) │
└───────────────────────────────┘
├── prometheus/
│ ├── prometheus.yaml
│ ├── alerting-rules/
│ │ ├── node-alerts.yaml
│ │ ├── pod-alerts.yaml
│ │ └── custom-alerts.yaml
│ └── service-monitors/
├── grafana/
│ ├── dashboards/
│ │ ├── kubernetes-cluster.json
│ │ ├── node-exporter.json
│ │ ├── pods-monitoring.json
│ │ └── custom-app.json
│ ├── datasources/
│ └── provisioning/
├── alertmanager/
│ ├── config.yaml
│ └── templates/
├── elasticsearch/
│ ├── elasticsearch.yaml
│ ├── kibana.yaml
│ └── fluentd/
├── jaeger/
│ └── jaeger-all-in-one.yaml
├── helm/
│ └── values/
└── kustomize/
├── base/
└── overlays/
# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
# Clone this repository
git clone https://github.com/SanjaySundarMurthy/k8s-observability-stack.git
cd k8s-observability-stack
# Install the stack
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
-f helm/values/prometheus-values.yaml \
-n monitoring --create-namespace
# Install ELK Stack
kubectl apply -k kustomize/overlays/production/# Deploy to development
kubectl apply -k kustomize/overlays/dev/
# Deploy to production
kubectl apply -k kustomize/overlays/production/| Dashboard | Description |
|---|---|
| Kubernetes Cluster | Cluster-wide overview |
| Node Exporter | Node-level metrics |
| Pod Monitoring | Pod resource usage |
| Nginx Ingress | Ingress metrics |
| API Server | Kubernetes API metrics |
| etcd | etcd cluster health |
| CoreDNS | DNS metrics |
| Custom Application | App-specific metrics |
- Node down > 5 minutes
- Pod CrashLoopBackOff
- PersistentVolume > 90% full
- API server errors > 10%
- CPU usage > 80%
- Memory usage > 85%
- Pod restart count > 5/hour
- Certificate expiry < 30 days
# prometheus/alerting-rules/pod-alerts.yaml
groups:
- name: pod.rules
rules:
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"# helm/values/prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 30d
retentionSize: 50GB
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: managed-premium
resources:
requests:
storage: 100Gigrafana:
grafana.ini:
auth.azuread:
enabled: true
client_id: ${AZURE_AD_CLIENT_ID}
client_secret: ${AZURE_AD_CLIENT_SECRET}
auth_url: https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/authorize
token_url: https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/token- Cluster Metrics: Node count, pod count, resource utilization
- Node Metrics: CPU, memory, disk, network
- Pod Metrics: Container resources, restart counts
- Application Metrics: Custom metrics via ServiceMonitor
- Ingress Metrics: Request rate, latency, errors
- RBAC for all components
- TLS encryption for internal communication
- Network policies for isolation
- Secret management via External Secrets Operator
MIT License
Sanjay S - Senior DevOps Engineer