Skip to content

SanjaySundarMurthy/k8s-observability-stack

Repository files navigation

Kubernetes Observability Stack 📊

Prometheus Grafana Elasticsearch

A comprehensive monitoring and observability solution for Kubernetes clusters. Get real-time insights, alerting, and log aggregation for 100+ services.

🎯 Features

  • Metrics: Prometheus with pre-configured alerting rules
  • Visualization: Grafana dashboards for all components
  • Logging: ELK Stack for centralized log management
  • Tracing: Jaeger for distributed tracing
  • Alerting: AlertManager with PagerDuty/Slack integration

🏗️ Architecture

                    ┌─────────────────────────────────────┐
                    │           Grafana                    │
                    │    (Visualization & Dashboards)      │
                    └───────────────┬─────────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        │                           │                           │
        ▼                           ▼                           ▼
┌───────────────┐         ┌───────────────┐         ┌───────────────┐
│  Prometheus   │         │ Elasticsearch │         │    Jaeger     │
│   (Metrics)   │         │   (Logs)      │         │  (Tracing)    │
└───────────────┘         └───────────────┘         └───────────────┘
        │                           │                           │
        ▼                           ▼                           ▼
┌───────────────┐         ┌───────────────┐         ┌───────────────┐
│ AlertManager  │         │    Fluentd    │         │ Jaeger Agent  │
└───────────────┘         └───────────────┘         └───────────────┘
        │                           │                           │
        └───────────────────────────┼───────────────────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │     Kubernetes Cluster         │
                    │  (Pods, Services, Nodes)       │
                    └───────────────────────────────┘

📁 Project Structure

├── prometheus/
│   ├── prometheus.yaml
│   ├── alerting-rules/
│   │   ├── node-alerts.yaml
│   │   ├── pod-alerts.yaml
│   │   └── custom-alerts.yaml
│   └── service-monitors/
├── grafana/
│   ├── dashboards/
│   │   ├── kubernetes-cluster.json
│   │   ├── node-exporter.json
│   │   ├── pods-monitoring.json
│   │   └── custom-app.json
│   ├── datasources/
│   └── provisioning/
├── alertmanager/
│   ├── config.yaml
│   └── templates/
├── elasticsearch/
│   ├── elasticsearch.yaml
│   ├── kibana.yaml
│   └── fluentd/
├── jaeger/
│   └── jaeger-all-in-one.yaml
├── helm/
│   └── values/
└── kustomize/
    ├── base/
    └── overlays/

🚀 Quick Start

Using Helm

# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Clone this repository
git clone https://github.com/SanjaySundarMurthy/k8s-observability-stack.git
cd k8s-observability-stack

# Install the stack
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
  -f helm/values/prometheus-values.yaml \
  -n monitoring --create-namespace

# Install ELK Stack
kubectl apply -k kustomize/overlays/production/

Using Kustomize

# Deploy to development
kubectl apply -k kustomize/overlays/dev/

# Deploy to production
kubectl apply -k kustomize/overlays/production/

📊 Pre-built Dashboards

Dashboard Description
Kubernetes Cluster Cluster-wide overview
Node Exporter Node-level metrics
Pod Monitoring Pod resource usage
Nginx Ingress Ingress metrics
API Server Kubernetes API metrics
etcd etcd cluster health
CoreDNS DNS metrics
Custom Application App-specific metrics

🔔 Alerting Rules

Critical Alerts

  • Node down > 5 minutes
  • Pod CrashLoopBackOff
  • PersistentVolume > 90% full
  • API server errors > 10%

Warning Alerts

  • CPU usage > 80%
  • Memory usage > 85%
  • Pod restart count > 5/hour
  • Certificate expiry < 30 days

Configuration Example

# prometheus/alerting-rules/pod-alerts.yaml
groups:
  - name: pod.rules
    rules:
      - alert: PodCrashLooping
        expr: |
          rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"

🔧 Configuration

Prometheus Storage

# helm/values/prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: 50GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: managed-premium
          resources:
            requests:
              storage: 100Gi

Grafana LDAP/Azure AD

grafana:
  grafana.ini:
    auth.azuread:
      enabled: true
      client_id: ${AZURE_AD_CLIENT_ID}
      client_secret: ${AZURE_AD_CLIENT_SECRET}
      auth_url: https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/authorize
      token_url: https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/token

📈 Metrics Collected

  • Cluster Metrics: Node count, pod count, resource utilization
  • Node Metrics: CPU, memory, disk, network
  • Pod Metrics: Container resources, restart counts
  • Application Metrics: Custom metrics via ServiceMonitor
  • Ingress Metrics: Request rate, latency, errors

🔐 Security

  • RBAC for all components
  • TLS encryption for internal communication
  • Network policies for isolation
  • Secret management via External Secrets Operator

📄 License

MIT License

👤 Author

Sanjay S - Senior DevOps Engineer

About

k8s-observability-stack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors