Kubernetes Observability Stack 📊

A comprehensive monitoring and observability solution for Kubernetes clusters. Get real-time insights, alerting, and log aggregation for 100+ services.

🎯 Features

Metrics: Prometheus with pre-configured alerting rules
Visualization: Grafana dashboards for all components
Logging: ELK Stack for centralized log management
Tracing: Jaeger for distributed tracing
Alerting: AlertManager with PagerDuty/Slack integration

🏗️ Architecture

                    ┌─────────────────────────────────────┐
                    │           Grafana                    │
                    │    (Visualization & Dashboards)      │
                    └───────────────┬─────────────────────┘
                                    │
        ┌───────────────────────────┼───────────────────────────┐
        │                           │                           │
        ▼                           ▼                           ▼
┌───────────────┐         ┌───────────────┐         ┌───────────────┐
│  Prometheus   │         │ Elasticsearch │         │    Jaeger     │
│   (Metrics)   │         │   (Logs)      │         │  (Tracing)    │
└───────────────┘         └───────────────┘         └───────────────┘
        │                           │                           │
        ▼                           ▼                           ▼
┌───────────────┐         ┌───────────────┐         ┌───────────────┐
│ AlertManager  │         │    Fluentd    │         │ Jaeger Agent  │
└───────────────┘         └───────────────┘         └───────────────┘
        │                           │                           │
        └───────────────────────────┼───────────────────────────┘
                                    │
                    ┌───────────────┴───────────────┐
                    │     Kubernetes Cluster         │
                    │  (Pods, Services, Nodes)       │
                    └───────────────────────────────┘

📁 Project Structure

├── prometheus/
│   ├── prometheus.yaml
│   ├── alerting-rules/
│   │   ├── node-alerts.yaml
│   │   ├── pod-alerts.yaml
│   │   └── custom-alerts.yaml
│   └── service-monitors/
├── grafana/
│   ├── dashboards/
│   │   ├── kubernetes-cluster.json
│   │   ├── node-exporter.json
│   │   ├── pods-monitoring.json
│   │   └── custom-app.json
│   ├── datasources/
│   └── provisioning/
├── alertmanager/
│   ├── config.yaml
│   └── templates/
├── elasticsearch/
│   ├── elasticsearch.yaml
│   ├── kibana.yaml
│   └── fluentd/
├── jaeger/
│   └── jaeger-all-in-one.yaml
├── helm/
│   └── values/
└── kustomize/
    ├── base/
    └── overlays/

🚀 Quick Start

Using Helm

# Add Helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Clone this repository
git clone https://github.com/SanjaySundarMurthy/k8s-observability-stack.git
cd k8s-observability-stack

# Install the stack
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
  -f helm/values/prometheus-values.yaml \
  -n monitoring --create-namespace

# Install ELK Stack
kubectl apply -k kustomize/overlays/production/

Using Kustomize

# Deploy to development
kubectl apply -k kustomize/overlays/dev/

# Deploy to production
kubectl apply -k kustomize/overlays/production/

📊 Pre-built Dashboards

Dashboard	Description
Kubernetes Cluster	Cluster-wide overview
Node Exporter	Node-level metrics
Pod Monitoring	Pod resource usage
Nginx Ingress	Ingress metrics
API Server	Kubernetes API metrics
etcd	etcd cluster health
CoreDNS	DNS metrics
Custom Application	App-specific metrics

🔔 Alerting Rules

Critical Alerts

Node down > 5 minutes
Pod CrashLoopBackOff
PersistentVolume > 90% full
API server errors > 10%

Warning Alerts

CPU usage > 80%
Memory usage > 85%
Pod restart count > 5/hour
Certificate expiry < 30 days

Configuration Example

# prometheus/alerting-rules/pod-alerts.yaml
groups:
  - name: pod.rules
    rules:
      - alert: PodCrashLooping
        expr: |
          rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} is restarting frequently"

🔧 Configuration

Prometheus Storage

# helm/values/prometheus-values.yaml
prometheus:
  prometheusSpec:
    retention: 30d
    retentionSize: 50GB
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: managed-premium
          resources:
            requests:
              storage: 100Gi

Grafana LDAP/Azure AD

grafana:
  grafana.ini:
    auth.azuread:
      enabled: true
      client_id: ${AZURE_AD_CLIENT_ID}
      client_secret: ${AZURE_AD_CLIENT_SECRET}
      auth_url: https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/authorize
      token_url: https://login.microsoftonline.com/${TENANT_ID}/oauth2/v2.0/token

📈 Metrics Collected

Cluster Metrics: Node count, pod count, resource utilization
Node Metrics: CPU, memory, disk, network
Pod Metrics: Container resources, restart counts
Application Metrics: Custom metrics via ServiceMonitor
Ingress Metrics: Request rate, latency, errors

🔐 Security

RBAC for all components
TLS encryption for internal communication
Network policies for isolation
Secret management via External Secrets Operator

📄 License

MIT License

👤 Author

Sanjay S - Senior DevOps Engineer

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubernetes Observability Stack 📊

🎯 Features

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Using Helm

Using Kustomize

📊 Pre-built Dashboards

🔔 Alerting Rules

Critical Alerts

Warning Alerts

Configuration Example

🔧 Configuration

Prometheus Storage

Grafana LDAP/Azure AD

📈 Metrics Collected

🔐 Security

📄 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
elasticsearch		elasticsearch
grafana/dashboards		grafana/dashboards
helm/values		helm/values
kustomize/base		kustomize/base
prometheus/alerting-rules		prometheus/alerting-rules
tracing		tracing
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Kubernetes Observability Stack 📊

🎯 Features

🏗️ Architecture

📁 Project Structure

🚀 Quick Start

Using Helm

Using Kustomize

📊 Pre-built Dashboards

🔔 Alerting Rules

Critical Alerts

Warning Alerts

Configuration Example

🔧 Configuration

Prometheus Storage

Grafana LDAP/Azure AD

📈 Metrics Collected

🔐 Security

📄 License

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages