Provide teams, organizations, and aspiring DevOps engineers a comprehensive, modern guide on DevOps best practices, tools, and workflows to build, secure, and deploy applications efficiently.
Note: These checklists are opinionated and based on industry experience with modern DevOps practices and DORA principles. They represent common patterns but are not universal truth. You should adapt them to your specific needs and context. Contributions, discussions, and improvements are more than welcome!
🚧 This repository is continuously evolving with DevOps best practices. Contributions and real-world insights are encouraged!
devops-checklist/
├── README.md # Main comprehensive checklist (single source of truth)
├── CONTRIBUTING.md # How to contribute + style guide
├── LICENSE # Apache 2.0
├── credits.md # Logo and attribution details
├── .github/ # Issue & PR templates
└── images/ # Header + technology logos
- Team & Culture, Git, CI/CD Tooling
- Docker & Artifact Management
- DevSecOps & Supply Chain Security
- Infrastructure as Code (Terraform)
- Cloud Platform (AWS)
- Kubernetes Orchestration & GitOps
- Observability (Metrics, Logs, Traces)
- Governance & Policy as Code
- FinOps & Cloud Cost Optimization
| Tool Category | Recommended Tool(s) | Why? |
|---|---|---|
| Version Control | Git (Trunk-Based) | Enables continuous, high-frequency delivery |
| CI/CD Orchestration | GitHub Actions / GitLab CI | Reduced operational overhead; native Git integration |
| Infrastructure as Code | Terraform | Multi-cloud capability; mature module ecosystem |
| Policy as Code | OPA Gatekeeper | Declarative governance for Kubernetes & infra |
| Observability (MLT) | Prometheus + Loki + Tempo | Unified open-source stack for Metrics, Logs, Tracing |
| Checklist Section | Primary DORA Metric Impacted |
|---|---|
| CI/CD Tooling | Deployment Frequency, Lead Time |
| Version Control - Git | Lead Time for Changes |
| DevSecOps | Change Failure Rate (shift-left reduces defects) |
| Observability | MTTR (faster detection & recovery) |
| Kubernetes / GitOps | Deployment Frequency, Change Failure Rate |
Guidance maps to improved deployment frequency, shorter lead times, lower MTTR, and reduced change failure rate.
This is an aspirational DevOps maturity checklist designed to help teams assess and improve their practices. Think of it as a scorecard for your DevOps journey.
Mark each item based on your current state:
- ✅ Achieved - Fully implemented and working well
- 🔄 In Progress - Partially implemented or being worked on
- ⏳ Not Yet - Not started or planned for future
- ❌ Not Applicable - Doesn't fit your context
Using the checklist:
- Assess: Go through sections relevant to your team and mark current state
- Prioritize: Identify high-impact items to work on next (focus on
⚠️ REQUIRED and ⭐ PREFERRED items first) - Track: Revisit quarterly to measure progress
- Adapt: Not everything applies to every organization - skip what doesn't make sense for you
Calculate your DevOps maturity score per section:
- Score = (Achieved items / Total applicable items) × 100
- 0-30%: Beginning - Focus on foundations (Git, CI/CD basics, basic monitoring)
- 31-60%: Developing - Expand capabilities (security scanning, IaC, advanced monitoring)
- 61-85%: Mature - Optimize and scale (GitOps, service mesh, FinOps, policy as code)
- 86-100%: Leading - Innovation and continuous improvement
Use this as a learning roadmap and career development guide:
- Check off items as you learn and gain hands-on experience
- Focus on one section at a time following the 3-month roadmap
- Build portfolio projects demonstrating key practices
- Track your progress toward DevOps engineer roles
- Start with Team Section
- Go through each technology section
- Mark what you already have in place
- Identify gaps and prioritize improvements
- Create an implementation roadmap
- Review technologies you work with
- Check off best practices you're following
- Learn from sections where you have gaps
- Apply improvements to your workflow
- Share knowledge with your team
- Start Here: For Aspiring DevOps Engineers
- Follow: The 3-month learning roadmap
- Build: Portfolio projects from the checklist
- Practice: Set up tools locally
- Track: Check off skills as you learn
- Focus on Git, Docker, and CI/CD sections
- Understand how your code reaches production
- Learn security best practices (SAST/DAST)
- Explore AWS basics
- Practice with Docker locally
Week 1 – Version Control
- Review Git workflows and enable branch protection
- Add pull request templates and required reviews
- Configure client-side hooks for linting or secrets scanning
Week 2 – CI/CD Foundations
- Stand up Jenkins or GitHub Actions
- Create an automated build-and-test pipeline
- Add notifications to Slack/Teams and start tracking build time
Week 3 – Code Quality & Security
- Integrate SonarQube (or an equivalent) into pipelines
- Enforce quality gates and remediate critical issues
- Layer in SAST/SCA scans and document remediation workflows
Week 4 – Containers & Deployment
- Harden Dockerfiles and enable image scanning
- Publish images to your chosen registry (ECR/ACR/GCR/Artifactory)
- Deploy to AWS ECS, Kubernetes, or a serverless target
| Section | What You'll Learn | Time to Read |
|---|---|---|
| Team | Roles, skills, culture, goals | 10 min |
| Production & Deployment | Release strategies, change management | 10 min |
| Git | Branching, workflows, security | 10 min |
| CI/CD Tooling | Jenkins JCasC, GitHub Actions, GitLab CI | 15 min |
| SonarQube | Code quality, coverage, quality gates | 5 min |
| Docker | Containers, registries, security | 10 min |
| Artifact Management | Artifactory, Nexus, cloud registries | 10 min |
| DevSecOps | SAST, DAST, SCA, supply chain | 15 min |
| Terraform (IaC) | Remote state, modules, automation | 15 min |
| Cloud Platform (AWS) | IAM, networking, cost | 20 min |
| Kubernetes Orchestration | EKS/GKE/AKS, Helm, GitOps | 20 min |
| Observability (MLT) | Metrics, logs, traces, SLOs | 15 min |
| Governance & Policy as Code | OPA, Sentinel, compliance | 10 min |
| FinOps & Cloud Cost | Tagging, budgets, optimization | 10 min |
Total reading time: ~3 hours (refer back often!)
# Minimal essentials
- Git
- Docker Desktop
- VS Code or preferred editor# Full local lab
- Git
- Docker Desktop
- Jenkins (local or containerized)
- AWS CLI
- Terraform
- kubectl and Helm (if using Kubernetes)# Cloud-first approach
- GitHub/GitLab for version control
- GitHub Actions / GitLab CI
- AWS Free Tier or preferred cloud
- Terraform Cloud (free tier)- Create a sample application in your preferred language
- Push to GitHub/GitLab with branch protections enabled
- Containerize it with a secure Dockerfile
- Build a pipeline (Jenkins, GitHub Actions, or GitLab CI)
- Add unit tests, linting, and security scans
- Deploy to AWS ECS, Kubernetes, or a serverless target
- Capture metrics (build time, deployment duration, failure rate)
Personal Tracker Template
# My DevOps Journey
## Completed ✅
- [x] Git basics
- [x] Docker fundamentals
## In Progress 🚧
- [ ] Jenkins or GitHub Actions pipelines
- [ ] AWS fundamentals
## Planned 📋
- [ ] Terraform modules
- [ ] Kubernetes & GitOpsIndividual Milestones
- Build foundational skills (Git, Docker, CI/CD) in the first 3 months
- Ship an automation or infrastructure project by month 6
- Earn a certification or lead a production improvement by month 9
- Track DevOps/SRE readiness milestones every quarter
Team Cadence
- Monthly: Review 2-3 checklist sections together
- Quarterly: Recompute maturity score and update roadmap
- Bi-annually: Revisit architecture, cost, and compliance posture
- Do we need everything in this checklist? Focus on what matches your current maturity and business goals.
- What order should we learn things? Follow the 3-month roadmap and adapt as you grow.
- Is this suitable for beginners? Yes—each section scales from fundamentals to advanced practices.
- What if our tooling is different? Apply the principles; substitute equivalent tools (e.g., GitLab CI for GitHub Actions).
- How do we measure success? Track DORA metrics, SLO attainment, and cost/incident reductions.
Use the Learning Path (3 Months) checklist to structure your first quarter:
- Month 1: Git, Linux, shell scripting, CI/CD fundamentals
- Month 2: Pipelines (Jenkins/GitHub Actions), Docker, publish to a registry, AWS basics
- Month 3: Terraform, Kubernetes/ECS fundamentals, security scanning, observability basics
- Review checklist with the team
- Mark completed items with ✅
- Calculate maturity score (completed items / total items × 100)
- Identify high-impact gaps
- Create a quarterly improvement roadmap
- Reassess each quarter
- Cloud-Native Teams: Kubernetes → Observability → Governance → FinOps
- Security-First Teams: DevSecOps → Governance → Terraform security practices
- Cost-Conscious Teams: FinOps → Observability → Right-sizing & automation
- Startup / SMB: CI/CD → Docker → AWS basics → Monitoring foundations
Monitor these four keys continuously:
- Deployment Frequency – How often you ship production changes
- Lead Time for Changes – Time from code commit to production
- Mean Time to Recovery (MTTR) – Time to restore service after incidents
- Change Failure Rate – % of deployments causing failures/rollbacks
This checklist intentionally includes expanded coverage beyond traditional DevOps basics:
- Multi-cloud clusters (EKS, GKE, AKS)
- Helm vs Kustomize usage guidance
- RBAC & least-privilege enforcement
- Service mesh (Istio, Linkerd, App Mesh) patterns
- GitOps (ArgoCD, Flux) deployment workflows
- RED/USE metrics patterns, Prometheus exporters
- Loki / ELK structured logging and retention
- OpenTelemetry instrumentation & tracing (Jaeger/Tempo)
- Unified Grafana dashboards and SLO-driven alerting
- OPA / Gatekeeper admission control policies
- HashiCorp Sentinel for Terraform enforcement
- Cloud Config/Azure Policy for continuous compliance
- Pipeline policy checks (conftest, Checkov, tfsec, Kyverno)
- Mandatory tagging and cost allocation hygiene
- Budget alerts & anomaly detection
- Right-sizing + autoscaling strategies
- Reserved Instances & Savings Plans adoption tracking
- Kubernetes cost management (Kubecost)
Tip: Treat these four domains (Kubernetes, Observability, Governance, FinOps) as additive maturity layers—don’t adopt all at once; layer them after strong CI/CD + security foundations.
- Start small: Don't try to implement everything at once
- Quick wins first: Tackle items that provide immediate value with low effort
- Document decisions: Record why you chose certain tools or skipped certain practices
- Share and collaborate: Use this checklist in team discussions and planning sessions
Team |
![]() Git |
![]() CI/CD |
![]() Docker |
SonarQube |
Security |
Terraform |
AWS |
Kubernetes |
Observability |
Governance |
FinOps |
- Team 👥
- Production & Deployment
- Version Control - Git
- CI/CD Tooling
- Code Quality - SonarQube
- Containerization - Docker
- Artifact Management
- Application Security (DevSecOps)
- Infrastructure as Code - Terraform
- Cloud Platform - AWS
- Container Orchestration (ECS & Kubernetes/EKS)
- Kubernetes Orchestration
- Observability (The MLT Stack)
- Governance & Policy as Code
- FinOps & Cloud Cost Optimization
- Monitoring & Observability (The Three Pillars)
- Continuous Improvement
- Getting Started Guide
- Resources
- Contributing
- Credits
- License
Detailed checklist moved to docs/team.md.
➡️ See: Team Checklist
-
Define Clear DevOps Responsibilities
- Core Responsibilities:
- CI/CD pipeline development and maintenance
- Infrastructure automation
- Deployment orchestration
- Monitoring and alerting setup
- Security integration (DevSecOps)
- Toolchain management
- Collaborative Responsibilities:
- Working with developers on deployment strategies
- Collaborating with security teams on compliance
- Supporting operations with infrastructure
- Core Responsibilities:
-
Document Everything
- Team responsibilities clearly written
- Runbooks for common operations
- Incident response procedures
- Onboarding documentation
- Tool usage guides
-
Version Control (Git)
- Understanding of Git workflows (GitFlow, trunk-based)
- Branching strategies and merge strategies
- Git hooks and automation
-
CI/CD Fundamentals
- Pipeline design and implementation
- Build automation
- Deployment automation
- Testing integration
-
Scripting & Automation
- Shell scripting (Bash/Zsh)
- Python or another scripting language
- Configuration management basics
-
Containerization
- Docker fundamentals
- Container orchestration concepts
- Image optimization
-
Cloud Platform Knowledge
- At least one cloud platform (AWS, Azure, GCP)
- IaaS, PaaS, SaaS concepts
- Cloud services for compute, storage, networking
-
Infrastructure as Code
- Terraform, CloudFormation, or similar
- Configuration management (Ansible, Chef, Puppet)
- Version control for infrastructure
-
Security Awareness
- Secure coding practices
- Secrets management
- Vulnerability scanning
- Compliance basics (SOC2, HIPAA, etc.)
- Programming languages (Go, Python, Java)
- Kubernetes/EKS for container orchestration
- Advanced networking concepts
- Database administration basics
- Prometheus and Grafana for monitoring and observability
-
Break Down Silos
- Development and Operations work together
- Shared responsibility for production
- Cross-functional collaboration
-
Automation First Mindset
- Automate repetitive tasks
- Infrastructure as Code everywhere
- Self-service capabilities for developers
-
Continuous Improvement
- Regular retrospectives
- Postmortem culture (blameless)
- Metrics-driven improvements
-
Shift-Left Approach
- Security integrated early (DevSecOps)
- Testing early in the pipeline
- Quality checks from the start
-
Define DORA Metrics (The Four Keys)
- Deployment frequency (How often you ship)
- Lead time for changes (Time from commit to production)
- Mean time to recovery (MTTR) (How fast you fix failures)
- Change failure rate (%) (How often deployment fails)
-
Service Level Objectives (SLOs)
- Define SLOs for critical services (e.g., 99.9% uptime)
- Define SLIs (Service Level Indicators) to measure SLOs (e.g., latency, error rate)
- Pipeline execution time targets
- Deployment success rates
-
Team Maturity Model
- Level 1: Manual Deployments - Moving to automation
- Level 2: Automated CI/CD - Pipelines established
- Level 3: Advanced Automation - Security integrated, monitoring
- Level 4: Full DevSecOps - Shift-left, self-service, optimization
Welcome to DevOps! 👋 This checklist is your roadmap.
-
Month 1: Foundations & Version Control
- Master Git essentials (branching, pull requests, rebasing)
- Brush up on Linux fundamentals and shell scripting
- Learn CI/CD concepts and pipeline stages
- Build a personal knowledge base for quick notes
-
Month 2: CI/CD, Containers & Cloud Basics
- Set up Jenkins or GitHub Actions locally/in the cloud
- Build a basic pipeline with automated tests
- Learn Docker fundamentals and publish an image to a registry
- Explore AWS core services (IAM, EC2, S3) using the free tier
-
Month 3: Infrastructure, Security & Observability
- Provision cloud infrastructure with Terraform (remote state optional for solo learners)
- Learn Kubernetes/ECS fundamentals and deploy a sample app
- Integrate code quality (SonarQube) and security scans (SAST/SCA)
- Set up basic monitoring/logging (Prometheus/Grafana or managed services)
Build these to showcase your skills:
-
Project 1: Simple CI/CD Pipeline
- Git repo → GitHub Actions/Jenkins → Build → Test → Deploy to server
-
Project 2: Dockerized Application
- Multi-container app with Docker Compose
- Published to registry
-
Project 3: AWS Infrastructure with Terraform
- VPC, EC2, RDS provisioned with IaC
- Documented and in version control
-
Project 4: Complete DevSecOps Pipeline
- Git → CI Tool → SonarQube/SAST → SCA → Docker Build → DAST → Deploy to ECS/EKS
- Security scans integrated
- Monitoring and alerts set up
-
Resume & Portfolio
- GitHub profile with projects
- LinkedIn profile optimized
- Personal blog/documentation
- Certifications (AWS, Docker, etc.)
-
Networking
- Join DevOps communities
- Contribute to open source
- Attend meetups/conferences
- Follow industry leaders
Moved to a future docs/production.md (to be created if needed). Current overview retained.
-
Define Deployment Strategy
- What triggers deployments? (commits, tags, manual)
- How often do you deploy? (continuous, daily, weekly)
- Who can trigger production deployments?
-
Environments
- Development environment
- Testing/QA environment
- Staging environment (production-like)
- Production environment
- Environment parity: Staging closely mirrors production
- Same infrastructure configuration (IaC applied to all)
- Same resource sizes and scaling rules
- Same network topology
- Same security policies
- Same monitoring and logging setup
- Proper isolation between environments (separate AWS accounts/VPCs)
- Ephemeral test environments for feature branches (optional but recommended)
- Automatically provisioned per PR
- Automatically destroyed after merge/close
- Cost-effective testing of infrastructure changes
-
Immutable Artifacts & Promotion
- Build once, deploy many times (same artifact across environments)
- Artifacts stored in registry (Docker images, JARs, etc.)
- No rebuilding for different environments (use config injection)
- Artifact versioning and traceability
- Promotion workflow: Dev → QA → Staging → Prod
- Artifact scanning before promotion
-
Pipeline Stages
- Source code checkout
- Build & compile
- Unit tests
- Code quality checks (SonarQube)
- Security scans (SAST)
- Dependency scanning (SCA)
- Container image build
- Integration tests
- Security scans (DAST)
- Artifact storage (Nexus)
- Deployment to environments
- Post-deployment tests
- Choose Deployment Strategy
- Blue-Green Deployment: Two identical environments; switch traffic between them; easy rollback.
- Canary Deployment: Gradual rollout to subset of users; monitor metrics before full rollout.
- Rolling Deployment: Update instances one by one; zero downtime.
- Recreate (Not recommended for production)
-
Versioning Strategy
- Semantic versioning (MAJOR.MINOR.PATCH)
- Git tags for releases
- Changelog maintained
-
Rollback Capability
- Quick rollback procedure documented
- Automated rollback triggers
- Database migration rollback strategy
-
Release Communication
- Release notes published
- Stakeholders notified
- Deployment windows communicated
Full best practices moved to docs/git.md.
➡️ See: Git Checklist
-
Organization
- Separate repositories for services (microservices approach)
- Monorepo vs Multi-repo decision made
- Clear naming conventions
-
Repository Content
- Application source code
- Infrastructure as Code files
- CI/CD pipeline definitions
- Documentation (README, CONTRIBUTING)
-
.gitignoreproperly configured
-
Forbidden Content
- NO secrets, passwords, API keys
- NO large binary files (use Git LFS if needed)
- NO compiled artifacts (use artifact repository)
-
Choose a Strategy
- GitFlow (for scheduled releases)
- Trunk-Based Development (TBD) (for continuous deployment)
- GitHub Flow (simplified)
-
Branch Protection
-
mainbranch protected - Require pull request reviews
- Require status checks to pass
- No force push allowed
- Require signed commits (optional but recommended)
-
-
Commit Practices
- Clear, descriptive commit messages
- Conventional commits (feat:, fix:, docs:, etc.)
- Small, atomic commits
- No "work in progress" commits in main
-
Pull Request Process
- Pull request template defined
- Code review required (at least 1-2 reviewers)
- CI checks must pass
- Link to issue/ticket
- Description of changes
-
Git Hooks
- Pre-commit hooks for linting/formatting checks
- Pre-commit hooks for secret scanning
- Pre-push hooks to run local tests
-
Access Control
- Least privilege principle
- Role-based access (read, write, admin)
- Regular access audits
-
Security Scanning
- Git secrets scanning (detect leaked credentials with tools like GitLeaks or TruffleHog)
- Dependency vulnerability scanning
- Automated security alerts
-
Backup & Disaster Recovery
- Git server/platform backups
- Disaster recovery plan documented
Full pipeline & tooling guidance moved to docs/cicd.md.
➡️ See: CI/CD Checklist
Modern CI/CD prioritizes automation, security, and maintainability. While Jenkins remains powerful with Configuration as Code (JCasC), cloud-native alternatives like GitHub Actions and GitLab CI offer reduced operational overhead and tighter integration with modern development workflows.
🎯 Key Recommendation: For new projects, prioritize GitHub Actions or GitLab CI for their simplicity and native cloud integration. Use Jenkins with JCasC for complex enterprise environments requiring extensive customization.
-
Installation & Configuration
- Jenkins installed (Docker recommended)
- Jenkins Configuration as Code (JCasC) implemented for reproducible config
- JCasC YAML file version-controlled in Git
- Master-agent architecture for distributed builds
- High availability setup (for production)
-
JCasC Best Practices
- All Jenkins configuration defined in
jenkins.yaml - Credentials managed via JCasC with external secret managers
- Plugin installation automated via JCasC
- No manual UI configuration required
- Configuration changes tested in staging first
- All Jenkins configuration defined in
-
Backup Strategy
- Jenkins home directory backed up
- JCasC config files in Git (primary source of truth)
- Job definitions stored as code (Jenkinsfile)
- Regular automated backup schedule
-
Pipeline Structure
- Declarative pipeline preferred (easier to read)
- Stages clearly defined
- Parallel execution where possible
- Proper error handling
-
Build Optimization
- Use Docker agents for consistent builds
- Cache dependencies (Maven, npm, pip)
- Incremental builds where possible
-
Pipeline Stages
- Checkout → Build → Unit Tests → Code Quality (SonarQube) → Security Scan (SAST/SCA) → Containerize → Publish (Nexus/Registry) → Deploy → Verify
pipeline {
agent {
docker {
image 'maven:3.8.1-jdk-11'
}
}
environment {
SONAR_TOKEN = credentials('sonar-token')
DOCKER_REGISTRY = 'your-registry.com'
}
stages {
stage('Checkout') {
steps {
checkout scm
}
}
stage('Build') {
steps {
sh 'mvn clean compile'
}
}
stage('Test') {
steps {
sh 'mvn test'
}
post {
always {
junit 'target/surefire-reports/*.xml'
}
}
}
stage('Code Quality & SAST') {
steps {
sh 'mvn sonar:sonar -Dsonar.token=${SONAR_TOKEN}'
sh 'trivy fs --security-checks vuln .'
}
}
stage('Dependency Scan (SCA)') {
steps {
sh 'snyk test --json > snyk_report.json' // Example SCA
}
}
stage('Package') {
steps {
sh 'mvn package -DskipTests'
}
}
stage('Docker Build') {
steps {
sh 'docker build -t ${DOCKER_REGISTRY}/myapp:${BUILD_NUMBER} .'
}
}
stage('Push to Registry') {
steps {
sh 'docker push ${DOCKER_REGISTRY}/myapp:${BUILD_NUMBER}'
}
}
stage('Deploy to Dev') {
steps {
// Deployment steps
sh './deploy.sh dev ${BUILD_NUMBER}'
}
}
}
post {
success {
echo "Build ${BUILD_NUMBER} succeeded" // Replace with slackSend
}
failure {
echo "Build ${BUILD_NUMBER} failed" // Replace with slackSend
}
always {
cleanWs()
}
}
}-
GitHub Actions ⭐ PREFERRED for GitHub-hosted projects
- Workflows defined in
.github/workflows/*.yml - Native integration with GitHub (PRs, Issues, Releases)
- Extensive marketplace of pre-built actions
- Built-in secrets management
- Matrix builds for multi-platform testing
- Self-hosted runners for private infrastructure
- Workflows defined in
-
GitLab CI/CD ⭐ PREFERRED for GitLab-hosted projects
- Pipeline defined in
.gitlab-ci.yml - Integrated with GitLab's full DevOps platform
- Auto DevOps for zero-config pipelines
- Container registry included
- Built-in security scanning (SAST, DAST, dependency scanning)
- GitLab Runners (shared or self-hosted)
- Pipeline defined in
-
Modern CI/CD Benefits
- ✅ No infrastructure to maintain (managed runners)
- ✅ Native Git platform integration (better DX)
- ✅ Built-in container support (Docker, Kubernetes)
- ✅ Simplified YAML syntax (easier learning curve)
- ✅ Cost-effective for small to medium teams
- ✅ Cloud-native by design
Example GitHub Actions Workflow:
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up JDK 11
uses: actions/setup-java@v3
with:
java-version: '11'
distribution: 'temurin'
cache: maven
- name: Build with Maven
run: mvn clean compile
- name: Run Tests
run: mvn test
- name: SonarQube Scan
uses: sonarsource/sonarqube-scan-action@master
env:
SONAR_TOKEN: ${{ secrets.SONAR_TOKEN }}
- name: Trivy Security Scan
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
- name: Build Docker Image
run: docker build -t myapp:${{ github.sha }} .
- name: Push to ECR
uses: aws-actions/amazon-ecr-login@v1
# ... push logic-
Credentials Management
- Use credentials management systems (Jenkins Credentials, Cloud Secrets Manager, or HashiCorp Vault).
- NO hardcoded secrets in pipeline files.
- Rotate credentials regularly.
-
Access Control
- Role-based access control (RBAC)
- Audit logs enabled
- Essential Plugins
- Git, Docker, Pipeline, Credentials, SonarQube Scanner
- Slack/Email notifications
Section condensed. (Consider adding docs/code-quality.md later.)
-
Installation
- SonarQube server installed (Docker is common)
- Database configured (PostgreSQL recommended)
-
Project Setup
- Projects created for each application
- Quality profiles defined
-
Define Quality Gates
- Code coverage threshold (e.g., > 80%)
- Bug and vulnerability limits (zero high/critical)
- Code smell limits
- Duplication percentage limits
-
Enforcement
- Quality gate as pipeline stage
- Block deployment if quality gate fails
- Notify team of failures
- PR Analysis
- Pull Request decoration enabled (comments on PRs with issues).
- Block merge if quality gate fails.
Detailed Docker practices moved to future docs/docker.md (not yet created).
-
Base Images
- Use official, specific tags, not
latest. - Use minimal base images (Alpine, distroless) for small size and minimal attack surface.
- Keep base images updated.
- Use official, specific tags, not
-
Image Size
- Multi-stage builds for smaller images.
- Remove unnecessary files.
- Use
.dockerignorefile.
- Dockerfile Checklist
- Use multi-stage builds.
- Run as non-root user.
- Combine
RUNcommands to reduce layers. - Add health checks.
- Use
ENVfor configuration. - Add labels for metadata.
-
Security Scanning
- Scan images for vulnerabilities (e.g., Trivy, Snyk, Clair).
- Block deployment of vulnerable images.
-
Runtime Security
- Run containers as non-root.
- Use read-only file systems where possible.
- Limit resources (CPU, memory).
-
Secrets Management
- NO secrets in images.
- Mount secrets at runtime using orchestrator (ECS Secrets, Kubernetes Secrets, Vault).
-
Container Registry
- Choose registry (ECR, Docker Hub, GCR, Nexus).
- Private registry for internal images.
- Access control configured.
-
Registry Operations
- Automated image builds.
- Image promotion across environments.
- Clean up old/unused images.
Full multi-tool artifact guidance moved to future docs/artifacts.md (not yet created).
Modern artifact management requires centralized storage, security, and automation. While Nexus remains popular for self-hosted solutions, JFrog Artifactory offers advanced features, and cloud-native registries (ECR/ACR/GCR) provide seamless cloud integration.
🎯 Key Recommendation: Use cloud-native registries (ECR/ACR/GCR) for container images and cloud workloads. For multi-format artifacts (Maven, npm, PyPI, etc.), consider JFrog Artifactory or Nexus based on your feature requirements.
-
Repository Setup
- Hosted: Your internal artifacts
- Proxy: Cache external artifacts (Maven Central, npm, etc.)
- Group: Combine multiple repositories
- Support for Maven, npm, Docker, PyPI, NuGet, etc.
-
Nexus Best Practices
- Repository health checks configured
- Blob store strategy defined
- Backup and disaster recovery plan
- Nexus High Availability for production
-
Advanced Features
- Universal artifact repository (all package types)
- Xray integration for deep security/license scanning
- Advanced replication (multi-site, edge nodes)
- Build integration with full build metadata tracking
- AQL (Artifactory Query Language) for complex searches
-
Artifactory Configuration
- Local, remote, and virtual repositories configured
- Repository layout standards enforced
- Cleanup policies automated
- Access federation for enterprise SSO
-
AWS Elastic Container Registry (ECR) ⭐ PREFERRED for AWS workloads
- Private ECR repositories per application/service
- IAM-based access control (no credential management needed)
- Image scanning enabled (vulnerability detection)
- Lifecycle policies for automatic cleanup
- Cross-region replication configured
- Encryption at rest enabled
-
Azure Container Registry (ACR) ⭐ PREFERRED for Azure workloads
- ACR integrated with Azure Active Directory
- Geo-replication for global deployments
- Content trust and signing enabled
- Azure Defender scanning enabled
-
Google Container Registry (GCR) / Artifact Registry ⭐ PREFERRED for GCP workloads
- GCR integrated with GCP IAM
- Vulnerability scanning enabled
- Binary Authorization for deployment policy
- Multi-region storage configured
-
Cloud Registry Benefits
- ✅ Zero infrastructure management
- ✅ Native cloud IAM integration
- ✅ Built-in vulnerability scanning
- ✅ Highly available by design
- ✅ Pay-per-use pricing model
- ✅ Seamless integration with cloud services (ECS, EKS, AKS, GKE)
-
Versioning Strategy
- Snapshot vs Release repositories
- Semantic versioning enforced (MAJOR.MINOR.PATCH)
- Immutable releases (cannot overwrite published artifacts)
- Build metadata tracking (commit SHA, build number)
-
Promotion Strategy
- Automated promotion workflow: Dev → QA → Staging → Prod
- Quality gates at each promotion stage
- Audit trail for all promotions
- Rollback capability maintained
-
Cleanup Policies
- Automated removal of old snapshots (e.g., >30 days)
- Retention policy for releases (e.g., keep last 10 versions)
- Unused artifact cleanup based on download activity
- Storage quota monitoring and alerts
-
Authentication
- LDAP/Active Directory/SAML integration
- Token-based authentication for CI/CD (no passwords in pipelines)
- Multi-factor authentication (MFA) for admin access
- API keys rotation policy (90 days max)
-
Authorization
- Role-based access control (RBAC)
- Read/write/delete permissions per repository
- Least privilege principle enforced
- Regular access audits and reviews
-
Security Scanning
- Vulnerability scanning on artifact upload
- License compliance checking
- Malware scanning for binaries
- Quarantine mechanism for vulnerable artifacts
Detailed security workflow moved to docs/devsecops.md.
➡️ See: DevSecOps Checklist
-
SAST Implementation
- SAST integrated in CI/CD pipeline.
- Scans run on every commit/PR.
- Results block pipeline if critical issues found.
-
What SAST Detects
- SQL injection vulnerabilities.
- Cross-site scripting (XSS).
- Hardcoded secrets/credentials.
- SCA Implementation
- Dependency Scanning integrated in CI/CD pipeline (Snyk, Dependabot, OWASP Dependency-Check).
- Check open-source libraries for known vulnerabilities and license compliance.
- Maintain a software bill of materials (SBOM).
-
DAST Implementation
- DAST runs in staging/pre-prod environment.
- Automated scans after deployment.
- OWASP ZAP (open-source) is a common tool choice.
-
What DAST Detects
- Authentication/authorization flaws.
- Configuration errors.
- OWASP Top 10 vulnerabilities.
1. Code Commit
↓
2. SAST Scan (immediate feedback)
↓
3. Build & Unit Tests
↓
4. Dependency Scan (SCA)
↓
5. Container Image Scan (Trivy/Clair)
↓
6. Deploy to Staging
↓
7. DAST Scan (against running app)
↓
8. Deploy to Production
- Security Tools Integration
- Secrets Scanning (MANDATORY):
- GitLeaks or TruffleHog configured to scan full Git history
- Pre-commit hooks to prevent secret commits
- CI/CD pipeline blocks on secret detection
- Regular historical scans (weekly/monthly)
- Scan all branches, not just main
- License Compliance: Check license compatibility early
- Secrets Scanning (MANDATORY):
🚨 CRITICAL: Secrets scanning must cover the entire Git history, not just new commits. Historical secrets remain exploitable.
-
Git History Scanning
⚠️ REQUIRED- GitLeaks configured to scan entire repository history
- Scan runs on all branches and tags
- Custom regex patterns for organization-specific secrets
- Baseline exceptions documented for known false positives
- Automated scanning in CI/CD on every push
-
GitLeaks Configuration Example
# .gitleaks.toml title = "Gitleaks Config" [[rules]] description = "AWS Access Key" regex = '''AKIA[0-9A-Z]{16}''' tags = ["aws", "credentials"] [[rules]] description = "Private Key" regex = '''-----BEGIN (RSA|OPENSSH|DSA|EC) PRIVATE KEY-----''' tags = ["private-key"]
-
Secret Rotation & Remediation Procedure
⚠️ REQUIRED-
Immediate Actions on Secret Detection:
- Revoke/rotate compromised credentials immediately (within 15 minutes)
- Block the commit from being merged
- Alert security team via Slack/PagerDuty
- Create incident ticket with timeline
- Review access logs for unauthorized usage
-
Git History Cleanup:
- Use
git-filter-repoorBFG Repo-Cleanerto remove secrets - Force-push cleaned history (coordinate with team)
- Update all developer clones
- Verify secret removal with follow-up scan
- Use
-
Prevention Measures:
- Mandatory pre-commit hooks for all developers
- Secret management training for all team members
- Use secret management tools (AWS Secrets Manager, Vault)
- Regular rotation schedule for all credentials (90 days max)
- Automated rotation where possible (AWS IAM, database passwords)
-
Detection & Response:
- Real-time alerting on secret detection
- Automated revocation workflows
- Incident response playbook documented
- Post-incident review process
- Metrics tracking: time-to-detect, time-to-remediate
-
-
Vulnerability Tracking
- Centralized vulnerability dashboard.
- Severity classification (Critical, High, Medium, Low).
- SLA for fixing vulnerabilities defined and enforced.
-
Security Policies
- No critical/high vulnerabilities allowed in production.
- Regular security audits and penetration testing.
Detailed Terraform practices moved to docs/terraform.md.
➡️ See: Terraform Checklist
- Directory Layout
-
environments/: Contains environment-specific configurations (dev,staging,production). -
modules/: Contains reusable, well-tested code blocks (e.g.,vpc,ecs-cluster). -
backend.tf: Configuration for remote state.
-
-
Code Quality
- Use consistent naming conventions.
- Add descriptions to all variables and outputs.
- Use
terraform fmtfor formatting. - Run
tflintfor linting.
-
Variables & Locals
- Don't hardcode values.
- Use
localsfor computed values. - Use variable validation to enforce structure.
-
Resource Management
- Use
for_eachinstead ofcountfor flexibility. - Use
prevent_destroyfor critical resources. - Tag all resources consistently.
- Use
🚨 CRITICAL FOR TEAM ENVIRONMENTS: Remote backend with state locking is MANDATORY for any team working with Terraform. Local state files are only acceptable for individual learning/experimentation.
-
Remote Backend (REQUIRED for Production)
⚠️ - S3 + DynamoDB state locking configured (AWS)
- Azure Blob Storage + state locking (Azure)
- GCS + state locking (GCP)
- Terraform Cloud/Enterprise (cross-cloud)
- State locking prevents concurrent modifications
- State versioning enabled for rollback capability
- Encryption at rest enabled (AES-256)
- Encryption in transit enforced (TLS)
-
AWS S3 + DynamoDB Backend Configuration ⭐ REQUIRED PATTERN
terraform { backend "s3" { bucket = "your-terraform-state-bucket" key = "project/environment/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-state-lock" encrypt = true kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY-ID" # Prevent state file manipulation acl = "private" # Enable versioning for disaster recovery versioning = true } }
-
DynamoDB Table for State Locking
- Table created with
LockIDas primary key (String) - On-demand billing mode or minimal provisioned capacity
- Table encryption enabled
- Point-in-time recovery enabled
- Table created with
-
State Security & Access
- NEVER commit state files to Git (add
*.tfstate*to.gitignore) - Restrict S3 bucket access via IAM policies
- Enable S3 bucket versioning for state history
- S3 bucket logging enabled for audit trail
- MFA delete protection for production state buckets
- Cross-region replication for disaster recovery
- NEVER commit state files to Git (add
-
State Management Best Practices
- Separate state files per environment (dev/staging/prod)
- Separate state files per logical component/stack
- Use workspaces judiciously (prefer separate state files)
- Regular state file backups verified
- State file disaster recovery procedure documented
-
Execution Control
- Implement CI/CD orchestration for terraform operations
- Use Terraform Cloud, Atlantis, or GitHub Actions for safe applies
- Require peer review for
terraform planbefore apply - Automated drift detection configured
- Only automated systems can apply changes (no manual applies in prod)
-
Module Design
- Keep modules small and focused.
- Version modules with Git tags.
- Document module inputs/outputs.
-
CI/CD Integration
-
terraform planon pull requests. - Plan output commented on PRs (use Infracost for cost estimation).
-
terraform applytriggered on merge to main/protected branches.
-
Detailed AWS baseline moved to docs/aws.md.
➡️ See: AWS Checklist
-
Multi-Account Strategy
- Separate AWS accounts per environment (Dev, Staging, Prod).
- Use AWS Organizations for governance and consolidated billing.
- Implement Service Control Policies (SCPs).
-
Account Baseline
- CloudTrail, GuardDuty, and Security Hub enabled in all accounts.
- Enable resource tagging for cost allocation.
- EC2: Use Auto Scaling Groups, Launch Templates, and IMDSv2.
- Lambda: Serverless functions, use environment variables, enable X-Ray tracing.
- S3: Enable encryption, block public access by default, use lifecycle policies.
- RDS: Use Multi-AZ for production, enable automated backups and encryption.
-
IAM
- Enable MFA for all users.
- Use IAM roles, not users, for applications (least privilege principle).
- Rotate access keys regularly.
-
Secrets Management
- Use AWS Secrets Manager or Parameter Store for runtime secrets.
- Rotate secrets automatically.
-
Cost Management
- Set up billing alerts and use AWS Budgets.
- Use Cost Explorer and Cost Anomaly Detection.
-
Strategies
- Right-size instances.
- Use Reserved Instances and Savings Plans for steady workloads.
- Use Spot Instances for non-critical, flexible workloads.
Core EKS/ECS comparison retained; deep Kubernetes content centralized separately.
- When to use ECS (AWS-Native): Simpler needs, lower operational overhead, tighter AWS integration, Fargate preferred for serverless compute.
- When to use EKS (Kubernetes): Multi-cloud/hybrid needs, complex orchestration (Service Mesh), necessity of the Kubernetes ecosystem, team has existing K8s expertise.
- Checklist
- Use Fargate (serverless) or EC2 (for control).
- Set appropriate CPU and memory.
- Use Task Roles for AWS API access (least privilege).
- Define health checks.
- Use secrets from Secrets Manager/Parameter Store.
- Configure logging to CloudWatch Logs.
- Setup
- Deploy to multiple Availability Zones.
- Configure Auto Scaling (CPU, Memory, Request Count).
- Set up Load Balancer integration (ALB/NLB).
- Configure deployment circuit breaker.
-
Core Resources
- Understand Pods (smallest deployable unit).
- Understand Deployments (manages desired state of Pods).
- Understand Services (stable internal access/load balancing).
- Understand ConfigMaps and Secrets.
-
Deployment Tooling
- Use Helm for package management and templating deployments.
- Use K8s-native CD tools like ArgoCD or Flux (GitOps philosophy).
- Rolling Update (default): Update instances one by one.
- Blue/Green Deployment: Use CodeDeploy (for ECS) or K8s Service Selector switching (for EKS).
- Canary Deployment: Deploy a small subset, monitor, and gradually shift traffic.
Full Kubernetes operational checklist moved to docs/kubernetes.md.
➡️ See: Kubernetes Checklist
Modern cloud-native applications require robust container orchestration. Kubernetes has emerged as the de facto standard for managing containerized workloads at scale across clouds.
-
Managed Kubernetes Service Selection
- AWS EKS (Elastic Kubernetes Service) for AWS workloads
- Google GKE (Google Kubernetes Engine) for GCP workloads
- Azure AKS (Azure Kubernetes Service) for Azure workloads
- Control plane managed by cloud provider
- Worker nodes auto-scaling configured
- Multi-AZ deployment for high availability
-
Cluster Configuration
- Kubernetes version upgrade strategy defined
- Node groups/pools for different workload types (compute, memory, GPU)
- Cluster autoscaler enabled
- Pod disruption budgets (PDB) configured
- Network policies enforced
- Private cluster endpoints configured
-
EKS-Specific Best Practices
- VPC CNI plugin configured for pod networking
- IAM Roles for Service Accounts (IRSA) enabled
- EKS add-ons managed (VPC CNI, CoreDNS, kube-proxy)
- Managed node groups with launch templates
- Fargate profiles for serverless pods (where applicable)
-
Helm (Package Manager) ⭐ RECOMMENDED for complex applications
- Helm 3+ installed (no Tiller required)
- Charts stored in version control or chart repositories
- Values files per environment (dev/staging/prod)
- Chart versioning and release management
- Helm hooks for pre/post operations
- Chart testing with
helm test - Dependencies managed via
Chart.yaml
-
Kustomize (Native K8s Configuration)
- Base configurations in
base/directory - Overlays per environment in
overlays/ - Patch strategies for environment-specific changes
- No templating - pure YAML transformations
- Integrated with
kubectl apply -k
- Base configurations in
-
Package Management Best Practices
- Choose Helm for complex, reusable deployments
- Choose Kustomize for simpler, native K8s approach
- Don't mix both in same project (pick one)
- All manifests stored in Git (GitOps ready)
-
Role-Based Access Control (RBAC)
- Least privilege principle enforced for all service accounts
- Cluster roles vs namespace roles clearly defined
- Default service account NOT used for applications
- Service accounts per application/microservice
- RoleBindings audited regularly
- No cluster-admin access for regular users
-
Pod Security
- Pod Security Standards (PSS) enforced
- Security contexts defined for all pods
- Containers run as non-root user
- Read-only root filesystem where possible
- Privileged containers prohibited (except system)
- Host network/IPC/PID disabled
- Capabilities dropped (e.g., drop ALL, add NET_BIND_SERVICE only if needed)
-
Network Security
- Network policies implemented (default deny)
- Ingress/egress rules explicitly defined
- Private container registry access configured
- Service mesh for mTLS (see next section)
-
Secrets Management
- External secrets operator for cloud secret managers
- Secrets encrypted at rest (KMS)
- Secrets not exposed in environment variables
- Regular secret rotation policy (90 days max)
-
Service Mesh Decision
- Istio for feature-rich, enterprise requirements
- Linkerd for simplicity and performance
- AWS App Mesh for AWS-native integration
- Service mesh evaluation based on: observability, traffic management, security needs
-
Istio Configuration (if selected)
- Istio control plane installed
- Sidecar injection enabled per namespace
- Mutual TLS (mTLS) enforced cluster-wide
- Traffic management with VirtualServices
- Circuit breakers and retries configured
- Fault injection for chaos engineering
- Observability with Kiali, Jaeger, Prometheus
-
Linkerd Configuration (if selected)
- Linkerd control plane installed
- Automatic proxy injection enabled
- Zero-trust mTLS enabled by default
- Traffic split for canary deployments
- Linkerd Viz for built-in observability
- Service profiles for per-route metrics
-
Service Mesh Benefits
- ✅ Automatic mTLS between services
- ✅ Advanced traffic routing (canary, blue/green)
- ✅ Circuit breaking and fault tolerance
- ✅ Observability without code changes
- ✅ Consistent security policies
-
Horizontal Pod Autoscaler (HPA)
- HPA configured based on CPU/memory metrics
- Custom metrics from Prometheus for business logic scaling
- Min/max replica counts defined
- Scale-up and scale-down policies tuned
-
Vertical Pod Autoscaler (VPA)
- VPA for automatic resource request/limit adjustments
- Used in "recommendation mode" initially
- Combined with HPA carefully (can conflict)
-
Cluster Autoscaler
- Automatically add/remove nodes based on demand
- Node group min/max size configured
- Integrated with cloud provider autoscaling groups
-
Resource Management
- Resource requests defined for all containers
- Resource limits set to prevent resource exhaustion
- Quality of Service (QoS) classes understood
- Resource quotas per namespace
- Limit ranges for default constraints
-
GitOps Philosophy ⭐ BEST PRACTICE for K8s
- Git as single source of truth
- Declarative infrastructure and applications
- Automated sync from Git to cluster
- Drift detection and auto-remediation
- Audit trail via Git history
-
ArgoCD (Pull-based GitOps)
- ArgoCD installed in management cluster
- Applications defined as ArgoCD Application CRDs
- Auto-sync enabled with self-healing
- Multi-cluster management configured
- RBAC integrated with SSO (OIDC/SAML)
- Image updater for automated image updates
- Notification controller for Slack/email alerts
-
Flux (Pull-based GitOps alternative)
- Flux controllers installed via
flux bootstrap - GitRepository sources configured
- Kustomization resources for deployments
- Helm releases managed via HelmRelease CRD
- Image automation for automatic updates
- Multi-tenancy with namespace isolation
- Flux controllers installed via
-
GitOps Benefits
- ✅ Declarative - desired state in Git
- ✅ Auditable - all changes tracked in Git
- ✅ Automated - no manual kubectl applies
- ✅ Recoverable - easy rollback via Git revert
- ✅ Secure - no cluster credentials in CI/CD
Full MLT guidance moved to docs/observability.md.
➡️ See: Observability Checklist
Modern observability requires the "MLT" (Metrics, Logging, Tracing) approach. These three pillars work together to provide complete visibility into distributed systems.
🎯 Key Recommendation: Implement all three pillars (MLT) for production systems. Use Prometheus + Loki + Tempo for a unified, open-source stack, or leverage cloud-native solutions.
-
Prometheus (Time-Series Metrics) ⭐ INDUSTRY STANDARD
- Prometheus server deployed (HA mode for production)
- Service discovery configured (Kubernetes, EC2, Consul)
- Scrape configs for all services and infrastructure
- Recording rules for precomputed queries
- Long-term storage with Thanos or Cortex
- Retention policy defined (e.g., 15 days local, years in object storage)
-
Metrics Collection
- Application metrics exposed via
/metricsendpoint - RED metrics (Rate, Errors, Duration) for services
- USE metrics (Utilization, Saturation, Errors) for infrastructure
- Four Golden Signals: Latency, Traffic, Errors, Saturation
- Business metrics tracked (signups, transactions, revenue)
- Custom metrics via client libraries (Prometheus SDK)
- Application metrics exposed via
-
Prometheus Exporters
- Node exporter for Linux/OS metrics
- Blackbox exporter for endpoint monitoring
- Database exporters (PostgreSQL, MySQL, Redis)
- Cloud-specific exporters (CloudWatch exporter)
- Custom exporters for legacy systems
-
Grafana (Visualization) ⭐ PREFERRED UI
- Grafana deployed with persistent storage
- Prometheus configured as data source
- Pre-built dashboards imported (Node Exporter, Kubernetes)
- Custom dashboards per service/team
- Dashboard as code (JSON in version control)
- Template variables for environment selection
- Unified dashboards combining multiple data sources
-
Logging Strategy
- Structured logging (JSON format) enforced
- Correlation IDs for request tracing
- Log levels properly used (DEBUG, INFO, WARN, ERROR)
- No PII/secrets logged
- Centralized log aggregation
-
ELK Stack (Elasticsearch, Logstash, Kibana)
- Elasticsearch cluster for log storage (3+ nodes for HA)
- Logstash or Fluentd for log processing
- Filebeat/Fluentbit for lightweight log shipping
- Index lifecycle management (ILM) configured
- Kibana for log search and visualization
- Index patterns and saved searches defined
- Retention policy (e.g., 30 days hot, 90 days warm, archive cold)
-
Grafana Loki ⭐ COST-EFFECTIVE ALTERNATIVE
- Loki deployed for log aggregation
- Promtail agents on all nodes
- Labels for efficient log indexing (don't over-label)
- Integration with Grafana for unified view
- Object storage backend (S3/GCS) for scalability
- Significantly lower cost than Elasticsearch
-
Cloud-Native Logging
- CloudWatch Logs (AWS)
- Cloud Logging (GCP)
- Azure Monitor Logs (Azure)
- Log forwarding from cloud to central system
-
Logging Best Practices
- ✅ Structured logs in JSON format
- ✅ Include context: service, environment, version, host
- ✅ Use correlation/trace IDs across services
- ✅ Log to stdout/stderr (not files) in containers
- ✅ Aggregate logs centrally (never rely on local logs)
-
Distributed Tracing
⚠️ CRITICAL for microservices- Tracing implemented for all service-to-service calls
- Trace context propagated via HTTP headers (W3C Trace Context)
- Parent-child span relationships maintained
- Critical paths instrumented
- Sampling strategy configured (e.g., 1-10% in production)
-
OpenTelemetry ⭐ MODERN STANDARD
- OpenTelemetry SDK integrated in applications
- Auto-instrumentation for frameworks (Spring, Express, Django)
- Custom spans for business logic
- Attributes and events added to spans
- Resource attributes configured (service.name, environment)
- Vendor-neutral implementation (portable across backends)
-
Jaeger (Trace Backend)
- Jaeger deployed for trace storage and visualization
- Collector receives traces from applications
- Storage backend configured (Elasticsearch, Cassandra, or Badger)
- Jaeger UI for trace search and analysis
- Service dependency graph visualization
- Retention policy configured
-
Alternative Trace Backends
- Grafana Tempo (cost-effective, integrated with Grafana)
- AWS X-Ray (for AWS-centric workloads)
- Google Cloud Trace
- Azure Application Insights
- Commercial solutions: Datadog APM, New Relic, Honeycomb
-
Tracing Use Cases
- Identify slow database queries
- Find service bottlenecks
- Root cause analysis for errors
- Understand service dependencies
- Measure end-to-end request latency
-
Integration Between MLT Pillars
- Trace ID in logs for correlation
- Jump from metrics → logs → traces in Grafana
- Unified query interface (LogQL, PromQL, TraceQL)
- Single pane of glass dashboard
- Context switching minimized
-
Grafana Observability Stack ⭐ RECOMMENDED UNIFIED SOLUTION
- Grafana - Visualization layer
- Prometheus - Metrics
- Loki - Logs
- Tempo - Traces
- Grafana Agent - Unified collection
- All components integrated natively
-
Commercial Alternatives
- Datadog (all-in-one, expensive)
- New Relic (APM focused)
- Splunk (log-centric, enterprise)
- Elastic Observability (ELK + APM)
- Dynatrace (AI-driven)
-
Alerting Strategy
- Alerts based on symptoms, not causes
- Multi-window, multi-burn-rate alerts (Google SRE approach)
- Alert fatigue prevention (actionable alerts only)
- Severity levels: Critical (page), High (ticket), Low (weekly review)
- Alert grouping and deduplication
-
Prometheus Alertmanager
- Alertmanager deployed in HA mode
- Alert routing rules configured
- Notification channels: PagerDuty, Slack, Email, Webhook
- Silencing and inhibition rules
- Alert templates customized
-
Service Level Objectives (SLOs)
- SLOs defined for critical services (e.g., 99.9% uptime)
- Service Level Indicators (SLIs) measured (availability, latency, throughput)
- Error budget calculated and tracked
- SLO dashboards visible to all teams
- Alerts fire when error budget depleted
-
Incident Response
- On-call rotation defined and automated
- Runbooks for all critical alerts
- Incident management tool (PagerDuty, Opsgenie)
- Blameless postmortems conducted
- Incident timeline and RCA documented
- Logs: CloudWatch Logs, ELK Stack, Splunk.
- Metrics: Prometheus (time series data), CloudWatch Metrics.
- Traces: AWS X-Ray or Jaeger (distributed tracing).
- Visualization Tools
- Use Grafana to visualize data from Prometheus, CloudWatch, or other sources.
- Centralized dashboards for key application and infrastructure health metrics.
- Real-time monitoring enabled.
-
Alerting Setup
- Define alerts based on SLOs and critical resource thresholds.
- Integrate with notification channels (SNS, Slack, PagerDuty).
- Use Prometheus Alertmanager for sophisticated grouping and routing.
-
Runbooks
- On-call rotation defined.
- Runbooks for common alerts are documented and accessible.
Detailed policy enforcement moved to docs/governance.md.
➡️ See: Governance Checklist
Modern governance requires automated policy enforcement and compliance checks. Policy as Code enables security, compliance, and operational standards to be codified, version-controlled, and automatically enforced.
🎯 Key Recommendation: Implement policy as code early to prevent drift and ensure compliance. Use OPA for Kubernetes and general-purpose policies, Sentinel for Terraform, and cloud-native tools for infrastructure compliance.
-
OPA Fundamentals ⭐ RECOMMENDED for Kubernetes
- OPA installed as admission controller in Kubernetes
- Policies written in Rego language
- OPA Gatekeeper for Kubernetes-native CRDs
- Policy library maintained in Git
- Policies tested with unit tests (conftest)
-
Kubernetes Policy Enforcement
- Require resource limits on all pods
- Block privileged containers
- Enforce pod security standards (restricted/baseline)
- Require specific labels (owner, environment, cost-center)
- Restrict image registries (only approved registries)
- Network policy requirements
- Ingress hostname uniqueness
- Namespace quotas enforcement
-
OPA Use Cases Beyond Kubernetes
- API authorization policies
- Infrastructure policy validation
- Data filtering and masking
- Service mesh authorization
-
OPA Best Practices
- Audit mode first, then enforce
- Policies as code in version control
- Policy decision logs for compliance audits
- Regular policy reviews and updates
- Policy violations reported to teams
-
Sentinel for Terraform ⭐ TERRAFORM CLOUD/ENTERPRISE
- Sentinel policies integrated in Terraform workflow
- Policy sets organized by compliance framework
- Policies run before apply
- Advisory, soft mandatory, and hard mandatory levels
-
Terraform Policy Examples
- Mandatory tags on all resources (environment, owner, project)
- Prevent public S3 buckets
- Require encryption at rest for databases and storage
- Enforce instance size limits (prevent oversized instances)
- Require VPC for all resources (no default VPC)
- MFA delete for S3 buckets in production
- Backup requirements for critical data stores
- Cost controls (estimated cost limits per apply)
-
Sentinel Integration
- Policy checks in Terraform Cloud/Enterprise
- Policy failures block terraform apply
- Policy override process documented
- Compliance reports generated
-
AWS Config (AWS Compliance)
- AWS Config enabled in all accounts/regions
- Config rules for compliance checks
- CIS AWS Foundations Benchmark rules deployed
- Custom config rules for organization policies
- Automatic remediation with Systems Manager
- Compliance dashboard for leadership
- Non-compliant resources flagged
-
AWS Config Rules Examples
- S3 buckets must have encryption enabled
- RDS instances must have backup enabled
- EC2 instances must be in VPC
- Root account MFA enabled
- IAM password policy enforced
- Security groups don't allow 0.0.0.0/0 on port 22/3389
- CloudTrail enabled and logging
-
Azure Policy (Azure Compliance)
- Azure Policy definitions assigned
- Built-in policies for regulatory compliance (HIPAA, PCI-DSS)
- Custom policies for organization standards
- Deny effect for critical violations
- Audit effect for advisory policies
- Policy remediation tasks
-
Google Cloud Organization Policies
- Organization policy constraints defined
- Resource location restrictions
- Allowed services and API restrictions
- VM instance requirements
-
Pre-Deployment Validation
-
conftestfor policy testing in CI pipelines - Terraform plan validated by Sentinel/OPA
- Kubernetes manifests validated by OPA/Kyverno
- Dockerfile linting with policy checks
- Infrastructure code scanning (Checkov, tfsec)
-
-
CI/CD Integration
- Policy checks as mandatory CI/CD stages
- Policy violations fail the build
- Policy reports attached to PRs
- Override mechanism for exceptions (with approval)
-
Policy Testing
- Unit tests for policy rules
- Test cases for both allow and deny scenarios
- Automated policy regression testing
- Policy test coverage measured
-
Audit Trails
- All infrastructure changes logged
- API calls tracked (CloudTrail, Azure Activity Log)
- Policy decision logs stored
- Change management records maintained
-
Compliance Reporting
- Automated compliance reports generated
- Dashboard showing compliance posture
- Non-compliance items tracked and remediated
- Executive summaries for leadership
- Evidence collection for auditors
-
Compliance Frameworks
- SOC 2 controls mapped to policies
- PCI-DSS requirements enforced
- HIPAA compliance validated
- GDPR data protection policies
- ISO 27001 security controls
Detailed cost optimization checklist moved to docs/finops.md.
➡️ See: FinOps Checklist
Cloud costs can spiral out of control without proper governance and optimization. FinOps brings financial accountability to cloud spending through visibility, optimization, and cultural change.
🎯 Key Recommendation: Implement comprehensive tagging strategy first, then use native cloud cost tools combined with third-party solutions for deep analysis and recommendations.
-
Tagging Strategy
⚠️ FOUNDATIONAL REQUIREMENT-
Mandatory tags defined and documented:
-
environment(dev/staging/prod) -
ownerorteam(responsible team) -
projectorapplication(business context) -
cost-center(billing allocation) -
managed-by(terraform/manual) -
expiry-date(for temporary resources)
-
-
Tag policies enforced via:
- AWS Organizations Tag Policies
- Azure Policy for required tags
- Terraform validation rules
- CI/CD pre-deployment checks
-
Resources without required tags blocked from creation
-
Regular tag compliance audits
-
Automated tag remediation where possible
-
-
Cost Visibility Tools
- AWS Cost Explorer configured with custom reports
- AWS Cost and Usage Reports (CUR) enabled
- Azure Cost Management dashboards created
- GCP Cost Management configured
- Third-party tools: CloudHealth, Cloudability, or Kubecost
-
Cost Allocation
- Cost allocation tags propagated to billing
- Chargeback/showback reports per team/project
- Cost trends analyzed monthly
- Cost anomaly detection alerts configured
-
Budget Setup
- AWS Budgets created per account/project
- Budget thresholds at 50%, 80%, 100%, 120%
- Forecasted spend alerts enabled
- Budget alerts to team leads and finance
-
Cost Anomaly Detection
- AWS Cost Anomaly Detection enabled
- Azure Cost Management anomaly alerts
- Real-time cost spike notifications
- Automated investigation workflows
-
Spend Reviews
- Monthly cost review meetings
- Quarterly business review with finance
- Cost optimization backlog maintained
- ROI tracking for optimization efforts
-
Compute Right-Sizing
- AWS Compute Optimizer recommendations reviewed
- Underutilized EC2 instances identified (< 40% CPU/memory)
- Overprovisioned instances downsized
- Instance family optimization (graviton, AMD instances)
- Idle instances terminated or scheduled
- Auto-scaling configured to match demand
-
Database Optimization
- RDS instance right-sizing
- Aurora Serverless for variable workloads
- Read replica evaluation (are they needed?)
- Database storage type optimization (gp3 vs gp2)
-
Storage Optimization
- S3 Intelligent-Tiering enabled
- S3 lifecycle policies for archival
- EBS volume optimization (unused volumes deleted)
- Snapshot cleanup for old/unused snapshots
- EBS volume type optimization (gp3 over gp2)
-
Kubernetes Cost Optimization
- Kubecost deployed for K8s cost visibility
- Pod resource requests match actual usage
- Cluster autoscaler fine-tuned
- Node rightsizing based on workload
- Spot instances for fault-tolerant workloads
-
Commitment Discounts
- Reserved Instances (RIs) for steady-state workloads
- Savings Plans for flexible compute savings (AWS)
- Azure Reserved VM Instances
- GCP Committed Use Discounts
-
RI/Savings Plan Strategy
- 1-year vs 3-year commitment analysis
- Standard vs convertible RI decision
- Coverage targets: 70-80% of steady-state compute
- Quarterly RI utilization reviews
- RI exchange/modification as workloads change
-
Spot/Preemptible Instances
- Spot instances for batch processing
- Spot instances in Kubernetes (Karpenter, Spot Ocean)
- GCP Preemptible VMs for dev/test
- Graceful handling of spot terminations
-
FinOps Culture
- Engineering teams aware of their cloud costs
- Cost metrics in team dashboards
- Cost optimization as sprint work items
- Recognition for cost-saving initiatives
-
Cost-Aware Architecture
- Cost considerations in architecture reviews
- Serverless-first approach where appropriate
- Multi-region strategy aligned with business need
- Data transfer costs minimized
- Over-engineering avoided
-
Waste Elimination
- Idle resources automatically identified
- Non-production environments shut down off-hours
- Zombie resources (unattached EBS, old snapshots) cleaned
- Unused reserved capacity released
- Duplicate data stores eliminated
-
FinOps Metrics
- Cost per customer/transaction tracked
- Cloud cost as % of revenue monitored
- Cost efficiency trends over time
- Engineering cost savings tracked and celebrated
-
Metrics & KPIs
- Regularly review DORA metrics and SLOs.
- Track pipeline execution time and cost.
- Monitor deployment success rates and rollback frequency
- Track mean time to detection (MTTD) for issues
- Measure infrastructure drift and compliance violations
-
Retrospectives & Learning
- Regular team retrospectives (weekly/bi-weekly).
- Blameless postmortems with documented action items
- Postmortem action items tracked to completion
- Share learnings across teams (internal blog, wiki)
- Track recurring issues and address root causes
-
Training & Development
- Regular cross-training sessions.
- Continuous learning culture fostered.
- Dedicated learning time (e.g., 10% of sprint)
- Internal tech talks and knowledge sharing
- Conference attendance and external training budgets
-
Disaster Recovery & Resilience Testing
⚠️ CRITICAL- DR drills conducted regularly (quarterly minimum)
- Backup restore tests automated and scheduled
- Recovery Time Objective (RTO) defined and tested
- Recovery Point Objective (RPO) defined and measured
- DR runbooks tested and updated
- Failover procedures documented and rehearsed
- Multi-region failover tested (if applicable)
- Data backup integrity verified regularly
- Chaos engineering experiments (optional but valuable)
- Fault injection in non-production
- Controlled chaos in production (with safeguards)
- Game days with simulated outages
- Tools: Chaos Monkey, Gremlin, AWS Fault Injection Simulator
- Post-DR test review with improvement actions
- Week 1-2: Foundation (Git, AWS Accounts, IAM/Security Baseline).
- Week 3-4: CI/CD Foundation (Choose and implement CI Tool, Docker basics, first pipeline).
- Week 5-6: Quality & Security (Integrate SonarQube, SAST/SCA, set up Nexus).
- Week 7-8: Infrastructure (Terraform basics, provision VPC/Network).
- Week 9-10: Container Orchestration (Set up ECS/EKS, configure load balancing).
- Week 11-12: Advanced Topics (Implement Observability, Auto-scaling, Blue/Green deployment).
- Month 1: Git + Linux + Bash scripting
- Month 2: CI/CD (Jenkins/Actions) + Docker basics
- Month 3: AWS fundamentals + Terraform
- Month 4: Build end-to-end project
- Month 5-6: Advanced topics (K8s/Prometheus) + portfolio projects
- Git Documentation
- Jenkins Documentation
- Docker Documentation
- Terraform Documentation
- AWS Documentation
- Kubernetes Documentation
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
See credits.md for image and logo attributions.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Remember: DevOps is a journey, not a destination. Start small, automate incrementally, and continuously improve! 🚀


