This is a collection of documentation, how-tos, tools and other information on debugging and identifying Kubernetes/container workload failures, performance and reliability considerations.
Initially this investigation started as user-reported failures at the DNS, networking and application levels, however through the analysis the actual causes for these failures we due to severe resource saturation & contention, IO throttling, kernel panics, etc. For an overview, see Part 1: Summary.
Through the investigation, I've discovered a lack of operational / systems knowledge, tracking and general awareness of the worker nodes / linux hosts that comprise kubernetes clusters (including filesystem incompatibility).
There are many gotchas, mud pits and blind spots running distributed systems, and kubernetes is no different. My goal with this is to step through the past 20 years of my career (eg, showing everyone my mistakes and learnings from the past).
Hopefully, this stuff helps you and your team.
This is an ongoing project / labor of love. It is not complete by any means
- the rough project roadmap is here
- Issues, comments and suggestions can be filed in the tracker