Created blogpost for Observability Game days at Skyscanner (#4039)

open-telemetry · Feb 26, 2024 · 19720bb · 19720bb
1 parent 910040f
commit 19720bb
Show file tree

Hide file tree

Showing 4 changed files with 137 additions and 0 deletions.
diff --git a/content/en/blog/2024/demo-skyscanner/index.md b/content/en/blog/2024/demo-skyscanner/index.md
@@ -0,0 +1,109 @@
+---
+title:
+  "Making observability fun: How we increased engineers' confidence in incident
+  management using a game"
+linkTitle: Skyscanner using OTel Demo
+date: 2024-02-26
+author: >-
+  [Jordi Bisbal Ansaldo](https://github.com/jordibisbal8) (Skyscanner)
+cSpell:ignore: Ansaldo Bisbal Jordi runbooks Skyscanner upskilled Yankova
+---
+
+At [Skyscanner](https://www.skyscanner.net), as in many organizations, teams
+tend to follow specific runbooks for individual failure modes. With modern and
+complex distributed systems, this has the downside of most of the errors being
+unknowns, which makes runbooks only partially applicable.
+
+After migrating our telemetry data to the OpenTelemetry standards at Skyscanner,
+we now have richer instrumentation and can rely on observability directly. As a
+result, we are ready to adopt a new
+[observability mindset](https://charity.wtf/2019/09/20/love-and-alerting-in-the-time-of-cholera-and-observability/),
+which requires training our engineers to work effectively with the new
+ecosystem. This allows them to react efficiently to any known or unknown issues,
+even under pressure.
+
+To achieve this, we believe that the best way to gain knowledge isn’t through
+one-time viewings of documents or videos. Instead, it’s through practical
+exercises that include situations with never-before-seen (or at least rarely
+seen) problems. This helps the company reduce the time to mitigate an issue
+(TTM), which starts when a first responder acknowledges the incident, until
+users stop suffering from the incident.
+
+## Environment
+
+To begin with, we need to set up an environment that demonstrates the best
+practices for monitoring and debugging using OpenTelemetry instrumentation and
+observability. For this, we propose the use of the official
+[OpenTelemetry Demo](/docs/demo/), which is a realistic example of a distributed
+system called Astronomy Shop. Thanks to the
+[OpenTelemetry Protocol](/docs/specs/otlp/) (OTLP), it allows us to simply point
+the standard OTLP exporter in the Collector to
+[New Relic](https://newrelic.com/), our chosen observability platform at
+Skyscanner which, like other platforms, is fully embracing open standards to
+ingest telemetry data.
+
+This system contains regressions that can be injected into the platform and
+helps us demonstrate the importance of Service Levels Objectives (SLOs),
+tracing, logs, metrics, etc. For instance, we can observe traffic flow through
+various components, as shown in the image below. Since part of the OpenTelemetry
+ecosystem is open source, we can easily introduce any new features that will be
+reviewed by OpenTelemetry contributors.
+
+![Distributed tracing example in Astronomy shop](tracing-example.png)
+
+## Observability game day
+
+Once the environment is set up, we can introduce the Observability Game Day, an
+initiative based on the Wheel of Misfortune practices that Google uses and
+describes in the [Site Reliability Engineering book](https://sre.google/books/).
+
+This game simulates a production incident, where a moderator known as the game
+master (GM) conducts the session and someone from the audience spins the wheel
+and explains an incident or outage. The participants are then divided into teams
+and tasked with identifying and resolving the issue as quickly as possible. If
+the solution is not optimal, the GM can help by introducing a new tool or view,
+which gives a different perspective on how to tackle the incident (knowledge
+sharing). This exercise can be repeated multiple times for different incidents.
+
+![Wheel of misfortune example](wheel.png)
+
+## Results
+
+The Observability Game Day has already been completed by multiple Skyscanner
+teams, where each team observability expert (ambassador) runs the session. The
+participants have given extremely positive feedback, where 90% of the responders
+say that after the Game Day, they feel more confident debugging production
+systems and would love to have further sessions.
+
+- Hugely valuable to run against real services and to compare and contrast
+  different debugging methods. I'm certain everyone, regardless of skill level,
+  will have got something out of the session - I know I did! Thank you for
+  taking the time to set this up and promoting it for us -
+  [Dominic Fraser](https://github.com/dominicfraser) (Senior Software Engineer)
+- It is a really great (company-wide) initiative to get people upskilled in
+  observability and OpenTelemetry/New Relic and I personally found it very
+  useful, as well as a lot of fun! :D - Polly Yankova (Software Engineer)
+
+In addition, we learned that:
+
+1. OTLP makes it incredibly simple to integrate a standard application with an
+   observability vendor. Just point it to the right endpoint and job done.
+2. Our winning teams relied primarily on tracing data to analyze regressions
+   that helped them understand the root cause faster. Tracing FTW!
+3. Front-end engineers found the Game Day lacked focus on client-side
+   observability, so we decided to contribute upstream (see next steps below).
+   This was my first contribution to the project, and it was a great experience!
+   Maintainers were very welcoming and helped me to test and release. Thanks!
+
+## Next steps
+
+The next action is to run sessions for all the engineering teams in the company
+and convert them into a Skyscanner learning course. This way, the content can be
+used during the onboarding process for new joiners or even reviewed at any time
+as a refresher for those who have been in the company longer. In addition, after
+observing common feedback, we identified that it would be beneficial to extend
+the current incidents to include more front-end-specific ones, such as incidents
+triggered by browser traffic. To achieve this, we have contributed to the
+OpenTelemetry Demo and enabled these features for other interested parties. For
+more information, please have a look at the
+[raised PR](https://github.com/open-telemetry/opentelemetry-demo/pull/1345).
diff --git a/content/en/blog/2024/demo-skyscanner/tracing-example.png b/content/en/blog/2024/demo-skyscanner/tracing-example.png
diff --git a/content/en/blog/2024/demo-skyscanner/wheel.png b/content/en/blog/2024/demo-skyscanner/wheel.png
diff --git a/static/refcache.json b/static/refcache.json
@@ -223,6 +223,10 @@
     "StatusCode": 206,
     "LastSeen": "2024-01-30T06:06:13.062554-05:00"
   },
+  "https://charity.wtf/2019/09/20/love-and-alerting-in-the-time-of-cholera-and-observability/": {
+    "StatusCode": 200,
+    "LastSeen": "2024-02-26T10:53:38.116124+01:00"
+  },
   "https://circleci.com": {
     "StatusCode": 206,
     "LastSeen": "2024-01-30T05:18:29.78394-05:00"
@@ -2291,6 +2295,10 @@
     "StatusCode": 200,
     "LastSeen": "2024-01-25T10:54:57.162378745Z"
   },
+  "https://github.com/dominicfraser": {
+    "StatusCode": 200,
+    "LastSeen": "2024-02-26T10:53:39.237712+01:00"
+  },
   "https://github.com/dotansimha/graphql-yoga": {
     "StatusCode": 200,
     "LastSeen": "2024-01-30T05:18:34.524624-05:00"
@@ -2475,6 +2483,10 @@
     "StatusCode": 200,
     "LastSeen": "2024-01-30T16:14:54.527183-05:00"
   },
+  "https://github.com/jordibisbal8": {
+    "StatusCode": 200,
+    "LastSeen": "2024-02-26T10:53:37.290066+01:00"
+  },
   "https://github.com/joshleecreates/": {
     "StatusCode": 200,
     "LastSeen": "2024-01-30T05:18:13.610639-05:00"
@@ -3011,6 +3023,10 @@
     "StatusCode": 200,
     "LastSeen": "2024-01-30T06:05:58.807859-05:00"
   },
+  "https://github.com/open-telemetry/opentelemetry-demo/pull/1345": {
+    "StatusCode": 200,
+    "LastSeen": "2024-02-26T10:53:40.03196+01:00"
+  },
   "https://github.com/open-telemetry/opentelemetry-demo/pull/432": {
     "StatusCode": 200,
     "LastSeen": "2024-01-30T15:26:30.96845-05:00"
@@ -4851,6 +4867,10 @@
     "StatusCode": 200,
     "LastSeen": "2024-01-18T19:02:19.249572-05:00"
   },
+  "https://newrelic.com/": {
+    "StatusCode": 206,
+    "LastSeen": "2024-02-26T10:53:38.368111+01:00"
+  },
   "https://newrelic.com/blog/authors/daniel-kim": {
     "StatusCode": 206,
     "LastSeen": "2024-01-18T19:10:48.917326-05:00"
@@ -6967,6 +6987,10 @@
     "StatusCode": 206,
     "LastSeen": "2024-01-30T15:25:13.019086-05:00"
   },
+  "https://sre.google/books/": {
+    "StatusCode": 206,
+    "LastSeen": "2024-02-26T10:53:38.643051+01:00"
+  },
   "https://stackoverflow.com/questions/5626193/what-is-monkey-patching": {
     "StatusCode": 200,
     "LastSeen": "2024-01-18T19:07:28.672979-05:00"
@@ -8119,6 +8143,10 @@
     "StatusCode": 200,
     "LastSeen": "2024-01-30T06:01:24.01921-05:00"
   },
+  "https://www.skyscanner.net": {
+    "StatusCode": 206,
+    "LastSeen": "2024-02-26T10:53:37.476242+01:00"
+  },
   "https://www.slideshare.net/Altinity/osa-con-2022-signal-correlation-the-ho11y-grail-michael-hausenblas-awspdf": {
     "StatusCode": 206,
     "LastSeen": "2024-01-18T19:56:02.307051-05:00"