Skip to content

Commit

Permalink
Created blogpost for Observability Game days at Skyscanner (#4039)
Browse files Browse the repository at this point in the history
  • Loading branch information
jordibisbal8 authored Feb 26, 2024
1 parent 910040f commit 19720bb
Show file tree
Hide file tree
Showing 4 changed files with 137 additions and 0 deletions.
109 changes: 109 additions & 0 deletions content/en/blog/2024/demo-skyscanner/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
---
title:
"Making observability fun: How we increased engineers' confidence in incident
management using a game"
linkTitle: Skyscanner using OTel Demo
date: 2024-02-26
author: >-
[Jordi Bisbal Ansaldo](https://github.com/jordibisbal8) (Skyscanner)
cSpell:ignore: Ansaldo Bisbal Jordi runbooks Skyscanner upskilled Yankova
---

At [Skyscanner](https://www.skyscanner.net), as in many organizations, teams
tend to follow specific runbooks for individual failure modes. With modern and
complex distributed systems, this has the downside of most of the errors being
unknowns, which makes runbooks only partially applicable.

After migrating our telemetry data to the OpenTelemetry standards at Skyscanner,
we now have richer instrumentation and can rely on observability directly. As a
result, we are ready to adopt a new
[observability mindset](https://charity.wtf/2019/09/20/love-and-alerting-in-the-time-of-cholera-and-observability/),
which requires training our engineers to work effectively with the new
ecosystem. This allows them to react efficiently to any known or unknown issues,
even under pressure.

To achieve this, we believe that the best way to gain knowledge isn’t through
one-time viewings of documents or videos. Instead, it’s through practical
exercises that include situations with never-before-seen (or at least rarely
seen) problems. This helps the company reduce the time to mitigate an issue
(TTM), which starts when a first responder acknowledges the incident, until
users stop suffering from the incident.

## Environment

To begin with, we need to set up an environment that demonstrates the best
practices for monitoring and debugging using OpenTelemetry instrumentation and
observability. For this, we propose the use of the official
[OpenTelemetry Demo](/docs/demo/), which is a realistic example of a distributed
system called Astronomy Shop. Thanks to the
[OpenTelemetry Protocol](/docs/specs/otlp/) (OTLP), it allows us to simply point
the standard OTLP exporter in the Collector to
[New Relic](https://newrelic.com/), our chosen observability platform at
Skyscanner which, like other platforms, is fully embracing open standards to
ingest telemetry data.

This system contains regressions that can be injected into the platform and
helps us demonstrate the importance of Service Levels Objectives (SLOs),
tracing, logs, metrics, etc. For instance, we can observe traffic flow through
various components, as shown in the image below. Since part of the OpenTelemetry
ecosystem is open source, we can easily introduce any new features that will be
reviewed by OpenTelemetry contributors.

![Distributed tracing example in Astronomy shop](tracing-example.png)

## Observability game day

Once the environment is set up, we can introduce the Observability Game Day, an
initiative based on the Wheel of Misfortune practices that Google uses and
describes in the [Site Reliability Engineering book](https://sre.google/books/).

This game simulates a production incident, where a moderator known as the game
master (GM) conducts the session and someone from the audience spins the wheel
and explains an incident or outage. The participants are then divided into teams
and tasked with identifying and resolving the issue as quickly as possible. If
the solution is not optimal, the GM can help by introducing a new tool or view,
which gives a different perspective on how to tackle the incident (knowledge
sharing). This exercise can be repeated multiple times for different incidents.

![Wheel of misfortune example](wheel.png)

## Results

The Observability Game Day has already been completed by multiple Skyscanner
teams, where each team observability expert (ambassador) runs the session. The
participants have given extremely positive feedback, where 90% of the responders
say that after the Game Day, they feel more confident debugging production
systems and would love to have further sessions.

- Hugely valuable to run against real services and to compare and contrast
different debugging methods. I'm certain everyone, regardless of skill level,
will have got something out of the session - I know I did! Thank you for
taking the time to set this up and promoting it for us -
[Dominic Fraser](https://github.com/dominicfraser) (Senior Software Engineer)
- It is a really great (company-wide) initiative to get people upskilled in
observability and OpenTelemetry/New Relic and I personally found it very
useful, as well as a lot of fun! :D - Polly Yankova (Software Engineer)

In addition, we learned that:

1. OTLP makes it incredibly simple to integrate a standard application with an
observability vendor. Just point it to the right endpoint and job done.
2. Our winning teams relied primarily on tracing data to analyze regressions
that helped them understand the root cause faster. Tracing FTW!
3. Front-end engineers found the Game Day lacked focus on client-side
observability, so we decided to contribute upstream (see next steps below).
This was my first contribution to the project, and it was a great experience!
Maintainers were very welcoming and helped me to test and release. Thanks!

## Next steps

The next action is to run sessions for all the engineering teams in the company
and convert them into a Skyscanner learning course. This way, the content can be
used during the onboarding process for new joiners or even reviewed at any time
as a refresher for those who have been in the company longer. In addition, after
observing common feedback, we identified that it would be beneficial to extend
the current incidents to include more front-end-specific ones, such as incidents
triggered by browser traffic. To achieve this, we have contributed to the
OpenTelemetry Demo and enabled these features for other interested parties. For
more information, please have a look at the
[raised PR](https://github.com/open-telemetry/opentelemetry-demo/pull/1345).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/en/blog/2024/demo-skyscanner/wheel.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
28 changes: 28 additions & 0 deletions static/refcache.json
Original file line number Diff line number Diff line change
Expand Up @@ -223,6 +223,10 @@
"StatusCode": 206,
"LastSeen": "2024-01-30T06:06:13.062554-05:00"
},
"https://charity.wtf/2019/09/20/love-and-alerting-in-the-time-of-cholera-and-observability/": {
"StatusCode": 200,
"LastSeen": "2024-02-26T10:53:38.116124+01:00"
},
"https://circleci.com": {
"StatusCode": 206,
"LastSeen": "2024-01-30T05:18:29.78394-05:00"
Expand Down Expand Up @@ -2291,6 +2295,10 @@
"StatusCode": 200,
"LastSeen": "2024-01-25T10:54:57.162378745Z"
},
"https://github.com/dominicfraser": {
"StatusCode": 200,
"LastSeen": "2024-02-26T10:53:39.237712+01:00"
},
"https://github.com/dotansimha/graphql-yoga": {
"StatusCode": 200,
"LastSeen": "2024-01-30T05:18:34.524624-05:00"
Expand Down Expand Up @@ -2475,6 +2483,10 @@
"StatusCode": 200,
"LastSeen": "2024-01-30T16:14:54.527183-05:00"
},
"https://github.com/jordibisbal8": {
"StatusCode": 200,
"LastSeen": "2024-02-26T10:53:37.290066+01:00"
},
"https://github.com/joshleecreates/": {
"StatusCode": 200,
"LastSeen": "2024-01-30T05:18:13.610639-05:00"
Expand Down Expand Up @@ -3011,6 +3023,10 @@
"StatusCode": 200,
"LastSeen": "2024-01-30T06:05:58.807859-05:00"
},
"https://github.com/open-telemetry/opentelemetry-demo/pull/1345": {
"StatusCode": 200,
"LastSeen": "2024-02-26T10:53:40.03196+01:00"
},
"https://github.com/open-telemetry/opentelemetry-demo/pull/432": {
"StatusCode": 200,
"LastSeen": "2024-01-30T15:26:30.96845-05:00"
Expand Down Expand Up @@ -4851,6 +4867,10 @@
"StatusCode": 200,
"LastSeen": "2024-01-18T19:02:19.249572-05:00"
},
"https://newrelic.com/": {
"StatusCode": 206,
"LastSeen": "2024-02-26T10:53:38.368111+01:00"
},
"https://newrelic.com/blog/authors/daniel-kim": {
"StatusCode": 206,
"LastSeen": "2024-01-18T19:10:48.917326-05:00"
Expand Down Expand Up @@ -6967,6 +6987,10 @@
"StatusCode": 206,
"LastSeen": "2024-01-30T15:25:13.019086-05:00"
},
"https://sre.google/books/": {
"StatusCode": 206,
"LastSeen": "2024-02-26T10:53:38.643051+01:00"
},
"https://stackoverflow.com/questions/5626193/what-is-monkey-patching": {
"StatusCode": 200,
"LastSeen": "2024-01-18T19:07:28.672979-05:00"
Expand Down Expand Up @@ -8119,6 +8143,10 @@
"StatusCode": 200,
"LastSeen": "2024-01-30T06:01:24.01921-05:00"
},
"https://www.skyscanner.net": {
"StatusCode": 206,
"LastSeen": "2024-02-26T10:53:37.476242+01:00"
},
"https://www.slideshare.net/Altinity/osa-con-2022-signal-correlation-the-ho11y-grail-michael-hausenblas-awspdf": {
"StatusCode": 206,
"LastSeen": "2024-01-18T19:56:02.307051-05:00"
Expand Down

0 comments on commit 19720bb

Please sign in to comment.