Skip to content

Latest commit

 

History

History
86 lines (53 loc) · 9.7 KB

overview.md

File metadata and controls

86 lines (53 loc) · 9.7 KB
cover title description
assets/img/covers/training_overview.png
Training Overview
Learning about the PagerDuty major incident response process is an important part of being an effective on-call engineer at PagerDuty. This section goes over our training material for the various roles that are involved in our incident response, along with some additional information and training material from government agencies.

Learning about the PagerDuty major incident response process is an important part of being an effective on-call engineer at PagerDuty. This section goes over our training material for the various roles that are involved in our incident response, along with some additional information and training material from government agencies.

Training Guides

Our training guides are split up by role, however you are encouraged to read through the training guides even for roles you don't belong to, as it can give you some good insight into how those people will be behaving during major incidents.

  • Incident Commander Training - The "IC" is the person who drives a major incident to resolution. They're the person who will be directing everyone else.
  • Deputy Training - The Deputy is someone who supports the Incident Commander and can take over for them if necessary.
  • Scribe Training - This is intended for individuals who will be acting as a Scribe during an incident.
  • SME / Resolver Training - This is relevant to everyone at PagerDuty who are on-call for any team.
  • Customer Liaison Training - This is for individuals who will be publicly representing us and interacting with customers.
  • Internal Liaison Training - This is relevant for anyone who might be called upon to work with teams internal during an incident.

Training Courses

We've also published slides and videos of some of our training courses. Originally used internally at PagerDuty to train our staff, we've since adapted them for a wider audience so you can make use of them in your own organizations.

Example Incident

This recorded call is a reenactment of an actual major incident that occurred at PagerDuty in January 2017. Some details have been changed in the interest of brevity and privacy, but the incident remains otherwise largely intact. For more details about the recording, you can read the PagerDuty blog post.

<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/yoY_pDxc0TA?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

National Incident Management System (NIMS)

Our incident response process is loosely based on the US National Incident Management System (NIMS), which is described as,

A systematic, proactive approach to guide departments and agencies at all levels of government, nongovernmental organizations, and the private sector to work together seamlessly and manage incidents involving all threats and hazards—regardless of cause, size, location, or complexity—in order to reduce loss of life, property, and harm to the environment.

While it might not initially seem that this would be applicable to an IT operations environment, we've found that many of the lessons learned from major incidents in these situations can be directly applied to our industry too. The principles are the same and span many different environments.

NIMS NIMS Training

If you want to learn more about NIMS, we recommend the ICS-100 and ICS-700 online training courses, which go over NIMS and the Incident Command System (You can also take an online examination after training in order to get a certificate from FEMA). There is also a wealth of additional training material and courses from FEMA on NIMS, which I would encourage you to look at.

If you're based in the US and interested in taking a more active incident response role in your community, we recommend investigating your local CERT programs (Community Emergency Response Teams). Many cities offer CERT training, after which you can volunteer as a CERT contributor within your community. Not only is it an opportunity to get real world experience with disaster response, but the skills you learn can be applied to everyday life too.

Also take a look at the Additional Reading page.

Incident Response Around the World

While NIMS is the US incident response framework, many countries have their own similar frameworks. Some are based on the US system, but many were developed on their own. There's a wealth of additional information to be learned by investigating the methods and frameworks used in countries all over the world.

A book called "Comparative Emergency Management: Understanding Disaster Policies, Organizations, and Initiatives from Around the World" (available from the FEMA website) compares the systems used by 30 or so different countries, and is an amazing collection of information on emergency management frameworks used around the world.

Here are a few of the systems we looked at in more detail in order to adapt and improve our own process at PagerDuty.

United Kingdom

The United Kingdom emergency services use a command hierarchy called Gold-Silver-Bronze Command Structure for their major operations. The framework involves three levels responsible for strategic (gold), tactical (silver), and operational (bronze) command decisions.

Here are some useful reading materials if you're interested in learning more:

New Zealand

New Zealand's system is called the Coordinated Incident Management System (CIMS) and is based upon the Incident Command System used in the US. One area we particularly liked from CIMS is its focus on common terminology, which helps prevents confusion during an incident and allows for a faster and more effective response. Some terminology has been changed from ICS (e.g. "Control" instead of "Command" to describe the management functions), but should still be familiar.

Here are some useful reading materials if you're interested in learning more:

Australia

Australia uses a system called the Australasian Inter-Service Incident Management System (AIIMS) which is a derivative of the NIMS framework used in the US. While based on ICS, AIIMS puts a bigger focus on span of control than other frameworks. As with New Zealand's system, there are some differences to the terminology being used (e.g. "Incident Controller" instead of "Incident Commander"), but it should still be familiar to those who know ICS.

Here are some useful reading materials if you're interested in learning more:

Canada

Canada uses their own Incident Command System (PDF). The standard for which is maintained by a network of organizations called ICS Canada. Their website has a collection of information on how you can find local training courses in Canada, depending on your Province.

Here are some useful reading materials if you're interested in learning more: