-
Notifications
You must be signed in to change notification settings - Fork 54
What to Do When You Have No Idea What is Wrong
Sometimes, things crash, and we have no idea why. SimpleReport is a system that depends on several external pieces of infrastructure, and any one of them can run into issues.
Consider this document a "Choose Your Own Adventure" guide to navigating an incident. If you run into a situation that isn't covered here, congratulations! You're going to have an amazing story to tell. Once you've recovered from the ordeal, make sure you add your knowledge to this document. Even the smallest of contributions will help others as we move forward.
Let's get started, shall we?
Remember: SimpleReport depends on Microsoft Azure. If Azure is having trouble with its data center network, whether due to a physical problem at a DC, or due to a networking trunk issue, the critical components that make up SimpleReport may go offline. When these outages are at their worst, the Azure portal may also become inaccessible.
What to Check
Your first stop will be to check the Azure Status page. This should give you a detailed picture of what condition Azure's services are in.
Unfortunately...Microsoft is notorious for not updating this status page until after an incident has already been resolved...or, in the case of a very bad incident, they won't update until they get enough pressure on Twitter. They may also not be able to update the page at all if any of its dependencies are hosted on their ailing cloud.
To get a better idea of Azure's condition, check out DownDetector. This status is crowdsourced, and it can usually tell you that Microsoft is having a problem before Microsoft knows about it.
How to Respond
If Azure is the problem:
- Turn on Maintenance Mode.
- Monitor. There's nothing more you can do until Azure comes back up.
If Azure is NOT the problem: Proceed to the next section.
If Azure is working, we need to run a sanity check to make sure our Azure subscription is properly allocated.
What to Check
Fetch a fresh SU password from the CDC network. Once you have that, log into Azure with your credentials.
What you do next depends on what results from your login.
How to Respond
If you're not able to log in:
- Escalate the PagerDuty alert to your backup. You will need someone else to handle on-call until your permissions are rectified.
If you're not able to log in, and you ARE the backup:
- See this section.
If you're able to log in, and cannot see any of our subscriptions:
- Escalate the PagerDuty alert to your backup. You will need someone else to handle on-call until your permissions are rectified.
If you're able to log in, you cannot see any of our subscriptions, and you ARE the backup:
- See this section.
If none of the above:
- Proceed to the next section.
The backend consists of our SimpleReport App Service and PostgreSQL Flexible Server instance. Logs from the former should indicate problems with the latter, and will provide you with helpful debugging information.
What to Check
Follow the steps in our Container Debugging document.
How to Respond
If the error messages indicate a problem acquiring a database changelog lock:
- Follow the steps here.
If the error messages indicate a problem with application startup, and the problem is the DATABASE:
- See this section.
If the error messages indicate a problem with Okta:
- See this section.
If the error messages indicate a problem with ANYTHING ELSE:
- Perform a slot swap.
If swapping slots doesn't solve the problem:
- See this section.
If none of the above work:
- See this section.
Without our database running the show, SimpleReport will grind to a halt. This is arguably our most critical component outside of the code itself.
What to Check
First, make sure the database isn't in a maintenance window. Full documentation for the managed maintenance, and how to find out when it is scheduled, can be found in Microsoft's documentation.
Next, check the basics. Does the database exist? Is the server started and running? Perform your standard sanity checks. If you need detailed instructions for troubleshooting, including diagnosing metrics that are outside of expected ranges, start here.
How to Respond
If the database is in a maintenance window:
- Monitor. There's nothing more you can do until maintenance is finished. Maintenance should be happening on a weekend outside of business hours, and usually lasts no more than an hour, so you won't need to worry about maintenance mode for this.
If the maintenance window lasts until the start of business hours:
- Call DevSecOps, and make some popcorn. There will be some fun calls with Microsoft Support you can listen to in your near future.
If the database is turned off:
- Turn it on by clicking the "Start" button.
If the database is started, but isn't responding:
- Restart the database.
If the restart doesn't fix the issue:
- Attempt to connect to the database using the Bastion procedure. More information about that is included in our secure docset, which can only be accessed with a CDC account and if you are a part of the CDCent org.
If you are unable to Bastion into the database:
- See this section.
If the database doesn't exist:
- Did you make a typo in the search box? Are you in the right subscription?
If you're sure you typed it correctly, and you verified that you're in the right subscription:
- See this section.
Ah, Akamai: wonderful purveyor of rapid data and blocker of unwanted DDoS attacks when it works, and an endless source of confusion and misery when it doesn't.
Akamai is our Content Delivery Network (CDN) layer. It is controlled through a CDC-owned panel, with a dedicated support team.
What to Check
First, make sure the backend is working. You'll want to follow the steps in Check the Backend above to verify that.
If the backend is up, try to navigate to the website again. If you see this screen:
...your problem is likely Akamai. The screen above is what renders when Akamai logs an error while trying to load the origin site. To confirm, navigate to the following URL, replacing <env>
with your environment's shortcode:
https://origin-<env>.simplereport.gov
What you do next will depend on what you see at the link above. Specifically, make sure that you aren't seeing a certificate error of any sort.
How to Respond
If you see the Akamai error screen, and the origin link loads with no certificate error:
- Call DevSecOps. Make it their problem...you'll thank me later. (Editor's Note: This statement was written by a DevSecOps engineer.)
If you see the Akamai error screen, and the origin link loads WITH A CERTIFICATE ERROR:
- Call DevSecOps. They forgot to renew an SSL certificate, or Azure somehow lost it. Regardless, it's their problem, now.
If you see the Akamai error screen, and the origin link DOES NOT LOAD:
- See this section.
We use Okta as our central clearinghouse for authentication and authorization throughout the app. If Okta goes down, we're dead in the water.
What to Check
To see if Okta is reporting an outage, you can check their status page. They tend to be pretty good about providing updates, and will likely send notifications via email, as well.
You can also leverage DownDetector as an early-warning system for potential issues.
How to Respond
If Okta reports an outage:
- Turn on Maintenance Mode.
- Monitor. There's nothing more you can do until Okta comes back up.
If Okta is up, but authorization and authentication are still broken:
- Call DevSecOps. There is likely a configuration issue that needs to be taken care of, or Terraform decided to nuke something. Regardless, it's time to make this their problem.
If you've reached this part of the document...now you can start to panic. But, let's panic constructively!
If you have a secondary on-call member, escalate your alert to them.
If you are the secondary, or have no secondary to escalate to, now is the time to escalate the alert to the entirety of the on-call list. Make direct phone calls if you have to; our goal is to get as many eyes on the problem as possible.
If you hit this step, please remember one thing: breathe. You've done all you can; this is why we are a team. We're all here to help, and we will get through this!
- Getting Started
- [Setup] Docker and docker compose development
- [Setup] IntelliJ run configurations
- [Setup] Running DB outside of Docker (optional)
- [Setup] Running nginx locally (optional)
- [Setup] Running outside of docker
- Accessing and testing weird parts of the app on local dev
- Accessing patient experience in local dev
- API Testing with Insomnia
- Cypress
- How to run e2e locally for development
- E2E tests
- Database maintenance
- MailHog
- Running tests
- SendGrid
- Setting up okta
- Sonar
- Storybook and Chromatic
- Twilio
- User roles
- Wiremock
- CSV Uploader
- Log local DB queries
- Code review and PR conventions
- SimpleReport Style Guide
- How to Review and Test Pull Requests for Dependabot
- How to Review and Test Pull Requests with Terraform Changes
- SimpleReport Deployment Process
- Adding a Developer
- Removing a developer
- Non-deterministic test tracker
- Alert Response - When You Know What is Wrong
- What to Do When You Have No Idea What is Wrong
- Main Branch Status
- Maintenance Mode
- Swapping Slots
- Monitoring
- Container Debugging
- Debugging the ReportStream Uploader
- Renew Azure Service Principal Credentials
- Releasing Changelog Locks
- Muting Alerts
- Architectural Decision Records
- Backend Stack Overview
- Frontend Overview
- Cloud Architecture
- Cloud Environments
- Database ERD
- External IDs
- GraphQL Flow
- Hibernate Lazy fetching and nested models
- Identity Verification (Experian)
- Spring Profile Management
- SR Result bulk uploader device validation logic
- Test Metadata and how we store it
- TestOrder vs TestEvent
- ReportStream Integration
- Feature Flag Setup
- FHIR Resources
- FHIR Conversions
- Okta E2E Integration
- Deploy Application Action
- Slack notifications for support escalations
- Creating a New Environment Within a Resource Group
- How to Add and Use Environment Variables in Azure
- Web Application Firewall (WAF) Troubleshooting and Maintenance
- How to Review and Test Pull Requests with Terraform Changes