Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete Zero Silent Failures Checklist #93561

Closed
18 tasks
pacerwow opened this issue Sep 24, 2024 · 2 comments
Closed
18 tasks

Complete Zero Silent Failures Checklist #93561

pacerwow opened this issue Sep 24, 2024 · 2 comments
Labels
DBEX-Carbs Disability Benefits Experience - Team Carbs DBEX-TREX Disability Benefits Experience - Team T-Rex disability-experience zero-silent-failures Work related to eliminating silent failures ZSF:Documentation Tickets related to discovery, documenting flows

Comments

@pacerwow
Copy link
Contributor

pacerwow commented Sep 24, 2024

Issue Description

As a DBEX team member I want to complete the Zero Silent Failures (ZSF) checklist (below) to ensure that We provide the very best Veteran experience. The activities listed below will be completed and any future product, design, or engineering work needed to address ZSF is identified and tickets are created for this work with clear action items.


Tasks

  • Complete the checklist below
  • OCTO, Team 1, Team 2 meet as needed to discuss or resolve any items
  • PMs and Eng Leads from both teams meet with the Zero Silent Failures federal lead team on October 2nd

Acceptance Criteria

  • The checklist below is completed
  • Create tickets for any additional work to address gaps

Additional Information

Text-based documentation is saved in products/disability/526ez


Checklist

Start

  • Do you know when your application shipped to production?
    • If not, use Github to determine, roughly, when your application shipped to users.
  • Did your application use the same APIs when it shipped as it does today?
    • If not, then you'll need to consider the path user data took through both the current architecture and the previous architecture. You will need to account for potential failures in all paths since your application shipped.

Monitoring

  • Do you monitor the API(s) that you submit to via Datadog?
  • Does your Datadog monitoring use the appropriate tagging?
  • Do errors detected by Datadog go into a Slack notifications channel?
  • Does more than one person look at the Slack notifications channel containing errors on a daily basis?
  • Do the team members monitoring the Slack channel have a system for acknowledging and responding to the errors that appear there?

⚠️ Failure to have endpoint monitoring in place is a blocking QA standard at Staging review as of 9/10/24. If you answered no to any of the questions above, you will be blocked from shipping at the Staging review touchpoint in Collab Cycle.

Reporting errors

  • Have you filed issues for errors that are appearing in Datadog / Slack?
    • If not, then start filing Github issues for new categories of errors following this guidance
  • Do all fatal errors thrown in your application end up visible to the end user either in the user interface or via email?
    • If not, then file Github issues to capture error categories following this guidance

Documentation

  • Do you have a diagram of the submission path that user data your application accepts takes to reach a system of record?
  • Do you understand how the error is handled when each system in the submission path fails, is down for maintenance, or is completely down?
    • If not, then create documentation that captures how errors in each system are handled. Detail which systems retry a submission and what happens when those retries exhaust. Show this in your diagram.
  • Has the owner of the system of record receiving the user's data indicated in writing that their system notifies or resolves 100% of fatal errors once in their custody?
    • If not, work with OCTO to meet with the owner of the system and get their agreement in writing.
    • Please document the outcome of this conversation in your product's documentation in Github.

User experience

  • Do you capture all of the potential points of failure and make those errors known to the user via email notification and/or through the application on VA.gov or the mobile application?
    • If not, don't worry. Few teams are doing this and we'll be providing resources to help you do this in your application. Proceed to create a user data flow diagram. That diagram will help us to help you and your team to create this user experience.

We don't have any silent errors!

Great! Please let us know that you went through the checklist above as a team and did not find any silent failures in our Slack channel: #zero-silent-failures and send us a link of a copy of this completed checklist. If you don't connect to a backend system, you don't need to fillout the checklist but let us know in your message. You don't have to hang out in there once you have notified us. Just pop in, tell us who you are (which team and in which portfolio) and that no failures were found. Thanks!

Additional details

Set up monitoring in Datadog

Follow this guidance on endpoint monitoring to get going. Then following the guidance on monitoring performance to get up to speed with Datadog.

Examples

Additional examples

File silent errors issues in Github

We want to know about your silent errors so that we can help you to fix them. To do this, follow the process in the Managing Errors document.

How to create a user data flow diagram

Creating a user data flow diagram is a requirement of the Zero silent errors initiative and will be a required asset at the Architecture Intent touchpoint of the Engineering and Security track of Collaboration Cycle.

Learn how to create a user data flow diagram

@pacerwow pacerwow added DBEX-TREX Disability Benefits Experience - Team T-Rex disability-experience DBEX-Carbs Disability Benefits Experience - Team Carbs zero-silent-failures Work related to eliminating silent failures labels Sep 24, 2024
@pacerwow pacerwow changed the title Complete Checklist for Zero Silent Failures Complete Zero Silent Failures Checklist Sep 24, 2024
@nnicholas7 nnicholas7 self-assigned this Sep 25, 2024
@lisacapaccioli
Copy link
Contributor

lisacapaccioli commented Sep 25, 2024

In this thread, please link me to any document or artifact you've worked on related to submission flow, failure/error cases. Even if they're out of date or partial, they'll help me as I take a first pass at the checklist for zero silent failures. I have a few already but am just casting a wide net! more is better! looking for your pile of links before EOD today. (edited)

Sam Stuckey

features that support / power the safety net:

Remediation era docs

Kyle Soskin
doc from a while ago of stuff I thought should be monitored, and how, not sure if any of it got made
I feel like I probably had a lot of other stuff, but will have to dig around for it

Aurora Hampton

  1. high-level domain model
  2. technical overview for submit migration to lighthouse
  3. diagram used for our 2022 form version PSIRR

non-diagram things

  1. what happens if pdf polling fails in the primary path?
  2. troubleshooting 2022 form version problems
  3. supplemental documentation to the remediation process

Nathan Burgess
I believe all I had were the two diagrams in the bottom left of this page which include my initial discovery of where I believed the silent failures were, plus a draft of a "state machine" that could track the status of all of the ancillary documents, but our understanding of the failure points may have changed in the meantime, this was back in the beginning of the year when I first noticed where things could be slipping through the cracks
mural.comural.co
526 Claim Submission Migration - Polling (32 kB)
So that would be "(Current) Ancillary Jobs Flow, Retry and Fail State Diagram" and "(Future) Ancillary Actions State Machine"

Scott
Veteran Document Upload Silent Failure Discovery (original draft– numbers and process do not reflect the most-recent findings and methodology, to be documented this sprint)

@emilytheis
Copy link
Contributor

I moved this checklist into a markdown file in the sensitive repo since it will need to live on beyond the lifecycle of this ticket:

https://github.com/department-of-veterans-affairs/va.gov-team-sensitive/blob/master/teams/benefits/investigations/Disability%20526-%20Zero-Silent-Failures-Checklist-.md

@sanjabaj2 sanjabaj2 added the ZSF:Documentation Tickets related to discovery, documenting flows label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DBEX-Carbs Disability Benefits Experience - Team Carbs DBEX-TREX Disability Benefits Experience - Team T-Rex disability-experience zero-silent-failures Work related to eliminating silent failures ZSF:Documentation Tickets related to discovery, documenting flows
Projects
None yet
Development

No branches or pull requests

10 participants