Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: rds error should fail loudly in sfn #6255

Merged
merged 11 commits into from
Nov 22, 2023
Merged

Conversation

danieljhegeman
Copy link
Contributor

@danieljhegeman danieljhegeman commented Nov 21, 2023

Reason for Change

Changes

  • exit(1) when cxg failure or seurat failure is present. Logging is already occurring locally as well as notification to slack.

Testing steps

  • Tests will come in follow-up but this should be incorporated asap for schema migration testing purposes.

Checklist 🛎️

  • Add product, design, and eng as reviewers for rdev review
  • For UI changes, add screenshots/videos, so the reviewers know what you expect them to see
  • For UI changes, add e2e tests to prevent regressions

Notes for Reviewer

Copy link
Contributor

Deployment Summary

@danieljhegeman danieljhegeman marked this pull request as ready for review November 21, 2023 09:03
Copy link

codecov bot commented Nov 21, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (77e0eb7) 91.76% compared to head (143c604) 91.76%.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #6255   +/-   ##
=======================================
  Coverage   91.76%   91.76%           
=======================================
  Files         175      175           
  Lines       14129    14129           
=======================================
  Hits        12965    12965           
  Misses       1164     1164           
Flag Coverage Δ
unittests 91.76% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change works for cxg + seurat, but the original ticket mentions marking the sfn as failed upon errors in the validate step also. I'm not 100% sure if we should, since some validate failures will be intended failures (i.e. the curator submits an invalid dataset), but it would be helpful for debugging cases where the failure is unexpected. I'm not sure how simple it'd be to determine whether a validation error is expected or unexpected and handle accordingly. @Bento007 your thoughts?

We can decouple that change to unblock this, if we make a separate ticket for it

Copy link
Contributor

@nayib-jose-gloria nayib-jose-gloria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bento007, I believe, mentioned that it might be problematic to fail before we deregister the Job Definition. Am I recalling correctly? This step runs before then, so that'd be a problem if so

@danieljhegeman
Copy link
Contributor Author

@Bento007, I believe, mentioned that it might be problematic to fail before we deregister the Job Definition. Am I recalling correctly? This step runs before then, so that'd be a problem if so

@nayib-jose-gloria that's right, I'll fix that.

@Bento007
Copy link
Contributor

Bento007 commented Nov 21, 2023

When I was imaging the solution for this, i was assuming we could do it within in the SFN definition. When a step fails in the SFN it puts the error in the result passed to the next step. All we'd need to do is check for the existence of this error field in the resulting JSON and set the state to Failed. This check could then be done right after we deregister. Like this
Screenshot 2023-11-21 at 2 05 09 PM

    "DeregisterJobDefinition": {
      "Type": "Task",
      "Parameters": {
        "JobDefinition.$": "$.batch.JobDefinitionName"
      },
      "Resource": "arn:aws:states:::aws-sdk:batch:deregisterJobDefinition",
      "Next": "Detect Errors"
    },
    "Detect Errors": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.error",
          "IsPresent": true,
          "Next": "Fail"
        }
      ],
      "Default": "Pass"
    },
    "Fail": {
      "Type": "Fail"
    },
    "Pass": {
      "Type": "Pass",
      "End": true
    }

Copy link
Contributor

@Bento007 Bento007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will you also be catching error from download and validate in future work?

Copy link
Contributor

@Bento007 Bento007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@danieljhegeman danieljhegeman merged commit 8923090 into main Nov 22, 2023
@danieljhegeman danieljhegeman deleted the dan/6234-sfn-error branch November 22, 2023 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants