-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to return SUCCEED when training is completed with some failed job tasks #421
Conversation
be28773
to
a91c858
Compare
Triggered another run of the job. |
CI complaints I don’t have permission to rebuild it. @morenn520 can you push an empty commit to trigger another build ? |
@@ -246,6 +246,10 @@ public static String getContainerDockerKey() { | |||
// Job types that we will short circuit when it failed | |||
public static final String STOP_ON_FAILURE_JOBTYPES = TONY_APPLICATION_PREFIX + "stop.on.failure.jobtypes"; | |||
|
|||
// Whether to return FAILED when training is completed with some failed job tasks | |||
public static final String FAILED_ON_COMPLETED_WITH_FAILURE_ENABLED = TONY_APPLICATION_PREFIX + "failed_on_complete_with_failure.enabled"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge with the logic to handle STOP_ON_FAILURE_JOBTYPES?
Also use a-b-c instead of a_b_c
And add a test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They could config STOP_ON_FAILURE_JOBTYPES if they needs fail fast, while return SUCCEED when training is completed if they needs training to continue by FAILED_ON_COMPLETED_WITH_FAILURE_ENABLED.
So I don't think FAILED_ON_COMPLETED_WITH_FAILURE_ENABLED could be merged with STOP_ON_FAILURE_JOBTYPES.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test, I should run a job where one worker will be failed and others continue to complete the training finally. And I don't find any test case like this in TonY. @oliverhu Any thought or suggestion on UT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a second thought, we could refactor and follow the convention of tony.x.stop_on_failure
tony.x.fail_app_on_failure
, basically use this to form the table we discussed in the issue, thoughts?
Check the E2E tests, they mimic real life tony jobs.
moved to discuss in #421 |
Description: #420