Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to return SUCCEED when training is completed with some failed job tasks #421

Closed

Conversation

charliechen211
Copy link
Contributor

Description: #420

@oliverhu
Copy link
Member

Triggered another run of the job.

@oliverhu
Copy link
Member

CI complaints I don’t have permission to rebuild it. @morenn520 can you push an empty commit to trigger another build ?

@@ -246,6 +246,10 @@ public static String getContainerDockerKey() {
// Job types that we will short circuit when it failed
public static final String STOP_ON_FAILURE_JOBTYPES = TONY_APPLICATION_PREFIX + "stop.on.failure.jobtypes";

// Whether to return FAILED when training is completed with some failed job tasks
public static final String FAILED_ON_COMPLETED_WITH_FAILURE_ENABLED = TONY_APPLICATION_PREFIX + "failed_on_complete_with_failure.enabled";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge with the logic to handle STOP_ON_FAILURE_JOBTYPES?
Also use a-b-c instead of a_b_c
And add a test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They could config STOP_ON_FAILURE_JOBTYPES if they needs fail fast, while return SUCCEED when training is completed if they needs training to continue by FAILED_ON_COMPLETED_WITH_FAILURE_ENABLED.

So I don't think FAILED_ON_COMPLETED_WITH_FAILURE_ENABLED could be merged with STOP_ON_FAILURE_JOBTYPES.

Copy link
Contributor Author

@charliechen211 charliechen211 Jan 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test, I should run a job where one worker will be failed and others continue to complete the training finally. And I don't find any test case like this in TonY. @oliverhu Any thought or suggestion on UT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a second thought, we could refactor and follow the convention of tony.x.stop_on_failure tony.x.fail_app_on_failure, basically use this to form the table we discussed in the issue, thoughts?
Check the E2E tests, they mimic real life tony jobs.

@oliverhu
Copy link
Member

moved to discuss in #421

@oliverhu oliverhu closed this Feb 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants