Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New precheck procedure to enhance stability. #1453

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

BalaBalaYi
Copy link
Collaborator

What changes were proposed in this pull request?

  1. Pre-check operator api definition.
  2. Design doc.
  3. Base implement of pre-check procedure including master and worker.

Why are the changes needed?

For details, please see the design document in the current PR.

Does this PR introduce any user-facing change?

User can enable or disable the pre-check function through job args. For details, please see the development document in the current PR.

How was this patch tested?

UT and simple training job.

@BalaBalaYi BalaBalaYi added this to the v0.5.0 milestone Jan 26, 2025
@BalaBalaYi BalaBalaYi self-assigned this Jan 26, 2025
Copy link

codecov bot commented Jan 26, 2025

Codecov Report

Attention: Patch coverage is 91.32420% with 19 lines in your changes missing coverage. Please review.

Project coverage is 81.69%. Comparing base (1c1ac83) to head (54d489c).

Files with missing lines Patch % Lines
...rover/python/master/diagnosis/diagnosis_manager.py 89.18% 4 Missing ⚠️
dlrover/python/elastic_agent/master_client.py 25.00% 3 Missing ⚠️
...rover/python/master/diagnosis/precheck_operator.py 91.17% 3 Missing ⚠️
dlrover/python/master/servicer.py 60.00% 2 Missing ⚠️
dlrover/trainer/torch/elastic_run.py 83.33% 2 Missing ⚠️
dlrover/python/master/args.py 88.88% 1 Missing ⚠️
dlrover/python/master/diagnosis/diagnosis.py 50.00% 1 Missing ⚠️
dlrover/python/master/main.py 0.00% 1 Missing ⚠️
dlrover/python/master/node/job_context.py 90.00% 1 Missing ⚠️
dlrover/python/tests/test_pre_check_operator.py 94.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1453      +/-   ##
==========================================
+ Coverage   81.60%   81.69%   +0.08%     
==========================================
  Files         240      242       +2     
  Lines       24060    24270     +210     
==========================================
+ Hits        19635    19827     +192     
- Misses       4425     4443      +18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

BalaBalaYi and others added 11 commits January 26, 2025 18:54
# Conflicts:
#	dlrover/python/common/global_context.py
#	dlrover/python/master/diagnosis/diagnosis_manager.py
#	dlrover/python/tests/test_args.py
#	dlrover/python/tests/test_diagnosis_manager.py
#	docs/deployment/argument.md
# Conflicts:
#	dlrover/python/common/constants.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant