Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance torch240 rendezvous to improve fault tolerance ability. #1454

Conversation

BalaBalaYi
Copy link
Collaborator

What changes were proposed in this pull request?

Add 'wait' logic on 2 parts:

  1. Rank0 retrieve all role info.
  2. Other ranks retrieve rank info.

Why are the changes needed?

Optimization for rendezvous logic targeting torch version greater than 2.4.0.

If a worker exits during the process of assigning RANK for rendezvous, resulting in empty metadata retrieval, the process will not immediately exit due to deserialization failure. Instead, it will wait (to prevent all networking workers from encountering errors and exiting immediately). If non-empty data cannot be obtained eventually, the process will terminate with an exception due to a pending timeout.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

UT and training(TODO).

@BalaBalaYi BalaBalaYi added the enhancement New feature or request label Jan 27, 2025
@BalaBalaYi BalaBalaYi added this to the v0.5.0 milestone Jan 27, 2025
@BalaBalaYi BalaBalaYi self-assigned this Jan 27, 2025
Copy link

codecov bot commented Feb 5, 2025

Codecov Report

Attention: Patch coverage is 90.00000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 81.74%. Comparing base (6740f60) to head (e942ea3).

Files with missing lines Patch % Lines
dlrover/python/elastic_agent/torch/training.py 73.33% 4 Missing ⚠️
...lrover/python/tests/test_elastic_training_agent.py 94.33% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1454      +/-   ##
==========================================
+ Coverage   81.61%   81.74%   +0.13%     
==========================================
  Files         240      240              
  Lines       24045    24097      +52     
==========================================
+ Hits        19624    19699      +75     
+ Misses       4421     4398      -23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@BalaBalaYi BalaBalaYi closed this Feb 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant