Skip to content

Commit

Permalink
fix (#1463)
Browse files Browse the repository at this point in the history
  • Loading branch information
BalaBalaYi authored Feb 7, 2025
1 parent 1c1ac83 commit 89c310e
Showing 1 changed file with 2 additions and 0 deletions.
2 changes: 2 additions & 0 deletions dlrover/python/elastic_agent/torch/training.py
Original file line number Diff line number Diff line change
Expand Up @@ -885,6 +885,8 @@ def _initialize_workers(self, worker_group, max_errors=3):
time.sleep(JobConstant.TRAINING_AGENT_LOOP_DEFAULT_INTERVAL)
if time.time() - start_pending > pend_timeout:
raise TimeoutError("Timeout to wait for new nodes.")
except NodeCheckFailedError as node_check_error:
raise node_check_error
except Exception as e:
err_cnt += 1
if err_cnt < max_errors:
Expand Down

0 comments on commit 89c310e

Please sign in to comment.