You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm trying to train a LLM on multi-nodes (4 nodes, 8 GPUS each node in my setting). I got this error.
2025-03-02 20:46:26,498 WARNING tune_controller.py:700 -- Trial controller checkpointing failed: [Errno 2] No such file or directory: '/tmp/ray/session_2025-03-02_20-45-59_560640_41/artifacts/2025-03-02_20-46-26/TorchTrainer_2025-03-02_20-46-26/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-03-02_20-45-59_560640_41/artifacts/2025-03-02_20-46-26/TorchTrainer_2025-03-02_20-46-26/driver_artifacts/basic-variant-state-2025-03-02_20-46-26.json'
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/axolotl/cli/train.py", line 113, in <module>
fire.Fire(do_cli)
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/axolotl/cli/train.py", line 86, in do_cli
return trainer.fit()
^^^^^^^^^^^^^
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/train/base_trainer.py", line 705, in fit
result_grid = tuner.fit()
^^^^^^^^^^^
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/tuner.py", line 345, in fit
return self._local_tuner.fit()
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 504, in fit
analysis = self._fit_internal(trainable, param_space)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 620, in _fit_internal
analysis = run(
^^^^
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/tune.py", line 994, in run
runner.step()
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 701, in step
raise e
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 698, in step
self.checkpoint()
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 352, in checkpoint
self._checkpoint_manager.sync_up_experiment_state(
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/experiment_state.py", line 167, in sync_up_experiment_state
save_fn()
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 348, in save_to_dir
self._search_alg.save_to_dir(driver_staging_path, session_str=self._session_str)
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/search/basic_variant.py", line 405, in save_to_dir
_atomic_save(
File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/utils/util.py", line 415, in _atomic_save
os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-03-02_20-45-59_560640_41/artifacts/2025-03-02_20-46-26/TorchTrainer_2025-03-02_20-46-26/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-03-02_20-45-59_560640_41/artifacts/2025-03-02_20-46-26/TorchTrainer_2025-03-02_20-46-26/driver_artifacts/basic-variant-state-2025-03-02_20-46-26.json'
Versions / Dependencies
ray==2.43.0
Reproduction script
To reproduce, install axolotl, run following command. But I thinks it is a bug of ray.
The text was updated successfully, but these errors were encountered:
Hambaobao
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Mar 2, 2025
What happened + What you expected to happen
I'm trying to train a LLM on multi-nodes (4 nodes, 8 GPUS each node in my setting). I got this error.
Versions / Dependencies
Reproduction script
To reproduce, install
axolotl
, run following command. But I thinks it is a bug ofray
.Issue Severity
None
The text was updated successfully, but these errors were encountered: