Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Train] FileNotFoundError '/tmp/ray/sessio_xxxx/xxxx/.tmp_generator' #51020

Open
Hambaobao opened this issue Mar 2, 2025 · 0 comments
Open
Labels
bug Something that is supposed to be working; but isn't train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@Hambaobao
Copy link

What happened + What you expected to happen

I'm trying to train a LLM on multi-nodes (4 nodes, 8 GPUS each node in my setting). I got this error.

2025-03-02 20:46:26,498	WARNING tune_controller.py:700 -- Trial controller checkpointing failed: [Errno 2] No such file or directory: '/tmp/ray/session_2025-03-02_20-45-59_560640_41/artifacts/2025-03-02_20-46-26/TorchTrainer_2025-03-02_20-46-26/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-03-02_20-45-59_560640_41/artifacts/2025-03-02_20-46-26/TorchTrainer_2025-03-02_20-46-26/driver_artifacts/basic-variant-state-2025-03-02_20-46-26.json'

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/axolotl/cli/train.py", line 113, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 135, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/axolotl/cli/train.py", line 86, in do_cli
    return trainer.fit()
           ^^^^^^^^^^^^^
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/train/base_trainer.py", line 705, in fit
    result_grid = tuner.fit()
                  ^^^^^^^^^^^
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/tuner.py", line 345, in fit
    return self._local_tuner.fit()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 504, in fit
    analysis = self._fit_internal(trainable, param_space)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/impl/tuner_internal.py", line 620, in _fit_internal
    analysis = run(
               ^^^^
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/tune.py", line 994, in run
    runner.step()
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 701, in step
    raise e
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 698, in step
    self.checkpoint()
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 352, in checkpoint
    self._checkpoint_manager.sync_up_experiment_state(
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/experiment_state.py", line 167, in sync_up_experiment_state
    save_fn()
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/execution/tune_controller.py", line 348, in save_to_dir
    self._search_alg.save_to_dir(driver_staging_path, session_str=self._session_str)
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/search/basic_variant.py", line 405, in save_to_dir
    _atomic_save(
  File "/root/miniconda3/envs/axolotl/lib/python3.11/site-packages/ray/tune/utils/util.py", line 415, in _atomic_save
    os.replace(tmp_search_ckpt_path, os.path.join(checkpoint_dir, file_name))
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/ray/session_2025-03-02_20-45-59_560640_41/artifacts/2025-03-02_20-46-26/TorchTrainer_2025-03-02_20-46-26/driver_artifacts/.tmp_generator' -> '/tmp/ray/session_2025-03-02_20-45-59_560640_41/artifacts/2025-03-02_20-46-26/TorchTrainer_2025-03-02_20-46-26/driver_artifacts/basic-variant-state-2025-03-02_20-46-26.json'

Versions / Dependencies

ray==2.43.0

Reproduction script

To reproduce, install axolotl, run following command. But I thinks it is a bug of ray.

axolotl train configs/finetune/qwen2.5-coder/fft-fsdp-32b-ray.yaml

Issue Severity

None

@Hambaobao Hambaobao added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Mar 2, 2025
@jcotant1 jcotant1 added the train Ray Train Related Issue label Mar 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't train Ray Train Related Issue triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

2 participants