Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tensor parallel distributed tests by requesting PyTorch not to destroy PG upon exit #1472

Closed
wants to merge 2 commits into from

Conversation

IvanYashchuk
Copy link
Collaborator

@IvanYashchuk IvanYashchuk commented Nov 26, 2024

Before this PR tensor parallel tests were failing. Example

pytest thunder/tests/distributed/test_tensor_parallel.py::TensorParallelTest::test_embedding_name_column

ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Process process 1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/pytorch/lightning-thunder/thunder/tests/distributed/helper.py", line 144, in _run
    torch.distributed.barrier()
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 4462, in barrier
    opts.device = torch.device(_get_object_coll_device(group))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 774, in _get_object_coll_device
    group = group or _get_default_group()
                     ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 1276, in _get_default_group
    raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

And now it works.

cc @Borda @crcrpar

IvanYashchuk and others added 2 commits November 26, 2024 13:45
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
@t-vi
Copy link
Collaborator

t-vi commented Nov 26, 2024

Thank you Ivan, this is fixed in #1470 already along with the other.

@t-vi t-vi closed this Nov 26, 2024
@IvanYashchuk IvanYashchuk deleted the fix-tp-distributed-destroy branch November 26, 2024 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants