Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel training hangs in reduce_tensor method #51

Open
ayushi-3536 opened this issue Jan 8, 2023 · 1 comment
Open

Parallel training hangs in reduce_tensor method #51

ayushi-3536 opened this issue Jan 8, 2023 · 1 comment

Comments

@ayushi-3536
Copy link

Hi,

I was training the network with validation on coco dataset using 8 gpus on a single node. It seems like the network hangs while using reduce_tensor method inside validate_cls(). Is there a solution known for this issue?

@xvjiarui
Copy link
Contributor

xvjiarui commented Jan 9, 2023

Hi @ayushi-3536

I haven't encountered this issue.

Could you please provide your PyTorch version?

Maybe you could please try run validation first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants