Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train with multiple GPU - errors #7

Open
jesusbft opened this issue Jan 31, 2025 · 0 comments
Open

Train with multiple GPU - errors #7

jesusbft opened this issue Jan 31, 2025 · 0 comments

Comments

@jesusbft
Copy link

Trying train with multiple GPU, not worked:

party -d cuda:0,cuda:1,cuda:2,cuda:3 train -q early --lag 20 --augment -t dataset/train.lst -e dataset/val.lst -B 20 --workers 16 --threads 16 -o models/Portuguese

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Trainer(val_check_interval=1.0) was configured so validation will run at the end of the training epoch..
Loading from huggingface hub 10.5281/zenodo.14616981.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Loading from huggingface hub 10.5281/zenodo.14616981.
Loading from huggingface hub 10.5281/zenodo.14616981.
Loading from huggingface hub 10.5281/zenodo.14616981.
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4

distributed_backend=nccl
All distributed processes registered. Starting with 4 processes

[rank1]:[E131 16:15:16.686699493 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800004 milliseconds before timing out.
[rank3]:[E131 16:15:16.688074713 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800005 milliseconds before timing out.
[rank3]:[E131 16:15:16.698386699 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E131 16:15:16.698412854 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank3]:[E131 16:15:16.698420406 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E131 16:15:16.698428398 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E131 16:15:16.698436714 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E131 16:15:16.698474862 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E131 16:15:16.698493613 ProcessGroupNCCL.cpp:630] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E131 16:15:16.698508402 ProcessGroupNCCL.cpp:636] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E131 16:15:16.705394686 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800005 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9529a8a446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7c94dedcc772 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c94dedd3bb3 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c94dedd561d in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7c952b29a5c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7c952bf4aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7c952bfdba04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800005 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9529a8a446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7c94dedcc772 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7c94dedd3bb3 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7c94dedd561d in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7c952b29a5c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7c952bf4aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7c952bfdba04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7c9529a8a446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7c94dea4271b in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7c952b29a5c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7c952bf4aac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7c952bfdba04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank1]:[E131 16:15:16.707598872 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800004 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70be4e97e446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x70be03dcc772 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70be03dd3bb3 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70be03dd561d in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x70be501695c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x70be50e19ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x70be50eaaa04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800004 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70be4e97e446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x70be03dcc772 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x70be03dd3bb3 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x70be03dd561d in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x70be501695c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x70be50e19ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x70be50eaaa04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x70be4e97e446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x70be03a4271b in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x70be501695c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x70be50e19ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x70be50eaaa04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E131 16:15:16.729492418 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
[rank2]:[E131 16:15:16.729856038 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E131 16:15:16.729864759 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 2] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank2]:[E131 16:15:16.729869534 ProcessGroupNCCL.cpp:630] [Rank 2] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E131 16:15:16.729873967 ProcessGroupNCCL.cpp:636] [Rank 2] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E131 16:15:16.733099300 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7029f4ee6446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7029aa1cc772 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7029aa1d3bb3 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7029aa1d561d in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7029f67355c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7029f73e5ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7029f7476a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 2] Process group watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=BROADCAST, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800047 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7029f4ee6446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7029aa1cc772 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7029aa1d3bb3 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7029aa1d561d in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x7029f67355c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x7029f73e5ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x7029f7476a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7029f4ee6446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x7029a9e4271b in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x7029f67355c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x7029f73e5ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x7029f7476a04 in /lib/x86_64-linux-gnu/libc.so.6)

[rank0]:[E131 16:15:16.794976876 ProcessGroupNCCL.cpp:616] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before timing out.
[rank0]:[E131 16:15:16.796087449 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 3, last enqueued NCCL work: 3, last completed NCCL work: 2.
[rank0]:[E131 16:15:16.796125481 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 0] Timeout at NCCL work: 3, last enqueued NCCL work: 3, last completed NCCL work: 2.
[rank0]:[E131 16:15:16.796148045 ProcessGroupNCCL.cpp:630] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E131 16:15:16.796161145 ProcessGroupNCCL.cpp:636] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E131 16:15:16.801508681 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76bb3f18a446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x76baf45cc772 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76baf45d3bb3 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76baf45d561d in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x76bb409f45c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x76bb416a4ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x76bb41735a04 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800093 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76bb3f18a446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x76baf45cc772 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x76baf45d3bb3 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x76baf45d561d in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x145c0 (0x76bb409f45c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #5: + 0x94ac3 (0x76bb416a4ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: clone + 0x44 (0x76bb41735a04 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1601 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x76bb3f18a446 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: + 0xe4271b (0x76baf424271b in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x145c0 (0x76bb409f45c0 in /workspace/venv/lib/python3.11/site-packages/torch/lib/libtorch.so)
frame #3: + 0x94ac3 (0x76bb416a4ac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: clone + 0x44 (0x76bb41735a04 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant