Skip to content

CUDA Error on second epoch #28

@nbardy

Description

@nbardy

Seeing an unknown CUDA error on the second epoch. Will try to debug more tomorrow.

Traceback (most recent call last):
  File "/home/paperspace/git/DRLX/train_aesthetics.py", line 12, in <module>
    trainer.train(pipe, Aesthetics())
  File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in train
    if self.config.train.total_samples is not None:
  File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in <listcomp>
    if self.config.train.total_samples is not None:
  File "/home/paperspace/.pyenv/versions/3.9.17/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/paperspace/git/DRLX/src/drlx/denoisers/ldm_unet.py", line 125, in postprocess
    images = images.detach().cpu().permute(0,2,3,1).numpy()
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
File "/home/paperspace/git/DRLX/train_aesthetics.py", line 12, in
trainer.train(pipe, Aesthetics())
File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in train
if self.config.train.total_samples is not None:
File "/home/paperspace/git/DRLX/src/drlx/trainer/ddpo_trainer.py", line 313, in
if self.config.train.total_samples is not None:
File "/home/paperspace/.pyenv/versions/3.9.17/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/paperspace/git/DRLX/src/drlx/denoisers/ldm_unet.py", line 125, in postprocess
images = images.detach().cpu().permute(0,2,3,1).numpy()
RuntimeError: CUDA error: unknown error
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions