Skip to content

Conversation

pawel-kotowski
Copy link
Contributor

Purpose

  • Improve error handling in vision agents

Proposed Changes

  • Refactored paths handling - it was a bit messy and weights were redownloaded to wrong path as described in Grounddino weight downloaded to wrong location on error during init #678
  • Error resulting from broken weights file is detected via "PytorchStreamReader" message in exception (because we always get RuntimeError regardless of error type) in BaseVisionAgent class
  • Only in this case the weights are redownloaded
  • Other errors are reraised, e.g. CUDA related errors which also prevent model init, even if weights file is correct

Issues

Testing

  • Tests with broken weights files and error related to CUDA_VISIBLE_DEVICES

@maciejmajek
Copy link
Member

Refactored paths handling - it was a bit messy and weights were redownloaded to wrong path as described in #678

I think this is still a problem. Interrupting on the first start causes error ".... is a directory"

@pawel-kotowski
Copy link
Contributor Author

pawel-kotowski commented Sep 17, 2025

@maciejmajek I added one minor naming fix (didn't affect anything). It works on my side, maybe @jmatejcz could check.

@Juliaj
Copy link
Collaborator

Juliaj commented Sep 18, 2025

@pawel-kotowski, we looked at #678 during Wednesday's dev sync meeting. I was added as a reviewer since I was seeing the CUDA error somewhat frequently when running the RAI manipulation demo a few months ago. Your changes look solid overall, good work! In the PR description, you mentioned still experiencing CUDA errors even with your fixes - is that correct? I haven't been able to reproduce the error with your changes. If you're still seeing the issue, could you run the attached cuda_error_debug.py to help gather diagnostic information?

@Juliaj
Copy link
Collaborator

Juliaj commented Sep 18, 2025

Refactored paths handling - it was a bit messy and weights were redownloaded to wrong path as described in #678

I think this is still a problem. Interrupting on the first start causes error ".... is a directory"

@maciejmajek, how can we reproduce the issue you're experiencing? I tried interrupting the RAIManipulationDemo.GameLauncher startup with Ctrl-C several times but couldn't trigger any error.

Copy link
Member

@maciejmajek maciejmajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested again and works! Thank you!

@maciejmajek maciejmajek force-pushed the feat/fix_weights_redownload_in_vision_agents branch from dd1f53e to fa8a153 Compare October 1, 2025 19:28
@maciejmajek maciejmajek merged commit 8ec2eb3 into main Oct 1, 2025
6 checks passed
@maciejmajek maciejmajek deleted the feat/fix_weights_redownload_in_vision_agents branch October 1, 2025 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants