You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The following snippet hangs forever at the process exit. Apparently Aim's finalizer internally crash with an UnauthorizedRequestError, which seem to cause the process to get stuck indefinitely.
A crucial part for re-producing the error is that the process itself must have duration of at least half an hour (that's why the long sleep is needed in the example) -- which made this a rather nasty bug to narrow down and reproduce.
We encountered this bug in the real world when working with Ray, where such patterns can happen easily on the head node, which by design can idle for longer times, while workers do their job. This is quite inconvenient in that scenario, because the stuck jobs mean that the resources allocated on a Ray cluster will never be released.
The example should terminate, and not get stuck forever.
Actual behavior
The process will never terminate. The output is (slightly truncated):
2025-01-17 22:50:09,151 aim.sdk.run DEBUG [19847] Opening Run 220b2d77ddd94a3399dd8e7d in write mode
2025-01-17 22:50:09,153 aim.sdk.reporter INFO [19847] creating RunStatusReporter for 220b2d77ddd94a3399dd8e7d
2025-01-17 22:50:09,153 aim.sdk.reporter DEBUG [19847] polling for check-ins in <aim.sdk.remote_run_reporter.RemoteFileManager object at 0x104763470>
2025-01-17 22:50:09,154 aim.sdk.reporter DEBUG [19847] no check-in found for 220b2d77ddd94a3399dd8e7d; returning zero-check-in
2025-01-17 22:50:09,154 aim.sdk.reporter DEBUG [19847] no leftover check-in found. starting from zero
2025-01-17 22:50:09,154 aim.sdk.reporter INFO [19847] starting from: {}
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] incrementing starting idx -> 1
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] scheduled TimedTask(when=0, flag_name='starting', overwritten=False) ASAP because no physical check-in was found
2025-01-17 22:50:09,155 aim.sdk.reporter INFO [19847] starting writer thread for <aim.sdk.reporter.RunStatusReporter object at 0x10c006cc0>
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 0
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] notifying <aim.sdk.reporter.RunStatusReporter object at 0x10c006cc0>
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] flushing starting
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] unfinished tasks: 1
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] time remaining: -25171.476753625
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] blocking until the writer finishes...
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] only -25171.476753625 remaining... flushing one task
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 0
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] unfinished tasks: 2
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] time remaining: -25171.476975583
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] only -25171.476975583 remaining... flushing one task
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] Skipping touching `starting` as it is already up to date
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] starting is not scheduled anymore. It will be invoked as soon as next one appears
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 0
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] unfinished tasks: 1
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] time remaining: 19.999827291998372
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] too soon, 19.999827291998372 remaining
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] putting back for the future: TimedTask(when=25191.476874375, flag_name='starting', overwritten=False)
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] now: 25161.477080583... scheduled for: 25191.476874375
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] done
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 22:50:09,155 aim.sdk.reporter DEBUG [19847] unfinished tasks: 1
2025-01-17 22:50:09,210 root INFO [19847] Initialized Aim run with hash: 220b2d77ddd94a3399dd8e7d
2025-01-17 22:50:10,113 root INFO [19847] Simulating other work...
2025-01-17 22:50:39,160 aim.sdk.reporter DEBUG [19847] time remaining: -10.005470750002132
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] only -10.005470750002132 remaining... flushing one task
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] Skipping touching `starting` as it is already up to date
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] starting is not scheduled anymore. It will be invoked as soon as next one appears
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 0
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 22:50:39,161 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 22:51:09,166 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 22:51:09,166 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 22:51:09,166 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 22:51:39,172 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 22:51:39,172 aim.sdk.reporter DEBUG [19847] until woken up
[...]
2025-01-17 23:21:09,486 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 23:21:39,492 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:21:39,492 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 23:21:39,492 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 23:22:09,498 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:22:09,498 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 23:22:09,498 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 23:22:10,143 root INFO [19847] Shutting down
2025-01-17 23:22:10,144 aim.ext.cleanup DEBUG [19847] Cleaning up... Blocking KeyboardInterrupts
2025-01-17 23:22:10,144 aim.ext.cleanup DEBUG [19847] Cleaning up... Found 9 finalizers
2025-01-17 23:22:10,145 aim.ext.cleanup DEBUG [19847] Cleaning up... Iterating over instances in order
2025-01-17 23:22:10,145 aim.ext.cleanup DEBUG [19847] Cleaning up... with priority=90 instance <Run#-6885649275010414954 name=220b2d77ddd94a3399dd8e7d repo=<Repo#-3317689182825341487 path=localhost:53800/.aim read_only=None>>
2025-01-17 23:22:10,145 aim.sdk.run DEBUG [19847] Closing resource <aim.ext.resource.tracker.ResourceTracker object at 0x10c0329f0>
Exception in thread Thread-6 (_cleanup):
Traceback (most recent call last):
File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
self.run()
File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1012, in run
self._target(*self._args, **self._kwargs)
File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/ext/cleanup/__init__.py", line 86, in _cleanup
finalizer()
File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/weakref.py", line 590, in __call__
return info.func(*info.args, **(info.kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/sdk/run.py", line 109, in _close
self._heartbeat.stop()
File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/sdk/remote_run_reporter.py", line 37, in stop
self._rpc_client.run_instruction(self._hash, self._handler, 'stop')
File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/ext/transport/client.py", line 253, in run_instruction
return self._run_read_instructions(queue_id, resource, method, args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/ext/transport/client.py", line 272, in _run_read_instructions
raise_exception(response.json().get('exception'))
File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/ext/transport/message_utils.py", line 70, in raise_exception
raise exception(*args) if args else exception()
aim.ext.transport.tracking.UnauthorizedRequestError: 107e4d55-4d5b-4aa6-8028-9778e368a2db
2025-01-17 23:22:39,504 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:22:39,504 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 23:22:39,504 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 23:23:09,515 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:23:09,515 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 23:23:09,515 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 23:23:39,517 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:23:39,517 aim.sdk.reporter DEBUG [19847] until woken up
2025-01-17 23:23:39,517 aim.sdk.reporter DEBUG [19847] unfinished tasks: 0
2025-01-17 23:24:09,523 aim.sdk.reporter DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:24:09,523 aim.sdk.reporter DEBUG [19847] until woken up
[...]
The process will continue to log these last messages forever.
Environment
Aim Version: 3.27.0
Python version: 3.12.7
pip version: none, using poetry 1.8.4
OS (e.g., Linux): Reproduced on both MaxOs and Linux
The text was updated successfully, but these errors were encountered:
🐛 Bug
The following snippet hangs forever at the process exit. Apparently Aim's finalizer internally crash with an
UnauthorizedRequestError
, which seem to cause the process to get stuck indefinitely.A crucial part for re-producing the error is that the process itself must have duration of at least half an hour (that's why the long sleep is needed in the example) -- which made this a rather nasty bug to narrow down and reproduce.
We encountered this bug in the real world when working with Ray, where such patterns can happen easily on the head node, which by design can idle for longer times, while workers do their job. This is quite inconvenient in that scenario, because the stuck jobs mean that the resources allocated on a Ray cluster will never be released.
To reproduce
Start a local Aim server with e.g.
mkdir -p /tmp/test_repo && aim server --yes --repo /tmp/test_repo --log-level debug
Then run:
Expected behavior
The example should terminate, and not get stuck forever.
Actual behavior
The process will never terminate. The output is (slightly truncated):
The process will continue to log these last messages forever.
Environment
The text was updated successfully, but these errors were encountered: