Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process stuck forever / finalizer crash with UnauthorizedRequestError #3280

Open
bluenote10 opened this issue Jan 17, 2025 · 0 comments
Open
Labels
help wanted Extra attention is needed type / bug Issue type: something isn't working

Comments

@bluenote10
Copy link
Contributor

🐛 Bug

The following snippet hangs forever at the process exit. Apparently Aim's finalizer internally crash with an UnauthorizedRequestError, which seem to cause the process to get stuck indefinitely.

A crucial part for re-producing the error is that the process itself must have duration of at least half an hour (that's why the long sleep is needed in the example) -- which made this a rather nasty bug to narrow down and reproduce.

We encountered this bug in the real world when working with Ray, where such patterns can happen easily on the head node, which by design can idle for longer times, while workers do their job. This is quite inconvenient in that scenario, because the stuck jobs mean that the resources allocated on a Ray cluster will never be released.

To reproduce

Start a local Aim server with e.g.

mkdir -p /tmp/test_repo && aim server --yes --repo /tmp/test_repo --log-level debug

Then run:

import logging
import time

from aim import Repo, Run

if __name__ == "__main__":
    logging.basicConfig(
        level=logging.INFO,
        format="%(asctime)s %(name)-16.16s %(levelname)8s [%(process)d] %(message)s",
    )
    logging.getLogger("aim.ext.cleanup").setLevel(logging.DEBUG)
    logging.getLogger("aim.sdk.repo").setLevel(logging.DEBUG)
    logging.getLogger("aim.sdk.reporter").setLevel(logging.DEBUG)
    logging.getLogger("aim.sdk.run").setLevel(logging.DEBUG)

    # Create Aim run
    repo = Repo(f"aim://localhost:53800")
    run = Run(repo=repo)
    logging.info(f"Initialized Aim run with hash: {run.hash}")
    repo.close()

    logging.info("Simulating other work...")
    time.sleep(32 * 60)
    logging.info("Shutting down")

Expected behavior

The example should terminate, and not get stuck forever.

Actual behavior

The process will never terminate. The output is (slightly truncated):

2025-01-17 22:50:09,151 aim.sdk.run         DEBUG [19847] Opening Run 220b2d77ddd94a3399dd8e7d in write mode
2025-01-17 22:50:09,153 aim.sdk.reporter     INFO [19847] creating RunStatusReporter for 220b2d77ddd94a3399dd8e7d
2025-01-17 22:50:09,153 aim.sdk.reporter    DEBUG [19847] polling for check-ins in <aim.sdk.remote_run_reporter.RemoteFileManager object at 0x104763470>
2025-01-17 22:50:09,154 aim.sdk.reporter    DEBUG [19847] no check-in found for 220b2d77ddd94a3399dd8e7d; returning zero-check-in
2025-01-17 22:50:09,154 aim.sdk.reporter    DEBUG [19847] no leftover check-in found. starting from zero
2025-01-17 22:50:09,154 aim.sdk.reporter     INFO [19847] starting from: {}
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] incrementing starting idx -> 1
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] scheduled TimedTask(when=0, flag_name='starting', overwritten=False) ASAP because no physical check-in was found
2025-01-17 22:50:09,155 aim.sdk.reporter     INFO [19847] starting writer thread for <aim.sdk.reporter.RunStatusReporter object at 0x10c006cc0>
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 0
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] notifying <aim.sdk.reporter.RunStatusReporter object at 0x10c006cc0>
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] flushing starting
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 1
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] time remaining: -25171.476753625
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] blocking until the writer finishes...
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] only -25171.476753625 remaining... flushing one task
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 0
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 2
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] time remaining: -25171.476975583
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] only -25171.476975583 remaining... flushing one task
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] Skipping touching `starting` as it is already up to date
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] starting is not scheduled anymore. It will be invoked as soon as next one appears
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 0
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 1
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] time remaining: 19.999827291998372
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] too soon, 19.999827291998372 remaining
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] putting back for the future: TimedTask(when=25191.476874375, flag_name='starting', overwritten=False)
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] now: 25161.477080583... scheduled for: 25191.476874375
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] done
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 22:50:09,155 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 1
2025-01-17 22:50:09,210 root                 INFO [19847] Initialized Aim run with hash: 220b2d77ddd94a3399dd8e7d
2025-01-17 22:50:10,113 root                 INFO [19847] Simulating other work...
2025-01-17 22:50:39,160 aim.sdk.reporter    DEBUG [19847] time remaining: -10.005470750002132
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] only -10.005470750002132 remaining... flushing one task
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] Skipping touching `starting` as it is already up to date
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] starting is not scheduled anymore. It will be invoked as soon as next one appears
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 0
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 22:50:39,161 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 22:51:09,166 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 22:51:09,166 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 22:51:09,166 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 22:51:39,172 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 22:51:39,172 aim.sdk.reporter    DEBUG [19847] until woken up
[...]
2025-01-17 23:21:09,486 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 23:21:39,492 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:21:39,492 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 23:21:39,492 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 23:22:09,498 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:22:09,498 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 23:22:09,498 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 23:22:10,143 root                 INFO [19847] Shutting down
2025-01-17 23:22:10,144 aim.ext.cleanup     DEBUG [19847] Cleaning up...  Blocking KeyboardInterrupts
2025-01-17 23:22:10,144 aim.ext.cleanup     DEBUG [19847] Cleaning up...  Found 9 finalizers
2025-01-17 23:22:10,145 aim.ext.cleanup     DEBUG [19847] Cleaning up...  Iterating over instances in order
2025-01-17 23:22:10,145 aim.ext.cleanup     DEBUG [19847] Cleaning up...  with priority=90 instance <Run#-6885649275010414954 name=220b2d77ddd94a3399dd8e7d repo=<Repo#-3317689182825341487 path=localhost:53800/.aim read_only=None>>
2025-01-17 23:22:10,145 aim.sdk.run         DEBUG [19847] Closing resource <aim.ext.resource.tracker.ResourceTracker object at 0x10c0329f0>
Exception in thread Thread-6 (_cleanup):
Traceback (most recent call last):
  File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1075, in _bootstrap_inner
    self.run()
  File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/threading.py", line 1012, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/ext/cleanup/__init__.py", line 86, in _cleanup
    finalizer()
  File "/opt/homebrew/Cellar/python@3.12/3.12.7_1/Frameworks/Python.framework/Versions/3.12/lib/python3.12/weakref.py", line 590, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/sdk/run.py", line 109, in _close
    self._heartbeat.stop()
  File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/sdk/remote_run_reporter.py", line 37, in stop
    self._rpc_client.run_instruction(self._hash, self._handler, 'stop')
  File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/ext/transport/client.py", line 253, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/ext/transport/client.py", line 272, in _run_read_instructions
    raise_exception(response.json().get('exception'))
  File "/Users/user/Library/Caches/pypoetry/virtualenvs/finetuning-3xad2Jof-py3.12/lib/python3.12/site-packages/aim/ext/transport/message_utils.py", line 70, in raise_exception
    raise exception(*args) if args else exception()
aim.ext.transport.tracking.UnauthorizedRequestError: 107e4d55-4d5b-4aa6-8028-9778e368a2db
2025-01-17 23:22:39,504 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:22:39,504 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 23:22:39,504 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 23:23:09,515 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:23:09,515 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 23:23:09,515 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 23:23:39,517 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:23:39,517 aim.sdk.reporter    DEBUG [19847] until woken up
2025-01-17 23:23:39,517 aim.sdk.reporter    DEBUG [19847] unfinished tasks: 0
2025-01-17 23:24:09,523 aim.sdk.reporter    DEBUG [19847] no interesting things to do, sleeping for 30
2025-01-17 23:24:09,523 aim.sdk.reporter    DEBUG [19847] until woken up
[...]

The process will continue to log these last messages forever.

Environment

  • Aim Version: 3.27.0
  • Python version: 3.12.7
  • pip version: none, using poetry 1.8.4
  • OS (e.g., Linux): Reproduced on both MaxOs and Linux
@bluenote10 bluenote10 added help wanted Extra attention is needed type / bug Issue type: something isn't working labels Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed type / bug Issue type: something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant