Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Issue with dask on branch sopa2 #129

Closed
lguerard opened this issue Sep 25, 2024 · 4 comments
Closed

[Bug] Issue with dask on branch sopa2 #129

lguerard opened this issue Sep 25, 2024 · 4 comments

Comments

@lguerard
Copy link
Contributor

lguerard commented Sep 25, 2024

Description

When running the Cellpose segmentation using the dask backend, cell crashes after a while.

Multiple workers showed the error exceeded 95% memory budget. Restarting...". Then after a while it says that a task will be marked as failed because 4 workers died while trying to run it`.

Then it completely crashes with these errors :

2024-09-25 16:02:04,452 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:58018' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {('array-0173291e659995cef21d3d1e6515a34d', 0)} (stimulus_id='handle-worker-cleanup-1727272924.4456077')
2024-09-25 16:02:04,458 - distributed.scheduler - WARNING - Removing worker 'tcp://127.0.0.1:57843' caused the cluster to lose already computed task(s), which will be recomputed elsewhere: {'shuffle-taker-1981a4f154033ba88983f1452daf58f3', ('block-info-_map_read_frame-b518e369790450b6bf2ef0f396523719', 0, 0, 0)} (stimulus_id='handle-worker-cleanup-1727272924.4513524')
2024-09-25 16:02:07,609 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:07,610 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:07,612 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:07,614 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:07,615 - distributed.nanny - WARNING - Worker process still alive after 4.0 seconds, killing
2024-09-25 16:02:08,612 - distributed.client - ERROR - 
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\asyncio\tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\utils.py", line 806, in wrapper
    return await func(*args, **kwargs)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\client.py", line 1938, in _close
    await self.cluster.close()
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\deploy\spec.py", line 448, in _close
    await self._correct_state()
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\deploy\spec.py", line 359, in _correct_state_internal
    await asyncio.gather(*tasks)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\nanny.py", line 619, in close
    await self.kill(timeout=timeout, reason=reason)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\nanny.py", line 400, in kill
    await self.process.kill(reason=reason, timeout=timeout)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\nanny.py", line 882, in kill
    await process.join(max(0, deadline - time()))
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\process.py", line 330, in join
    await wait_for(asyncio.shield(self._exit_future), timeout)
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\utils.py", line 1926, in wait_for
    return await asyncio.wait_for(fut, timeout)
  File "S:\anaconda_envs\sopa\lib\asyncio\tasks.py", line 458, in wait_for
    raise exceptions.TimeoutError() from exc
asyncio.exceptions.TimeoutError
2024-09-25 16:02:08,614 - tornado.application - ERROR - Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0x000001478C780190>>, <Task finished name='Task-63060' coro=<SpecCluster._correct_state_internal() done, defined at S:\anaconda_envs\sopa\lib\site-packages\distributed\deploy\spec.py:346> exception=TimeoutError()>)
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\asyncio\tasks.py", line 456, in wait_for
    return fut.result()
asyncio.exceptions.CancelledError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\tornado\ioloop.py", line 750, in _run_callback
    ret = callback()
  File "S:\anaconda_envs\sopa\lib\site-packages\tornado\ioloop.py", line 774, in _discard_future_result
    future.result()
asyncio.exceptions.TimeoutError
Future exception was never retrieved
future: <Future finished exception=PermissionError(13, 'Access is denied', None, 5, None)>
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
  File "S:\anaconda_envs\sopa\lib\multiprocessing\process.py", line 140, in kill
    self._popen.kill()
  File "S:\anaconda_envs\sopa\lib\multiprocessing\popen_spawn_win32.py", line 123, in terminate
    _winapi.TerminateProcess(int(self._handle), TERMINATE)
PermissionError: [WinError 5] Access is denied
Future exception was never retrieved
future: <Future finished exception=PermissionError(13, 'Access is denied', None, 5, None)>
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
  File "S:\anaconda_envs\sopa\lib\multiprocessing\process.py", line 140, in kill
    self._popen.kill()
  File "S:\anaconda_envs\sopa\lib\multiprocessing\popen_spawn_win32.py", line 123, in terminate
    _winapi.TerminateProcess(int(self._handle), TERMINATE)
PermissionError: [WinError 5] Access is denied
Future exception was never retrieved
future: <Future finished exception=PermissionError(13, 'Access is denied', None, 5, None)>
Traceback (most recent call last):
  File "S:\anaconda_envs\sopa\lib\site-packages\distributed\process.py", line 55, in _call_and_set_future
    res = func(*args, **kwargs)
  File "S:\anaconda_envs\sopa\lib\multiprocessing\process.py", line 140, in kill
    self._popen.kill()
  File "S:\anaconda_envs\sopa\lib\multiprocessing\popen_spawn_win32.py", line 123, in terminate
    _winapi.TerminateProcess(int(self._handle), TERMINATE)
PermissionError: [WinError 5] Access is denied

Expected behavior

Cellpose patches created and processed

System

  • OS: Windows 10
  • Python version : 3.10.15
  • RAM: 256GB
@quentinblampey
Copy link
Collaborator

Thanks @lguerard for detailing the issue. How much RAM and how many CPU cores do you have?

@lguerard
Copy link
Contributor Author

I updated the post with the RAM amount.

As for the CPU, we're actually having a virtualized environment that shares resources between different VM. But each VM should have somewhere between 48 and 64 cores.

@quentinblampey
Copy link
Collaborator

Alright, thanks for the details. I'm still experimenting with the dask Client, so I'll try to improve it over time to have a stable release in sopa 2.0.0

@quentinblampey
Copy link
Collaborator

quentinblampey commented Oct 31, 2024

Tagging issue #145 so that we can discuss about all the issues with the dask backend.
I'm closing this issue to continue on the other one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants