-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask clusters getting stuck during pangeo-forge workloads #88
Comments
Do you have profiler data on what's happening during store_chunk? Do you have typical times on a local process/thread? Do you think this might be hitting a cloud object store bottleneck or throttle? |
I'm running it with a dask profile now. Also having better logging on the workers (#84) would help a lot. |
Dask performance report is here: https://gist.githack.com/rabernat/60e9dd299032ff3d4234239c257fa49d/raw/c04c47989d11c0c31efc05c95e8f3cdb1fe347b1/dask-report.html |
For a recent test, I was able to eliminate the problem of high variance among workers by setting |
What kind of local storage do they have? |
Some kind of kuberenetes volume mount, don't know the details. However our dask-gateway workers are configured. |
I have been running full size Pangeo Forge recipes on Pangeo Cloud, using Dask Gateway clusters of 10-50 workers. This has been a good opportunity to see how things perform at scale.
In general, I'm pretty happy. I've processed several large datasets (pangeo-forge/staged-recipes#23, pangeo-forge/staged-recipes#24) more-or-less successfully.
Example Recipe Workflow
Set up Dask Cluster
Create Recipe and Configure Storage
Run Recipe on Cluster
The problem
However with pangeo-forge/staged-recipes#23, which involves nearly 2000 input files, I started to see some weird behavior from Dask. Specifically, it looks like one worker is getting stuck with all the tasks and then becoming unresponsive. This grinds the whole flow to a halt.
Here's what the dashboard looks like in this state:

Here's a similar example with fewer workers:

(Note that most workers have zero tasks to process, while one has all the rest. However, this worker has stopped working
Here's what the worker page looks like:

In the worker log for the stuck worker, I see messages about "Comm closed".
It would be great if a Dask expert could help us get to the bottom of this.
The text was updated successfully, but these errors were encountered: