Dask clusters getting stuck during pangeo-forge workloads #88

rabernat · 2021-03-29T14:37:18Z

I have been running full size Pangeo Forge recipes on Pangeo Cloud, using Dask Gateway clusters of 10-50 workers. This has been a good opportunity to see how things perform at scale.

In general, I'm pretty happy. I've processed several large datasets (pangeo-forge/staged-recipes#23, pangeo-forge/staged-recipes#24) more-or-less successfully.

Example Recipe Workflow

Set up Dask Cluster

import subprocess
import logging
from distributed import WorkerPlugin

class PipPlugin(WorkerPlugin):
    """
    Install packages on a worker as it starts up.

    Parameters
    ----------
    packages : List[str]
        A list of packages to install with pip on startup.
    """
    def __init__(self, packages):
        self.packages = packages

    def setup(self, worker):
        logger = logging.getLogger("distributed.worker")
        subprocess.call(['python', '-m', 'pip', 'install', '--no-deps', '--upgrade'] + self.packages)
        logger.info("Installed %s", self.packages)

from dask_gateway import Gateway

gateway = Gateway()
options = gateway.cluster_options()
options.worker_memory = 4
options.worker_cores = 1
cluster = gateway.new_cluster(options)
client = cluster.get_client()

plugin = PipPlugin(
    ['git+https://github.com/pangeo-data/rechunker.git',
     'git+https://github.com/rabernat/xarray.git@zarr-chunk-fixes',
     'git+https://github.com/rabernat/pangeo-forge.git@write-with-zarr-and-local-cache'
    ]
)
client.register_worker_plugin(plugin)
cluster.scale(50)

from prefect.executors import DaskExecutor
prefect_executor = DaskExecutor(
    address=cluster.scheduler_address,
    client_kwargs={"security": cluster.security}
)

Create Recipe and Configure Storage

import pandas as pd
from pangeo_forge.recipe import NetCDFtoZarrSequentialRecipe

url_base = 'https://dsrs.atmos.umd.edu/DATA/soda3.4.2/ORIGINAL/ocean/'

dates = pd.date_range(start='1993-01-04', end='2019-12-19', freq='5D')
date_string = [d.strftime('%Y_%m_%d') for d in dates]
urls = [url_base + f'soda3.4.2_5dy_ocean_or_{dstring}.nc' for dstring in date_string]

recipe = NetCDFtoZarrSequentialRecipe(
    input_urls=urls,
    sequence_dim="time",
    inputs_per_chunk=1,
    cache_inputs=True
)

# configure storage

import os
from pangeo_forge.storage import FSSpecTarget, CacheFSSpecTarget
import json
import gcsfs

with open('pangeo-181919-cc01a4fbb1ef.json') as fp:
    token = json.load(fp)
fs_target = gcsfs.GCSFileSystem(token=token, cache_timeout=-1, cache_type="none")
fs_scratch = gcsfs.GCSFileSystem(cache_timeout=-1, cache_type="none")
recipe_name = 'soda3.4.2_5dy_ocean_or'
target_path = f'gs://pangeo-forge-us-central1/pangeo-forge/soda/{recipe_name}'
cache_path = os.environ['PANGEO_SCRATCH'] + f'pangeo-forge-cache/{recipe_name}'
cache_target = CacheFSSpecTarget(fs_scratch, cache_path)
target = FSSpecTarget(fs_target, target_path)
recipe.input_cache = cache_target
recipe.target = target

Run Recipe on Cluster

from pangeo_forge.executors import PrefectPipelineExecutor
pipelines = recipe.to_pipelines()
executor = PrefectPipelineExecutor()
flow = executor.pipelines_to_plan(pipelines)
state = flow.run(executor=prefect_executor)  # triggers many hours of work

The problem

However with pangeo-forge/staged-recipes#23, which involves nearly 2000 input files, I started to see some weird behavior from Dask. Specifically, it looks like one worker is getting stuck with all the tasks and then becoming unresponsive. This grinds the whole flow to a halt.

Here's what the dashboard looks like in this state:

Here's a similar example with fewer workers:

(Note that most workers have zero tasks to process, while one has all the rest. However, this worker has stopped working

Here's what the worker page looks like:

In the worker log for the stuck worker, I see messages about "Comm closed".

``` distributed.worker - INFO - Start worker at: tls://10.36.27.69:43361 distributed.worker - INFO - Listening to: tls://10.36.27.69:43361 distributed.worker - INFO - dashboard at: 10.36.27.69:8787 distributed.worker - INFO - Waiting to connect to: tls://dask-098a0a4542fa4de0a7cbc3e00b8d3212.prod:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Threads: 1 distributed.worker - INFO - Memory: 4.29 GB distributed.worker - INFO - Local Directory: /home/jovyan/dask-worker-space/dask-worker-space/worker-mzjd2u0g distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Starting Worker plugin <__main__.PipPlugin object at 0x7fc2ba85ce80>-294a0cab-7425-4b2d-860f-a4897f5e363c distributed.worker - INFO - Installed ['git+https://github.com/pangeo-data/rechunker.git', 'git+https://github.com/rabernat/xarray.git@zarr-chunk-fixes', 'git+https://github.com/rabernat/pangeo-forge.git@write-with-zarr-and-local-cache'] distributed.worker - INFO - Registered to: tls://dask-098a0a4542fa4de0a7cbc3e00b8d3212.prod:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Connection to scheduler broken. Reconnecting... distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Starting Worker plugin <__main__.PipPlugin object at 0x7fc2ba85cb50>-77274dc9-8886-4c12-84ad-0dab61ee97f6 distributed.worker - INFO - Installed ['git+https://github.com/pangeo-data/rechunker.git', 'git+https://github.com/rabernat/xarray.git@zarr-chunk-fixes', 'git+https://github.com/rabernat/pangeo-forge.git@write-with-zarr-and-local-cache'] distributed.worker - INFO - Registered to: tls://dask-098a0a4542fa4de0a7cbc3e00b8d3212.prod:8786 distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Connection to scheduler broken. Reconnecting... distributed.worker - INFO - ------------------------------------------------- distributed.worker - INFO - Comm closed ```

It would be great if a Dask expert could help us get to the bottom of this.

rabernat · 2021-03-29T15:14:17Z

A possibly related issue is that there is a huge amount of variability in the time taken by store_chunk tasks. Sometimes 90s, sometimes >800s.

martindurant · 2021-03-29T15:17:35Z

Do you have profiler data on what's happening during store_chunk? Do you have typical times on a local process/thread? Do you think this might be hitting a cloud object store bottleneck or throttle?

rabernat · 2021-03-29T15:18:49Z

I'm running it with a dask profile now. Also having better logging on the workers (#84) would help a lot.

rabernat · 2021-03-29T16:11:01Z

Some workers are consistently like 10x faster than others. What could explain that?

rabernat · 2021-03-29T16:11:22Z

Dask performance report is here: https://gist.githack.com/rabernat/60e9dd299032ff3d4234239c257fa49d/raw/c04c47989d11c0c31efc05c95e8f3cdb1fe347b1/dask-report.html

rabernat · 2021-03-31T13:51:04Z

For a recent test, I was able to eliminate the problem of high variance among workers by setting copy_input_to_local_file=False (see #87).

martindurant · 2021-03-31T13:53:17Z

What kind of local storage do they have?

rabernat · 2021-03-31T13:57:27Z

Some kind of kuberenetes volume mount, don't know the details. However our dask-gateway workers are configured.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask clusters getting stuck during pangeo-forge workloads #88

Dask clusters getting stuck during pangeo-forge workloads #88

rabernat commented Mar 29, 2021

rabernat commented Mar 29, 2021 •

edited

Loading

martindurant commented Mar 29, 2021

rabernat commented Mar 29, 2021

rabernat commented Mar 29, 2021

rabernat commented Mar 29, 2021 •

edited

Loading

rabernat commented Mar 31, 2021

martindurant commented Mar 31, 2021

rabernat commented Mar 31, 2021

Dask clusters getting stuck during pangeo-forge workloads #88

Dask clusters getting stuck during pangeo-forge workloads #88

Comments

rabernat commented Mar 29, 2021

Example Recipe Workflow

Set up Dask Cluster

Create Recipe and Configure Storage

Run Recipe on Cluster

The problem

rabernat commented Mar 29, 2021 • edited Loading

martindurant commented Mar 29, 2021

rabernat commented Mar 29, 2021

rabernat commented Mar 29, 2021

rabernat commented Mar 29, 2021 • edited Loading

rabernat commented Mar 31, 2021

martindurant commented Mar 31, 2021

rabernat commented Mar 31, 2021

rabernat commented Mar 29, 2021 •

edited

Loading

rabernat commented Mar 29, 2021 •

edited

Loading