Skip to content

Latest commit

 

History

History
29 lines (20 loc) · 2.14 KB

README.md

File metadata and controls

29 lines (20 loc) · 2.14 KB

CMLDask

Utility/Wrapper for interacting with the Computational Memory Lab's computing resources using the Dask distributed computing framework

Installation

Clone this repository:

git clone git@github.com:pennmem/cmldask.git (Using SSH)

Install using pip:

pip install -e cmldask -r cmldask/requirements.txt

Usage

See included notebooks for more detailed instructions and examples, but in short:

  1. Initialize a new client with dask: client = CMLDask.new_dask_client_{sge,slurm}("job_name", "1GB") (depending on whether you're connecting to SGE or SLURM for job scheduling)
  2. Optional: Open dashboard using instructions printed upon running (1)
  3. Define some function func that takes an argument arg (or many arguments) and returns a result.
  4. Define an iterable of arguments for func that you want to compute in parallel: arguments
  5. Use the client to map these arguments to the function on different parallel workers: futures = client.map(func, arguments). Dask's Futures API launches parallel jobs and computes results immediately, but delays collecting the results from distributed memory.
  6. Gather results into memory of your current process: result = client.gather(futures)

Dask Documentation

Refer to the Dask Futures API for great documentation on the preferred approach for parallel computing.

Some cases might warrant using Dask Delayed, which evaluates functions lazily.

Dask clients, once initialized in your kernel, will also automatically parallelize operations on a Dask DataFrame or a Dask Array. It is straightforward to convert between standard pandas, numpy, and xarray objects and these dask objects for distributed computing. As long as your data object fits easily in memory, though, it's unlikely that you will stand to benefit much from using these implementations (there is significant overhead).

Here is a good YouTube walkthrough of the Dask interactive monitoring dashboard.