Skip to content

Utility/Wrapper for interacting with the Computational Memory Lab's computing resources using the Dask distributed computing framework

Notifications You must be signed in to change notification settings

pennmem/cmldask

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CMLDask

Utility/Wrapper for interacting with the Computational Memory Lab's computing resources using the Dask distributed computing framework

Installation

Clone this repository:

git clone git@github.com:pennmem/cmldask.git (Using SSH)

Install using pip:

pip install -e cmldask -r cmldask/requirements.txt

Usage

See included notebooks for more detailed instructions and examples, but in short:

  1. Initialize a new client with dask: client = CMLDask.new_dask_client_{sge,slurm}("job_name", "1GB") (depending on whether you're connecting to SGE or SLURM for job scheduling)
  2. Optional: Open dashboard using instructions printed upon running (1)
  3. Define some function func that takes an argument arg (or many arguments) and returns a result.
  4. Define an iterable of arguments for func that you want to compute in parallel: arguments
  5. Use the client to map these arguments to the function on different parallel workers: futures = client.map(func, arguments). Dask's Futures API launches parallel jobs and computes results immediately, but delays collecting the results from distributed memory.
  6. Gather results into memory of your current process: result = client.gather(futures)

Dask Documentation

Refer to the Dask Futures API for great documentation on the preferred approach for parallel computing.

Some cases might warrant using Dask Delayed, which evaluates functions lazily.

Dask clients, once initialized in your kernel, will also automatically parallelize operations on a Dask DataFrame or a Dask Array. It is straightforward to convert between standard pandas, numpy, and xarray objects and these dask objects for distributed computing. As long as your data object fits easily in memory, though, it's unlikely that you will stand to benefit much from using these implementations (there is significant overhead).

Here is a good YouTube walkthrough of the Dask interactive monitoring dashboard.

About

Utility/Wrapper for interacting with the Computational Memory Lab's computing resources using the Dask distributed computing framework

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •