Roadmap #1

jpsamaroo · 2024-10-24T14:55:45Z

Required:

Switch from --worker flag to another addprocs setup mechanism (Make the package actually usable #2)

Enhancements (speculative, these are up for discussion):

@always_everywhere and other pre-loading support
Worker state change notifications (worker added, worker exiting, etc.)
Experiment with multi-cluster setups
Type-stable interfaces and serialization
Observability tools

Bug Fixes:

Fix/improve thread safety (Such threadsafe, much wow #4)
Fix RemoteChannel isempty (Implement Base.isempty(::RemoteChannel) #3)

The text was updated successfully, but these errors were encountered:

mofeing · 2024-10-25T08:25:47Z

One problem @clasqui and @exaexa had with Distributed.jl and Extrae.jl is that they needed to trace the communications, which is doable with MPI thanks to dependency injection, but imposible for Distributed. We had to create our own ClusterManager that overdubs start_worker https://github.com/bsc-quantic/Extrae.jl/blob/86e8c1a4ecd1cc0853df47443d486a1fb7767bd8/src/Instrumentation/ExtraeLocalManager.jl#L23-L33 and emits custom Extrae events on different parts of Distributed https://github.com/bsc-quantic/Extrae.jl/blob/86e8c1a4ecd1cc0853df47443d486a1fb7767bd8/src/Instrumentation/Distributed.jl#L68-L112
Of course, performance was horrible and you have to do it for each cluster manager...

So, a hook system on different parts of Distributed, independently of the manager, would be super nice.

mofeing · 2024-10-25T08:32:10Z

Furthermore, a minor feature that could be nice to experiment with is custom worker ids; like multi-dimensional ids (is quite common to represent your problem in a 2D / 3D lattice of workers).

Also some "worker grouping" functionality akin to MPI groups and communicators would be interesting too. Like "broadcast to a group". What would be killer is if a worker can be in multiple groups.

Experiment with multi-cluster setups

Do I understand it correctly that this means that it will stop being a master-worker model? Or at least multi-master? That would be nice.

exaexa · 2024-10-25T08:35:02Z

@mofeing thx for ping, this would be indeed relevant. I guess there should be some slightly more general mechanism to control the worker spawning; I recently had similar "fun" with just a custom JLL loaded.

@jpsamaroo one question for the "next" version, would there be any improved support for managing worker-local data? We previously did this to "just do it reliably in the simplest way": https://github.com/LCSB-BioCore/DistributedData.jl . In another iteration (with simpler use-case) we managed to simplify to this: https://github.com/COBREXA/COBREXA.jl/blob/master/src/worker_data.jl , used with CachingPool.

+1 for the other question of @mofeing -- having the worker locality somewhat more exposed (so that people can hopefully improve scheduling of stuff that depends on latency&volume) would be great.

JamesWrigley · 2024-10-25T12:24:12Z

Experiment with multi-cluster setups

Do I understand it correctly that this means that it will stop being a master-worker model? Or at least multi-master? That would be nice.

From discussing it with Julian I think the idea is more about better support for workers created under different cluster managers, e.g. some from SSH, some from slurm etc. Another possibility is having 'private clusters' that are not visible to workers() and such, the use-case here would be a library spawning some workers for a specific purpose and not wanting them to be visible to other code that might doingrmprocs() etc.

jpsamaroo · 2024-10-29T21:00:51Z

Experiment with multi-cluster setups

I think there's a few possibilities here - doing multi-cluster at the level of a single Distributed logical cluster is the option I originally had in mind, but there could also be the possibility of a "multi-master" cluster. We'd probably need to discuss the pros and cons of each approach, or find a more general framework that allows both to exist.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap #1

Roadmap #1

jpsamaroo commented Oct 24, 2024 •

edited

Loading

mofeing commented Oct 25, 2024 •

edited

Loading

mofeing commented Oct 25, 2024

exaexa commented Oct 25, 2024 •

edited

Loading

JamesWrigley commented Oct 25, 2024

jpsamaroo commented Oct 29, 2024

Roadmap #1

Roadmap #1

Comments

jpsamaroo commented Oct 24, 2024 • edited Loading

mofeing commented Oct 25, 2024 • edited Loading

mofeing commented Oct 25, 2024

exaexa commented Oct 25, 2024 • edited Loading

JamesWrigley commented Oct 25, 2024

jpsamaroo commented Oct 29, 2024

jpsamaroo commented Oct 24, 2024 •

edited

Loading

mofeing commented Oct 25, 2024 •

edited

Loading

exaexa commented Oct 25, 2024 •

edited

Loading