Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap #1

Open
2 of 8 tasks
jpsamaroo opened this issue Oct 24, 2024 · 5 comments
Open
2 of 8 tasks

Roadmap #1

jpsamaroo opened this issue Oct 24, 2024 · 5 comments

Comments

@jpsamaroo
Copy link
Member

jpsamaroo commented Oct 24, 2024

Required:

Enhancements (speculative, these are up for discussion):

  • @always_everywhere and other pre-loading support
  • Worker state change notifications (worker added, worker exiting, etc.)
  • Experiment with multi-cluster setups
  • Type-stable interfaces and serialization
  • Observability tools

Bug Fixes:

@mofeing
Copy link

mofeing commented Oct 25, 2024

One problem @clasqui and @exaexa had with Distributed.jl and Extrae.jl is that they needed to trace the communications, which is doable with MPI thanks to dependency injection, but imposible for Distributed. We had to create our own ClusterManager that overdubs start_worker https://github.com/bsc-quantic/Extrae.jl/blob/86e8c1a4ecd1cc0853df47443d486a1fb7767bd8/src/Instrumentation/ExtraeLocalManager.jl#L23-L33 and emits custom Extrae events on different parts of Distributed https://github.com/bsc-quantic/Extrae.jl/blob/86e8c1a4ecd1cc0853df47443d486a1fb7767bd8/src/Instrumentation/Distributed.jl#L68-L112
Of course, performance was horrible and you have to do it for each cluster manager...

So, a hook system on different parts of Distributed, independently of the manager, would be super nice.

@mofeing
Copy link

mofeing commented Oct 25, 2024

Furthermore, a minor feature that could be nice to experiment with is custom worker ids; like multi-dimensional ids (is quite common to represent your problem in a 2D / 3D lattice of workers).

Also some "worker grouping" functionality akin to MPI groups and communicators would be interesting too. Like "broadcast to a group". What would be killer is if a worker can be in multiple groups.

Experiment with multi-cluster setups

Do I understand it correctly that this means that it will stop being a master-worker model? Or at least multi-master? That would be nice.

@exaexa
Copy link
Contributor

exaexa commented Oct 25, 2024

@mofeing thx for ping, this would be indeed relevant. I guess there should be some slightly more general mechanism to control the worker spawning; I recently had similar "fun" with just a custom JLL loaded.

@jpsamaroo one question for the "next" version, would there be any improved support for managing worker-local data? We previously did this to "just do it reliably in the simplest way": https://github.com/LCSB-BioCore/DistributedData.jl . In another iteration (with simpler use-case) we managed to simplify to this: https://github.com/COBREXA/COBREXA.jl/blob/master/src/worker_data.jl , used with CachingPool.

+1 for the other question of @mofeing -- having the worker locality somewhat more exposed (so that people can hopefully improve scheduling of stuff that depends on latency&volume) would be great.

@JamesWrigley
Copy link
Collaborator

Experiment with multi-cluster setups

Do I understand it correctly that this means that it will stop being a master-worker model? Or at least multi-master? That would be nice.

From discussing it with Julian I think the idea is more about better support for workers created under different cluster managers, e.g. some from SSH, some from slurm etc. Another possibility is having 'private clusters' that are not visible to workers() and such, the use-case here would be a library spawning some workers for a specific purpose and not wanting them to be visible to other code that might doingrmprocs() etc.

@jpsamaroo
Copy link
Member Author

Experiment with multi-cluster setups

I think there's a few possibilities here - doing multi-cluster at the level of a single Distributed logical cluster is the option I originally had in mind, but there could also be the possibility of a "multi-master" cluster. We'd probably need to discuss the pros and cons of each approach, or find a more general framework that allows both to exist.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants