Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ray Executor #488

Open
alxmrs opened this issue Jun 24, 2024 · 5 comments · May be fixed by #687
Open

Ray Executor #488

alxmrs opened this issue Jun 24, 2024 · 5 comments · May be fixed by #687
Labels

Comments

@alxmrs
Copy link
Contributor

alxmrs commented Jun 24, 2024

In addition to accelerator support (e.g. via #304), Cubed could benefit ML users by providing ray executor: https://docs.ray.io/en/latest/ray-core/walkthrough.html

Since Cubed is a serverless model, I bet it could get away with only using Tasks/remote functions.

From talking with @cromwellian a bit, my hope is that Cubed could provide memory bounds when trying to saturate GPUs during model training. I'm not totally sure exactly what a training loop with Cubed would look like. Here's how ray integrates with PyTorch, for example: https://docs.ray.io/en/latest/train/api/doc/ray.train.torch.TorchTrainer.html#ray.train.torch.TorchTrainer

@shoyer pointed out to me once the idea that GPU OOM errors occur while taking the gradient of a function graph, not necessarily on the forward pass. I'm not totally sure right now if Cubed is in fact a good fit for tackling this problem, only that the potential is exciting.

@tomwhite
Copy link
Member

Thanks for opening this issue @alxmrs! I think Ray would be a great runtime for Cubed, and should be relatively straightforward to write an executor for (maybe a bit like the Modal one?). Do you know what people generally run Ray on in production/at scale?

@alxmrs
Copy link
Contributor Author

alxmrs commented Jun 25, 2024

Hey Tom! Do you mean what does the userbase look like, or do I know specific people? On the former: Ray is the engine that OpenAI uses to train its GPT models; it's really popular in the ML world. On the latter: Ray, the person (cromwellian), uses Ray, the framework, at Roblox for model training. :)

should be relatively straightforward to write an executor for (maybe a bit like the Modal one?).

I agree, and it does look like it will be similar to Modal.

@tomwhite
Copy link
Member

I meant usage of Anyscale vs KubeRay vs ?? I was wondering if there was a choice that most people use, or whether it's a bit of everything.

Ray, the person (cromwellian), uses Ray, the framework, at Roblox for model training. :)

Got it!

@tomwhite
Copy link
Member

I added some notes on how to write a new executor in #498.

@rbavery
Copy link
Contributor

rbavery commented Jan 25, 2025

I meant usage of Anyscale vs KubeRay vs ?? I was wondering if there was a choice that most people use, or whether it's a bit of everything.

From attending Ray Summit last year, I think both Anyscale and KubeRay Operator (open source, self managed Ray) are popular but if I had to guess based on conversations and talks I attended, I think there are more users of self managed Kuberay.

@tomwhite tomwhite linked a pull request Feb 3, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants