-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ray Executor #488
Comments
Hey Tom! Do you mean what does the userbase look like, or do I know specific people? On the former: Ray is the engine that OpenAI uses to train its GPT models; it's really popular in the ML world. On the latter: Ray, the person (cromwellian), uses Ray, the framework, at Roblox for model training. :)
I agree, and it does look like it will be similar to Modal. |
I meant usage of Anyscale vs KubeRay vs ?? I was wondering if there was a choice that most people use, or whether it's a bit of everything.
Got it! |
I added some notes on how to write a new executor in #498. |
From attending Ray Summit last year, I think both Anyscale and KubeRay Operator (open source, self managed Ray) are popular but if I had to guess based on conversations and talks I attended, I think there are more users of self managed Kuberay. |
In addition to accelerator support (e.g. via #304), Cubed could benefit ML users by providing ray executor: https://docs.ray.io/en/latest/ray-core/walkthrough.html
Since Cubed is a serverless model, I bet it could get away with only using Tasks/remote functions.
From talking with @cromwellian a bit, my hope is that Cubed could provide memory bounds when trying to saturate GPUs during model training. I'm not totally sure exactly what a training loop with Cubed would look like. Here's how ray integrates with PyTorch, for example: https://docs.ray.io/en/latest/train/api/doc/ray.train.torch.TorchTrainer.html#ray.train.torch.TorchTrainer
@shoyer pointed out to me once the idea that GPU OOM errors occur while taking the gradient of a function graph, not necessarily on the forward pass. I'm not totally sure right now if Cubed is in fact a good fit for tackling this problem, only that the potential is exciting.
The text was updated successfully, but these errors were encountered: