-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Alpa multi-node backend for OPT 175B #48
Comments
Installed locally on COE grid. Getting various errors like this:
Similar errors when trying the OPT alpa benchmark script as specified here: https://alpa.ai/tutorials/opt_serving.html#launch-a-web-server-to-serve-the-opt-models OPT jax benchmark script succeeds. As far as I can tell, the GPUs are available and the ray worker is aware of them. |
Using the cuda Docker image with cuda compatibility on BRTX fails because the GPU is not supported by cuda compatibility:
|
Building alpa from source with cuda 11.0 (which is not officially supported) produces these warnings and a core dump due to a bus error:
|
Cherry-picking the ptxas binary from the corresponding cuda 11.1 image also produces a bus error. |
It seems these errors are coming from JAX. The stack trace ends at a call to the And running this part of the JAX quickstart in ipython produces similar errors: https://github.com/google/jax#automatic-differentiation-with-grad At first I thought maybe the issue was that the GPUs are in use. It seems when I try to make a GPU reservation with qlogin, it doesn't work, as other processes end up using those GPUs sometimes. On one attempt, the test install script seemed to get further than usual, however. So I am not entirely sure this isn't just an issue of actually getting an unused GPU. |
Here is my approximate setup procedure:
|
I overlooked this from the wiki:
|
Never mind, using qrsh doesn't fix the issue. |
Finally got the
|
Using that same EC2 instance, the textgen test succeeded in producing output (generating text completions) using the alpa/opt-125m model (on a single-node ray cluster), but produced the same warnings about picking the best algorithm, and also produced these errors at the end (hopefully just a cleanup issue??):
|
Investigating this warning message, I found the following in the algorithm picker code:
|
It looks like FB's license prohibits distribution of the OPT 175B weights, so any OPT 175B implementation is going to require a little extra work:
(good grief)
The text was updated successfully, but these errors were encountered: