Add Alpa multi-node backend for OPT 175B #48

ccmaymay · 2022-08-09T16:21:52Z

It looks like FB's license prohibits distribution of the OPT 175B weights, so any OPT 175B implementation is going to require a little extra work:

https://alpa.ai/tutorials/opt_serving.html#launch-a-web-server-to-serve-the-opt-models

(good grief)

ccmaymay · 2022-08-25T20:10:26Z

Installed locally on COE grid. Getting various errors like this:

(alpa-r5n04) 15:59:41 cmay@r5n04 examples (main) $ python -m alpa.test_install                                                         
2022-08-25 16:00:23.951522: W external/org_tensorflow/tensorflow/compiler/xla/service/platform_util.cc:200] unable to create StreamExec
utor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_
ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable                                                            
2022-08-25 16:00:25.307147: F external/org_tensorflow/tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of han
dling error INVALID_ARGUMENT: device CUDA:2 not supported by XLA service

Similar errors when trying the OPT alpa benchmark script as specified here: https://alpa.ai/tutorials/opt_serving.html#launch-a-web-server-to-serve-the-opt-models

OPT jax benchmark script succeeds.

As far as I can tell, the GPUs are available and the ray worker is aware of them.

ccmaymay · 2022-08-31T16:53:24Z

Using the cuda Docker image with cuda compatibility on BRTX fails because the GPU is not supported by cuda compatibility:

2022-08-31 16:42:03.611628: E external/org_tensorflow/tensorflow/stream_executor
/cuda/cuda_driver.cc:272] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED
_ON_DEVICE: forward compatibility was attempted on non supported HW
2022-08-31 16:42:03.612251: E external/org_tensorflow/tensorflow/stream_executo$
/cuda/cuda_diagnostics.cc:313] kernel version 450.51.6 does not match DSO versi$
n 455.45.1 -- cannot find working devices in this configuration
WARNING:absl:No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 
and rerun for more info.)

ccmaymay · 2022-08-31T17:20:56Z

Building alpa from source with cuda 11.0 (which is not officially supported) produces these warnings and a core dump due to a bus error:

2022-08-31 17:19:42.554883: W external/org_tensorflow/tensorflow/stream_executor
/gpu/asm_compiler.cc:111] *** WARNING *** You are using ptxas 11.0.221, which is
 older than 11.1. ptxas before 11.1 is known to miscompile XLA code, leading to 
incorrect results or invalid-address errors.

You may not need to update to CUDA 11.1; cherry-picking the ptxas binary is ofte
n sufficient.

ccmaymay · 2022-08-31T17:25:28Z

Cherry-picking the ptxas binary from the corresponding cuda 11.1 image also produces a bus error.

ccmaymay · 2022-08-31T22:00:03Z

Installed locally on COE grid. Getting various errors like this:
(alpa-r5n04) 15:59:41 cmay@r5n04 examples (main) $ python -m alpa.test_install                                                         
2022-08-25 16:00:23.951522: W external/org_tensorflow/tensorflow/compiler/xla/service/platform_util.cc:200] unable to create StreamExec
utor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_
ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable                                                            
2022-08-25 16:00:25.307147: F external/org_tensorflow/tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of han
dling error INVALID_ARGUMENT: device CUDA:2 not supported by XLA service  
Similar errors when trying the OPT alpa benchmark script as specified here: https://alpa.ai/tutorials/opt_serving.html#launch-a-web-server-to-serve-the-opt-models

OPT jax benchmark script succeeds.

As far as I can tell, the GPUs are available and the ray worker is aware of them.

It seems these errors are coming from JAX. The stack trace ends at a call to the get_backend function exported from this module: https://github.com/google/jax/blob/main/jax/lib/xla_bridge.py

And running this part of the JAX quickstart in ipython produces similar errors: https://github.com/google/jax#automatic-differentiation-with-grad

At first I thought maybe the issue was that the GPUs are in use. It seems when I try to make a GPU reservation with qlogin, it doesn't work, as other processes end up using those GPUs sometimes. qstat -j JOB_ID shows the GPU indices that were requested, I think, but the scheduler doesn't automatically set CUDA_VISIBLE_DEVICES or SGE_HGR_gpu. That said, I have looked for unused GPUs on the node and tried to use those by setting CUDA_VISIBLE_DEVICES.

On one attempt, the test install script seemed to get further than usual, however. So I am not entirely sure this isn't just an issue of actually getting an unused GPU. nvidia-smi -q also says the compute mode is "exclusive process," which prohibits more than one context on a GPU.

ccmaymay · 2022-08-31T22:12:21Z

Installed locally on COE grid. Getting various errors like this:
(alpa-r5n04) 15:59:41 cmay@r5n04 examples (main) $ python -m alpa.test_install                                                         
2022-08-25 16:00:23.951522: W external/org_tensorflow/tensorflow/compiler/xla/service/platform_util.cc:200] unable to create StreamExec
utor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_
ERROR_DEVICE_UNAVAILABLE: CUDA-capable device(s) is/are busy or unavailable                                                            
2022-08-25 16:00:25.307147: F external/org_tensorflow/tensorflow/core/platform/statusor.cc:33] Attempting to fetch value instead of han
dling error INVALID_ARGUMENT: device CUDA:2 not supported by XLA service  
Similar errors when trying the OPT alpa benchmark script as specified here: https://alpa.ai/tutorials/opt_serving.html#launch-a-web-server-to-serve-the-opt-models
OPT jax benchmark script succeeds.
As far as I can tell, the GPUs are available and the ray worker is aware of them.
It seems these errors are coming from JAX. The stack trace ends at a call to the get_backend function exported from this module: https://github.com/google/jax/blob/main/jax/lib/xla_bridge.py

And running this part of the JAX quickstart in ipython produces similar errors: https://github.com/google/jax#automatic-differentiation-with-grad

At first I thought maybe the issue was that the GPUs are in use. It seems when I try to make a GPU reservation with qlogin, it doesn't work, as other processes end up using those GPUs sometimes. qstat -j JOB_ID shows the GPU indices that were requested, I think, but the scheduler doesn't automatically set CUDA_VISIBLE_DEVICES or SGE_HGR_gpu. That said, I have looked for unused GPUs on the node and tried to use those by setting CUDA_VISIBLE_DEVICES.

On one attempt, the test install script seemed to get further than usual, however. So I am not entirely sure this isn't just an issue of actually getting an unused GPU. nvidia-smi -q also says the compute mode is "exclusive process," which prohibits more than one context on a GPU.

Here is my approximate setup procedure:

module load cuda11.3/toolkit/11.3.1-1
module load nccl/2.9.9-1_cuda11.3 
module load cudnn/8.2.0.53_cuda11.x
conda create -n alpa-r10n06 python=3.8
conda activate alpa-r10n06
conda install -y pytorch torchvision torchaudio -c pytorch
pip install accelerate 'transformers>=4.20.1'
pip install cupy-cuda113
pip install alpa
pip install jaxlib==0.3.5+cuda113.cudnn820 -f https://alpa-projects.github.io/wheels.html
ray start --head
python -m alpa.test_install

ccmaymay · 2022-08-31T22:15:02Z

I overlooked this from the wiki:

To use a GPU in an interactive session, use qrsh with /bin/bash (not qlogin) to get a session.

qrsh -q gpu.q -l num_proc=1,mem_free=10G,h_rt=8:00:00,gpu=1

ccmaymay · 2022-08-31T22:22:42Z

Never mind, using qrsh doesn't fix the issue.

ccmaymay · 2022-09-01T23:10:27Z

Finally got the test_install script working on an EC2 p2 instance. The driver was misconfigured on the DL AMIs so I used a base Ubuntu AMI and installed cuda myself. Still needed to install the legacy v470 drivers for the K80 and then use a different install script to install the toolkit (11.3). And the test produces these warnings:

(MeshHostWorker pid=2411) 2022-09-01 23:03:00.750916: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:212] Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal: INTERNAL: All algorithms tried for %cublas-gemm.5 = f32[64,128]{1,0} custom-call(f32[64,128]{1,0} %multiply.147, f32[128,128]{1,0} %param_19), custom_call_target="__cublas$gemm", metadata={op_type="dot_general" op_name="parallelize(train_step_pipeshard_parallel_mesh_0)/dot_general[dimension_numbers=(((1,), (1,)), ((), ())) precision=None preferred_element_type=None]" source_file="/home/ubuntu/anaconda3/lib/python3.9/site-packages/flax/linen/linear.py" source_line=188}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"1\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"lhs_stride\":\"8192\",\"rhs_stride\":\"16384\"}" failed. Falling back to default algorithm.  Per-algorithm errors:

ccmaymay · 2022-09-01T23:42:56Z

Using that same EC2 instance, the textgen test succeeded in producing output (generating text completions) using the alpa/opt-125m model (on a single-node ray cluster), but produced the same warnings about picking the best algorithm, and also produced these errors at the end (hopefully just a cleanup issue??):

Exception ignored in: <function RemoteArrayRef.__del__ at 0x7fe4c5c43c10>       
Traceback (most recent call last):                                                File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/alpa/device_mesh.py",
 line 1373, in __del__                                                            File "/home/ubuntu/anaconda3/lib/python3.9/site-packages/alpa/device_mesh.py",
 line 1140, in delete_remote_buffers                                            
TypeError: 'NoneType' object is not callable

ccmaymay · 2022-09-02T15:49:07Z

Finally got the test_install script working on an EC2 p2 instance. The driver was misconfigured on the DL AMIs so I used a base Ubuntu AMI and installed cuda myself. Still needed to install the legacy v470 drivers for the K80 and then use a different install script to install the toolkit (11.3). And the test produces these warnings:

(MeshHostWorker pid=2411) 2022-09-01 23:03:00.750916: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:212] Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal: INTERNAL: All algorithms tried for %cublas-gemm.5 = f32[64,128]{1,0} custom-call(f32[64,128]{1,0} %multiply.147, f32[128,128]{1,0} %param_19), custom_call_target="__cublas$gemm", metadata={op_type="dot_general" op_name="parallelize(train_step_pipeshard_parallel_mesh_0)/dot_general[dimension_numbers=(((1,), (1,)), ((), ())) precision=None preferred_element_type=None]" source_file="/home/ubuntu/anaconda3/lib/python3.9/site-packages/flax/linen/linear.py" source_line=188}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"1\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"batch_size\":\"1\",\"lhs_stride\":\"8192\",\"rhs_stride\":\"16384\"}" failed. Falling back to default algorithm.  Per-algorithm errors:

Investigating this warning message, I found the following in the algorithm picker code:

// We expect GemmWithAlgorithm to fail sometimes
// -- in fact, it will fail for all algorithms if
// we're targeting < sm_50

sm_50 refers to compute capability 5.0, and the K80 has compute capability 3.7. So I expect this warning to resolve itself on more recent GPUs.

ccmaymay added the new-model Request to add support for a new model label Aug 9, 2022

ccmaymay mentioned this issue Aug 9, 2022

Plan for extra-large model support (OPT 175B, Bloom 176B) #46

Closed

ccmaymay self-assigned this Aug 17, 2022

ccmaymay removed their assignment Sep 8, 2022

ccmaymay mentioned this issue Jul 20, 2023

Alpa OPT service #64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Alpa multi-node backend for OPT 175B #48

Add Alpa multi-node backend for OPT 175B #48

ccmaymay commented Aug 9, 2022

ccmaymay commented Aug 25, 2022

ccmaymay commented Aug 31, 2022

ccmaymay commented Aug 31, 2022 •

edited

Loading

ccmaymay commented Aug 31, 2022

ccmaymay commented Aug 31, 2022 •

edited

Loading

ccmaymay commented Aug 31, 2022

ccmaymay commented Aug 31, 2022

ccmaymay commented Aug 31, 2022

ccmaymay commented Sep 1, 2022

ccmaymay commented Sep 1, 2022

ccmaymay commented Sep 2, 2022

Add Alpa multi-node backend for OPT 175B #48

Add Alpa multi-node backend for OPT 175B #48

Comments

ccmaymay commented Aug 9, 2022

ccmaymay commented Aug 25, 2022

ccmaymay commented Aug 31, 2022

ccmaymay commented Aug 31, 2022 • edited Loading

ccmaymay commented Aug 31, 2022

ccmaymay commented Aug 31, 2022 • edited Loading

ccmaymay commented Aug 31, 2022

ccmaymay commented Aug 31, 2022

ccmaymay commented Aug 31, 2022

ccmaymay commented Sep 1, 2022

ccmaymay commented Sep 1, 2022

ccmaymay commented Sep 2, 2022

ccmaymay commented Aug 31, 2022 •

edited

Loading

ccmaymay commented Aug 31, 2022 •

edited

Loading