-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error in Fully Sharded Data Parallelism (FSDP) set up #2931
Comments
if I do a full_shard I get the following error Traceback (most recent call last):
File "/eph/nvme0/azureml/cr/j/e783517ada544c6e901713a9aef8f300/exe/wd/train.py", line 291, in <module>
main(
File "/eph/nvme0/azureml/cr/j/e783517ada544c6e901713a9aef8f300/exe/wd/train.py", line 213, in main
trainer.train()
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/trainer.py", line 1938, in train
return inner_training_loop(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop
self.model = self.accelerator.prepare(self.model)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/accelerate/accelerator.py", line 1326, in prepare
result = tuple(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/accelerate/accelerator.py", line 1484, in prepare_model
model = FSDP(model, **kwargs)
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
_init_param_handle_from_module(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 574, in _init_param_handle_from_module
state.compute_device = _get_compute_device(
File "/opt/conda/envs/ptca/lib/python3.10/site-packages/torch/distributed/fsdp/_init_utils.py", line 1037, in _get_compute_device
raise ValueError(
ValueError: Inconsistent compute device and `device_id` on rank 3: cuda:0 vs cuda:3 |
Hello! As for the
error, this is because FSDP is looking for which layers to share across devices, and there's no BertLayer in BGE-M3 because it's based on XLM-RoBERTa instead. I suspect you have to use XLMRobertaLayer, so: fsdp_config={"transformer_layer_cls_to_wrap": "XLMRobertaLayer"}, Let me know if that gets you a bit further. FSDP wasn't fully tested because DDP is faster with most models: the primary use case with FSDP is if the model itself is so big that sharing it across devices allows you to get a much higher batch size. At least, that is my understanding.
|
I just noticed that couple of hours ago and made the fix. This fixed the layer issue. I should have been more careful. I didn't want to do FSDP but for some reason I kept getting cuda out of memory while training. My model is only 500M parameters and my dataset is a retrieval datasets with some long articles. My max_seq_length is 8192. If i truncate the length of the articles to 300 words, I edge close to my memory limit of 80GB(A100). When increasing the number of words to 1000 words, I get the cuda out of memory issue. My suspicion is that it's due to the fact that attention scales ~quadratically with the seq_length. This is why I thought FSDP is what I should do. Is there anything else I could do other than FSDP in my case? In anyway, I tried to fix the inconsistent compute device issue with the following but ran into another error: local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)
model = model.to(f"cuda:{local_rank}") and running it with However, then I received the error:
|
just an update, I manage to fix my cuda out of memory issue. Here's the fix. In SentenceTransformer documentation:
This is true for most models, but not for XLMRobertaModels. For XLMRobertaModels, a PR in the transformers library was recently merged but has not been included in the releases as of yet. So you need to install it from the github repo. That will reduce memory requirements heavily We can close this issue for now, or I can test different configuration of FSDP just for the sake of getting it right, if needed |
Trying to finetune a model whose max seq length is 8k, BAAI/bge-m3. I'm trying to finetune on some retrieval task. Here's my trainer set up
I run it via
torchrun --nproc_per_node=8 train.py
When running though, I get the following error:
Any thoughts on what's wrong in my code?
The text was updated successfully, but these errors were encountered: