-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
多GPU运行 #16
Comments
您好,可以分享下运行的脚本代码吗,或者报错信息。 |
您可以试一下在运行脚本时设置环境变量 CUDA_VISIBLE_DEVICES=0,1,2,3 python script.py |
多卡运行的 python 脚本可以参考 ChatGLM-6B 通过 # script.py
from transformers import AutoTokenizer, AutoModel
import os
from typing import Dict, Tuple, Union, Optional
from torch.nn import Module
def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
# transformer.word_embeddings 占用1层
# transformer.final_layernorm 和 lm_head 占用1层
# transformer.layers 占用 28 层
# 总共30层分配到num_gpus张卡上
num_trans_layers = 28
per_gpu_layers = 30 / num_gpus
# bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
# windows下 model.device 会被设置成 transformer.word_embeddings.device
# linux下 model.device 会被设置成 lm_head.device
# 在调用chat或者stream_chat时,input_ids会被放到model.device上
# 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
# 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
device_map = {'transformer.word_embeddings': 0,
'transformer.final_layernorm': 0, 'lm_head': 0}
used = 2
gpu_target = 0
for i in range(num_trans_layers):
if used >= per_gpu_layers:
gpu_target += 1
used = 0
assert gpu_target < num_gpus
device_map[f'transformer.layers.{i}'] = gpu_target
used += 1
return device_map
def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
device_map: Optional[Dict[str, int]] = None, **kwargs) -> Module:
if num_gpus < 2 and device_map is None:
model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half().cuda()
else:
from accelerate import dispatch_model
model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half()
if device_map is None:
device_map = auto_configure_device_map(num_gpus)
model = dispatch_model(model, device_map=device_map)
return model
if __name__ == '__main__':
model_url = "/data/minio01/model_file/fuzi_model"
tokenizer = AutoTokenizer.from_pretrained(model_url, trust_remote_code=True)
# model = AutoModel.from_pretrained(model_url, device_map="auto", trust_remote_code=True).half().cuda()
model = load_model_on_gpus(model_url, num_gpus=4)
response, history = model.chat(tokenizer, "你好", history=[])
print(response)
response, history = model.chat(tokenizer, "你能做什么", history=history)
print(response) |
您好,多卡的模型并行(将模型拆分到不同的 GPU 上)主要是解决单卡显存不足的问题,而不是为了加速。使用多卡因为涉及多进程间的通信是比单卡运行要慢的。当单卡显存足够的情况下一般不需要多卡运行。 |
非常感谢 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
您好,我尝试了多GPU运行,但是一直没有成功,请问您有什么好的方法吗
The text was updated successfully, but these errors were encountered: