Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

多GPU运行 #16

Open
wenxinmomo opened this issue Mar 13, 2024 · 7 comments
Open

多GPU运行 #16

wenxinmomo opened this issue Mar 13, 2024 · 7 comments

Comments

@wenxinmomo
Copy link

您好,我尝试了多GPU运行,但是一直没有成功,请问您有什么好的方法吗

@Furyton
Copy link
Member

Furyton commented Mar 15, 2024

您好,可以分享下运行的脚本代码吗,或者报错信息。

@wenxinmomo
Copy link
Author

问题:有多GPU运行的解决方案吗。

我的尝试:
您好,代码如下:

from transformers import AutoTokenizer, AutoModel



if __name__ == '__main__':
    model_url = "/data/minio01/model_file/fuzi_model"
    tokenizer = AutoTokenizer.from_pretrained(model_url, trust_remote_code=True)
    model = AutoModel.from_pretrained(model_url, device_map="auto", trust_remote_code=True).half().cuda()
    response, history = model.chat(tokenizer, "你好", history=[])
    print(response)
    response, history = model.chat(tokenizer, "你能做什么", history=history)
    print(response)

我在获取model时加了device_map="auto",但是真正运行起来,仍是第一个GPU主要在跑,其他的GPU显存由2M上升到了477M,但是没有明显的加速,似乎也确实没有在运行。
Snipaste_2024-03-18_16-39-50

@Furyton
Copy link
Member

Furyton commented Mar 21, 2024

您可以试一下在运行脚本时设置环境变量

CUDA_VISIBLE_DEVICES=0,1,2,3 python script.py

@Furyton
Copy link
Member

Furyton commented Mar 21, 2024

多卡运行的 python 脚本可以参考 ChatGLM-6B

通过 CUDA_VISIBLE_DEVICES=0,1,2,3 python script.py 可以进行 4 卡推理。

# script.py
from transformers import AutoTokenizer, AutoModel

import os
from typing import Dict, Tuple, Union, Optional

from torch.nn import Module


def auto_configure_device_map(num_gpus: int) -> Dict[str, int]:
    # transformer.word_embeddings 占用1层
    # transformer.final_layernorm 和 lm_head 占用1层
    # transformer.layers 占用 28 层
    # 总共30层分配到num_gpus张卡上
    num_trans_layers = 28
    per_gpu_layers = 30 / num_gpus

    # bugfix: 在linux中调用torch.embedding传入的weight,input不在同一device上,导致RuntimeError
    # windows下 model.device 会被设置成 transformer.word_embeddings.device
    # linux下 model.device 会被设置成 lm_head.device
    # 在调用chat或者stream_chat时,input_ids会被放到model.device上
    # 如果transformer.word_embeddings.device和model.device不同,则会导致RuntimeError
    # 因此这里将transformer.word_embeddings,transformer.final_layernorm,lm_head都放到第一张卡上
    device_map = {'transformer.word_embeddings': 0,
                  'transformer.final_layernorm': 0, 'lm_head': 0}

    used = 2
    gpu_target = 0
    for i in range(num_trans_layers):
        if used >= per_gpu_layers:
            gpu_target += 1
            used = 0
        assert gpu_target < num_gpus
        device_map[f'transformer.layers.{i}'] = gpu_target
        used += 1

    return device_map


def load_model_on_gpus(checkpoint_path: Union[str, os.PathLike], num_gpus: int = 2,
                       device_map: Optional[Dict[str, int]] = None, **kwargs) -> Module:
    if num_gpus < 2 and device_map is None:
        model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half().cuda()
    else:
        from accelerate import dispatch_model

        model = AutoModel.from_pretrained(checkpoint_path, trust_remote_code=True, **kwargs).half()

        if device_map is None:
            device_map = auto_configure_device_map(num_gpus)

        model = dispatch_model(model, device_map=device_map)

    return model



if __name__ == '__main__':
    model_url = "/data/minio01/model_file/fuzi_model"
    tokenizer = AutoTokenizer.from_pretrained(model_url, trust_remote_code=True)
    # model = AutoModel.from_pretrained(model_url, device_map="auto", trust_remote_code=True).half().cuda()
    model = load_model_on_gpus(model_url, num_gpus=4)
    response, history = model.chat(tokenizer, "你好", history=[])
    print(response)
    response, history = model.chat(tokenizer, "你能做什么", history=history)
    print(response)

@wenxinmomo
Copy link
Author

非常感谢您的回复,利用您提供的代码,我成功运行了多GPU。
但是多GPU跑的时候,有时甚至没有单卡运行的速度快。并且每个GPU运行时,大概只能占用20%左右。
请问您有更好的调优方法吗,期待您的回复。
对于夫子明察模式一,运行时间在60s左右。
Snipaste_2024-03-26_10-15-10

@Furyton
Copy link
Member

Furyton commented Mar 30, 2024

您好,多卡的模型并行(将模型拆分到不同的 GPU 上)主要是解决单卡显存不足的问题,而不是为了加速。使用多卡因为涉及多进程间的通信是比单卡运行要慢的。当单卡显存足够的情况下一般不需要多卡运行。

@wenxinmomo
Copy link
Author

非常感谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants