Zero inference is too slow #5418

garyyang85 · 2024-04-16T03:17:41Z

garyyang85
Apr 16, 2024

I am using deepspeed zero-inference for inference. The model is 13b float16, and running on a v100 32G GPU. For common inference, when the inputs token is more than 2000(should support max 4096), it will report "cuda out of memory". So I found the zero-reference solution in deepspeed here. But the inference speed is too slow. And the GPU memory usage is about 8G. Is there a way to speed up the inference and use more GPU? Thanks.

tjruwase · 2024-04-16T13:27:31Z

tjruwase
Apr 16, 2024
Maintainer

@garyyang85, zero-inference is expected to be slower before of streaming weights over the slower PCIe link. Here are a couple of things to do.

Can you share more details about your experiment, such as batch size, log, ds_config, and observed tokens/sec.
Compare your observed perf against the latest release?

4 replies

garyyang85 Apr 17, 2024
Author

@tjruwase Thanks for your reply. Below is my test script. I am new to deepspeed and I can understand it should be slower. Just want to know if I can speed up the inference by using more GPU or some other ways. Seems there are many GPU memory in free status in my test :)
Found many useful params in your link, will have a try.

import torch
import deepspeed
import os
from transformers.deepspeed import HfDeepSpeedConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
from transformers.generation.utils import GenerationConfig
import time

model_name = "model/Baichuan2-13B-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, trust_remote_code=True)

# Set up DeepSpeed configuration
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
def run_zero_inference():
    ds_config = {
        "fp16": {"enabled": True},
        "bf16": {"enabled": False},
        "zero_optimization": {
            "stage": 3,
            "offload_param": {
                "device": "cpu",
            },
        },
        "train_micro_batch_size_per_gpu": 10,
    }

    # Share the DeepSpeed config with HuggingFace
    hfdsc = HfDeepSpeedConfig(ds_config)

    # Load the model with DeepSpeed
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.float16, trust_remote_code=True
    )
    ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
    ds_engine.module.eval()
    model = ds_engine.module

    # Run inference
    start_time = time.time()
    inputs = tokenizer.encode("<reserved_106>" + "Hello" + "<reserved_107>", return_tensors="pt").to(
        f"cuda:{local_rank}"
    )
    outputs = model.generate(inputs, max_new_tokens=100)
    output_str = tokenizer.decode(outputs[0])
    print(output_str)
    end_time = time.time()
    print("ZeRO-inference time:", end_time - start_time)
# # model.generation_config = GenerationConfig.from_pretrained(model_name)
if __name__ == "__main__":
    run_zero_inference()

The output str:

<reserved_106>Hello<reserved_107>Hello! How can I assist you today?</s>
ZeRO-inference time: 66.46322226524353

garyyang85 Apr 17, 2024
Author

@tjruwase Compared with the OPT-30B result : 22.74 (bsz=24. cpu_offload), my result is so poor.
I am wondering where is the "bsz" configuration? I have checked the "deepspeed.runtime.zero.config.DeepSpeedZeroOffloadParamConfig" and "deepspeed.runtime.zero.config.DeepSpeedZeroOffloadOptimizerConfig" in the doc, but seems no one can match it. Could you show some lights here? Thanks.

tjruwase Apr 17, 2024
Maintainer

@garyyang85, a few things to consider.

Partial offload vs full offload is a trade-off of weights vs bsz which could hurt thoughput.
You try partial offloading through these two ds_config settings
param_persistence_threshold to model size (i.e., 13e9)
model_persistence_threshold to in-GPU size (e.g., 6.5e9 to pin half the model in GPU).
For your test script, batch size tuning can be achieved through inputs argument to model.generate(...). You might find this snippet in our demo code useful.

garyyang85 Apr 18, 2024
Author

@tjruwase Thanks for your advice. I have tried the param_persistence_threshold and model_persistence_threshold and cannot speed up the inference. But the "pin_memory=True" can reduce "Hello" inference time to about 33s. This is still too slow...
Also checked the "batch size" in the snippet, seems in the sample code, it tries to do many sentences inference at the same time and may got more throughput, which is not my case. If I just want to ask question to LLM one by one, any other advice to speed up the inference? Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero inference is too slow #5418

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Zero inference is too slow #5418

garyyang85 Apr 16, 2024

Replies: 1 comment · 4 replies

tjruwase Apr 16, 2024 Maintainer

garyyang85 Apr 17, 2024 Author

garyyang85 Apr 17, 2024 Author

tjruwase Apr 17, 2024 Maintainer

garyyang85 Apr 18, 2024 Author

garyyang85
Apr 16, 2024

Replies: 1 comment 4 replies

tjruwase
Apr 16, 2024
Maintainer

garyyang85 Apr 17, 2024
Author

garyyang85 Apr 17, 2024
Author

tjruwase Apr 17, 2024
Maintainer

garyyang85 Apr 18, 2024
Author