How to disable deepspeed quantization during inference? #3567

yuchen2580 · 2023-05-18T07:05:08Z

yuchen2580
May 18, 2023

I follow the tutorial and use the following code for inference:

    ds_engine = deepspeed.init_inference(
                            model, mp_size=world_size,
                            dtype=torch.float,
                            replace_with_kernel_inject=False,
    )
    model = ds_engine.module

However, when i exam the log output, i found:
[2023-05-18 11:50:58,071] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown
[2023-05-18 11:50:58,073] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
05/18/2023 11:50:58 - WARNING - datasets.arrow_dataset - Loading cached processed dataset at /home/ethany/.cache/huggingface/datasets/wikitext/wikitext-2-raw-v1/1.0.0/bcb9b8b48fdeae767d48b3ce9341d5b691048450328db6e6f1a9583eb759599a/cache-9164b0c08bb4f7d7.arrow
[2023-05-18 11:50:58,074] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-05-18 11:50:58,084] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown
[2023-05-18 11:50:58,086] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown
[2023-05-18 11:50:58,086] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-05-18 11:50:58,087] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-05-18 11:50:58,087] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.1, git-hash=unknown, git-branch=unknown
[2023-05-18 11:50:58,088] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-05-18 11:50:58,088] [WARNING] [config_utils.py:70:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-05-18 11:50:58,089] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-05-18 11:50:58,089] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
AutoTP: [(<class 'transformers.models.opt.modeling_opt.OPTDecoderLayer'>, ['self_attn.out_proj', '.fc2'])]
AutoTP: [(<class 'transformers.models.opt.modeling_opt.OPTDecoderLayer'>, ['.fc2', 'self_attn.out_proj'])]
AutoTP: [(<class 'transformers.models.opt.modeling_opt.OPTDecoderLayer'>, ['.fc2', 'self_attn.out_proj'])]
AutoTP: [(<class 'transformers.models.opt.modeling_opt.OPTDecoderLayer'>, ['self_attn.out_proj', '.fc2'])]

which suggests it was running on quantize_bits=8?
how should i disable it? cause it drops a lot of accuracy for model OPT, gpt-neo which i grabbed from hugging face.

YudiZh · 2023-09-23T02:37:40Z

YudiZh
Sep 23, 2023

@yuchen2580 Have you figured out this problem? I also have the same confusion

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to disable deepspeed quantization during inference? #3567

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to disable deepspeed quantization during inference? #3567

yuchen2580 May 18, 2023

Replies: 1 comment

YudiZh Sep 23, 2023

yuchen2580
May 18, 2023

YudiZh
Sep 23, 2023