inference_with_transformers_zh

使用transformers进行推理

我们提供了命令行方式使用原生Transformers进行推理。下面以加载Chinese-Mixtral-Instruct模型为例说明启动方式。

使用transformres库推理

如果你下载的是完整版权重，或者之前已执行了merge_mixtral_with_chinese_lora_low_mem.py脚本将LoRA权重与Mixtral-8x7B-v0.1合并，可直接加载完整版模型推理。

python scripts/inference/inference_hf.py \
    --base_model path_to_chinese_mixtral_instruct_hf_dir \
    --with_prompt \
    --interactive

使用vLLM进行推理加速

可以使用vLLM作为LLM后端进行推理，需要额外安装vLLM库。

pip install vllm

只需在原本的命令行上添加--use_vllm参数

python scripts/inference/inference_hf.py \
    --base_model path_to_chinese_mixtral_instruct_hf_dir \
    --with_prompt \
    --interactive \
    --use_vllm

参数说明

--base_model {base_model} ：存放HF格式的mixtral模型权重和配置文件的目录。也可使用🤗Model Hub模型调用名称
--tokenizer_path {tokenizer_path}：存放对应tokenizer的目录。若不提供此参数，则其默认值与--base_model相同
--with_prompt：是否将输入与prompt模版进行合并。如果加载mixtral-instruct模型，请务必启用此选项！
--interactive：以交互方式启动，以便进行多次单轮问答（此处不是llama.cpp中的上下文对话）
--data_file {file_name}：非交互方式启动下，按行读取file_name中的的内容进行预测
--predictions_file {file_name}：非交互式方式下，将预测的结果以json格式写入file_name
--only_cpu：仅使用CPU进行推理
--gpus {gpu_ids}：指定使用的GPU设备编号，默认为0。如使用多张GPU，以逗号分隔，如0,1,2
--load_in_8bit或--load_in_4bit：使用8bit或4bit方式加载模型，降低显存占用，推荐使用--load_in_4bit
--use_vllm：使用vLLM作为LLM后端进行推理
--use_flash_attention_2: 使用Flash-Attention加速推理，如果不指定该参数，代码默认SDPA加速。

注意事项

该脚本仅为方便快速体验用，并未对推理速度做优化
mixtral本身模型权重大小为87G，推荐以4bit加载推理，该方式仍需占用26G内存(显存)，推荐使用llama.cpp进行交互推理。

中文文档

English Docs

Model Reconstruction
Model Quantization, Inference and Deployment
System Performance
Training Scripts
- Pre-training Scripts
- Instruction Fine-tuning Scripts
FAQ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly