Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
185 commits
Select commit Hold shift + click to select a range
f5e2b80
Update README.md
yudi3ooo Nov 14, 2025
dd646f0
Update 01-gpu.md
yudi3ooo Nov 17, 2025
631b2c4
Update 02-cpu.md
yudi3ooo Nov 17, 2025
940f142
Update 03-ai-accelerator.md
yudi3ooo Nov 17, 2025
2bca565
Update README.md
yudi3ooo Nov 17, 2025
7af772f
Update 02-quickstart.md
yudi3ooo Nov 17, 2025
4c74b20
Update 04-troubleshooting.md
yudi3ooo Nov 17, 2025
672d14a
Update 05-faq.md
yudi3ooo Nov 17, 2025
370bd60
Update 06-v1-user-guide.md
yudi3ooo Nov 17, 2025
00f2f3c
Update tutorial link for audio language example
kqyang-77 Nov 27, 2025
554aace
Update 02-basic.md
kqyang-77 Nov 27, 2025
216457d
Update 03-chat_with_tools.md
kqyang-77 Nov 27, 2025
9d1fd8d
Update 04-cpu_offload_lmcache.md
kqyang-77 Nov 27, 2025
5ad7efc
Update 05-data_parallel.md
kqyang-77 Nov 27, 2025
9c694a7
Update 06-disaggregated_prefill_lmcache.md
kqyang-77 Nov 27, 2025
82eedad
Update 07-disaggregated_prefill.md
kqyang-77 Nov 27, 2025
2efa086
Update 08-distributed.md
kqyang-77 Nov 27, 2025
1411d47
Update 09-eagle.md
kqyang-77 Nov 27, 2025
9759e55
Update tutorial link for Encoder Decoder Multimodal
kqyang-77 Nov 27, 2025
1afe7c7
Update 11-encoder_decoder.md
kqyang-77 Nov 27, 2025
da0c010
Update 12-llm_engine_example.md
kqyang-77 Nov 27, 2025
91218c2
Update 13-load_sharded_state.md
kqyang-77 Nov 27, 2025
7d36b59
Update 14-lora_with_quantization_inference.md
kqyang-77 Nov 27, 2025
e66f43c
Update 15-mistral-small.md
kqyang-77 Nov 27, 2025
eb91366
Update 16-mlpspeculator.md
kqyang-77 Nov 27, 2025
8f69061
Update 17-multilora_inference.md
kqyang-77 Nov 27, 2025
816ccbc
Update 18-neuron_int8_quantization.md
kqyang-77 Nov 27, 2025
305f508
Update 19-neuron.md
kqyang-77 Nov 27, 2025
bafb3f2
Update 20-openai_batch.md
kqyang-77 Nov 27, 2025
2e5b05f
Update 21-prefix_caching.md
kqyang-77 Nov 27, 2025
a3424e5
Update 21-prefix_caching.md
kqyang-77 Nov 27, 2025
463c4fa
Update 23-profiling_tpu.md
kqyang-77 Nov 27, 2025
f051b70
Update 24-profiling.md
kqyang-77 Nov 27, 2025
350903c
Update 25-reproduciblity.md
kqyang-77 Nov 27, 2025
6e5d140
Update 26-rlhf.md
kqyang-77 Nov 27, 2025
c40c125
Update 27-rlhf_colocate.md
kqyang-77 Nov 27, 2025
f8c6a53
Update tutorial link for Rlhf Utils
kqyang-77 Nov 27, 2025
fb6779c
Update link for online vLLM tutorial
kqyang-77 Nov 27, 2025
595953e
Update tutorial link for Simple Profiling
kqyang-77 Nov 27, 2025
52c7efc
Update 31-structured_outputs.md
kqyang-77 Nov 27, 2025
3c14191
Update link for online vLLM tutorial
kqyang-77 Nov 27, 2025
d5bbbcf
Update tutorial link for TPU example
kqyang-77 Nov 27, 2025
0da2581
Update 34-vision_language.md
kqyang-77 Nov 27, 2025
d5ad571
Update 35-vision_language_embedding.md
kqyang-77 Nov 27, 2025
52d96df
Update 36-vision_language_multi_image.md
kqyang-77 Nov 27, 2025
f0162a6
Update README.md
kqyang-77 Nov 27, 2025
9d063dd
Update 01-api_client.md
kqyang-77 Nov 27, 2025
31ee94c
Update 02-chart-helm.md
kqyang-77 Nov 27, 2025
ea5cba5
Update 03-cohere_rerank_client.md
kqyang-77 Nov 27, 2025
b252e5c
Update 04-disaggregated_prefill.md
kqyang-77 Nov 27, 2025
2362af8
Update tutorial link for Gradio OpenAI chatbot
kqyang-77 Nov 27, 2025
9a13b4a
Update 06-gradio_webserver.md
kqyang-77 Nov 27, 2025
13911df
Update 07-jinaai_rerank_client.md
kqyang-77 Nov 27, 2025
4aa0ce1
Update tutorial link for multi-node serving
kqyang-77 Nov 27, 2025
f3e7a02
Update tutorial link for OpenAI Chat Completion Client
kqyang-77 Nov 27, 2025
1830189
Update 10-openai_chat_completion_client_for_multimodal.md
kqyang-77 Nov 27, 2025
647c40f
Update 11-openai_chat_completion_client_with_tools.md
kqyang-77 Nov 27, 2025
c544cf2
Update tutorial link for OpenAI Chat Completion Client
kqyang-77 Nov 27, 2025
4a7c752
Update 13-openai_chat_completion_structured_outputs.md
kqyang-77 Nov 27, 2025
9db7a84
Update 14-openai_chat_completion_structured_outputs_with_reasoning.md
kqyang-77 Nov 27, 2025
60310f3
Update 15-openai_chat_completion_tool_calls_with_reasoning.md
kqyang-77 Nov 27, 2025
0a8d89c
Update 16-openai_chat_completion_with_reasoning.md
kqyang-77 Nov 27, 2025
318e2db
Update 17-openai_chat_completion_with_reasoning_streaming.md
kqyang-77 Nov 27, 2025
2cf93d6
Update 18-openai_chat_embedding_client_for_multimodal.md
kqyang-77 Nov 27, 2025
89a3f11
Update 19-openai_completion_client.md
kqyang-77 Nov 27, 2025
075b612
Update 20-openai_cross_encoder_score.md
kqyang-77 Nov 27, 2025
61d074a
Update 21-openai_embedding_client.md
kqyang-77 Nov 27, 2025
fa30c11
Update 22-openai_pooling_client.md
kqyang-77 Nov 27, 2025
194bcb3
Update 23-openai_transcription_client.md
kqyang-77 Nov 27, 2025
8568f8b
Update 24-opentelemetry.md
kqyang-77 Nov 27, 2025
3718f5b
Update 25-prometheus_grafana.md
kqyang-77 Nov 27, 2025
06e50ad
Update 26-run_cluster.md
kqyang-77 Nov 27, 2025
e68b744
Update 27-sagemaker-entrypoint.md
kqyang-77 Nov 27, 2025
6c35652
Update README.md
kqyang-77 Nov 27, 2025
8a0bd07
Update 01-logging_configuration.md
kqyang-77 Nov 27, 2025
e6ef38f
Update 02-tensorize_vllm_model.md
kqyang-77 Nov 27, 2025
8940bf9
Update README.md
kqyang-77 Nov 27, 2025
46d9bd5
Update 01-runai_model_streamer.md
kqyang-77 Nov 27, 2025
b4ae0f7
Update 02-tensorizer.md
kqyang-77 Nov 27, 2025
4523895
Update 03-fastsafetensor.md
kqyang-77 Nov 27, 2025
b2447dc
Update README.md
kqyang-77 Nov 27, 2025
27448b8
Update 01-supported_models.md
kqyang-77 Nov 27, 2025
488c543
Update 02-generative_models.md
kqyang-77 Nov 27, 2025
95af61a
Update 03-Pooling Models.md
kqyang-77 Nov 27, 2025
c5f376a
Update 03-bnb.md
kqyang-77 Nov 27, 2025
8cd290a
Update 04-gguf.md
kqyang-77 Nov 27, 2025
cdc753e
Update 05-gptqmodel.md
kqyang-77 Nov 27, 2025
6da2c28
Update 06-int4.md
kqyang-77 Nov 27, 2025
955233f
Update 07-int8.md
kqyang-77 Nov 27, 2025
b1a0442
Update 08-fp8.md
kqyang-77 Nov 27, 2025
89c72f1
Update 09-quantized_kvcache.md
kqyang-77 Nov 27, 2025
e9f1bcd
Update 10-TorchAO.md
kqyang-77 Nov 27, 2025
2e21be9
Update README.md
kqyang-77 Nov 27, 2025
aef48a4
Update 02-lora.md
kqyang-77 Nov 27, 2025
693f493
Update 03-tool_calling.md
kqyang-77 Nov 27, 2025
2d75129
Update 04-reasoning_outputs.md
kqyang-77 Nov 27, 2025
87e47f8
Update 05-structured_outputs.md
kqyang-77 Nov 27, 2025
4e20f3b
Update 06-automatic_prefix_caching.md
kqyang-77 Nov 27, 2025
19bd6ff
Update 07-disagg_prefill.md
kqyang-77 Nov 27, 2025
d6b42fc
Update 08-spec_decode.md
kqyang-77 Nov 27, 2025
13d7ba9
Update 09-compatibility_matrix.md
kqyang-77 Nov 27, 2025
6f05777
Update 01-trl.md
kqyang-77 Nov 28, 2025
7f1878e
Update 02-rlhf.md
kqyang-77 Nov 28, 2025
1d1c5e2
Update 01-offline_inference.md
kqyang-77 Nov 28, 2025
b4b67d9
Update 02-openai_compatible_server.md
kqyang-77 Nov 28, 2025
2649ed3
Update 03-multimodal_inputs.md
kqyang-77 Nov 28, 2025
319a451
Update 04-distributed_serving_new.md
kqyang-77 Nov 28, 2025
385ddc1
Update tutorial link in metrics documentation
kqyang-77 Nov 28, 2025
e80730d
Update 06-engine_args.md
kqyang-77 Nov 28, 2025
768c531
Update tutorial link for vLLM introduction
kqyang-77 Nov 28, 2025
860e60f
Update usage stats tutorial link
kqyang-77 Nov 28, 2025
0c2517c
Update tutorial link for LangChain integration
kqyang-77 Nov 28, 2025
82b341e
Update tutorial link for LlamaIndex
kqyang-77 Nov 28, 2025
c1a0158
Update 01-docker.md
kqyang-77 Nov 28, 2025
6b1f567
Update tutorial link for vLLM deployment guide
kqyang-77 Nov 28, 2025
3abf4c4
Update 03-nginx.md
kqyang-77 Nov 28, 2025
42c8197
Update 01-bentoml.md
kqyang-77 Nov 28, 2025
4271bb1
Update 02-cerebrium.md
kqyang-77 Nov 28, 2025
077b034
Update 03-dstack.md
kqyang-77 Nov 28, 2025
4798d6f
Update 04-helm.md
kqyang-77 Nov 28, 2025
1d04e2e
Update 05-lws.md
kqyang-77 Nov 28, 2025
2a9247a
Update 06-modal.md
kqyang-77 Nov 28, 2025
cb6d785
Update 07-skypilot.md
kqyang-77 Nov 28, 2025
d44ea7c
Update 08-triton.md
kqyang-77 Nov 28, 2025
87ec48f
Update README.md
kqyang-77 Nov 28, 2025
b20ceee
Update 01-kserve.md
kqyang-77 Nov 28, 2025
5a82d96
Update 02-kubeai.md
kqyang-77 Nov 28, 2025
f679f31
Update 03-llamastack.md
kqyang-77 Nov 28, 2025
e18f33f
Update 04-llmaz.md
kqyang-77 Nov 28, 2025
3b7daaa
Update 05-production-stack.md
kqyang-77 Nov 28, 2025
8b689a6
Update README.md
kqyang-77 Nov 28, 2025
443a8fd
Update 01-optimization.md
kqyang-77 Nov 28, 2025
bc957b2
Update tutorial link in benchmarks documentation
kqyang-77 Nov 28, 2025
c95f17e
Update 01-arch_overview.md
kqyang-77 Nov 28, 2025
95ff1b0
Update 02-huggingface_integration.md
kqyang-77 Nov 28, 2025
a6d6f76
Update 03-plugin_system.md
kqyang-77 Nov 28, 2025
314ad3c
Update 04-paged_attention.md
kqyang-77 Nov 28, 2025
be25eb8
Update 05-mm_processing.md
kqyang-77 Nov 28, 2025
04ca058
Update 06-automatic_prefix_caching.md
kqyang-77 Nov 28, 2025
dbf8dcc
Update 07-multiprocessing.md
kqyang-77 Nov 28, 2025
e579501
Update 01-torch_compile.md
kqyang-77 Nov 28, 2025
8003224
Update 02-prefix_caching.md
kqyang-77 Nov 28, 2025
9ce9e07
Update 03-metrics.md
kqyang-77 Nov 28, 2025
39c7532
Update 01-overview.md
kqyang-77 Nov 28, 2025
a1de1d5
Update 02-profiling_index.md
kqyang-77 Nov 28, 2025
483bc46
Update 03-dockerfile.md
kqyang-77 Nov 28, 2025
e031e1f
Update 01-basic.md
kqyang-77 Nov 28, 2025
6691311
Update 02-registration.md
kqyang-77 Nov 28, 2025
208c5f8
Update 03-tests.md
kqyang-77 Nov 28, 2025
a06adf1
Update tutorial link for vLLM introduction
kqyang-77 Nov 28, 2025
31c78e3
Update README.md
kqyang-77 Nov 28, 2025
5601baf
Update 03-inference_params.md
kqyang-77 Nov 28, 2025
46311aa
Update 01-llm.md
kqyang-77 Nov 28, 2025
1ad3d79
Update 02-llm_inputs.md
kqyang-77 Nov 28, 2025
7e1bf82
Update README.md
kqyang-77 Nov 28, 2025
a2dc357
Update 01-llm_engine.md
kqyang-77 Nov 28, 2025
e431d8b
Update 02-async_llm_engine.md
kqyang-77 Nov 28, 2025
076dc2d
Update README.md
kqyang-77 Nov 28, 2025
5b82cc1
Update 01-inputs.md
kqyang-77 Nov 28, 2025
f42adf6
Update 02-parse.md
kqyang-77 Nov 28, 2025
df2896a
Update 01-inputs.md
kqyang-77 Nov 28, 2025
ddc0d3b
Update 03-processing.md
kqyang-77 Nov 28, 2025
fca9713
Update 04-profiling.md
kqyang-77 Nov 28, 2025
c278e97
Update 05-registry.md
kqyang-77 Nov 28, 2025
ea40033
Update README.md
kqyang-77 Nov 28, 2025
f429cd2
Update 01-interfaces_base.md
kqyang-77 Nov 28, 2025
80fbb61
Update 02-interfaces.md
kqyang-77 Nov 28, 2025
0569782
Update 03-adapters.md
kqyang-77 Nov 28, 2025
eebbc20
Update README.md
kqyang-77 Dec 1, 2025
4def7ee
修改链接
kqyang-77 Dec 5, 2025
065de0a
Update README.md
kqyang-77 Dec 5, 2025
f64f5b0
Update README.md
kqyang-77 Dec 5, 2025
e34864b
Update README.md
kqyang-77 Dec 5, 2025
7420417
Update README.md
kqyang-77 Dec 5, 2025
253c930
Update README.md
kqyang-77 Dec 5, 2025
c92ad88
Update README.md
kqyang-77 Dec 5, 2025
bb33a55
Update README.md
kqyang-77 Dec 5, 2025
5e49482
Update 03-nginx.md
kqyang-77 Dec 5, 2025
e445b47
Update README.md
kqyang-77 Dec 5, 2025
ebd338c
Update README.md
kqyang-77 Dec 5, 2025
f5db294
Update README.md
kqyang-77 Dec 5, 2025
bce2d16
Update README.md
kqyang-77 Dec 5, 2025
1b5eb01
Update README.md
kqyang-77 Dec 5, 2025
e2664ae
Update README.md
kqyang-77 Dec 5, 2025
baabf7c
Merge pull request #1 from kqyang-77/master
yudi3ooo Dec 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/01-getting-started/01-installation/01-gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: GPU
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

vLLM 是一个支持如下 GPU 类型的 Python 库,根据您的 GPU 型号查看相应的说明。

Expand Down
25 changes: 11 additions & 14 deletions docs/01-getting-started/01-installation/02-cpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: CPU
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

vLLM 是一个支持以下 CPU 变体的 Python 库。根据您的 CPU 类型查看厂商特定的说明:

Expand All @@ -11,7 +11,6 @@ vLLM 是一个支持以下 CPU 变体的 Python 库。根据您的 CPU 类型查
vLLM 初步支持在 x86 CPU 平台进行基础模型推理和服务,支持 FP32、FP16 和 BF16 数据类型。

> **注意**
>
> 此设备没有预编译的 wheel 包或镜像,您必须从源码构建 vLLM。

#### ARM AArch64
Expand All @@ -21,7 +20,6 @@ vLLM 已适配支持具备 NEON 指令集的 ARM64 CPU,基于最初为 x86 平
ARM CPU 后端当前支持 Float32、FP16 和 BFloat16 数据类型。

> **注意**
>
> 此设备没有预编译的 wheel 包或镜像,您必须从源码构建 vLLM。

#### Apple silicon
Expand All @@ -31,7 +29,6 @@ vLLM 对 macOS 上的 Apple 芯片提供实验性支持。目前用户需从源
macOS 的 CPU 实现当前支持 FP32 和 FP16 数据类型。

> **注意**
>
> 此设备没有预编译的 wheel 包或镜像,您必须从源码构建 vLLM。

#### IBM Z (S390X)
Expand All @@ -41,7 +38,6 @@ vLLM 对 IBM Z 平台上的 s390x 架构提供实验性支持。目前用户需
s390x 架构的 CPU 实现当前仅支持 FP32 数据类型。

> **注意**
>
> 此设备没有预编译的 wheel 包或镜像,您必须从源码构建 vLLM。

## 系统要求
Expand All @@ -54,7 +50,8 @@ s390x 架构的 CPU 实现当前仅支持 FP32 数据类型。
- 编译器:`gcc/g++ >= 12.3.0`(可选,推荐)
- 指令集架构 (ISA):AVX512(可选,推荐)

> **提示** >[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch)  通过最新特性优化扩展 PyTorch,可在 Intel 硬件上获得额外性能提升。
> **提示**
>[Intel Extension for PyTorch (IPEX)](https://github.com/intel/intel-extension-for-pytorch)  通过最新特性优化扩展 PyTorch,可在 Intel 硬件上获得额外性能提升。

#### ARM AArch64

Expand Down Expand Up @@ -270,10 +267,10 @@ $ docker run -it \

vLLM CPU 后端支持以下特性:

- 张量并行 (Tensor Parallel)
- 模型量化 (`INT8 W8A8`、`AWQ`、`GPTQ`)
- 分块预填充 (Chunked-prefill
- 前缀缓存 (Prefix-caching)
- 张量并行Tensor Parallel
- 模型量化`INT8 W8A8`、`AWQ`、`GPTQ`
- 分块预填充Chunked-prefill
- 前缀缓存Prefix-caching
- FP8-E5M2 KV 缓存

## 相关运行时环境变量
Expand All @@ -285,7 +282,7 @@ vLLM CPU 后端支持以下特性:

`VLLM_CPU_OMP_THREADS_BIND=0-31|32-63`  表示启用 2 个张量并行进程,rank0 的 32 个 OpenMP 线程绑定到 0-31 号核心,rank1 的线程绑定到 32-63 号核心

- `VLLM_CPU_MOE_PREPACK` : 是否为 MoE 层使用预打包功能。该参数会传递给  `ipex.llm.modules.GatedMLPMOE` 。默认值为  `1` (启用)。在不支持的 CPU 上可能需要设置为  `0` (禁用)。
- `VLLM_CPU_MOE_PREPACK` : 是否为 MoE 层使用预打包功能。该参数会传递给 `ipex.llm.modules.GatedMLPMOE`。默认值为 `1`(启用)。在不支持的 CPU 上可能需要设置为 `0`(禁用)。

## 性能优化建议

Expand All @@ -306,7 +303,7 @@ export VLLM_CPU_OMP_THREADS_BIND=0-29
vllm serve facebook/opt-125m
```

- 在支持超线程的机器上使用 vLLM CPU 后端时,建议通过  `VLLM_CPU_OMP_THREADS_BIND`  将每个物理 CPU 核心只绑定一个 OpenMP 线程。在 16 逻辑核心 / 8 物理核心的超线程平台上:
- 在支持超线程的机器上使用 vLLM CPU 后端时,建议通过 `VLLM_CPU_OMP_THREADS_BIND` 将每个物理 CPU 核心只绑定一个 OpenMP 线程。在 16 逻辑核心/8 物理核心的超线程平台上:

```plain
$ lscpu -e # check the mapping between logical CPU cores and physical CPU cores
Expand Down Expand Up @@ -337,13 +334,13 @@ $ export VLLM_CPU_OMP_THREADS_BIND=0-7
$ python examples/offline_inference/basic/basic.py
```

- 在多插槽 NUMA 机器上使用 vLLM CPU 后端时,应注意通过  `VLLM_CPU_OMP_THREADS_BIND`  设置 CPU 核心,避免跨 NUMA 节点的内存访问。
- 在多插槽 NUMA 机器上使用 vLLM CPU 后端时,应注意通过 `VLLM_CPU_OMP_THREADS_BIND` 设置 CPU 核心,避免跨 NUMA 节点的内存访问。

## 其他注意事项

- CPU 后端与 GPU 后端有显著差异,因为 vLLM 架构最初是为 GPU 优化的。需要多项优化来提升其性能。
- 建议将 HTTP 服务组件与推理组件解耦。在 GPU 后端配置中,HTTP 服务和分词任务运行在 CPU 上,而推理运行在 GPU 上,这通常不会造成问题。但在基于 CPU 的环境中,HTTP 服务和分词可能导致显著的上下文切换和缓存效率降低。因此强烈建议分离这两个组件以获得更好的性能。
- 在启用 NUMA 的 CPU 环境中,内存访问性能可能受  [拓扑结构](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.inc.md#non-uniform-memory-access-numa)  影响较大。对于 NUMA 架构,推荐两种优化方案:张量并行或数据并行。
- 在启用 NUMA 的 CPU 环境中,内存访问性能可能受[拓扑结构](https://github.com/intel/intel-extension-for-pytorch/blob/main/docs/tutorials/performance_tuning/tuning_guide.inc.md#non-uniform-memory-access-numa)影响较大。对于 NUMA 架构,推荐两种优化方案:张量并行或数据并行。

- 延迟敏感场景使用张量并行:遵循 GPU 后端设计,基于 NUMA 节点数量(例如双 NUMA 节点系统 TP=2)使用 Megatron-LM 的并行算法切分模型。随着  [CPU 上的 TP 功能](https://github.com/vllm-project/vllm/pull/6125#)  合并,张量并行已支持服务和离线推理。通常每个 NUMA 节点被视为一个 GPU 卡。以下是启用张量并行度为 2 的服务示例:

Expand Down
6 changes: 1 addition & 5 deletions docs/01-getting-started/01-installation/03-ai-accelerator.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: 其他 AI 加速器
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

vLLM 是一个 Python 库,支持以下 AI 加速器。根据您的 AI 加速器类型查看供应商特定说明:

Expand All @@ -29,23 +29,20 @@ vLLM 是一个 Python 库,支持以下 AI 加速器。根据您的 AI 加速
您可能需要为 TPU 虚拟机提供额外的持久存储。更多信息请参阅 [Cloud TPU 数据存储选项](https://cloud.devsite.corp.google.com/tpu/docs/storage-options)。

> **注意**
>
> 此设备没有预构建的 wheels,因此您必须使用预构建的 Docker 镜像或从源代码构建 vLLM。

#### Intel Gaudi

此节提供了在 Intel Gaudi 设备上运行 vLLM 的说明。

> **注意**
>
> 此设备没有预构建的 wheels 或镜像,因此您必须从源代码构建 vLLM。

#### AWS Neuron

vLLM 0.3.3 及以上版本支持通过 Neuron SDK 在 AWS Trainium/Inferentia 上进行模型推理和服务,并支持连续批处理。分页注意力 (Paged Attention) 和分块预填充 (Chunked Prefill) 功能目前正在开发中,即将推出。Neuron SDK 当前支持的数据类型为 FP16 和 BF16。

> **注意**
>
> 此设备没有预构建的 wheels 或镜像,因此您必须从源代码构建 vLLM。

## 环境要求
Expand All @@ -61,7 +58,6 @@ vLLM 0.3.3 及以上版本支持通过 Neuron SDK 在 AWS Trainium/Inferentia
您可以使用 [Cloud TPU API](https://cloud.google.com/tpu/docs/reference/rest) 或 [队列资源](https://cloud.google.com/tpu/docs/queued-resources) API 配置 Cloud TPU。本节展示如何使用队列资源 API 创建 TPU。有关使用 Cloud TPU API 的更多信息,请参阅 [使用 Create Node API 创建 Cloud TPU](https://cloud.google.com/tpu/docs/managing-tpus-tpu-vm#create-node-api)。队列资源允许您以队列方式请求 Cloud TPU 资源。当您请求队列资源时,请求会被添加到 Cloud TPU 服务维护的队列中。当请求的资源可用时,它将分配给您的 Google Cloud 项目供您独占使用。

> **注意**
>
> 在以下所有命令中,请将全大写的参数名称替换为适当的值。有关参数描述,请参阅参数描述表。

#### 使用 GKE 配置 Cloud TPU
Expand Down
26 changes: 13 additions & 13 deletions docs/01-getting-started/01-installation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,25 +2,25 @@
title: 安装
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

vLLM 支持以下硬件平台:

## GPU
## [GPU](/docs/getting-started/installation/gpu)

- NVIDIA CUDA
- AMD ROCm
- Intel XPU
- [NVIDIA CUDA](/docs/getting-started/installation/gpu#nvidia-cuda)
- [AMD ROCm](/docs/getting-started/installation/gpu#amd-rocm)
- [Intel XPU](/docs/getting-started/installation/gpu#inter-xpu-1)

## CPU
## [CPU](/docs/getting-started/installation/cpu)

- Intel/AMD x86
- ARM AArch64
- Apple silicon
- [Intel/AMD x86](/docs/getting-started/installation/cpu#intelamd-x86)
- [ARM AArch64](/docs/getting-started/installation/cpu#arm-aarch64)
- [Apple silicon](/docs/getting-started/installation/cpu#apple-silicon)

## 其他 AI 加速器
## [其他 AI 加速器](/docs/getting-started/installation/ai-accelerator)

- Google TPU
- Intel Gaudi
- AWS Neuron
- [Google TPU](/docs/getting-started/installation/ai-accelerator#google-tpu-1)
- [Intel Gaudi](/docs/getting-started/installation/ai-accelerator#intel-gaudi-1)
- [AWS Neuron](/docs/getting-started/installation/ai-accelerator#aws-neuron-1)
- OpenVINO
2 changes: 1 addition & 1 deletion docs/01-getting-started/02-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: 快速开始
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

本指南将帮助您快速开始使用 vLLM 进行以下操作:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Audio Language
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/audio_language.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/audio_language.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: 基础指南
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/basic](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/basic)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Chat With Tools
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/chat_with_tools.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/chat_with_tools.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Cpu Offload Lmcache
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/cpu_offload_lmcache.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/cpu_offload_lmcache.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Data Parallel
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/data_parallel.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/data_parallel.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Disaggregated Prefill Lmcache
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/disaggregated_prefill_lmcache.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/disaggregated_prefill_lmcache.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Disaggregated Prefill
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/disaggregated_prefill.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/disaggregated_prefill.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Distributed
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/distributed.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/distributed.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Eagle
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/eagle.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/eagle.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Encoder Decoder Multimodal
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/encoder_decoder_multimodal.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/encoder_decoder_multimodal.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Encoder Decoder
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/encoder_decoder.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/encoder_decoder.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Llm Engine Example
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/llm_engine_example.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/llm_engine_example.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Load Sharded State
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/load_sharded_state.py.](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/load_sharded_state.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Lora With Quantization Inference
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/lora_with_quantization_inference.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/lora_with_quantization_inference.py)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Mistral-small
---

[\*在线运行 vLLM 入门教程:零基础分步指南](https://openbayes.com/console/public/tutorials/rXxb5fZFr29?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)
[\*在线运行 vLLM 入门教程:零基础分步指南](https://app.hyper.ai/console/public/tutorials/rUwYsyhAIt3?utm_source=vLLM-CNdoc&utm_medium=vLLM-CNdoc-V1&utm_campaign=vLLM-CNdoc-V1-25ap)

源码 [examples/offline_inference/mistral-small.py](https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/mistral-small.py)

Expand Down
Loading