From b6a72236abdd9dfb4be801231c97719edf26d556 Mon Sep 17 00:00:00 2001 From: t-parry <146764540+t-parry@users.noreply.github.com> Date: Fri, 6 Dec 2024 10:16:02 -0800 Subject: [PATCH 1/2] Update README.md Updates to model name and BKC version --- docs/dev-docker/README.md | 45 ++++++++++++++++++--------------------- 1 file changed, 21 insertions(+), 24 deletions(-) diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md index ad08078936c8d..b44f989147adc 100644 --- a/docs/dev-docker/README.md +++ b/docs/dev-docker/README.md @@ -22,7 +22,7 @@ The performance data below was measured on a server with MI300X accelerators wit | System | MI300X with 8 GPUs | |---|---| -| BKC | 24.11 | +| BKC | 24.13 | | ROCm | version ROCm 6.2.2 | | amdgpu | build 2009461 | | OS | Ubuntu 22.04 | @@ -41,7 +41,7 @@ The performance data below was measured on a server with MI300X accelerators wit ## Pull latest -You can pull the image with `docker pull rocm/vllm-dev:20241114-tuned` +You can pull the image with `docker pull rocm/vllm-dev:main` ### What is New @@ -58,9 +58,9 @@ The gemms are automatically enabled in the docker image, and all stored gemm co To make it easier to run fp8 Llama 3.1 models on MI300X, the quantized checkpoints are available on AMD Huggingface space as follows -- https://huggingface.co/amd/Meta-Llama-3.1-8B-Instruct-FP8-KV -- https://huggingface.co/amd/Meta-Llama-3.1-70B-Instruct-FP8-KV -- https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-FP8-KV +- https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV +- https://huggingface.co/amd/Llama-3.1-70B-Instruct-FP8-KV +- https://huggingface.co/amd/Llama-3.1-405B-Instruct-FP8-KV - https://huggingface.co/amd/grok-1-FP8-KV Currently these models are private. Please join https://huggingface.co/amd to access. @@ -72,7 +72,7 @@ These FP8 quantized checkpoints were generated with AMD’s Quark Quantizer. For ### Quantize your own models This step is optional for you to use quantized models on your own. Take Llama 3.1 405B as an example. -Download the Model View the Meta-Llama-3.1-405B model at https://huggingface.co/meta-llama/Meta-Llama-3.1-405B. Ensure that you have been granted access, and apply for it if you do not have access. +Download the Model View the Llama-3.1-405B model at https://huggingface.co/meta-llama/Llama-3.1-405B. Ensure that you have been granted access, and apply for it if you do not have access. If you do not already have a HuggingFace token, open your user profile (https://huggingface.co/settings/profile), select "Access Tokens", press "+ Create New Token", and create a new Read token. @@ -92,18 +92,18 @@ Create the directory for Llama 3.1 models (if it doesn't already exist) Download the model - huggingface-cli download meta-llama/Meta-Llama-3.1-405B-Instruct --exclude "original/*" --local-dir /data/llama-3.1/Meta-Llama-3.1-405B-Instruct + huggingface-cli download meta-llama/Llama-3.1-405B-Instruct --exclude "original/*" --local-dir /data/llama-3.1/Llama-3.1-405B-Instruct -Similarly, you can download Meta-Llama-3.1-70B and Meta-Llama-3.1-8B. +Similarly, you can download Llama-3.1-70B and Llama-3.1-8B. [Download and install Quark](https://quark.docs.amd.com/latest/install.html) Run the quantization script in the example folder using the following command line: -export MODEL_DIR = [local model checkpoint folder] or meta-llama/Meta-Llama-3.1-405B-Instruct +export MODEL_DIR = [local model checkpoint folder] or meta-llama/Llama-3.1-405B-Instruct #### single GPU python3 quantize_quark.py \ --model_dir $MODEL_DIR \ - --output_dir Meta-Llama-3.1-405B-Instruct-FP8-KV \ + --output_dir Llama-3.1-405B-Instruct-FP8-KV \ --quant_scheme w_fp8_a_fp8 \ --kv_cache_dtype fp8 \ --num_calib_data 128 \ @@ -113,7 +113,7 @@ export MODEL_DIR = [local model checkpoint folder] or meta-llama/Meta-Llama-3.1- #### If model size is too large for single GPU, please use multi GPU instead. python3 quantize_quark.py \ --model_dir $MODEL_DIR \ - --output_dir Meta-Llama-3.1-405B-Instruct-FP8-KV \ + --output_dir Llama-3.1-405B-Instruct-FP8-KV \ --quant_scheme w_fp8_a_fp8 \ --kv_cache_dtype fp8 \ --num_calib_data 128 \ @@ -131,7 +131,7 @@ Download and launch the docker, --cap-add=CAP_SYS_ADMIN --cap-add=SYS_PTRACE \ --device=/dev/kfd --device=/dev/dri --device=/dev/mem \ -v /data/llama-3.1:/data/llm \ - docker pull rocm/vllm-dev:20241114-tuned + rocm/vllm-dev:main ### Benchmark with AMD vLLM Docker @@ -176,7 +176,7 @@ Below is a list of options which are useful: - **--max-seq-len-to-capture** : Maximum sequence length for which Hip-graphs are captured and utilized. It's recommended to use Hip-graphs for the best decode performance. The default value of this parameter is 8K, which is lower than the large context lengths supported by recent models such as LLama. Set this parameter to max-model-len or maximum context length supported by the model for best performance. - **--gpu-memory-utilization** : The ratio of GPU memory reserved by a vLLM instance. Default value is 0.9. It's recommended to set this to 0.99 to increase KV cache space. -Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments. However, vLLM's benchmark_latency and benchmark_throughput command lines may not include all of these flags as command line arguments. In that case, it might be necessary to add these parameters to the LLMEngine instance constructor inside the benchmark script. +Note: vLLM's server creation command line (vllm serve) supports the above parameters as command line arguments. ##### Online Gemm Tuning Online Gemm tuning for small decode batch sizes can improve performance in some cases. e.g. Llama 70B upto Batch size 8 @@ -268,16 +268,18 @@ If you want to run Meta-Llama-3.1-405B FP16, please run python /app/vllm/benchmarks/benchmark_throughput.py \ --model /data/llm/Meta-Llama-3.1-405B-Instruct \ --dtype float16 \ - --gpu-memory-utilization 0.99 \ + --gpu-memory-utilization 0.9 \ --num-prompts 2000 \ --distributed-executor-backend mp \ --num-scheduler-steps 10 \ --tensor-parallel-size 8 \ --input-len 128 \ --output-len 128 \ - --swapspace 16 \ - --max-model-length 8192 \ + --swap-space 16 \ + --max-model-len 8192 \ --max-num-batched-tokens 65536 \ + --swap-space + --max-model-len --gpu-memory-utilization 0.99 For fp8 quantized Llama3.18B/70B models: @@ -292,7 +294,7 @@ For more information about the parameters, please run /app/vllm/benchmarks/benchmark_throughput.py -h -Tensor parallelism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance. +Tensor parallism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance. ##### Online Server Benchmark @@ -410,7 +412,7 @@ Please refer to the MLPerf instructions for recreating the MLPerf numbers. Updated: -vLLM: https://github.com/ROCm/vllm/commit/5362727ec366c1542b2be7a520e7c44e5cc3ce30 +vLLM: https://github.com/ROCm/vllm/commit/2c60adc83981ada77a77b2adda78ef109d2e2e2b ### Docker Manifest To reproduce the release docker: @@ -418,11 +420,6 @@ To reproduce the release docker: ``` git clone https://github.com/ROCm/vllm.git cd vllm -git checkout 5362727ec366c1542b2be7a520e7c44e5cc3ce30 +git checkout 2c60adc83981ada77a77b2adda78ef109d2e2e2b docker build -f Dockerfile.rocm -t --build-arg BUILD_HIPBLASLT=1 --build-arg USE_CYTHON=1 . ``` - -For details on all the dependencies, please refer to: https://github.com/ROCm/vllm/blob/5362727ec366c1542b2be7a520e7c44e5cc3ce30/Dockerfile.rocm - - - From c264d993bbd7bdee8ab1c2c064aebacb4bc9f457 Mon Sep 17 00:00:00 2001 From: t-parry <146764540+t-parry@users.noreply.github.com> Date: Fri, 6 Dec 2024 10:51:52 -0800 Subject: [PATCH 2/2] Update README.md Fixed spelling error. Added llama 3.3 support under What is New section --- docs/dev-docker/README.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/dev-docker/README.md b/docs/dev-docker/README.md index b44f989147adc..9bc7e1f86f508 100644 --- a/docs/dev-docker/README.md +++ b/docs/dev-docker/README.md @@ -47,6 +47,7 @@ You can pull the image with `docker pull rocm/vllm-dev:main` - MoE optimizations for Mixtral 8x22B, FP16 - Llama 3.2 stability improvements + - Llama 3.3 support Gemms are tuned using PyTorch's Tunable Ops feature (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md) @@ -294,7 +295,7 @@ For more information about the parameters, please run /app/vllm/benchmarks/benchmark_throughput.py -h -Tensor parallism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance. +Tensor parallelism (TP) parameters depends on the model size. For Llama 3.1 70B and 8B model, TP 1 can be used as well for MI300X. In general, TP 8 and 1 is recommended to achieve the optimum performance. ##### Online Server Benchmark