Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features integration #5

Closed
wants to merge 190 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
190 commits
Select commit Hold shift + click to select a range
9c2367c
[ROCm] Fixup arch checks for ROCM
dllehr-amd Jan 27, 2024
a9d752c
yapf cleanup
dllehr-amd Jan 27, 2024
ad53f74
Add hip_fp8 datatype and conversions
gshtras Feb 5, 2024
9b1577a
Add 3rdparty quantizer utility and usage to quantize models (HF default)
HaiShaw Feb 6, 2024
644b165
Update 3rdparty quantizer utility and usage with ammo updates
HaiShaw Feb 7, 2024
0ed1d98
Use e4m3 and e5m2 interchangeably
gshtras Feb 7, 2024
18b2516
Using fp8 in any cache tests that could support it
gshtras Feb 7, 2024
81a6859
Integrate e4m3 alongside e5m2 and adapt cache tests
gshtras Feb 8, 2024
83089d0
Add gfx942 to the arch list
gshtras Feb 8, 2024
926e2b8
Less forgiving atol in fp8 tests
gshtras Feb 8, 2024
475a2ef
Merge pull request #1 from ROCm/greg/fp8_tests
HaiShaw Feb 9, 2024
1b8bc9f
enable fp8-e4m3 kv cache on rocm
Feb 9, 2024
777cc35
Rename remaining fp8_e5m2 to general fp8
gshtras Feb 9, 2024
a2897d6
Fix or comparisons
mawong-amd Feb 9, 2024
a06ddac
Add e4m3 to attention kernels
gshtras Feb 9, 2024
f358dcd
Remove remaining mentions of e5m2 where it refers to general fp8
gshtras Feb 9, 2024
4db0038
Address naming conventions
Feb 9, 2024
2e85bc7
Merge branch 'fp8-e4m3-kvcache-rocm' of https://github.com/ROCm/vllm-…
Feb 9, 2024
17a91a0
More verbose help message for fp8 cache type
gshtras Feb 9, 2024
4fbc915
Updated fp8 help text in additional files sililar to arg_utils
gshtras Feb 9, 2024
7f5623d
Merge pull request #3 from ROCm/greg/tweaks
HaiShaw Feb 9, 2024
0da44bc
Merge branch 'fp8_kv' of https://github.com/ROCm/vllm-fp8 into fp8-e4…
Feb 9, 2024
2c525de
Fix merge conflict
Feb 9, 2024
54d1d4d
generalize fp8 convention
Feb 9, 2024
4a0d880
Merge pull request #2 from ROCm/fp8-e4m3-kvcache-rocm
HaiShaw Feb 9, 2024
7a9db00
Update log info and args description w.r.t. FP8 KV cache.
HaiShaw Feb 10, 2024
20b5f10
Initial port of gradlib gemm tuner
dllehr-amd Feb 10, 2024
6f28107
Enable torchrun vs Ray
dllehr-amd Feb 11, 2024
184806e
Add custom matvec kernels and sampler matmul call tuned_gemm
dllehr-amd Feb 11, 2024
0d59cfb
Initial skeleton; scaling factors in CacheEngine and PagedAttention
mawong-amd Feb 12, 2024
af9e9d1
Add silu gemm fusion when batch and seq_len = 1
dllehr-amd Feb 14, 2024
5f8eac3
Add tunable flags to VLLM
dllehr-amd Feb 14, 2024
22766b4
Allow benchmark_latency to take a list of input/output/batches for fa…
dllehr-amd Feb 14, 2024
87b4c1b
Add dynamic tuning feature to vllm
dllehr-amd Feb 15, 2024
694ae1d
Add rpd tracer controls to benchmark_latency.py
dllehr-amd Feb 15, 2024
eaf08ff
Initial conversion back to KV cache scales in model, using float scal…
mawong-amd Feb 15, 2024
0a80226
Completing KV cache scaling factors ingest (TP>1 todo), clean up code…
HaiShaw Feb 20, 2024
7b26ec9
Fix typos, add a few more sanity checks to the KV cache scales loader…
mawong-amd Feb 20, 2024
936821e
Add additional checks to the scaling factor loader and fail gracefull…
mawong-amd Feb 20, 2024
74b2b3f
Remove lingering PT fallback in extraction utility
mawong-amd Feb 20, 2024
2c7ce96
Add ROCm clarification to extract scales script
mawong-amd Feb 20, 2024
5985e25
Merge pull request #4 from ROCm/fp8_ingest_stage1_model
AdrianAbeyta Feb 20, 2024
763b283
Preliminary TP rank > 1 extraction and loading support
mawong-amd Feb 21, 2024
ef26716
Ensure loaded dictionary has same TP size as currently running engine
mawong-amd Feb 21, 2024
0e73aed
Fix Dockerfile errors
dllehr-amd Feb 15, 2024
90df0c9
Add llama2 run script
dllehr-amd Feb 16, 2024
ab67280
Increase Partition and Num threads for attention blocks
dllehr-amd Feb 16, 2024
1d53722
Fix WORKDIR
dllehr-amd Feb 22, 2024
5148aa5
Add accuracy flag to benchmark_latency.py
dllehr-amd Feb 22, 2024
c7e2587
Add tp_size argument for user to specify TP size to expect in quantiz…
mawong-amd Feb 23, 2024
61f2046
Add specific FP8 E4M3 and ROCm flavor text to the --quantized_model a…
mawong-amd Feb 23, 2024
dc71088
Small tweak on expected TP size flavor text for clarity
mawong-amd Feb 23, 2024
4cd76e9
Add output filename argument, rename output_path to output_dir, and c…
mawong-amd Feb 23, 2024
ad8b841
Fix up remaining 'output_path's from the rename
mawong-amd Feb 23, 2024
fec2232
Add scaling factor correction for ROCm FP8
mawong-amd Feb 23, 2024
7fdcf10
Add example output for extract_scales
Feb 23, 2024
6a6bbcd
Strip out download functionality in scale extraction utility
mawong-amd Feb 23, 2024
9336cdb
Merge pull request #5 from ROCm/fp8_ingest_stage1_model
AdrianAbeyta Feb 23, 2024
8e108d3
Correcting a stray type hint
mawong-amd Feb 23, 2024
31ebfa6
Merge branch 'fp8_kv' into fp8_ingest_scales_correction
mawong-amd Feb 23, 2024
4dd7d1e
Correct a stray type hint
mawong-amd Feb 23, 2024
9a03b96
Create README.md and add usage example
AdrianAbeyta Feb 23, 2024
4064973
Added benchmark description
AdrianAbeyta Feb 23, 2024
988ffc3
Clean up readme
AdrianAbeyta Feb 24, 2024
0650ae4
Merge pull request #6 from ROCm/fp8_ingest_scales_correction
AdrianAbeyta Feb 26, 2024
534dcff
Don't broadcast when using torchrun
dllehr-amd Feb 26, 2024
ee6ba29
Change convention: Initialize scaling factors always if KV cache is F…
mawong-amd Feb 26, 2024
4007656
Updated example descriptions
AdrianAbeyta Feb 26, 2024
6f2b248
Merge pull request #8 from ROCm/fp8_ingest_stage1_model
AdrianAbeyta Feb 26, 2024
0a45612
Merge pull request #7 from ROCm/fp8_doc
AdrianAbeyta Feb 26, 2024
c8059c2
Kernel and Device functions to enable FP8 KV cache scaling factors
HaiShaw Feb 27, 2024
fc2cdaf
Make KV cache scaling factors default to 1.0 instead of None
HaiShaw Feb 27, 2024
76c6058
Update KV cache scales loader name to clarify that we are not using a…
mawong-amd Feb 27, 2024
c825bb3
Fix test cases from the introduction of KV cache scaling factors, usi…
HaiShaw Feb 28, 2024
4f574cd
Cleanup comments according to reviews
HaiShaw Feb 29, 2024
86f06ca
Merge pull request #9 from ROCm/fp8_kv_cache
HaiShaw Feb 29, 2024
f325cb0
Add hip_fp8 datatype and conversions
gshtras Feb 5, 2024
4bb8dac
Add 3rdparty quantizer utility and usage to quantize models (HF default)
HaiShaw Feb 6, 2024
8b1279b
Update 3rdparty quantizer utility and usage with ammo updates
HaiShaw Feb 7, 2024
9c4226e
Use e4m3 and e5m2 interchangeably
gshtras Feb 7, 2024
30bba1c
Using fp8 in any cache tests that could support it
gshtras Feb 7, 2024
692f5ad
Integrate e4m3 alongside e5m2 and adapt cache tests
gshtras Feb 8, 2024
c9321a0
Add gfx942 to the arch list
gshtras Feb 8, 2024
4e1f89a
Less forgiving atol in fp8 tests
gshtras Feb 8, 2024
7bc2574
enable fp8-e4m3 kv cache on rocm
Feb 9, 2024
ebf7542
Address naming conventions
Feb 9, 2024
ad44055
Fix or comparisons
mawong-amd Feb 9, 2024
e5e0e7c
Rename remaining fp8_e5m2 to general fp8
gshtras Feb 9, 2024
c86b2ec
Add e4m3 to attention kernels
gshtras Feb 9, 2024
a432815
Remove remaining mentions of e5m2 where it refers to general fp8
gshtras Feb 9, 2024
4b77126
Updated fp8 help text in additional files sililar to arg_utils
gshtras Feb 9, 2024
bbf6d49
generalize fp8 convention
Feb 9, 2024
4dfb26d
Update log info and args description w.r.t. FP8 KV cache.
HaiShaw Feb 10, 2024
0f492f5
Initial skeleton; scaling factors in CacheEngine and PagedAttention
mawong-amd Feb 12, 2024
030c9eb
Initial conversion back to KV cache scales in model, using float scal…
mawong-amd Feb 15, 2024
e00a86d
Completing KV cache scaling factors ingest (TP>1 todo), clean up code…
HaiShaw Feb 20, 2024
d3e98f3
Fix typos, add a few more sanity checks to the KV cache scales loader…
mawong-amd Feb 20, 2024
714e42c
Add additional checks to the scaling factor loader and fail gracefull…
mawong-amd Feb 20, 2024
3ff51b1
Remove lingering PT fallback in extraction utility
mawong-amd Feb 20, 2024
e97a31e
Add ROCm clarification to extract scales script
mawong-amd Feb 20, 2024
0ba975d
Add scaling factor correction for ROCm FP8
mawong-amd Feb 23, 2024
6991c59
Correcting a stray type hint
mawong-amd Feb 23, 2024
2292776
Preliminary TP rank > 1 extraction and loading support
mawong-amd Feb 21, 2024
221699b
Ensure loaded dictionary has same TP size as currently running engine
mawong-amd Feb 21, 2024
96a7546
Add tp_size argument for user to specify TP size to expect in quantiz…
mawong-amd Feb 23, 2024
9d08a92
Add specific FP8 E4M3 and ROCm flavor text to the --quantized_model a…
mawong-amd Feb 23, 2024
553209b
Small tweak on expected TP size flavor text for clarity
mawong-amd Feb 23, 2024
c852549
Add output filename argument, rename output_path to output_dir, and c…
mawong-amd Feb 23, 2024
e23379e
Fix up remaining 'output_path's from the rename
mawong-amd Feb 23, 2024
40171c9
Strip out download functionality in scale extraction utility
mawong-amd Feb 23, 2024
7666587
Correct a stray type hint
mawong-amd Feb 23, 2024
7c0bf6e
Change convention: Initialize scaling factors always if KV cache is F…
mawong-amd Feb 26, 2024
42e2aef
Add example output for extract_scales
Feb 23, 2024
d3dbb1a
Create README.md and add usage example
AdrianAbeyta Feb 23, 2024
f39839b
Added benchmark description
AdrianAbeyta Feb 23, 2024
2cfea65
Clean up readme
AdrianAbeyta Feb 24, 2024
257a7da
Updated example descriptions
AdrianAbeyta Feb 26, 2024
730562a
Update KV cache scales loader name to clarify that we are not using a…
mawong-amd Feb 27, 2024
f20eceb
Kernel and Device functions to enable FP8 KV cache scaling factors
HaiShaw Feb 27, 2024
8834917
Make KV cache scaling factors default to 1.0 instead of None
HaiShaw Feb 27, 2024
3187582
Fix test cases from the introduction of KV cache scaling factors, usi…
HaiShaw Feb 28, 2024
49df502
Cleanup comments according to reviews
HaiShaw Feb 29, 2024
12f7650
Remove load_dummy_kv_cache_scales as convention change in PR#9 render…
mawong-amd Mar 4, 2024
4a8d06c
Add back removal of gather cached kv kernel for use with FP8
Mar 5, 2024
b87aec1
Clean up IFU
Mar 5, 2024
fa6fbce
Clean up IFU
Mar 5, 2024
65f70d7
Schema change: preliminary changes to extract script, TODO: loading l…
mawong-amd Mar 6, 2024
f5c0236
Fix runtime issues with upstream rebase
Mar 6, 2024
d2a42f9
[Minor fix] The domain dns.google may cause a socket.gaierror excepti…
ttbachyinsda Mar 4, 2024
4b3e4b0
Preliminary refactoring: KV cache scales JSON into general scales JSO…
mawong-amd Mar 7, 2024
00f5113
Merge branch 'fp8_kv' into fp8_ingest_stage1_model
mawong-amd Mar 7, 2024
a7e6e81
Fixing stray syntax errors and typos, refactoring rank_keyword detection
mawong-amd Mar 7, 2024
ef85f98
Address reviewer comments
mawong-amd Mar 7, 2024
d8b2843
Address Greg's strong type checking :)
mawong-amd Mar 7, 2024
52df603
Add an additional TODO
mawong-amd Mar 7, 2024
b484112
Merge remote-tracking branch 'upstream/main' into IFU-2024-03-01-fp8-kv
Mar 7, 2024
7b72159
Merge pull request #16 from ROCm/fp8_ingest_stage1_model
AdrianAbeyta Mar 7, 2024
18c55d2
Fix OOM bug in quantize script, remove extraneous model_export
mawong-amd Mar 7, 2024
7d0fa2f
Fix rocm build conditions
Mar 7, 2024
e7db6af
Keep previous build flow for neuron
Mar 7, 2024
660dbb3
Merge remote-tracking branch 'origin/fp8_kv' into IFU-2024-03-01-fp8-kv
Mar 7, 2024
ca1b39c
Measure model memory usage (#3120)
mgoin Mar 7, 2024
fd6e57e
Possible fix for conflict between Automated Prefix Caching (#2762) an…
jacobthebanana Mar 7, 2024
90c2cd4
Update fp8 examples
Mar 7, 2024
2d520f2
Merge remote-tracking branch 'upstream/main' into IFU-2024-03-01-fp8-kv
Mar 8, 2024
9e6144a
[FIX] Make `flash_attn` optional (#3269)
WoosukKwon Mar 8, 2024
fd01e9a
Fix setup.py up to where it should be before the excitement of the la…
mawong-amd Mar 8, 2024
6edfbf1
Fix missing enable FP8_E4M3 flag and cherry pick newest load convention
mawong-amd Mar 8, 2024
be92918
Merge branch 'IFU-2024-03-01-fp8-kv' of https://github.com/ROCm/vllm-…
Mar 8, 2024
dd469df
Add model flag as example option
Mar 8, 2024
b3d81e0
Merge pull request #17 from ROCm/IFU-2024-03-01-fp8-kv
AdrianAbeyta Mar 8, 2024
5a2d747
Merge branch 'main' into fp8_kv
AdrianAbeyta Mar 11, 2024
267c847
Merge remote-tracking branch 'upstream/main' into fp8_kv
Mar 11, 2024
2f60ad7
Fix ruff syntax errors
Mar 13, 2024
31d96dd
Merge remote-tracking branch 'upstream/main' into fp8_kv
Mar 13, 2024
94c2e7c
Update model config for scales path
Mar 13, 2024
a350641
Add .rst for fp8_e4m3_kvcache and rename fp8_kvcache to fp8_e5m2
Mar 13, 2024
eb8e3d8
Skip fp8 UT test on CUDA for e4m3
Mar 14, 2024
f9eba0c
Fix device id formatting
gshtras Mar 14, 2024
db8f29c
Fix scales_path location
mawong-amd Mar 15, 2024
49d1593
Fix yapf formatting
mawong-amd Mar 15, 2024
e569133
Adding new rocm triton flash attention kernel
jpvillam-amd Mar 15, 2024
b5ebb41
Skipping certain cache tests when using fp8 cache with e5m2 type. The…
gshtras Mar 15, 2024
45d5912
Fix yapf ci error
Mar 15, 2024
8cb05bc
Merge remote-tracking branch 'origin/main' into v0.3.3_greg
gshtras Mar 18, 2024
be708d0
Removed gradlib and its tuned gemm in favor of tunable ops
gshtras Mar 18, 2024
a2311fd
Merge remote-tracking branch 'origin/fp8_kv' into v0.3.3_greg
gshtras Mar 18, 2024
8f266e8
Merge branch 'jpvillam/v0.3.3_triton' into v0.3.3_greg
gshtras Mar 18, 2024
1510843
Initializing scaling factors for kv cache in flash attention backend
gshtras Mar 18, 2024
3d82eea
Removed obsolete parts. Made rocm attention defaults define guarded
gshtras Mar 19, 2024
86062a5
Adding functionality with benchmarks/measure_ppl_MC_small.py. The scr…
Alexei-V-Ivanov-AMD Mar 19, 2024
6ce8e8f
Merge pull request #6 from ROCm/integration_alexei
gshtras Mar 19, 2024
ce7078d
Fix triton build condition
gshtras Mar 19, 2024
54d82e6
Merge branch 'integration' of github.com:ROCm/vllm into integration
gshtras Mar 19, 2024
0e63661
Small fix on dockerfile
jpvillam-amd Mar 19, 2024
c45547b
Update description of measure_ppl_MC_small.py
Alexei-V-Ivanov-AMD Mar 19, 2024
c89c0e3
Merge remote-tracking branch 'origin/main' into jpvillam/v0.3.3_triton
jpvillam-amd Mar 19, 2024
d4cb905
Rebase updates and PR review changes
jpvillam-amd Mar 19, 2024
9cb2bbd
Merge remote-tracking branch 'origin/jpvillam/v0.3.3_triton' into int…
gshtras Mar 20, 2024
9d96fdb
Introducing torchrun multi GPU support
gshtras Mar 20, 2024
51ce9f5
add use case for custom kernel for matvec operation
charlifu Mar 21, 2024
ec9a8c0
Merge branch 'greg/torchrun' into integration
gshtras Mar 21, 2024
bea2883
Merge branch 'integration' of github.com:ROCm/vllm into integration
gshtras Mar 21, 2024
47c560e
Remove ignored file
gshtras Mar 21, 2024
97e1978
limit the custom kernel under is_hip
charlifu Mar 21, 2024
ed96036
Fix parameter
gshtras Mar 21, 2024
42324b6
Unused import
gshtras Mar 21, 2024
9b1388c
fix custom kernel
charlifu Mar 21, 2024
6b186bb
Refactor torchrun executor to reuse single gpu executor code
gshtras Mar 21, 2024
6ff0272
Fixed mixed up values
gshtras Mar 22, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -181,6 +181,7 @@ _build/
# hip files generated by PyTorch
*.hip
*_hip*
hip_compat.h

# Benchmark dataset
*.json
32 changes: 32 additions & 0 deletions 3rdparty/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
### Quantizer Utilities
`quantizer/quantize.py`: nVIDIA Quantization utilities using AMMO, ported from TensorRT-LLM:
`https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/quantization/quantize.py`

### Prerequisite

#### AMMO (AlgorithMic Model Optimization) Installation: nvidia-ammo 0.7.1 or later
`pip install --no-cache-dir --extra-index-url https://pypi.nvidia.com nvidia-ammo`

#### AMMO Download (code and docs)
`https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.5.0.tar.gz`
`https://developer.nvidia.com/downloads/assets/cuda/files/nvidia-ammo/nvidia_ammo-0.7.1.tar.gz`

### Usage

#### Run on H100 system for speed if FP8; number of GPUs depends on the model size

#### Example: quantize Llama2-7b model from HF to FP8 with FP8 KV Cache:
`python quantize.py --model_dir ./ll2-7b --dtype float16 --qformat fp8 --kv_cache_dtype fp8 --output_dir ./ll2_7b_fp8 --calib_size 512 --tp_size 1`

Outputs: model structure, quantized model & parameters (with scaling factors) are in JSON and Safetensors (npz is generated only for the reference)
```
# ll ./ll2_7b_fp8/
total 19998244
drwxr-xr-x 2 root root 4096 Feb 7 01:08 ./
drwxrwxr-x 8 1060 1061 4096 Feb 7 01:08 ../
-rw-r--r-- 1 root root 176411 Feb 7 01:08 llama_tp1.json
-rw-r--r-- 1 root root 13477087480 Feb 7 01:09 llama_tp1_rank0.npz
-rw-r--r-- 1 root root 7000893272 Feb 7 01:08 rank0.safetensors
#
```

362 changes: 362 additions & 0 deletions 3rdparty/quantizer/extract_scales.py

Large diffs are not rendered by default.

Loading
Loading