Features integration #5

gshtras · 2024-03-19T15:09:40Z

Putting together:
Custom ops and kernels from private v0.2.7_dllehr
fp8 kv_cache from fp8_kv (upstream PR vllm-project#3290)
triton attention kernel from jpvillam/v0.3.3_triton

The ROCM stack with PyTorch supports a wide set of gfx architectures. This can be displayed by printing PYTORCH_ROCM_ARCH env. In the absence of PYTORCH_ROCM_ARCH pytorch uses theoutput from rocm_agent_enumerator to choose what to compile for. vllm supports a subset of these, (gfx908, gfx90a,...) Due to a need to potentially support multiple architectures at once (ex. docker image) it's important to make sure vllm is compiled with them all unless specified otherwise. We now gather either the PYTORCH_ROCM_ARCH env or rocm_agent_enumerator output and cross reference with ROCM_SUPPORTED_ARCHS from vllm to generate a list of arches to build for.

Add non-MI300 compatible alternative for bulk conversions Removed bf8 (e5m2) and renamed f8 to fp8 to explicitly specify that it is e4m3 Removed stochastic rounding for simplicity Put bulk fp8 conversion hip intrinsics behind a define. Disabled by default Using types from the proper vllm headers. Added namespace Move amd specific headers under amd_detail

Greg/fp8 tests

Reduce fp8 range in the conversion test to match e4m3 Add other MI300 architectures to the list Simplify device guard use in conversion kernel

into fp8-e4m3-kvcache-rocm

Rename remaining fp8_e5m2 to general fp8

…m3-kvcache-rocm

Enable FP8 E4M3 KV Cache

This is a bit of a hack. But Ray seems to have some serious perf degradation when running multi gpu latency benchmarks. Allow distributed to be used when Ray is disabled, and make sure we connect via env ranking instead of tcp/port based.

…ipt is supposed to be a quick correctness benchmark.

Adding functionality with benchmarks/measure_ppl_MC_small.py.

Added invocation examples into the description.

Added Flag for controlling triton vs default flow. More small changes to dockerfile

…egration

dllehr-amd and others added 30 commits January 27, 2024 12:36

yapf cleanup

a9d752c

Add 3rdparty quantizer utility and usage to quantize models (HF default)

9b1577a

Update 3rdparty quantizer utility and usage with ammo updates

644b165

Use e4m3 and e5m2 interchangeably

0ed1d98

Using fp8 in any cache tests that could support it

18b2516

Integrate e4m3 alongside e5m2 and adapt cache tests

81a6859

Add gfx942 to the arch list

83089d0

Less forgiving atol in fp8 tests

926e2b8

Merge pull request #1 from ROCm/greg/fp8_tests

475a2ef

Greg/fp8 tests

enable fp8-e4m3 kv cache on rocm

1b8bc9f

Rename remaining fp8_e5m2 to general fp8

777cc35

Reduce fp8 range in the conversion test to match e4m3 Add other MI300 architectures to the list Simplify device guard use in conversion kernel

Fix or comparisons

a2897d6

Add e4m3 to attention kernels

a06ddac

Remove remaining mentions of e5m2 where it refers to general fp8

f358dcd

Address naming conventions

4db0038

Merge branch 'fp8-e4m3-kvcache-rocm' of https://github.com/ROCm/vllm-fp8

2e85bc7

into fp8-e4m3-kvcache-rocm

More verbose help message for fp8 cache type

17a91a0

Updated fp8 help text in additional files sililar to arg_utils

4fbc915

Merge pull request #3 from ROCm/greg/tweaks

7f5623d

Rename remaining fp8_e5m2 to general fp8

Merge branch 'fp8_kv' of https://github.com/ROCm/vllm-fp8 into fp8-e4…

0da44bc

…m3-kvcache-rocm

Fix merge conflict

2c525de

generalize fp8 convention

54d1d4d

Merge pull request #2 from ROCm/fp8-e4m3-kvcache-rocm

4a0d880

Enable FP8 E4M3 KV Cache

Update log info and args description w.r.t. FP8 KV cache.

7a9db00

Initial port of gradlib gemm tuner

20b5f10

Enable torchrun vs Ray

6f28107

This is a bit of a hack. But Ray seems to have some serious perf degradation when running multi gpu latency benchmarks. Allow distributed to be used when Ray is disabled, and make sure we connect via env ranking instead of tcp/port based.

Add custom matvec kernels and sampler matmul call tuned_gemm

184806e

Initial skeleton; scaling factors in CacheEngine and PagedAttention

0d59cfb

AdrianAbeyta and others added 26 commits March 15, 2024 19:20

Fix yapf ci error

45d5912

Merge remote-tracking branch 'origin/main' into v0.3.3_greg

8cb05bc

Removed gradlib and its tuned gemm in favor of tunable ops

be708d0

Merge remote-tracking branch 'origin/fp8_kv' into v0.3.3_greg

a2311fd

Merge branch 'jpvillam/v0.3.3_triton' into v0.3.3_greg

8f266e8

Initializing scaling factors for kv cache in flash attention backend

1510843

Removed obsolete parts. Made rocm attention defaults define guarded

3d82eea

Adding functionality with benchmarks/measure_ppl_MC_small.py. The scr…

86062a5

…ipt is supposed to be a quick correctness benchmark.

Merge pull request #6 from ROCm/integration_alexei

6ce8e8f

Adding functionality with benchmarks/measure_ppl_MC_small.py.

Fix triton build condition

ce7078d

Merge branch 'integration' of github.com:ROCm/vllm into integration

54d82e6

Small fix on dockerfile

0e63661

Update description of measure_ppl_MC_small.py

c45547b

Added invocation examples into the description.

Merge remote-tracking branch 'origin/main' into jpvillam/v0.3.3_triton

c89c0e3

Rebase updates and PR review changes

d4cb905

Added Flag for controlling triton vs default flow. More small changes to dockerfile

Merge remote-tracking branch 'origin/jpvillam/v0.3.3_triton' into int…

9cb2bbd

…egration

Introducing torchrun multi GPU support

9d96fdb

add use case for custom kernel for matvec operation

51ce9f5

Merge branch 'greg/torchrun' into integration

ec9a8c0

Merge branch 'integration' of github.com:ROCm/vllm into integration

bea2883

Remove ignored file

47c560e

limit the custom kernel under is_hip

97e1978

Fix parameter

ed96036

Unused import

42324b6

fix custom kernel

9b1388c

Refactor torchrun executor to reuse single gpu executor code

6b186bb

gshtras mentioned this pull request Mar 21, 2024

Features integration without fp8 #7

Merged

Fixed mixed up values

6ff0272

gshtras closed this Mar 25, 2024

gshtras deleted the integration branch March 25, 2024 20:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features integration #5

Features integration #5

gshtras commented Mar 19, 2024

Features integration #5

Features integration #5

Conversation

gshtras commented Mar 19, 2024