Features integration without fp8 #7

gshtras · 2024-03-21T23:58:43Z

Putting together (similar to #5 except without fp8 kv_cache):
Custom ops and kernels from private v0.2.7_dllehr
triton attention kernel from jpvillam/v0.3.3_triton
Option to run multi GPU using torchrun instead or ray

The ROCM stack with PyTorch supports a wide set of gfx architectures. This can be displayed by printing PYTORCH_ROCM_ARCH env. In the absence of PYTORCH_ROCM_ARCH pytorch uses theoutput from rocm_agent_enumerator to choose what to compile for. vllm supports a subset of these, (gfx908, gfx90a,...) Due to a need to potentially support multiple architectures at once (ex. docker image) it's important to make sure vllm is compiled with them all unless specified otherwise. We now gather either the PYTORCH_ROCM_ARCH env or rocm_agent_enumerator output and cross reference with ROCM_SUPPORTED_ARCHS from vllm to generate a list of arches to build for.

This is a bit of a hack. But Ray seems to have some serious perf degradation when running multi gpu latency benchmarks. Allow distributed to be used when Ray is disabled, and make sure we connect via env ranking instead of tcp/port based.

…ster execution Also add reporting functionality for easy display

Co-authored-by: Vinayak Gokhale <vinayak.gokhale@amd.com>

Added Flag for controlling triton vs default flow. More small changes to dockerfile

…egration_no_fp8

shajrawi

Option to run multi GPU using torchrun instead or ray - should be the default in our internal branch. at least for isHip()
Let's add Matt's layernorm changes

Once we have these in place, unless there are objections: We can ship it

…egration_no_fp8

…den by --worker-use-ray

shajrawi · 2024-03-22T20:35:34Z

one more small ask: Can we have a markdown doc under docs/ with all our flags we can configure / turn on and off? Most importantly: Tunable ops and running with it enabled (but also which attention implementation to use, ray..etc)

shajrawi

Ship it! We can write the markdown next week

V4 update

dllehr-amd and others added 30 commits January 27, 2024 12:36

yapf cleanup

a9d752c

Initial port of gradlib gemm tuner

20b5f10

Enable torchrun vs Ray

6f28107

This is a bit of a hack. But Ray seems to have some serious perf degradation when running multi gpu latency benchmarks. Allow distributed to be used when Ray is disabled, and make sure we connect via env ranking instead of tcp/port based.

Add custom matvec kernels and sampler matmul call tuned_gemm

184806e

Add silu gemm fusion when batch and seq_len = 1

af9e9d1

Add tunable flags to VLLM

5f8eac3

Allow benchmark_latency to take a list of input/output/batches for fa…

22766b4

…ster execution Also add reporting functionality for easy display

Add dynamic tuning feature to vllm

87b4c1b

Add rpd tracer controls to benchmark_latency.py

694ae1d

Fix Dockerfile errors

0e73aed

Add llama2 run script

90df0c9

Increase Partition and Num threads for attention blocks

ab67280

Fix WORKDIR

1d53722

Add accuracy flag to benchmark_latency.py

5148aa5

Don't broadcast when using torchrun

534dcff

Adding new rocm triton flash attention kernel

e569133

Co-authored-by: Vinayak Gokhale <vinayak.gokhale@amd.com>

Merge remote-tracking branch 'origin/main' into v0.3.3_greg

8cb05bc

Removed gradlib and its tuned gemm in favor of tunable ops

be708d0

Small fix on dockerfile

0e63661

Merge remote-tracking branch 'origin/main' into jpvillam/v0.3.3_triton

c89c0e3

Rebase updates and PR review changes

d4cb905

Added Flag for controlling triton vs default flow. More small changes to dockerfile

Introducing torchrun multi GPU support

bc750fa

Update dockerfile

a83b7ea

Merge branch 'greg/torchrun' into integration_no_fp8

2ff59d7

Merge remote-tracking branch 'origin/jpvillam/v0.3.3_triton' into int…

04fd3fb

…egration_no_fp8

add use case for custom kernel for matvec operation

e01b8cd

limit the custom kernel under is_hip

eb21ad7

fix custom kernel

1bf736f

Rocm defaults and cleanup

7ab4a24

gshtras added 2 commits March 21, 2024 23:53

Remove ignored file

fbea667

Refactor torchrun executor to reuse single gpu executor code

c8fce27

shajrawi reviewed Mar 22, 2024

View reviewed changes

jpvillam and others added 6 commits March 22, 2024 16:48

Added interleaving for MQA for triton kernel

1fff99a

linter

0a2309a

Merge remote-tracking branch 'origin/jpvillam/v0.3.3_triton' into int…

1ec6554

…egration_no_fp8

Making torchrun the default multi GPU executor on ROCm unless overrid…

44c2cee

…den by --worker-use-ray

Make triton the default FA

1256bee

Make workaround only applicable to triton path

b687795

Pin ray version to 2.9.3

5e3ec52

gshtras marked this pull request as ready for review March 22, 2024 21:01

shajrawi approved these changes Mar 22, 2024

View reviewed changes

gshtras merged commit 629f74b into main Mar 22, 2024
0 of 2 checks passed

gshtras deleted the integration_no_fp8 branch March 25, 2024 20:45

gshtras pushed a commit that referenced this pull request Sep 27, 2024

Merge pull request #7 from ROCmSoftwarePlatform/vllm_v4

656dd08

V4 update

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features integration without fp8 #7

Features integration without fp8 #7

gshtras commented Mar 21, 2024

shajrawi left a comment

shajrawi commented Mar 22, 2024

shajrawi left a comment

Features integration without fp8 #7

Features integration without fp8 #7

Conversation

gshtras commented Mar 21, 2024

shajrawi left a comment

Choose a reason for hiding this comment

shajrawi commented Mar 22, 2024

shajrawi left a comment

Choose a reason for hiding this comment