Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features integration without fp8 #7

Merged
merged 39 commits into from
Mar 22, 2024
Merged

Features integration without fp8 #7

merged 39 commits into from
Mar 22, 2024

Conversation

gshtras
Copy link
Collaborator

@gshtras gshtras commented Mar 21, 2024

Putting together (similar to #5 except without fp8 kv_cache):
Custom ops and kernels from private v0.2.7_dllehr
triton attention kernel from jpvillam/v0.3.3_triton
Option to run multi GPU using torchrun instead or ray

dllehr-amd and others added 30 commits January 27, 2024 12:36
The ROCM stack with PyTorch supports a wide set of gfx architectures.  This can be
displayed by printing PYTORCH_ROCM_ARCH env.  In the absence of PYTORCH_ROCM_ARCH
pytorch uses theoutput from rocm_agent_enumerator to choose what to compile for.

vllm supports a subset of these, (gfx908, gfx90a,...)

Due to a need to potentially support multiple architectures at once (ex. docker image)
it's important to make sure vllm is compiled with them all unless specified otherwise.

We now gather either the PYTORCH_ROCM_ARCH env or rocm_agent_enumerator output and
cross reference with ROCM_SUPPORTED_ARCHS from vllm to generate a list of
arches to build for.
This is a bit of a hack. But Ray seems to have some serious perf degradation when
running multi gpu latency benchmarks.

Allow distributed to be used when Ray is disabled, and make sure we connect
via env ranking instead of tcp/port based.
…ster execution

Also add reporting functionality for easy display
Co-authored-by: Vinayak Gokhale <vinayak.gokhale@amd.com>
Added Flag for controlling triton vs default flow.
More small changes to dockerfile
Copy link
Collaborator

@shajrawi shajrawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Option to run multi GPU using torchrun instead or ray - should be the default in our internal branch. at least for isHip()
  • Let's add Matt's layernorm changes

Once we have these in place, unless there are objections: We can ship it

@shajrawi
Copy link
Collaborator

one more small ask: Can we have a markdown doc under docs/ with all our flags we can configure / turn on and off? Most importantly: Tunable ops and running with it enabled (but also which attention implementation to use, ray..etc)

@gshtras gshtras marked this pull request as ready for review March 22, 2024 21:01
Copy link
Collaborator

@shajrawi shajrawi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it! We can write the markdown next week

@gshtras gshtras merged commit 629f74b into main Mar 22, 2024
0 of 2 checks passed
@gshtras gshtras deleted the integration_no_fp8 branch March 25, 2024 20:45
gshtras pushed a commit that referenced this pull request Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants