Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features integration #5

Closed
wants to merge 190 commits into from
Closed

Features integration #5

wants to merge 190 commits into from

Conversation

gshtras
Copy link
Collaborator

@gshtras gshtras commented Mar 19, 2024

Putting together:
Custom ops and kernels from private v0.2.7_dllehr
fp8 kv_cache from fp8_kv (upstream PR vllm-project#3290)
triton attention kernel from jpvillam/v0.3.3_triton

dllehr-amd and others added 30 commits January 27, 2024 12:36
The ROCM stack with PyTorch supports a wide set of gfx architectures.  This can be
displayed by printing PYTORCH_ROCM_ARCH env.  In the absence of PYTORCH_ROCM_ARCH
pytorch uses theoutput from rocm_agent_enumerator to choose what to compile for.

vllm supports a subset of these, (gfx908, gfx90a,...)

Due to a need to potentially support multiple architectures at once (ex. docker image)
it's important to make sure vllm is compiled with them all unless specified otherwise.

We now gather either the PYTORCH_ROCM_ARCH env or rocm_agent_enumerator output and
cross reference with ROCM_SUPPORTED_ARCHS from vllm to generate a list of
arches to build for.
Add non-MI300 compatible alternative for bulk conversions
Removed bf8 (e5m2) and renamed f8 to fp8 to explicitly specify that it is e4m3
Removed stochastic rounding for simplicity
Put bulk fp8 conversion hip intrinsics behind a define. Disabled by default
Using types from the proper vllm headers. Added namespace
Move amd specific headers under amd_detail
Reduce fp8 range in the conversion test to match e4m3
Add other MI300 architectures to the list
Simplify device guard use in conversion kernel
Rename remaining fp8_e5m2 to general fp8
This is a bit of a hack. But Ray seems to have some serious perf degradation when
running multi gpu latency benchmarks.

Allow distributed to be used when Ray is disabled, and make sure we connect
via env ranking instead of tcp/port based.
AdrianAbeyta and others added 26 commits March 15, 2024 19:20
…ipt is supposed to be a quick correctness benchmark.
Adding functionality with benchmarks/measure_ppl_MC_small.py.
Added invocation examples into the description.
Added Flag for controlling triton vs default flow.
More small changes to dockerfile
@gshtras gshtras closed this Mar 25, 2024
@gshtras gshtras deleted the integration branch March 25, 2024 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.