forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features integration #5
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The ROCM stack with PyTorch supports a wide set of gfx architectures. This can be displayed by printing PYTORCH_ROCM_ARCH env. In the absence of PYTORCH_ROCM_ARCH pytorch uses theoutput from rocm_agent_enumerator to choose what to compile for. vllm supports a subset of these, (gfx908, gfx90a,...) Due to a need to potentially support multiple architectures at once (ex. docker image) it's important to make sure vllm is compiled with them all unless specified otherwise. We now gather either the PYTORCH_ROCM_ARCH env or rocm_agent_enumerator output and cross reference with ROCM_SUPPORTED_ARCHS from vllm to generate a list of arches to build for.
Add non-MI300 compatible alternative for bulk conversions Removed bf8 (e5m2) and renamed f8 to fp8 to explicitly specify that it is e4m3 Removed stochastic rounding for simplicity Put bulk fp8 conversion hip intrinsics behind a define. Disabled by default Using types from the proper vllm headers. Added namespace Move amd specific headers under amd_detail
Greg/fp8 tests
Reduce fp8 range in the conversion test to match e4m3 Add other MI300 architectures to the list Simplify device guard use in conversion kernel
Rename remaining fp8_e5m2 to general fp8
…m3-kvcache-rocm
Enable FP8 E4M3 KV Cache
This is a bit of a hack. But Ray seems to have some serious perf degradation when running multi gpu latency benchmarks. Allow distributed to be used when Ray is disabled, and make sure we connect via env ranking instead of tcp/port based.
…ipt is supposed to be a quick correctness benchmark.
Adding functionality with benchmarks/measure_ppl_MC_small.py.
Added invocation examples into the description.
Added Flag for controlling triton vs default flow. More small changes to dockerfile
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Putting together:
Custom ops and kernels from private v0.2.7_dllehr
fp8 kv_cache from fp8_kv (upstream PR vllm-project#3290)
triton attention kernel from jpvillam/v0.3.3_triton