-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features integration without fp8 #7
Conversation
The ROCM stack with PyTorch supports a wide set of gfx architectures. This can be displayed by printing PYTORCH_ROCM_ARCH env. In the absence of PYTORCH_ROCM_ARCH pytorch uses theoutput from rocm_agent_enumerator to choose what to compile for. vllm supports a subset of these, (gfx908, gfx90a,...) Due to a need to potentially support multiple architectures at once (ex. docker image) it's important to make sure vllm is compiled with them all unless specified otherwise. We now gather either the PYTORCH_ROCM_ARCH env or rocm_agent_enumerator output and cross reference with ROCM_SUPPORTED_ARCHS from vllm to generate a list of arches to build for.
This is a bit of a hack. But Ray seems to have some serious perf degradation when running multi gpu latency benchmarks. Allow distributed to be used when Ray is disabled, and make sure we connect via env ranking instead of tcp/port based.
…ster execution Also add reporting functionality for easy display
Co-authored-by: Vinayak Gokhale <vinayak.gokhale@amd.com>
Added Flag for controlling triton vs default flow. More small changes to dockerfile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Option to run multi GPU using torchrun instead or ray - should be the default in our internal branch. at least for isHip()
- Let's add Matt's layernorm changes
Once we have these in place, unless there are objections: We can ship it
…den by --worker-use-ray
one more small ask: Can we have a markdown doc under docs/ with all our flags we can configure / turn on and off? Most importantly: Tunable ops and running with it enabled (but also which attention implementation to use, ray..etc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ship it! We can write the markdown next week
Putting together (similar to #5 except without fp8 kv_cache):
Custom ops and kernels from private v0.2.7_dllehr
triton attention kernel from jpvillam/v0.3.3_triton
Option to run multi GPU using torchrun instead or ray