-
Notifications
You must be signed in to change notification settings - Fork 700
[Bugfix] Fix bitwise determinism after vLLM SiluAndMul change #2358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
556ae85 to
5a68847
Compare
| # Since these are parameter free we instantiate default config | ||
| with set_current_vllm_config(VllmConfig()): | ||
| vllm_silu_and_mul = VLLMSiluAndMul() | ||
| vllm_silu_and_mul = VLLMSiluAndMul(compile_native=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does compile_native=False means we don't use torch.compile inside the custom op? Can you add a line of comment to explain why this field is neede? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b q: Is vllm_compat path using vllm's compile mechanism, or it applies compile manually? Or compile is not yet enabled?
NOTE: if we enable compile in vLLM in the future, we need to compile this op as well.
I'm a little bit confused here, and want to high-levely understand how we should enable compile properly. IIUC there should be 2 ways of applying compile:
- apply compile by ourself, like
apply_compile()function, and send the compiled model to vllm (set vllm.compilation_config.level =0); - Let vllm engine enable compile (vllm.compilation_config.level = 3), and decorate our model with decorator @support_torch_compile
278337a to
972b13b
Compare
972b13b to
7981609
Compare
Summary
vllm-project/vllm#32806 Changed the behavior of
SiluAndMulto usetorch.compileinside the custom op. This causes a divergence in the vllm definition and torchtitan definition, causing the RL script to failWe fix this by changing the implementation to call through to the kernel used in vLLM for equivalence
Test Plan
Before (Main)
⚠ vLLM-TorchTitan logprobs differ: 59/100 tokens Max delta: 3.650573e-01, Avg delta: 1.829216e-02 vLLM logprobs: ['-0.0642131940', '-0.1617958397', '-0.0011243457', '-0.0051660384', '-0.2546415329'] TorchTitan logprobs: ['-0.0609663166', '-0.1456209123', '-0.0010027625', '-0.0048254938', '-0.2543271184']After