-
Notifications
You must be signed in to change notification settings - Fork 74
[AMDGPU] Add hot block register renaming pass #371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: amd-staging
Are you sure you want to change the base?
[AMDGPU] Add hot block register renaming pass #371
Conversation
This patch introduces a post-allocation register renaming optimization pass that reduces value density in hot basic blocks. The pass helps the post-RA scheduler avoid false WAW dependencies by moving local values to unused physical registers. The pass operates after greedy register allocation but before VirtRegRewriter. It identifies hot blocks (above frequency threshold), calculates value density per physical register, and selectively moves local live ranges to free registers. Only 32-bit VGPR values that live entirely within a single basic block are moved, ensuring conservative behavior. Key features: - Respects tied operands and register allocation constraints - Honors occupancy-based VGPR limits to avoid spilling - Disabled by default (enable with -amdgpu-enable-hot-block-reg-renaming) - Includes comprehensive lit tests Performance results show up to 2% improvement on register-intensive kernels such as rocRAND MTGP32.
- Rename canMoveValue to isVirtRegMovable for clarity - Add assertions to verify single-value precondition - Restore VRM->getPhys check: NOT redundant due to register aliasing (register units are shared between aliased registers like VGPR0 and VGPR0_VGPR1) - Improve tied operand check to verify tied source register compatibility
This flips the default of -amdgpu-enable-hot-block-reg-renaming to true to exercise the pass across large CI/CT builds. This is a temporary enablement to flush out issues; users can still disable with -mllvm -amdgpu-enable-hot-block-reg-renaming=false.
[AMDGPU] Add hot block register renaming passProblemPerformance regression was observed in register-intensive kernels (e.g., rocRAND MTGP32) due to high register pressure in hot basic blocks. The greedy register allocator tends to reuse the same physical registers for multiple short-lived values within a basic block, which creates false WAW (Write-After-Write) dependencies. These false dependencies prevent the Post-RA scheduler from reordering instructions effectively, leading to suboptimal scheduling around barriers and memory operations. SolutionThis patch introduces a new post-allocation optimization pass ( Key Features
Algorithm
Technical DetailsPass PlacementThe pass runs in the pre-rewrite phase, after greedy register allocation but before VirtRegRewriter:
Implementation
API UsageThe pass uses standard LLVM register allocation infrastructure:
TestingLit TestsThree comprehensive test cases in
All tests pass with both legacy and new pass managers. Regression Testing
Performance ResultsTested on rocRAND MTGP32 kernel (register-intensive workload):
Statistics (on MTGP32 kernel)18 hot blocks processed The most critical block (BB#31) had 34 values moved from 8 high-density registers to free registers, which allowed the Post-RA scheduler to better reorder instructions around barriers. Future WorkPotential enhancements (not included in this patch):
ReviewersPlease review with focus on:
|
NB: Pass is enabled by default for the testing purposes. DO NOT MERGE!
This patch introduces a post-allocation register renaming optimization pass that reduces value density in hot basic blocks. The pass helps the post-RA scheduler avoid false dependencies by moving local values to unused physical registers.
The pass operates after greedy register allocation but before VirtRegRewriter. It identifies hot blocks (above frequency threshold), calculates value density per physical register, and selectively moves local live ranges to free registers. Only 32-bit VGPR values that live entirely within a single basic block are moved, ensuring conservative behavior.
Key features:
Performance results show up to 2% improvement on register-intensive kernels such as rocRAND MTGP32 on top of fixing the 5% regression.