[AMDGPU] Add hot block register renaming pass #371

michaelselehov · 2025-10-24T13:59:48Z

NB: Pass is enabled by default for the testing purposes. DO NOT MERGE!

This patch introduces a post-allocation register renaming optimization pass that reduces value density in hot basic blocks. The pass helps the post-RA scheduler avoid false dependencies by moving local values to unused physical registers.

The pass operates after greedy register allocation but before VirtRegRewriter. It identifies hot blocks (above frequency threshold), calculates value density per physical register, and selectively moves local live ranges to free registers. Only 32-bit VGPR values that live entirely within a single basic block are moved, ensuring conservative behavior.

Key features:

Respects tied operands and register allocation constraints
Honors occupancy-based VGPR limits to avoid spilling
Disabled by default (enable with -amdgpu-enable-hot-block-reg-renaming)
Includes comprehensive lit tests

Performance results show up to 2% improvement on register-intensive kernels such as rocRAND MTGP32 on top of fixing the 5% regression.

This patch introduces a post-allocation register renaming optimization pass that reduces value density in hot basic blocks. The pass helps the post-RA scheduler avoid false WAW dependencies by moving local values to unused physical registers. The pass operates after greedy register allocation but before VirtRegRewriter. It identifies hot blocks (above frequency threshold), calculates value density per physical register, and selectively moves local live ranges to free registers. Only 32-bit VGPR values that live entirely within a single basic block are moved, ensuring conservative behavior. Key features: - Respects tied operands and register allocation constraints - Honors occupancy-based VGPR limits to avoid spilling - Disabled by default (enable with -amdgpu-enable-hot-block-reg-renaming) - Includes comprehensive lit tests Performance results show up to 2% improvement on register-intensive kernels such as rocRAND MTGP32.

- Rename canMoveValue to isVirtRegMovable for clarity - Add assertions to verify single-value precondition - Restore VRM->getPhys check: NOT redundant due to register aliasing (register units are shared between aliased registers like VGPR0 and VGPR0_VGPR1) - Improve tied operand check to verify tied source register compatibility

This flips the default of -amdgpu-enable-hot-block-reg-renaming to true to exercise the pass across large CI/CT builds. This is a temporary enablement to flush out issues; users can still disable with -mllvm -amdgpu-enable-hot-block-reg-renaming=false.

z1-cciauto · 2025-10-24T14:01:03Z

PSDB Link: https://compiler-ci.amd.com/job/compiler-psdb-amd-staging/2425

michaelselehov · 2025-10-24T14:09:53Z

[AMDGPU] Add hot block register renaming pass

Problem

Performance regression was observed in register-intensive kernels (e.g., rocRAND MTGP32) due to high register pressure in hot basic blocks. The greedy register allocator tends to reuse the same physical registers for multiple short-lived values within a basic block, which creates false WAW (Write-After-Write) dependencies. These false dependencies prevent the Post-RA scheduler from reordering instructions effectively, leading to suboptimal scheduling around barriers and memory operations.

Solution

This patch introduces a new post-allocation optimization pass (AMDGPUHotBlockRegisterRenaming) that reduces value density in hot basic blocks by remapping local live ranges to unused physical registers.

Key Features

Conservative approach: Only moves values that:
- Live entirely within a single basic block (local values)
- Are 32-bit VGPR values (no register pairs or wide registers)
- Have no register allocation hints
- Have no tied operands (def-use constraints)
Respects constraints:
- Honors occupancy-based VGPR limits to avoid spilling
- Checks for tied operands to prevent breaking instruction constraints
- Preserves all register allocation decisions for cross-block values
Disabled by default: Enabled only with -amdgpu-enable-hot-block-reg-renaming flag

Algorithm

Sort basic blocks by execution frequency (process hottest first)
For each hot block:
- Calculate value density (number of distinct values per physical register)
- Identify completely free physical registers in this block
- Move local values from high-density registers to free registers
Stop when no more moves are profitable or no free registers remain

Technical Details

Pass Placement

The pass runs in the pre-rewrite phase, after greedy register allocation but before VirtRegRewriter:

Legacy PM: GCNPassConfig::addPreRewrite()
New PM: AMDGPUCodeGenPassBuilder::addPreRewrite()

Implementation

Files added:
- llvm/lib/Target/AMDGPU/AMDGPUHotBlockRegisterRenaming.cpp (516 lines)
- llvm/lib/Target/AMDGPU/AMDGPUHotBlockRegisterRenaming.h (34 lines)
- llvm/test/CodeGen/AMDGPU/hot-block-register-renaming.mir (149 lines)
Files modified:
- llvm/lib/Target/AMDGPU/AMDGPU.h (pass declaration)
- llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (flag and pipeline integration)
- llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (NPM registration)
- llvm/lib/Target/AMDGPU/CMakeLists.txt (build system)

API Usage

The pass uses standard LLVM register allocation infrastructure:

VirtRegMap - for querying and updating virtual-to-physical register mapping
LiveRegMatrix - for tracking physical register interference
LiveIntervals - for live range analysis
MachineBlockFrequencyInfo - for identifying hot blocks

Testing

Lit Tests

Three comprehensive test cases in hot-block-register-renaming.mir:

test_basic_move: Verifies that local values are correctly moved from high-density registers to free registers
test_tied_operand: Verifies that values with tied def-use constraints are NOT moved (e.g., V_MAC_F32)
test_no_free_registers: Verifies that the pass skips blocks when all registers are occupied (conservative behavior)

All tests pass with both legacy and new pass managers.

Regression Testing

Full LLVM test suite: ninja check-llvm - PASSED (42,461/42,461 tests)
Pipeline structure tests: All existing tests continue to pass (pass not visible without flag)
No changes to code generation when pass is disabled (default)

Performance Results

Tested on rocRAND MTGP32 kernel (register-intensive workload):

Baseline (without pass): 570 Gi/s
With pass enabled: 615 Gi/s
Improvement: +8% throughput, +2% vs. previous best result

Statistics (on MTGP32 kernel)

18 hot blocks processed
117 values remapped to reduce density
39 blocks skipped (no optimization needed)

The most critical block (BB#31) had 34 values moved from 8 high-density registers to free registers, which allowed the Post-RA scheduler to better reorder instructions around barriers.

Future Work

Potential enhancements (not included in this patch):

Support for wider VGPR values (register pairs, 96-bit, 128-bit)
Cross-block value remapping (more aggressive optimization)
Integration with occupancy tuning heuristics
Metrics-based decision making (e.g., estimated scheduling benefit)

Reviewers

Please review with focus on:

Correctness of tied operand handling
Conservative nature of the optimization (should never cause spilling)
Integration with existing register allocation infrastructure
Test coverage

michaelselehov added 3 commits October 24, 2025 08:51

michaelselehov added the testing only label Oct 24, 2025

michaelselehov requested review from dhruvachak, hidekisaito and mhalk October 24, 2025 14:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Add hot block register renaming pass #371

[AMDGPU] Add hot block register renaming pass #371

michaelselehov commented Oct 24, 2025 •

edited

Loading

Uh oh!

z1-cciauto commented Oct 24, 2025

Uh oh!

michaelselehov commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[AMDGPU] Add hot block register renaming pass #371

Are you sure you want to change the base?

[AMDGPU] Add hot block register renaming pass #371

Conversation

michaelselehov commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

z1-cciauto commented Oct 24, 2025

Uh oh!

michaelselehov commented Oct 24, 2025

[AMDGPU] Add hot block register renaming pass

Problem

Solution

Key Features

Algorithm

Technical Details

Pass Placement

Implementation

API Usage

Testing

Lit Tests

Regression Testing

Performance Results

Statistics (on MTGP32 kernel)

Future Work

Reviewers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

michaelselehov commented Oct 24, 2025 •

edited

Loading