Skip to content

Conversation

@michaelselehov
Copy link

@michaelselehov michaelselehov commented Oct 24, 2025

NB: Pass is enabled by default for the testing purposes. DO NOT MERGE!

This patch introduces a post-allocation register renaming optimization pass that reduces value density in hot basic blocks. The pass helps the post-RA scheduler avoid false dependencies by moving local values to unused physical registers.

The pass operates after greedy register allocation but before VirtRegRewriter. It identifies hot blocks (above frequency threshold), calculates value density per physical register, and selectively moves local live ranges to free registers. Only 32-bit VGPR values that live entirely within a single basic block are moved, ensuring conservative behavior.

Key features:

  • Respects tied operands and register allocation constraints
  • Honors occupancy-based VGPR limits to avoid spilling
  • Disabled by default (enable with -amdgpu-enable-hot-block-reg-renaming)
  • Includes comprehensive lit tests

Performance results show up to 2% improvement on register-intensive kernels such as rocRAND MTGP32 on top of fixing the 5% regression.

This patch introduces a post-allocation register renaming optimization
pass that reduces value density in hot basic blocks. The pass helps
the post-RA scheduler avoid false WAW dependencies by moving local
values to unused physical registers.

The pass operates after greedy register allocation but before
VirtRegRewriter. It identifies hot blocks (above frequency threshold),
calculates value density per physical register, and selectively moves
local live ranges to free registers. Only 32-bit VGPR values that live
entirely within a single basic block are moved, ensuring conservative
behavior.

Key features:
- Respects tied operands and register allocation constraints
- Honors occupancy-based VGPR limits to avoid spilling
- Disabled by default (enable with -amdgpu-enable-hot-block-reg-renaming)
- Includes comprehensive lit tests

Performance results show up to 2% improvement on register-intensive
kernels such as rocRAND MTGP32.
- Rename canMoveValue to isVirtRegMovable for clarity
- Add assertions to verify single-value precondition
- Restore VRM->getPhys check: NOT redundant due to register aliasing
  (register units are shared between aliased registers like VGPR0 and VGPR0_VGPR1)
- Improve tied operand check to verify tied source register compatibility
This flips the default of -amdgpu-enable-hot-block-reg-renaming to true
to exercise the pass across large CI/CT builds. This is a temporary
enablement to flush out issues; users can still disable with
-mllvm -amdgpu-enable-hot-block-reg-renaming=false.
@z1-cciauto
Copy link
Collaborator

@michaelselehov
Copy link
Author

[AMDGPU] Add hot block register renaming pass

Problem

Performance regression was observed in register-intensive kernels (e.g., rocRAND MTGP32) due to high register pressure in hot basic blocks. The greedy register allocator tends to reuse the same physical registers for multiple short-lived values within a basic block, which creates false WAW (Write-After-Write) dependencies. These false dependencies prevent the Post-RA scheduler from reordering instructions effectively, leading to suboptimal scheduling around barriers and memory operations.

Solution

This patch introduces a new post-allocation optimization pass (AMDGPUHotBlockRegisterRenaming) that reduces value density in hot basic blocks by remapping local live ranges to unused physical registers.

Key Features

  • Conservative approach: Only moves values that:

    • Live entirely within a single basic block (local values)
    • Are 32-bit VGPR values (no register pairs or wide registers)
    • Have no register allocation hints
    • Have no tied operands (def-use constraints)
  • Respects constraints:

    • Honors occupancy-based VGPR limits to avoid spilling
    • Checks for tied operands to prevent breaking instruction constraints
    • Preserves all register allocation decisions for cross-block values
  • Disabled by default: Enabled only with -amdgpu-enable-hot-block-reg-renaming flag

Algorithm

  1. Sort basic blocks by execution frequency (process hottest first)
  2. For each hot block:
    • Calculate value density (number of distinct values per physical register)
    • Identify completely free physical registers in this block
    • Move local values from high-density registers to free registers
  3. Stop when no more moves are profitable or no free registers remain

Technical Details

Pass Placement

The pass runs in the pre-rewrite phase, after greedy register allocation but before VirtRegRewriter:

  • Legacy PM: GCNPassConfig::addPreRewrite()
  • New PM: AMDGPUCodeGenPassBuilder::addPreRewrite()

Implementation

  • Files added:

    • llvm/lib/Target/AMDGPU/AMDGPUHotBlockRegisterRenaming.cpp (516 lines)
    • llvm/lib/Target/AMDGPU/AMDGPUHotBlockRegisterRenaming.h (34 lines)
    • llvm/test/CodeGen/AMDGPU/hot-block-register-renaming.mir (149 lines)
  • Files modified:

    • llvm/lib/Target/AMDGPU/AMDGPU.h (pass declaration)
    • llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp (flag and pipeline integration)
    • llvm/lib/Target/AMDGPU/AMDGPUPassRegistry.def (NPM registration)
    • llvm/lib/Target/AMDGPU/CMakeLists.txt (build system)

API Usage

The pass uses standard LLVM register allocation infrastructure:

  • VirtRegMap - for querying and updating virtual-to-physical register mapping
  • LiveRegMatrix - for tracking physical register interference
  • LiveIntervals - for live range analysis
  • MachineBlockFrequencyInfo - for identifying hot blocks

Testing

Lit Tests

Three comprehensive test cases in hot-block-register-renaming.mir:

  1. test_basic_move: Verifies that local values are correctly moved from high-density registers to free registers
  2. test_tied_operand: Verifies that values with tied def-use constraints are NOT moved (e.g., V_MAC_F32)
  3. test_no_free_registers: Verifies that the pass skips blocks when all registers are occupied (conservative behavior)

All tests pass with both legacy and new pass managers.

Regression Testing

  • Full LLVM test suite: ninja check-llvm - PASSED (42,461/42,461 tests)
  • Pipeline structure tests: All existing tests continue to pass (pass not visible without flag)
  • No changes to code generation when pass is disabled (default)

Performance Results

Tested on rocRAND MTGP32 kernel (register-intensive workload):

  • Baseline (without pass): 570 Gi/s
  • With pass enabled: 615 Gi/s
  • Improvement: +8% throughput, +2% vs. previous best result

Statistics (on MTGP32 kernel)

18 hot blocks processed
117 values remapped to reduce density
39 blocks skipped (no optimization needed)

The most critical block (BB#31) had 34 values moved from 8 high-density registers to free registers, which allowed the Post-RA scheduler to better reorder instructions around barriers.

Future Work

Potential enhancements (not included in this patch):

  • Support for wider VGPR values (register pairs, 96-bit, 128-bit)
  • Cross-block value remapping (more aggressive optimization)
  • Integration with occupancy tuning heuristics
  • Metrics-based decision making (e.g., estimated scheduling benefit)

Reviewers

Please review with focus on:

  1. Correctness of tied operand handling
  2. Conservative nature of the optimization (should never cause spilling)
  3. Integration with existing register allocation infrastructure
  4. Test coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants