-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
All memory operations (loads, stores, atomics) are threaded through a single global token chain. This is correct but conservative—operations on independent arrays are serialized unnecessarily.
Proposed improvement
Implement alias-aware token threading:
- Alias analysis: Compute which pointers may refer to the same memory region (alias sets)
- Per-set token chains: Thread tokens only between operations that may alias
- Loop parallel stores: Identify stores in loops with non-overlapping indices across iterations—these can skip token dependencies entirely
Why
The current sequential approach prevents parallelism between independent memory operations. For example, loading from array a and storing to array b don't need ordering constraints if they're provably disjoint. Alias-aware threading preserves correctness while enabling the hardware to execute independent operations concurrently.
Reference implementation
cuTile Python implements this in:
- https://github.com/NVIDIA/cutile-python/blob/main/src/cuda/tile/_passes/alias_analysis.py — dataflow analysis propagating alias sets until fixed-point
- https://github.com/NVIDIA/cutile-python/blob/main/src/cuda/tile/_passes/token_order.py — maps tokens to alias sets via TokenKey, with special handling for loop-parallel stores
Metadata
Metadata
Assignees
Labels
No labels