Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions examples/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,15 @@ def verbose_allclose(
return []


def clear_l2_cache():
# import cupy as cp
# cp.cuda.runtime.deviceSetLimit(cp.cuda.runtime.cudaLimitPersistingL2CacheSize, 0)
# create a large dummy tensor
dummy = torch.empty((32, 1024, 1024), dtype=torch.int64, device="cuda")
# write stuff to it
dummy.fill_(42)
del dummy
def clear_l2_cache(device='cuda'):
"""
Clears GPU L2 cache by allocating and zeroing a buffer.

GB200 has 126 MB L2 cache. Using 512 MB (4x buffer).
See: https://docs.nvidia.com/cuda/blackwell-tuning-guide/
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL in the documentation appears to be incomplete or generic. The link should point to a specific section of the Blackwell tuning guide that discusses L2 cache specifications, if available, to help readers verify the 126 MB L2 cache claim.

Suggested change
See: https://docs.nvidia.com/cuda/blackwell-tuning-guide/
See: https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html#l2-cache
# Section: "L2 Cache" in the Blackwell tuning guide.

Copilot uses AI. Check for mistakes.

Adapted from triton.testing.do_bench.
"""
cache_size = 512 * 1024 * 1024
cache = torch.empty(int(cache_size // 4), dtype=torch.int32, device=device)
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 4 in the division cache_size // 4 should be explained with a comment or replaced with a named constant. The division by 4 likely converts bytes to int32 elements (4 bytes per int32), but this is not immediately clear to readers.

Suggested change
cache = torch.empty(int(cache_size // 4), dtype=torch.int32, device=device)
BYTES_PER_INT32 = 4 # Number of bytes in a 32-bit integer
cache = torch.empty(int(cache_size // BYTES_PER_INT32), dtype=torch.int32, device=device)

Copilot uses AI. Check for mistakes.
cache.zero_()
Loading