Skip to content

Comments

Add triton inspired l2 cache clearing#370

Closed
msaroufim wants to merge 1 commit intomainfrom
l2_cache_clear
Closed

Add triton inspired l2 cache clearing#370
msaroufim wants to merge 1 commit intomainfrom
l2_cache_clear

Conversation

@msaroufim
Copy link
Member

No description provided.

Copilot AI review requested due to automatic review settings November 8, 2025 23:56
@github-actions
Copy link

github-actions bot commented Nov 8, 2025

Coverage report

This PR does not seem to contain any modification to coverable code.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the clear_l2_cache function to improve L2 cache clearing for GPU benchmarking. The implementation is updated based on the Triton library's approach and optimized for newer GPU architectures like GB200.

Key changes:

  • Added configurable device parameter with default value 'cuda'
  • Updated cache clearing strategy from allocating 32 MB and filling with value 42 to allocating 512 MB and zeroing
  • Added comprehensive documentation explaining the rationale and hardware-specific considerations

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Adapted from triton.testing.do_bench.
"""
cache_size = 512 * 1024 * 1024
cache = torch.empty(int(cache_size // 4), dtype=torch.int32, device=device)
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The magic number 4 in the division cache_size // 4 should be explained with a comment or replaced with a named constant. The division by 4 likely converts bytes to int32 elements (4 bytes per int32), but this is not immediately clear to readers.

Suggested change
cache = torch.empty(int(cache_size // 4), dtype=torch.int32, device=device)
BYTES_PER_INT32 = 4 # Number of bytes in a 32-bit integer
cache = torch.empty(int(cache_size // BYTES_PER_INT32), dtype=torch.int32, device=device)

Copilot uses AI. Check for mistakes.
Clears GPU L2 cache by allocating and zeroing a buffer.

GB200 has 126 MB L2 cache. Using 512 MB (4x buffer).
See: https://docs.nvidia.com/cuda/blackwell-tuning-guide/
Copy link

Copilot AI Nov 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL in the documentation appears to be incomplete or generic. The link should point to a specific section of the Blackwell tuning guide that discusses L2 cache specifications, if available, to help readers verify the 126 MB L2 cache claim.

Suggested change
See: https://docs.nvidia.com/cuda/blackwell-tuning-guide/
See: https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html#l2-cache
# Section: "L2 Cache" in the Blackwell tuning guide.

Copilot uses AI. Check for mistakes.
@msaroufim msaroufim closed this Nov 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant