Skip to content

Conversation

wwwjn
Copy link
Contributor

@wwwjn wwwjn commented Sep 24, 2025

Keep developing on top of #1559. Thanks @KhoomeiK for initial contribution!

Initialized by the same seed checkpoint, set seed=0 and deterministic = True.
Screenshot 2025-09-30 at 3 39 25 PM

Run 1: dp_shard = 2
Screenshot 2025-09-30 at 3 39 40 PM

Run 2: dp_shard = 2, TP degree = 2 (NGPU=4)
Screenshot 2025-09-30 at 3 36 49 PM

Run 3: dp_shard = 2, TP degree =2, EP degree = 2 (NGPU=4)
Screenshot 2025-09-30 at 3 35 33 PM

Run 4: dp_shard = 2, TP degree = 2, EP degree = 2, ETP degree = 2 (NGPU=4)
Screenshot 2025-09-30 at 3 32 34 PM

Run 5: dp_shard=2, EP degree = 2 (NGPU=2)
Screenshot 2025-09-30 at 3 26 43 PM

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 24, 2025
block_mask = FlexAttention.block_masks[self.mask_key]
return FlexAttention.flex_attn(q, k, v, block_mask=block_mask, scale=scale)

def _forward_with_sink(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wants some early comments / suggestions @fegin @tianyu-l

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I'm curious how expensive it is to always return lse. If it is actually no cost, we can merge the FlexAttention call to the original forward.

cc., @drisspg

@wwwjn
Copy link
Contributor Author

wwwjn commented Sep 30, 2025

Need to rebase onto #1776

@wwwjn wwwjn marked this pull request as ready for review September 30, 2025 23:01
@wwwjn wwwjn changed the title [WIP] gpt-oss model enablement gpt-oss model enablement Sep 30, 2025
] # (mask_type, fixed_block_size, sliding_window)


class FlexAttention(torch.nn.Module):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wwwjn will rebase onto #1776

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the disruption. I should have done this earlier.

As for the FlexAttention, @drisspg confirmed that, while it is probably just a minor overhead, the AuxOutput does incur some extra memory and memory write. So let's keep it optional.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants