Add Mixture of Attention #273

shawntan · 2025-08-26T04:30:00Z

No description provided.

gemini-code-assist

Summary of Changes

Hello @shawntan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a novel 'Mixture of Attention' (MoA) mechanism, internally referred to as 'momha', into the lm_engine framework. The primary goal is to enhance the model's attention capabilities by leveraging a routing system that directs different parts of the input to specialized experts for query and output transformations. This integration involves adding a new attention module, updating the model's configuration system to recognize and utilize 'momha', and ensuring compatibility across various framework components, including FLOPs calculation utilities. The change aims to provide a more flexible and potentially more efficient attention mechanism for language models.

Highlights

Introduction of Mixture of Attention (MoA) Module: A new MixtureOfAttention (MoA) module has been introduced, designed to enhance attention mechanisms by incorporating expert routing. This module intelligently routes parts of the input to specialized 'experts' for query and output projections, aiming for more efficient and effective processing.
Seamless Configuration and Integration: The necessary configuration classes (_MixtureOfAttentionArgs) and mappings have been added to integrate MoA seamlessly into the existing model configuration system. This allows for easy definition and instantiation of models utilizing the new 'momha' sequence mixer type.
Advanced Attention Features and Expert Management: The core MixtureOfAttention implementation includes advanced features such as support for rotary position embeddings, efficient attention computation via Flash Attention (v2/v3) or scaled dot-product attention, and mechanisms to manage past key-values for generation. Crucially, it incorporates auxiliary losses (switch loss and z-loss) to promote balanced utilization of the experts during training, preventing 'expert collapse'.
Updated Framework Utilities for MoA Compatibility: The framework's utility functions, including those for calculating model FLOPs, have been updated to correctly account for the computational cost of the new 'momha' sequence mixer, ensuring accurate performance estimations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a MixtureOfAttention layer, a novel approach combining attention with mixture-of-experts principles. The implementation is largely sound and integrates well with the existing model architecture. My review focuses on a few key areas for improvement. I've identified a critical issue in the TFLOPs calculation for the new layer type, which is currently incorrect. Additionally, there are some medium-severity issues related to incorrect type hints and a hardcoded hyperparameter that should be made configurable for better flexibility and maintainability.

gemini-code-assist · 2025-08-26T04:31:40Z

lm_engine/train_utils.py

                b * s, block.out_channels, h, gradient_checkpointing=gradient_checkpointing_enabled
            )
-        elif sequence_mixer_type in ["softmax_attention", "stickbreaking_attention"]:
+        elif sequence_mixer_type in ["softmax_attention", "stickbreaking_attention", "momha"]:


The TFLOPs calculation for momha is incorrect as it's treated the same as standard softmax_attention. This significantly underestimates the computational cost because it doesn't account for the gating network and the expert-based projections. This will lead to inaccurate TFLOPs reporting. A separate calculation block should be added for momha to correctly account for the FLOPs of all its components (gate, expert-based Q-projection, dense KV-projection, attention, and expert-based output projection).

gemini-code-assist · 2025-08-26T04:31:40Z

lm_engine/hf_models/modeling_utils/sequence_mixer_blocks/mixture_of_attention.py

+        causal: bool,
+        layer_idx: int,
+        use_padding_free_transformer: bool,
+    ) -> MixtureOfAttention:


The __init__ method should have a return type hint of None, not the class name.

Suggested change

) -> MixtureOfAttention:

) -> None:

gemini-code-assist · 2025-08-26T04:31:41Z

lm_engine/hf_models/modeling_utils/sequence_mixer_blocks/mixture_of_attention.py

+
+        return x, indices
+
+    def _compute_routing_weights(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor]:


The return type hint for this function is incorrect. It returns a tuple of three tensors, but the hint suggests a tuple with a single tensor. Please update it for correctness and clarity.

Suggested change

def _compute_routing_weights(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor]:

def _compute_routing_weights(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor, torch.Tensor]:

gemini-code-assist · 2025-08-26T04:31:41Z

lm_engine/hf_models/modeling_utils/sequence_mixer_blocks/mixture_of_attention.py

+        switch_loss = num_experts * (F.normalize(acc_probs, p=1, dim=0) * F.normalize(freq, p=1, dim=0)).sum()
+        z_loss = (torch.logsumexp(logits, dim=-1) ** 2).mean()
+
+        loss = switch_loss + 0.1 * z_loss


The coefficient 0.1 for z_loss is hardcoded. This hyperparameter can significantly impact training. It's recommended to make this configurable, for instance, by adding a z_loss_coef field to the _MixtureOfAttentionArgs config class and using it here.

Add Mixture of Attention

05f8a22

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Mixture of Attention #273

Add Mixture of Attention #273

Uh oh!

shawntan commented Aug 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Uh oh!

gemini-code-assist bot Aug 26, 2025

Uh oh!

gemini-code-assist bot Aug 26, 2025

Uh oh!

gemini-code-assist bot Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		return x, indices

		def _compute_routing_weights(self, hidden_states: torch.Tensor) -> tuple[torch.Tensor]:

Add Mixture of Attention #273

Are you sure you want to change the base?

Add Mixture of Attention #273

Uh oh!

Conversation

shawntan commented Aug 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant