Skip to content

Conversation

@Limerances
Copy link
Contributor

Motivation

This PR adds support for native GPT-OSS-120B and GPT-OSS-20B models using MXFP4 offline quantization, leveraging FlashInfer as the backend for MoE components.

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

Copilot AI review requested due to automatic review settings December 5, 2025 14:18
@paddle-bot
Copy link

paddle-bot bot commented Dec 5, 2025

Thanks for your contribution!

@paddle-bot paddle-bot bot added the contributor External developers label Dec 5, 2025
Copilot finished reviewing on behalf of Limerances December 5, 2025 14:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for MXFP4 quantization for GPT-OSS models (120B and 20B variants) with FlashInfer as the MoE backend. The implementation includes a new quantization method, model-specific weight loading logic, and integration with the existing quantization framework.

Key Changes:

  • Adds MXFP4 quantization implementation with FlashInfer and Triton backend support
  • Updates GPT-OSS model to handle MXFP4 weight loading with specialized logic for MoE expert weights
  • Introduces module filtering capability to exclude certain layers (e.g., normalization) from quantization

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 18 comments.

Show a summary per file
File Description
fastdeploy/model_executor/layers/quantization/mxfp4.py New file implementing MXFP4 quantization method with weight creation, loading, and inference logic for MoE layers
fastdeploy/model_executor/layers/quantization/__init__.py Registers MXFP4 as a supported quantization method in the framework
fastdeploy/model_executor/models/gpt_oss.py Updates weight loading logic to handle MXFP4 quantized weights and adds norm layer exclusion from quantization
fastdeploy/model_executor/layers/moe/moe.py Adds MXFP4-specific weight transformation logic in the fused experts weight loader
fastdeploy/model_executor/layers/linear.py Integrates module filtering check before applying quantization
fastdeploy/model_executor/layers/normalization.py Adds module filtering to conditionally apply quantization parameters based on exclusion patterns
fastdeploy/model_executor/layers/utils.py Introduces modules_to_convert utility function to check if a module should be quantized
fastdeploy/utils.py Modifies unified checkpoint file counting logic (appears to be a bug fix for off-by-one error)
fastdeploy/envs.py Adds environment variable to control FlashInfer MXFP4 backend usage
fastdeploy/config.py Adds special case for GPT-OSS model format detection
Comments suppressed due to low confidence (1)

fastdeploy/model_executor/layers/quantization/mxfp4.py:250

  • This statement is unreachable.
        block_size = 32

return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, offset=offset)


def modules_to_convert(prefix: str, fd_config: FDConfig):
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function modules_to_convert lacks a docstring. Please add documentation explaining the purpose, parameters, and return value. For example: "Check if a module should be converted for quantization based on exclusion patterns in the configuration."

Suggested change
def modules_to_convert(prefix: str, fd_config: FDConfig):
def modules_to_convert(prefix: str, fd_config: FDConfig):
"""
Determine whether a module should be converted for quantization based on exclusion patterns in the configuration.
Args:
prefix (str): The prefix or name of the module to check.
fd_config (FDConfig): The FastDeploy configuration object containing model and quantization settings.
Returns:
bool: True if the module should be converted for quantization, False if it should be excluded.
"""

Copilot uses AI. Check for mistakes.
Comment on lines +574 to 577
safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1
flags = [0] * safetensors_num
for x in model_files:
current_index = int(x.strip(".safetensors").split("-")[1])
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple issues with this line:

  1. The .strip(".safetensors") is incorrect - it removes any of those characters from both ends, not the suffix. Use .removesuffix(".safetensors") or .replace(".safetensors", "") instead.
  2. For a file like "model-00001-of-00008.safetensors", after stripping the suffix, splitting by "-" gives ["model", "00001", "of", "00008"], so [-1] gets "00008". Converting to int gives 8, and adding 1 gives 9, which is incorrect if there are 8 files total (indexed 1-8).
  3. The logic assumes the last part is the total count, but it should extract from the "of-XXXXX" part using split("-")[-1] without the +1, or verify the intended file naming convention.
Suggested change
safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1
flags = [0] * safetensors_num
for x in model_files:
current_index = int(x.strip(".safetensors").split("-")[1])
# Extract total number of files from filename like "model-00001-of-00008.safetensors"
match = re.match(r"model-(\d+)-of-(\d+)\.safetensors$", model_files[0])
if not match:
raise ValueError(f"Filename {model_files[0]} does not match expected pattern.")
safetensors_num = int(match.group(2))
flags = [0] * safetensors_num
for x in model_files:
match = re.match(r"model-(\d+)-of-(\d+)\.safetensors$", x)
if not match:
raise ValueError(f"Filename {x} does not match expected pattern.")
current_index = int(match.group(1))

Copilot uses AI. Check for mistakes.
**extra_kwargs,
)

return output[..., : layer.hidden_size]
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The apply method only handles the SM90_FI_MXFP4_BF16 backend case. When self.mxfp4_backend is TRITON, the function will implicitly return None, which is likely incorrect behavior. Add an else clause to either handle the TRITON backend or raise a NotImplementedError with a clear message.

Suggested change
return output[..., : layer.hidden_size]
return output[..., : layer.hidden_size]
else:
raise NotImplementedError(
f"MXFP4 backend '{self.mxfp4_backend}' is not supported in MXFP4MoeMethod.apply()."
)

Copilot uses AI. Check for mistakes.
Comment on lines +390 to +393
# tp_size=self.moe.tp_size,
# tp_rank=self.moe.tp_rank,
# ep_size=self.moe.ep_size,
# ep_rank=self.moe.ep_rank,
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should be removed. If these parameters (tp_size, tp_rank, ep_size, ep_rank) might be needed in the future, document this with a TODO comment explaining why they're currently disabled.

Suggested change
# tp_size=self.moe.tp_size,
# tp_rank=self.moe.tp_rank,
# ep_size=self.moe.ep_size,
# ep_rank=self.moe.ep_rank,
# TODO: The parameters tp_size, tp_rank, ep_size, and ep_rank are not currently passed to flashinfer_cutlass_fused_moe.
# Enable these when distributed MoE support is implemented and tested in the backend.

Copilot uses AI. Check for mistakes.


class MXFP4Config(QuantConfigBase):
"""Base class for quantization configs."""
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring "Base class for quantization configs." is incorrect. This is not a base class but rather a specific implementation for MXFP4 quantization. Update to something like "Configuration for MXFP4 quantization."

Suggested change
"""Base class for quantization configs."""
"""Configuration for MXFP4 quantization."""

Copilot uses AI. Check for mistakes.
# Set moe backend."cutlass","marlin" and "triton" can be set currently.
"FD_MOE_BACKEND": lambda: os.getenv("FD_MOE_BACKEND", "cutlass"),
# Whether to use FLASHINFER as MXFP4 backend for MoE.
"FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"),
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with other boolean environment variables in this file (e.g., FD_USE_DEEP_GEMM, FD_USE_HF_TOKENIZER), the value should be converted to a boolean using bool(int(os.getenv(...))) pattern instead of returning a string "0". This ensures consistent handling of boolean environment variables throughout the codebase.

Suggested change
"FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"),
"FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: bool(int(os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"))),

Copilot uses AI. Check for mistakes.
Comment on lines 332 to 333
model_sublayer_name = re.sub(r"\.(up_gate_proj_weight|down_proj_weight|weight)$", "", model_param_name)
process_weights_after_loading_fn(model_sublayer_name, param)
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable model_param_name may be undefined if the loop exits without matching any parameters. This can happen when matched is False after all checks and model_param_name not in params_dict evaluates to True (line 325 continues the loop). This would cause a NameError at line 332. Consider initializing model_param_name before the loops or restructuring the logic to ensure it's always defined when used.

Copilot uses AI. Check for mistakes.
Comment on lines +277 to +282
("up_gate_proj_weight", "gate_up_proj", None, None),
("down_proj_weight", "down_proj", None, None),
("up_gate_proj_weight", "gate_up_proj_blocks", None, None),
("up_gate_proj_scale", "gate_up_proj_scales", None, None),
("down_proj_weight", "down_proj_blocks", None, None),
("down_proj_scale", "down_proj_scales", None, None),
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expert_params_mapping list contains duplicate entries for up_gate_proj_weight and down_proj_weight. Lines 277-278 map to "gate_up_proj" and "down_proj", while lines 279-282 map to the quantized variants ("gate_up_proj_blocks", "gate_up_proj_scales", etc.). During weight loading, the first match will always win due to the break statement at line 321, meaning the quantized weight mappings (lines 279-282) may never be matched. Consider using a more specific matching condition or restructuring the logic to handle both quantized and non-quantized weights properly.

Suggested change
("up_gate_proj_weight", "gate_up_proj", None, None),
("down_proj_weight", "down_proj", None, None),
("up_gate_proj_weight", "gate_up_proj_blocks", None, None),
("up_gate_proj_scale", "gate_up_proj_scales", None, None),
("down_proj_weight", "down_proj_blocks", None, None),
("down_proj_scale", "down_proj_scales", None, None),
# Quantized variants first
("up_gate_proj_weight", "gate_up_proj_blocks", None, None),
("up_gate_proj_scale", "gate_up_proj_scales", None, None),
("down_proj_weight", "down_proj_blocks", None, None),
("down_proj_scale", "down_proj_scales", None, None),
# Non-quantized variants after
("up_gate_proj_weight", "gate_up_proj", None, None),
("down_proj_weight", "down_proj", None, None),

Copilot uses AI. Check for mistakes.
)
h2d_copy(dst=param, src=loaded_weight)

if self.moe_quant_config.name() == "mxfp4":
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code accesses self.moe_quant_config.name() without checking if moe_quant_config exists or is not None. This could raise an AttributeError if moe_quant_config is None. Consider adding a null check: if self.moe_quant_config and self.moe_quant_config.name() == "mxfp4":

Suggested change
if self.moe_quant_config.name() == "mxfp4":
if self.moe_quant_config and self.moe_quant_config.name() == "mxfp4":

Copilot uses AI. Check for mistakes.
is_checkpoint_bf16 = not config.get("is_quantized", False)
return cls(is_checkpoint_bf16)

def get_quant_method(self, layer) -> Optional[QuantMethodBase]:
Copy link

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method requires 2 positional arguments, whereas overridden QuantConfigBase.get_quant_method requires 3.

Suggested change
def get_quant_method(self, layer) -> Optional[QuantMethodBase]:
def get_quant_method(self, layer, weight_attr=None) -> Optional[QuantMethodBase]:

Copilot uses AI. Check for mistakes.
@Limerances Limerances closed this Dec 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants