-
Notifications
You must be signed in to change notification settings - Fork 663
Support MXFP4 for GPT-OSS #5409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support MXFP4 for GPT-OSS #5409
Conversation
…/FastDeploy into add-paddleocr-vl-test
…/FastDeploy into add-paddleocr-vl-test
…/FastDeploy into add-paddleocr-vl-test
|
Thanks for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds support for MXFP4 quantization for GPT-OSS models (120B and 20B variants) with FlashInfer as the MoE backend. The implementation includes a new quantization method, model-specific weight loading logic, and integration with the existing quantization framework.
Key Changes:
- Adds MXFP4 quantization implementation with FlashInfer and Triton backend support
- Updates GPT-OSS model to handle MXFP4 weight loading with specialized logic for MoE expert weights
- Introduces module filtering capability to exclude certain layers (e.g., normalization) from quantization
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 18 comments.
Show a summary per file
| File | Description |
|---|---|
fastdeploy/model_executor/layers/quantization/mxfp4.py |
New file implementing MXFP4 quantization method with weight creation, loading, and inference logic for MoE layers |
fastdeploy/model_executor/layers/quantization/__init__.py |
Registers MXFP4 as a supported quantization method in the framework |
fastdeploy/model_executor/models/gpt_oss.py |
Updates weight loading logic to handle MXFP4 quantized weights and adds norm layer exclusion from quantization |
fastdeploy/model_executor/layers/moe/moe.py |
Adds MXFP4-specific weight transformation logic in the fused experts weight loader |
fastdeploy/model_executor/layers/linear.py |
Integrates module filtering check before applying quantization |
fastdeploy/model_executor/layers/normalization.py |
Adds module filtering to conditionally apply quantization parameters based on exclusion patterns |
fastdeploy/model_executor/layers/utils.py |
Introduces modules_to_convert utility function to check if a module should be quantized |
fastdeploy/utils.py |
Modifies unified checkpoint file counting logic (appears to be a bug fix for off-by-one error) |
fastdeploy/envs.py |
Adds environment variable to control FlashInfer MXFP4 backend usage |
fastdeploy/config.py |
Adds special case for GPT-OSS model format detection |
Comments suppressed due to low confidence (1)
fastdeploy/model_executor/layers/quantization/mxfp4.py:250
- This statement is unreachable.
block_size = 32
| return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, offset=offset) | ||
|
|
||
|
|
||
| def modules_to_convert(prefix: str, fd_config: FDConfig): |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function modules_to_convert lacks a docstring. Please add documentation explaining the purpose, parameters, and return value. For example: "Check if a module should be converted for quantization based on exclusion patterns in the configuration."
| def modules_to_convert(prefix: str, fd_config: FDConfig): | |
| def modules_to_convert(prefix: str, fd_config: FDConfig): | |
| """ | |
| Determine whether a module should be converted for quantization based on exclusion patterns in the configuration. | |
| Args: | |
| prefix (str): The prefix or name of the module to check. | |
| fd_config (FDConfig): The FastDeploy configuration object containing model and quantization settings. | |
| Returns: | |
| bool: True if the module should be converted for quantization, False if it should be excluded. | |
| """ |
| safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1 | ||
| flags = [0] * safetensors_num | ||
| for x in model_files: | ||
| current_index = int(x.strip(".safetensors").split("-")[1]) |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple issues with this line:
- The
.strip(".safetensors")is incorrect - it removes any of those characters from both ends, not the suffix. Use.removesuffix(".safetensors")or.replace(".safetensors", "")instead. - For a file like "model-00001-of-00008.safetensors", after stripping the suffix, splitting by "-" gives ["model", "00001", "of", "00008"], so
[-1]gets "00008". Converting to int gives 8, and adding 1 gives 9, which is incorrect if there are 8 files total (indexed 1-8). - The logic assumes the last part is the total count, but it should extract from the "of-XXXXX" part using
split("-")[-1]without the+1, or verify the intended file naming convention.
| safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1 | |
| flags = [0] * safetensors_num | |
| for x in model_files: | |
| current_index = int(x.strip(".safetensors").split("-")[1]) | |
| # Extract total number of files from filename like "model-00001-of-00008.safetensors" | |
| match = re.match(r"model-(\d+)-of-(\d+)\.safetensors$", model_files[0]) | |
| if not match: | |
| raise ValueError(f"Filename {model_files[0]} does not match expected pattern.") | |
| safetensors_num = int(match.group(2)) | |
| flags = [0] * safetensors_num | |
| for x in model_files: | |
| match = re.match(r"model-(\d+)-of-(\d+)\.safetensors$", x) | |
| if not match: | |
| raise ValueError(f"Filename {x} does not match expected pattern.") | |
| current_index = int(match.group(1)) |
| **extra_kwargs, | ||
| ) | ||
|
|
||
| return output[..., : layer.hidden_size] |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The apply method only handles the SM90_FI_MXFP4_BF16 backend case. When self.mxfp4_backend is TRITON, the function will implicitly return None, which is likely incorrect behavior. Add an else clause to either handle the TRITON backend or raise a NotImplementedError with a clear message.
| return output[..., : layer.hidden_size] | |
| return output[..., : layer.hidden_size] | |
| else: | |
| raise NotImplementedError( | |
| f"MXFP4 backend '{self.mxfp4_backend}' is not supported in MXFP4MoeMethod.apply()." | |
| ) |
| # tp_size=self.moe.tp_size, | ||
| # tp_rank=self.moe.tp_rank, | ||
| # ep_size=self.moe.ep_size, | ||
| # ep_rank=self.moe.ep_rank, |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed. If these parameters (tp_size, tp_rank, ep_size, ep_rank) might be needed in the future, document this with a TODO comment explaining why they're currently disabled.
| # tp_size=self.moe.tp_size, | |
| # tp_rank=self.moe.tp_rank, | |
| # ep_size=self.moe.ep_size, | |
| # ep_rank=self.moe.ep_rank, | |
| # TODO: The parameters tp_size, tp_rank, ep_size, and ep_rank are not currently passed to flashinfer_cutlass_fused_moe. | |
| # Enable these when distributed MoE support is implemented and tested in the backend. |
|
|
||
|
|
||
| class MXFP4Config(QuantConfigBase): | ||
| """Base class for quantization configs.""" |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring "Base class for quantization configs." is incorrect. This is not a base class but rather a specific implementation for MXFP4 quantization. Update to something like "Configuration for MXFP4 quantization."
| """Base class for quantization configs.""" | |
| """Configuration for MXFP4 quantization.""" |
| # Set moe backend."cutlass","marlin" and "triton" can be set currently. | ||
| "FD_MOE_BACKEND": lambda: os.getenv("FD_MOE_BACKEND", "cutlass"), | ||
| # Whether to use FLASHINFER as MXFP4 backend for MoE. | ||
| "FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"), |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with other boolean environment variables in this file (e.g., FD_USE_DEEP_GEMM, FD_USE_HF_TOKENIZER), the value should be converted to a boolean using bool(int(os.getenv(...))) pattern instead of returning a string "0". This ensures consistent handling of boolean environment variables throughout the codebase.
| "FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"), | |
| "FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: bool(int(os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"))), |
| model_sublayer_name = re.sub(r"\.(up_gate_proj_weight|down_proj_weight|weight)$", "", model_param_name) | ||
| process_weights_after_loading_fn(model_sublayer_name, param) |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable model_param_name may be undefined if the loop exits without matching any parameters. This can happen when matched is False after all checks and model_param_name not in params_dict evaluates to True (line 325 continues the loop). This would cause a NameError at line 332. Consider initializing model_param_name before the loops or restructuring the logic to ensure it's always defined when used.
| ("up_gate_proj_weight", "gate_up_proj", None, None), | ||
| ("down_proj_weight", "down_proj", None, None), | ||
| ("up_gate_proj_weight", "gate_up_proj_blocks", None, None), | ||
| ("up_gate_proj_scale", "gate_up_proj_scales", None, None), | ||
| ("down_proj_weight", "down_proj_blocks", None, None), | ||
| ("down_proj_scale", "down_proj_scales", None, None), |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The expert_params_mapping list contains duplicate entries for up_gate_proj_weight and down_proj_weight. Lines 277-278 map to "gate_up_proj" and "down_proj", while lines 279-282 map to the quantized variants ("gate_up_proj_blocks", "gate_up_proj_scales", etc.). During weight loading, the first match will always win due to the break statement at line 321, meaning the quantized weight mappings (lines 279-282) may never be matched. Consider using a more specific matching condition or restructuring the logic to handle both quantized and non-quantized weights properly.
| ("up_gate_proj_weight", "gate_up_proj", None, None), | |
| ("down_proj_weight", "down_proj", None, None), | |
| ("up_gate_proj_weight", "gate_up_proj_blocks", None, None), | |
| ("up_gate_proj_scale", "gate_up_proj_scales", None, None), | |
| ("down_proj_weight", "down_proj_blocks", None, None), | |
| ("down_proj_scale", "down_proj_scales", None, None), | |
| # Quantized variants first | |
| ("up_gate_proj_weight", "gate_up_proj_blocks", None, None), | |
| ("up_gate_proj_scale", "gate_up_proj_scales", None, None), | |
| ("down_proj_weight", "down_proj_blocks", None, None), | |
| ("down_proj_scale", "down_proj_scales", None, None), | |
| # Non-quantized variants after | |
| ("up_gate_proj_weight", "gate_up_proj", None, None), | |
| ("down_proj_weight", "down_proj", None, None), |
| ) | ||
| h2d_copy(dst=param, src=loaded_weight) | ||
|
|
||
| if self.moe_quant_config.name() == "mxfp4": |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code accesses self.moe_quant_config.name() without checking if moe_quant_config exists or is not None. This could raise an AttributeError if moe_quant_config is None. Consider adding a null check: if self.moe_quant_config and self.moe_quant_config.name() == "mxfp4":
| if self.moe_quant_config.name() == "mxfp4": | |
| if self.moe_quant_config and self.moe_quant_config.name() == "mxfp4": |
| is_checkpoint_bf16 = not config.get("is_quantized", False) | ||
| return cls(is_checkpoint_bf16) | ||
|
|
||
| def get_quant_method(self, layer) -> Optional[QuantMethodBase]: |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method requires 2 positional arguments, whereas overridden QuantConfigBase.get_quant_method requires 3.
| def get_quant_method(self, layer) -> Optional[QuantMethodBase]: | |
| def get_quant_method(self, layer, weight_attr=None) -> Optional[QuantMethodBase]: |
Motivation
This PR adds support for native GPT-OSS-120B and GPT-OSS-20B models using MXFP4 offline quantization, leveraging FlashInfer as the backend for MoE components.
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.