Support MXFP4 for GPT-OSS #5409

Limerances · 2025-12-05T14:18:21Z

Motivation

This PR adds support for native GPT-OSS-120B and GPT-OSS-20B models using MXFP4 offline quantization, leveraging FlashInfer as the backend for MoE components.

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

…/FastDeploy into add-paddleocr-vl-test

…into develop

paddle-bot · 2025-12-05T14:18:34Z

Thanks for your contribution!

Copilot

Pull request overview

This PR adds support for MXFP4 quantization for GPT-OSS models (120B and 20B variants) with FlashInfer as the MoE backend. The implementation includes a new quantization method, model-specific weight loading logic, and integration with the existing quantization framework.

Key Changes:

Adds MXFP4 quantization implementation with FlashInfer and Triton backend support
Updates GPT-OSS model to handle MXFP4 weight loading with specialized logic for MoE expert weights
Introduces module filtering capability to exclude certain layers (e.g., normalization) from quantization

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 18 comments.

Show a summary per file

File	Description
`fastdeploy/model_executor/layers/quantization/mxfp4.py`	New file implementing MXFP4 quantization method with weight creation, loading, and inference logic for MoE layers
`fastdeploy/model_executor/layers/quantization/__init__.py`	Registers MXFP4 as a supported quantization method in the framework
`fastdeploy/model_executor/models/gpt_oss.py`	Updates weight loading logic to handle MXFP4 quantized weights and adds norm layer exclusion from quantization
`fastdeploy/model_executor/layers/moe/moe.py`	Adds MXFP4-specific weight transformation logic in the fused experts weight loader
`fastdeploy/model_executor/layers/linear.py`	Integrates module filtering check before applying quantization
`fastdeploy/model_executor/layers/normalization.py`	Adds module filtering to conditionally apply quantization parameters based on exclusion patterns
`fastdeploy/model_executor/layers/utils.py`	Introduces `modules_to_convert` utility function to check if a module should be quantized
`fastdeploy/utils.py`	Modifies unified checkpoint file counting logic (appears to be a bug fix for off-by-one error)
`fastdeploy/envs.py`	Adds environment variable to control FlashInfer MXFP4 backend usage
`fastdeploy/config.py`	Adds special case for GPT-OSS model format detection

Comments suppressed due to low confidence (1)

fastdeploy/model_executor/layers/quantization/mxfp4.py:250

This statement is unreachable.

        block_size = 32

Copilot · 2025-12-05T14:25:01Z

fastdeploy/model_executor/layers/utils.py

    return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, offset=offset)
+
+
+def modules_to_convert(prefix: str, fd_config: FDConfig):


The function modules_to_convert lacks a docstring. Please add documentation explaining the purpose, parameters, and return value. For example: "Check if a module should be converted for quantization based on exclusion patterns in the configuration."

Suggested change

def modules_to_convert(prefix: str, fd_config: FDConfig):

def modules_to_convert(prefix: str, fd_config: FDConfig):

"""

Determine whether a module should be converted for quantization based on exclusion patterns in the configuration.

Args:

prefix (str): The prefix or name of the module to check.

fd_config (FDConfig): The FastDeploy configuration object containing model and quantization settings.

Returns:

bool: True if the module should be converted for quantization, False if it should be excluded.

"""

Copilot · 2025-12-05T14:25:01Z

fastdeploy/utils.py

+        safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1
        flags = [0] * safetensors_num
        for x in model_files:
            current_index = int(x.strip(".safetensors").split("-")[1])


Multiple issues with this line:

The .strip(".safetensors") is incorrect - it removes any of those characters from both ends, not the suffix. Use .removesuffix(".safetensors") or .replace(".safetensors", "") instead.

For a file like "model-00001-of-00008.safetensors", after stripping the suffix, splitting by "-" gives ["model", "00001", "of", "00008"], so [-1] gets "00008". Converting to int gives 8, and adding 1 gives 9, which is incorrect if there are 8 files total (indexed 1-8).

The logic assumes the last part is the total count, but it should extract from the "of-XXXXX" part using split("-")[-1] without the +1, or verify the intended file naming convention.

Suggested change

safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1

flags = [0] * safetensors_num

for x in model_files:

current_index = int(x.strip(".safetensors").split("-")[1])

# Extract total number of files from filename like "model-00001-of-00008.safetensors"

match = re.match(r"model-(\d+)-of-(\d+)\.safetensors$", model_files[0])

if not match:

raise ValueError(f"Filename {model_files[0]} does not match expected pattern.")

safetensors_num = int(match.group(2))

flags = [0] * safetensors_num

for x in model_files:

match = re.match(r"model-(\d+)-of-(\d+)\.safetensors$", x)

if not match:

raise ValueError(f"Filename {x} does not match expected pattern.")

current_index = int(match.group(1))

Copilot · 2025-12-05T14:25:02Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+                **extra_kwargs,
+            )
+
+            return output[..., : layer.hidden_size]


The apply method only handles the SM90_FI_MXFP4_BF16 backend case. When self.mxfp4_backend is TRITON, the function will implicitly return None, which is likely incorrect behavior. Add an else clause to either handle the TRITON backend or raise a NotImplementedError with a clear message.

Suggested change

return output[..., : layer.hidden_size]

return output[..., : layer.hidden_size]

else:

raise NotImplementedError(

f"MXFP4 backend '{self.mxfp4_backend}' is not supported in MXFP4MoeMethod.apply()."

)

Copilot · 2025-12-05T14:25:02Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+                # tp_size=self.moe.tp_size,
+                # tp_rank=self.moe.tp_rank,
+                # ep_size=self.moe.ep_size,
+                # ep_rank=self.moe.ep_rank,


Commented-out code should be removed. If these parameters (tp_size, tp_rank, ep_size, ep_rank) might be needed in the future, document this with a TODO comment explaining why they're currently disabled.

Suggested change

# tp_size=self.moe.tp_size,

# tp_rank=self.moe.tp_rank,

# ep_size=self.moe.ep_size,

# ep_rank=self.moe.ep_rank,

# TODO: The parameters tp_size, tp_rank, ep_size, and ep_rank are not currently passed to flashinfer_cutlass_fused_moe.

# Enable these when distributed MoE support is implemented and tested in the backend.

Copilot · 2025-12-05T14:25:03Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+
+
+class MXFP4Config(QuantConfigBase):
+    """Base class for quantization configs."""


The docstring "Base class for quantization configs." is incorrect. This is not a base class but rather a specific implementation for MXFP4 quantization. Update to something like "Configuration for MXFP4 quantization."

Suggested change

"""Base class for quantization configs."""

"""Configuration for MXFP4 quantization."""

Copilot · 2025-12-05T14:25:05Z

fastdeploy/envs.py

    # Set moe backend."cutlass","marlin" and "triton" can be set currently.
    "FD_MOE_BACKEND": lambda: os.getenv("FD_MOE_BACKEND", "cutlass"),
+    # Whether to use FLASHINFER as MXFP4 backend for MoE.
+    "FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"),


For consistency with other boolean environment variables in this file (e.g., FD_USE_DEEP_GEMM, FD_USE_HF_TOKENIZER), the value should be converted to a boolean using bool(int(os.getenv(...))) pattern instead of returning a string "0". This ensures consistent handling of boolean environment variables throughout the codebase.

Suggested change

"FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"),

"FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: bool(int(os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"))),

Copilot · 2025-12-05T14:25:05Z

fastdeploy/model_executor/models/gpt_oss.py

            model_sublayer_name = re.sub(r"\.(up_gate_proj_weight|down_proj_weight|weight)$", "", model_param_name)
            process_weights_after_loading_fn(model_sublayer_name, param)


The variable model_param_name may be undefined if the loop exits without matching any parameters. This can happen when matched is False after all checks and model_param_name not in params_dict evaluates to True (line 325 continues the loop). This would cause a NameError at line 332. Consider initializing model_param_name before the loops or restructuring the logic to ensure it's always defined when used.

Copilot · 2025-12-05T14:25:06Z

fastdeploy/model_executor/models/gpt_oss.py

+            ("up_gate_proj_weight", "gate_up_proj", None, None),
+            ("down_proj_weight", "down_proj", None, None),
+            ("up_gate_proj_weight", "gate_up_proj_blocks", None, None),
+            ("up_gate_proj_scale", "gate_up_proj_scales", None, None),
+            ("down_proj_weight", "down_proj_blocks", None, None),
+            ("down_proj_scale", "down_proj_scales", None, None),


The expert_params_mapping list contains duplicate entries for up_gate_proj_weight and down_proj_weight. Lines 277-278 map to "gate_up_proj" and "down_proj", while lines 279-282 map to the quantized variants ("gate_up_proj_blocks", "gate_up_proj_scales", etc.). During weight loading, the first match will always win due to the break statement at line 321, meaning the quantized weight mappings (lines 279-282) may never be matched. Consider using a more specific matching condition or restructuring the logic to handle both quantized and non-quantized weights properly.

Suggested change

("up_gate_proj_weight", "gate_up_proj", None, None),

("down_proj_weight", "down_proj", None, None),

("up_gate_proj_weight", "gate_up_proj_blocks", None, None),

("up_gate_proj_scale", "gate_up_proj_scales", None, None),

("down_proj_weight", "down_proj_blocks", None, None),

("down_proj_scale", "down_proj_scales", None, None),

# Quantized variants first

("up_gate_proj_weight", "gate_up_proj_blocks", None, None),

("up_gate_proj_scale", "gate_up_proj_scales", None, None),

("down_proj_weight", "down_proj_blocks", None, None),

("down_proj_scale", "down_proj_scales", None, None),

# Non-quantized variants after

("up_gate_proj_weight", "gate_up_proj", None, None),

("down_proj_weight", "down_proj", None, None),

Copilot · 2025-12-05T14:25:06Z

fastdeploy/model_executor/layers/moe/moe.py

-        )
-        h2d_copy(dst=param, src=loaded_weight)
+
+        if self.moe_quant_config.name() == "mxfp4":


The code accesses self.moe_quant_config.name() without checking if moe_quant_config exists or is not None. This could raise an AttributeError if moe_quant_config is None. Consider adding a null check: if self.moe_quant_config and self.moe_quant_config.name() == "mxfp4":

Suggested change

if self.moe_quant_config.name() == "mxfp4":

if self.moe_quant_config and self.moe_quant_config.name() == "mxfp4":

Copilot · 2025-12-05T14:25:06Z

fastdeploy/model_executor/layers/quantization/mxfp4.py

+        is_checkpoint_bf16 = not config.get("is_quantized", False)
+        return cls(is_checkpoint_bf16)
+
+    def get_quant_method(self, layer) -> Optional[QuantMethodBase]:


This method requires 2 positional arguments, whereas overridden QuantConfigBase.get_quant_method requires 3.

Suggested change

def get_quant_method(self, layer) -> Optional[QuantMethodBase]:

def get_quant_method(self, layer, weight_attr=None) -> Optional[QuantMethodBase]:

ming1753 and others added 16 commits October 28, 2025 18:51

fix paddleocr prefix cache bug

f07c584

add test for paddleocr_vl

e0f9cef

Merge commit 'refs/pull/4625/head' of https://github.com/PaddlePaddle…

6e1ab62

…/FastDeploy into add-paddleocr-vl-test

disable prefix-caching in ocr

6c6f26f

Merge commit 'refs/pull/4625/head' of https://github.com/PaddlePaddle…

cfaf44d

…/FastDeploy into add-paddleocr-vl-test

add test for paddleocr_vl

6d1dd20

Merge branch 'develop' into add-paddleocr-vl-test

885625d

Merge branch 'develop' into add-paddleocr-vl-test

260c82c

update test for paddleocr_vl

49001b3

Merge branch 'add-paddleocr-vl-test' of https://github.com/Limerances…

51c53de

…/FastDeploy into add-paddleocr-vl-test

Fix top_p for rejection sampling

73527d5

Merge branch 'add-paddleocr-vl-test' into develop

cffd70e

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

1926c54

…into develop

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

593b3d4

…into develop

support mxfp4 in gpt-oss

7c6d5ef

support mxfp4 in gpt-oss

7d42a53

Copilot AI review requested due to automatic review settings December 5, 2025 14:18

paddle-bot bot added the contributor External developers label Dec 5, 2025

Copilot started reviewing on behalf of Limerances December 5, 2025 14:18 View session

Copilot finished reviewing on behalf of Limerances December 5, 2025 14:22

Copilot AI reviewed Dec 5, 2025

View reviewed changes

Limerances closed this Dec 8, 2025

		return vocab_range_from_per_partition_vocab_size(per_partition_vocab_size, rank, offset=offset)


		def modules_to_convert(prefix: str, fd_config: FDConfig):

-def modules_to_convert(prefix: str, fd_config: FDConfig):
+def modules_to_convert(prefix: str, fd_config: FDConfig):
+    """
+    Determine whether a module should be converted for quantization based on exclusion patterns in the configuration.
+    Args:
+        prefix (str): The prefix or name of the module to check.
+        fd_config (FDConfig): The FastDeploy configuration object containing model and quantization settings.
+    Returns:
+        bool: True if the module should be converted for quantization, False if it should be excluded.
+    """

-        safetensors_num = int(model_files[0].strip(".safetensors").split("-")[-1]) + 1
-        flags = [0] * safetensors_num
-        for x in model_files:
-            current_index = int(x.strip(".safetensors").split("-")[1])
+        # Extract total number of files from filename like "model-00001-of-00008.safetensors"
+        match = re.match(r"model-(\d+)-of-(\d+)\.safetensors$", model_files[0])
+        if not match:
+            raise ValueError(f"Filename {model_files[0]} does not match expected pattern.")
+        safetensors_num = int(match.group(2))
+        flags = [0] * safetensors_num
+        for x in model_files:
+            match = re.match(r"model-(\d+)-of-(\d+)\.safetensors$", x)
+            if not match:
+                raise ValueError(f"Filename {x} does not match expected pattern.")
+            current_index = int(match.group(1))



		class MXFP4Config(QuantConfigBase):
		"""Base class for quantization configs."""

	"""Base class for quantization configs."""
	"""Configuration for MXFP4 quantization."""

	"FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"),
	"FD_USE_FLASHINFER_MOE_MXFP4_BF16": lambda: bool(int(os.getenv("FD_USE_FLASHINFER_MOE_MXFP4_BF16", "0"))),

		model_sublayer_name = re.sub(r"\.(up_gate_proj_weight\|down_proj_weight\|weight)$", "", model_param_name)
		process_weights_after_loading_fn(model_sublayer_name, param)

	if self.moe_quant_config.name() == "mxfp4":
	if self.moe_quant_config and self.moe_quant_config.name() == "mxfp4":

	def get_quant_method(self, layer) -> Optional[QuantMethodBase]:
	def get_quant_method(self, layer, weight_attr=None) -> Optional[QuantMethodBase]:

Support MXFP4 for GPT-OSS #5409

Support MXFP4 for GPT-OSS #5409

Uh oh!

Conversation

Limerances commented Dec 5, 2025

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Dec 5, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants