wrong loss_mask processing in _process_chat func of pretrain/qwen3_data.py?


Hi, thanks for the great work and contribution to the open-source community!

While reviewing the data processing logic in `qwen3_dataset.py`, I noticed a potential issue in the `_process_chat` function used for SFT data.

- Specifically, when `_get_assistant_mask` is called, the `start_pattern` and `end_pattern` arguments are explicitly passed as `[151644, 872, 198]` and `[151645, 198, 151645]`, which override the default values defined in `_get_assistant_mask` (`start_pattern=[151644, 77091, 198]` and `end_pattern=[151645, 198]`). [codes about loss_mask](https://github.com/Kuaishou-OneRec/OpenOneRec/blob/a0d35470c6a03d1b8f9174fec53a9aed0b38bc00/pretrain/onerec_llm/data/qwen3_dataset.py#L333-L337)

  ```python
  inputs["loss_mask"] = self._get_assistant_mask(
      input_ids,
      start_pattern=[self.im_start_token_id, 872, 198],  # <|im_start|>assistant
      end_pattern=[self.im_end_token_id, 198, self.im_end_token_id]  # <|im_end|>
  )
  ```
- Although the comment indicates that the intended start_pattern corresponds to `"<|im_start|>assistant\n"`, using the tokenizer of qwen3-1.7b/8b shows that token 872 corresponds to "user" rather than "assistant". As a result, the effective start_pattern becomes `"<|im_start|>user\n"` instead of `"<|im_start|>assistant\n"`.
- This leads to the `loss_mask` being set to 1 for user-related content. Since `loss_mask` is later used in `train_qwen3.py` to construct the training labels, user tokens are also included in the loss computation.

Could you please clarify whether this behavior is intended? What is the rationale for setting `start_pattern` to `[151644, 872, 198]` and `end_pattern` to `[151645, 198, 151645]`? Additionally, were the experimental results reported in the paper obtained using this same configuration?

I'd really appreciate your clarifications! Thanks in advance!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wrong loss_mask processing in _process_chat func of pretrain/qwen3_data.py? #38

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

wrong loss_mask processing in _process_chat func of pretrain/qwen3_data.py? #38

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions