-
Notifications
You must be signed in to change notification settings - Fork 90
Description
Hi, thanks for the great work and contribution to the open-source community!
While reviewing the data processing logic in qwen3_dataset.py, I noticed a potential issue in the _process_chat function used for SFT data.
-
Specifically, when
_get_assistant_maskis called, thestart_patternandend_patternarguments are explicitly passed as[151644, 872, 198]and[151645, 198, 151645], which override the default values defined in_get_assistant_mask(start_pattern=[151644, 77091, 198]andend_pattern=[151645, 198]). codes about loss_maskinputs["loss_mask"] = self._get_assistant_mask( input_ids, start_pattern=[self.im_start_token_id, 872, 198], # <|im_start|>assistant end_pattern=[self.im_end_token_id, 198, self.im_end_token_id] # <|im_end|> )
-
Although the comment indicates that the intended start_pattern corresponds to
"<|im_start|>assistant\n", using the tokenizer of qwen3-1.7b/8b shows that token 872 corresponds to "user" rather than "assistant". As a result, the effective start_pattern becomes"<|im_start|>user\n"instead of"<|im_start|>assistant\n". -
This leads to the
loss_maskbeing set to 1 for user-related content. Sinceloss_maskis later used intrain_qwen3.pyto construct the training labels, user tokens are also included in the loss computation.
Could you please clarify whether this behavior is intended? What is the rationale for setting start_pattern to [151644, 872, 198] and end_pattern to [151645, 198, 151645]? Additionally, were the experimental results reported in the paper obtained using this same configuration?
I'd really appreciate your clarifications! Thanks in advance!