Dear LLaDA2.0 developers,
I noticed that at SFT stage you pad all sequences with <|endoftext|> token to a fixed length (2048 or 4096) which can be much larger than original sequence length (https://github.com/inclusionAI/dFactory/blob/main/tasks/dataset/data_transform.py#L136). Moreover, you include these tokens in attention_mask and compute loss for them as well. In your earlier paper about LLaDA-MoE (https://arxiv.org/abs/2509.24389) you even mention that if max_length is too large (8192), model learns to predict too many <|endoftext|> tokens which affects metrics.
My question is why don't you pad sequence with <|pad|> tokens (different from <|endoftext|>) and exclude them from attention and loss computation?
Best regards,
Mikhail