Skip to content

Why not pad with a separate special token and ignore these pad tokens in loss? #14

@mishgon

Description

@mishgon

Dear LLaDA2.0 developers,

I noticed that at SFT stage you pad all sequences with <|endoftext|> token to a fixed length (2048 or 4096) which can be much larger than original sequence length (https://github.com/inclusionAI/dFactory/blob/main/tasks/dataset/data_transform.py#L136). Moreover, you include these tokens in attention_mask and compute loss for them as well. In your earlier paper about LLaDA-MoE (https://arxiv.org/abs/2509.24389) you even mention that if max_length is too large (8192), model learns to predict too many <|endoftext|> tokens which affects metrics.

My question is why don't you pad sequence with <|pad|> tokens (different from <|endoftext|>) and exclude them from attention and loss computation?

Best regards,
Mikhail

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions