Why not pad with a separate special token and ignore these pad tokens in loss?

Dear LLaDA2.0 developers,

I noticed that at SFT stage you pad all sequences with <|endoftext|> token to a fixed length (2048 or 4096) which can be much larger than original sequence length (https://github.com/inclusionAI/dFactory/blob/main/tasks/dataset/data_transform.py#L136). Moreover, you include these tokens in attention_mask and compute loss for them as well. In your earlier paper about LLaDA-MoE (https://arxiv.org/abs/2509.24389) you even mention that if max_length is too large (8192), model learns to predict too many <|endoftext|> tokens which affects metrics.

My question is why don't you pad sequence with <|pad|> tokens (different from <|endoftext|>) and exclude them from attention and loss computation?

Best regards,
Mikhail

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why not pad with a separate special token and ignore these pad tokens in loss? #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why not pad with a separate special token and ignore these pad tokens in loss? #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions