Regarding autoregressive pre-training

Thank you for publishing the code for this interesting paper!

I have one question regarding autoregressive generation.
Mamba models, by design, are causal. This means that one should be able to perform teacher forcing. However, I see that this code uses [cross attention in the decoder](https://github.com/OliverRensu/ARM/blob/main/models_pretrain.py#L86-L104). I wonder, why do you opt for this approach?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regarding autoregressive pre-training #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Regarding autoregressive pre-training #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions