Thank you for publishing the code for this interesting paper!
I have one question regarding autoregressive generation.
Mamba models, by design, are causal. This means that one should be able to perform teacher forcing. However, I see that this code uses cross attention in the decoder. I wonder, why do you opt for this approach?