"In the LLaDA paper, it is clearly stated that the model is a diffusion model rather than an autoregressive model. However, I found that your code uses a lower triangular matrix mask, which introduces causal inference relationships and turns the model into an autoregressive one. Does this conflict with the core argument of the paper? Additionally, when I tried to remove this lower triangular matrix from the source code, the loss decreased very slowly, and the test accuracy after 5 epochs was 0.