an we configure custom training for LLaDA-MoE variants, like adding MoE-specific YAML params (e.g., expert routing) and VeOmni integration?
On a related note, during my experiments with LLaDA-MoE following the paper's exact settings , I'm seeing the z-loss (noise prediction component) steadily rising after interation.
