You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that in the code implementation the expert network INRNet is just a FC layer with positional embedding and no separate layer is added for each expert sub-network, whereas in the article it is written that ”To downsize the whole MoE layer, we share the positional embedding and the first 4 layers among all expert networks. Then we append two independent layers for each expert. We note this design can make two experts share the early-stage features and adjust their coherence.“
How to explain this, code wise its hardly a MoE, more like a MLP layer with sparse coding.
The text was updated successfully, but these errors were encountered:
I noticed that in the code implementation the expert network INRNet is just a FC layer with positional embedding and no separate layer is added for each expert sub-network, whereas in the article it is written that ”To downsize the whole MoE layer, we share the positional embedding and the first 4 layers among all expert networks. Then we append two independent layers for each expert. We note this design can make two experts share the early-stage features and adjust their coherence.“
How to explain this, code wise its hardly a MoE, more like a MLP layer with sparse coding.
The text was updated successfully, but these errors were encountered: