Difference between code and paper #4

FALLANGELZOU · 2024-10-17T07:39:13Z

I noticed that in the code implementation the expert network INRNet is just a FC layer with positional embedding and no separate layer is added for each expert sub-network, whereas in the article it is written that ”To downsize the whole MoE layer, we share the positional embedding and the first 4 layers among all expert networks. Then we append two independent layers for each expert. We note this design can make two experts share the early-stage features and adjust their coherence.“

How to explain this, code wise its hardly a MoE, more like a MLP layer with sparse coding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between code and paper #4

Difference between code and paper #4

FALLANGELZOU commented Oct 17, 2024

Difference between code and paper #4

Difference between code and paper #4

Comments

FALLANGELZOU commented Oct 17, 2024