--num-gpus is implemented by sharding each expert layer across GPUs, i.e. expert parallelism
this is probably not advisable for local experimentation, especially on batch size 1 -- where EP only adds communication overhead to no speed benefit vs naive model/pipeline parallel.