What's an example where two different tensors would have different values of width_mult? #5
-
It seems to me that for the learning rate scaling etc. to work out properly, all infinite dimensions should be scaling linearly with "n", so for each d_i you have something like d_i = k_i * n. The width_mult value seems to be calculated by dividing this quantity for two different values of n, so shouldn't that give something like n_model / n_base for every tensor, regardless of its initial size? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @davisyoshida, an example would be if you double the d_model of a Transformer but quadruple the d_ffn (where the MLP dimensions are d_model -> d_ffn -> d_model). Because we calculate |
Beta Was this translation helpful? Give feedback.
Hi @davisyoshida, an example would be if you double the d_model of a Transformer but quadruple the d_ffn (where the MLP dimensions are d_model -> d_ffn -> d_model). Because we calculate
width_mult
using the fan-in dimension, this would cause the firstnn.Linear.weight
in the MLP layer to have itswidth_mult=2
but the 2ndnn.Linear.weight
would have itswidth_mult=4
. Nevertheless, as we demonstrate in our paper, we should expect hyperparameters to transfer even in this case.