What's an example where two different tensors would have different values of width_mult? #5

davisyoshida · 2022-03-20T18:12:44Z

davisyoshida
Mar 20, 2022

It seems to me that for the learning rate scaling etc. to work out properly, all infinite dimensions should be scaling linearly with "n", so for each d_i you have something like d_i = k_i * n. The width_mult value seems to be calculated by dividing this quantity for two different values of n, so shouldn't that give something like n_model / n_base for every tensor, regardless of its initial size?

Answered by thegregyang

Mar 20, 2022

Hi @davisyoshida, an example would be if you double the d_model of a Transformer but quadruple the d_ffn (where the MLP dimensions are d_model -> d_ffn -> d_model). Because we calculate width_mult using the fan-in dimension, this would cause the first nn.Linear.weight in the MLP layer to have its width_mult=2 but the 2nd nn.Linear.weight would have its width_mult=4. Nevertheless, as we demonstrate in our paper, we should expect hyperparameters to transfer even in this case.

View full answer

thegregyang · 2022-03-20T20:59:45Z

thegregyang
Mar 20, 2022

Hi @davisyoshida, an example would be if you double the d_model of a Transformer but quadruple the d_ffn (where the MLP dimensions are d_model -> d_ffn -> d_model). Because we calculate width_mult using the fan-in dimension, this would cause the first nn.Linear.weight in the MLP layer to have its width_mult=2 but the 2nd nn.Linear.weight would have its width_mult=4. Nevertheless, as we demonstrate in our paper, we should expect hyperparameters to transfer even in this case.

1 reply

davisyoshida Mar 20, 2022
Author

Got it, thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's an example where two different tensors would have different values of width_mult? #5

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

What's an example where two different tensors would have different values of width_mult? #5

davisyoshida Mar 20, 2022

Replies: 1 comment · 1 reply

thegregyang Mar 20, 2022

davisyoshida Mar 20, 2022 Author

davisyoshida
Mar 20, 2022

Replies: 1 comment 1 reply

thegregyang
Mar 20, 2022

davisyoshida Mar 20, 2022
Author