The change in the distribution of network activations due to the change in networkparameters during training.
Because the layers need to continuously adapt to the new distribution.
Whitening the activation of each layer so that the activation of each layer is always normal distributed, so the next layer don't need to learn new distribution.
- Training
- Testing
Using the mean and variance statistics during training so that each layer's output only depends on its input.
- The
normalize step
(third equation) ensures that the mean and variance of output activations are the same during training. - The
scale and shift step
(fourth equation) gives Batch Normalization the ability toundo
the normalization. For instance, ifThen we can recover the original activations.
- For fully-connected layers, normalization is done in every dimension of x, while in convolutional layers, normalization is done in each feature map. That means, there is one pair of γ(k) and β(k) for each feature map. This makes activations whithin the same feature map normalized in the same way, with respect to what convolution does.
- Enables higher learning rate: BN helps to amplify small changes into larger and suboptimal changes in activations in gradients as well as stabilize the parameter growth.
- Less careful about initialization
- Eliminate or reduce Dropout
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[J]. arXiv preprint arXiv:1502.03167, 2015.