From b68ab023991f737f5f4f6a522164cd03f2be668e Mon Sep 17 00:00:00 2001
From: Titus Tzeng <titusjgr@gmail.com>
Date: Fri, 20 Mar 2020 16:25:50 +0800
Subject: [PATCH 1/6] Update 02-2.md

---
 docs/zh/week2/02-2.md | 208 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 208 insertions(+)
 create mode 100644 docs/zh/week2/02-2.md
diff --git a/docs/zh/week2/02-2.md b/docs/zh/week2/02-2.md
new file mode 100644
index 000000000..f5a2e1dcc
--- /dev/null
+++ b/docs/zh/week2/02-2.md
@@ -0,0 +1,208 @@
+---
+lang-ref: ch.02-2
+title: 为神经网络的模组计算梯度，与反向传播的实用技巧
+authors: Micaela Flores, Sheetal Laad, Brina Seidel, Aishwarya Rajan
+date: 3 February 2020
+translator: Titus Tzeng
+---
+
+
+## [一个反向传播的具体例子还有介绍基础的神经网络模组](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3022s)
+
+
+### 范例
+
+接下来我们会考虑一个反向传播的例子，并使用图像来辅助。任意的函数 $G(w)$ 输入到损失函数 $C$ 中，这可以用一个图来表示。经由雅可比矩阵的乘法操作，我们能将这个图转换成一个反向计算梯度的图。（注意 Pytorch 和 Tensorflow 已经自动地为使用者完成这件事了，也就是说，向前的图自动的被「倒反」来创造导函数的图形以反向传播梯度。）
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-1.png" alt="Gradient diagram" style="zoom:40%;" /></center>
+
+在这个范例中，右方的绿色图代表梯度的图。
+In this example, the green graph on the right represents the gradient graph. Following the graph from the topmost node, it follows that
+
+$$
+\frac{\partial C(y,\bar{y})}{\partial w}=1 \cdot \frac{\partial C(y,\bar{y})}{\partial\bar{y}}\cdot\frac{\partial G(x,w)}{\partial w}
+$$
+
+In terms of dimensions, $\frac{\partial C(y,\bar{y})}{\partial w}$ is a row vector of size $1\times N$ where $N$ is the number of components of $w$; $\frac{\partial C(y,\bar{y})}{\partial \bar{y}}$  is a row vector of size $1\times M$, where $M$ is the dimension of the output; $\frac{\partial \bar{y}}{\partial w}=\frac{\partial G(x,w)}{\partial w}$ is a matrix of size $M\times N$, where $M$ is the number of outputs of $G$ and $N$ is the dimension of $w$.
+
+Note that complications might arise when the architecture of the graph is not fixed, but is data-dependent. For example, we could choose neural net module depending on the length of input vector. Though this is possible, it becomes increasingly difficult to manage this variation when the number of loops exceeds a reasonable amount.
+
+
+
+### Basic neural net modules
+
+There exist different types of pre-built modules besides the familiar Linear and ReLU modules. These are useful because they are uniquely optimized to perform their respective functions (as opposed to being built by a combination of other, elementary modules).
+
+- Linear: $Y=W\cdot X$
+
+$$
+\begin{aligned}
+\frac{dC}{dX} &= W^\top \cdot \frac{dC}{dY} \\
+\frac{dC}{dW} &= \frac{dC}{dY} \cdot X^\top
+\end{aligned}
+$$
+
+- ReLU: $y=(x)^+$
+
+  $$
+  \frac{dC}{dX} =
+      \begin{cases}
+        0 & x<0\\
+        \frac{dC}{dY} & \text{otherwise}
+      \end{cases}
+  $$
+
+- Duplicate: $Y_1=X$, $Y_2=X$
+
+  - Akin to a "Y - splitter" where both outputs are equal to the input.
+
+  - When backpropagating, the gradients get summed
+
+  - Can be split into n branches similarly
+
+    $$
+    \frac{dC}{dX}=\frac{dC}{dY_1}+\frac{dC}{dY_2}
+    $$
+
+
+- Add: $Y=X_1+X_2$
+
+  - With two variables being summed, when one is perturbed, the output will be perturbed by the same quantity, i.e.,
+
+    $$
+    \frac{dC}{dX_1}=\frac{dC}{dY}\cdot1 \quad \text{and}\quad \frac{dC}{dX_2}=\frac{dC}{dY}\cdot1
+    $$
+
+
+- Max: $Y=\max(X_1,X_2)$
+
+  -  Since this function can also be represented as
+
+$$
+Y=\max(X_1,X_2)=\begin{cases}
+      X_1 & X_1 > X_2 \\
+      X_2 & \text{else}
+   \end{cases}
+\Rightarrow
+\frac{dY}{dX_1}=\begin{cases}
+      1 & X_1 > X_2 \\
+      0 & \text{else}
+   \end{cases}
+$$
+
+ - - Therefore, by the chain rule,
+
+$$
+\frac{dC}{dX_1}=\begin{cases}
+      \frac{dC}{dY}\cdot1 & X_1 > X_2 \\
+      0 & \text{else}
+   \end{cases}
+$$
+
+
+## [LogSoftMax vs SoftMax](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3985s)
+
+*SoftMax*, which is also a PyTorch module, is a convenient way of transforming a group of numbers into a group of positive numbers between 0 and 1 that sum to one. These numbers can be interpreted as a probability distribution. As a result, it is commonly used in classification problems. $y_i$ in the equation below is a vector of probabilities for all the categories.
+
+$$
+y_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
+$$
+
+However, the use of softmax leaves the network susceptible to vanishing gradients. Vanishing gradient is a problem, as it prevents weights downstream from being modified by the neural network, which may completely stop the neural network from further training. The logistic sigmoid function, which is the softmax function for one value, shows that when s is large, $h(s)$ is 1, and when s is small, $h(s)$ is 0. Because the sigmoid function is flat at $h(s) = 0 $ and $h(s) = 1$, the gradient is 0, which results in a vanishing gradient.
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-2.png" alt="Sigmoid function to illustrate vanishing gradient" style="background-color:#DCDCDC;" /></center>
+
+$$
+h(s) = \frac{1}{1 + \exp(-s)}
+$$
+
+Mathematicians came up with the idea of logsoftmax in order to solve for the issue of the vanishing gradient created by softmax. *LogSoftMax* is another basic module in PyTorch. As can be seen in the equation below, *LogSoftMax* is a combination of softmax and log.
+
+$$
+\log(y_i )= \log\left(\frac{\exp(x_i)}{\Sigma_j \exp(x_j)}\right) = x_i - \log(\Sigma_j \exp(x_j)
+$$
+
+The equation below demonstrates another way to look at the same equation. The figure below shows the $\log(1 + \exp(s))$ part of the function. When s is very small, the value is 0, and when s is very large, the value is s. As a result it doesn’t saturate, and the vanishing gradient problem is avoided.
+
+$$
+\log\left(\frac{\exp(s)}{\exp(s) + 1}\right)= s - \log(1 + \exp(s))
+$$
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-3.png" width='400px' alt="Plot of logarithmic part of the functions" /></center>
+
+
+## [Practical tricks for backpropagation](https://www.youtube.com/watch?v=d9vdh3b787Y&t=4924s)
+
+
+### Use ReLU as the non-linear activation function
+
+ReLU works best for networks with many layers, which has caused alternatives like the sigmoid function and hyperbolic tangent $\tanh(\cdot)$ function to fall out of favour. The reason ReLU works best is likely due to its single kink which makes it scale equivariant.
+
+
+### Use cross-entropy loss as the objective function for classification problems
+
+Log softmax, which we discussed earlier in the lecture, is a special case of cross-entropy loss. In PyTorch, be sure to provide the cross-entropy loss function with *log* softmax as input (as opposed to normal softmax).
+
+
+### Use stochastic gradient descent on minibatches during training
+
+As discussed previously, minibatches let you train more efficiently because there is redundancy in the data; you shouldn't need to make a prediction and calculate the loss on every single observation at every single step to estimate the gradient.
+
+
+### Shuffle the order of the training examples when using stochastic gradient descent
+
+Order matters. If the model sees only examples from a single class during each training step, then it will learn to predict that class without learning why it ought to be predicting that class. For example, if you were trying to classify digits from the MNIST dataset and the data was unshuffled, the bias parameters in the last layer would simply always predict zero, then adapt to always predict one, then two, etc. Ideally, you should have samples from every class in every minibatch.
+
+However, there's ongoing debate over whether you need to change the order of the samples in every pass (epoch).
+
+
+### Normalize the inputs to have zero mean and unit variance
+
+Before training, it's useful to normalize each input feature so that it has a mean of zero and a standard deviation of one. When using RGB image data, it is common to take mean and standard deviation of each channel individually and normalize the image channel-wise. For example, take the mean $m_b$ and standard deviation $\sigma_b$ of all the blue values in the dataset, then normalize the blue values for each individual image as.
+
+$$
+b_{[i,j]}^{'} = \frac{b_{[i,j]} - m_b}{\max(\sigma_b, \epsilon)}
+$$
+
+where $\epsilon$ is an arbitrarily small number that we use to avoid division by zero. Repeat the same for green and red channels. This is necessary to get a meaningful signal out of images taken in different lighting; for example, day lit pictures have a lot of red while underwater pictures have almost none.
+
+
+### Use a schedule to decrease the learning rate
+
+The learning rate should fall as training goes on. In practice, most advanced models are trained by using algorithms like Adam/Momentum which adapt the learning rate instead of simple SGD with a constant learning rate.
+
+
+### Use L1 and/or L2 regularization for weight decay
+
+You can add a cost for large weights to the cost function. For example, using L2 regularization, we would define the loss $L$ and update the weights $w$ as follows:
+
+$$
+L(S, w) = C(S, w) + \alpha \Vert w \Vert^2\\
+\frac{\partial R}{\partial w_i} = 2w_i\\
+w_i = w_i - \eta\frac{\partial C}{\partial w_i} = w_i - \eta(\frac{\partial C}{\partial w_i} + 2 \alpha w_i)
+$$
+
+To understand why this is called weight decay, note that we can rewrite the above formula to show that we multiply $w_i$ by a constant less than one during the update.
+
+$$
+w_i = (1 - 2 \eta \alpha) w_i - \eta\frac{\partial C}{\partial w_i}
+$$
+
+L1 regularization (Lasso) is similar, except that we use $\sum_i \vert w_i\vert$ instead of $\Vert w \Vert^2$.
+
+Essentially, regularization tries to tell the system to minimize the cost function with the shortest weight vector possible. With L1 regularization, weights that are not useful are shrunk to 0.
+
+
+### Weight initialisation
+
+The weights need to be initialised at random, however, they shouldn't be too large or too small such that output is roughly of the same variance as that of input. There are various weight initialisation tricks built into PyTorch. One of the tricks that works well for deep models is Kaiming initialisation where the variance of the weights is inversely proportional to square root of number of inputs.
+
+
+### Use dropout
+
+Dropout is another form of regularization. It can be thought of as another layer of the neural net: it takes inputs, randomly sets $n/2$ of the inputs to zero, and returns the result as output. This forces the system to take information from all input units rather than becoming overly reliant on a small number of input units thus distributing the information across all of the units in a layer. This method was initially proposed by <a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a>.
+
+For more tricks, see  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>.
+
+Finally, note that backpropagation doesn't just work for stacked models; it can work for any directed acyclic graph (DAG) as long as there is a partial order on the modules.
+

From 3b61726f0390d0001151f9cd4cd8d4e7a5e26eaa Mon Sep 17 00:00:00 2001
From: Titus Tzeng <titusjgr@gmail.com>
Date: Fri, 20 Mar 2020 16:31:38 +0800
Subject: [PATCH 2/6] Move file to correct directory

---
 docs/zh/week02/02-2.md | 217 +++++++++++++++++++++++++++++++++++++++--
 docs/zh/week2/02-2.md  | 208 ---------------------------------------
 2 files changed, 208 insertions(+), 217 deletions(-)
 delete mode 100644 docs/zh/week2/02-2.md

diff --git a/docs/zh/week02/02-2.md b/docs/zh/week02/02-2.md
index fb6822258..f5a2e1dcc 100644
--- a/docs/zh/week02/02-2.md
+++ b/docs/zh/week02/02-2.md
@@ -1,9 +1,208 @@
----
-lang: zh
-lang-ref: ch.02-2
-title: Computing gradients for NN modules and Practical tricks for Back Propagation
-authors: Micaela Flores, Sheetal Laad, Brina Seidel, Aishwarya Rajan
-date: 3 February 2020
----
-
-This will be the translated version of [Computing gradients for NN modules and Practical tricks for Back Propagation]({{site.baseurl}}{% link en/week02/02-2.md %}).
+---
+lang-ref: ch.02-2
+title: 为神经网络的模组计算梯度，与反向传播的实用技巧
+authors: Micaela Flores, Sheetal Laad, Brina Seidel, Aishwarya Rajan
+date: 3 February 2020
+translator: Titus Tzeng
+---
+
+
+## [一个反向传播的具体例子还有介绍基础的神经网络模组](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3022s)
+
+
+### 范例
+
+接下来我们会考虑一个反向传播的例子，并使用图像来辅助。任意的函数 $G(w)$ 输入到损失函数 $C$ 中，这可以用一个图来表示。经由雅可比矩阵的乘法操作，我们能将这个图转换成一个反向计算梯度的图。（注意 Pytorch 和 Tensorflow 已经自动地为使用者完成这件事了，也就是说，向前的图自动的被「倒反」来创造导函数的图形以反向传播梯度。）
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-1.png" alt="Gradient diagram" style="zoom:40%;" /></center>
+
+在这个范例中，右方的绿色图代表梯度的图。
+In this example, the green graph on the right represents the gradient graph. Following the graph from the topmost node, it follows that
+
+$$
+\frac{\partial C(y,\bar{y})}{\partial w}=1 \cdot \frac{\partial C(y,\bar{y})}{\partial\bar{y}}\cdot\frac{\partial G(x,w)}{\partial w}
+$$
+
+In terms of dimensions, $\frac{\partial C(y,\bar{y})}{\partial w}$ is a row vector of size $1\times N$ where $N$ is the number of components of $w$; $\frac{\partial C(y,\bar{y})}{\partial \bar{y}}$  is a row vector of size $1\times M$, where $M$ is the dimension of the output; $\frac{\partial \bar{y}}{\partial w}=\frac{\partial G(x,w)}{\partial w}$ is a matrix of size $M\times N$, where $M$ is the number of outputs of $G$ and $N$ is the dimension of $w$.
+
+Note that complications might arise when the architecture of the graph is not fixed, but is data-dependent. For example, we could choose neural net module depending on the length of input vector. Though this is possible, it becomes increasingly difficult to manage this variation when the number of loops exceeds a reasonable amount.
+
+
+
+### Basic neural net modules
+
+There exist different types of pre-built modules besides the familiar Linear and ReLU modules. These are useful because they are uniquely optimized to perform their respective functions (as opposed to being built by a combination of other, elementary modules).
+
+- Linear: $Y=W\cdot X$
+
+$$
+\begin{aligned}
+\frac{dC}{dX} &= W^\top \cdot \frac{dC}{dY} \\
+\frac{dC}{dW} &= \frac{dC}{dY} \cdot X^\top
+\end{aligned}
+$$
+
+- ReLU: $y=(x)^+$
+
+  $$
+  \frac{dC}{dX} =
+      \begin{cases}
+        0 & x<0\\
+        \frac{dC}{dY} & \text{otherwise}
+      \end{cases}
+  $$
+
+- Duplicate: $Y_1=X$, $Y_2=X$
+
+  - Akin to a "Y - splitter" where both outputs are equal to the input.
+
+  - When backpropagating, the gradients get summed
+
+  - Can be split into n branches similarly
+
+    $$
+    \frac{dC}{dX}=\frac{dC}{dY_1}+\frac{dC}{dY_2}
+    $$
+
+
+- Add: $Y=X_1+X_2$
+
+  - With two variables being summed, when one is perturbed, the output will be perturbed by the same quantity, i.e.,
+
+    $$
+    \frac{dC}{dX_1}=\frac{dC}{dY}\cdot1 \quad \text{and}\quad \frac{dC}{dX_2}=\frac{dC}{dY}\cdot1
+    $$
+
+
+- Max: $Y=\max(X_1,X_2)$
+
+  -  Since this function can also be represented as
+
+$$
+Y=\max(X_1,X_2)=\begin{cases}
+      X_1 & X_1 > X_2 \\
+      X_2 & \text{else}
+   \end{cases}
+\Rightarrow
+\frac{dY}{dX_1}=\begin{cases}
+      1 & X_1 > X_2 \\
+      0 & \text{else}
+   \end{cases}
+$$
+
+ - - Therefore, by the chain rule,
+
+$$
+\frac{dC}{dX_1}=\begin{cases}
+      \frac{dC}{dY}\cdot1 & X_1 > X_2 \\
+      0 & \text{else}
+   \end{cases}
+$$
+
+
+## [LogSoftMax vs SoftMax](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3985s)
+
+*SoftMax*, which is also a PyTorch module, is a convenient way of transforming a group of numbers into a group of positive numbers between 0 and 1 that sum to one. These numbers can be interpreted as a probability distribution. As a result, it is commonly used in classification problems. $y_i$ in the equation below is a vector of probabilities for all the categories.
+
+$$
+y_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
+$$
+
+However, the use of softmax leaves the network susceptible to vanishing gradients. Vanishing gradient is a problem, as it prevents weights downstream from being modified by the neural network, which may completely stop the neural network from further training. The logistic sigmoid function, which is the softmax function for one value, shows that when s is large, $h(s)$ is 1, and when s is small, $h(s)$ is 0. Because the sigmoid function is flat at $h(s) = 0 $ and $h(s) = 1$, the gradient is 0, which results in a vanishing gradient.
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-2.png" alt="Sigmoid function to illustrate vanishing gradient" style="background-color:#DCDCDC;" /></center>
+
+$$
+h(s) = \frac{1}{1 + \exp(-s)}
+$$
+
+Mathematicians came up with the idea of logsoftmax in order to solve for the issue of the vanishing gradient created by softmax. *LogSoftMax* is another basic module in PyTorch. As can be seen in the equation below, *LogSoftMax* is a combination of softmax and log.
+
+$$
+\log(y_i )= \log\left(\frac{\exp(x_i)}{\Sigma_j \exp(x_j)}\right) = x_i - \log(\Sigma_j \exp(x_j)
+$$
+
+The equation below demonstrates another way to look at the same equation. The figure below shows the $\log(1 + \exp(s))$ part of the function. When s is very small, the value is 0, and when s is very large, the value is s. As a result it doesn’t saturate, and the vanishing gradient problem is avoided.
+
+$$
+\log\left(\frac{\exp(s)}{\exp(s) + 1}\right)= s - \log(1 + \exp(s))
+$$
+
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-3.png" width='400px' alt="Plot of logarithmic part of the functions" /></center>
+
+
+## [Practical tricks for backpropagation](https://www.youtube.com/watch?v=d9vdh3b787Y&t=4924s)
+
+
+### Use ReLU as the non-linear activation function
+
+ReLU works best for networks with many layers, which has caused alternatives like the sigmoid function and hyperbolic tangent $\tanh(\cdot)$ function to fall out of favour. The reason ReLU works best is likely due to its single kink which makes it scale equivariant.
+
+
+### Use cross-entropy loss as the objective function for classification problems
+
+Log softmax, which we discussed earlier in the lecture, is a special case of cross-entropy loss. In PyTorch, be sure to provide the cross-entropy loss function with *log* softmax as input (as opposed to normal softmax).
+
+
+### Use stochastic gradient descent on minibatches during training
+
+As discussed previously, minibatches let you train more efficiently because there is redundancy in the data; you shouldn't need to make a prediction and calculate the loss on every single observation at every single step to estimate the gradient.
+
+
+### Shuffle the order of the training examples when using stochastic gradient descent
+
+Order matters. If the model sees only examples from a single class during each training step, then it will learn to predict that class without learning why it ought to be predicting that class. For example, if you were trying to classify digits from the MNIST dataset and the data was unshuffled, the bias parameters in the last layer would simply always predict zero, then adapt to always predict one, then two, etc. Ideally, you should have samples from every class in every minibatch.
+
+However, there's ongoing debate over whether you need to change the order of the samples in every pass (epoch).
+
+
+### Normalize the inputs to have zero mean and unit variance
+
+Before training, it's useful to normalize each input feature so that it has a mean of zero and a standard deviation of one. When using RGB image data, it is common to take mean and standard deviation of each channel individually and normalize the image channel-wise. For example, take the mean $m_b$ and standard deviation $\sigma_b$ of all the blue values in the dataset, then normalize the blue values for each individual image as.
+
+$$
+b_{[i,j]}^{'} = \frac{b_{[i,j]} - m_b}{\max(\sigma_b, \epsilon)}
+$$
+
+where $\epsilon$ is an arbitrarily small number that we use to avoid division by zero. Repeat the same for green and red channels. This is necessary to get a meaningful signal out of images taken in different lighting; for example, day lit pictures have a lot of red while underwater pictures have almost none.
+
+
+### Use a schedule to decrease the learning rate
+
+The learning rate should fall as training goes on. In practice, most advanced models are trained by using algorithms like Adam/Momentum which adapt the learning rate instead of simple SGD with a constant learning rate.
+
+
+### Use L1 and/or L2 regularization for weight decay
+
+You can add a cost for large weights to the cost function. For example, using L2 regularization, we would define the loss $L$ and update the weights $w$ as follows:
+
+$$
+L(S, w) = C(S, w) + \alpha \Vert w \Vert^2\\
+\frac{\partial R}{\partial w_i} = 2w_i\\
+w_i = w_i - \eta\frac{\partial C}{\partial w_i} = w_i - \eta(\frac{\partial C}{\partial w_i} + 2 \alpha w_i)
+$$
+
+To understand why this is called weight decay, note that we can rewrite the above formula to show that we multiply $w_i$ by a constant less than one during the update.
+
+$$
+w_i = (1 - 2 \eta \alpha) w_i - \eta\frac{\partial C}{\partial w_i}
+$$
+
+L1 regularization (Lasso) is similar, except that we use $\sum_i \vert w_i\vert$ instead of $\Vert w \Vert^2$.
+
+Essentially, regularization tries to tell the system to minimize the cost function with the shortest weight vector possible. With L1 regularization, weights that are not useful are shrunk to 0.
+
+
+### Weight initialisation
+
+The weights need to be initialised at random, however, they shouldn't be too large or too small such that output is roughly of the same variance as that of input. There are various weight initialisation tricks built into PyTorch. One of the tricks that works well for deep models is Kaiming initialisation where the variance of the weights is inversely proportional to square root of number of inputs.
+
+
+### Use dropout
+
+Dropout is another form of regularization. It can be thought of as another layer of the neural net: it takes inputs, randomly sets $n/2$ of the inputs to zero, and returns the result as output. This forces the system to take information from all input units rather than becoming overly reliant on a small number of input units thus distributing the information across all of the units in a layer. This method was initially proposed by <a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a>.
+
+For more tricks, see  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>.
+
+Finally, note that backpropagation doesn't just work for stacked models; it can work for any directed acyclic graph (DAG) as long as there is a partial order on the modules.
+
diff --git a/docs/zh/week2/02-2.md b/docs/zh/week2/02-2.md
deleted file mode 100644
index f5a2e1dcc..000000000
--- a/docs/zh/week2/02-2.md
+++ /dev/null
@@ -1,208 +0,0 @@
----
-lang-ref: ch.02-2
-title: 为神经网络的模组计算梯度，与反向传播的实用技巧
-authors: Micaela Flores, Sheetal Laad, Brina Seidel, Aishwarya Rajan
-date: 3 February 2020
-translator: Titus Tzeng
----
-
-
-## [一个反向传播的具体例子还有介绍基础的神经网络模组](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3022s)
-
-
-### 范例
-
-接下来我们会考虑一个反向传播的例子，并使用图像来辅助。任意的函数 $G(w)$ 输入到损失函数 $C$ 中，这可以用一个图来表示。经由雅可比矩阵的乘法操作，我们能将这个图转换成一个反向计算梯度的图。（注意 Pytorch 和 Tensorflow 已经自动地为使用者完成这件事了，也就是说，向前的图自动的被「倒反」来创造导函数的图形以反向传播梯度。）
-
-<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-1.png" alt="Gradient diagram" style="zoom:40%;" /></center>
-
-在这个范例中，右方的绿色图代表梯度的图。
-In this example, the green graph on the right represents the gradient graph. Following the graph from the topmost node, it follows that
-
-$$
-\frac{\partial C(y,\bar{y})}{\partial w}=1 \cdot \frac{\partial C(y,\bar{y})}{\partial\bar{y}}\cdot\frac{\partial G(x,w)}{\partial w}
-$$
-
-In terms of dimensions, $\frac{\partial C(y,\bar{y})}{\partial w}$ is a row vector of size $1\times N$ where $N$ is the number of components of $w$; $\frac{\partial C(y,\bar{y})}{\partial \bar{y}}$  is a row vector of size $1\times M$, where $M$ is the dimension of the output; $\frac{\partial \bar{y}}{\partial w}=\frac{\partial G(x,w)}{\partial w}$ is a matrix of size $M\times N$, where $M$ is the number of outputs of $G$ and $N$ is the dimension of $w$.
-
-Note that complications might arise when the architecture of the graph is not fixed, but is data-dependent. For example, we could choose neural net module depending on the length of input vector. Though this is possible, it becomes increasingly difficult to manage this variation when the number of loops exceeds a reasonable amount.
-
-
-
-### Basic neural net modules
-
-There exist different types of pre-built modules besides the familiar Linear and ReLU modules. These are useful because they are uniquely optimized to perform their respective functions (as opposed to being built by a combination of other, elementary modules).
-
-- Linear: $Y=W\cdot X$
-
-$$
-\begin{aligned}
-\frac{dC}{dX} &= W^\top \cdot \frac{dC}{dY} \\
-\frac{dC}{dW} &= \frac{dC}{dY} \cdot X^\top
-\end{aligned}
-$$
-
-- ReLU: $y=(x)^+$
-
-  $$
-  \frac{dC}{dX} =
-      \begin{cases}
-        0 & x<0\\
-        \frac{dC}{dY} & \text{otherwise}
-      \end{cases}
-  $$
-
-- Duplicate: $Y_1=X$, $Y_2=X$
-
-  - Akin to a "Y - splitter" where both outputs are equal to the input.
-
-  - When backpropagating, the gradients get summed
-
-  - Can be split into n branches similarly
-
-    $$
-    \frac{dC}{dX}=\frac{dC}{dY_1}+\frac{dC}{dY_2}
-    $$
-
-
-- Add: $Y=X_1+X_2$
-
-  - With two variables being summed, when one is perturbed, the output will be perturbed by the same quantity, i.e.,
-
-    $$
-    \frac{dC}{dX_1}=\frac{dC}{dY}\cdot1 \quad \text{and}\quad \frac{dC}{dX_2}=\frac{dC}{dY}\cdot1
-    $$
-
-
-- Max: $Y=\max(X_1,X_2)$
-
-  -  Since this function can also be represented as
-
-$$
-Y=\max(X_1,X_2)=\begin{cases}
-      X_1 & X_1 > X_2 \\
-      X_2 & \text{else}
-   \end{cases}
-\Rightarrow
-\frac{dY}{dX_1}=\begin{cases}
-      1 & X_1 > X_2 \\
-      0 & \text{else}
-   \end{cases}
-$$
-
- - - Therefore, by the chain rule,
-
-$$
-\frac{dC}{dX_1}=\begin{cases}
-      \frac{dC}{dY}\cdot1 & X_1 > X_2 \\
-      0 & \text{else}
-   \end{cases}
-$$
-
-
-## [LogSoftMax vs SoftMax](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3985s)
-
-*SoftMax*, which is also a PyTorch module, is a convenient way of transforming a group of numbers into a group of positive numbers between 0 and 1 that sum to one. These numbers can be interpreted as a probability distribution. As a result, it is commonly used in classification problems. $y_i$ in the equation below is a vector of probabilities for all the categories.
-
-$$
-y_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
-$$
-
-However, the use of softmax leaves the network susceptible to vanishing gradients. Vanishing gradient is a problem, as it prevents weights downstream from being modified by the neural network, which may completely stop the neural network from further training. The logistic sigmoid function, which is the softmax function for one value, shows that when s is large, $h(s)$ is 1, and when s is small, $h(s)$ is 0. Because the sigmoid function is flat at $h(s) = 0 $ and $h(s) = 1$, the gradient is 0, which results in a vanishing gradient.
-
-<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-2.png" alt="Sigmoid function to illustrate vanishing gradient" style="background-color:#DCDCDC;" /></center>
-
-$$
-h(s) = \frac{1}{1 + \exp(-s)}
-$$
-
-Mathematicians came up with the idea of logsoftmax in order to solve for the issue of the vanishing gradient created by softmax. *LogSoftMax* is another basic module in PyTorch. As can be seen in the equation below, *LogSoftMax* is a combination of softmax and log.
-
-$$
-\log(y_i )= \log\left(\frac{\exp(x_i)}{\Sigma_j \exp(x_j)}\right) = x_i - \log(\Sigma_j \exp(x_j)
-$$
-
-The equation below demonstrates another way to look at the same equation. The figure below shows the $\log(1 + \exp(s))$ part of the function. When s is very small, the value is 0, and when s is very large, the value is s. As a result it doesn’t saturate, and the vanishing gradient problem is avoided.
-
-$$
-\log\left(\frac{\exp(s)}{\exp(s) + 1}\right)= s - \log(1 + \exp(s))
-$$
-
-<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-3.png" width='400px' alt="Plot of logarithmic part of the functions" /></center>
-
-
-## [Practical tricks for backpropagation](https://www.youtube.com/watch?v=d9vdh3b787Y&t=4924s)
-
-
-### Use ReLU as the non-linear activation function
-
-ReLU works best for networks with many layers, which has caused alternatives like the sigmoid function and hyperbolic tangent $\tanh(\cdot)$ function to fall out of favour. The reason ReLU works best is likely due to its single kink which makes it scale equivariant.
-
-
-### Use cross-entropy loss as the objective function for classification problems
-
-Log softmax, which we discussed earlier in the lecture, is a special case of cross-entropy loss. In PyTorch, be sure to provide the cross-entropy loss function with *log* softmax as input (as opposed to normal softmax).
-
-
-### Use stochastic gradient descent on minibatches during training
-
-As discussed previously, minibatches let you train more efficiently because there is redundancy in the data; you shouldn't need to make a prediction and calculate the loss on every single observation at every single step to estimate the gradient.
-
-
-### Shuffle the order of the training examples when using stochastic gradient descent
-
-Order matters. If the model sees only examples from a single class during each training step, then it will learn to predict that class without learning why it ought to be predicting that class. For example, if you were trying to classify digits from the MNIST dataset and the data was unshuffled, the bias parameters in the last layer would simply always predict zero, then adapt to always predict one, then two, etc. Ideally, you should have samples from every class in every minibatch.
-
-However, there's ongoing debate over whether you need to change the order of the samples in every pass (epoch).
-
-
-### Normalize the inputs to have zero mean and unit variance
-
-Before training, it's useful to normalize each input feature so that it has a mean of zero and a standard deviation of one. When using RGB image data, it is common to take mean and standard deviation of each channel individually and normalize the image channel-wise. For example, take the mean $m_b$ and standard deviation $\sigma_b$ of all the blue values in the dataset, then normalize the blue values for each individual image as.
-
-$$
-b_{[i,j]}^{'} = \frac{b_{[i,j]} - m_b}{\max(\sigma_b, \epsilon)}
-$$
-
-where $\epsilon$ is an arbitrarily small number that we use to avoid division by zero. Repeat the same for green and red channels. This is necessary to get a meaningful signal out of images taken in different lighting; for example, day lit pictures have a lot of red while underwater pictures have almost none.
-
-
-### Use a schedule to decrease the learning rate
-
-The learning rate should fall as training goes on. In practice, most advanced models are trained by using algorithms like Adam/Momentum which adapt the learning rate instead of simple SGD with a constant learning rate.
-
-
-### Use L1 and/or L2 regularization for weight decay
-
-You can add a cost for large weights to the cost function. For example, using L2 regularization, we would define the loss $L$ and update the weights $w$ as follows:
-
-$$
-L(S, w) = C(S, w) + \alpha \Vert w \Vert^2\\
-\frac{\partial R}{\partial w_i} = 2w_i\\
-w_i = w_i - \eta\frac{\partial C}{\partial w_i} = w_i - \eta(\frac{\partial C}{\partial w_i} + 2 \alpha w_i)
-$$
-
-To understand why this is called weight decay, note that we can rewrite the above formula to show that we multiply $w_i$ by a constant less than one during the update.
-
-$$
-w_i = (1 - 2 \eta \alpha) w_i - \eta\frac{\partial C}{\partial w_i}
-$$
-
-L1 regularization (Lasso) is similar, except that we use $\sum_i \vert w_i\vert$ instead of $\Vert w \Vert^2$.
-
-Essentially, regularization tries to tell the system to minimize the cost function with the shortest weight vector possible. With L1 regularization, weights that are not useful are shrunk to 0.
-
-
-### Weight initialisation
-
-The weights need to be initialised at random, however, they shouldn't be too large or too small such that output is roughly of the same variance as that of input. There are various weight initialisation tricks built into PyTorch. One of the tricks that works well for deep models is Kaiming initialisation where the variance of the weights is inversely proportional to square root of number of inputs.
-
-
-### Use dropout
-
-Dropout is another form of regularization. It can be thought of as another layer of the neural net: it takes inputs, randomly sets $n/2$ of the inputs to zero, and returns the result as output. This forces the system to take information from all input units rather than becoming overly reliant on a small number of input units thus distributing the information across all of the units in a layer. This method was initially proposed by <a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a>.
-
-For more tricks, see  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>.
-
-Finally, note that backpropagation doesn't just work for stacked models; it can work for any directed acyclic graph (DAG) as long as there is a partial order on the modules.
-

From 79559452e9d6867f0cf2147354e39db3578416f5 Mon Sep 17 00:00:00 2001
From: Titus Tzeng <titusjgr@gmail.com>
Date: Fri, 20 Mar 2020 20:57:37 +0800
Subject: [PATCH 3/6] Update 02-2.md

---
 docs/zh/week02/02-2.md | 62 ++++++++++++++++++++----------------------
 1 file changed, 30 insertions(+), 32 deletions(-)

diff --git a/docs/zh/week02/02-2.md b/docs/zh/week02/02-2.md
index f5a2e1dcc..5796ec7e1 100644
--- a/docs/zh/week02/02-2.md
+++ b/docs/zh/week02/02-2.md
@@ -14,26 +14,25 @@ translator: Titus Tzeng
 
 接下来我们会考虑一个反向传播的例子，并使用图像来辅助。任意的函数 $G(w)$ 输入到损失函数 $C$ 中，这可以用一个图来表示。经由雅可比矩阵的乘法操作，我们能将这个图转换成一个反向计算梯度的图。（注意 Pytorch 和 Tensorflow 已经自动地为使用者完成这件事了，也就是说，向前的图自动的被「倒反」来创造导函数的图形以反向传播梯度。）
 
-<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-1.png" alt="Gradient diagram" style="zoom:40%;" /></center>
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-1.png" alt="梯度的图像" style="zoom:40%;" /></center>
 
-在这个范例中，右方的绿色图代表梯度的图。
-In this example, the green graph on the right represents the gradient graph. Following the graph from the topmost node, it follows that
+在这个范例中，右方的绿色图代表梯度的图。跟着图从最上方的节点开始的是：
 
 $$
 \frac{\partial C(y,\bar{y})}{\partial w}=1 \cdot \frac{\partial C(y,\bar{y})}{\partial\bar{y}}\cdot\frac{\partial G(x,w)}{\partial w}
 $$
 
-In terms of dimensions, $\frac{\partial C(y,\bar{y})}{\partial w}$ is a row vector of size $1\times N$ where $N$ is the number of components of $w$; $\frac{\partial C(y,\bar{y})}{\partial \bar{y}}$  is a row vector of size $1\times M$, where $M$ is the dimension of the output; $\frac{\partial \bar{y}}{\partial w}=\frac{\partial G(x,w)}{\partial w}$ is a matrix of size $M\times N$, where $M$ is the number of outputs of $G$ and $N$ is the dimension of $w$.
+从维度来说，$\frac{\partial C(y,\bar{y})}{\partial w}$ 是一个行向量，大小为 $1\times N$，其中 $N$ 是 $w$ 中成员的数量；$\frac{\partial C(y,\bar{y})}{\partial \bar{y}}$ 是个大小 $1\times M$ 的行向量，其中 $M$ 是输出的维度；$\frac{\partial \bar{y}}{\partial w}=\frac{\partial G(x,w)}{\partial w}$ 是个大小 $M\times N$ 的矩阵，其中 $M$ 是 $G$ 输出的数量，而 $N$ 是 $w$ 的维度。
 
-Note that complications might arise when the architecture of the graph is not fixed, but is data-dependent. For example, we could choose neural net module depending on the length of input vector. Though this is possible, it becomes increasingly difficult to manage this variation when the number of loops exceeds a reasonable amount.
+当图的结构不固定而是对应于资料时，情况可能更为复杂。比如，我们可以根据输入向量的长度来选择神经网络的模组。虽然这是可行的，当回圈数量过度增加，处理这个变化的难度会增加。
 
 
 
-### Basic neural net modules
+### 基本的神经网络模组
 
-There exist different types of pre-built modules besides the familiar Linear and ReLU modules. These are useful because they are uniquely optimized to perform their respective functions (as opposed to being built by a combination of other, elementary modules).
+除了习惯的线性和 ReLU 模组，还有其他预先建立的模组。他们十分有用因为他们为了各自的功能被特别的优化过（而非只是用其他初阶模组拼凑而成）。
 
-- Linear: $Y=W\cdot X$
+- 线性： $Y=W\cdot X$
 
 $$
 \begin{aligned}
@@ -42,7 +41,7 @@ $$
 \end{aligned}
 $$
 
-- ReLU: $y=(x)^+$
+- ReLU： $y=(x)^+$
 
   $$
   \frac{dC}{dX} =
@@ -52,31 +51,31 @@ $$
       \end{cases}
   $$
 
-- Duplicate: $Y_1=X$, $Y_2=X$
+- 重复： $Y_1=X$, $Y_2=X$
 
-  - Akin to a "Y - splitter" where both outputs are equal to the input.
+  - 如同一个「分接线」，两个输出都与输入相同
 
-  - When backpropagating, the gradients get summed
+  - 反向传播时，梯度相加
 
-  - Can be split into n branches similarly
+  - 可以类似的分配成 n 个分支
 
     $$
     \frac{dC}{dX}=\frac{dC}{dY_1}+\frac{dC}{dY_2}
     $$
 
 
-- Add: $Y=X_1+X_2$
+- 相加： $Y=X_1+X_2$
 
-  - With two variables being summed, when one is perturbed, the output will be perturbed by the same quantity, i.e.,
+  - 当两个变数相加，其中一个若被改变，输出也会以相同幅度改变，即
 
     $$
     \frac{dC}{dX_1}=\frac{dC}{dY}\cdot1 \quad \text{and}\quad \frac{dC}{dX_2}=\frac{dC}{dY}\cdot1
     $$
 
 
-- Max: $Y=\max(X_1,X_2)$
+- 最大值： $Y=\max(X_1,X_2)$
 
-  -  Since this function can also be represented as
+  - 因为这个函数也可以写作：
 
 $$
 Y=\max(X_1,X_2)=\begin{cases}
@@ -90,7 +89,7 @@ Y=\max(X_1,X_2)=\begin{cases}
    \end{cases}
 $$
 
- - - Therefore, by the chain rule,
+  - 因此，根据链式法则 
 
 $$
 \frac{dC}{dX_1}=\begin{cases}
@@ -102,51 +101,50 @@ $$
 
 ## [LogSoftMax vs SoftMax](https://www.youtube.com/watch?v=d9vdh3b787Y&t=3985s)
 
-*SoftMax*, which is also a PyTorch module, is a convenient way of transforming a group of numbers into a group of positive numbers between 0 and 1 that sum to one. These numbers can be interpreted as a probability distribution. As a result, it is commonly used in classification problems. $y_i$ in the equation below is a vector of probabilities for all the categories.
+*SoftMax*，另一个 Pytorch 模组，是一种方便的方式，可以将一组数字转换为 0 到 1 之间的数值，并使它们和为 1。这些数字可以理解为几率分布。因此，它经常用于分类问题。下方等式中的 $y_i$ 是一个向量记录每个类别的几率。
 
 $$
 y_i = \frac{\exp(x_i)}{\sum_j \exp(x_j)}
 $$
 
-However, the use of softmax leaves the network susceptible to vanishing gradients. Vanishing gradient is a problem, as it prevents weights downstream from being modified by the neural network, which may completely stop the neural network from further training. The logistic sigmoid function, which is the softmax function for one value, shows that when s is large, $h(s)$ is 1, and when s is small, $h(s)$ is 0. Because the sigmoid function is flat at $h(s) = 0 $ and $h(s) = 1$, the gradient is 0, which results in a vanishing gradient.
+然而，使用 softmax 使网络容易面临梯度消失。梯度消失是一个问题，因为它会使得随后的权重无法被神经网络改动，进而停止神经网络进一步训练。Logistic sigmoid 函数，就是单一数值的 Softmax 函数，展现出当 s 很大时，$h(s)$ 是 1，而当 s 很小时，$h(s)$ 是 0。因为 sigmoid 函数在 $h(s) = 0$ 和 $h(s) = 1$ 处是平坦的，其梯度为 0，造成消失的梯度。
 
-<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-2.png" alt="Sigmoid function to illustrate vanishing gradient" style="background-color:#DCDCDC;" /></center>
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-2.png" alt="描绘了梯度消失的 sigmoid 函数" style="background-color:#DCDCDC;" /></center>
 
 $$
 h(s) = \frac{1}{1 + \exp(-s)}
 $$
 
-Mathematicians came up with the idea of logsoftmax in order to solve for the issue of the vanishing gradient created by softmax. *LogSoftMax* is another basic module in PyTorch. As can be seen in the equation below, *LogSoftMax* is a combination of softmax and log.
+数学家想到可以用 logsoftmax 来解决 softmax 造成的梯度消失问题。*LogSoftMax* 是 Pytorch 当中的另一个基本模组。正如下方等式所示，*LogSoftMax* 组合了 softmax 和对数。
 
 $$
 \log(y_i )= \log\left(\frac{\exp(x_i)}{\Sigma_j \exp(x_j)}\right) = x_i - \log(\Sigma_j \exp(x_j)
 $$
 
-The equation below demonstrates another way to look at the same equation. The figure below shows the $\log(1 + \exp(s))$ part of the function. When s is very small, the value is 0, and when s is very large, the value is s. As a result it doesn’t saturate, and the vanishing gradient problem is avoided.
+下方的等式提供同一个等式的另一种观点。下图显示函数中 $\log(1 + \exp(s))$ 的部份。当 s 非常小，其值为 0，至于 s 很大时，值是 s。如此一来，它不会造成<!-- saturate-->饱和，就避免了梯度消失。
 
 $$
 \log\left(\frac{\exp(s)}{\exp(s) + 1}\right)= s - \log(1 + \exp(s))
 $$
 
-<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-3.png" width='400px' alt="Plot of logarithmic part of the functions" /></center>
+<center><img src="{{site.baseurl}}/images/week02/02-2/02-2-3.png" width='400px' alt="函数的指数部份" /></center>
 
 
-## [Practical tricks for backpropagation](https://www.youtube.com/watch?v=d9vdh3b787Y&t=4924s)
+## [反向传播的实用技巧](https://www.youtube.com/watch?v=d9vdh3b787Y&t=4924s)
 
 
-### Use ReLU as the non-linear activation function
+### 用 ReLU 作为非线性函数
 
-ReLU works best for networks with many layers, which has caused alternatives like the sigmoid function and hyperbolic tangent $\tanh(\cdot)$ function to fall out of favour. The reason ReLU works best is likely due to its single kink which makes it scale equivariant.
+对于有很多层的网络，ReLU 的效果最好，甚至使其他函数如 sigmoid、hyperbolic tangent $\tanh(\cdot)$ 相形之下过时了。ReLU 很有效的原因可能是因为它具有的一个尖点使它具有缩放的等变性。
 
+### 用交叉熵作为分类问题的损失函数
 
-### Use cross-entropy loss as the objective function for classification problems
+在讲座里前面提到的 Log softmax是交叉熵损失的特例。Pytorch 里，请确认传给交叉熵损失函数时要使用 *Log* softmax 为输入（而非一般 softmax）。
 
-Log softmax, which we discussed earlier in the lecture, is a special case of cross-entropy loss. In PyTorch, be sure to provide the cross-entropy loss function with *log* softmax as input (as opposed to normal softmax).
 
+### 训练时使用小批量（minibatch）的随机梯度下降
 
-### Use stochastic gradient descent on minibatches during training
-
-As discussed previously, minibatches let you train more efficiently because there is redundancy in the data; you shouldn't need to make a prediction and calculate the loss on every single observation at every single step to estimate the gradient.
+如同之前所讨论的，小批量使你能更有效率的训练，因为资料中有重复；你不需要每一步对每个观察进行预测、计算损失以估计梯度。
 
 
 ### Shuffle the order of the training examples when using stochastic gradient descent

From 948d166bc6288ecb4fda4d6dedf2302e0a2e66cf Mon Sep 17 00:00:00 2001
From: Titus Tzeng <titusjgr@gmail.com>
Date: Sun, 22 Mar 2020 16:25:48 +0800
Subject: [PATCH 4/6] Update 02-2.md

---
 docs/zh/week02/02-2.md | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/docs/zh/week02/02-2.md b/docs/zh/week02/02-2.md
index 5796ec7e1..a88269cf3 100644
--- a/docs/zh/week02/02-2.md
+++ b/docs/zh/week02/02-2.md
@@ -147,32 +147,32 @@ $$
 如同之前所讨论的，小批量使你能更有效率的训练，因为资料中有重复；你不需要每一步对每个观察进行预测、计算损失以估计梯度。
 
 
-### Shuffle the order of the training examples when using stochastic gradient descent
+### 训练时打乱样本顺序
 
-Order matters. If the model sees only examples from a single class during each training step, then it will learn to predict that class without learning why it ought to be predicting that class. For example, if you were trying to classify digits from the MNIST dataset and the data was unshuffled, the bias parameters in the last layer would simply always predict zero, then adapt to always predict one, then two, etc. Ideally, you should have samples from every class in every minibatch.
+顺序会造成影响。如果模型在一回的训练中只有看见来自同一类别的样本，它会去直接预测该类别而非学习为何要预测该类别。例如，如果你试着分类 MNIST 资料集的数字而没有打乱资料，那最后一层的偏置易开始总会预测零，接着改成预测一、二，依此类推。理想上，每个批量中都应该要有来自每个类别的样本。
 
-However, there's ongoing debate over whether you need to change the order of the samples in every pass (epoch).
+不过是否要在每回（epoch）的训练都改变次序仍然存在争论。
 
 
-### Normalize the inputs to have zero mean and unit variance
+### 将输入归一化使其具有零平均值与单位方差
 
-Before training, it's useful to normalize each input feature so that it has a mean of zero and a standard deviation of one. When using RGB image data, it is common to take mean and standard deviation of each channel individually and normalize the image channel-wise. For example, take the mean $m_b$ and standard deviation $\sigma_b$ of all the blue values in the dataset, then normalize the blue values for each individual image as.
+训练之前，建议先归一化每个输入特征，让均值为零、标准差为一。使用 RGB 图像资料时，经常会单独取每个通道的均值和标准差，以通道为单位进行归一化。举例而言，取资料集中所有蓝色的数值的均值 $m_b$ 和标准差 $\sigma_b$，接着归一化每个图像的蓝色数值如下：
 
 $$
 b_{[i,j]}^{'} = \frac{b_{[i,j]} - m_b}{\max(\sigma_b, \epsilon)}
 $$
 
-where $\epsilon$ is an arbitrarily small number that we use to avoid division by zero. Repeat the same for green and red channels. This is necessary to get a meaningful signal out of images taken in different lighting; for example, day lit pictures have a lot of red while underwater pictures have almost none.
+其中 $\epsilon$ 是个任意小的数字，用于避免除以零。对绿色和红色进行同样操作。这个必要的动作使我们能从不同光线下的图像取得有用的信号；例如日光中的相片有很多红色，但水下的图片则几乎没有。
 
 
-### Use a schedule to decrease the learning rate
+### 按照进度递减学习率
 
-The learning rate should fall as training goes on. In practice, most advanced models are trained by using algorithms like Adam/Momentum which adapt the learning rate instead of simple SGD with a constant learning rate.
+随着训练持续，学习率应该下降。实际上，大多进阶的模型是用 Adam/Momentum 这些能自我调整学习率的算法训练的，而非学习率固定的单纯 SGD。
 
 
-### Use L1 and/or L2 regularization for weight decay
+### 使用 L1 和（或）L2 正则化进行权重衰减
 
-You can add a cost for large weights to the cost function. For example, using L2 regularization, we would define the loss $L$ and update the weights $w$ as follows:
+你可以在损失函数中附上对巨大权重的损失。例如，使用 L2 正则化，我们定义损失为 $L$ 并且如下更新权重 $w$：
 
 $$
 L(S, w) = C(S, w) + \alpha \Vert w \Vert^2\\
@@ -180,27 +180,27 @@ L(S, w) = C(S, w) + \alpha \Vert w \Vert^2\\
 w_i = w_i - \eta\frac{\partial C}{\partial w_i} = w_i - \eta(\frac{\partial C}{\partial w_i} + 2 \alpha w_i)
 $$
 
-To understand why this is called weight decay, note that we can rewrite the above formula to show that we multiply $w_i$ by a constant less than one during the update.
+为了理解为何这称作权重衰减，我们可以将上方的方程式重写来展现我们在更新时把 $w_i$ 乘以一个小于一的常数。
 
 $$
 w_i = (1 - 2 \eta \alpha) w_i - \eta\frac{\partial C}{\partial w_i}
 $$
 
-L1 regularization (Lasso) is similar, except that we use $\sum_i \vert w_i\vert$ instead of $\Vert w \Vert^2$.
+L1 正则化（Lasso）是类似的，只不过我们使用 $\sum_i \vert w_i\vert$ 而不是 $\Vert w \Vert^2$。
 
-Essentially, regularization tries to tell the system to minimize the cost function with the shortest weight vector possible. With L1 regularization, weights that are not useful are shrunk to 0.
+本质上，正则化尝试告诉系统要以最短的权重向量来最小化损失函数。L1 正则化会将无用的权重缩减至 0。
 
 
-### Weight initialisation
+### 权重初始化
 
-The weights need to be initialised at random, however, they shouldn't be too large or too small such that output is roughly of the same variance as that of input. There are various weight initialisation tricks built into PyTorch. One of the tricks that works well for deep models is Kaiming initialisation where the variance of the weights is inversely proportional to square root of number of inputs.
+权重要被随机的初始化，但它们不能太大或太小，因为输出得要有与输入差不多的方差。Pytorch 有诸多内建的初始化技巧。其中一个适合深层模型的是 Kaiming 初始化：权重的方差与输入数量的平方根成反比。
 
 
-### Use dropout
+### 使用 dropout
 
-Dropout is another form of regularization. It can be thought of as another layer of the neural net: it takes inputs, randomly sets $n/2$ of the inputs to zero, and returns the result as output. This forces the system to take information from all input units rather than becoming overly reliant on a small number of input units thus distributing the information across all of the units in a layer. This method was initially proposed by <a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a>.
+Dropout 是另一种正则化。它可以当做神经网络的另一层：它接受输入，随机将 $n/2$ 的输入设为零，并且回传这个结果为输出。这迫使系统从所有输入单元取得资讯而不是过度倚赖少数的输入单元，从而能将资讯分配于一层中的所有单元。这个方法最初是由<a href="https://arxiv.org/abs/1207.0580">Hinton et al (2012)</a>提出。
 
-For more tricks, see  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>.
+更多技巧参见  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf">LeCun et al 1998</a>.
 
-Finally, note that backpropagation doesn't just work for stacked models; it can work for any directed acyclic graph (DAG) as long as there is a partial order on the modules.
+最后，注意反向传播不只适用于层层堆叠的模型；它可用于任何有向无环图（DAG）只要模组间具有偏序关系，
 

From aec68db9cf0a05d6b44cf75246f97ebc13657aea Mon Sep 17 00:00:00 2001
From: Titus Tzeng <titusjgr@gmail.com>
Date: Sat, 28 Mar 2020 12:11:33 +0800
Subject: [PATCH 5/6] Fix content

---
 docs/zh/week02/02-2.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/zh/week02/02-2.md b/docs/zh/week02/02-2.md
index a88269cf3..f79a5af2f 100644
--- a/docs/zh/week02/02-2.md
+++ b/docs/zh/week02/02-2.md
@@ -167,7 +167,7 @@ $$
 
 ### 按照进度递减学习率
 
-随着训练持续，学习率应该下降。实际上，大多进阶的模型是用 Adam/Momentum 这些能自我调整学习率的算法训练的，而非学习率固定的单纯 SGD。
+随着训练持续，学习率应该下降。实际上，大多进阶的模型是用 Adam 这些能自我调整学习率的算法训练的，而非学习率固定的单纯 SGD。
 
 
 ### 使用 L1 和（或）L2 正则化进行权重衰减

From 3809ee53f1597cf7fe32e44995979f6e93d3fdaa Mon Sep 17 00:00:00 2001
From: Alfredo Canziani <alfredo.canziani@gmail.com>
Date: Sat, 28 Mar 2020 15:09:01 -0400
Subject: [PATCH 6/6] Fix bad permissions of _config.yml

---
 docs/_config.yml | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 mode change 100755 => 100644 docs/_config.yml

diff --git a/docs/_config.yml b/docs/_config.yml
old mode 100755
new mode 100644