You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately, we can't directly use this formulation as our objective function – it's not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the [Lagrangian Duality](https://en.wikipedia.org/wiki/Duality_(optimization)). The details of this are out of scope (take EECS 127 if you're interested in learning more), but the end result is very useful. It turns out that minimizing the following *augmented* objective function is *equivalent* to our minimization goal above.
The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p |\theta_i|$ as it's **L1 norm** equivalent form, $|| \theta ||_1$.
@@ -290,7 +290,7 @@ Notice that we've replaced the constraint with a second term in our objective fu
290
290
1. Keeping the model's error on the training data low, represented by the term $\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2$
291
291
2. Keeping the magnitudes of model parameters low, represented by the term $\lambda \sum_{i=1}^p |\theta_i|$
292
292
293
-
The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
293
+
The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
294
294
295
295
- Assume $\lambda \rightarrow \infty$. Then, $\lambda || \theta ||_1$ dominates the cost function. In order to neutralize the $\infty$ and minimize this term, we set $\theta_j = 0$ for all $j \ge 1$. This is a very constrained model that is mathematically equivalent to the constant model <!--, which also arises when $Q$ approaches $0$. -->
296
296
@@ -360,8 +360,8 @@ Notice that all we have done is change the constraint on the model parameters. T
360
360
361
361
Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as:
The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p \theta_i^2$ as it's **L2 norm** equivalent form, $|| \theta ||_2^2$.
<p>Unfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the <ahref="https://en.wikipedia.org/wiki/Duality_(optimization)">Lagrangian Duality</a>. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following <em>augmented</em> objective function is <em>equivalent</em> to our minimization goal above.</p>
<p>The last two expressions include the MSE expressed using vector notation, and the last expression writes <spanclass="math inline">\(\sum_{i=1}^p |\theta_i|\)</span> as it’s <strong>L1 norm</strong> equivalent form, <spanclass="math inline">\(|| \theta ||_1\)</span>.</p>
618
618
<p>Notice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that <em>penalizes large coefficients</em>. In order to minimize this new objective function, we’ll end up balancing two components:</p>
619
619
<oltype="1">
620
620
<li>Keeping the model’s error on the training data low, represented by the term <spanclass="math inline">\(\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2\)</span></li>
621
621
<li>Keeping the magnitudes of model parameters low, represented by the term <spanclass="math inline">\(\lambda \sum_{i=1}^p |\theta_i|\)</span></li>
622
622
</ol>
623
-
<p>The <spanclass="math inline">\(\lambda\)</span> factor controls the degree of regularization. Roughly speaking, <spanclass="math inline">\(\lambda\)</span> is related to our <spanclass="math inline">\(Q\)</span> constraint from before by the rule <spanclass="math inline">\(\lambda \approx \frac{1}{Q}\)</span>. To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: <spanclass="math inline">\(||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\)</span>.</p>
623
+
<p>The <spanclass="math inline">\(\lambda\)</span> factor controls the degree of regularization. Roughly speaking, <spanclass="math inline">\(\lambda\)</span> is related to our <spanclass="math inline">\(Q\)</span> constraint from before by the rule <spanclass="math inline">\(\lambda \approx \frac{1}{Q}\)</span>. To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: <spanclass="math inline">\(\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\)</span>.</p>
624
624
<ul>
625
625
<li><p>Assume <spanclass="math inline">\(\lambda \rightarrow \infty\)</span>. Then, <spanclass="math inline">\(\lambda || \theta ||_1\)</span> dominates the cost function. In order to neutralize the <spanclass="math inline">\(\infty\)</span> and minimize this term, we set <spanclass="math inline">\(\theta_j = 0\)</span> for all <spanclass="math inline">\(j \ge 1\)</span>. This is a very constrained model that is mathematically equivalent to the constant model <!--, which also arises when $Q$ approaches $0$. --></p></li>
626
626
<li><p>Assume <spanclass="math inline">\(\lambda \rightarrow 0\)</span>. Then, <spanclass="math inline">\(\lambda || \theta ||_1=0\)</span>. Minimizing the cost function is equivalent to minimizing <spanclass="math inline">\(\frac{1}{n} || Y - X\theta ||_2^2\)</span>, our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum <spanclass="math inline">\(\hat{\theta} = \hat\theta_{No Reg.}\)</span>. <!-- We showed that the global optimum is achieved when the L2 norm ball radius $Q \rightarrow \infty$. --></p></li>
<p>If we modify our objective function like before, we find that our new goal is to minimize the function: <spanclass="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p \theta_i^2 \leq Q\]</span></p>
768
768
<p>Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.</p>
769
-
<p>Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: <spanclass="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span><spanclass="math display">\[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span><spanclass="math display">\[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]</span></p>
769
+
<p>Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: <spanclass="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span><spanclass="math display">\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span><spanclass="math display">\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]</span></p>
770
770
<p>The last two expressions include the MSE expressed using vector notation, and the last expression writes <spanclass="math inline">\(\sum_{i=1}^p \theta_i^2\)</span> as it’s <strong>L2 norm</strong> equivalent form, <spanclass="math inline">\(|| \theta ||_2^2\)</span>.</p>
771
771
<p>When applying L2 regularization, our goal is to minimize this updated objective function.</p>
772
772
<p>Unlike L1 regularization, L2 regularization <em>does</em> have a closed-form solution for the best parameter vector when regularization is applied:</p>
0 commit comments