Skip to content

Commit 2785d77

Browse files
committed
note 16 fix
1 parent 1b35b8f commit 2785d77

File tree

2 files changed

+8
-8
lines changed

2 files changed

+8
-8
lines changed

cv_regularization/cv_reg.qmd

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -279,8 +279,8 @@ $$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \p
279279
Unfortunately, we can't directly use this formulation as our objective function – it's not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the [Lagrangian Duality](https://en.wikipedia.org/wiki/Duality_(optimization)). The details of this are out of scope (take EECS 127 if you're interested in learning more), but the end result is very useful. It turns out that minimizing the following *augmented* objective function is *equivalent* to our minimization goal above.
280280

281281
$$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert$$
282-
$$ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|$$
283-
$$ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$$
282+
$$ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|$$
283+
$$ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$$
284284

285285

286286
The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p |\theta_i|$ as it's **L1 norm** equivalent form, $|| \theta ||_1$.
@@ -290,7 +290,7 @@ Notice that we've replaced the constraint with a second term in our objective fu
290290
1. Keeping the model's error on the training data low, represented by the term $\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2$
291291
2. Keeping the magnitudes of model parameters low, represented by the term $\lambda \sum_{i=1}^p |\theta_i|$
292292

293-
The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
293+
The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
294294

295295
- Assume $\lambda \rightarrow \infty$. Then, $\lambda || \theta ||_1$ dominates the cost function. In order to neutralize the $\infty$ and minimize this term, we set $\theta_j = 0$ for all $j \ge 1$. This is a very constrained model that is mathematically equivalent to the constant model <!--, which also arises when $Q$ approaches $0$. -->
296296

@@ -360,8 +360,8 @@ Notice that all we have done is change the constraint on the model parameters. T
360360

361361
Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as:
362362
$$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2$$
363-
$$= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2$$
364-
$$= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2$$
363+
$$= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2$$
364+
$$= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2$$
365365

366366

367367
The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p \theta_i^2$ as it's **L2 norm** equivalent form, $|| \theta ||_2^2$.

docs/cv_regularization/cv_reg.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -613,14 +613,14 @@ <h3 data-number="16.2.2" class="anchored" data-anchor-id="l1-lasso-regularizatio
613613
<p>To apply our constraint, we need to rephrase our minimization goal as:</p>
614614
<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p |\theta_i| \leq Q\]</span></p>
615615
<p>Unfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the <a href="https://en.wikipedia.org/wiki/Duality_(optimization)">Lagrangian Duality</a>. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following <em>augmented</em> objective function is <em>equivalent</em> to our minimization goal above.</p>
616-
<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\]</span> <span class="math display">\[ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\]</span> <span class="math display">\[ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]</span></p>
616+
<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\]</span> <span class="math display">\[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\]</span> <span class="math display">\[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]</span></p>
617617
<p>The last two expressions include the MSE expressed using vector notation, and the last expression writes <span class="math inline">\(\sum_{i=1}^p |\theta_i|\)</span> as it’s <strong>L1 norm</strong> equivalent form, <span class="math inline">\(|| \theta ||_1\)</span>.</p>
618618
<p>Notice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that <em>penalizes large coefficients</em>. In order to minimize this new objective function, we’ll end up balancing two components:</p>
619619
<ol type="1">
620620
<li>Keeping the model’s error on the training data low, represented by the term <span class="math inline">\(\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2\)</span></li>
621621
<li>Keeping the magnitudes of model parameters low, represented by the term <span class="math inline">\(\lambda \sum_{i=1}^p |\theta_i|\)</span></li>
622622
</ol>
623-
<p>The <span class="math inline">\(\lambda\)</span> factor controls the degree of regularization. Roughly speaking, <span class="math inline">\(\lambda\)</span> is related to our <span class="math inline">\(Q\)</span> constraint from before by the rule <span class="math inline">\(\lambda \approx \frac{1}{Q}\)</span>. To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: <span class="math inline">\(||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\)</span>.</p>
623+
<p>The <span class="math inline">\(\lambda\)</span> factor controls the degree of regularization. Roughly speaking, <span class="math inline">\(\lambda\)</span> is related to our <span class="math inline">\(Q\)</span> constraint from before by the rule <span class="math inline">\(\lambda \approx \frac{1}{Q}\)</span>. To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: <span class="math inline">\(\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\)</span>.</p>
624624
<ul>
625625
<li><p>Assume <span class="math inline">\(\lambda \rightarrow \infty\)</span>. Then, <span class="math inline">\(\lambda || \theta ||_1\)</span> dominates the cost function. In order to neutralize the <span class="math inline">\(\infty\)</span> and minimize this term, we set <span class="math inline">\(\theta_j = 0\)</span> for all <span class="math inline">\(j \ge 1\)</span>. This is a very constrained model that is mathematically equivalent to the constant model <!--, which also arises when $Q$ approaches $0$. --></p></li>
626626
<li><p>Assume <span class="math inline">\(\lambda \rightarrow 0\)</span>. Then, <span class="math inline">\(\lambda || \theta ||_1=0\)</span>. Minimizing the cost function is equivalent to minimizing <span class="math inline">\(\frac{1}{n} || Y - X\theta ||_2^2\)</span>, our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum <span class="math inline">\(\hat{\theta} = \hat\theta_{No Reg.}\)</span>. <!-- We showed that the global optimum is achieved when the L2 norm ball radius $Q \rightarrow \infty$. --></p></li>
@@ -766,7 +766,7 @@ <h3 data-number="16.2.4" class="anchored" data-anchor-id="l2-ridge-regularizatio
766766
</center>
767767
<p>If we modify our objective function like before, we find that our new goal is to minimize the function: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p \theta_i^2 \leq Q\]</span></p>
768768
<p>Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.</p>
769-
<p>Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]</span></p>
769+
<p>Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]</span></p>
770770
<p>The last two expressions include the MSE expressed using vector notation, and the last expression writes <span class="math inline">\(\sum_{i=1}^p \theta_i^2\)</span> as it’s <strong>L2 norm</strong> equivalent form, <span class="math inline">\(|| \theta ||_2^2\)</span>.</p>
771771
<p>When applying L2 regularization, our goal is to minimize this updated objective function.</p>
772772
<p>Unlike L1 regularization, L2 regularization <em>does</em> have a closed-form solution for the best parameter vector when regularization is applied:</p>

0 commit comments

Comments
 (0)