note 16 fix

ishani07 · ishani07 · commit 2785d775121a · 2024-04-29T15:12:31.000-07:00
diff --git a/cv_regularization/cv_reg.qmd b/cv_regularization/cv_reg.qmd
@@ -279,8 +279,8 @@ $$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \p
 Unfortunately, we can't directly use this formulation as our objective function – it's not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the [Lagrangian Duality](https://en.wikipedia.org/wiki/Duality_(optimization)). The details of this are out of scope (take EECS 127 if you're interested in learning more), but the end result is very useful. It turns out that minimizing the following *augmented* objective function is *equivalent* to our minimization goal above.
 
 $$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert$$ 
-$$ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|$$
-$$ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$$
+$$ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|$$
+$$ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$$
 
 
 The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p |\theta_i|$ as it's **L1 norm** equivalent form, $|| \theta ||_1$.
@@ -290,7 +290,7 @@ Notice that we've replaced the constraint with a second term in our objective fu
 1. Keeping the model's error on the training data low, represented by the term $\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2$
 2. Keeping the magnitudes of model parameters low, represented by the term $\lambda \sum_{i=1}^p |\theta_i|$
 
-The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
+The $\lambda$ factor controls the degree of regularization. Roughly speaking, $\lambda$ is related to our $Q$ constraint from before by the rule $\lambda \approx \frac{1}{Q}$. To understand why, let's consider two extreme examples. Recall that our goal is to minimize the cost function: $\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1$.
 
 - Assume $\lambda \rightarrow \infty$. Then, $\lambda || \theta ||_1$ dominates the cost function. In order to neutralize the $\infty$ and minimize this term, we set $\theta_j = 0$ for all $j \ge 1$. This is a very constrained model that is mathematically equivalent to the constant model <!--, which also arises when $Q$ approaches $0$. -->
 
@@ -360,8 +360,8 @@ Notice that all we have done is change the constraint on the model parameters. T
 
 Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as:
 $$\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2$$ 
-$$= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2$$
-$$= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2$$
+$$= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2$$
+$$= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2$$
 
 
 The last two expressions include the MSE expressed using vector notation, and the last expression writes $\sum_{i=1}^p \theta_i^2$ as it's **L2 norm** equivalent form, $|| \theta ||_2^2$.
diff --git a/docs/cv_regularization/cv_reg.html b/docs/cv_regularization/cv_reg.html
@@ -613,14 +613,14 @@ <h3 data-number="16.2.2" class="anchored" data-anchor-id="l1-lasso-regularizatio
 <p>To apply our constraint, we need to rephrase our minimization goal as:</p>
 <p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p |\theta_i| \leq Q\]</span></p>
 <p>Unfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the <a href="https://en.wikipedia.org/wiki/Duality_(optimization)">Lagrangian Duality</a>. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following <em>augmented</em> objective function is <em>equivalent</em> to our minimization goal above.</p>
-<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\]</span> <span class="math display">\[ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\]</span> <span class="math display">\[ = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]</span></p>
+<p><span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\]</span> <span class="math display">\[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\]</span> <span class="math display">\[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]</span></p>
 <p>The last two expressions include the MSE expressed using vector notation, and the last expression writes <span class="math inline">\(\sum_{i=1}^p |\theta_i|\)</span> as it’s <strong>L1 norm</strong> equivalent form, <span class="math inline">\(|| \theta ||_1\)</span>.</p>
 <p>Notice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that <em>penalizes large coefficients</em>. In order to minimize this new objective function, we’ll end up balancing two components:</p>
 <ol type="1">
 <li>Keeping the model’s error on the training data low, represented by the term <span class="math inline">\(\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2\)</span></li>
 <li>Keeping the magnitudes of model parameters low, represented by the term <span class="math inline">\(\lambda \sum_{i=1}^p |\theta_i|\)</span></li>
 </ol>
-<p>The <span class="math inline">\(\lambda\)</span> factor controls the degree of regularization. Roughly speaking, <span class="math inline">\(\lambda\)</span> is related to our <span class="math inline">\(Q\)</span> constraint from before by the rule <span class="math inline">\(\lambda \approx \frac{1}{Q}\)</span>. To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: <span class="math inline">\(||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\)</span>.</p>
+<p>The <span class="math inline">\(\lambda\)</span> factor controls the degree of regularization. Roughly speaking, <span class="math inline">\(\lambda\)</span> is related to our <span class="math inline">\(Q\)</span> constraint from before by the rule <span class="math inline">\(\lambda \approx \frac{1}{Q}\)</span>. To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: <span class="math inline">\(\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\)</span>.</p>
 <ul>
 <li><p>Assume <span class="math inline">\(\lambda \rightarrow \infty\)</span>. Then, <span class="math inline">\(\lambda || \theta ||_1\)</span> dominates the cost function. In order to neutralize the <span class="math inline">\(\infty\)</span> and minimize this term, we set <span class="math inline">\(\theta_j = 0\)</span> for all <span class="math inline">\(j \ge 1\)</span>. This is a very constrained model that is mathematically equivalent to the constant model <!--, which also arises when $Q$ approaches $0$. --></p></li>
 <li><p>Assume <span class="math inline">\(\lambda \rightarrow 0\)</span>. Then, <span class="math inline">\(\lambda || \theta ||_1=0\)</span>. Minimizing the cost function is equivalent to minimizing <span class="math inline">\(\frac{1}{n} || Y - X\theta ||_2^2\)</span>, our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum <span class="math inline">\(\hat{\theta} = \hat\theta_{No Reg.}\)</span>. <!-- We showed that the global optimum is achieved when the L2 norm ball radius $Q \rightarrow \infty$. --></p></li>
@@ -766,7 +766,7 @@ <h3 data-number="16.2.4" class="anchored" data-anchor-id="l2-ridge-regularizatio
 </center>
 <p>If we modify our objective function like before, we find that our new goal is to minimize the function: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p \theta_i^2 \leq Q\]</span></p>
 <p>Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.</p>
-<p>Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]</span></p>
+<p>Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: <span class="math display">\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]</span> <span class="math display">\[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]</span></p>
 <p>The last two expressions include the MSE expressed using vector notation, and the last expression writes <span class="math inline">\(\sum_{i=1}^p \theta_i^2\)</span> as it’s <strong>L2 norm</strong> equivalent form, <span class="math inline">\(|| \theta ||_2^2\)</span>.</p>
 <p>When applying L2 regularization, our goal is to minimize this updated objective function.</p>
 <p>Unlike L1 regularization, L2 regularization <em>does</em> have a closed-form solution for the best parameter vector when regularization is applied:</p>