Tiny fixes to logistic regression 2

DS-100 · Nov 10, 2023 · 90dc4fd · 90dc4fd
1 parent 4f2d883
commit 90dc4fd
Show file tree

Hide file tree

Showing 4 changed files with 587 additions and 23 deletions.
diff --git a/...ssion_2/images/linear_seperability_1D.png → ...ssion_2/images/linear_separability_1D.png b/...ssion_2/images/linear_seperability_1D.png → ...ssion_2/images/linear_separability_1D.png
diff --git a/...ssion_2/images/linear_seperability_2D.png → ...ssion_2/images/linear_separability_2D.png b/...ssion_2/images/linear_seperability_2D.png → ...ssion_2/images/linear_separability_2D.png
diff --git a/logistic_regression_2/logistic_reg_2.ipynb b/logistic_regression_2/logistic_reg_2.ipynb
@@ -30,7 +30,7 @@
         "* Introduce new metrics for model performance\n",
         "::: \n",
         "\n",
-        "Today, we will continue studying the Logistic Regression model. We'll discussion decision boundaries that help inform the classification of a particular prediction. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discus metrics that allow us to determine our model's performance in different scenarios. \n",
+        "Today, we will continue studying the Logistic Regression model. We'll discuss decision boundaries that help inform the classification of a particular prediction. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discus metrics that allow us to determine our model's performance in different scenarios. \n",
         "\n",
         "This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model. We will learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression.\n",
         "\n",
@@ -55,14 +55,14 @@
         "    \n",
         "The threshold is often set to $T = 0.5$, but *not always*. We'll discuss why we might want to use other thresholds  $T \\neq 0.5$ later in this lecture.\n",
         "\n",
-        "Using our decision rule, we can define a **decision boundary** as the “line” the splits the data into classes based on its features. For logistic regression, the decision boundary is a **hyperplane** -- a linear combination of the features in p-dimensions -- and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have $\\theta = [\\theta_0 \\theta_1 \\theta_2]$ (including the intercept term), and we can solve for the decision boundary like so: \n",
+        "Using our decision rule, we can define a **decision boundary** as the “line” the splits the data into classes based on its features. For logistic regression, the decision boundary is a **hyperplane** -- a linear combination of the features in p-dimensions -- and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have $\\theta = [\\theta_0, \\theta_1, \\theta_2]$ (including the intercept term), and we can solve for the decision boundary like so: \n",
         "\n",
         "$$\n",
         "\\begin{align}\n",
         "T &= \\frac{1}{1 + e^{\\theta_0 + \\theta_1 * \\text{feature1} +  \\theta_2 * \\text{feature2}}} \\\\\n",
-        "1 + e^{\\theta_0 + \\theta_1 * \\text{feature1} +  \\theta_2 * \\text{feature2}} &= \\frac{1}{T} \\\\\n",
-        "e^{\\theta_0 + \\theta_1 * \\text{feature1} +  \\theta_2 * \\text{feature2}} &= \\frac{1}{T} - 1 \\\\\n",
-        "\\theta_0 + \\theta_1 * \\text{feature1} +  \\theta_2 * \\text{feature2} &= log(\\frac{1}{T} - 1)\n",
+        "1 + e^{\\theta_0 + \\theta_1 \\cdot \\text{feature1} +  \\theta_2  \\cdot  \\text{feature2}} &= \\frac{1}{T} \\\\\n",
+        "e^{\\theta_0 + \\theta_1  \\cdot  \\text{feature1} +  \\theta_2  \\cdot  \\text{feature2}} &= \\frac{1}{T} - 1 \\\\\n",
+        "\\theta_0 + \\theta_1  \\cdot  \\text{feature1} +  \\theta_2  \\cdot  \\text{feature2} &= \\log(\\frac{1}{T} - 1)\n",
         "\\end{align} \n",
         "$$\n",
         "\n",
@@ -81,21 +81,21 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Linear Seperability and Regularization\n",
+        "## Linear Separability and Regularization\n",
         "\n",
         "A classification dataset is said to be **linearly separable** if there exists a hyperplane among input features $x$ that separates the two classes $y$. \n",
         "\n",
-        "Linear seperability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly seperable along the vertical line $x=0$. However, no such line perfectly seperates the two classes on the bottom right.\n",
+        "Linear separability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly separable along the vertical line $x=0$. However, no such line perfectly separates the two classes on the bottom right.\n",
         "\n",
-        "<center><img src=\"images/linear_seperability_1D.png\" alt='linear_seperability_1D' width='800'></center>\n",
+        "<center><img src=\"images/linear_separability_1D.png\" alt='linear_separability_1D' width='800'></center>\n",
         "\n",
-        "This same definition holds in higher dimensions. If there are two features, the seperating hyperplane must exist in two dimensions (any line of the form $y=mx+b$). We can visualize this using a scatter plot.\n",
+        "This same definition holds in higher dimensions. If there are two features, the separating hyperplane must exist in two dimensions (any line of the form $y=mx+b$). We can visualize this using a scatter plot.\n",
         "\n",
-        "<center><img src=\"images/linear_seperability_2D.png\" alt='linear_seperability_1D' width='800'></center>\n",
+        "<center><img src=\"images/linear_separability_2D.png\" alt='linear_separability_1D' width='800'></center>\n",
         "\n",
         "This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. However, (unexpected) complications may arise when data is linearly separable. Consider the toy dataset with 2 points and only a single feature $x$:\n",
         "\n",
-        "<center><img src=\"images/toy_2_point.png\" alt='toy_linear_seperability' width='500'></center>\n",
+        "<center><img src=\"images/toy_2_point.png\" alt='toy_linear_separability' width='500'></center>\n",
         "\n",
         "The optimal $\\theta$ value that minimizes loss pushes the predicted probabilities of the data points to their true class.\n",
         "\n",
@@ -111,7 +111,7 @@
         "\n",
         "The diverging weights cause the model to be overconfident. For example, consider the new point $(x, y) = (0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and a thus, $\\hat y = 0$.\n",
         "\n",
-        "<center><img src=\"images/toy_3_point.png\" alt='toy_linear_seperability' width='500'></center>\n",
+        "<center><img src=\"images/toy_3_point.png\" alt='toy_linear_separability' width='500'></center>\n",
         "\n",
         "The loss incurred by this misclassified point is infinite.\n",
         "\n",
@@ -169,7 +169,7 @@
         "## Performance Metrics\n",
         "You might be thinking, if we've already introduced cross-entropy loss, why do we need additional ways of assessing how well our models perform? In linear regression, we made numerical predictions and used a loss function to determine how “good” these predictions were. In logistic regression, our ultimate goal is to classify data – we are much more concerned with whether or not each datapoint was assigned the correct class using the decision rule. As such, we are interested in the *quality* of classifications, not the predicted probabilities.\n",
         "\n",
-        "The most basic evaluation metric is **accuracy** -- the proportion of correctly classified points.\n",
+        "The most basic evaluation metric is **accuracy**, that is, the proportion of correctly classified points.\n",
         "\n",
         "$$\\text{accuracy} = \\frac{\\# \\text{ of points classified correctly}}{\\# \\text{ of total points}}$$\n",
         "\n",
@@ -185,7 +185,7 @@
         "- **Model 1**: Our first model classifies every email as non-spam. The model's accuracy is high ($\\frac{95}{100} = 0.95$), but it doesn't detect any spam emails. Despite the high accuracy, this is a bad model.\n",
         "- **Model 2**: The second model classifies every email as spam. The accuracy is low ($\\frac{5}{100} = 0.05$), but the model correctly labels every spam email. Unfortunately, it also misclassifies every non-spam email.\n",
         "\n",
-        "As this example illustrates, accuracy is not always a good metric for classification, particularly when your data have class imbalance (e.g., very few 1’s compared to 0’s).\n",
+        "As this example illustrates, accuracy is not always a good metric for classification, particularly when your data could exhibit class imbalance (e.g., very few 1’s compared to 0’s).\n",
         "\n",
         "### Types of Classification\n",
         "There are 4 different different classifications that our model might make:\n",
@@ -405,7 +405,7 @@
         "\n",
         "\n",
         "#### [Extra] What is the “worst” AUC and why is it 0.5? \n",
-        "On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predicts P(Y = 1 | x) to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two.\n",
+        "On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predicts $P(Y = 1 | x)$ to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two.\n",
         "\n",
         "<center><img src=\"images/roc_curve_worst_predictor.png\" alt='roc_curve_worst_predictor' width='900'></center>"
       ]
@@ -419,29 +419,29 @@
         "$$\n",
         "t_i = \\phi(x_i)^T \\theta \\\\\n",
         "p_i = \\sigma(t_i) \\\\\n",
-        "t_i = log(\\frac{p_i}{1 - p_i}) \\\\\n",
+        "t_i = \\log(\\frac{p_i}{1 - p_i}) \\\\\n",
         "1 - \\sigma(t_i) = \\sigma(-t_i) \\\\\n",
         "\\frac{d}{dt}  \\sigma(t) =  \\sigma(t) \\sigma(-t)\n",
         "$$\n",
         "\n",
         "Now, we can simplify the cross-entropy loss\n",
         "$$\n",
         "\\begin{align}\n",
-        "y_i log(p_i) + (1 - y_i)log(1 - p_i) &= y_i log(\\frac{p_i}{1 - p_i}) + log(1 - p_i) \\\\\n",
-        "&= y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta))\n",
+        "y_i \\log(p_i) + (1 - y_i) \\log(1 - p_i) &= y_i \\log(\\frac{p_i}{1 - p_i}) + \\log(1 - p_i) \\\\\n",
+        "&= y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta))\n",
         "\\end{align}\n",
         "$$\n",
         "\n",
         "Hence, the optimal $\\hat{\\theta}$ is \n",
-        "$$\\argmin_{\\theta} - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta)))$$ \n",
+        "$$\\argmin_{\\theta} - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta)))$$ \n",
         "\n",
-        "We want to minimize $$L(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta)))$$\n",
+        "We want to minimize $$L(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta)))$$\n",
         "\n",
         "So we take the derivative \n",
         "$$ \n",
         "\\begin{align}\n",
-        "\\triangledown_{\\theta} L(\\theta) &= - \\frac{1}{n} \\sum_{i=1}^n \\triangledown_{\\theta} y_i \\phi(x_i)^T + \\triangledown_{\\theta} log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n",
-        "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\triangledown_{\\theta} log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n",
+        "\\triangledown_{\\theta} L(\\theta) &= - \\frac{1}{n} \\sum_{i=1}^n \\triangledown_{\\theta} y_i \\phi(x_i)^T + \\triangledown_{\\theta} \\log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n",
+        "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\triangledown_{\\theta} \\log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n",
         "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\frac{1}{\\sigma(-\\phi(x_i)^T \\theta)} \\triangledown_{\\theta} \\sigma(-\\phi(x_i)^T \\theta) \\\\\n",
         "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\frac{\\sigma(-\\phi(x_i)^T \\theta)}{\\sigma(-\\phi(x_i)^T \\theta)} \\sigma(\\phi(x_i)^T \\theta)\\triangledown_{\\theta} \\sigma(-\\phi(x_i)^T \\theta) \\\\\n",
         "&= - \\frac{1}{n} \\sum_{i=1}^n (y_i - \\sigma(\\phi(x_i)^T \\theta)\\phi(x_i))\n",
@@ -459,7 +459,7 @@
         "### Stochastic Gradient Descent Update Rule\n",
         "$$\\theta^{(0)} \\leftarrow \\text{initial vector (random, zeros, ...)} $$\n",
         "\n",
-        "For $\\tau$ from 0 to convergence, let $B ~ \\text{Random subset of indices}$. \n",
+        "For $\\tau$ from 0 to convergence, let $B$ ~ $\\text{Random subset of indices}$. \n",
         "$$ \\theta^{(\\tau + 1)} \\leftarrow \\theta^{(\\tau)} + \\rho(\\tau)\\left( \\frac{1}{|B|} \\sum_{i \\in B} \\triangledown_{\\theta} L_i(\\theta) \\mid_{\\theta = \\theta^{(\\tau)}}\\right) $$"
       ]
     }