diff --git a/logistic_regression_1/logistic_reg_1.ipynb b/logistic_regression_1/logistic_reg_1.ipynb new file mode 100644 index 00000000..87999b90 --- /dev/null +++ b/logistic_regression_1/logistic_reg_1.ipynb @@ -0,0 +1,707 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "title: Logistic Regression I\n", + "format:\n", + " html:\n", + " toc: true\n", + " toc-depth: 5\n", + " toc-location: right\n", + " code-fold: false\n", + " theme:\n", + " - cosmo\n", + " - cerulean\n", + " callout-icon: false\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.callout-note collapse=\"false\"}\n", + "## Learning Outcomes\n", + "* Understand the difference between regression and classification\n", + "* Derive the logistic regression model for classifying data\n", + "* Quantify the error of our logistic regression model with cross-entropy loss\n", + ":::\n", + "\n", + "\n", + "Up until this point in the class , we've focused on **regression** tasks - that is, predicting a *numerical* quantity from a given dataset. We discussed optimization, feature engineering, and regularization all in the context of performing regression to predict some quantity. \n", + "\n", + "Now that we have this deep understanding of the modeling process, let's expand our knowledge of possible modeling tasks. \n", + "\n", + "\n", + "## Classification\n", + "\n", + "In the next two lectures, we'll tackle the task of **classification**. A classification problem aims to classify data into *categories*. Unlike in regression, where we predicted a numeric output, classification involves predicting some **categorical variable**, or **response**, $y$. Examples of classification tasks include:\n", + "\n", + "* Predicting which team won from its turnover percentage \n", + "* Predicting the day of the week of a meal from the total restaurant bill \n", + "* Predicting the model of car from its horsepower\n", + "\n", + "There are a couple of different types of classification:\n", + "\n", + "* Binary classification: classify data into two classes, and responses $y$ are either 0 or 1\n", + "* Multiclass classification: classify data into multiple classes (e.g., image labeling, next word in a sentence, etc.)\n", + "* Structured prediction tasks: conduct mutliple related classification predictions (e.g., translation, voice recognition, etc.)\n", + "\n", + "In Data 100, we will mostly deal with **binary classification**, where we are attempting to classify data into one of two classes. \n", + "\n", + "To build a classification model, we need to modify our modeling workflow slightly. Recall that in regression we:\n", + "\n", + "1. Created a design matrix of numeric features\n", + "2. Defined our model as a linear combination of these numeric features\n", + "3. Used the model to output numeric predictions\n", + "\n", + "In classification, however, we no longer want to output numeric predictions; instead, we want to predict the class to which a datapoint belongs. This means that we need to update our workflow. To build a classification model, we will:\n", + "\n", + "1. Create a design matrix of numeric features.\n", + "2. Define our model as a linear combination of these numeric features, transformed by a non-linear **sigmoid function**. This outputs a numeric quantity.\n", + "3. Apply a **decision rule** to interpret the outputted quantity and decide a classification.\n", + "4. Output a predicted class.\n", + "\n", + "There are two key differences: as we'll soon see, we need to incorporate a non-linear transformation to capture non-linear relationships hidden in our data. We do so by applying the sigmoid function to a linear combination of the features. Secondly, we must apply a decision rule to convert the numeric quantities computed by our model into an actual class prediction. This can be as simple as saying that any datapoint with a feature greater than some number $x$ belongs to Class 1.\n", + "\n", + "
reg
\n", + "\n", + "
class
\n", + "\n", + "This was a very high-level overview. Let's walk through the process in detail to clarify what we mean.\n", + "\n", + "## Deriving the Logistic Regression Model\n", + "\n", + "Throughout this lecture, we will work with the `games` dataset, which contains information about games played in the NBA basketball league. Our goal will be to use a basketball team's `\"GOAL_DIFF\"` to predict whether or not a given team won their game (`\"WON\"`). If a team wins their game, we'll say they belong to Class 1. If they lose, they belong to Class 0.\n", + "\n", + "For those who are curious, `\"GOAL_DIFF\"` represents the difference in successful field goal percentages between the two competing teams. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| code-fold: true\n", + "import warnings\n", + "warnings.filterwarnings(\"ignore\")\n", + "\n", + "import pandas as pd\n", + "import numpy as np\n", + "np.seterr(divide='ignore')\n", + "\n", + "games = pd.read_csv(\"data/games\").dropna()\n", + "games.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's visualize the relationship between `\"GOAL_DIFF\"` and `\"WON\"` using the Seaborn function `sns.stripplot`. A strip plot automatically introduces a small amount of random noise to jitter the data. Recall that all values in the `\"WON\"` column are either 1 (won) or 0 (lost) – if we were to directly plot them without jittering, we would see severe overplotting." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| code-fold: true\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "sns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\")\n", + "# By default, sns.stripplot plots 0, then 1. We invert the y axis to reverse this behavior\n", + "plt.gca().invert_yaxis();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This dataset is unlike anything we've seen before – our target variable contains only two unique values! Remember that each y value is either 0 or 1; the plot above jitters the y data slightly for ease of reading.\n", + "\n", + "The regression models we have worked with always assumed that we were attempting to predict a continuous target. If we apply a linear regression model to this dataset, something strange happens." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| code-fold: true\n", + "import sklearn.linear_model as lm\n", + "\n", + "X, Y = games[[\"GOAL_DIFF\"]], games[\"WON\"]\n", + "regression_model = lm.LinearRegression()\n", + "regression_model.fit(X, Y)\n", + "\n", + "plt.plot(X.squeeze(), regression_model.predict(X), \"k\")\n", + "sns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\")\n", + "plt.gca().invert_yaxis();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The linear regression fit follows the data as closely as it can. However, there are a few key flaws with this approach:\n", + "\n", + "* The predicted output, $\\hat{y}$, can be outside the range of possible classes (there are predictions above 1 and below 0)\n", + "* This means that the output can't always be interpreted (what does it mean to predict a class of -2.3?)\n", + "\n", + "Our usual linear regression framework won't work here. Instead, we'll need to get more creative.\n", + "\n", + "Back in [Data 8](https://inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html#example-prediction), you gradually built up to the concept of linear regression by using the **graph of averages**. Before you knew the mathematical underpinnings of the regression line, you took a more intuitive approach: you bucketed the $x$ data into bins of common values, then computed the average $y$ for all datapoints in the same bin. The result gave you the insight needed to derive the regression fit.\n", + "\n", + "Let's take the same approach as we grapple with our new classification task. In the cell below, we 1) bucket the `\"GOAL_DIFF\"` data into bins of similar values and 2) compute the average `\"WON\"` value of all datapoints in a bin." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bins = pd.cut(games[\"GOAL_DIFF\"], 20)\n", + "games[\"bin\"] = [(b.left + b.right) / 2 for b in bins]\n", + "win_rates_by_bin = games.groupby(\"bin\")[\"WON\"].mean()\n", + "\n", + "# alpha makes the points transparent so we can see the line more clearly\n", + "sns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\", alpha=0.3)\n", + "plt.plot(win_rates_by_bin.index, win_rates_by_bin, c=\"tab:red\")\n", + "plt.gca().invert_yaxis();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interesting: our result is certainly not like the straight line produced by finding the graph of averages for a linear relationship. We can make two observations:\n", + "\n", + "* All predictions on our line are between 0 and 1\n", + "* The predictions are **non-linear**, following a rough \"S\" shape\n", + "\n", + "Let's think more about what we've just done.\n", + "\n", + "To find the average $y$ value for each bin, we computed:\n", + "\n", + "$$\\frac{1 \\text{(\\# Y = 1 in bin)} + 0 \\text{(\\# Y = 0 in bin)}}{\\text{\\# datapoints in bin}} = \\frac{\\text{\\# Y = 1 in bin}}{\\text{\\# datapoints in bin}} = P(\\text{Y = 1} | \\text{bin})$$\n", + "\n", + "This is simply the probability of a datapoint in that bin belonging to Class 1! This aligns with our observation from earlier: all of our predictions lie between 0 and 1, just as we would expect for a probability.\n", + "\n", + "Our graph of averages was really modeling the probability, $p$, that a datapoint belongs to Class 1, or essentially that $\\text{Y = 1}$ for a particular value of $\\text{x}$.\n", + "\n", + "$$ p = P(Y = 1 | \\text{ x} )$$\n", + "\n", + "In logistic regression, we have a new modeling goal. We want to model the **probability that a particular datapoint belongs to Class 1**. To do so, we'll need to create a model that can approximate the S-shaped curve we plotted above.\n", + "\n", + "Fortunately for us, we're already well-versed with a technique to model non-linear relationships – applying non-linear transformations to linearize the relationship. Recall the steps we've applied previously:\n", + "\n", + "* Transform the variables until we linearize their relationship\n", + "* Fit a linear model to the transformed variables\n", + "* \"Undo\" our transformations to identify the underlying relationship between the original variables\n", + "\n", + "In past examples, we used the bulge diagram to help us decide what transformations may be useful. The S-shaped curve we saw above, however, looks nothing like any relationship we've seen in the past. We'll need to think carefully about what transformations will linearize this curve.\n", + "\n", + "Let's consider our eventual goal: determining if we should predict a Class of 0 or 1 for each datapoint. Rephrased, we want to decide if it seems more \"likely\" that the datapoint belongs to Class 0 or to Class 1. One way of deciding this is to see which class has the higher predicted probability for a given datapoint. The **odds** is defined as the probability of a datapoint belonging to Class 1 divided by the probability of it belonging to Class 0. \n", + "\n", + "$$\\text{odds} = \\frac{P(Y=1|x)}{P(Y=0|x)} = \\frac{p}{1-p}$$\n", + "\n", + "If we plot the odds for each input `\"GOAL_DIFF\"` ($x$), we see something that looks more promising." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "p = win_rates_by_bin\n", + "odds = p/(1-p) \n", + "\n", + "plt.plot(odds.index, odds)\n", + "plt.xlabel(\"x\")\n", + "plt.ylabel(r\"Odds $= \\frac{p}{1-p}$\");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It turns out that the relationship between our input `\"GOAL_DIFF\"` and the odds is roughly exponential! Let's linearize the exponential by taking the logarithm. As a reminder, you should assume that any logarithm in Data 100 is the base $e$ natural logarithm unless told otherwise." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "log_odds = np.log(odds)\n", + "plt.plot(odds.index, log_odds, c=\"tab:green\")\n", + "plt.xlabel(\"x\")\n", + "plt.ylabel(r\"Log-Odds $= \\log{\\frac{p}{1-p}}$\");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see something promising – the relationship between the log-odds and our input feature is approximately linear. This means that we can use a linear model to describe the relationship between the log-odds and $x$. In other words:\n", + "\n", + "\\begin{align}\n", + "\\log{(\\frac{p}{1-p})} &= \\theta_0 + \\theta_1 x_i\\\\\n", + "&= x^{\\top} \\theta\n", + "\\end{align}\n", + "\n", + "Here, we use $x^{\\top}$ to represent an observation in our dataset, stored as a row vector. You can imagine it as a single row in our design matrix. $x^{\\top} \\theta$ indicates a linear combination of the features for this observation (just as we used in multiple linear regression). \n", + "\n", + "We're in good shape! We have now derived the following relationship:\n", + "\n", + "$$\\log{(\\frac{p}{1-p})} = x^{\\top} \\theta$$\n", + "\n", + "Remember that our goal is to predict the probability of a datapoint belonging to Class 1, $p$. Let's rearrange this relationship to uncover the original relationship between $p$ and our input data, $x^{\\top}$.\n", + "\n", + "\\begin{align}\n", + "\\log{(\\frac{p}{1-p})} &= x^T \\theta\\\\\n", + "\\frac{p}{1-p} &= e^{x^T \\theta}\\\\\n", + "p &= (1-p)e^{x^T \\theta}\\\\\n", + "p &= e^{x^T \\theta}- p e^{x^T \\theta}\\\\\n", + "p(1 + e^{x^T \\theta}) &= e^{x^T \\theta} \\\\\n", + "p &= \\frac{e^{x^T \\theta}}{1+e^{x^T \\theta}}\\\\\n", + "p &= \\frac{1}{1+e^{-x^T \\theta}}\\\\\n", + "\\end{align}\n", + "\n", + "Phew, that was a lot of algebra. What we've uncovered is the **logistic regression model** to predict the probability of a datapoint $x^{\\top}$ belonging to Class 1. If we plot this relationship for our data, we see the S-shaped curve from earlier!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| code-fold: true\n", + "# We'll discuss the `LogisticRegression` class next time\n", + "xs = np.linspace(-0.3, 0.3)\n", + "\n", + "logistic_model = lm.LogisticRegression(C=20)\n", + "logistic_model.fit(X, Y)\n", + "predicted_prob = logistic_model.predict_proba(xs[:, np.newaxis])[:, 1]\n", + "\n", + "sns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\", alpha=0.5)\n", + "plt.plot(xs, predicted_prob, c=\"k\", lw=3, label=\"Logistic regression model\")\n", + "plt.plot(win_rates_by_bin.index, win_rates_by_bin, lw=2, c=\"tab:red\", label=\"Graph of averages\")\n", + "plt.legend(loc=\"upper left\")\n", + "plt.gca().invert_yaxis();" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To predict a probability using the logistic regression model, we:\n", + "\n", + "1. Compute a linear combination of the features, $x^{\\top}\\theta$\n", + "2. Apply the sigmoid activation function, $\\sigma(x^{\\top} \\theta)$.\n", + "\n", + "Our predicted probabilities are of the form $P(Y=1|x) = p = \\frac{1}{1+e^{-(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 + \\ldots + \\theta_p x_p)}}$\n", + "\n", + "An important note: despite its name, logistic regression is used for *classification* tasks, not regression tasks. In Data 100, we always apply logistic regression with the goal of classifying data.\n", + "\n", + "The S-shaped curve is formally known as the **sigmoid function** and is typically denoted by $\\sigma$. \n", + "\n", + "$$\\sigma(t) = \\frac{1}{1+e^{-t}}$$\n", + "\n", + "::: {.callout-tip}\n", + "## Properties of the Sigmoid\n", + "* Reflection/Symmetry: $1-\\sigma(t) = \\frac{e^{-t}}{1+e^{-t}}=\\sigma(-t)$\n", + "* Inverse: $t=\\sigma^{-1}(p)=\\log{(\\frac{p}{1-p})}$\n", + "* Derivative: $\\frac{d}{dz} \\sigma(t) = \\sigma(t) (1-\\sigma(t))=\\sigma(t)\\sigma(-t)$\n", + "* Domain: $-\\infty < t < \\infty$\n", + "* Range: $0 < \\sigma(t) < 1$\n", + ":::\n", + "\n", + "In the context of our modeling process, the sigmoid is considered an **activation function**. It takes in a linear combination of the features and applies a non-linear transformation.\n", + "\n", + "Let's summarize our logistic regression modeling workflow.\n", + "\n", + "
log_reg
\n", + "\n", + "Our main takeaways from this section are:\n", + "\n", + "* Fit the \"S\" curve as best as possible\n", + "* The curve models the probability: $P = (Y=1 | x)$\n", + "* Assume log-odds is a linear combination of $x$ and $\\theta$\n", + "\n", + "Putting this together, we know that the estimated probability that response is 1 given the features $x$ is equal to the logistic function $\\sigma()$ at the value $x^{\\top}\\theta$:\n", + "\n", + "\\begin{align}\n", + "\\hat{P}_{\\theta}(Y = 1 | x) = \\frac{1}{1 + e^{-x^{\\top}\\theta}}\n", + "\\end{align}\n", + "\n", + "More commonly, the logistic regression model is written as:\n", + "\n", + "\\begin{align}\n", + "\\hat{P}_{\\theta}(Y = 1 | x) = \\sigma(x^{\\top}\\theta)\n", + "\\end{align}\n", + "\n", + "\n", + "::: {.callout-tip}\n", + "## Properties of the Logistic Model\n", + "Consider a logistic regression model with one feature and an intercept term:\n", + "\n", + "\\begin{align}\n", + "p = P(Y = 1 | x) = \\frac{1}{1+e^{-(\\theta_0 + \\theta_1 x)}}\n", + "\\end{align}\n", + "\n", + "Properties:\n", + "\n", + "* $\\theta_0$ controls the position of the curve along the horizontal axis\n", + "* The magnitude of $\\theta_1$ controls the \"steepness\" of the sigmoid\n", + "* The sign of $\\theta_1$ controls the orientation of the curve\n", + "\n", + ":::\n", + "\n", + "::: {.callout collapse=\"true\"}\n", + "## Example Calculation\n", + "Suppose we want to predict the probability that a team wins a game, given `\"GOAL_DIFF\"` (first feature) and the number of free throws (second feature). Let's say we fit a logistic regression model (with no intercept) using the training data and estimate the optimal parameters. Now we want to predict the probability that a new team will win their game.\n", + "\n", + "\\begin{align}\n", + "\\hat{\\theta}^{\\top} = \\begin{matrix}[0.1 & -0.5]\\end{matrix}\n", + "\\\\x^{\\top} = \\begin{matrix}[15 & 1]\\end{matrix}\n", + "\\end{align}\n", + "\n", + "\\begin{align}\n", + "\\hat{P}_{\\hat{\\theta}}(Y = 1 | x) = \\sigma(x^{\\top}\\hat{\\theta}) = \\sigma(0.1 \\cdot 15 + (-0.5) \\cdot 1) = \\sigma(1) = \\frac{1}{1+e^{-1}} \\approx 0.7311\n", + "\\end{align}\n", + "\n", + "We see that the response is more likely to be 1 than 0, indicating that a reasonable prediction is $\\hat{y} = 1$. We'll dive deeper into this in the next lecture.\n", + "\n", + ":::\n", + "\n", + "\n", + "## Cross-Entropy Loss\n", + "\n", + "To quantify the error of our logistic regression model, we'll need to define a loss function. \n", + "\n", + "### Why Not MSE?\n", + "You may wonder: why not use our familiar mean squared error? It turns out that the MSE is not well suited for logistic regression. To see why, let's consider a simple, artificially generated `toy` dataset (this will be easier to work with than the more complicated `games` data)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| code-fold: true\n", + "toy_df = pd.DataFrame({\n", + " \"x\": [-4, -2, -0.5, 1, 3, 5],\n", + " \"y\": [0, 0, 1, 0, 1, 1]})\n", + "toy_df.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We'll construct a basic logistic regression model with only one feature and no intercept term. Our predicted probabilities take the form:\n", + "\n", + "$$p=P(Y=1|x)=\\frac{1}{1+e^{-\\theta_1 x}}$$\n", + "\n", + "In the cell below, we plot the MSE for our model on the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| code-fold: true\n", + "def sigmoid(z):\n", + " return 1/(1+np.e**(-z))\n", + " \n", + "def mse_on_toy_data(theta):\n", + " p_hat = sigmoid(toy_df['x'] * theta)\n", + " return np.mean((toy_df['y'] - p_hat)**2)\n", + "\n", + "thetas = np.linspace(-15, 5, 100)\n", + "plt.plot(thetas, [mse_on_toy_data(theta) for theta in thetas])\n", + "plt.title(\"MSE on toy classification data\")\n", + "plt.xlabel(r'$\\theta_1$')\n", + "plt.ylabel('MSE');" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This looks nothing like the parabola we found when plotting the MSE of a linear regression model! In particular, we can identify two flaws with using the MSE for logistic regression:\n", + "\n", + "1. The MSE loss surface is *non-convex*. There is both a global minimum and a (barely perceptible) local minimum in the loss surface above. This means that there is the risk of gradient descent converging on the local minimum of the loss surface, missing the true optimum parameter $\\theta_1$.\n", + "2. Squared loss is *bounded* for a classification task. Recall that each true $y$ has a value of either 0 or 1. This means that even if our model makes the worst possible prediction (e.g. predicting $p=0$ for $y=1$), the squared loss for an observation will be no greater than 1: $$(y-p)^2=(1-0)^2=1$$ The MSE does not strongly penalize poor predictions.\n", + "\n", + "### Motivating Cross-Entropy Loss\n", + "Suffice to say, we don't want to use the MSE when working with logistic regression. Instead, we'll consider what kind of behavior we would *like* to see in a loss function.\n", + "\n", + "Let $y$ be the binary label ${0, 1}$, and $p$ be the model's predicted probability of the label being 1. \n", + "\n", + "* When the true $y$ is 1, we should incur *low* loss when the model predicts large $p$\n", + "* When the true $y$ is 0, we should incur *high* loss when the model predicts large $p$\n", + "\n", + "In other words, our loss function should behave differently depending on the value of the true class, $y$. \n", + "\n", + "The **cross-entropy loss** incorporates this changing behavior. We will use it throughout our work on logistic regression. Below, we write out the cross-entropy loss for a *single* datapoint (no averages just yet).\n", + "\n", + "$$\\text{Cross-Entropy Loss} = \\begin{cases}\n", + " -\\log{(p)} & \\text{if } y=1 \\\\\n", + " -\\log{(1-p)} & \\text{if } y=0\n", + "\\end{cases}$$\n", + "\n", + "Why does this (seemingly convoluted) loss function \"work\"? Let's break it down.\n", + "\n", + ":::: {.columns}\n", + "\n", + "::: {.column width=\"35%\"}\n", + "When $y=1$\n", + "
cross-entropy loss when Y=1
\n", + "\n", + "* As $p \\rightarrow 0$, loss approches $\\infty$\n", + "* As $p \\rightarrow 1$, loss approaches 0\n", + " \n", + ":::\n", + "\n", + "::: {.column width=\"20%\"}\n", + ":::\n", + "\n", + "::: {.column width=\"35%\"}\n", + "When $y=0$\n", + "
cross-entropy loss when Y=0
\n", + "\n", + "* As $p \\rightarrow 0$, loss approches 0\n", + "* As $p \\rightarrow 1$, loss approaches $\\infty$\n", + " \n", + ":::\n", + "\n", + "::::\n", + "\n", + "All good – we are seeing the behavior we want for our logistic regression model.\n", + "\n", + "The piecewise function we outlined above is difficult to optimize: we don't want to constantly \"check\" which form of the loss function we should be using at each step of choosing the optimal model parameters. We can re-express cross-entropy loss in a more convenient way:\n", + "\n", + "$$\\text{Cross-Entropy Loss} = -\\left(y\\log{(p)}-(1-y)\\log{(1-p)}\\right)$$\n", + "\n", + "By setting $y$ to 0 or 1, we see that this new form of cross-entropy loss gives us the same behavior as the original formulation.\n", + "\n", + ":::: {.columns}\n", + "\n", + "::: {.column width=\"35%\"}\n", + "When $y=1$:\n", + "\n", + "\\begin{align}\n", + "\\text{CE} &= -\\left((1)\\log{(p)}-(1-1)\\log{(1-p)}\\right)\\\\\n", + "&= -\\log{(p)}\n", + "\\end{align}\n", + ":::\n", + "\n", + "::: {.column width=\"20%\"}\n", + ":::\n", + "\n", + "::: {.column width=\"35%\"}\n", + "When $y=0$:\n", + "\n", + "\\begin{align}\n", + "\\text{CE} &= -\\left((0)\\log{(p)}-(1-0)\\log{(1-p)}\\right)\\\\\n", + "&= -\\log{(1-p)}\n", + "\\end{align}\n", + ":::\n", + "\n", + "::::\n", + "\n", + "The empirical risk of the logistic regression model is then the mean cross-entropy loss across all datapoints in the dataset. When fitting the model, we want to determine the model parameter $\\theta$ that leads to the lowest mean cross-entropy loss possible.\n", + "\n", + "$$R(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(p_i)}-(1-y_i)\\log{(1-p_i)}\\right)$$\n", + "$$R(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(\\sigma(X_i^{\\top}\\theta)}-(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta)}\\right)$$\n", + "\n", + "The optimization problem is therefore to find the estimate $\\hat{\\theta}$ that minimizes $R(\\theta)$:\n", + "\n", + "\\begin{align}\n", + "\\hat{\\theta} = \\underset{\\theta}{\\arg\\min} = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(\\sigma(X_i^{\\top}\\theta)}-(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta)}\\right)\n", + "\\end{align}\n", + "\n", + "Plotting the cross-entropy loss surface for our `toy` dataset gives us a more encouraging result – our loss function is now convex. This means we can optimize it using gradient descent. Computing the gradient of the logistic model is fairly challenging, so we'll let `sklearn` take care of this for us. You won't need to compute the gradient of the logistic model in Data 100." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "#| code-fold: true\n", + "def cross_entropy(y, p_hat):\n", + " return - y * np.log(p_hat) - (1 - y) * np.log(1 - p_hat)\n", + "\n", + "def mean_cross_entropy_on_toy_data(theta):\n", + " p_hat = sigmoid(toy_df['x'] * theta)\n", + " return np.mean(cross_entropy(toy_df['y'], p_hat))\n", + "\n", + "plt.plot(thetas, [mean_cross_entropy_on_toy_data(theta) for theta in thetas], color = 'green')\n", + "plt.ylabel(r'Mean Cross-Entropy Loss($\\theta$)')\n", + "plt.xlabel(r'$\\theta$');" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## (Bonus) Maximum Likelihood Estimation\n", + "\n", + "It may have seemed like we pulled cross-entropy loss out of thin air. How did we know that taking the negative logarithms of our probabilities would work so well? It turns out that cross-entropy loss is justified by probability theory.\n", + "\n", + "The following section is out of scope, but is certainly an interesting read!\n", + "\n", + "### Building Intuition: The Coin Flip\n", + "To build some intuition for logistic regression, let’s look at an introductory example to classification: the coin flip. Suppose we observe some outcomes of a coin flip (1 = Heads, 0 = Tails)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "flips = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]\n", + "flips" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A reasonable model is to assume all flips are IID (independent and identically distributed). In other words, each flip has the same probability of returning a 1 (or heads). Let's define a parameter $\\theta$, the probability that the next flip is a heads. We will use this parameter to inform our decision for $\\hat y$ (predicting either 0 or 1) of the next flip. If $\\theta \\ge 0.5, \\hat y = 1, \\text{else } \\hat y = 0$.\n", + "\n", + "You may be inclined to say $0.5$ is the best choice for $\\theta$. However, notice that we made no assumption about the coin itself. The coin may be biased, so we should make our decision based only on the data. We know that exactly $\\frac{4}{10}$ of the flips were heads, so we might guess $\\hat \\theta = 0.4$. In the next section, we will mathematically prove why this is the best possible estimate.\n", + "\n", + "### Likelihood of Data\n", + "\n", + "Let's call the result of the coin flip a random variable $Y$. This is a Bernoulli random variable with two outcomes. $Y$ has the following distribution: \n", + "\n", + "$$P(Y = y) = \\begin{cases}\n", + " p, \\text{if } y=1\\\\\n", + " 1 - p, \\text{if } y=0\n", + " \\end{cases} $$\n", + "\n", + "$p$ is unknown to us. But we can find the $p$ that makes the data we observed the most *likely*.\n", + "\n", + "The probability of observing 4 heads and 6 tails follows the binomial distribution.\n", + "\n", + "$$\\binom{10}{4} (p)^4 (1-p)^6$$ \n", + "\n", + "We define the **likelihood** of obtaining our observed data as a quantity *proportional* to the probability above. To find it, simply multiply the probabilities of obtaining each coin flip.\n", + "\n", + "$$(p)^{4} (1-p)^6$$ \n", + "\n", + "The technique known as **maximum likelihood estimation** finds the $p$ that maximizes the above likelihood. You can find this maximum by taking the derivative of the likelihood, but we'll provide a more intuitive graphical solution." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "thetas = np.linspace(0, 1)\n", + "plt.plot(thetas, (thetas**4)*(1-thetas)**6)\n", + "plt.xlabel(r\"$\\theta$\")\n", + "plt.ylabel(\"Likelihood\");" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "More generally, the likelihood for some Bernoulli($p$) random variable $Y$ is:\n", + "\n", + "$$P(Y = y) = \\begin{cases}\n", + " 1, \\text{with probability } p\\\\\n", + " 0, \\text{with probability } 1 - p\n", + " \\end{cases} $$\n", + " \n", + "Equivalently, this can be written in a compact way:\n", + "\n", + "$$P(Y=y) = p^y(1-p)^{1-y}$$\n", + "\n", + "- When $y = 1$, this reads $P(Y=y) = p$\n", + "- When $y = 0$, this reads $P(Y=y) = (1-p)$\n", + "\n", + "In our example, a Bernoulli random variable is analogous to a single data point (e.g., one instance of a basketball team winning or losing a game). All together, our `games` data consists of many IID Bernoulli($p$) random variables. To find the likelihood of independent events in succession, simply multiply their likelihoods.\n", + "\n", + "$$\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}$$\n", + "\n", + "As with the coin example, we want to find the parameter $p$ that maximizes this likelihood. Earlier, we gave an intuitive graphical solution, but let's take the derivative of the likelihood to find this maximum.\n", + "\n", + "At a first glance, this derivative will be complicated! We will have to use the product rule, followed by the chain rule. Instead, we can make an observation that simplifies the problem. \n", + "\n", + "Finding the $p$ that maximizes $$\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}$$ is equivalent to the $p$ that maximizes $$\\text{log}(\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i})$$\n", + "\n", + "This is because $\\text{log}$ is a strictly *increasing* function. It won't change the maximum or minimum of the function it was applied to. From $\\text{log}$ properties, $\\text{log}(a*b)$ = $\\text{log}(a) + \\text{log}(b)$. We can apply this to our equation above to get:\n", + "\n", + "$$\\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} \\text{log}(p^{y_i} (1-p)^{1-y_i})$$\n", + "\n", + "$$= \\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} (\\text{log}(p^{y_i}) + \\text{log}((1-p)^{1-y_i}))$$\n", + "\n", + "$$= \\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} (y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p))$$\n", + "\n", + "We can add a constant factor of $\\frac{1}{n}$ out front. It won't affect the $p$ that maximizes our likelihood.\n", + "\n", + "$$=\\underset{p}{\\text{argmax}} \\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)$$\n", + "\n", + "One last \"trick\" we can do is change this to a minimization problem by negating the result. This works because we are dealing with a *concave* function, which can be made *convex*.\n", + "\n", + "$$= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)$$\n", + "\n", + "Now let's say that we have data that are independent with different probability $p_i$. Then, we would want to find the $p_1, p_2, \\dots, p_n$ that maximize $$\\prod_{i=1}^{n} p_i^{y_i} (1-p_i)^{1-y_i}$$\n", + "\n", + "Setting up and simplifying the optimization problems as we did above, we ultimately want to find:\n", + "\n", + "$$= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p_i) + (1-y_i)\\text{log}(1-p_i)$$\n", + "\n", + "For logistic regression, $p_i = \\sigma(x^{\\top}\\theta)$. Plugging that in, we get: \n", + "\n", + "$$= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(\\sigma(x^{\\top}\\theta)) + (1-y_i)\\text{log}(1-\\sigma(x^{\\top}\\theta))$$\n", + "\n", + "This is exactly our average cross-entropy loss minimization problem from before! \n", + "\n", + "Why did we do all this complicated math? We have shown that *minimizing* cross-entropy loss is equivalent to *maximizing* the likelihood of the training data.\n", + "\n", + "- By minimizing cross-entropy loss, we are choosing the model parameters that are \"most likely\" for the data we observed.\n", + "\n", + "Note that this is under the assumption that all data is drawn independently from the same logistic regression model with parameter $\\theta$. In fact, many of the model + loss combinations we've seen can be motivated using MLE (e.g., OLS, Ridge Regression, etc.). In probability and ML classes, you'll get the chance to explore MLE further. \n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "name": "python", + "version": "3.10.9" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/logistic_regression_2/images/decision_boundary.png b/logistic_regression_2/images/decision_boundary.png new file mode 100644 index 00000000..df94c58e Binary files /dev/null and b/logistic_regression_2/images/decision_boundary.png differ diff --git a/logistic_regression_2/images/decision_boundary_true.png b/logistic_regression_2/images/decision_boundary_true.png new file mode 100644 index 00000000..d3b39b6d Binary files /dev/null and b/logistic_regression_2/images/decision_boundary_true.png differ diff --git a/logistic_regression_2/images/log_reg_summary.png b/logistic_regression_2/images/log_reg_summary.png new file mode 100644 index 00000000..7c0ec940 Binary files /dev/null and b/logistic_regression_2/images/log_reg_summary.png differ diff --git a/logistic_regression_2/images/varying_threshold.png b/logistic_regression_2/images/varying_threshold.png index 885bd9da..e90113a4 100644 Binary files a/logistic_regression_2/images/varying_threshold.png and b/logistic_regression_2/images/varying_threshold.png differ diff --git a/logistic_regression_2/logistic_reg_2.ipynb b/logistic_regression_2/logistic_reg_2.ipynb new file mode 100644 index 00000000..d364ef4b --- /dev/null +++ b/logistic_regression_2/logistic_reg_2.ipynb @@ -0,0 +1,476 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "title: Logistic Regression II\n", + "format:\n", + " html:\n", + " toc: true\n", + " toc-depth: 5\n", + " toc-location: right\n", + " code-fold: false\n", + " theme:\n", + " - cosmo\n", + " - cerulean\n", + " callout-icon: false\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "::: {.callout-note collapse=\\\"false\\\"}\n", + "## Learning Outcomes\n", + "* Apply decision rules to make a classification\n", + "* Learn when logistic regression works well and when it does not\n", + "* Introduce new metrics for model performance\n", + "::: \n", + "\n", + "Today, we will continue studying the Logistic Regression model. We'll discussion decision boundaries that help inform the classification of a particular prediction. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discus metrics that allow us to determine our model's performance in different scenarios. \n", + "\n", + "This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model. We will learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression.\n", + "\n", + "
tpr_fpr
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Decision Boundaries\n", + "In logistic regression, we model the *probability* that a datapoint belongs to Class 1. Last week, we developed the logistic regression model to predict that probability, but we never actually made any *classifications* for whether our prediction $y$ belongs in Class 0 or Class 1. \n", + "\n", + "$$ p = P(Y=1 | x) = \\frac{1}{1 + e^{-x^T\\theta}}$$\n", + "\n", + "A **decision rule** tells us how to interpret the output of the model to make a decision on how to classify a datapoint. We commonly make decision rules by specifying a **threshold**, $T$. If the predicted probability is greater than or equal to $T$, predict Class 1. Otherwise, predict Class 0. \n", + "\n", + "$$\\hat y = \\text{classify}(x) = \\begin{cases}\n", + " 1, & P(Y=1|x) \\ge T\\\\\n", + " 0, & \\text{otherwise }\n", + " \\end{cases}$$\n", + " \n", + "The threshold is often set to $T = 0.5$, but *not always*. We'll discuss why we might want to use other thresholds $T \\neq 0.5$ later in this lecture.\n", + "\n", + "Using our decision rule, we can define a **decision boundary** as the “line” the splits the data into classes based on its features. For logistic regression, the decision boundary is a **hyperplane** -- a linear combination of the features in p-dimensions -- and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have $\\theta = [\\theta_0 \\theta_1 \\theta_2]$ (including the intercept term), and we can solve for the decision boundary like so: \n", + "\n", + "$$\n", + "\\begin{align}\n", + "T &= \\frac{1}{1 + e^{\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2}}} \\\\\n", + "1 + e^{\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2}} &= \\frac{1}{T} \\\\\n", + "e^{\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2}} &= \\frac{1}{T} - 1 \\\\\n", + "\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2} &= log(\\frac{1}{T} - 1)\n", + "\\end{align} \n", + "$$\n", + "\n", + "For a model with 2 features, the decision boundary is a line in terms of it's features. To make it easier to visualize, we've included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes. \n", + "\n", + "
varying_threshold
\n", + "\n", + "In real life, however, that is often not the case, and we often see some overlap between points of different classes across the decision boundary. The *true* classes of the 2D data is shown below: \n", + "\n", + "
varying_threshold
\n", + "\n", + "As you can see, the decision boundary predicted by our logistic regression does not perfectly separate the two classes. There's a “muddled” region near the decision boundary where our classifier predicts the wrong class. What would the data have to look like for the classifier to make perfect predictions?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Linear Seperability and Regularization\n", + "\n", + "A classification dataset is said to be **linearly separable** if there exists a hyperplane among input features $x$ that separates the two classes $y$. \n", + "\n", + "Linear seperability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly seperable along the vertical line $x=0$. However, no such line perfectly seperates the two classes on the bottom right.\n", + "\n", + "
linear_seperability_1D
\n", + "\n", + "This same definition holds in higher dimensions. If there are two features, the seperating hyperplane must exist in two dimensions (any line of the form $y=mx+b$). We can visualize this using a scatter plot.\n", + "\n", + "
linear_seperability_1D
\n", + "\n", + "This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. However, (unexpected) complications may arise when data is linearly separable. Consider the toy dataset with 2 points and only a single feature $x$:\n", + "\n", + "
toy_linear_seperability
\n", + "\n", + "The optimal $\\theta$ value that minimizes loss pushes the predicted probabilities of the data points to their true class.\n", + "\n", + "- $P(Y = 1|x = -1) = \\frac{1}{1 + e^\\theta} \\rightarrow 1$\n", + "- $P(Y = 1|x = 1) = \\frac{1}{1 + e^{-\\theta}} \\rightarrow 0$\n", + "\n", + "This happens when $\\theta = -\\infty$. When $\\theta = -\\infty$, we observe the following behavior for any input $x$.\n", + "\n", + "$$P(Y=1|x) = \\sigma(\\theta x) \\rightarrow \\begin{cases}\n", + " 1, \\text{if } x < 0\\\\\n", + " 0, \\text{if } x \\ge 0\n", + " \\end{cases}$$\n", + "\n", + "The diverging weights cause the model to be overconfident. For example, consider the new point $(x, y) = (0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and a thus, $\\hat y = 0$.\n", + "\n", + "
toy_linear_seperability
\n", + "\n", + "The loss incurred by this misclassified point is infinite.\n", + "\n", + "$$-(y\\text{ log}(p) + (1-y)\\text{ log}(1-p))=1\\text{log}(0)$$\n", + "\n", + "Thus, diverging weights ($|\\theta| \\rightarrow \\infty$) occur with **lineary separable** data. \"Overconfidence\" is a particularly dangerous version of overfitting.\n", + "\n", + "Consider the loss function with respect to the parameter $\\theta$.\n", + "\n", + "
unreg_loss
\n", + "\n", + "Though it's very difficult to see, the plateau for negative values of $\\theta$ is slightly tilted downwards, meaning the loss approaches $0$ as $\\theta$ decreases and approaches $-\\infty$.\n", + "\n", + "### Regularized Logistic Regression\n", + "\n", + "To avoid large weights and infinite loss (particularly on linearly separable data), we use regularizaiton. The same principles apply as with linear regression - make sure to standardize your features first.\n", + "\n", + "For example, L2 (Ridge) Logistic Regression takes on the form\n", + "\n", + "$$\\min_{\\theta} -\\frac{1}{n} \\sum_{i=1}^{n} (y_i \\text{log}(\\sigma(x_i^T\\theta)) + (1-y_i)\\text{log}(1-\\sigma(x_i^T\\theta))) + \\lambda \\sum_{i=1}^{d} \\theta_j^2$$\n", + "\n", + "Now, let us compare the loss functions of un-regularized and regularized logistic regression.\n", + "\n", + "
unreg_loss
\n", + "\n", + "
reg_loss
\n", + "\n", + "As we can see, $L2$ regularization helps us prevent diverging weights and deters against \"overconfidence.\"\n", + "\n", + "`sklearn`'s logistic regression defaults to L2 regularization and `C=1.0`; `C` is the inverse of $\\lambda$: $C = \\frac{1}{\\lambda}$. Setting `C` to a large value, for example `C=300.0` results in minimal regularization.\n", + "\n", + " # sklearn defaults\n", + " model = LogisticRegression(penalty='l2', C=1.0, …)\n", + " model.fit()\n", + "\n", + "Note that in Data 100, we only use `sklearn` to fit logistic regression models. There is no closed-form solution to the optimal theta vector, and the gradient is a little messy (see the bonus section below for details).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From here, the `.predict` function returns the predicted class $\\hat y$ of the point. In the simple binary case, \n", + "\n", + "$$\\hat y = \\begin{cases}\n", + " 1, & P(Y=1|x) \\ge 0.5\\\\\n", + " 0, & \\text{otherwise }\n", + " \\end{cases}$$" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Performance Metrics\n", + "You might be thinking, if we've already introduced cross-entropy loss, why do we need additional ways of assessing how well our models perform? In linear regression, we made numerical predictions and used a loss function to determine how “good” these predictions were. In logistic regression, our ultimate goal is to classify data – we are much more concerned with whether or not each datapoint was assigned the correct class using the decision rule. As such, we are interested in the *quality* of classifications, not the predicted probabilities.\n", + "\n", + "The most basic evaluation metric is **accuracy** -- the proportion of correctly classified points.\n", + "\n", + "$$\\text{accuracy} = \\frac{\\# \\text{ of points classified correctly}}{\\# \\text{ of total points}}$$\n", + "\n", + "Translated to code: \n", + "\n", + " def accuracy(X, Y):\n", + " return np.mean(model.predict(X) == Y)\n", + " \n", + " model.score(X, y) # built-in accuracy function\n", + "\n", + "However, accuracy is not always a great metric for classification. To understand why, let's consider a classification problem with 100 emails where only 5 are truly spam, and the remaining 95 are truly ham. We'll investigate two models where accuracy is a poor metric. \n", + "\n", + "- **Model 1**: Our first model classifies every email as non-spam. The model's accuracy is high ($\\frac{95}{100} = 0.95$), but it doesn't detect any spam emails. Despite the high accuracy, this is a bad model.\n", + "- **Model 2**: The second model classifies every email as spam. The accuracy is low ($\\frac{5}{100} = 0.05$), but the model correctly labels every spam email. Unfortunately, it also misclassifies every non-spam email.\n", + "\n", + "As this example illustrates, accuracy is not always a good metric for classification, particularly when your data have class imbalance (e.g., very few 1’s compared to 0’s).\n", + "\n", + "### Types of Classification\n", + "There are 4 different different classifications that our model might make:\n", + "\n", + "1. **True positive**: correctly classify a positive point as being positive ($y=1$ and $\\hat{y}=1$)\n", + "2. **True negative**: correctly classify a negative point as being negative ($y=0$ and $\\hat{y}=0$)\n", + "3. **False positive**: incorrectly classify a negative point as being positive ($y=0$ and $\\hat{y}=1$)\n", + "4. **False negative**: incorrectly classify a positive point as being negative ($y=1$ and $\\hat{y}=0$)\n", + "\n", + "These classifications can be concisely summarized in a **confusion matrix**. \n", + "\n", + "
confusion_matrix
\n", + "\n", + "An easy way to remember this terminology is as follows:\n", + "\n", + "1. Look at the second word in the phrase. *Positive* means a prediction of 1. *Negative* means a prediction of 0.\n", + "2. Look at the first word in the phrase. *True* means our prediction was correct. *False* means it was incorrect.\n", + "\n", + "We can now write the accuracy calculation as \n", + "$$\\text{accuracy} = \\frac{TP + TN}{n}$$\n", + "\n", + "In `sklearn`, we use the following syntax\n", + "\n", + " from sklearn.metrics import confusion_matrix\n", + " cm = confusion_matrix(Y_true, Y_pred)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "python" + } + }, + "outputs": [], + "source": [ + "from sklearn.metrics import confusion_matrix\n", + "\n", + "y_pred = model.predict(X)\n", + "confusion_matrix(y, y_pred)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Accuracy, Precision, and Recall\n", + "\n", + "The purpose of our discussion of the confusion matrix was to motivate better performance metrics for classification problems with class imbalance - namely, precision and recall.\n", + "\n", + "**Precision** is defined as\n", + "\n", + "$$\\text{precision} = \\frac{\\text{TP}}{\\text{TP + FP}}$$\n", + "\n", + "Precision answers the question: \"of all observations that were predicted to be $1$, what proportion were actually $1$?\" It measures how accurate the classifier is when its predictions are positive.\n", + "\n", + "**Recall** (or **sensitivity**) is defined as \n", + "\n", + "$$\\text{recall} = \\frac{\\text{TP}}{\\text{TP + FN}}$$\n", + "\n", + "Recall aims to answer: \"of all observations that were actually $1$, what proportion were predicted to be $1$?\" It measures how many positive predictions were missed.\n", + "\n", + "Here's a helpful graphic that summarizes our discussion above.\n", + "\n", + "\n", + "
confusion_matrix
\n", + "\n", + "### Example Calculation\n", + "\n", + "In this section, we will calculate the accuracy, precision, and recall performance metrics for our earlier spam classification example. As a reminder, we had a 100 emails, 5 of which were spam. We designed two models:\n", + "\n", + "- Model 1: Predict that every email is *non-spam*\n", + "- Model 2: Predict that every email is *spam*\n", + "\n", + "#### Model 1\n", + "\n", + "First, let's begin by creating the confusion matrix.\n", + "\n", + "+-------------------+-------------------+---------------------------+\n", + "| | 0 | 1 |\n", + "+===================+===================+===========================+\n", + "| 0 | True Negative: 95 | False Positive: 0 |\n", + "+-------------------+-------------------+---------------------------+\n", + "| 1 | False Negative: 5 | True Positive: 0 |\n", + "+-------------------+-------------------+---------------------------+\n", + "\n", + "Convince yourself of why our confusion matrix looks like so.\n", + "\n", + "$$\\text{accuracy} = \\frac{95}{100} = 0.95$$\n", + "$$\\text{precision} = \\frac{0}{0 + 0} = \\text{undefined}$$\n", + "$$\\text{recall} = \\frac{0}{0 + 5} = 0$$\n", + "\n", + "Notice how our precision is undefined because we never predicted class $1$. Our recall is 0 for the same reason -- the numerator is 0 (we had no positive predictions).\n", + "\n", + "#### Model 2\n", + "\n", + "Our confusion matrix for Model 2 looks like so.\n", + "\n", + "+-------------------+-------------------+---------------------------+\n", + "| | 0 | 1 |\n", + "+===================+===================+===========================+\n", + "| 0 | True Negative: 0 | False Positive: 95 |\n", + "+-------------------+-------------------+---------------------------+\n", + "| 1 | False Negative: 0 | True Positive: 5 |\n", + "+-------------------+-------------------+---------------------------+\n", + "\n", + "$$\\text{accuracy} = \\frac{5}{100} = 0.05$$\n", + "$$\\text{precision} = \\frac{5}{5 + 95} = 0.05$$\n", + "$$\\text{recall} = \\frac{5}{5 + 0} = 1$$\n", + "\n", + "Our precision is low because we have many false positives, and our recall is perfect - we correctly classified all spam emails (we never predicted class $0$).\n", + "\n", + "#### Precision vs. Recall\n", + "\n", + "Precision ($\\frac{\\text{TP}}{\\text{TP} + \\textbf{ FP}}$) penalizes false positives, while recall ($\\frac{\\text{TP}}{\\text{TP} + \\textbf{ FN}}$) penalizes false negatives.\n", + "\n", + "In fact, precision and recall are *inversely related*. This is evident in our second model -- we observed a high recall and low precision. Usually, there is a tradeoff in these two (most models can either minimize the number of FP or FN; and in rare cases, both). \n", + "\n", + "The specific performance metric(s) to prioritize depends on the context. In many medical settings, there might be a much higher cost to missing positive cases. For instance, in our breast cancer example, it is more costly to misclassify malignant tumors (false negatives) than it is to incorrectly classify a benign tumor as malignant (false positives). In the case of the latter, pathologists can conduct further study to verify malignant tumors. As such, we should minimize the number of false negatives. This is equivalent to maximizing recall.\n", + "\n", + "\n", + "\n", + "### Two More Metrics\n", + "\n", + "The **True Positive Rate (TPR)** is defined as\n", + "\n", + "$$\\frac{\\text{TP}}{\\text{TP + FN}}$$\n", + "\n", + "You'll notice this is equivalent to *recall*. In the context of our spam email classifier, it answers the question: \"what proportion of spam did I mark correctly?\". We'd like this to be close to $1$\n", + "\n", + "The **False Positive Rate (FPR)** is defined as\n", + "\n", + "$$\\frac{\\text{FP}}{\\text{FP + TN}}$$\n", + "\n", + "Another word for FPR is *specificity*. This answers the question: \"what proportion of regular email did I mark as spam?\". We'd like this to be close to $0$\n", + "\n", + "As we increase threshold $T$, both TPR and FPR decrease. We've plotted this relationship below for some model on a toy dataset.\n", + "\n", + "
tpr_fpr
\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Adjusting the Classification Threshold\n", + "\n", + "One way to minimize the number of FP vs. FN (equivalently, maximizing precision vs. recall) is by adjusting the classification threshold $T$.\n", + "\n", + "$$\\hat y = \\begin{cases}\n", + " 1, & P(Y=1|x) \\ge T\\\\\n", + " 0, & \\text{otherwise }\n", + " \\end{cases}$$\n", + " \n", + "The default threshold in `sklearn` is $T = 0.5$. As we increase the threshold $T$, we “raise the standard” of how confident our classifier needs to be to predict 1 (i.e., “positive”).\n", + "\n", + "
varying_threshold
\n", + "\n", + "As you may notice, the choice of threshold $T$ impacts our classifier's performance.\n", + "\n", + "- High $T$: Most predictions are $0$. \n", + " - Lots of false negatives\n", + " - Fewer false positives\n", + "- Low $T$: Most predictions are $1$. \n", + " - Lots of false positives \n", + " - Fewer false negatives\n", + "\n", + "In fact, we can choose a threshold $T$ based on our desired number, or proportion, of false positives and false negatives. We can do so using a few different tools. We'll touch on two of the most important ones in Data 100.\n", + "\n", + "1. Precision-Recall Curve (PR Curve)\n", + "2. \"Receiver Operating Characteristic\" Curve (ROC Curve)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "### Precision-Recall Curves\n", + "\n", + "A **Precision-Recall Curve (PR Curve)** is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. It is constructed in a similar way as with the ROC curve.\n", + "\n", + "Let's first consider how precision and recall change as a function of the threshold $T$. We know this quite well from earlier -- precision will generally increase, and recall will decrease.\n", + "\n", + "
precision-recall-thresh
\n", + "\n", + "Displayed below is the PR-Curve for the same toy dataset. Notice how threshold values increase as we move to the left.\n", + "\n", + "
pr_curve_thresholds
\n", + "\n", + "Once again, the perfect classifier will resemble the orange curve, this time, facing the opposite direction.\n", + "\n", + "
pr_curve_perfect
\n", + "\n", + "We want our PR-Curve to be as close to the “top right” of this graph as possible. Again, we use the AUC to determine \"closeness\", with the perfect classifier exhibiting an AUC = 1 (and the worst with an AUC = 0.5)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The ROC Curve\n", + "\n", + "The “Receiver Operating Characteristic” Curve (**ROC Curve**) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold $T$ values.\n", + "\n", + "
roc_curve
\n", + "\n", + "The “perfect” classifier is the one that has a TPR of 1, and FPR of 0. This is achieved at the top-left of the plot below. More generally, it's ROC curve resembles the curve in orange.\n", + "\n", + "
roc_curve_perfect
\n", + "\n", + "We want our model to be as close to this orange curve as possible. How do we quantify \"closeness\"?\n", + "\n", + "We can compute the **area under curve (AUC)** of the ROC curve. Notice how the perfect classifier has an AUC = 1. The closer our model's AUC is to 1, the better it is. \n", + "\n", + "\n", + "#### [Extra] What is the “worst” AUC and why is it 0.5? \n", + "On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predicts P(Y = 1 | x) to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two.\n", + "\n", + "
roc_curve_worst_predictor
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## [Extra] Gradient Descent for Logistic Regression\n", + "Let's define the following: \n", + "$$\n", + "t_i = \\phi(x_i)^T \\theta \\\\\n", + "p_i = \\sigma(t_i) \\\\\n", + "t_i = log(\\frac{p_i}{1 - p_i}) \\\\\n", + "1 - \\sigma(t_i) = \\sigma(-t_i) \\\\\n", + "\\frac{d}{dt} \\sigma(t) = \\sigma(t) \\sigma(-t)\n", + "$$\n", + "\n", + "Now, we can simplify the cross-entropy loss\n", + "$$\n", + "\\begin{align}\n", + "y_i log(p_i) + (1 - y_i)log(1 - p_i) &= y_i log(\\frac{p_i}{1 - p_i}) + log(1 - p_i) \\\\\n", + "&= y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta))\n", + "\\end{align}\n", + "$$\n", + "\n", + "Hence, the optimal $\\hat{\\theta}$ is \n", + "$$\\argmin_{\\theta} - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta)))$$ \n", + "\n", + "We want to minimize $$L(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + log(\\sigma(-\\phi(x_i)^T \\theta)))$$\n", + "\n", + "So we take the derivative \n", + "$$ \n", + "\\begin{align}\n", + "\\triangledown_{\\theta} L(\\theta) &= - \\frac{1}{n} \\sum_{i=1}^n \\triangledown_{\\theta} y_i \\phi(x_i)^T + \\triangledown_{\\theta} log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n", + "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\triangledown_{\\theta} log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n", + "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\frac{1}{\\sigma(-\\phi(x_i)^T \\theta)} \\triangledown_{\\theta} \\sigma(-\\phi(x_i)^T \\theta) \\\\\n", + "&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\frac{\\sigma(-\\phi(x_i)^T \\theta)}{\\sigma(-\\phi(x_i)^T \\theta)} \\sigma(\\phi(x_i)^T \\theta)\\triangledown_{\\theta} \\sigma(-\\phi(x_i)^T \\theta) \\\\\n", + "&= - \\frac{1}{n} \\sum_{i=1}^n (y_i - \\sigma(\\phi(x_i)^T \\theta)\\phi(x_i))\n", + "\\end{align}\n", + "$$\n", + "\n", + "Setting the derivative to 0 and solving for $\\hat{\\theta}$, we find that there's no general analytic solution. Therefore, we must solve using numeric methods. \n", + "\n", + "### Gradient Descent Update Rule\n", + "$$\\theta^{(0)} \\leftarrow \\text{initial vector (random, zeros, ...)} $$\n", + "\n", + "For $\\tau$ from 0 to convergence: \n", + "$$ \\theta^{(\\tau + 1)} \\leftarrow \\theta^{(\\tau)} + \\rho(\\tau)\\left( \\frac{1}{n} \\sum_{i=1}^n \\triangledown_{\\theta} L_i(\\theta) \\mid_{\\theta = \\theta^{(\\tau)}}\\right) $$\n", + "\n", + "### Stochastic Gradient Descent Update Rule\n", + "$$\\theta^{(0)} \\leftarrow \\text{initial vector (random, zeros, ...)} $$\n", + "\n", + "For $\\tau$ from 0 to convergence, let $B ~ \\text{Random subset of indices}$. \n", + "$$ \\theta^{(\\tau + 1)} \\leftarrow \\theta^{(\\tau)} + \\rho(\\tau)\\left( \\frac{1}{|B|} \\sum_{i \\in B} \\triangledown_{\\theta} L_i(\\theta) \\mid_{\\theta = \\theta^{(\\tau)}}\\right) $$" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/logistic_regression_2/logistic_reg_2.qmd b/logistic_regression_2/logistic_reg_2.qmd index 3737ffa5..e2b84a61 100644 --- a/logistic_regression_2/logistic_reg_2.qmd +++ b/logistic_regression_2/logistic_reg_2.qmd @@ -13,117 +13,69 @@ format: jupyter: python3 --- -Today, we will continue studying the Logistic Regression model. First, we'll examine logisitc regression through the lens of maxmium likelihood estimation and understand its connection to cross-entropy loss. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. Finally, we will provide an implementation of `sklearn`'s logistic regression model. +::: {.callout-note collapse=\"false\"} +## Learning Outcomes +* Apply decision rules to make a classification +* Learn when logistic regression works well and when it does not +* Introduce new metrics for model performance +::: -This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model. We will learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression. - -## Logistic Regression Model (continued) - - -### Maximum Likelihood Estimation - - -In our earlier coin toss example, we had data on 10 coin flips, and wanted to estimate $\hat \theta$, the probability of a heads. - -```{python} -flips = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0] -flips -``` +Today, we will continue studying the Logistic Regression model. We'll discussion decision boundaries that help inform the classification of a particular prediction. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of `sklearn`'s logistic regression model. Lastly, we'll return to decision rules and discus metrics that allow us to determine our model's performance in different scenarios. -$\hat \theta = 0.4$ is the most intuitive two reasons: - -1. It is the frequency of heads in our data -2. It maximizes the **likelihood** of our data +This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model. We will learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression. -$$\hat \theta = \text{argmax}_\theta (\theta^4(1-\theta)^6)$$ +
tpr_fpr
-More generally, we can apply this notion of likelihood to any random binary sample. For example, we can find the likelihood of the data observed in our breast cancer study. We will show how the likelihood is intrinsically related to cross-entropy loss. +## Decision Boundaries +In logistic regression, we model the *probability* that a datapoint belongs to Class 1. Last week, we developed the logistic regression model to predict that probability, but we never actually made any *classifications* for whether our prediction $y$ belongs in Class 0 or Class 1. -As a quick refresher on likelihood: +$$ p = P(Y=1 | x) = \frac{1}{1 + e^{-x^T\theta}}$$ -For some Bernoulli($p$) random variable $Y$ the probability distribution, or likelihood is: +A **decision rule** tells us how to interpret the output of the model to make a decision on how to classify a datapoint. We commonly make decision rules by specifying a **threshold**, $T$. If the predicted probability is greater than or equal to $T$, predict Class 1. Otherwise, predict Class 0. -$$P(Y = y) = \begin{cases} - 1, \text{with probability } p\\ - 0, \text{with probability } 1 - p - \end{cases} $$ +$$\hat y = \text{classify}(x) = \begin{cases} + 1, & P(Y=1|x) \ge T\\ + 0, & \text{otherwise } + \end{cases}$$ -Equivalently, this can be written in a compact way: - -$$P(Y=y) = p^y(1-p)^{1-y}$$ - -- When $y = 1$, this reads $P(Y=y) = p$ -- When $y = 0$, this reads $P(Y=y) = (1-p)$ - -In our example, a Bernoulli random variable is analagous to a single data point, or tumor (from the previous chapter). All together, our breast cancer study consist of multiple IID Bernoulli($p$) random variables. To find the likelihood of independent events in succession, simply multiply their likelihoods. - -$$\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}$$ - -As with the coin example, we want to find the parameter $p$ that maximizes this likelihood - this technique is known as **maximum likelihood estimation**. Earlier, we gave an intuitive graphical solution, but let's take the derivative of the likelihood to find this maximum. - -From a first glance, this derivative will be complicated! We will have to use the product rule, followed by the chain rule. Instead, we can make an observation that simplifies the problem. - -Finding the $p$ that maximizes $$\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}$$ is equivalent to the $p$ that maximizes $$\text{log}(\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i})$$ - -This is because $\text{log}$ is a strictly *increasing* function. It won't change the maximum or minimum of the function it was applied to. From $\text{log}$ properties, $\text{log}(a*b)$ = $\text{log}(a) + \text{log}(b)$. We can apply this to our equation above to get: - -$$\text{argmax}_p \sum_{i=1}^{n} \text{log}(p^{y_i} (1-p)^{1-y_i})$$ - -$$= \text{argmax}_p \sum_{i=1}^{n} \text{log}(p^{y_i}) + \text{log}((1-p)^{1-y_i})$$ - -$$= \text{argmax}_p \sum_{i=1}^{n} y_i\text{log}(p) + (1-y_i)\text{log}(1-p)$$ - -We can add a constant factor of $\frac{1}{n}$ out front. It won't affect the $p$ that maximizes our likelihood. - -$$=\text{argmax}_p \frac{1}{n} \sum_{i=1}^{n} y_i\text{log}(p) + (1-y_i)\text{log}(1-p)$$ +The threshold is often set to $T = 0.5$, but *not always*. We'll discuss why we might want to use other thresholds $T \neq 0.5$ later in this lecture. -One last "trick" we can do is change this to a minimization problem by negating the result. This works because we are dealing with a *concave* function, which can be made *convex*. +Using our decision rule, we can define a **decision boundary** as the “line” the splits the data into classes based on its features. For logistic regression, the decision boundary is a **hyperplane** -- a linear combination of the features in p-dimensions -- and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have $\theta = [\theta_0 \theta_1 \theta_2]$ (including the intercept term), and we can solve for the decision boundary like so: -$$= \text{argmin}_p -\frac{1}{n} \sum_{i=1}^{n} y_i\text{log}(p) + (1-y_i)\text{log}(1-p)$$ +$$ +\begin{align} +T &= \frac{1}{1 + e^{\theta_0 + \theta_1 * \text{feature1} + \theta_2 * \text{feature2}}} \\ +1 + e^{\theta_0 + \theta_1 * \text{feature1} + \theta_2 * \text{feature2}} &= \frac{1}{T} \\ +e^{\theta_0 + \theta_1 * \text{feature1} + \theta_2 * \text{feature2}} &= \frac{1}{T} - 1 \\ +\theta_0 + \theta_1 * \text{feature1} + \theta_2 * \text{feature2} &= log(\frac{1}{T} - 1) +\end{align} +$$ -This is exactly our average cross-entropy loss minimization problem from before! +For a model with 2 features, the decision boundary is a line in terms of it's features. To make it easier to visualize, we've included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes. -Why did we do all this complicated math? We have shown that *minimizing* cross-entropy loss is equivalent to *maximizing* the likelihood of the training data. +
varying_threshold
-- By minimizing cross-entropy loss, we are choosing the model parameters that are "most likely" for the data we observed. +In real life, however, that is often not the case, and we often see some overlap between points of different classes across the decision boundary. The *true* classes of the 2D data is shown below: -## A (Tangent) Note +
varying_threshold
-You will study MLE further in probability and ML classes. But now you know it exists. It turns out that many of the model + loss combinations we’ve seen can be motivated using MLE (OLS, Ridge Regression, etc.) +As you can see, the decision boundary predicted by our logistic regression does not perfectly separate the two classes. There's a “muddled” region near the decision boundary where our classifier predicts the wrong class. What would the data have to look like for the classifier to make perfect predictions? -The two important takeways from this section are +## Linear Seperability and Regularization -1. Formulating the Logistic Regression Model -2. Motivating the Cross-Entropy Loss +A classification dataset is said to be **linearly separable** if there exists a hyperplane among input features $x$ that separates the two classes $y$. -We will now continue to learn how to evaluate the strength of logistic regression models. +Linear seperability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly seperable along the vertical line $x=0$. However, no such line perfectly seperates the two classes on the bottom right. -Above, we proved that *minimizing* cross-entropy loss is equivalent to *maximizing* likelihood of the training data. +
linear_seperability_1D
-Intuitively, this means that the optimal $\hat \theta$ that minimizes cross-entropy loss "pushes" all the probabilities in the direction of their true, underlying class. +This same definition holds in higher dimensions. If there are two features, the seperating hyperplane must exist in two dimensions (any line of the form $y=mx+b$). We can visualize this using a scatter plot. -- For points that belong to the $0$ class, $\sigma(x^T\theta) \rightarrow 0$ -- For points that belong to the $1$ class, $\sigma(x^T\theta) \rightarrow 1$ +
linear_seperability_1D
-However, something interesting happens when our data is perfectly classifiable; in other words, linearly seperable. +This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. However, (unexpected) complications may arise when data is linearly separable. Consider the toy dataset with 2 points and only a single feature $x$: -### Linear Seperability and Regularization - -A classification dataset is said to be **linearly separable** if there exists a hyperplane among input features $x$ that separates the two classes $y$. For example, notice how the plot on the bottom left is linearly seperable along the vertical line $x=0$. No such line perfectly seperates the two classes on the bottom right. - -- Linear seperability in 1D can be found with a rugplot of a single feature. - -linear_seperability_1D - -This same definition holds in higher dimensions. If there are two features, the seperating hyperplane must exist in two dimensions (any line of the form $y=mx+b$) - -- Linear seperability among 2 features is evident from a two-dimensional visualization, or scatter plot. - -linear_seperability_1D - -Complications may arise when data is linearly seperable. Consider the toy dataset with 2 points and only a single feature $x$: - -toy_linear_seperability +
toy_linear_seperability
The optimal $\theta$ value that minimizes loss pushes the predicted probabilities of the data points to their true class. @@ -139,25 +91,23 @@ $$P(Y=1|x) = \sigma(\theta x) \rightarrow \begin{cases} The diverging weights cause the model to be overconfident. For example, consider the new point $(x, y) = (0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and a thus, $\hat y = 0$. -toy_linear_seperability +
toy_linear_seperability
The loss incurred by this misclassified point is infinite. -$$-(y\text{ log}(p) + (1-y)\text{ log}(1-p))$$ - -$$=1\text{log}(0)$$ +$$-(y\text{ log}(p) + (1-y)\text{ log}(1-p))=1\text{log}(0)$$ Thus, diverging weights ($|\theta| \rightarrow \infty$) occur with **lineary separable** data. "Overconfidence" is a particularly dangerous version of overfitting. Consider the loss function with respect to the parameter $\theta$. -unreg_loss +
unreg_loss
-Although impossible to see, the plateau for negative values of $\theta$ is slightly tilted downwards, meaning the loss approaches $0$ as $\theta$ decreases and approaches $-\infty$. +Though it's very difficult to see, the plateau for negative values of $\theta$ is slightly tilted downwards, meaning the loss approaches $0$ as $\theta$ decreases and approaches $-\infty$. -#### Regularized Logistic Regression +### Regularized Logistic Regression -To avoid large weights, and thus, infinite loss (particularly on linearly seperable data), regularization is used. The same principles apply as with linear regression - make sure to standardize your features first. +To avoid large weights and infinite loss (particularly on linearly separable data), we use regularizaiton. The same principles apply as with linear regression - make sure to standardize your features first. For example, L2 (Ridge) Logistic Regression takes on the form @@ -165,52 +115,19 @@ $$\min_{\theta} -\frac{1}{n} \sum_{i=1}^{n} (y_i \text{log}(\sigma(x_i^T\theta)) Now, let us compare the loss functions of un-regularized and regularized logistic regression. -unreg_loss +
unreg_loss
-reg_loss +
reg_loss
As we can see, $L2$ regularization helps us prevent diverging weights and deters against "overconfidence." -### Logistic Regression Model Implementation - -The implementation of logistic regression in `sklearn` is simple. We'll begin by fitting a model on the breast cancer dataset from last lecture. - -```{python} -#| code-fold: true - -import pandas as pd -import sklearn.datasets +`sklearn`'s logistic regression defaults to L2 regularization and `C=1.0`; `C` is the inverse of $\lambda$: $C = \frac{1}{\lambda}$. Setting `C` to a large value, for example `C=300.0` results in minimal regularization. -data_dict = sklearn.datasets.load_breast_cancer() -data = pd.DataFrame(data_dict['data'], columns=data_dict['feature_names']) -data['malignant'] = (data_dict['target'] == 0).astype(int) + # sklearn defaults + model = LogisticRegression(penalty='l2', C=1.0, …) + model.fit() -X = data[['mean radius']] -y = data['malignant'] -``` - -```{python} -from sklearn.linear_model import LogisticRegression -model = LogisticRegression() -model.fit(X, y); -``` - -By default, `sklearn` applies regularization to the logistic regression class. This is to avoid diverging weights with seperable data. The code above can be written more expliclty as follows. - -```{python} -# sklearn defaults -model = LogisticRegression(penalty='l2', C=1.0, fit_intercept=True) -model.fit(X, y); -``` - -The parameter `C` controls the amount of regularization -- `C` is the inverse of the regularization hyperparameter $\lambda$. Set `C` big for minimal regularization, and vice versa. - -The `.predict_proba` method returns the predicted probabilities of belonging to each class. The first element corresponds to class $0$, the second to class $1$. - -```{python} -# Here are the first 5 predicted probabilities -model.predict_proba(X)[:5] -``` +Note that in Data 100, we only use `sklearn` to fit logistic regression models. There is no closed-form solution to the optimal theta vector, and the gradient is a little messy (see the bonus section below for details). From here, the `.predict` function returns the predicted class $\hat y$ of the point. In the simple binary case, @@ -219,44 +136,54 @@ $$\hat y = \begin{cases} 0, & \text{otherwise } \end{cases}$$ -```{python} -# Here are the first 5 predicted classes -model.predict(X)[:5] -``` - ## Performance Metrics +You might be thinking, if we've already introduced cross-entropy loss, why do we need additional ways of assessing how well our models perform? In linear regression, we made numerical predictions and used a loss function to determine how “good” these predictions were. In logistic regression, our ultimate goal is to classify data – we are much more concerned with whether or not each datapoint was assigned the correct class using the decision rule. As such, we are interested in the *quality* of classifications, not the predicted probabilities. -Now that we have our classifier, let's quantify how well it performs. The most basic evaluation metric is **accuracy** -- the proportion of correctly classified points. +The most basic evaluation metric is **accuracy** -- the proportion of correctly classified points. $$\text{accuracy} = \frac{\# \text{ of points classified correctly}}{\# \text{ of total points}}$$ -```{python} -model.score(X, y) # built-in accuracy function -``` +Translated to code: -However, accuracy is not always a great metric for classification, particularily when the data has class imbalance. + def accuracy(X, Y): + return np.mean(model.predict(X) == Y) + + model.score(X, y) # built-in accuracy function -To understand why, let's consider a classification problem with 100 emails, 5 of which are spam. We'll investigate two models where accuracy is a poor metric. +However, accuracy is not always a great metric for classification. To understand why, let's consider a classification problem with 100 emails where only 5 are truly spam, and the remaining 95 are truly ham. We'll investigate two models where accuracy is a poor metric. - **Model 1**: Our first model classifies every email as non-spam. The model's accuracy is high ($\frac{95}{100} = 0.95$), but it doesn't detect any spam emails. Despite the high accuracy, this is a bad model. - **Model 2**: The second model classifies every email as spam. The accuracy is low ($\frac{5}{100} = 0.05$), but the model correctly labels every spam email. Unfortunately, it also misclassifies every non-spam email. -### The Confusion Matrix +As this example illustrates, accuracy is not always a good metric for classification, particularly when your data have class imbalance (e.g., very few 1’s compared to 0’s). + +### Types of Classification +There are 4 different different classifications that our model might make: -Model 1 from above has 5 **false negatives (FN)** -- data points which were predicted to belong to class $0$ (non-spam), but their true class was $1$ (spam). In a similar vein, Model 2 has 95 **false positives (FP)** -- that is, "false alarms" where we predict class $1$, but the true class was $0$. **True positives (TP)** and **true negatives (TN)** are when we correctly classify observations as being positive or negative, respectively. +1. **True positive**: correctly classify a positive point as being positive ($y=1$ and $\hat{y}=1$) +2. **True negative**: correctly classify a negative point as being negative ($y=0$ and $\hat{y}=0$) +3. **False positive**: incorrectly classify a negative point as being positive ($y=0$ and $\hat{y}=1$) +4. **False negative**: incorrectly classify a positive point as being negative ($y=1$ and $\hat{y}=0$) These classifications can be concisely summarized in a **confusion matrix**. -confusion_matrix +
confusion_matrix
An easy way to remember this terminology is as follows: 1. Look at the second word in the phrase. *Positive* means a prediction of 1. *Negative* means a prediction of 0. 2. Look at the first word in the phrase. *True* means our prediction was correct. *False* means it was incorrect. - -A confusion matrix for a particular classifier may be found programatically. For our breast cancer data, it looks like this: + +We can now write the accuracy calculation as +$$\text{accuracy} = \frac{TP + TN}{n}$$ + +In `sklearn`, we use the following syntax + + from sklearn.metrics import confusion_matrix + cm = confusion_matrix(Y_true, Y_pred) ```{python} +#| vscode: {languageId: python} from sklearn.metrics import confusion_matrix y_pred = model.predict(X) @@ -269,32 +196,32 @@ The purpose of our discussion of the confusion matrix was to motivate better per **Precision** is defined as -$$\frac{\text{TP}}{\text{TP + FP}}$$ +$$\text{precision} = \frac{\text{TP}}{\text{TP + FP}}$$ Precision answers the question: "of all observations that were predicted to be $1$, what proportion were actually $1$?" It measures how accurate the classifier is when its predictions are positive. **Recall** (or **sensitivity**) is defined as -$$\frac{\text{TP}}{\text{TP + FN}}$$ +$$\text{recall} = \frac{\text{TP}}{\text{TP + FN}}$$ Recall aims to answer: "of all observations that were actually $1$, what proportion were predicted to be $1$?" It measures how many positive predictions were missed. Here's a helpful graphic that summarizes our discussion above. -confusion_matrix -#### Example Calculation +
confusion_matrix
+ +### Example Calculation In this section, we will calculate the accuracy, precision, and recall performance metrics for our earlier spam classification example. As a reminder, we had a 100 emails, 5 of which were spam. We designed two models: - Model 1: Predict that every email is *non-spam* - Model 2: Predict that every email is *spam* -##### Model 1 +#### Model 1 First, let's begin by creating the confusion matrix. - +-------------------+-------------------+---------------------------+ | | 0 | 1 | +===================+===================+===========================+ @@ -309,10 +236,9 @@ $$\text{accuracy} = \frac{95}{100} = 0.95$$ $$\text{precision} = \frac{0}{0 + 0} = \text{undefined}$$ $$\text{recall} = \frac{0}{0 + 5} = 0$$ -- Notice how our precision is undefined because we never predicted class $1$ -- Our recall is 0 for the same reason -- the numerator is 0 (we had no positive predictions) +Notice how our precision is undefined because we never predicted class $1$. Our recall is 0 for the same reason -- the numerator is 0 (we had no positive predictions). -##### Model 2 +#### Model 2 Our confusion matrix for Model 2 looks like so. @@ -328,10 +254,9 @@ $$\text{accuracy} = \frac{5}{100} = 0.05$$ $$\text{precision} = \frac{5}{5 + 95} = 0.05$$ $$\text{recall} = \frac{5}{5 + 0} = 1$$ -- Our precision is low because we have many false positives -- Our recall is perfect - we correctly classified all spam emails (we never predicted class $0$) +Our precision is low because we have many false positives, and our recall is perfect - we correctly classified all spam emails (we never predicted class $0$). -#### Precision vs Recall +#### Precision vs. Recall Precision ($\frac{\text{TP}}{\text{TP} + \textbf{ FP}}$) penalizes false positives, while recall ($\frac{\text{TP}}{\text{TP} + \textbf{ FN}}$) penalizes false negatives. @@ -339,6 +264,27 @@ In fact, precision and recall are *inversely related*. This is evident in our se The specific performance metric(s) to prioritize depends on the context. In many medical settings, there might be a much higher cost to missing positive cases. For instance, in our breast cancer example, it is more costly to misclassify malignant tumors (false negatives) than it is to incorrectly classify a benign tumor as malignant (false positives). In the case of the latter, pathologists can conduct further study to verify malignant tumors. As such, we should minimize the number of false negatives. This is equivalent to maximizing recall. + + +### Two More Metrics + +The **True Positive Rate (TPR)** is defined as + +$$\frac{\text{TP}}{\text{TP + FN}}$$ + +You'll notice this is equivalent to *recall*. In the context of our spam email classifier, it answers the question: "what proportion of spam did I mark correctly?". We'd like this to be close to $1$ + +The **False Positive Rate (FPR)** is defined as + +$$\frac{\text{FP}}{\text{FP + TN}}$$ + +Another word for FPR is *specificity*. This answers the question: "what proportion of regular email did I mark as spam?". We'd like this to be close to $0$ + +As we increase threshold $T$, both TPR and FPR decrease. We've plotted this relationship below for some model on a toy dataset. + +
tpr_fpr
+ + ## Adjusting the Classification Threshold One way to minimize the number of FP vs. FN (equivalently, maximizing precision vs. recall) is by adjusting the classification threshold $T$. @@ -350,7 +296,7 @@ $$\hat y = \begin{cases} The default threshold in `sklearn` is $T = 0.5$. As we increase the threshold $T$, we “raise the standard” of how confident our classifier needs to be to predict 1 (i.e., “positive”). -varying_threshold +
varying_threshold
As you may notice, the choice of threshold $T$ impacts our classifier's performance. @@ -363,68 +309,93 @@ As you may notice, the choice of threshold $T$ impacts our classifier's performa In fact, we can choose a threshold $T$ based on our desired number, or proportion, of false positives and false negatives. We can do so using a few different tools. We'll touch on two of the most important ones in Data 100. -1. Precision-Recall Curve (PR Curve). [Covered in Extra Content] +1. Precision-Recall Curve (PR Curve) 2. "Receiver Operating Characteristic" Curve (ROC Curve) -To motivate the ROC Curve, let's first consider two more metrics - true positive rate (TPR) and false positive rate (FPR). -### Two More Metrics - -The **True Positive Rate (TPR)** is defined as - -$$\frac{\text{TP}}{\text{TP + FN}}$$ - -You'll notice this is equivalent to *recall*. In the context of our spam email classifier, it answers the question: "what proportion of spam did I mark correctly?". +### Precision-Recall Curves -- We'd like this to be close to $1$ +A **Precision-Recall Curve (PR Curve)** is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. It is constructed in a similar way as with the ROC curve. -The **False Positive Rate (FPR)** is defined as +Let's first consider how precision and recall change as a function of the threshold $T$. We know this quite well from earlier -- precision will generally increase, and recall will decrease. -$$\frac{\text{FP}}{\text{FP + TN}}$$ +
precision-recall-thresh
-Another word for FPR is *specificity*. This answers the question: "what proportion of regular email did I mark as spam?" +Displayed below is the PR-Curve for the same toy dataset. Notice how threshold values increase as we move to the left. -- We'd like this to be close to $0$ +
pr_curve_thresholds
-As we increase threshold $T$, both TPR and FPR decrease. We've plotted this relationship below for some model on a toy dataset. +Once again, the perfect classifier will resemble the orange curve, this time, facing the opposite direction. -tpr_fpr +
pr_curve_perfect
+We want our PR-Curve to be as close to the “top right” of this graph as possible. Again, we use the AUC to determine "closeness", with the perfect classifier exhibiting an AUC = 1 (and the worst with an AUC = 0.5). ### The ROC Curve The “Receiver Operating Characteristic” Curve (**ROC Curve**) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold $T$ values. -roc_curve +
roc_curve
The “perfect” classifier is the one that has a TPR of 1, and FPR of 0. This is achieved at the top-left of the plot below. More generally, it's ROC curve resembles the curve in orange. -roc_curve_perfect +
roc_curve_perfect
We want our model to be as close to this orange curve as possible. How do we quantify "closeness"? -We can compute the **area under curve (AUC)** of the ROC curve. Notice how the perfect classifier has an AUC = 1. The closer our model's AUC is to 1, the better it is. On the other hand, a terrible model will have an AUC closer to 0.5. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two. +We can compute the **area under curve (AUC)** of the ROC curve. Notice how the perfect classifier has an AUC = 1. The closer our model's AUC is to 1, the better it is. -roc_curve_worst_predictor -## Extra Content +#### [Extra] What is the “worst” AUC and why is it 0.5? +On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predicts P(Y = 1 | x) to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two. -### Precision-Recall Curves +
roc_curve_worst_predictor
-A **Precision-Recall Curve (PR Curve)** is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. It is constructed in a similar way as with the ROC curve. +## [Extra] Gradient Descent for Logistic Regression +Let's define the following: +$$ +t_i = \phi(x_i)^T \theta \\ +p_i = \sigma(t_i) \\ +t_i = log(\frac{p_i}{1 - p_i}) \\ +1 - \sigma(t_i) = \sigma(-t_i) \\ +\frac{d}{dt} \sigma(t) = \sigma(t) \sigma(-t) +$$ -Let's first consider how precision and recall change as a function of the threshold $T$. We know this quite well from earlier -- precision will generally increase, and recall will decrease. +Now, we can simplify the cross-entropy loss +$$ +\begin{align} +y_i log(p_i) + (1 - y_i)log(1 - p_i) &= y_i log(\frac{p_i}{1 - p_i}) + log(1 - p_i) \\ +&= y_i \phi(x_i)^T + log(\sigma(-\phi(x_i)^T \theta)) +\end{align} +$$ -precision-recall-thresh +Hence, the optimal $\hat{\theta}$ is +$$\argmin_{\theta} - \frac{1}{n} \sum_{i=1}^n (y_i \phi(x_i)^T + log(\sigma(-\phi(x_i)^T \theta)))$$ -Displayed below is the PR-Curve for the same toy dataset. Notice how threshold values increase as we move to the left. +We want to minimize $$L(\theta) = - \frac{1}{n} \sum_{i=1}^n (y_i \phi(x_i)^T + log(\sigma(-\phi(x_i)^T \theta)))$$ -pr_curve_thresholds +So we take the derivative +$$ +\begin{align} +\triangledown_{\theta} L(\theta) &= - \frac{1}{n} \sum_{i=1}^n \triangledown_{\theta} y_i \phi(x_i)^T + \triangledown_{\theta} log(\sigma(-\phi(x_i)^T \theta)) \\ +&= - \frac{1}{n} \sum_{i=1}^n y_i \phi(x_i) + \triangledown_{\theta} log(\sigma(-\phi(x_i)^T \theta)) \\ +&= - \frac{1}{n} \sum_{i=1}^n y_i \phi(x_i) + \frac{1}{\sigma(-\phi(x_i)^T \theta)} \triangledown_{\theta} \sigma(-\phi(x_i)^T \theta) \\ +&= - \frac{1}{n} \sum_{i=1}^n y_i \phi(x_i) + \frac{\sigma(-\phi(x_i)^T \theta)}{\sigma(-\phi(x_i)^T \theta)} \sigma(\phi(x_i)^T \theta)\triangledown_{\theta} \sigma(-\phi(x_i)^T \theta) \\ +&= - \frac{1}{n} \sum_{i=1}^n (y_i - \sigma(\phi(x_i)^T \theta)\phi(x_i)) +\end{align} +$$ -Once again, the perfect classifier will resemble the orange curve, this time, facing the opposite direction. +Setting the derivative to 0 and solving for $\hat{\theta}$, we find that there's no general analytic solution. Therefore, we must solve using numeric methods. -pr_curve_perfect +### Gradient Descent Update Rule +$$\theta^{(0)} \leftarrow \text{initial vector (random, zeros, ...)} $$ -We want our PR-Curve to be as close to the “top right” of this graph as possible. Again, we use the AUC to determine "closeness", with the perfect classifier exhibiting an AUC = 1 (and the worst with an AUC = 0.5). +For $\tau$ from 0 to convergence: +$$ \theta^{(\tau + 1)} \leftarrow \theta^{(\tau)} + \rho(\tau)\left( \frac{1}{n} \sum_{i=1}^n \triangledown_{\theta} L_i(\theta) \mid_{\theta = \theta^{(\tau)}}\right) $$ + +### Stochastic Gradient Descent Update Rule +$$\theta^{(0)} \leftarrow \text{initial vector (random, zeros, ...)} $$ +For $\tau$ from 0 to convergence, let $B ~ \text{Random subset of indices}$. +$$ \theta^{(\tau + 1)} \leftarrow \theta^{(\tau)} + \rho(\tau)\left( \frac{1}{|B|} \sum_{i \in B} \triangledown_{\theta} L_i(\theta) \mid_{\theta = \theta^{(\tau)}}\right) $$ diff --git a/logistic_regression_2/logistic_reg_2_old.ipynb b/logistic_regression_2/logistic_reg_2_old.ipynb new file mode 100644 index 00000000..4b52ed06 --- /dev/null +++ b/logistic_regression_2/logistic_reg_2_old.ipynb @@ -0,0 +1,526 @@ +{ + "cells": [ + { + "cell_type": "raw", + "metadata": {}, + "source": [ + "---\n", + "title: Logistic Regression II\n", + "format:\n", + " html:\n", + " toc: true\n", + " toc-depth: 5\n", + " toc-location: right\n", + " code-fold: false\n", + " theme:\n", + " - cosmo\n", + " - cerulean\n", + " callout-icon: false\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Today, we will continue studying the Logistic Regression model. First, we'll examine logisitc regression through the lens of maxmium likelihood estimation and understand its connection to cross-entropy loss. Then, we'll pick up from last lecture's discussion of cross-entropy loss, study a few of its pitfalls, and learn potential remedies. Finally, we will provide an implementation of `sklearn`'s logistic regression model. \n", + "\n", + "This will introduce us to the process of **thresholding** -- a technique used to *classify* data from our model's predicted probabilities, or $P(Y=1|x)$. In doing so, we'll focus on how these thresholding decisions affect the behavior of our model. We will learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression.\n", + "\n", + "## Logistic Regression Model (continued)\n", + "\n", + "\n", + "### Maximum Likelihood Estimation\n", + "\n", + "\n", + "In our earlier coin toss example, we had data on 10 coin flips, and wanted to estimate $\\hat \\theta$, the probability of a heads." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "flips = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]\n", + "flips" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "$\\hat \\theta = 0.4$ is the most intuitive two reasons:\n", + "\n", + "1. It is the frequency of heads in our data\n", + "2. It maximizes the **likelihood** of our data \n", + "\n", + "$$\\hat \\theta = \\text{argmax}_\\theta (\\theta^4(1-\\theta)^6)$$\n", + "\n", + "More generally, we can apply this notion of likelihood to any random binary sample. For example, we can find the likelihood of the data observed in our breast cancer study. We will show how the likelihood is intrinsically related to cross-entropy loss.\n", + "\n", + "As a quick refresher on likelihood:\n", + "\n", + "For some Bernoulli($p$) random variable $Y$ the probability distribution, or likelihood is:\n", + "\n", + "$$P(Y = y) = \\begin{cases}\n", + " 1, \\text{with probability } p\\\\\n", + " 0, \\text{with probability } 1 - p\n", + " \\end{cases} $$\n", + " \n", + "Equivalently, this can be written in a compact way:\n", + "\n", + "$$P(Y=y) = p^y(1-p)^{1-y}$$\n", + "\n", + "- When $y = 1$, this reads $P(Y=y) = p$\n", + "- When $y = 0$, this reads $P(Y=y) = (1-p)$\n", + "\n", + "In our example, a Bernoulli random variable is analagous to a single data point, or tumor (from the previous chapter). All together, our breast cancer study consist of multiple IID Bernoulli($p$) random variables. To find the likelihood of independent events in succession, simply multiply their likelihoods.\n", + "\n", + "$$\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}$$\n", + "\n", + "As with the coin example, we want to find the parameter $p$ that maximizes this likelihood - this technique is known as **maximum likelihood estimation**. Earlier, we gave an intuitive graphical solution, but let's take the derivative of the likelihood to find this maximum.\n", + "\n", + "From a first glance, this derivative will be complicated! We will have to use the product rule, followed by the chain rule. Instead, we can make an observation that simplifies the problem. \n", + "\n", + "Finding the $p$ that maximizes $$\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}$$ is equivalent to the $p$ that maximizes $$\\text{log}(\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i})$$\n", + "\n", + "This is because $\\text{log}$ is a strictly *increasing* function. It won't change the maximum or minimum of the function it was applied to. From $\\text{log}$ properties, $\\text{log}(a*b)$ = $\\text{log}(a) + \\text{log}(b)$. We can apply this to our equation above to get:\n", + "\n", + "$$\\text{argmax}_p \\sum_{i=1}^{n} \\text{log}(p^{y_i} (1-p)^{1-y_i})$$\n", + "\n", + "$$= \\text{argmax}_p \\sum_{i=1}^{n} \\text{log}(p^{y_i}) + \\text{log}((1-p)^{1-y_i})$$\n", + "\n", + "$$= \\text{argmax}_p \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)$$\n", + "\n", + "We can add a constant factor of $\\frac{1}{n}$ out front. It won't affect the $p$ that maximizes our likelihood.\n", + "\n", + "$$=\\text{argmax}_p \\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)$$\n", + "\n", + "One last \"trick\" we can do is change this to a minimization problem by negating the result. This works because we are dealing with a *concave* function, which can be made *convex*.\n", + "\n", + "$$= \\text{argmin}_p -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)$$\n", + "\n", + "This is exactly our average cross-entropy loss minimization problem from before! \n", + "\n", + "Why did we do all this complicated math? We have shown that *minimizing* cross-entropy loss is equivalent to *maximizing* the likelihood of the training data.\n", + "\n", + "- By minimizing cross-entropy loss, we are choosing the model parameters that are \"most likely\" for the data we observed.\n", + "\n", + "## A (Tangent) Note\n", + "\n", + "You will study MLE further in probability and ML classes. But now you know it exists. It turns out that many of the model + loss combinations we’ve seen can be motivated using MLE (OLS, Ridge Regression, etc.)\n", + "\n", + "The two important takeways from this section are \n", + "\n", + "1. Formulating the Logistic Regression Model\n", + "2. Motivating the Cross-Entropy Loss\n", + "\n", + "We will now continue to learn how to evaluate the strength of logistic regression models.\n", + "\n", + "Above, we proved that *minimizing* cross-entropy loss is equivalent to *maximizing* likelihood of the training data.\n", + "\n", + "Intuitively, this means that the optimal $\\hat \\theta$ that minimizes cross-entropy loss \"pushes\" all the probabilities in the direction of their true, underlying class. \n", + "\n", + "- For points that belong to the $0$ class, $\\sigma(x^T\\theta) \\rightarrow 0$\n", + "- For points that belong to the $1$ class, $\\sigma(x^T\\theta) \\rightarrow 1$\n", + "\n", + "However, something interesting happens when our data is perfectly classifiable; in other words, linearly seperable.\n", + "\n", + "### Linear Seperability and Regularization\n", + "\n", + "A classification dataset is said to be **linearly separable** if there exists a hyperplane among input features $x$ that separates the two classes $y$. For example, notice how the plot on the bottom left is linearly seperable along the vertical line $x=0$. No such line perfectly seperates the two classes on the bottom right.\n", + "\n", + "- Linear seperability in 1D can be found with a rugplot of a single feature.\n", + "\n", + "linear_seperability_1D\n", + "\n", + "This same definition holds in higher dimensions. If there are two features, the seperating hyperplane must exist in two dimensions (any line of the form $y=mx+b$)\n", + "\n", + "- Linear seperability among 2 features is evident from a two-dimensional visualization, or scatter plot.\n", + "\n", + "linear_seperability_1D\n", + "\n", + "Complications may arise when data is linearly seperable. Consider the toy dataset with 2 points and only a single feature $x$:\n", + "\n", + "toy_linear_seperability\n", + "\n", + "The optimal $\\theta$ value that minimizes loss pushes the predicted probabilities of the data points to their true class.\n", + "\n", + "- $P(Y = 1|x = -1) = \\frac{1}{1 + e^\\theta} \\rightarrow 1$\n", + "- $P(Y = 1|x = 1) = \\frac{1}{1 + e^{-\\theta}} \\rightarrow 0$\n", + "\n", + "This happens when $\\theta = -\\infty$. When $\\theta = -\\infty$, we observe the following behavior for any input $x$.\n", + "\n", + "$$P(Y=1|x) = \\sigma(\\theta x) \\rightarrow \\begin{cases}\n", + " 1, \\text{if } x < 0\\\\\n", + " 0, \\text{if } x \\ge 0\n", + " \\end{cases}$$\n", + "\n", + "The diverging weights cause the model to be overconfident. For example, consider the new point $(x, y) = (0.5, 1)$. Following the behavior above, our model will incorrectly predict $p=0$, and a thus, $\\hat y = 0$.\n", + "\n", + "toy_linear_seperability\n", + "\n", + "The loss incurred by this misclassified point is infinite.\n", + "\n", + "$$-(y\\text{ log}(p) + (1-y)\\text{ log}(1-p))$$\n", + "\n", + "$$=1\\text{log}(0)$$\n", + "\n", + "Thus, diverging weights ($|\\theta| \\rightarrow \\infty$) occur with **lineary separable** data. \"Overconfidence\" is a particularly dangerous version of overfitting.\n", + "\n", + "Consider the loss function with respect to the parameter $\\theta$.\n", + "\n", + "unreg_loss\n", + "\n", + "Although impossible to see, the plateau for negative values of $\\theta$ is slightly tilted downwards, meaning the loss approaches $0$ as $\\theta$ decreases and approaches $-\\infty$.\n", + "\n", + "#### Regularized Logistic Regression\n", + "\n", + "To avoid large weights, and thus, infinite loss (particularly on linearly seperable data), regularization is used. The same principles apply as with linear regression - make sure to standardize your features first.\n", + "\n", + "For example, L2 (Ridge) Logistic Regression takes on the form\n", + "\n", + "$$\\min_{\\theta} -\\frac{1}{n} \\sum_{i=1}^{n} (y_i \\text{log}(\\sigma(x_i^T\\theta)) + (1-y_i)\\text{log}(1-\\sigma(x_i^T\\theta))) + \\lambda \\sum_{i=1}^{d} \\theta_j^2$$\n", + "\n", + "Now, let us compare the loss functions of un-regularized and regularized logistic regression.\n", + "\n", + "unreg_loss\n", + "\n", + "reg_loss\n", + "\n", + "As we can see, $L2$ regularization helps us prevent diverging weights and deters against \"overconfidence.\"\n", + "\n", + "### Logistic Regression Model Implementation\n", + "\n", + "The implementation of logistic regression in `sklearn` is simple. We'll begin by fitting a model on the breast cancer dataset from last lecture." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| code-fold: true\n", + "\n", + "import pandas as pd\n", + "import sklearn.datasets\n", + "\n", + "data_dict = sklearn.datasets.load_breast_cancer()\n", + "data = pd.DataFrame(data_dict['data'], columns=data_dict['feature_names'])\n", + "data['malignant'] = (data_dict['target'] == 0).astype(int)\n", + "\n", + "X = data[['mean radius']]\n", + "y = data['malignant']" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "from sklearn.linear_model import LogisticRegression\n", + "model = LogisticRegression()\n", + "model.fit(X, y);" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By default, `sklearn` applies regularization to the logistic regression class. This is to avoid diverging weights with seperable data. The code above can be written more expliclty as follows. " + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# sklearn defaults\n", + "model = LogisticRegression(penalty='l2', C=1.0, fit_intercept=True)\n", + "model.fit(X, y);" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The parameter `C` controls the amount of regularization -- `C` is the inverse of the regularization hyperparameter $\\lambda$. Set `C` big for minimal regularization, and vice versa.\n", + "\n", + "The `.predict_proba` method returns the predicted probabilities of belonging to each class. The first element corresponds to class $0$, the second to class $1$." + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Here are the first 5 predicted probabilities\n", + "model.predict_proba(X)[:5]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "From here, the `.predict` function returns the predicted class $\\hat y$ of the point. In the simple binary case, \n", + "\n", + "$$\\hat y = \\begin{cases}\n", + " 1, & P(Y=1|x) \\ge 0.5\\\\\n", + " 0, & \\text{otherwise }\n", + " \\end{cases}$$" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Here are the first 5 predicted classes\n", + "model.predict(X)[:5]" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Performance Metrics\n", + "\n", + "Now that we have our classifier, let's quantify how well it performs. The most basic evaluation metric is **accuracy** -- the proportion of correctly classified points.\n", + "\n", + "$$\\text{accuracy} = \\frac{\\# \\text{ of points classified correctly}}{\\# \\text{ of total points}}$$" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "model.score(X, y) # built-in accuracy function" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "However, accuracy is not always a great metric for classification, particularily when the data has class imbalance. \n", + "\n", + "To understand why, let's consider a classification problem with 100 emails, 5 of which are spam. We'll investigate two models where accuracy is a poor metric. \n", + "\n", + "- **Model 1**: Our first model classifies every email as non-spam. The model's accuracy is high ($\\frac{95}{100} = 0.95$), but it doesn't detect any spam emails. Despite the high accuracy, this is a bad model.\n", + "- **Model 2**: The second model classifies every email as spam. The accuracy is low ($\\frac{5}{100} = 0.05$), but the model correctly labels every spam email. Unfortunately, it also misclassifies every non-spam email.\n", + "\n", + "### The Confusion Matrix\n", + "\n", + "Model 1 from above has 5 **false negatives (FN)** -- data points which were predicted to belong to class $0$ (non-spam), but their true class was $1$ (spam). In a similar vein, Model 2 has 95 **false positives (FP)** -- that is, \"false alarms\" where we predict class $1$, but the true class was $0$. **True positives (TP)** and **true negatives (TN)** are when we correctly classify observations as being positive or negative, respectively.\n", + "\n", + "These classifications can be concisely summarized in a **confusion matrix**. \n", + "\n", + "confusion_matrix\n", + "\n", + "An easy way to remember this terminology is as follows:\n", + "\n", + "1. Look at the second word in the phrase. *Positive* means a prediction of 1. *Negative* means a prediction of 0.\n", + "2. Look at the first word in the phrase. *True* means our prediction was correct. *False* means it was incorrect.\n", + " \n", + "A confusion matrix for a particular classifier may be found programatically. For our breast cancer data, it looks like this:" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "from sklearn.metrics import confusion_matrix\n", + "\n", + "y_pred = model.predict(X)\n", + "confusion_matrix(y, y_pred)" + ], + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Accuracy, Precision, and Recall\n", + "\n", + "The purpose of our discussion of the confusion matrix was to motivate better performance metrics for classification problems with class imbalance - namely, precision and recall.\n", + "\n", + "**Precision** is defined as\n", + "\n", + "$$\\frac{\\text{TP}}{\\text{TP + FP}}$$\n", + "\n", + "Precision answers the question: \"of all observations that were predicted to be $1$, what proportion were actually $1$?\" It measures how accurate the classifier is when its predictions are positive.\n", + "\n", + "**Recall** (or **sensitivity**) is defined as \n", + "\n", + "$$\\frac{\\text{TP}}{\\text{TP + FN}}$$\n", + "\n", + "Recall aims to answer: \"of all observations that were actually $1$, what proportion were predicted to be $1$?\" It measures how many positive predictions were missed.\n", + "\n", + "Here's a helpful graphic that summarizes our discussion above.\n", + "\n", + "confusion_matrix\n", + "\n", + "#### Example Calculation\n", + "\n", + "In this section, we will calculate the accuracy, precision, and recall performance metrics for our earlier spam classification example. As a reminder, we had a 100 emails, 5 of which were spam. We designed two models:\n", + "\n", + "- Model 1: Predict that every email is *non-spam*\n", + "- Model 2: Predict that every email is *spam*\n", + "\n", + "##### Model 1\n", + "\n", + "First, let's begin by creating the confusion matrix.\n", + "\n", + "\n", + "+-------------------+-------------------+---------------------------+\n", + "| | 0 | 1 |\n", + "+===================+===================+===========================+\n", + "| 0 | True Negative: 95 | False Positive: 0 |\n", + "+-------------------+-------------------+---------------------------+\n", + "| 1 | False Negative: 5 | True Positive: 0 |\n", + "+-------------------+-------------------+---------------------------+\n", + "\n", + "Convince yourself of why our confusion matrix looks like so.\n", + "\n", + "$$\\text{accuracy} = \\frac{95}{100} = 0.95$$\n", + "$$\\text{precision} = \\frac{0}{0 + 0} = \\text{undefined}$$\n", + "$$\\text{recall} = \\frac{0}{0 + 5} = 0$$\n", + "\n", + "- Notice how our precision is undefined because we never predicted class $1$\n", + "- Our recall is 0 for the same reason -- the numerator is 0 (we had no positive predictions)\n", + "\n", + "##### Model 2\n", + "\n", + "Our confusion matrix for Model 2 looks like so.\n", + "\n", + "+-------------------+-------------------+---------------------------+\n", + "| | 0 | 1 |\n", + "+===================+===================+===========================+\n", + "| 0 | True Negative: 0 | False Positive: 95 |\n", + "+-------------------+-------------------+---------------------------+\n", + "| 1 | False Negative: 0 | True Positive: 5 |\n", + "+-------------------+-------------------+---------------------------+\n", + "\n", + "$$\\text{accuracy} = \\frac{5}{100} = 0.05$$\n", + "$$\\text{precision} = \\frac{5}{5 + 95} = 0.05$$\n", + "$$\\text{recall} = \\frac{5}{5 + 0} = 1$$\n", + "\n", + "- Our precision is low because we have many false positives\n", + "- Our recall is perfect - we correctly classified all spam emails (we never predicted class $0$)\n", + "\n", + "#### Precision vs Recall\n", + "\n", + "Precision ($\\frac{\\text{TP}}{\\text{TP} + \\textbf{ FP}}$) penalizes false positives, while recall ($\\frac{\\text{TP}}{\\text{TP} + \\textbf{ FN}}$) penalizes false negatives.\n", + "\n", + "In fact, precision and recall are *inversely related*. This is evident in our second model -- we observed a high recall and low precision. Usually, there is a tradeoff in these two (most models can either minimize the number of FP or FN; and in rare cases, both). \n", + "\n", + "The specific performance metric(s) to prioritize depends on the context. In many medical settings, there might be a much higher cost to missing positive cases. For instance, in our breast cancer example, it is more costly to misclassify malignant tumors (false negatives) than it is to incorrectly classify a benign tumor as malignant (false positives). In the case of the latter, pathologists can conduct further study to verify malignant tumors. As such, we should minimize the number of false negatives. This is equivalent to maximizing recall.\n", + "\n", + "## Adjusting the Classification Threshold\n", + "\n", + "One way to minimize the number of FP vs. FN (equivalently, maximizing precision vs. recall) is by adjusting the classification threshold $T$.\n", + "\n", + "$$\\hat y = \\begin{cases}\n", + " 1, & P(Y=1|x) \\ge T\\\\\n", + " 0, & \\text{otherwise }\n", + " \\end{cases}$$\n", + " \n", + "The default threshold in `sklearn` is $T = 0.5$. As we increase the threshold $T$, we “raise the standard” of how confident our classifier needs to be to predict 1 (i.e., “positive”).\n", + "\n", + "varying_threshold\n", + "\n", + "As you may notice, the choice of threshold $T$ impacts our classifier's performance.\n", + "\n", + "- High $T$: Most predictions are $0$. \n", + " - Lots of false negatives\n", + " - Fewer false positives\n", + "- Low $T$: Most predictions are $1$. \n", + " - Lots of false positives \n", + " - Fewer false negatives\n", + "\n", + "In fact, we can choose a threshold $T$ based on our desired number, or proportion, of false positives and false negatives. We can do so using a few different tools. We'll touch on two of the most important ones in Data 100.\n", + "\n", + "1. Precision-Recall Curve (PR Curve). [Covered in Extra Content]\n", + "2. \"Receiver Operating Characteristic\" Curve (ROC Curve)\n", + "\n", + "To motivate the ROC Curve, let's first consider two more metrics - true positive rate (TPR) and false positive rate (FPR).\n", + "\n", + "### Two More Metrics\n", + "\n", + "The **True Positive Rate (TPR)** is defined as\n", + "\n", + "$$\\frac{\\text{TP}}{\\text{TP + FN}}$$\n", + "\n", + "You'll notice this is equivalent to *recall*. In the context of our spam email classifier, it answers the question: \"what proportion of spam did I mark correctly?\". \n", + "\n", + "- We'd like this to be close to $1$\n", + "\n", + "The **False Positive Rate (FPR)** is defined as\n", + "\n", + "$$\\frac{\\text{FP}}{\\text{FP + TN}}$$\n", + "\n", + "Another word for FPR is *specificity*. This answers the question: \"what proportion of regular email did I mark as spam?\" \n", + "\n", + "- We'd like this to be close to $0$\n", + "\n", + "As we increase threshold $T$, both TPR and FPR decrease. We've plotted this relationship below for some model on a toy dataset.\n", + "\n", + "tpr_fpr\n", + "\n", + "\n", + "### The ROC Curve\n", + "\n", + "The “Receiver Operating Characteristic” Curve (**ROC Curve**) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold $T$ values.\n", + "\n", + "roc_curve\n", + "\n", + "The “perfect” classifier is the one that has a TPR of 1, and FPR of 0. This is achieved at the top-left of the plot below. More generally, it's ROC curve resembles the curve in orange.\n", + "\n", + "roc_curve_perfect\n", + "\n", + "We want our model to be as close to this orange curve as possible. How do we quantify \"closeness\"?\n", + "\n", + "We can compute the **area under curve (AUC)** of the ROC curve. Notice how the perfect classifier has an AUC = 1. The closer our model's AUC is to 1, the better it is. On the other hand, a terrible model will have an AUC closer to 0.5. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two.\n", + "\n", + "roc_curve_worst_predictor\n", + "\n", + "## Extra Content\n", + "\n", + "### Precision-Recall Curves\n", + "\n", + "A **Precision-Recall Curve (PR Curve)** is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. It is constructed in a similar way as with the ROC curve.\n", + "\n", + "Let's first consider how precision and recall change as a function of the threshold $T$. We know this quite well from earlier -- precision will generally increase, and recall will decrease.\n", + "\n", + "precision-recall-thresh\n", + "\n", + "Displayed below is the PR-Curve for the same toy dataset. Notice how threshold values increase as we move to the left.\n", + "\n", + "pr_curve_thresholds\n", + "\n", + "Once again, the perfect classifier will resemble the orange curve, this time, facing the opposite direction.\n", + "\n", + "pr_curve_perfect\n", + "\n", + "We want our PR-Curve to be as close to the “top right” of this graph as possible. Again, we use the AUC to determine \"closeness\", with the perfect classifier exhibiting an AUC = 1 (and the worst with an AUC = 0.5).\n" + ] + } + ], + "metadata": { + "kernelspec": { + "name": "python3", + "language": "python", + "display_name": "Python 3 (ipykernel)" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file