diff --git a/docs/search.json b/docs/search.json index 1a2de9c5..2ffd1bc7 100644 --- a/docs/search.json +++ b/docs/search.json @@ -54,5 +54,40 @@ "title": "20  SQL I", "section": "20.6 Aggregating with GROUP BY", "text": "20.6 Aggregating with GROUP BY\nAt this point, we’ve seen that SQL offers much of the same functionality that was given to us by pandas. We can extract data from a table, filter it, and reorder it to suit our needs.\nIn pandas, much of our analysis work relied heavily on being able to use .groupby() to aggregate across the rows of our dataset. SQL’s answer to this task is the (very conveniently named) GROUP BY clause. While the outputs of GROUP BY are similar to those of .groupby() – in both cases, we obtain an output table where some column has been used for grouping – the syntax and logic used to group data in SQL are fairly different to the pandas implementation.\nTo illustrate GROUP BY, we will consider the Dish table from the basic_examples.db database.\n\n%%sql\nSELECT * \nFROM Dish\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\nname\ntype\ncost\n\n\n\n\nravioli\nentree\n10\n\n\nramen\nentree\n13\n\n\ntaco\nentree\n7\n\n\nedamame\nappetizer\n4\n\n\nfries\nappetizer\n4\n\n\npotsticker\nappetizer\n4\n\n\nice cream\ndessert\n5\n\n\n\n\n\nSay we wanted to find the total costs of dishes of a certain type. To accomplish this, we would write the following code.\n\n%%sql\nSELECT type, SUM(cost)\nFROM Dish\nGROUP BY type\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\ntype\nSUM(cost)\n\n\n\n\nappetizer\n12\n\n\ndessert\n5\n\n\nentree\n30\n\n\n\n\n\nWhat is going on here? The statement GROUP BY type tells SQL to group the data based on the value contained in the type column (whether a record is an appetizer, entree, or dessert). SUM(cost) sums up the costs of dishes in each type and displays the result in the output table.\nYou may be wondering: why does SUM(cost) come before the command to GROUP BY type? Don’t we need to form groups before we can count the number of entries in each?\nRemember that SQL is a declarative programming language – a SQL programmer simply states what end result they would like to see, and leaves the task of figuring out how to obtain this result to SQL itself. This means that SQL queries sometimes don’t follow what a reader sees as a “logical” sequence of thought. Instead, SQL requires that we follow its set order of operations when constructing queries. So long as we follow this ordering, SQL will handle the underlying logic.\nIn practical terms: our goal with this query was to output the total costs of each type. To communicate this to SQL, we say that we want to SELECT the SUMmed cost values for each type group.\nThere are many aggregation functions that can be used to aggregate the data contained in each group. Some common examples are:\n\nCOUNT: count the number of rows associated with each group\nMIN: find the minimum value of each group\nMAX: find the maximum value of each group\nSUM: sum across all records in each group\nAVG: find the average value of each group\n\nWe can easily compute multiple aggregations, all at once (a task that was very tricky in pandas).\n\n%%sql\nSELECT type, SUM(cost), MIN(cost), MAX(name)\nFROM Dish\nGROUP BY type\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\ntype\nSUM(cost)\nMIN(cost)\nMAX(name)\n\n\n\n\nappetizer\n12\n4\npotsticker\n\n\ndessert\n5\n5\nice cream\n\n\nentree\n30\n7\ntaco\n\n\n\n\n\nTo count the number of rows associated with each group, we use the COUNT keyword. Calling COUNT(*) will compute the total number of rows in each group, including rows with null values. Its pandas equivalent is .groupby().size().\n\n%%sql\nSELECT type, COUNT(*)\nFROM Dish\nGROUP BY type\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\ntype\nCOUNT(*)\n\n\n\n\nappetizer\n3\n\n\ndessert\n1\n\n\nentree\n3\n\n\n\n\n\nTo exclude NULL values when counting the rows in each group, we explicitly call COUNT on a column in the table. This is similar to calling .groupby().count() in pandas.\n\n%%sql\nSELECT year, COUNT(cute)\nFROM Dragon\nGROUP BY year\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\nyear\nCOUNT(cute)\n\n\n\n\n2010\n1\n\n\n2011\n1\n\n\n2019\n1\n\n\n\n\n\nWith this definition of GROUP BY in hand, let’s update our SQL order of operations. Remember: every SQL query must list clauses in this order.\nSELECT <column expression list>\nFROM <table>\n[WHERE <predicate>]\n[GROUP BY <column list>]\n[ORDER BY <column list>]\n[LIMIT <number of rows>]\n[OFFSET <number of rows>];\nNote that we can use the AS keyword to rename columns during the selection process and that column expressions may include aggregation functions (MAX, MIN, etc.)." + }, + { + "objectID": "logistic_regression_2/logistic_reg_2.html#decision-boundaries", + "href": "logistic_regression_2/logistic_reg_2.html#decision-boundaries", + "title": "23  Logistic Regression II", + "section": "23.1 Decision Boundaries", + "text": "23.1 Decision Boundaries\nIn logistic regression, we model the probability that a datapoint belongs to Class 1. Last week, we developed the logistic regression model to predict that probability, but we never actually made any classifications for whether our prediction \\(y\\) belongs in Class 0 or Class 1.\n\\[ p = P(Y=1 | x) = \\frac{1}{1 + e^{-x^T\\theta}}\\]\nA decision rule tells us how to interpret the output of the model to make a decision on how to classify a datapoint. We commonly make decision rules by specifying a threshold, \\(T\\). If the predicted probability is greater than or equal to \\(T\\), predict Class 1. Otherwise, predict Class 0.\n\\[\\hat y = \\text{classify}(x) = \\begin{cases}\n 1, & P(Y=1|x) \\ge T\\\\\n 0, & \\text{otherwise }\n \\end{cases}\\]\nThe threshold is often set to \\(T = 0.5\\), but not always. We’ll discuss why we might want to use other thresholds \\(T \\neq 0.5\\) later in this lecture.\nUsing our decision rule, we can define a decision boundary as the “line” that splits the data into classes based on its features. For logistic regression, the decision boundary is a hyperplane – a linear combination of the features in p-dimensions – and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have \\(\\theta = [\\theta_0, \\theta_1, \\theta_2]\\) including the intercept term, and we can solve for the decision boundary like so:\n\\[\n\\begin{align}\nT &= \\frac{1}{1 + e^{\\theta_0 + \\theta_1 * \\text{feature1} + \\theta_2 * \\text{feature2}}} \\\\\n1 + e^{\\theta_0 + \\theta_1 \\cdot \\text{feature1} + \\theta_2 \\cdot \\text{feature2}} &= \\frac{1}{T} \\\\\ne^{\\theta_0 + \\theta_1 \\cdot \\text{feature1} + \\theta_2 \\cdot \\text{feature2}} &= \\frac{1}{T} - 1 \\\\\n\\theta_0 + \\theta_1 \\cdot \\text{feature1} + \\theta_2 \\cdot \\text{feature2} &= \\log(\\frac{1}{T} - 1)\n\\end{align}\n\\]\nFor a model with 2 features, the decision boundary is a line in terms of its features. To make it easier to visualize, we’ve included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes.\n\n\n\nIn real life, however, that is often not the case, and we often see some overlap between points of different classes across the decision boundary. The true classes of the 2D data are shown below:\n\n\n\nAs you can see, the decision boundary predicted by our logistic regression does not perfectly separate the two classes. There’s a “muddled” region near the decision boundary where our classifier predicts the wrong class. What would the data have to look like for the classifier to make perfect predictions?" + }, + { + "objectID": "logistic_regression_2/logistic_reg_2.html#linear-separability-and-regularization", + "href": "logistic_regression_2/logistic_reg_2.html#linear-separability-and-regularization", + "title": "23  Logistic Regression II", + "section": "23.2 Linear Separability and Regularization", + "text": "23.2 Linear Separability and Regularization\nA classification dataset is said to be linearly separable if there exists a hyperplane among input features \\(x\\) that separates the two classes \\(y\\).\nLinear separability in 1D can be found with a rugplot of a single feature. For example, notice how the plot on the bottom left is linearly separable along the vertical line \\(x=0\\). However, no such line perfectly separates the two classes on the bottom right.\n\n\n\nThis same definition holds in higher dimensions. If there are two features, the separating hyperplane must exist in two dimensions (any line of the form \\(y=mx+b\\)). We can visualize this using a scatter plot.\n\n\n\nThis sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. However, (unexpected) complications may arise. Consider the toy dataset with 2 points and only a single feature \\(x\\):\n\n\n\nThe optimal \\(\\theta\\) value that minimizes loss pushes the predicted probabilities of the data points to their true class.\n\n\\(P(Y = 1|x = -1) = \\frac{1}{1 + e^\\theta} \\rightarrow 1\\)\n\\(P(Y = 1|x = 1) = \\frac{1}{1 + e^{-\\theta}} \\rightarrow 0\\)\n\nThis happens when \\(\\theta = -\\infty\\). When \\(\\theta = -\\infty\\), we observe the following behavior for any input \\(x\\).\n\\[P(Y=1|x) = \\sigma(\\theta x) \\rightarrow \\begin{cases}\n 1, \\text{if } x < 0\\\\\n 0, \\text{if } x \\ge 0\n \\end{cases}\\]\nThe diverging weights cause the model to be overconfident. For example, consider the new point \\((x, y) = (0.5, 1)\\). Following the behavior above, our model will incorrectly predict \\(p=0\\), and thus, \\(\\hat y = 0\\).\n\n\n\nThe loss incurred by this misclassified point is infinite.\n\\[-(y\\text{ log}(p) + (1-y)\\text{ log}(1-p))=1\\text{log}(0)\\]\nThus, diverging weights (\\(|\\theta| \\rightarrow \\infty\\)) occur with lineary separable data. “Overconfidence” is a particularly dangerous version of overfitting.\nConsider the loss function with respect to the parameter \\(\\theta\\).\n\n\n\nThough it’s very difficult to see, the plateau for negative values of \\(\\theta\\) is slightly tilted downwards, meaning the loss approaches \\(0\\) as \\(\\theta\\) decreases and approaches \\(-\\infty\\).\n\n23.2.1 Regularized Logistic Regression\nTo avoid large weights and infinite loss (particularly on linearly separable data), we use regularization. The same principles apply as with linear regression - make sure to standardize your features first.\nFor example, \\(L2\\) (Ridge) Logistic Regression takes on the form:\n\\[\\min_{\\theta} -\\frac{1}{n} \\sum_{i=1}^{n} (y_i \\text{log}(\\sigma(x_i^T\\theta)) + (1-y_i)\\text{log}(1-\\sigma(x_i^T\\theta))) + \\lambda \\sum_{i=1}^{d} \\theta_j^2\\]\nNow, let us compare the loss functions of un-regularized and regularized logistic regression.\n\n\n\n\n\n\nAs we can see, \\(L2\\) regularization helps us prevent diverging weights and deters against “overconfidence.”\nsklearn’s logistic regression defaults to L2 regularization and C=1.0; C is the inverse of \\(\\lambda\\): \\(C = \\frac{1}{\\lambda}\\). Setting C to a large value, for example, C=300.0, results in minimal regularization.\n# sklearn defaults\nmodel = LogisticRegression(penalty='l2', C=1.0, …)\nmodel.fit()\nNote that in Data 100, we only use sklearn to fit logistic regression models. There is no closed-form solution to the optimal theta vector, and the gradient is a little messy (see the bonus section below for details).\nFrom here, the .predict function returns the predicted class \\(\\hat y\\) of the point. In the simple binary case,\n\\[\\hat y = \\begin{cases}\n 1, & P(Y=1|x) \\ge 0.5\\\\\n 0, & \\text{otherwise }\n \\end{cases}\\]" + }, + { + "objectID": "logistic_regression_2/logistic_reg_2.html#performance-metrics", + "href": "logistic_regression_2/logistic_reg_2.html#performance-metrics", + "title": "23  Logistic Regression II", + "section": "23.3 Performance Metrics", + "text": "23.3 Performance Metrics\nYou might be thinking, if we’ve already introduced cross-entropy loss, why do we need additional ways of assessing how well our models perform? In linear regression, we made numerical predictions and used a loss function to determine how “good” these predictions were. In logistic regression, our ultimate goal is to classify data – we are much more concerned with whether or not each datapoint was assigned the correct class using the decision rule. As such, we are interested in the quality of classifications, not the predicted probabilities.\nThe most basic evaluation metric is accuracy, that is, the proportion of correctly classified points.\n\\[\\text{accuracy} = \\frac{\\# \\text{ of points classified correctly}}{\\# \\text{ of total points}}\\]\nTranslated to code:\ndef accuracy(X, Y):\n return np.mean(model.predict(X) == Y)\n \nmodel.score(X, y) # built-in accuracy function\nHowever, accuracy is not always a great metric for classification. To understand why, let’s consider a classification problem with 100 emails where only 5 are truly spam, and the remaining 95 are truly ham. We’ll investigate two models where accuracy is a poor metric.\n\nModel 1: Our first model classifies every email as non-spam. The model’s accuracy is high (\\(\\frac{95}{100} = 0.95\\)), but it doesn’t detect any spam emails. Despite the high accuracy, this is a bad model.\nModel 2: The second model classifies every email as spam. The accuracy is low (\\(\\frac{5}{100} = 0.05\\)), but the model correctly labels every spam email. Unfortunately, it also misclassifies every non-spam email.\n\nAs this example illustrates, accuracy is not always a good metric for classification, particularly when your data could exhibit class imbalance (e.g., very few 1’s compared to 0’s).\n\n23.3.1 Types of Classification\nThere are 4 different different classifications that our model might make:\n\nTrue positive: correctly classify a positive point as being positive (\\(y=1\\) and \\(\\hat{y}=1\\))\nTrue negative: correctly classify a negative point as being negative (\\(y=0\\) and \\(\\hat{y}=0\\))\nFalse positive: incorrectly classify a negative point as being positive (\\(y=0\\) and \\(\\hat{y}=1\\))\nFalse negative: incorrectly classify a positive point as being negative (\\(y=1\\) and \\(\\hat{y}=0\\))\n\nThese classifications can be concisely summarized in a confusion matrix.\n\n\n\nAn easy way to remember this terminology is as follows:\n\nLook at the second word in the phrase. Positive means a prediction of 1. Negative means a prediction of 0.\nLook at the first word in the phrase. True means our prediction was correct. False means it was incorrect.\n\nWe can now write the accuracy calculation as \\[\\text{accuracy} = \\frac{TP + TN}{n}\\]\nIn sklearn, we use the following syntax\nfrom sklearn.metrics import confusion_matrix\ncm = confusion_matrix(Y_true, Y_pred)\n\n\n\n\n\n23.3.2 Accuracy, Precision, and Recall\nThe purpose of our discussion of the confusion matrix was to motivate better performance metrics for classification problems with class imbalance - namely, precision and recall.\nPrecision is defined as\n\\[\\text{precision} = \\frac{\\text{TP}}{\\text{TP + FP}}\\]\nPrecision answers the question: “Of all observations that were predicted to be \\(1\\), what proportion was actually \\(1\\)?” It measures how accurate the classifier is when its predictions are positive.\nRecall (or sensitivity) is defined as\n\\[\\text{recall} = \\frac{\\text{TP}}{\\text{TP + FN}}\\]\nRecall aims to answer: “Of all observations that were actually \\(1\\), what proportion was predicted to be \\(1\\)?” It measures how many positive predictions were missed.\nHere’s a helpful graphic that summarizes our discussion above.\n\n\n\n\n\n23.3.3 Example Calculation\nIn this section, we will calculate the accuracy, precision, and recall performance metrics for our earlier spam classification example. As a reminder, we had 100 emails, 5 of which were spam. We designed two models:\n\nModel 1: Predict that every email is non-spam\nModel 2: Predict that every email is spam\n\n\n23.3.3.1 Model 1\nFirst, let’s begin by creating the confusion matrix.\n\n\n\n\n\n\n\n\n\n0\n1\n\n\n\n\n0\nTrue Negative: 95\nFalse Positive: 0\n\n\n1\nFalse Negative: 5\nTrue Positive: 0\n\n\n\nConvince yourself of why our confusion matrix looks like so.\n\\[\\text{accuracy} = \\frac{95}{100} = 0.95\\] \\[\\text{precision} = \\frac{0}{0 + 0} = \\text{undefined}\\] \\[\\text{recall} = \\frac{0}{0 + 5} = 0\\]\nNotice how our precision is undefined because we never predicted class \\(1\\). Our recall is 0 for the same reason – the numerator is 0 (we had no positive predictions).\n\n\n23.3.3.2 Model 2\nOur confusion matrix for Model 2 looks like so.\n\n\n\n\n\n\n\n\n\n0\n1\n\n\n\n\n0\nTrue Negative: 0\nFalse Positive: 95\n\n\n1\nFalse Negative: 0\nTrue Positive: 5\n\n\n\n\\[\\text{accuracy} = \\frac{5}{100} = 0.05\\] \\[\\text{precision} = \\frac{5}{5 + 95} = 0.05\\] \\[\\text{recall} = \\frac{5}{5 + 0} = 1\\]\nOur precision is low because we have many false positives, and our recall is perfect - we correctly classified all spam emails (we never predicted class \\(0\\)).\n\n\n\n23.3.4 Precision vs. Recall\nPrecision (\\(\\frac{\\text{TP}}{\\text{TP} + \\textbf{ FP}}\\)) penalizes false positives, while recall (\\(\\frac{\\text{TP}}{\\text{TP} + \\textbf{ FN}}\\)) penalizes false negatives.\nIn fact, precision and recall are inversely related. This is evident in our second model – we observed a high recall and low precision. Usually, there is a tradeoff in these two (most models can either minimize the number of FP or FN; and in rare cases, both).\nThe specific performance metric(s) to prioritize depends on the context. In many medical settings, there might be a much higher cost to missing positive cases. For instance, in our breast cancer example, it is more costly to misclassify malignant tumors (false negatives) than it is to incorrectly classify a benign tumor as malignant (false positives). In the case of the latter, pathologists can conduct further studies to verify malignant tumors. As such, we should minimize the number of false negatives. This is equivalent to maximizing recall.\n\n\n23.3.5 Two More Metrics\nThe True Positive Rate (TPR) is defined as\n\\[\\text{true positive rate} = \\frac{\\text{TP}}{\\text{TP + FN}}\\]\nYou’ll notice this is equivalent to recall. In the context of our spam email classifier, it answers the question: “What proportion of spam did I mark correctly?”. We’d like this to be close to \\(1\\)\nThe False Positive Rate (FPR) is defined as\n\\[\\text{false positive rate} = \\frac{\\text{FP}}{\\text{FP + TN}}\\]\nAnother word for FPR is specificity. This answers the question: “What proportion of regular email did I mark as spam?”. We’d like this to be close to \\(0\\)\nAs we increase threshold \\(T\\), both TPR and FPR decrease. We’ve plotted this relationship below for some model on a toy dataset." + }, + { + "objectID": "logistic_regression_2/logistic_reg_2.html#adjusting-the-classification-threshold", + "href": "logistic_regression_2/logistic_reg_2.html#adjusting-the-classification-threshold", + "title": "23  Logistic Regression II", + "section": "23.4 Adjusting the Classification Threshold", + "text": "23.4 Adjusting the Classification Threshold\nOne way to minimize the number of FP vs. FN (equivalently, maximizing precision vs. recall) is by adjusting the classification threshold \\(T\\).\n\\[\\hat y = \\begin{cases}\n 1, & P(Y=1|x) \\ge T\\\\\n 0, & \\text{otherwise }\n \\end{cases}\\]\nThe default threshold in sklearn is \\(T = 0.5\\). As we increase the threshold \\(T\\), we “raise the standard” of how confident our classifier needs to be to predict 1 (i.e., “positive”).\n\n\n\nAs you may notice, the choice of threshold \\(T\\) impacts our classifier’s performance.\n\nHigh \\(T\\): Most predictions are \\(0\\).\n\nLots of false negatives\nFewer false positives\n\nLow \\(T\\): Most predictions are \\(1\\).\n\nLots of false positives\nFewer false negatives\n\n\nIn fact, we can choose a threshold \\(T\\) based on our desired number, or proportion, of false positives and false negatives. We can do so using a few different tools. We’ll touch on two of the most important ones in Data 100.\n\nPrecision-Recall Curve (PR Curve)\n“Receiver Operating Characteristic” Curve (ROC Curve)\n\n\n23.4.1 Precision-Recall Curves\nA Precision-Recall Curve (PR Curve) is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. It is constructed in a similar way as with the ROC curve.\nLet’s first consider how precision and recall change as a function of the threshold \\(T\\). We know this quite well from earlier – precision will generally increase, and recall will decrease.\n\n\n\nDisplayed below is the PR Curve for the same toy dataset. Notice how threshold values increase as we move to the left.\n\n\n\nOnce again, the perfect classifier will resemble the orange curve, this time, facing the opposite direction.\n\n\n\nWe want our PR curve to be as close to the “top right” of this graph as possible. Again, we use the AUC to determine “closeness”, with the perfect classifier exhibiting an AUC = 1 (and the worst with an AUC = 0.5).\n\n\n23.4.2 The ROC Curve\nThe “Receiver Operating Characteristic” Curve (ROC Curve) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold \\(T\\) values.\n\n\n\nThe “perfect” classifier is the one that has a TPR of 1, and FPR of 0. This is achieved at the top-left of the plot below. More generally, it’s ROC curve resembles the curve in orange.\n\n\n\nWe want our model to be as close to this orange curve as possible. How do we quantify “closeness”?\nWe can compute the area under curve (AUC) of the ROC curve. Notice how the perfect classifier has an AUC = 1. The closer our model’s AUC is to 1, the better it is.\n\n23.4.2.1 [Extra] What is the “worst” AUC, and why is it 0.5?\nOn the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predict \\(P(Y = 1 | x)\\) to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two." + }, + { + "objectID": "logistic_regression_2/logistic_reg_2.html#extra-gradient-descent-for-logistic-regression", + "href": "logistic_regression_2/logistic_reg_2.html#extra-gradient-descent-for-logistic-regression", + "title": "23  Logistic Regression II", + "section": "23.5 [Extra] Gradient Descent for Logistic Regression", + "text": "23.5 [Extra] Gradient Descent for Logistic Regression\nLet’s define the following: \\[\nt_i = \\phi(x_i)^T \\theta \\\\\np_i = \\sigma(t_i) \\\\\nt_i = \\log(\\frac{p_i}{1 - p_i}) \\\\\n1 - \\sigma(t_i) = \\sigma(-t_i) \\\\\n\\frac{d}{dt} \\sigma(t) = \\sigma(t) \\sigma(-t)\n\\]\nNow, we can simplify the cross-entropy loss \\[\n\\begin{align}\ny_i \\log(p_i) + (1 - y_i) \\log(1 - p_i) &= y_i \\log(\\frac{p_i}{1 - p_i}) + \\log(1 - p_i) \\\\\n&= y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta))\n\\end{align}\n\\]\nHence, the optimal \\(\\hat{\\theta}\\) is \\[\\text{argmin}_{\\theta} - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta)))\\]\nWe want to minimize \\[L(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n (y_i \\phi(x_i)^T + \\log(\\sigma(-\\phi(x_i)^T \\theta)))\\]\nSo we take the derivative \\[\n\\begin{align}\n\\triangledown_{\\theta} L(\\theta) &= - \\frac{1}{n} \\sum_{i=1}^n \\triangledown_{\\theta} y_i \\phi(x_i)^T + \\triangledown_{\\theta} \\log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\triangledown_{\\theta} \\log(\\sigma(-\\phi(x_i)^T \\theta)) \\\\\n&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\frac{1}{\\sigma(-\\phi(x_i)^T \\theta)} \\triangledown_{\\theta} \\sigma(-\\phi(x_i)^T \\theta) \\\\\n&= - \\frac{1}{n} \\sum_{i=1}^n y_i \\phi(x_i) + \\frac{\\sigma(-\\phi(x_i)^T \\theta)}{\\sigma(-\\phi(x_i)^T \\theta)} \\sigma(\\phi(x_i)^T \\theta)\\triangledown_{\\theta} \\sigma(-\\phi(x_i)^T \\theta) \\\\\n&= - \\frac{1}{n} \\sum_{i=1}^n (y_i - \\sigma(\\phi(x_i)^T \\theta)\\phi(x_i))\n\\end{align}\n\\]\nSetting the derivative equal to 0 and solving for \\(\\hat{\\theta}\\), we find that there’s no general analytic solution. Therefore, we must solve using numeric methods.\n\n23.5.1 Gradient Descent Update Rule\n\\[\\theta^{(0)} \\leftarrow \\text{initial vector (random, zeros, ...)} \\]\nFor \\(\\tau\\) from 0 to convergence: \\[ \\theta^{(\\tau + 1)} \\leftarrow \\theta^{(\\tau)} + \\rho(\\tau)\\left( \\frac{1}{n} \\sum_{i=1}^n \\triangledown_{\\theta} L_i(\\theta) \\mid_{\\theta = \\theta^{(\\tau)}}\\right) \\]\n\n\n23.5.2 Stochastic Gradient Descent Update Rule\n\\[\\theta^{(0)} \\leftarrow \\text{initial vector (random, zeros, ...)} \\]\nFor \\(\\tau\\) from 0 to convergence, let \\(B\\) ~ \\(\\text{Random subset of indices}\\). \\[ \\theta^{(\\tau + 1)} \\leftarrow \\theta^{(\\tau)} + \\rho(\\tau)\\left( \\frac{1}{|B|} \\sum_{i \\in B} \\triangledown_{\\theta} L_i(\\theta) \\mid_{\\theta = \\theta^{(\\tau)}}\\right) \\]" } ] \ No newline at end of file