Code
import numpy as np
@@ -492,7 +492,7 @@ = dugongs[["Length", "Age"]]
data_linear
diff --git a/constant_model_loss_transformations/loss_transformations.qmd b/constant_model_loss_transformations/loss_transformations.qmd index b9170f76..ac921167 100644 --- a/constant_model_loss_transformations/loss_transformations.qmd +++ b/constant_model_loss_transformations/loss_transformations.qmd @@ -21,7 +21,7 @@ jupyter: format_version: '1.0' jupytext_version: 1.16.1 kernelspec: - display_name: Python 3 (ipykernel) + display_name: ds100env language: python name: python3 --- @@ -116,7 +116,7 @@ $$ \begin{align} 0 &= {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \hat{\theta_0}) \\ &= \sum^{n}_{i=1} (y_i - \hat{\theta_0}) \quad \quad \text{divide both sides by} \frac{-2}{n} -\\ &= \left(\sum^{n}_{i=1} y_i\right) - \left(\sum^{n}_{i=1} \theta_0\right) \quad \quad \text{separate sums} +\\ &= \left(\sum^{n}_{i=1} y_i\right) - \left(\sum^{n}_{i=1} \hat{\theta_0}\right) \quad \quad \text{separate sums} \\ &= \left(\sum^{n}_{i=1} y_i\right) - (n \cdot \hat{\theta_0}) \quad \quad \text{c + c + … + c = nc} \\ n \cdot \hat{\theta_0} &= \sum^{n}_{i=1} y_i \\ \hat{\theta_0} &= \frac{1}{n} \sum^{n}_{i=1} y_i @@ -159,7 +159,6 @@ The code for generating the graphs and models is included below, but we won't go ```{python} #| code-fold: true -#| vscode: {languageId: python} import numpy as np import pandas as pd import matplotlib.pyplot as plt @@ -174,7 +173,6 @@ data_linear = dugongs[["Length", "Age"]] ```{python} #| code-fold: true -#| vscode: {languageId: python} # Big font helper def adjust_fontsize(size=None): SMALL_SIZE = 8 @@ -196,7 +194,6 @@ plt.style.use("default") # Revert style to default mpl ```{python} #| code-fold: true -#| vscode: {languageId: python} # Constant Model + MSE plt.style.use('default') # Revert style to default mpl adjust_fontsize(size=16) @@ -222,7 +219,6 @@ plt.legend(); ```{python} #| code-fold: true -#| vscode: {languageId: python} # SLR + MSE def mse_linear(theta_0, theta_1, data_linear): data_x, data_y = data_linear.iloc[:, 0], data_linear.iloc[:, 1] @@ -278,7 +274,6 @@ ax.set_zlabel("MSE"); ```{python} #| code-fold: true -#| vscode: {languageId: python} # Predictions yobs = data_linear["Age"] # The true observations y xs = data_linear["Length"] # Needed for linear predictions @@ -290,7 +285,6 @@ yhats_linear = [theta_0_hat + theta_1_hat * x for x in xs] ```{python} #| code-fold: true -#| vscode: {languageId: python} # Constant Model Rug Plot # In case we're in a weird style state sns.set_theme() @@ -307,7 +301,6 @@ plt.yticks([]); ```{python} #| code-fold: true -#| vscode: {languageId: python} # SLR model scatter plot # In case we're in a weird style state sns.set_theme() @@ -421,7 +414,6 @@ Let's consider a dataset where each entry represents the number of drinks sold a ```{python} #| code-fold: false -#| vscode: {languageId: python} drinks = np.array([20, 21, 22, 29, 33]) drinks ``` @@ -430,7 +422,6 @@ From our derivations above, we know that the optimal model parameter under MSE c ```{python} #| code-fold: false -#| vscode: {languageId: python} np.mean(drinks), np.median(drinks) ``` @@ -444,7 +435,6 @@ How do outliers affect each cost function? Imagine we replace the largest value ```{python} #| code-fold: false -#| vscode: {languageId: python} drinks_with_outlier = np.append(drinks, 1033) display(drinks_with_outlier) np.mean(drinks_with_outlier), np.median(drinks_with_outlier) @@ -458,7 +448,6 @@ Let's try another experiment. This time, we'll add an additional, non-outlying d ```{python} #| code-fold: false -#| vscode: {languageId: python} drinks_with_additional_observation = np.append(drinks, 35) drinks_with_additional_observation ``` @@ -502,7 +491,6 @@ Let's revisit our dugongs example. The lengths and ages are plotted below: ```{python} #| code-fold: true -#| vscode: {languageId: python} # `corrcoef` computes the correlation coefficient between two variables # `std` finds the standard deviation x = dugongs["Length"] @@ -530,7 +518,6 @@ An important word on $\log$: in Data 100 (and most upper-division STEM courses), ```{python} #| code-fold: true -#| vscode: {languageId: python} z = np.log(y) r = np.corrcoef(x, z)[0, 1] @@ -568,7 +555,6 @@ $y$ is an *exponential* function of $x$. Applying an exponential fit to the untr ```{python} #| code-fold: true -#| vscode: {languageId: python} plt.figure(dpi=120, figsize=(4, 3)) plt.scatter(x, y) diff --git a/docs/constant_model_loss_transformations/loss_transformations.html b/docs/constant_model_loss_transformations/loss_transformations.html index 18bb4607..6e00683b 100644 --- a/docs/constant_model_loss_transformations/loss_transformations.html +++ b/docs/constant_model_loss_transformations/loss_transformations.html @@ -406,7 +406,7 @@
import numpy as np
@@ -492,7 +492,7 @@ = dugongs[["Length", "Age"]]
data_linear
# Big font helper
@@ -514,7 +514,7 @@ "default") # Revert style to default mpl
plt.style.use(
# Constant Model + MSE
@@ -547,7 +547,7 @@
+
Code
# SLR + MSE
@@ -610,7 +610,7 @@
+
Code
# Predictions
@@ -622,7 +622,7 @@ = [theta_0_hat + theta_1_hat * x for x in xs]
yhats_linear
-
+
Code
# Constant Model Rug Plot
@@ -652,7 +652,7 @@
+
Code
# SLR model scatter plot
@@ -766,7 +766,7 @@ 11.4 Comparing Loss Functions
We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?
Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.
-
+
= np.array([20, 21, 22, 29, 33])
drinks drinks
@@ -774,7 +774,7 @@
+
np.mean(drinks), np.median(drinks)
(np.float64(25.0), np.float64(22.0))
@@ -784,7 +784,7 @@
Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.
How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.
-
+
= np.append(drinks, 1033)
drinks_with_outlier
display(drinks_with_outlier) np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
@@ -798,7 +798,7 @@
This means that under the MSE, the optimal model parameter \(\hat{\theta}\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.
Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.
-
+
= np.append(drinks, 35)
drinks_with_additional_observation drinks_with_additional_observation
@@ -870,7 +870,7 @@
+
Code
# `corrcoef` computes the correlation coefficient between two variables
@@ -902,7 +902,7 @@ and "Length"
. What is making the raw data deviate from a linear relationship? Notice that the data points with "Length"
greater than 2.6 have disproportionately high values of "Age"
relative to the rest of the data. If we could manipulate these data points to have lower "Age"
values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \(y_i\) (that is, taking \(\log(\) "Age"
\()\) ) would achieve just that.
An important word on \(\log\): in Data 100 (and most upper-division STEM courses), \(\log\) denotes the natural logarithm with base \(e\). The base-10 logarithm, where relevant, is indicated by \(\log_{10}\).
-
+
Code
= np.log(y)
@@ -937,7 +937,7 @@ z \[\log{(y)} = \theta_0 + \theta_1 x\] \[y = e^{\theta_0 + \theta_1 x}\] \[y = (e^{\theta_0})e^{\theta_1 x}\] \[y_i = C e^{k x}\]
For some constants \(C\) and \(k\).
\(y\) is an exponential function of \(x\). Applying an exponential fit to the untransformed variables corroborates this finding.
-
+
Code
=120, figsize=(4, 3))
@@ -1482,7 +1482,7 @@ plt.figure(dpi format_version: '1.0'
jupytext_version: 1.16.1
kernelspec:
- display_name: Python 3 (ipykernel)
+ display_name: ds100env
language: python
name: python3
---
@@ -1577,7 +1577,7 @@
\begin{align}
0 &= {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \hat{\theta_0})\\ &= \sum^{n}_{i=1} (y_i - \hat{\theta_0}) \quad \quad \text{divide both sides by} \frac{-2}{n}
-\\ &= \left(\sum^{n}_{i=1} y_i\right) - \left(\sum^{n}_{i=1} \theta_0\right) \quad \quad \text{separate sums}
+\\ &= \left(\sum^{n}_{i=1} y_i\right) - \left(\sum^{n}_{i=1} \hat{\theta_0}\right) \quad \quad \text{separate sums}
\\ &= \left(\sum^{n}_{i=1} y_i\right) - (n \cdot \hat{\theta_0}) \quad \quad \text{c + c + … + c = nc}
\\ n \cdot \hat{\theta_0} &= \sum^{n}_{i=1} y_i
\\ \hat{\theta_0} &= \frac{1}{n} \sum^{n}_{i=1} y_i
@@ -1620,484 +1620,470 @@
```{python}
#| code-fold: true
-#| vscode: {languageId: python}
-import numpy as np
-import pandas as pd
-import matplotlib.pyplot as plt
-%matplotlib inline
-import seaborn as sns
-import itertools
-from mpl_toolkits.mplot3d import Axes3D
-= pd.read_csv("data/dugongs.csv")
- dugongs = dugongs["Age"]
- data_constant = dugongs[["Length", "Age"]]
- data_linear ```
-
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-# Big font helper
-def adjust_fontsize(size=None):
-= 8
- SMALL_SIZE = 10
- MEDIUM_SIZE = 12
- BIGGER_SIZE if size != None:
- = MEDIUM_SIZE = BIGGER_SIZE = size
- SMALL_SIZE
-"font", size=SMALL_SIZE) # controls default text sizes
- plt.rc("axes", titlesize=SMALL_SIZE) # fontsize of the axes title
- plt.rc("axes", labelsize=MEDIUM_SIZE) # fontsize of the x and y labels
- plt.rc("xtick", labelsize=SMALL_SIZE) # fontsize of the tick labels
- plt.rc("ytick", labelsize=SMALL_SIZE) # fontsize of the tick labels
- plt.rc("legend", fontsize=SMALL_SIZE) # legend fontsize
- plt.rc("figure", titlesize=BIGGER_SIZE) # fontsize of the figure title
- plt.rc(
-"default") # Revert style to default mpl
- plt.style.use(```
-
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-# Constant Model + MSE
-'default') # Revert style to default mpl
- plt.style.use(=16)
- adjust_fontsize(size%matplotlib inline
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+%matplotlib inline
+import seaborn as sns
+import itertools
+from mpl_toolkits.mplot3d import Axes3D
+= pd.read_csv("data/dugongs.csv")
+ dugongs = dugongs["Age"]
+ data_constant = dugongs[["Length", "Age"]]
+ data_linear ```
+
+```{python}
+#| code-fold: true
+# Big font helper
+def adjust_fontsize(size=None):
+= 8
+ SMALL_SIZE = 10
+ MEDIUM_SIZE = 12
+ BIGGER_SIZE if size != None:
+ = MEDIUM_SIZE = BIGGER_SIZE = size
+ SMALL_SIZE
+"font", size=SMALL_SIZE) # controls default text sizes
+ plt.rc("axes", titlesize=SMALL_SIZE) # fontsize of the axes title
+ plt.rc("axes", labelsize=MEDIUM_SIZE) # fontsize of the x and y labels
+ plt.rc("xtick", labelsize=SMALL_SIZE) # fontsize of the tick labels
+ plt.rc("ytick", labelsize=SMALL_SIZE) # fontsize of the tick labels
+ plt.rc("legend", fontsize=SMALL_SIZE) # legend fontsize
+ plt.rc("figure", titlesize=BIGGER_SIZE) # fontsize of the figure title
+ plt.rc(
+"default") # Revert style to default mpl
+ plt.style.use(```
+
+```{python}
+#| code-fold: true
+# Constant Model + MSE
+'default') # Revert style to default mpl
+ plt.style.use(=16)
+ adjust_fontsize(size%matplotlib inline
+
+def mse_constant(theta, data):
+return np.mean(np.array([(y_obs - theta) ** 2 for y_obs in data]), axis=0)
-def mse_constant(theta, data):
-return np.mean(np.array([(y_obs - theta) ** 2 for y_obs in data]), axis=0)
+ = np.linspace(-20, 42, 1000)
+ thetas = mse_constant(thetas, data_constant)
l2_loss_thetas
-= np.linspace(-20, 42, 1000)
- thetas = mse_constant(thetas, data_constant)
- l2_loss_thetas
-# Plotting the loss surface
-
- plt.plot(thetas, l2_loss_thetas)r'$\theta_0$')
- plt.xlabel(r'MSE')
- plt.ylabel(
-# Optimal point
-= np.mean(data_constant)
- thetahat =50, label = r"$\hat{\theta}_0$")
- plt.scatter([thetahat], [mse_constant(thetahat, data_constant)], s;
- plt.legend()# plt.show()
-```
-
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-# SLR + MSE
-def mse_linear(theta_0, theta_1, data_linear):
-= data_linear.iloc[:, 0], data_linear.iloc[:, 1]
- data_x, data_y return np.mean(
- - (theta_0 + theta_1 * x)) ** 2 for x, y in zip(data_x, data_y)]),
- np.array([(y =0,
- axis
- )
-
-# plotting the loss surface
-= np.linspace(-80, 20, 80)
- theta_0_values = np.linspace(-10, 30, 80)
- theta_1_values = np.array(
- mse_values for x in theta_0_values] for y in theta_1_values]
- [[mse_linear(x, y, data_linear)
- )
-# Optimal point
-= data_linear.iloc[:, 0], data_linear.iloc[:, 1]
- data_x, data_y = np.corrcoef(data_x, data_y)[0, 1] * np.std(data_y) / np.std(data_x)
- theta_1_hat = np.mean(data_y) - theta_1_hat * np.mean(data_x)
+ theta_0_hat # Plotting the loss surface
+
+ plt.plot(thetas, l2_loss_thetas)r'$\theta_0$')
+ plt.xlabel(r'MSE')
+ plt.ylabel(
+# Optimal point
+= np.mean(data_constant)
+ thetahat =50, label = r"$\hat{\theta}_0$")
+ plt.scatter([thetahat], [mse_constant(thetahat, data_constant)], s;
+ plt.legend()# plt.show()
+```
+
+```{python}
+#| code-fold: true
+# SLR + MSE
+def mse_linear(theta_0, theta_1, data_linear):
+= data_linear.iloc[:, 0], data_linear.iloc[:, 1]
+ data_x, data_y return np.mean(
+ - (theta_0 + theta_1 * x)) ** 2 for x, y in zip(data_x, data_y)]),
+ np.array([(y =0,
+ axis
+ )
+
+# plotting the loss surface
+= np.linspace(-80, 20, 80)
+ theta_0_values = np.linspace(-10, 30, 80)
+ theta_1_values = np.array(
+ mse_values for x in theta_0_values] for y in theta_1_values]
+ [[mse_linear(x, y, data_linear)
+ )
+# Optimal point
+= data_linear.iloc[:, 0], data_linear.iloc[:, 1]
+ data_x, data_y = np.corrcoef(data_x, data_y)[0, 1] * np.std(data_y) / np.std(data_x)
+ theta_1_hat = np.mean(data_y) - theta_1_hat * np.mean(data_x)
+ theta_0_hat
+# Create the 3D plot
+= plt.figure(figsize=(7, 5))
+ fig = fig.add_subplot(111, projection="3d")
ax
-# Create the 3D plot
-= plt.figure(figsize=(7, 5))
- fig = fig.add_subplot(111, projection="3d")
- ax
-= np.meshgrid(theta_0_values, theta_1_values)
- X, Y = ax.plot_surface(
- surf ="viridis", alpha=0.6
- X, Y, mse_values, cmap# Use alpha to make it slightly transparent
- )
-# Scatter point using matplotlib
-= ax.scatter(
- sc
- [theta_0_hat],
- [theta_1_hat],
- [mse_linear(theta_0_hat, theta_1_hat, data_linear)],="o",
- marker="red",
- color=100,
- s="theta hat",
- label
+ )= np.meshgrid(theta_0_values, theta_1_values)
+ X, Y = ax.plot_surface(
+ surf ="viridis", alpha=0.6
+ X, Y, mse_values, cmap# Use alpha to make it slightly transparent
+ )
+# Scatter point using matplotlib
+= ax.scatter(
+ sc
+ [theta_0_hat],
+ [theta_1_hat],
+ [mse_linear(theta_0_hat, theta_1_hat, data_linear)],="o",
+ marker="red",
+ color=100,
+ s="theta hat",
+ label
+ )
+# Create a colorbar
+= fig.colorbar(surf, ax=ax, shrink=0.5, aspect=10)
+ cbar "Cost Value")
cbar.set_label(
-# Create a colorbar
-= fig.colorbar(surf, ax=ax, shrink=0.5, aspect=10)
- cbar "Cost Value")
- cbar.set_label(
-"MSE for different $\\theta_0, \\theta_1$")
- ax.set_title("$\\theta_0$")
- ax.set_xlabel("$\\theta_1$")
- ax.set_ylabel("MSE");
- ax.set_zlabel(
-# plt.show()
-```
-
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-# Predictions
-= data_linear["Age"] # The true observations y
- yobs = data_linear["Length"] # Needed for linear predictions
- xs = len(yobs) # Predictions
- n
-= [thetahat for i in range(n)] # Not used, but food for thought
- yhats_constant = [theta_0_hat + theta_1_hat * x for x in xs]
- yhats_linear ```
-
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-# Constant Model Rug Plot
-# In case we're in a weird style state
-
- sns.set_theme()=16)
- adjust_fontsize(size%matplotlib inline
-
-= plt.figure(figsize=(8, 1.5))
- fig =0.25, lw=2) ;
- sns.rugplot(yobs, height='red', lw=4, label=r"$\hat{\theta}_0$");
- plt.axvline(thetahat, color
- plt.legend();
- plt.yticks([])# plt.show()
-```
-
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-# SLR model scatter plot
-# In case we're in a weird style state
-
- sns.set_theme()=16)
- adjust_fontsize(size%matplotlib inline
-
-=xs, y=yobs)
- sns.scatterplot(x='red', lw=4);
- plt.plot(xs, yhats_linear, color# plt.savefig('dugong_line.png', bbox_inches = 'tight');
-# plt.show()
-```
+"MSE for different $\\theta_0, \\theta_1$")
+ ax.set_title("$\\theta_0$")
+ ax.set_xlabel("$\\theta_1$")
+ ax.set_ylabel("MSE");
+ ax.set_zlabel(
+# plt.show()
+```
+
+```{python}
+#| code-fold: true
+# Predictions
+= data_linear["Age"] # The true observations y
+ yobs = data_linear["Length"] # Needed for linear predictions
+ xs = len(yobs) # Predictions
+ n
+= [thetahat for i in range(n)] # Not used, but food for thought
+ yhats_constant = [theta_0_hat + theta_1_hat * x for x in xs]
+ yhats_linear ```
+
+```{python}
+#| code-fold: true
+# Constant Model Rug Plot
+# In case we're in a weird style state
+
+ sns.set_theme()=16)
+ adjust_fontsize(size%matplotlib inline
+
+= plt.figure(figsize=(8, 1.5))
+ fig =0.25, lw=2) ;
+ sns.rugplot(yobs, height='red', lw=4, label=r"$\hat{\theta}_0$");
+ plt.axvline(thetahat, color
+ plt.legend();
+ plt.yticks([])# plt.show()
+```
+
+```{python}
+#| code-fold: true
+# SLR model scatter plot
+# In case we're in a weird style state
+
+ sns.set_theme()=16)
+ adjust_fontsize(size%matplotlib inline
+
+=xs, y=yobs)
+ sns.scatterplot(x='red', lw=4);
+ plt.plot(xs, yhats_linear, color# plt.savefig('dugong_line.png', bbox_inches = 'tight');
+# plt.show()
+```
+
+
+ Interpreting the RMSE (Root Mean Squared Error):
+- Because the constant error is **HIGHER** than the linear error,
+- The constant model is **WORSE** than the linear model (at least for this metric).
+
+## Constant Model + MAE
-
+ Interpreting the RMSE (Root Mean Squared Error):
We see now that changing the model used for prediction leads to a wildly different result for the optimal model parameter. What happens if we instead change the loss function used in model evaluation?
-- Because the constant error is **HIGHER** than the linear error,
-- The constant model is **WORSE** than the linear model (at least for this metric).
-
-## Constant Model + MAE
-
-
+ We see now that changing the model used for prediction leads to a wildly different result for the optimal model parameter. What happens if we instead change the loss function used in model evaluation?
+ This time, we will consider the constant model with L1 (absolute loss) as the loss function. This means that the average loss will be expressed as the **Mean Absolute Error (MAE)**.
+1. Choose a model: constant model
+2. Choose a loss function: L1 loss
+3. Fit the model
+4. Evaluate model performance
-
+ This time, we will consider the constant model with L1 (absolute loss) as the loss function. This means that the average loss will be expressed as the **Mean Absolute Error (MAE)**.### Deriving the optimal $\theta_0$
-1. Choose a model: constant model
-2. Choose a loss function: L1 loss
-3. Fit the model
-4. Evaluate model performance
-
-### Deriving the optimal $\theta_0$
-
-\{y_1, y_2, ..., y_n\}$.
- Recall that the MAE is average **absolute** loss (L1 loss) over the data $D =
-
- $$\hat{R}(\theta_0) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \hat{y_i}| $$
-
- Given the constant model $\hat{y} = \theta_0$, we can write the MAE as:
-
- $$\hat{R}(\theta_0) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0| $$
-
- To fit the model, we find the optimal parameter value $\hat{\theta_0}$ that minimizes the MAE by differentiating using a calculus approach:
-1. Differentiate with respect to $\hat{\theta_0}$:
-
-
- $$
- \begin{align}\\
- \hat{R}(\theta_0) &= \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0| \\
- \frac{d}{d\theta_0} R(\theta_0) &= \frac{d}{d\theta_0} \left(\frac{1}{n} \sum^{n}_{i=1} |y_i - \theta_0| \right)
- &= \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0|
- \end{align}
+ $$\{y_1, y_2, ..., y_n\}$.
+ Recall that the MAE is average **absolute** loss (L1 loss) over the data $D =
+
+ $$\hat{R}(\theta_0) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \hat{y_i}| $$
+
+ Given the constant model $\hat{y} = \theta_0$, we can write the MAE as:
+
+ $$\hat{R}(\theta_0) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0| $$
+
+ To fit the model, we find the optimal parameter value $\hat{\theta_0}$ that minimizes the MAE by differentiating using a calculus approach:
+1. Differentiate with respect to $\hat{\theta_0}$:
+
+
+ $$
+ \begin{align}\\
+ \hat{R}(\theta_0) &= \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0| \\
+ \frac{d}{d\theta_0} R(\theta_0) &= \frac{d}{d\theta_0} \left(\frac{1}{n} \sum^{n}_{i=1} |y_i - \theta_0| \right)
+ &= \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0|
+ \end{align}
+ $$
+- Here, we seem to have run into a problem: the derivative of an absolute value is undefined when the argument is 0 (i.e. when $y_i = \theta_0$). For now, we'll ignore this issue. It turns out that disregarding this case doesn't influence our final result.
+- To perform the derivative, consider two cases. When $\theta_0$ is *less than or equal to* $y_i$, the term $y_i - \theta_0$ will be positive and the absolute value has no impact. When $\theta_0$ is *greater than* $y_i$, the term $y_i - \theta_0$ will be negative. Applying the absolute value will convert this to a positive value, which we can express by saying $-(y_i - \theta_0) = \theta_0 - y_i$.
+
+\\ \theta_0 - y_i \quad \text{if }\theta_0 > y_i \end{cases}$$
+ $$|y_i - \theta_0| = \begin{cases} y_i - \theta_0 \quad \text{ if } \theta_0 \le y_i
+- Taking derivatives:
-- Here, we seem to have run into a problem: the derivative of an absolute value is undefined when the argument is 0 (i.e. when $y_i = \theta_0$). For now, we'll ignore this issue. It turns out that disregarding this case doesn't influence our final result.
-- To perform the derivative, consider two cases. When $\theta_0$ is *less than or equal to* $y_i$, the term $y_i - \theta_0$ will be positive and the absolute value has no impact. When $\theta_0$ is *greater than* $y_i$, the term $y_i - \theta_0$ will be negative. Applying the absolute value will convert this to a positive value, which we can express by saying $-(y_i - \theta_0) = \theta_0 - y_i$.
-
-\\ \theta_0 - y_i \quad \text{if }\theta_0 > y_i \end{cases}$$
- $$|y_i - \theta_0| = \begin{cases} y_i - \theta_0 \quad \text{ if } \theta_0 \le y_i
-- Taking derivatives:
-
-\\ \frac{d}{d\theta_0} (\theta_0 - y_i) = 1 \quad \text{if }\theta_0 > y_i \end{cases}$$
+ $$\frac{d}{d\theta_0} |y_i - \theta_0| = \begin{cases} \frac{d}{d\theta_0} (y_i - \theta_0) = -1 \quad \text{if }\theta_0 < y_i \\ \frac{d}{d\theta_0} (\theta_0 - y_i) = 1 \quad \text{if }\theta_0 > y_i \end{cases}$$
+ $$\frac{d}{d\theta_0} |y_i - \theta_0| = \begin{cases} \frac{d}{d\theta_0} (y_i - \theta_0) = -1 \quad \text{if }\theta_0 < y_i
+- This means that we obtain a different value for the derivative for data points where $\theta_0 < y_i$ and where $\theta_0 > y_i$. We can summarize this by saying:
+
+
+ $$\\
+ \frac{d}{d\theta_0} R(\theta_0) = \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0| [\sum_{\theta_0 < y_i} (-1) + \sum_{\theta_0 > y_i} (+1) \right]
+ = \frac{1}{n} \left
$$
-- This means that we obtain a different value for the derivative for data points where $\theta_0 < y_i$ and where $\theta_0 > y_i$. We can summarize this by saying:
-
-
- $$\\
- \frac{d}{d\theta_0} R(\theta_0) = \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0| [\sum_{\theta_0 < y_i} (-1) + \sum_{\theta_0 > y_i} (+1) \right]
- = \frac{1}{n} \left
+ $$- In other words, we take the sum of values for $i = 1, 2, ..., n$:
+ - $-1$ if our observation $y_i$ is *greater than* our prediction $\hat{\theta_0}$
+ - $+1$ if our observation $y_i$ is *smaller than* our prediction $\hat{\theta_0}$
+
+2. Set the derivative equation equal to 0:
+
$$ 0 = \frac{1}{n}\sum_{\hat{\theta_0} < y_i} (-1) + \frac{1}{n}\sum_{\hat{\theta_0} > y_i} (+1) $$
-- In other words, we take the sum of values for $i = 1, 2, ..., n$:
- - $-1$ if our observation $y_i$ is *greater than* our prediction $\hat{\theta_0}$
- - $+1$ if our observation $y_i$ is *smaller than* our prediction $\hat{\theta_0}$
-
-2. Set the derivative equation equal to 0:
-
+ $$ 0 = \frac{1}{n}\sum_{\hat{\theta_0} < y_i} (-1) + \frac{1}{n}\sum_{\hat{\theta_0} > y_i} (+1) $$3. Solve for $\hat{\theta_0}$:
+
+ $$ 0 = -\frac{1}{n}\sum_{\hat{\theta_0} < y_i} (1) + \frac{1}{n}\sum_{\hat{\theta_0} > y_i} (1)$$
+
+ $$\sum_{\hat{\theta_0} < y_i} (1) = \sum_{\hat{\theta_0} > y_i} (1) $$
+
Thus, the constant model parameter $\theta = \hat{\theta_0}$ that minimizes MAE must satisfy:
-3. Solve for $\hat{\theta_0}$:
-
- $$ 0 = -\frac{1}{n}\sum_{\hat{\theta_0} < y_i} (1) + \frac{1}{n}\sum_{\hat{\theta_0} > y_i} (1)$$
-
+ $$\sum_{\hat{\theta_0} < y_i} (1) = \sum_{\hat{\theta_0} > y_i} (1) $$
+ $$ \sum_{\hat{\theta_0} < y_i} (1) = \sum_{\hat{\theta_0} > y_i} (1) $$
+
+ In other words, the number of observations greater than $\theta_0$ must be equal to the number of observations less than $\theta_0$; there must be an equal number of points on the left and right sides of the equation. This is the definition of median, so our optimal value is
$$ \hat{\theta}_0 = median(y) $$
-
+ Thus, the constant model parameter $\theta = \hat{\theta_0}$ that minimizes MAE must satisfy:## Summary: Loss Optimization, Calculus, and Critical Points
-
+ $$ \sum_{\hat{\theta_0} < y_i} (1) = \sum_{\hat{\theta_0} > y_i} (1) $$
First, define the **objective function** as average loss.
-
- In other words, the number of observations greater than $\theta_0$ must be equal to the number of observations less than $\theta_0$; there must be an equal number of points on the left and right sides of the equation. This is the definition of median, so our optimal value is
+ $$ \hat{\theta}_0 = median(y) $$- Plug in L1 or L2 loss.
+- Plug in the model so that the resulting expression is a function of $\theta$.
-## Summary: Loss Optimization, Calculus, and Critical Points
+
Then, find the minimum of the objective function:
-
- First, define the **objective function** as average loss.
-- Plug in L1 or L2 loss.
-- Plug in the model so that the resulting expression is a function of $\theta$.
+1. Differentiate with respect to $\theta$.
+2. Set equal to 0.
+3. Solve for $\hat{\theta}$.
+4. (If we have multiple parameters) repeat steps 1-3 with partial derivatives.
-
+ Then, find the minimum of the objective function:
Recall critical points from calculus: $R(\hat{\theta})$ could be a minimum, maximum, or saddle point!
-1. Differentiate with respect to $\theta$.
-2. Set equal to 0.
-3. Solve for $\hat{\theta}$.
-4. (If we have multiple parameters) repeat steps 1-3 with partial derivatives.
-
-
- Recall critical points from calculus: $R(\hat{\theta})$ could be a minimum, maximum, or saddle point!
-- We should technically also perform the second derivative test, i.e., show $R''(\hat{\theta}) > 0$.
-- MSE has a property—**convexity**—that guarantees that $R(\hat{\theta})$ is a global minimum.
-- The proof of convexity for MAE is beyond this course.
-
-## Comparing Loss Functions
-
-
- We've now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?
-
- Let's consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We'll fit a constant model to predict the number of drinks that will be sold tomorrow.
-```{python}
-#| code-fold: false
-#| vscode: {languageId: python}
-= np.array([20, 21, 22, 29, 33])
- drinks
- drinks```
-
-
- From our derivations above, we know that the optimal model parameter under MSE cost is the mean of the dataset. Under MAE cost, the optimal parameter is the median of the dataset.
-```{python}
-#| code-fold: false
-#| vscode: {languageId: python}
-
- np.mean(drinks), np.median(drinks)```
-
-
- If we plot each empirical risk function across several possible values of $\theta$, we find that each $\hat{\theta}$ does indeed correspond to the lowest value of error:
-
- <img src="images/error.png" alt='error' width='600'>
-
- Notice that the MSE above is a **smooth** function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its "kinks." We'll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.
-
+ How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.- We should technically also perform the second derivative test, i.e., show $R''(\hat{\theta}) > 0$.
+- MSE has a property—**convexity**—that guarantees that $R(\hat{\theta})$ is a global minimum.
+- The proof of convexity for MAE is beyond this course.
+
+## Comparing Loss Functions
+
+
+ We've now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?
+
+ Let's consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We'll fit a constant model to predict the number of drinks that will be sold tomorrow.
+```{python}
+#| code-fold: false
+= np.array([20, 21, 22, 29, 33])
+ drinks
+ drinks```
+
+
+ From our derivations above, we know that the optimal model parameter under MSE cost is the mean of the dataset. Under MAE cost, the optimal parameter is the median of the dataset.
+```{python}
+#| code-fold: false
+
+ np.mean(drinks), np.median(drinks)```
+
+
+ If we plot each empirical risk function across several possible values of $\theta$, we find that each $\hat{\theta}$ does indeed correspond to the lowest value of error:
+
+ <img src="images/error.png" alt='error' width='600'>
+
+ Notice that the MSE above is a **smooth** function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its "kinks." We'll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.
+
+ How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.
+```{python}
+#| code-fold: false
+= np.append(drinks, 1033)
+ drinks_with_outlier
+ display(drinks_with_outlier)
+ np.mean(drinks_with_outlier), np.median(drinks_with_outlier)```
-```{python}
-#| code-fold: false
-#| vscode: {languageId: python}
-= np.append(drinks, 1033)
- drinks_with_outlier
- display(drinks_with_outlier)
- np.mean(drinks_with_outlier), np.median(drinks_with_outlier)```
-
-
- <img src="images/outliers.png" alt='outliers' width='700'>
-
+ This means that under the MSE, the optimal model parameter $\hat{\theta}$ is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is **sensitive** to outliers, while the MAE is **robust** to outliers.
+ <img src="images/outliers.png" alt='outliers' width='700'>
+
+ This means that under the MSE, the optimal model parameter $\hat{\theta}$ is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is **sensitive** to outliers, while the MAE is **robust** to outliers.
+
+ Let's try another experiment. This time, we'll add an additional, non-outlying datapoint to the data.
+```{python}
+#| code-fold: false
+= np.append(drinks, 35)
+ drinks_with_additional_observation
+ drinks_with_additional_observation```
-
+ Let's try another experiment. This time, we'll add an additional, non-outlying datapoint to the data.[22, 29]$ will minimize the MAE. In contrast, the MSE still has a single best value for $\hat{\theta}$. In other words, the MSE has a **unique** solution for $\hat{\theta}$; the MAE is not guaranteed to have a single unique solution.
When we again visualize the cost functions, we find that the MAE now plots a horizontal line between 22 and 29. This means that there are _infinitely_ many optimal values for the model parameter: any value $\hat{\theta} \in
-```{python}
-#| code-fold: false
-#| vscode: {languageId: python}
-= np.append(drinks, 35)
- drinks_with_additional_observation
- drinks_with_additional_observation```
-
-[22, 29]$ will minimize the MAE. In contrast, the MSE still has a single best value for $\hat{\theta}$. In other words, the MSE has a **unique** solution for $\hat{\theta}$; the MAE is not guaranteed to have a single unique solution.
- When we again visualize the cost functions, we find that the MAE now plots a horizontal line between 22 and 29. This means that there are _infinitely_ many optimal values for the model parameter: any value $\hat{\theta} \in
-
+ <img src="images/mse_loss_26.png" width='350'> <img src="images/mae_loss_infinite.png" width='350'>
+ <img src="images/mse_loss_26.png" width='350'> <img src="images/mae_loss_infinite.png" width='350'>
+
+ To summarize our example,
+
+ | | MSE (Mean Squared Loss) | MAE (Mean Absolute Loss) |
+ | ---------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------- |
+ | Loss Function | $\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2$ | $\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0|$ |
+ | Optimal $\hat{\theta_0}$ | $\hat{\theta_0} = mean(y) = \bar{y}$ | $\hat{\theta_0} = median(y)$ |
+ | Loss Surface | <img src="images/mse_loss_26.png" width='250'> | <img src="images/mae_loss_infinite.png" width='250'> |
+ | Shape | **Smooth** - easy to minimize using numerical methods (in a few weeks) | **Piecewise** - at each of the “kinks,” it’s not differentiable. Harder to minimize. |
+ | Outliers | **Sensitive** to outliers (since they change mean substantially). Sensitivity also depends on the dataset size. | **More robust** to outliers. |
| $\hat{\theta_0}$ Uniqueness | **Unique** $\hat{\theta_0}$ | **Infinitely many** $\hat{\theta_0}$s |
-
+ To summarize our example,## Transformations to Fit Linear Models
-
- | | MSE (Mean Squared Loss) | MAE (Mean Absolute Loss) |
- | ---------------------- | ---------------------------------------------------------------- | ------------------------------------------------------------------------------- |
- | Loss Function | $\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2$ | $\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0|$ |
- | Optimal $\hat{\theta_0}$ | $\hat{\theta_0} = mean(y) = \bar{y}$ | $\hat{\theta_0} = median(y)$ |
- | Loss Surface | <img src="images/mse_loss_26.png" width='250'> | <img src="images/mae_loss_infinite.png" width='250'> |
- | Shape | **Smooth** - easy to minimize using numerical methods (in a few weeks) | **Piecewise** - at each of the “kinks,” it’s not differentiable. Harder to minimize. |
- | Outliers | **Sensitive** to outliers (since they change mean substantially). Sensitivity also depends on the dataset size. | **More robust** to outliers. |
- | $\hat{\theta_0}$ Uniqueness | **Unique** $\hat{\theta_0}$ | **Infinitely many** $\hat{\theta_0}$s |
-## Transformations to Fit Linear Models
-
-
- At this point, we have an effective method of fitting models to predict linear relationships. Given a feature variable and target, we can apply our four-step process to find the optimal model parameters.
-
+ A key word above is _linear_. When we computed parameter estimates earlier, we assumed that $x_i$ and $y_i$ shared a roughly linear relationship. Data in the real world isn't always so straightforward, but we can transform the data to try and obtain linearity.
+ At this point, we have an effective method of fitting models to predict linear relationships. Given a feature variable and target, we can apply our four-step process to find the optimal model parameters.
+
+ A key word above is _linear_. When we computed parameter estimates earlier, we assumed that $x_i$ and $y_i$ shared a roughly linear relationship. Data in the real world isn't always so straightforward, but we can transform the data to try and obtain linearity.
+
+ The **Tukey-Mosteller Bulge Diagram** is a useful tool for summarizing what transformations can linearize the relationship between two variables. To determine what transformations might be appropriate, trace the shape of the "bulge" made by your data. Find the quadrant of the diagram that matches this bulge. The transformations shown on the vertical and horizontal axes of this quadrant can help improve the fit between the variables.
+
+ <img src="images/bulge.png" alt='bulge' width='600'>
+
+ Note that:
+- There are multiple solutions. Some will fit better than others.
+- sqrt and log make a value “smaller.”
+- Raising to a power makes a value “bigger.”
+- Each of these transformations equates to increasing or decreasing the scale of an axis.
-
- The **Tukey-Mosteller Bulge Diagram** is a useful tool for summarizing what transformations can linearize the relationship between two variables. To determine what transformations might be appropriate, trace the shape of the "bulge" made by your data. Find the quadrant of the diagram that matches this bulge. The transformations shown on the vertical and horizontal axes of this quadrant can help improve the fit between the variables.
-
- <img src="images/bulge.png" alt='bulge' width='600'>
-
- Note that:
-- There are multiple solutions. Some will fit better than others.
-- sqrt and log make a value “smaller.”
-- Raising to a power makes a value “bigger.”
-- Each of these transformations equates to increasing or decreasing the scale of an axis.
-
-
- Other goals in addition to linearity are possible, for example, making data appear more symmetric.
- Linearity allows us to fit lines to the transformed data.
-
- Let's revisit our dugongs example. The lengths and ages are plotted below:
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-# `corrcoef` computes the correlation coefficient between two variables
-# `std` finds the standard deviation
-= dugongs["Length"]
- x = dugongs["Age"]
- y = np.corrcoef(x, y)[0, 1]
- r = r * np.std(y) / np.std(x)
- theta_1 = np.mean(y) - theta_1 * np.mean(x)
- theta_0
-= plt.subplots(1, 2, dpi=200, figsize=(8, 3))
- fig, ax 0].scatter(x, y)
- ax[0].set_xlabel("Length")
- ax[0].set_ylabel("Age")
+ ax[
+ Other goals in addition to linearity are possible, for example, making data appear more symmetric.
+ Linearity allows us to fit lines to the transformed data.
+
+ Let's revisit our dugongs example. The lengths and ages are plotted below:
+```{python}
+#| code-fold: true
+# `corrcoef` computes the correlation coefficient between two variables
+# `std` finds the standard deviation
+= dugongs["Length"]
+ x = dugongs["Age"]
+ y = np.corrcoef(x, y)[0, 1]
+ r = r * np.std(y) / np.std(x)
+ theta_1 = np.mean(y) - theta_1 * np.mean(x)
+ theta_0
+= plt.subplots(1, 2, dpi=200, figsize=(8, 3))
+ fig, ax 0].scatter(x, y)
+ ax[0].set_xlabel("Length")
+ ax[0].set_ylabel("Age")
+ ax[
+1].scatter(x, y)
+ ax[1].plot(x, theta_0 + theta_1 * x, "tab:red")
+ ax[1].set_xlabel("Length")
+ ax[1].set_ylabel("Age");
+ ax[```
+
+
+ Looking at the plot on the left, we see that there is a slight curvature to the data points. Plotting the SLR curve on the right results in a poor fit.
+`"Age"` and `"Length"`. What is making the raw data deviate from a linear relationship? Notice that the data points with `"Length"` greater than 2.6 have disproportionately high values of `"Age"` relative to the rest of the data. If we could manipulate these data points to have lower `"Age"` values, we'd "shift" these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to $y_i$ (that is, taking $\log($ `"Age"` $)$ ) would achieve just that.
+ For SLR to perform well, we'd like there to be a rough linear trend relating
+
An important word on $\log$: in Data 100 (and most upper-division STEM courses), $\log$ denotes the natural logarithm with base $e$. The base-10 logarithm, where relevant, is indicated by $\log_{10}$.
-1].scatter(x, y)
- ax[1].plot(x, theta_0 + theta_1 * x, "tab:red")
- ax[1].set_xlabel("Length")
- ax[1].set_ylabel("Age");
- ax[```
+```{python}
+#| code-fold: true
+= np.log(y)
z
-
- Looking at the plot on the left, we see that there is a slight curvature to the data points. Plotting the SLR curve on the right results in a poor fit.
-`"Age"` and `"Length"`. What is making the raw data deviate from a linear relationship? Notice that the data points with `"Length"` greater than 2.6 have disproportionately high values of `"Age"` relative to the rest of the data. If we could manipulate these data points to have lower `"Age"` values, we'd "shift" these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to $y_i$ (that is, taking $\log($ `"Age"` $)$ ) would achieve just that.
+ For SLR to perform well, we'd like there to be a rough linear trend relating = np.corrcoef(x, z)[0, 1]
+ r = r * np.std(z) / np.std(x)
+ theta_1 = np.mean(z) - theta_1 * np.mean(x)
theta_0
-
- An important word on $\log$: in Data 100 (and most upper-division STEM courses), $\log$ denotes the natural logarithm with base $e$. The base-10 logarithm, where relevant, is indicated by $\log_{10}$.
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-= np.log(y)
- z
-= np.corrcoef(x, z)[0, 1]
- r = r * np.std(z) / np.std(x)
- theta_1 = np.mean(z) - theta_1 * np.mean(x)
+ theta_0 = plt.subplots(1, 2, dpi=200, figsize=(8, 3))
+ fig, ax 0].scatter(x, z)
+ ax[0].set_xlabel("Length")
+ ax[0].set_ylabel(r"$\log{(Age)}$")
+ ax[
+1].scatter(x, z)
+ ax[1].plot(x, theta_0 + theta_1 * x, "tab:red")
+ ax[1].set_xlabel("Length")
+ ax[1].set_ylabel(r"$\log{(Age)}$")
+ ax[
+=0.3)
+ plt.subplots_adjust(wspace```
-= plt.subplots(1, 2, dpi=200, figsize=(8, 3))
- fig, ax 0].scatter(x, z)
- ax[0].set_xlabel("Length")
- ax[0].set_ylabel(r"$\log{(Age)}$")
+ ax[`"Age"`, rather than the untransformed `"Age"`. In other words, we are applying the transformation $z_i = \log{(y_i)}$. Notice that the resulting model is still **linear in the parameters** $\theta = [\theta_0, \theta_1]$. The SLR model becomes:
+ Our SLR fit looks a lot better! We now have a new target variable: the SLR model is now trying to predict the _log_ of
+
+ $$\hat{\log{y}} = \theta_0 + \theta_1 x$$
$$\hat{z} = \theta_0 + \theta_1 x$$
-1].scatter(x, z)
- ax[1].plot(x, theta_0 + theta_1 * x, "tab:red")
- ax[1].set_xlabel("Length")
- ax[1].set_ylabel(r"$\log{(Age)}$")
- ax[
-=0.3)
- plt.subplots_adjust(wspace```
-
-`"Age"`, rather than the untransformed `"Age"`. In other words, we are applying the transformation $z_i = \log{(y_i)}$. Notice that the resulting model is still **linear in the parameters** $\theta = [\theta_0, \theta_1]$. The SLR model becomes:
- Our SLR fit looks a lot better! We now have a new target variable: the SLR model is now trying to predict the _log_ of
-
- $$\hat{\log{y}} = \theta_0 + \theta_1 x$$
- $$\hat{z} = \theta_0 + \theta_1 x$$
-
- It turns out that this linearized relationship can help us understand the underlying relationship between $x$ and $y$. If we rearrange the relationship above, we find:
-
- $$\log{(y)} = \theta_0 + \theta_1 x$$
- $$y = e^{\theta_0 + \theta_1 x}$$
- $$y = (e^{\theta_0})e^{\theta_1 x}$$
- $$y_i = C e^{k x}$$
-
- For some constants $C$ and $k$.
-
- $y$ is an *exponential* function of $x$. Applying an exponential fit to the untransformed variables corroborates this finding.
-```{python}
-#| code-fold: true
-#| vscode: {languageId: python}
-=120, figsize=(4, 3))
+ plt.figure(dpi
+ It turns out that this linearized relationship can help us understand the underlying relationship between $x$ and $y$. If we rearrange the relationship above, we find:
+
+ $$\log{(y)} = \theta_0 + \theta_1 x$$
+ $$y = e^{\theta_0 + \theta_1 x}$$
+ $$y = (e^{\theta_0})e^{\theta_1 x}$$
+ $$y_i = C e^{k x}$$
+
+ For some constants $C$ and $k$.
+
+ $y$ is an *exponential* function of $x$. Applying an exponential fit to the untransformed variables corroborates this finding.
+```{python}
+#| code-fold: true
+=120, figsize=(4, 3))
+ plt.figure(dpi
+
+ plt.scatter(x, y)* np.exp(theta_1 * x), "tab:red")
+ plt.plot(x, np.exp(theta_0) "Length")
+ plt.xlabel("Age");
+ plt.ylabel(```
+
+
+ You may wonder: why did we choose to apply a log transformation specifically? Why not some other function to linearize the data?
+`"Age"` and `"Length"` could have worked here.
+ Practically, many other mathematical operations that modify the relative scales of
+## Multiple Linear Regression
+
+
Multiple linear regression is an extension of simple linear regression that adds additional features to the model. The multiple linear regression model takes the form:
-
- plt.scatter(x, y)* np.exp(theta_1 * x), "tab:red")
- plt.plot(x, np.exp(theta_0) "Length")
- plt.xlabel("Age");
- plt.ylabel(```
+
+ $$\hat{y} = \theta_0\:+\:\theta_1x_{1}\:+\:\theta_2 x_{2}\:+\:...\:+\:\theta_p x_{p}$$
+
+ Our predicted value of $y$, $\hat{y}$, is a linear combination of the single **observations** (features), $x_i$, and the parameters, $\theta_i$.
+
We'll dive deeper into Multiple Linear Regression in the next lecture.
-
+ You may wonder: why did we choose to apply a log transformation specifically? Why not some other function to linearize the data?## Bonus: Calculating Constant Model MSE Using an Algebraic Trick
-`"Age"` and `"Length"` could have worked here.
+ Practically, many other mathematical operations that modify the relative scales of
Earlier, we calculated the constant model MSE using calculus. It turns out that there is a much more elegant way of performing this same minimization algebraically, without using calculus at all.
-## Multiple Linear Regression
+
In this calculation, we use the fact that the **sum of deviations from the mean is $0$** or that $\sum_{i=1}^{n} (y_i - \bar{y}) = 0$.
-
- Multiple linear regression is an extension of simple linear regression that adds additional features to the model. The multiple linear regression model takes the form:
-
- $$\hat{y} = \theta_0\:+\:\theta_1x_{1}\:+\:\theta_2 x_{2}\:+\:...\:+\:\theta_p x_{p}$$
-
- Our predicted value of $y$, $\hat{y}$, is a linear combination of the single **observations** (features), $x_i$, and the parameters, $\theta_i$.
-
- We'll dive deeper into Multiple Linear Regression in the next lecture.
-## Bonus: Calculating Constant Model MSE Using an Algebraic Trick
-
-
- Earlier, we calculated the constant model MSE using calculus. It turns out that there is a much more elegant way of performing this same minimization algebraically, without using calculus at all.
-
- In this calculation, we use the fact that the **sum of deviations from the mean is $0$** or that $\sum_{i=1}^{n} (y_i - \bar{y}) = 0$.
-
- Let's quickly walk through the proof for this:
- $$
- \begin{align}\\
- \sum_{i=1}^{n} (y_i - \bar{y}) &= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \bar{y} \\
- &= \sum_{i=1}^{n} y_i - n\bar{y} \\
- &= \sum_{i=1}^{n} y_i - n\frac{1}{n}\sum_{i=1}^{n}y_i \\
- &= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n}y_i
- & = 0
- \end{align}
- $$
-
- In our calculations, we'll also be using the definition of the variance as a sample. As a refresher:
-
+ $$\sigma_y^2 = \frac{1}{n}\sum_{i=1}^{n} (y_i - \bar{y})^2$$
+ Let's quickly walk through the proof for this:
+ $$
+ \begin{align}\\
+ \sum_{i=1}^{n} (y_i - \bar{y}) &= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \bar{y} \\
+ &= \sum_{i=1}^{n} y_i - n\bar{y} \\
+ &= \sum_{i=1}^{n} y_i - n\frac{1}{n}\sum_{i=1}^{n}y_i \\
+ &= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n}y_i
+ & = 0
+ \end{align}
+ $$
+
+ In our calculations, we'll also be using the definition of the variance as a sample. As a refresher:
+
+ $$\sigma_y^2 = \frac{1}{n}\sum_{i=1}^{n} (y_i - \bar{y})^2$$
+
+ Getting into our calculation for MSE minimization:
+
+ $$
+ \begin{align}
+ R(\theta) &= {\frac{1}{n}}\sum^{n}_{i=1} (y_i - \theta)^2\\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y}) + (\bar{y} - \theta)]^2\quad \quad \text{using trick that a-b can be written as (a-c) + (c-b) } \\
+
+ &\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \space \space \text{where a, b, and c are any numbers}\\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y})^2 + 2(y_i - \bar{y})(\bar{y} - \theta) + (\bar{y} - \theta)^2]
+\\ &= \frac{1}{n}[\sum^{n}_{i=1}(y_i - \bar{y})^2 + 2(\bar{y} - \theta)\sum^{n}_{i=1}(y_i - \bar{y}) + n(\bar{y} - \theta)^2] \quad \quad \text{distribute sum to individual terms}
+\\ &= \frac{1}{n}\sum^{n}_{i=1}(y_i - \bar{y})^2 + \frac{2}{n}(\bar{y} - \theta)\cdot0 + (\bar{y} - \theta)^2 \quad \quad \text{sum of deviations from mean is 0}
+\\ &= \sigma_y^2 + (\bar{y} - \theta)^2
+
+ \end{align}
$$
-
+ Getting into our calculation for MSE minimization:
Since variance can't be negative, we know that our first term, $\sigma_y^2$ is greater than or equal to $0$. Also note, that **the first term doesn't involve $\theta$ at all**, meaning changing our model won't change this value. For the purposes of determining $\hat{\theta}#, we can then essentially ignore this term.
-
- $$
- \begin{align}
- R(\theta) &= {\frac{1}{n}}\sum^{n}_{i=1} (y_i - \theta)^2\\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y}) + (\bar{y} - \theta)]^2\quad \quad \text{using trick that a-b can be written as (a-c) + (c-b) } \\
-
- &\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \space \space \text{where a, b, and c are any numbers}\\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y})^2 + 2(y_i - \bar{y})(\bar{y} - \theta) + (\bar{y} - \theta)^2]
-\\ &= \frac{1}{n}[\sum^{n}_{i=1}(y_i - \bar{y})^2 + 2(\bar{y} - \theta)\sum^{n}_{i=1}(y_i - \bar{y}) + n(\bar{y} - \theta)^2] \quad \quad \text{distribute sum to individual terms}
-\\ &= \frac{1}{n}\sum^{n}_{i=1}(y_i - \bar{y})^2 + \frac{2}{n}(\bar{y} - \theta)\cdot0 + (\bar{y} - \theta)^2 \quad \quad \text{sum of deviations from mean is 0}
-\\ &= \sigma_y^2 + (\bar{y} - \theta)^2
-
- \end{align}
- $$
-
- Since variance can't be negative, we know that our first term, $\sigma_y^2$ is greater than or equal to $0$. Also note, that **the first term doesn't involve $\theta$ at all**, meaning changing our model won't change this value. For the purposes of determining $\hat{\theta}#, we can then essentially ignore this term.
-
- Looking at the second term, $(\bar{y} - \theta)^2$, since it is squared, we know it must be greater than or equal to $0$. As this term does involve $\theta$, picking the value of $\theta$ that minimizes this term will allow us to minimize our average loss. For the second term to equal $0$, $\theta = \bar{y}$, or in other words, $\hat{\theta} = \bar{y} = mean(y)$.
-##### Note
-
-
- In the derivation above, we decompose the expected loss, $R(\theta)$, into two key components: the variance of the data, $\sigma_y^2$, and the square of the bias, $(\bar{y} - \theta)^2$. This decomposition is insightful for understanding the behavior of estimators in statistical models.
-- **Variance, $\sigma_y^2$**: This term represents the spread of the data points around their mean, $\bar{y}$, and is a measure of the data's inherent variability. Importantly, it does not depend on the choice of $\theta$, meaning it's a fixed property of the data. Variance serves as an indicator of the data's dispersion and is crucial in understanding the dataset's structure, but it remains constant regardless of how we adjust our model parameter $\theta$.
-
-- **Bias Squared, $(\bar{y} - \theta)^2$**: This term captures the bias of the estimator, defined as the square of the difference between the mean of the data points, $\bar{y}$, and the parameter $\theta$. The bias quantifies the systematic error introduced when estimating $\theta$. Minimizing this term is essential for improving the accuracy of the estimator. When $\theta = \bar{y}$, the bias is $0$, indicating that the estimator is unbiased for the parameter it estimates. This highlights a critical principle in statistical estimation: choosing $\theta$ to be the sample mean, $\bar{y}$, minimizes the average loss, rendering the estimator both efficient and unbiased for the population mean.
+
+ Looking at the second term, $(\bar{y} - \theta)^2$, since it is squared, we know it must be greater than or equal to $0$. As this term does involve $\theta$, picking the value of $\theta$ that minimizes this term will allow us to minimize our average loss. For the second term to equal $0$, $\theta = \bar{y}$, or in other words, $\hat{\theta} = \bar{y} = mean(y)$.
+##### Note
+
+
+ In the derivation above, we decompose the expected loss, $R(\theta)$, into two key components: the variance of the data, $\sigma_y^2$, and the square of the bias, $(\bar{y} - \theta)^2$. This decomposition is insightful for understanding the behavior of estimators in statistical models.
+- **Variance, $\sigma_y^2$**: This term represents the spread of the data points around their mean, $\bar{y}$, and is a measure of the data's inherent variability. Importantly, it does not depend on the choice of $\theta$, meaning it's a fixed property of the data. Variance serves as an indicator of the data's dispersion and is crucial in understanding the dataset's structure, but it remains constant regardless of how we adjust our model parameter $\theta$.
+
+- **Bias Squared, $(\bar{y} - \theta)^2$**: This term captures the bias of the estimator, defined as the square of the difference between the mean of the data points, $\bar{y}$, and the parameter $\theta$. The bias quantifies the systematic error introduced when estimating $\theta$. Minimizing this term is essential for improving the accuracy of the estimator. When $\theta = \bar{y}$, the bias is $0$, indicating that the estimator is unbiased for the parameter it estimates. This highlights a critical principle in statistical estimation: choosing $\theta$ to be the sample mean, $\bar{y}$, minimizes the average loss, rendering the estimator both efficient and unbiased for the population mean.
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf
index 807818cd..7b66b962 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf
index a8d7f108..02a64577 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf
index 93d961ca..85c5c531 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf
index 8f68f0b2..24402808 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf
index 8bb2661a..05b0bd69 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf
index 0d9a275f..f9e94bad 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf
index 42acfd6d..211eb515 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf differ
diff --git a/docs/eda/eda.html b/docs/eda/eda.html
index 31fc8fec..c79eb732 100644
--- a/docs/eda/eda.html
+++ b/docs/eda/eda.html
@@ -361,7 +361,7 @@ Data Cleaning and EDA
-
+
Code
import numpy as np
@@ -426,7 +426,7 @@
5.1.1.1 CSV
CSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas
lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections
and babynames
datasets were stored and loaded as CSVs:
-
+
"data/elections.csv").head(5) pd.read_csv(
@@ -497,7 +497,7 @@
+
with open("data/elections.csv", "r") as table:
= 0
i for row in table:
@@ -518,7 +518,7 @@ 5.1.1.2 TSV
Another common file type is TSV (Tab-Separated Values). In a TSV, records are still delimited by a newline \n
, while fields are delimited by \t
tab character.
Let’s check out the first few rows of the raw TSV file. Again, we’ll use the repr()
function so that print
shows the special characters.
-
+
with open("data/elections.txt", "r") as table:
= 0
i for row in table:
@@ -534,7 +534,7 @@ (documentation).
-
+
"data/elections.txt", sep='\t').head(3) pd.read_csv(
@@ -591,7 +591,7 @@
5.1.1.3 JSON
JSON (JavaScript Object Notation) files behave similarly to Python dictionaries. A raw JSON is shown below.
-
+
with open("data/elections.json", "r") as table:
= 0
i for row in table:
@@ -621,7 +621,7 @@
+
'data/elections.json').head(3) pd.read_json(
@@ -676,7 +676,7 @@
5.1.1.3.1 EDA with JSON: Berkeley COVID-19 Data
The City of Berkeley Open Data website has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let’s download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the ds100_utils.py
file that we can reuse these helper functions in many different notebooks.
-
+
from ds100_utils import fetch_and_cache
= fetch_and_cache(
@@ -695,7 +695,7 @@ covid_file 5.1.1.3.1.1 File Size
Let’s start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use Python
tools to probe the file.
Since there seem to be text files, let’s investigate the number of lines, which often corresponds to the number of records
-
+
import os
print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
@@ -713,7 +713,7 @@ As part of the EDA workflow, Unix commands can come in very handy. In fact, there’s an entire book called “Data Science at the Command Line” that explores this idea in depth! In Jupyter/IPython, you can prefix lines with !
to execute arbitrary Unix commands, and within those lines, you can refer to Python variables and expressions with the syntax {expr}
.
Here, we use the ls
command to list files, using the -lh
flags, which request “long format with information in human-readable form.” We also use the wc
command for “word count,” but with the -l
flag, which asks for line counts instead of words.
These two give us the same information as the code above, albeit in a slightly different form:
-
+
!ls -lh {covid_file}
!wc -l {covid_file}
@@ -725,7 +725,7 @@
5.1.1.3.1.3 File Contents
Let’s explore the data format using Python
.
-
+
with open(covid_file, "r") as f:
for i, row in enumerate(f):
print(repr(row)) # print raw strings
@@ -739,7 +739,7 @@
We can use the head
Unix command (which is where pandas
’ head
method comes from!) to see the first few lines of the file:
-
+
!head -5 {covid_file}
{
@@ -750,21 +750,21 @@
In order to load the JSON file into pandas
, Let’s first do some EDA with Oython’s json
package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into pandas
. Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the json
package.
-
+
import json
with open(covid_file, "rb") as f:
= json.load(f) covid_json
The covid_json
variable is now a dictionary encoding the data in the file:
-
+
type(covid_json)
dict
We can examine what keys are in the top level JSON object by listing out the keys.
-
+
covid_json.keys()
dict_keys(['meta', 'data'])
@@ -772,14 +772,14 @@
Observation: The JSON dictionary contains a meta
key which likely refers to metadata (data about the data). Metadata is often maintained with the data and can be a good source of additional information.
We can investigate the metadata further by examining the keys associated with the metadata.
-
+
'meta'].keys() covid_json[
dict_keys(['view'])
The meta
key contains another dictionary called view
. This likely refers to metadata about a particular “view” of some underlying database. We will learn more about views when we study SQL later in the class.
-
+
'meta']['view'].keys() covid_json[
dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])
@@ -799,7 +799,7 @@
There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
+
print(covid_json['meta']['view']['description'])
Counts of confirmed COVID-19 cases among Berkeley residents by date.
@@ -809,7 +809,7 @@
5.1.1.3.1.4 Examining the Data Field for Records
We can look at a few entries in the data
field. This is what we’ll load into pandas
.
-
+
for i in range(3):
print(f"{i:03} | {covid_json['data'][i]}")
@@ -820,7 +820,7 @@
+
type(covid_json['meta']['view']['columns'])
list
@@ -847,7 +847,7 @@
+
# Load the data from JSON and assign column titles
= pd.DataFrame(
covid 'data'],
@@ -960,7 +960,7 @@ covid_json[5.1.2 Primary and Foreign Keys
Last time, we introduced .merge
as the pandas
method for joining multiple DataFrame
s together. In our discussion of joins, we touched on the idea of using a “key” to determine what rows should be merged from each table. Let’s take a moment to examine this idea more closely.
The primary key is the column or set of columns in a table that uniquely determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student’s Cal ID as the primary key.
-
+
@@ -1006,7 +1006,7 @@ is a foreign key referencing the previous table.
-
+
@@ -1099,7 +1099,7 @@
5.2.3.1 Temporality with pandas
’ dt
accessors
Let’s briefly look at how we can use pandas
’ dt
accessors to work with dates/times in a dataset using the dataset you’ll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
+
Code
= pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
@@ -1206,11 +1206,11 @@ calls
+
"EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
calls[ calls.head()
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_98202/874729699.py:1: UserWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_32669/874729699.py:1: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
@@ -1315,7 +1315,7 @@
+
"EVENTDT"].dt.month.head() calls[
0 4
@@ -1327,7 +1327,7 @@
+
"EVENTDT"].dt.dayofweek.head() calls[
0 3
@@ -1339,7 +1339,7 @@
+
"EVENTDT").head() calls.sort_values(
@@ -1486,7 +1486,7 @@ <
We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways: 1. Using a text editor like emacs, vim, VSCode, etc. 2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc. 3. The Python
file object 4. pandas
, using pd.read_csv()
To try out options 1 and 2, you can view or download the Tuberculosis from the lecture demo notebook under the data
folder in the left hand menu. Notice how the CSV file is a type of rectangular data (i.e., tabular data) stored as comma-separated values.
Next, let’s try out option 3 using the Python
file object. We’ll look at the first four lines:
-
+
Code
with open("data/cdc_tuberculosis.csv", "r") as f:
@@ -1511,7 +1511,7 @@ <
Whoa, why are there blank lines interspaced between the lines of the CSV?
You may recall that all line breaks in text files are encoded as the special newline character \n
. Python’s print()
prints each string (including the newline), and an additional newline on top of that.
If you’re curious, we can use the repr()
function to return the raw string with all special characters:
-
+
Code
with open("data/cdc_tuberculosis.csv", "r") as f:
@@ -1530,7 +1530,7 @@ <
Finally, let’s try option 4 and use the tried-and-true Data 100 approach: pandas
.
-
+
= pd.read_csv("data/cdc_tuberculosis.csv")
tb_df tb_df.head()
@@ -1610,7 +1610,7 @@ <
You may notice some strange things about this table: what’s up with the “Unnamed” column names and the first row?
Congratulations — you’re ready to wrangle your data! Because of how things are stored, we’ll need to clean the data a bit to name our columns better.
A reasonable first step is to identify the row with the right header. The pd.read_csv()
function (documentation) has the convenient header
parameter that we can set to use the elements in row 1 as the appropriate columns:
-
+
= pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
tb_df 5) tb_df.head(
@@ -1689,7 +1689,7 @@ <
Wait…but now we can’t differentiate betwen the “Number of TB cases” and “TB incidence” year columns. pandas
has tried to make our lives easier by automatically adding “.1” to the latter columns, but this doesn’t help us, as humans, understand the data.
We can do this manually with df.rename()
(documentation):
-
+
= {'2019': 'TB cases 2019',
rename_dict '2020': 'TB cases 2020',
'2021': 'TB cases 2021',
@@ -1779,7 +1779,7 @@ Row 0 is what we call a rollup record, or summary record. It’s often useful when displaying tables to humans. The granularity of record 0 (Totals) vs the rest of the records (States) is different.
Okay, EDA step two. How was the rollup record aggregated?
Let’s check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get 2x the total cases in each of our TB cases by year (why do you think this is?).
-
+
Code
sum(axis=0) tb_df.
@@ -1796,7 +1796,7 @@
Whoa, what’s going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
+
Code
tb_df.dtypes
@@ -1814,7 +1814,7 @@
Since there are commas in the values for TB cases, the numbers are read as the object
datatype, or storage type (close to the Python
string datatype), so pandas
is concatenating strings instead of adding integers (recall that Python can “sum”, or concatenate, strings together: "data" + "100"
evaluates to "data100"
).
Fortunately read_csv
also has a thousands
parameter (documentation):
-
+
# improve readability: chaining method calls with outer parentheses/line breaks
= (
tb_df "data/cdc_tuberculosis.csv", header=1, thousands=',')
@@ -1895,7 +1895,7 @@ pd.read_csv(
-
+
sum() tb_df.
U.S. jurisdiction TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
@@ -1910,7 +1910,7 @@
The total TB cases look right. Phew!
Let’s just look at the records with state-level granularity:
-
+
Code
= tb_df[1:]
@@ -1995,7 +1995,7 @@ state_tb_df 5.4.3 Gather Census Data
U.S. Census population estimates source (2019), source (2020-2021).
Running the below cells cleans the data. There are a few new methods here: * df.convert_dtypes()
(documentation) conveniently converts all float dtypes into ints and is out of scope for the class. * df.drop_na()
(documentation) will be explained in more detail next time.
-
+
Code
# 2010s census data
@@ -2119,7 +2119,7 @@ or use iPython
magic which will intelligently import code when files change:
%load_ext autoreload
%autoreload 2
-
+
Code
# census 2020s data
@@ -2196,7 +2196,7 @@
5.4.4 Joining Data (Merging DataFrame
s)
Time to merge
! Here we use the DataFrame
method df1.merge(right=df2, ...)
on DataFrame
df1
(documentation). Contrast this with the function pd.merge(left=df1, right=df2, ...)
(documentation). Feel free to use either.
-
+
# merge TB DataFrame with two US census DataFrames
= (
tb_census_df
@@ -2371,7 +2371,7 @@ tb_df
+
# try merging again, but cleaner this time
= (
tb_census_df
@@ -2484,7 +2484,7 @@ tb_df\[\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} \]
\[= \frac{\text{TB cases in population}}{\text{population}} \times 100000\]
Let’s try this for 2019:
-
+
"recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
tb_census_df[5) tb_census_df.head(
@@ -2587,7 +2587,7 @@ documentation).
-
+
# recompute incidence for all years
for year in [2019, 2020, 2021]:
f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
@@ -2703,7 +2703,7 @@ tb_census_df[
+
tb_census_df.describe()
@@ -2864,7 +2864,7 @@
+
Code
= tb_df.set_index("U.S. jurisdiction")
@@ -2947,7 +2947,7 @@ tb_df
+
= census_2010s_df.set_index("Geographic Area")
census_2010s_df 5) census_2010s_df.head(
@@ -3055,7 +3055,7 @@
+
= census_2020s_df.set_index("Geographic Area")
census_2020s_df 5) census_2020s_df.head(
@@ -3115,7 +3115,7 @@
+