Code
import numpy as np
@@ -498,7 +504,7 @@ = dugongs[["Length", "Age"]]
data_linear
diff --git a/_quarto.yml b/_quarto.yml index f42d47ce..2ae1449f 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -32,7 +32,7 @@ book: - feature_engineering/feature_engineering.qmd - case_study_HCE/case_study_HCE.qmd - cv_regularization/cv_reg.qmd - # - probability_1/probability_1.qmd + - probability_1/probability_1.qmd # - probability_2/probability_2.qmd # - inference_causality/inference_causality.qmd # - case_study_climate/case_study_climate.qmd diff --git a/docs/case_study_HCE/case_study_HCE.html b/docs/case_study_HCE/case_study_HCE.html index a1105df1..0b16fd2e 100644 --- a/docs/case_study_HCE/case_study_HCE.html +++ b/docs/case_study_HCE/case_study_HCE.html @@ -247,6 +247,12 @@ + +
import numpy as np
@@ -498,7 +504,7 @@ = dugongs[["Length", "Age"]]
data_linear
# Big font helper
@@ -520,7 +526,7 @@ "default") # Revert style to default mpl
plt.style.use(
# Constant Model + MSE
@@ -553,7 +559,7 @@
+
Code
# SLR + MSE
@@ -616,7 +622,7 @@
+
Code
# Predictions
@@ -628,7 +634,7 @@ = [theta_0_hat + theta_1_hat * x for x in xs]
yhats_linear
-
+
Code
# Constant Model Rug Plot
@@ -658,7 +664,7 @@
+
Code
# SLR model scatter plot
@@ -772,7 +778,7 @@ 11.4 Comparing Loss Functions
We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?
Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.
-
+
= np.array([20, 21, 22, 29, 33])
drinks drinks
@@ -780,7 +786,7 @@
+
np.mean(drinks), np.median(drinks)
(np.float64(25.0), np.float64(22.0))
@@ -790,7 +796,7 @@
Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.
How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.
-
+
= np.append(drinks, 1033)
drinks_with_outlier
display(drinks_with_outlier) np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
@@ -804,7 +810,7 @@
This means that under the MSE, the optimal model parameter \(\hat{\theta}\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.
Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.
-
+
= np.append(drinks, 35)
drinks_with_additional_observation drinks_with_additional_observation
@@ -876,7 +882,7 @@
+
Code
# `corrcoef` computes the correlation coefficient between two variables
@@ -908,7 +914,7 @@ and "Length"
. What is making the raw data deviate from a linear relationship? Notice that the data points with "Length"
greater than 2.6 have disproportionately high values of "Age"
relative to the rest of the data. If we could manipulate these data points to have lower "Age"
values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \(y_i\) (that is, taking \(\log(\) "Age"
\()\) ) would achieve just that.
An important word on \(\log\): in Data 100 (and most upper-division STEM courses), \(\log\) denotes the natural logarithm with base \(e\). The base-10 logarithm, where relevant, is indicated by \(\log_{10}\).
-
+
Code
= np.log(y)
@@ -943,7 +949,7 @@ z \[\log{(y)} = \theta_0 + \theta_1 x\] \[y = e^{\theta_0 + \theta_1 x}\] \[y = (e^{\theta_0})e^{\theta_1 x}\] \[y_i = C e^{k x}\]
For some constants \(C\) and \(k\).
\(y\) is an exponential function of \(x\). Applying an exponential fit to the untransformed variables corroborates this finding.
-
+
Code
=120, figsize=(4, 3))
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf
index eaadd111..4fa6b5cb 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-13-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf
index bc01894d..0662fdcd 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-14-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf
index 38a94100..5e6420b9 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-15-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf
index c6bcde1a..ee6142a9 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-4-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf
index 81c1e0cd..d9f65cb3 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-5-output-1.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf
index 9f49e0a6..0c71ed10 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-7-output-2.pdf differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf
index c0df76b2..059ef4c2 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-pdf/cell-8-output-1.pdf differ
diff --git a/docs/cv_regularization/cv_reg.html b/docs/cv_regularization/cv_reg.html
index 556ff2c0..d48ceefa 100644
--- a/docs/cv_regularization/cv_reg.html
+++ b/docs/cv_regularization/cv_reg.html
@@ -64,6 +64,7 @@
+
@@ -278,6 +279,12 @@
plt.figure(dpi
+
+
+
@@ -381,7 +388,7 @@
In sklearn
, the train_test_split
function (documentation) of the model_selection
module allows us to automatically generate train-test splits.
We will work with the vehicles
dataset from previous lectures. As before, we will attempt to predict the mpg
of a vehicle from transformations of its hp
. In the cell below, we allocate 20% of the full dataset to testing, and the remaining 80% to training.
-
+
Code
import pandas as pd
@@ -400,7 +407,7 @@ = vehicles["mpg"] Y
-
+
from sklearn.model_selection import train_test_split
# `test_size` specifies the proportion of the full dataset that should be allocated to testing
@@ -422,7 +429,7 @@
After performing our train-test split, we fit a model to the training set and assess its performance on the test set.
-
+
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
@@ -602,7 +609,7 @@ \(\lambda\) the regularization penalty hyperparameter; it needs to be determined prior to training the model, so we must find the best value via cross-validation.
The process of finding the optimal \(\hat{\theta}\) to minimize our new objective function is called L1 regularization. It is also sometimes known by the acronym “LASSO”, which stands for “Least Absolute Shrinkage and Selection Operator.”
Unlike ordinary least squares, which can be solved via the closed-form solution \(\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\), there is no closed-form solution for the optimal parameter vector under L1 regularization. Instead, we use the Lasso
model class of sklearn
.
-
+
import sklearn.linear_model as lm
# The alpha parameter represents our lambda term
@@ -620,7 +627,7 @@
16.2.3 Scaling Features for Regularization
The regularization procedure we just performed had one subtle issue. To see what it is, let’s take a look at the design matrix for our lasso_model
.
-
+
Code
X_train.head()
@@ -683,7 +690,7 @@ \(\hat{y}\) because it is so much greater than the values of the other features. For hp
to have much of an impact at all on the prediction, it must be scaled by a large model parameter.
By inspecting the fitted parameters of our model, we see that this is the case – the parameter for hp
is much larger in magnitude than the parameter for hp^4
.
-
+
"Feature":X_train.columns, "Parameter":lasso_model.coef_}) pd.DataFrame({
@@ -747,7 +754,7 @@ \[\hat\theta_{\text{ridge}} = (\mathbb{X}^{\top}\mathbb{X} + n\lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}\]
This solution exists even if \(\mathbb{X}\) is not full column rank. This is a major reason why L2 regularization is often used – it can produce a solution even when there is colinearity in the features. We will discuss the concept of colinearity in a future lecture, but we will not derive this result in Data 100, as it involves a fair bit of matrix calculus.
In sklearn
, we perform L2 regularization using the Ridge
class. It runs gradient descent to minimize the L2 objective function. Notice that we scale the data before regularizing.
-
+
= lm.Ridge(alpha=1) # alpha represents the hyperparameter lambda
ridge_model
ridge_model.fit(X_train, Y_train)
@@ -1206,6 +1213,9 @@
diff --git a/docs/eda/eda.html b/docs/eda/eda.html
index 2d4ef647..9d8f8c33 100644
--- a/docs/eda/eda.html
+++ b/docs/eda/eda.html
@@ -279,6 +279,12 @@
+
+
+
@@ -367,7 +373,7 @@ Data Cleaning and EDA
-
+
Code
import numpy as np
@@ -432,7 +438,7 @@
5.1.1.1 CSV
CSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas
lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections
and babynames
datasets were stored and loaded as CSVs:
-
+
"data/elections.csv").head(5) pd.read_csv(
@@ -503,7 +509,7 @@
+
with open("data/elections.csv", "r") as table:
= 0
i for row in table:
@@ -524,7 +530,7 @@ 5.1.1.2 TSV
Another common file type is TSV (Tab-Separated Values). In a TSV, records are still delimited by a newline \n
, while fields are delimited by \t
tab character.
Let’s check out the first few rows of the raw TSV file. Again, we’ll use the repr()
function so that print
shows the special characters.
-
+
with open("data/elections.txt", "r") as table:
= 0
i for row in table:
@@ -540,7 +546,7 @@ (documentation).
-
+
"data/elections.txt", sep='\t').head(3) pd.read_csv(
@@ -597,7 +603,7 @@
5.1.1.3 JSON
JSON (JavaScript Object Notation) files behave similarly to Python dictionaries. A raw JSON is shown below.
-
+
with open("data/elections.json", "r") as table:
= 0
i for row in table:
@@ -627,7 +633,7 @@
+
'data/elections.json').head(3) pd.read_json(
@@ -682,7 +688,7 @@
5.1.1.3.1 EDA with JSON: Berkeley COVID-19 Data
The City of Berkeley Open Data website has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let’s download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the ds100_utils.py
file that we can reuse these helper functions in many different notebooks.
-
+
from ds100_utils import fetch_and_cache
= fetch_and_cache(
@@ -701,7 +707,7 @@ covid_file 5.1.1.3.1.1 File Size
Let’s start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use Python
tools to probe the file.
Since there seem to be text files, let’s investigate the number of lines, which often corresponds to the number of records
-
+
import os
print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
@@ -719,7 +725,7 @@ As part of the EDA workflow, Unix commands can come in very handy. In fact, there’s an entire book called “Data Science at the Command Line” that explores this idea in depth! In Jupyter/IPython, you can prefix lines with !
to execute arbitrary Unix commands, and within those lines, you can refer to Python variables and expressions with the syntax {expr}
.
Here, we use the ls
command to list files, using the -lh
flags, which request “long format with information in human-readable form.” We also use the wc
command for “word count,” but with the -l
flag, which asks for line counts instead of words.
These two give us the same information as the code above, albeit in a slightly different form:
-
+
!ls -lh {covid_file}
!wc -l {covid_file}
@@ -731,7 +737,7 @@
5.1.1.3.1.3 File Contents
Let’s explore the data format using Python
.
-
+
with open(covid_file, "r") as f:
for i, row in enumerate(f):
print(repr(row)) # print raw strings
@@ -745,7 +751,7 @@
We can use the head
Unix command (which is where pandas
’ head
method comes from!) to see the first few lines of the file:
-
+
!head -5 {covid_file}
{
@@ -756,21 +762,21 @@
In order to load the JSON file into pandas
, Let’s first do some EDA with Oython’s json
package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into pandas
. Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the json
package.
-
+
import json
with open(covid_file, "rb") as f:
= json.load(f) covid_json
The covid_json
variable is now a dictionary encoding the data in the file:
-
+
type(covid_json)
dict
We can examine what keys are in the top level JSON object by listing out the keys.
-
+
covid_json.keys()
dict_keys(['meta', 'data'])
@@ -778,14 +784,14 @@
Observation: The JSON dictionary contains a meta
key which likely refers to metadata (data about the data). Metadata is often maintained with the data and can be a good source of additional information.
We can investigate the metadata further by examining the keys associated with the metadata.
-
+
'meta'].keys() covid_json[
dict_keys(['view'])
The meta
key contains another dictionary called view
. This likely refers to metadata about a particular “view” of some underlying database. We will learn more about views when we study SQL later in the class.
-
+
'meta']['view'].keys() covid_json[
dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])
@@ -805,7 +811,7 @@
There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
+
print(covid_json['meta']['view']['description'])
Counts of confirmed COVID-19 cases among Berkeley residents by date.
@@ -815,7 +821,7 @@
5.1.1.3.1.4 Examining the Data Field for Records
We can look at a few entries in the data
field. This is what we’ll load into pandas
.
-
+
for i in range(3):
print(f"{i:03} | {covid_json['data'][i]}")
@@ -826,7 +832,7 @@
+
type(covid_json['meta']['view']['columns'])
list
@@ -853,7 +859,7 @@
+
# Load the data from JSON and assign column titles
= pd.DataFrame(
covid 'data'],
@@ -966,7 +972,7 @@ covid_json[5.1.2 Primary and Foreign Keys
Last time, we introduced .merge
as the pandas
method for joining multiple DataFrame
s together. In our discussion of joins, we touched on the idea of using a “key” to determine what rows should be merged from each table. Let’s take a moment to examine this idea more closely.
The primary key is the column or set of columns in a table that uniquely determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student’s Cal ID as the primary key.
-
+
@@ -1012,7 +1018,7 @@ is a foreign key referencing the previous table.
-
+
@@ -1105,7 +1111,7 @@
5.2.3.1 Temporality with pandas
’ dt
accessors
Let’s briefly look at how we can use pandas
’ dt
accessors to work with dates/times in a dataset using the dataset you’ll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
+
Code
= pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
@@ -1212,11 +1218,11 @@ calls
+
"EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
calls[ calls.head()
-/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_11563/874729699.py:1: UserWarning:
+/var/folders/ks/dgd81q6j5b7ghm1zc_4483vr0000gn/T/ipykernel_16139/874729699.py:1: UserWarning:
Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
@@ -1321,7 +1327,7 @@
+
"EVENTDT"].dt.month.head() calls[
0 4
@@ -1333,7 +1339,7 @@
+
"EVENTDT"].dt.dayofweek.head() calls[
0 3
@@ -1345,7 +1351,7 @@
+
"EVENTDT").head() calls.sort_values(
@@ -1492,7 +1498,7 @@ <
We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways: 1. Using a text editor like emacs, vim, VSCode, etc. 2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc. 3. The Python
file object 4. pandas
, using pd.read_csv()
To try out options 1 and 2, you can view or download the Tuberculosis from the lecture demo notebook under the data
folder in the left hand menu. Notice how the CSV file is a type of rectangular data (i.e., tabular data) stored as comma-separated values.
Next, let’s try out option 3 using the Python
file object. We’ll look at the first four lines:
-
+
Code
with open("data/cdc_tuberculosis.csv", "r") as f:
@@ -1517,7 +1523,7 @@ <
Whoa, why are there blank lines interspaced between the lines of the CSV?
You may recall that all line breaks in text files are encoded as the special newline character \n
. Python’s print()
prints each string (including the newline), and an additional newline on top of that.
If you’re curious, we can use the repr()
function to return the raw string with all special characters:
-
+
Code
with open("data/cdc_tuberculosis.csv", "r") as f:
@@ -1536,7 +1542,7 @@ <
Finally, let’s try option 4 and use the tried-and-true Data 100 approach: pandas
.
-
+
= pd.read_csv("data/cdc_tuberculosis.csv")
tb_df tb_df.head()
@@ -1616,7 +1622,7 @@ <
You may notice some strange things about this table: what’s up with the “Unnamed” column names and the first row?
Congratulations — you’re ready to wrangle your data! Because of how things are stored, we’ll need to clean the data a bit to name our columns better.
A reasonable first step is to identify the row with the right header. The pd.read_csv()
function (documentation) has the convenient header
parameter that we can set to use the elements in row 1 as the appropriate columns:
-
+
= pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
tb_df 5) tb_df.head(
@@ -1695,7 +1701,7 @@ <
Wait…but now we can’t differentiate betwen the “Number of TB cases” and “TB incidence” year columns. pandas
has tried to make our lives easier by automatically adding “.1” to the latter columns, but this doesn’t help us, as humans, understand the data.
We can do this manually with df.rename()
(documentation):
-
+
= {'2019': 'TB cases 2019',
rename_dict '2020': 'TB cases 2020',
'2021': 'TB cases 2021',
@@ -1785,7 +1791,7 @@ Row 0 is what we call a rollup record, or summary record. It’s often useful when displaying tables to humans. The granularity of record 0 (Totals) vs the rest of the records (States) is different.
Okay, EDA step two. How was the rollup record aggregated?
Let’s check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get 2x the total cases in each of our TB cases by year (why do you think this is?).
-
+
Code
sum(axis=0) tb_df.
@@ -1802,7 +1808,7 @@
Whoa, what’s going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
+
Code
tb_df.dtypes
@@ -1820,7 +1826,7 @@
Since there are commas in the values for TB cases, the numbers are read as the object
datatype, or storage type (close to the Python
string datatype), so pandas
is concatenating strings instead of adding integers (recall that Python can “sum”, or concatenate, strings together: "data" + "100"
evaluates to "data100"
).
Fortunately read_csv
also has a thousands
parameter (documentation):
-
+
# improve readability: chaining method calls with outer parentheses/line breaks
= (
tb_df "data/cdc_tuberculosis.csv", header=1, thousands=',')
@@ -1901,7 +1907,7 @@ pd.read_csv(
-
+
sum() tb_df.
U.S. jurisdiction TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
@@ -1916,7 +1922,7 @@
The total TB cases look right. Phew!
Let’s just look at the records with state-level granularity:
-
+
Code
= tb_df[1:]
@@ -2001,7 +2007,7 @@ state_tb_df 5.4.3 Gather Census Data
U.S. Census population estimates source (2019), source (2020-2021).
Running the below cells cleans the data. There are a few new methods here: * df.convert_dtypes()
(documentation) conveniently converts all float dtypes into ints and is out of scope for the class. * df.drop_na()
(documentation) will be explained in more detail next time.
-
+
Code
# 2010s census data
@@ -2125,7 +2131,7 @@ or use iPython
magic which will intelligently import code when files change:
%load_ext autoreload
%autoreload 2
-
+
Code
# census 2020s data
@@ -2202,7 +2208,7 @@
5.4.4 Joining Data (Merging DataFrame
s)
Time to merge
! Here we use the DataFrame
method df1.merge(right=df2, ...)
on DataFrame
df1
(documentation). Contrast this with the function pd.merge(left=df1, right=df2, ...)
(documentation). Feel free to use either.
-
+
# merge TB DataFrame with two US census DataFrames
= (
tb_census_df
@@ -2377,7 +2383,7 @@ tb_df
+