Code
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+%matplotlib inline
+import seaborn as sns
+import itertools
+from mpl_toolkits.mplot3d import Axes3D
diff --git a/_quarto.yml b/_quarto.yml index d676270a2..5d0b0949c 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -17,32 +17,32 @@ book: chapters: - index.md - intro_lec/introduction.qmd - # - pandas_1/pandas_1.qmd - # - pandas_2/pandas_2.qmd - # - pandas_3/pandas_3.qmd - # - eda/eda.qmd - # - regex/regex.qmd - # - visualization_1/visualization_1.qmd - # - visualization_2/visualization_2.qmd - # - sampling/sampling.qmd - # - intro_to_modeling/intro_to_modeling.qmd - # - constant_model_loss_transformations/loss_transformations.qmd - # - ols/ols.qmd - # - gradient_descent/gradient_descent.qmd - # - feature_engineering/feature_engineering.qmd - # - case_study_HCE/case_study_HCE.qmd - # - cv_regularization/cv_reg.qmd - # - probability_1/probability_1.qmd - # - probability_2/probability_2.qmd - # - inference_causality/inference_causality.qmd - # # - case_study_climate/case_study_climate.qmd - # - sql_I/sql_I.qmd - # - sql_II/sql_II.qmd - # - logistic_regression_1/logistic_reg_1.qmd - # - logistic_regression_2/logistic_reg_2.qmd - # - pca_1/pca_1.qmd - # - pca_2/pca_2.qmd - # - clustering/clustering.qmd + - pandas_1/pandas_1.qmd + - pandas_2/pandas_2.qmd + - pandas_3/pandas_3.qmd + - eda/eda.qmd + - regex/regex.qmd + - visualization_1/visualization_1.qmd + - visualization_2/visualization_2.qmd + - sampling/sampling.qmd + - intro_to_modeling/intro_to_modeling.qmd + - constant_model_loss_transformations/loss_transformations.qmd + - ols/ols.qmd + - gradient_descent/gradient_descent.qmd + - feature_engineering/feature_engineering.qmd + - case_study_HCE/case_study_HCE.qmd + - cv_regularization/cv_reg.qmd + - probability_1/probability_1.qmd + - probability_2/probability_2.qmd + - inference_causality/inference_causality.qmd + # - case_study_climate/case_study_climate.qmd + - sql_I/sql_I.qmd + - sql_II/sql_II.qmd + - logistic_regression_1/logistic_reg_1.qmd + - logistic_regression_2/logistic_reg_2.qmd + - pca_1/pca_1.qmd + - pca_2/pca_2.qmd + - clustering/clustering.qmd sidebar: logo: "data100_logo.png" diff --git a/docs/Principles-and-Techniques-of-Data-Science.pdf b/docs/Principles-and-Techniques-of-Data-Science.pdf deleted file mode 100644 index baa27bbe0..000000000 Binary files a/docs/Principles-and-Techniques-of-Data-Science.pdf and /dev/null differ diff --git a/docs/case_study_HCE/case_study_HCE.html b/docs/case_study_HCE/case_study_HCE.html new file mode 100644 index 000000000..5286221d5 --- /dev/null +++ b/docs/case_study_HCE/case_study_HCE.html @@ -0,0 +1,1513 @@ + +
+ + + + + + + +Note: Given the nuanced nature of some of the arguments made in the lecture, it is highly recommended that you view the lecture recording given by Professor Ari Edmundson to fully engage and understand the material. The course notes will have the same broader structure but are by no means comprehensive.
+ +++Disclaimer: The following note discusses issues of structural racism. Some of the items in this note may be sensitive and may or may not be the opinions, ideas, and beliefs of the students who collected the materials. The Data 100 course staff tries its best to only present information that is relevant for teaching the lessons at hand.
+
As data scientists, our goal is to wrangle data, recognize patterns and use them to make predictions within a certain context. However, it is often easy to abstract data away from its original context. In previous lectures, we’ve explored datasets like elections
, babynames
, and world_bank
to learn fundamental techniques for working with data, but rarely do we stop to ask questions like “How/when was this data collected?” or “Are there any inherent biases in the data that could affect results?”. It turns out that inquiries like these profoundly affect how data scientists approach a task and convey their findings. This lecture explores these ethical dilemmas through the lens of a case study.
Let’s immerse ourselves in the real-world story of data scientists working for an organization called the Cook County Assessor’s Office (CCAO) located in Chicago, Illinois. Their job is to estimate the values of houses in order to assign property taxes. This is because the tax burden in this area is determined by the estimated value of a house rather than its price. Since value changes over time and has no obvious indicators, the CCAO created a model to estimate the values of houses. In this note, we will dig deep into biases that arose in the model, the consequences to human lives, and what we can learn from this example to avoid the same mistakes in the future.
+What prompted the formation of the CCAO and led to the development of this model? In 2017, an investigative report by the Chicago Tribune uncovered a major scandal in the property assessment system managed by the CCAO under the watch of former County Assessor Joseph Berrios. Working with experts from the University of Chicago, the Chicago Tribune journalists found that the CCAO’s model for estimating house value perpetuated a highly regressive tax system that disproportionately burdened African-American and Latinx homeowners in Cook County. How did the journalists demonstrate this disparity?
+The image above shows two standard metrics to estimate the fairness of assessments: the coefficient of dispersion and price-related differential. How they’re calculated is out of scope for this class, but you can assume that these metrics have been rigorously tested by experts in the field and are a good indication of fairness. As we see above, calculating these metrics for the Cook County prices revealed that the pricing created by the CCAO did not fall in acceptable ranges. While this on its own is not the entire story, it was a good indicator that something fishy was going on.
+This prompted journalists to investigate if the CCAO’s model itself was producing fair tax rates. When accounting for the homeowner’s income, they found that the model actually produced a regressive tax rate (see figure above). A tax rate is regressive if the percentage tax rate is higher for individuals with lower net income; it is progressive if the percentage tax rate is higher for individuals with higher net income.
+Digging further, journalists found that the model was not only regressive and unfair to lower-income individuals, but it was also unfair to non-white homeowners (see figure above). The likelihood of a property being under- or over-assessed was highly dependent on the owner’s race, and that did not sit well with many homeowners.
+What was the cause of such a major issue? It might be easy to simply blame “biased” algorithms, but the main issue was not a faulty model. Instead, it was largely due to the appeals system which enabled the wealthy and privileged to more easily and successfully challenge their assessments. Once given the CCAO model’s initial assessment of their home’s value, homeowners could choose to appeal to a board of elected officials to try and change the listed value of their home and, consequently, how much they are taxed. In theory, this sounds like a very fair system: a human being oversees the final pricing of houses rather than a computer algorithm. In reality, this ended up exacerbating the problem.
+++“Appeals are a good thing,” Thomas Jaconetty, deputy assessor for valuation and appeals, said in an interview. “The goal here is fairness. We made the numbers. We can change them.”
+
We can borrow lessons from Critical Race Theory —— on the surface, everyone has the legal right to try and appeal the value of their home. However, not everyone has an equal ability to do so. Those who have the money to hire tax lawyers to appeal for them have a drastically higher chance of trying and succeeding in their appeal (see above figure). Many homeowners who appealed were generally under-assessed compared to homeowners who did not (see figure below). Clearly, the model is part of a deeper institutional pattern rife with potential corruption.
+In fact, Chicago boasts a large and thriving tax attorney industry dedicated precisely to appealing property assessments, reflected in the growing number of appeals in Cook County in the 21st century. Given wealthier, whiter neighborhoods typically have greater access to lawyers, they often appealed more and won reductions far more often than their less wealthy neighbors. In other words, those with higher incomes pay less in property tax, tax lawyers can grow their business due to their role in appeals, and politicians are socially connected to the aforementioned tax lawyers and wealthy homeowners. All these stakeholders have reasons to advertise the appeals system as an integral part of a fair system; after all, it serves to benefit them. Here lies the value in asking questions: a system that seems fair on the surface may, in reality, be unfair upon taking a closer look.
+What happened as a result of this corrupt system? As the Chicago Tribune reported, many African American and Latino homeowners purchased homes only to find their houses were later appraised at levels far higher than what they paid. As a result, homeowners were now responsible for paying significantly more in taxes every year than initially budgeted, putting them at risk of not being able to afford their homes and losing them.
+The impact of the housing model extends beyond the realm of home ownership and taxation —— the issues of justice go much deeper. This model perpetrated much older patterns of racially discriminatory practices in Chicago and across the United States. Unfortunately, it is no accident that this happened in Chicago, one of the most segregated cities in the United States (source). These factors are central to informing us, as data scientists, about what is at stake.
+Before we dive into how the CCAO used data science to “solve” this problem, let’s briefly go through the history of discriminatory housing practices in the United States to give more context on the gravity and urgency of this situation.
+Housing and real estate, among other factors, have been one of the most significant and enduring drivers of structural racism and racial inequality in the United States since the Civil War. It is one of the main areas where inequalities are created and reproduced. In the early 20th century, Jim Crow laws were explicit in forbidding people of color from utilizing the same facilities —— such as buses, bathrooms, and pools —— as white individuals. This set of practices by government actors in combination with overlapping practices driven by the private real estate industry further served to make neighborhoods increasingly segregated.
+Although advancements in civil rights have been made, the spirit of the laws is alive in many parts of the US. In the 1920s and 1930s, it was illegal for governments to actively segregate neighborhoods according to race, but other methods were available for achieving the same ends. One of the most notorious practices was redlining: the federal housing agencies’ process of distinguishing neighborhoods in a city in terms of relative risk. The goal was to increase access to homeownership for low-income Americans. In practice, however, it allowed real estate professionals to legally perpetuate segregation. The federal housing agencies deemed predominantly African American neighborhoods as high risk and colored them in red —— hence the name redlining —— making it nearly impossible for African Americans to own a home.
+The origins of the data that made these maps possible lay in a kind of “racial data revolution” in the private real estate industry beginning in the 1920s. Segregation was established and reinforced in part through the work of real estate agents who were also very concerned with establishing reliable methods for predicting the value of a home. The effects of these practices continue to resonate today.
+The response to this problem started in politics. A new assessor, Fritz Kaegi, was elected and created a new mandate with two goals:
+He wanted to not only create a more accurate algorithmic model but also to design a new system to address the problems with the CCAO.
+Let’s frame this problem through the lens of the data science lifecycle.
+The old system was unfair because it was systemically inaccurate; it made one kind of error for one group, and another kind of error for another. Its goal was to “create a robust pipeline that accurately assesses property values at scale and is fair”, and in turn, they defined fairness as accuracy: “the ability of our pipeline to accurately assess all residential property values, accounting for disparities in geography, information, etc.” Thus, the plan —— make the system more fair —— was already framed in terms of a task appropriate to a data scientist: make the assessments more accurate (or more precisely, minimize errors in a particular way).
+The idea here is that if the model is more accurate it will also (perhaps necessarily) become more fair, which is a big assumption. There are, in a sense, two different problems —— make accurate assessments, and make a fair system. Treating these two problems as one makes it a more straightforward issue that can be solved technically (with a good model) but does raise the question of if fairness and accuracy are one and the same.
+For now, let’s just talk about the technical part of this —— accuracy. For you, the data scientist, this part might feel more comfortable. We can determine some metrics of success and frame a social problem as a data science problem.
+ +The new Office of Data Science started by framing the problem and redefining their goals. They determined that they needed to:
+The goals defined above lead us to ask the question: what does it actually mean to accurately assess property values, and what role does “scale” play?
+Each of the above questions leads to a slew of more questions. Considering just the first question, one answer could be that an assessment is an estimate of the value of a home. This leads to more inquiries: what is the value of a home? What determines it? How do we know? For this class, we take it to be the house’s market value, or how much it would sell for.
+Unfortunately, if you are the county assessor, it becomes hard to determine property values with this definition. After all, you can’t make everyone sell their house every year. And as many properties haven’t been sold in decades, every year that passes makes that previous sale less reliable as an indicator.
+So how would one generate reliable estimates? You’re probably thinking, well, with data about homes and their sale prices you can probably predict the value of a property reliably. Even if you’re not a data scientist, you might know there are websites like Zillow and RedFin that estimate what properties would sell for and constantly update them. They don’t know the value, but they estimate them. How do you think they do this? Let’s start with the data —— which is the next step in the lifecycle.
+To generate estimates, the data scientists used two datasets. The first contained all recorded sales data from 2013 to 2019. The second contained property characteristics, including a property identification number and physical characteristics (e.g., age, bedroom, baths, square feet, neighborhood, site desirability, etc.).
+As they examined the datasets, they asked the questions:
+With so much data available, data scientists worked to see how all the different data points correlated with each other and with the sales prices. By discovering patterns in datasets containing known sale prices and characteristics of similar and nearby properties, training a model on this data, and applying it to all the properties without sales data, it was now possible to create a linear model that could predict the sale price (“fair market value”) of unsold properties.
+Some other key questions data scientists asked about the data were:
+Attributes can have different likelihoods of appearing in the data. For example, housing data in the floodplain geographic region of Chicago were less represented than other regions.
+Features can also be reported at different rates. Improvements in homes, which tend to increase property value, were unlikely to be reported by the homeowners.
+Additionally, they found that there was simply more missing data in lower-income neighborhoods.
+Before the modeling step, they investigated a multitude of crucial questions:
+They found that certain features, such as bedroom number, were much more useful in determining house value for certain neighborhoods than for others. This informed them that different models should be used depending on the neighborhood.
+They also noticed that low-income neighborhoods had disproportionately spottier data. This informed them that they needed to develop new data collection practices - including finding new sources of data.
+Rather than using a singular model to predict sale prices (“fair market value”) of unsold properties, the CCAO predicts sale prices using machine learning models that discover patterns in data sets containing known sale prices and characteristics of similar and nearby properties. It uses different model weights for each neighborhood.
+Compared to traditional mass appraisal, the CCAO’s new approach is more granular and more sensitive to neighborhood variations.
+But how do we know if an assessment is accurate? We can see how our model performs when predicting the sales prices of properties it wasn’t trained on! We can then evaluate how “close” our estimate was to the actual sales price, using Root Mean Square Error (RMSE). However, is RMSE a good proxy for fairness in this context?
+Broad metrics of error like RMSE can be limiting when evaluating the “fairness” of a property appraisal system. RMSE does not tell us anything about the distribution of errors, whether the errors are positive or negative, and the relative size of the errors. It does not tell us anything about the regressivity of the model, instead just giving a rough measure of our model’s overall error.
+Even with a low RMSE, we can’t guarantee a fair model. The error we see (no matter how small) may be a result of our model overvaluing less expensive homes and undervaluing more expensive homes.
+Regarding accuracy, it’s important to ask what makes a batch of assessments better or more accurate than another batch of assessments. The value of a home that a model predicts is relational. It’s a product of the interaction of social and technical elements so property assessment involves social trust.
+Why should any particular individual believe that the model is accurate for their property? Why should any individual trust the model?
+To foster public trust, the CCAO focuses on “transparency”, putting data, models, and the pipeline onto GitLab. By doing so, they can better equate the production of “accurate assessments” with “fairness”.
+There’s a lot more to be said here on the relationship between accuracy, fairness, and metrics we tend to use when evaluating our models. Given the nuanced nature of the argument, it is recommended you view the corresponding lecture as the course notes are not as comprehensive for this portion of the lecture.
+Unfortunately, it may be naive to hope that a more accurate and transparent algorithm will translate into more fair outcomes in practice. Even if our model is perfectly optimized according to the standards of fairness we’ve set, there is no guarantee that people will actually pay their expected share of taxes as determined by the model. While it is a good step in the right direction, maintaining a level of social trust is key to ensuring people pay their fair share.
+Despite all their best efforts, the CCAO is still struggling to create fair assessments and engender trust.
+Stories like the one show that total taxes for residential properties went up overall (because commercial taxes went down). But looking at the distribution, we can see that the biggest increases occurred in wealthy neighborhoods, and the biggest decreases occurred in poorer, predominantly Black neighborhoods. So maybe there was some success after all?
+However, it’ll ultimately be hard to overcome the propensity of the board of review to reduce the tax burden of the rich, preventing the CCAO from creating a truly fair system. This is in part because there are many cases where the model makes big, frustrating mistakes. In some cases like this one, it is due to spotty data.
+Question/Problem Formulation
+Data Acquisition and Cleaning
+Exploratory Data Analysis & Visualization
+Prediction and Inference
+Last time, we began our journey into unsupervised learning by discussing Principal Component Analysis (PCA).
+In this lecture, we will explore another very popular unsupervised learning concept: clustering. Clustering allows us to “group” similar datapoints together without being given labels of what “class” or where each point explicitly comes from. We will discuss two clustering algorithms: K-Means clustering and hierarchical agglomerative clustering, and we’ll examine the assumptions, strengths, and drawbacks of each one.
+In supervised learning, our goal is to create a function that maps inputs to outputs. Each model is learned from example input/output pairs (training set), validated using input/output pairs, and eventually tested on more input/output pairs. Each pair consists of:
+In regression, our output value is quantitative, and in classification, our output value is categorical.
+In unsupervised learning, our goal is to identify patterns in unlabeled data. In this type of learning, we do not have input/output pairs. Sometimes, we may have labels but choose to ignore them (e.g. PCA on labeled data). Instead, we are more interested in the inherent structure of the data we have rather than trying to simply predict a label using that structure of data. For example, if we are interested in dimensionality reduction, we can use PCA to reduce our data to a lower dimension.
+Now, let’s consider a new problem: clustering.
+Consider this figure from Fall 2019 Midterm 2. The original dataset had 8 dimensions, but we have used PCA to reduce our data down to 2 dimensions.
+Each point represents the 1st and 2nd principal component of how much time patrons spent at 8 different zoo exhibits. Visually and intuitively, we could potentially guess that this data belongs to 3 groups: one for each cluster. The goal of clustering is now to assign each point (in the 2 dimensional PCA representation) to a cluster.
+This is an unsupervised task, as:
+Now suppose you’re Netflix and are looking at information on customer viewing habits. Clustering can come in handy here. We can assign each person or show to a “cluster.” (Note: while we don’t know for sure that Netflix actually uses ML clustering to identify these categories, they could, in principle.)
+Keep in mind that with clustering, we don’t need to define clusters in advance; it discovers groups automatically. On the other hand, with classification, we have to decide labels in advance. This marks one of the key differences between clustering and classification.
+Let’s say we’re working with student-generated materials and pass them into the S-BERT module to extract sentence embeddings. Features from clusters are extracted to:
+Here we can see the outline of the anomaly detection module. It consists of:
+Looking more closely at our clustering, we can better understand the different components, which are represented by the centers. Below we have two examples.
+Note that the details for this example are not in scope.
+Now, consider the plot below:
+The rows of this plot are conditions (e.g., a row might be: “poured acid on the cells”), and the columns are genes. The green coloration indicates that the gene was “off” whereas red indicates the gene was “on”. For example, the ~9 genes in the top left corner of the plot were all turned off by the 6 experiments (rows) at the top.
+In a clustering lens, we might be interested in clustering similar observations together based on the reactions (on/off) to certain experiments.
+For example, here is a look at our data before and after clustering.
+Note: apologies if you can’t differentiate red from green by eye! Historical visualizations are not always the best.
+There are many types of clustering algorithms, and they all have strengths, inherent weaknesses, and different use cases. We will first focus on a partitional approach: K-Means clustering.
+The most popular clustering approach is K-Means. The algorithm itself entails the following:
+Pick an arbitrary \(k\), and randomly place \(k\) “centers”, each a different color.
Repeat until convergence:
+Consider the following data with an arbitrary \(k = 2\) and randomly placed “centers” denoted by the different colors (blue, orange):
+Now, we will follow the rest of the algorithm. First, let us color each point according to the closest center:
+Next, we will move the center for each color to the center of points with that color. Notice how the centers are generally well-centered amongst the data that shares its color.
+Assume this process (re-color and re-set centers) repeats for a few more iterations. We eventually reach this state.
+After this iteration, the center stays still and does not move at all. Thus, we have converged, and the clustering is complete!
+K-Means is a completely different algorithm than K-Nearest Neighbors. K-means is used for clustering, where each point is assigned to one of \(K\) clusters. On the other hand, K-Nearest Neighbors is used for classification (or, less often, regression), and the predicted value is typically the most common class among the \(K\)-nearest data points in the training set. The names may be similar, but there isn’t really anything in common.
+Consider the following example where \(K = 4\):
+Due to the randomness of where the \(K\) centers initialize/start, you will get a different output/clustering every time you run K-Means. Consider three possible K-Means outputs; the algorithm has converged, and the colors denote the final cluster they are clustered as.
+
Which clustering output is the best? To evaluate different clustering results, we need a loss function.
The two common loss functions are:
+In the example above:
+Switching back to the four-cluster example at the beginning of this section, random.seed(25)
had an inertia of 44.96
, random.seed(29)
had an inertia of 45.95
, and random.seed(40)
had an inertia of 54.35
. It seems that the best clustering output was random.seed(25)
with an inertia of 44.96
!
It turns out that the function K-Means is trying to minimize is inertia, but often fails to find global optimum. Why does this happen? We can think of K-means as a pair of optimizers that take turns. The first optimizer holds center positions constant and optimizes data colors. The second optimizer holds data colors constant and optimizes center positions. Neither optimizer gets full control!
+This is a hard problem: give an algorithm that optimizes inertia FOR A GIVEN \(K\); \(K\) is picked in advance. Your algorithm should return the EXACT best centers and colors, but you don’t need to worry about runtime.
+Note: This is a bit of a CS61B/CS70/CS170 problem, so do not worry about completely understanding the tricky predicament we are in too much!
+A potential algorithm:
+No better algorithm has been found for solving the problem of minimizing inertia exactly.
+Now, let us consider hierarchical agglomerative clustering.
+
Consider the following results of two K-Means clustering outputs:
Which clustering result do you like better? It seems K-Means likes the one on the right better because it has lower inertia (the sum of squared distances from each data point to its center), but this raises some questions:
Now, let us introduce Hierarchical Agglomerative Clustering! We start with every data point in a separate cluster, and we’ll keep merging the most similar pairs of data points/clusters until we have one big cluster left. This is called a bottom-up or agglomerative method.
+There are various ways to decide the order of combining clusters called Linkage Criterion:
+The linkage criterion decides how we measure the “distance” between two clusters. Regardless of the criterion we choose, the aim is to combine the two clusters that have the minimum “distance” between them, with the distance computed as per that criterion. In the case of complete linkage, for example, that means picking the two clusters that minimize the maximum distance between a point in the first cluster and a point in the second.
+When the algorithm starts, every data point is in its own cluster. In the plot below, there are 12 data points, so the algorithm starts with 12 clusters. As the clustering begins, it assesses which clusters are the closest together.
+The closest clusters are 10 and 11, so they are merged together.
+Next, points 0 and 4 are merged together because they are closest.
+At this point, we have 10 clusters: 8 with a single point (clusters 1, 2, 3, 4, 5, 6, 7, 8, and 9) and 2 with 2 points (clusters 0 and 10).
+Although clusters 0 and 3 are not the closest, let us consider if we were trying to merge them. A tricky question arises: what is the “distance” between clusters 0 and 3? We can use the Complete-Link approach that uses the max distance among all pairs of points between groups to decide which group has smaller “distance”.
+Let us assume the algorithm runs a little longer, and we have reached the following state. Clusters 0 and 7 are up next, but why? The max line between any member of 0 and 6 is longer than the max line between any member of 0 and 7.
+Thus, 0 and 7 are merged into 0 as they are closer under the complete linkage criterion.
+After more iterations, we finally converge to the plot on the left. There are two clusters (0, 1), and the agglomerative algorithm has converged.
+
Notice that on the full dataset, our agglomerative clustering algorithm achieves the more “correct” output.
+
+
Some professors use agglomerative clustering for grading bins; if there is a big gap between two people, draw a grading threshold there. The idea is that grade clustering should be more like the figure below on the left, not the right.
+The algorithms we’ve discussed require us to pick a \(K\) before we start. But how do we pick \(K\)? Often, the best \(K\) is subjective. For example, consider the state plot below.
+How many clusters are there here? For K-Means, one approach to determine this is to plot inertia versus many different \(K\) values. We’d pick the \(K\) in the elbow, where we get diminishing returns afterward. Note that big, complicated data often lacks an elbow, so this method is not foolproof. Here, we would likely select \(K = 2\).
+To evaluate how “well-clustered” a specific data point is, we can use the silhouette score, also termed the silhouette width. A high silhouette score indicates that a point is near the other points in its cluster; a low score means that it’s far from the other points in its cluster.
+For a data point \(X\), score \(S\) is: \[S =\frac{B - A}{\max(A, B)}\] where \(A\) is the average distance to other points in the cluster, and \(B\) is the average distance to points in the closest cluster.
+Consider what the highest possible value of \(S\) is and how that value can occur. The highest possible value of \(S\) is 1, which happens if every point in \(X\)’s cluster is right on top of \(X\); the average distance to other points in \(X\)’s cluster is \(0\), so \(A = 0\). Thus, \(S = \frac{B}{\max(0, B)} = \frac{B}{B} = 1\). Another case where \(S = 1\) could happen is if \(B\) is much greater than \(A\) (we denote this as \(B >> A\)).
+Can \(S\) be negative? The answer is yes. If the average distance to X’s clustermates is larger than the distance to the closest cluster, then this is possible. For example, the “low score” point on the right of the image above has \(S = -0.13\).
+We can plot the silhouette scores for all of our datapoints. The x-axis represents the silhouette coefficient value or silhouette score. The y-axis tells us which cluster label the points belong to, as well as the number of points within a particular cluster. Points with large silhouette widths are deeply embedded in their cluster; the red dotted line shows the average. Below, we plot the silhouette score for our plot with \(K=2\).
+
+
+
Similarly, we can plot the silhouette score for the same dataset but with \(K=3\):
+
+
+
The average silhouette score is lower with 3 clusters, so \(K=2\) is a better choice. This aligns with our visual intuition as well.
+Sometimes you can rely on real-world metrics to guide your choice of \(K\). For t-shirts, we can either:
+To choose \(K\), consider projected costs and sales for the 2 different \(K\)s and select the one that maximizes profit.
+We’ve now discussed a new machine learning goal —— clustering —— and explored two solutions:
+Our version of these algorithms required a hyperparameter \(K\). There are 4 ways to pick \(K\): the elbow method, silhouette scores, and by harnessing real-world metrics.
+There are many machine learning problems. Each can be addressed by many different solution techniques. Each has many metrics for evaluating success / loss. Many techniques can be used to solve different problem types. For example, linear models can be used for regression and classification.
+We’ve only scratched the surface and haven’t discussed many important ideas, such as neural networks and deep learning. In the last lecture, we’ll provide some specific course recommendations on how to explore these topics further.
+ + +Last time, we introduced the modeling process. We set up a framework to predict target variables as functions of our features, following a set workflow:
+To illustrate this process, we derived the optimal model parameters under simple linear regression (SLR) with mean squared error (MSE) as the cost function. A summary of the SLR modeling process is shown below:
+In this lecture, we’ll dive deeper into step 4 - evaluating model performance - using SLR as an example. Additionally, we’ll also explore the modeling process with new models, continue familiarizing ourselves with the modeling process by finding the best model parameters under a new model, the constant model, and test out two different loss functions to understand how our choice of loss influences model design. Later on, we’ll consider what happens when a linear model isn’t the best choice to capture trends in our data and what solutions there are to create better models.
+Before we get into Step 4, let’s quickly review some important terminology.
+The terms prediction and estimation are often used somewhat interchangeably, but there is a subtle difference between them. Estimation is the task of using data to calculate model parameters. Prediction is the task of using a model to predict outputs for unseen data. In our simple linear regression model,
+\[\hat{y} = \hat{\theta_0} + \hat{\theta_1}\]
+we estimate the parameters by minimizing average loss; then, we predict using these estimations. Least Squares Estimation is when we choose the parameters that minimize MSE.
+Now that we’ve explored the mathematics behind (1) choosing a model, (2) choosing a loss function, and (3) fitting the model, we’re left with one final question – how “good” are the predictions made by this “best” fitted model? To determine this, we can:
+Visualize data and compute statistics:
+Performance metrics:
+\[\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}\]
Visualization:
+To illustrate this process, let’s take a look at Anscombe’s quartet.
+Let’s take a look at four different datasets.
+import numpy as np
+import pandas as pd
+import matplotlib.pyplot as plt
+%matplotlib inline
+import seaborn as sns
+import itertools
+from mpl_toolkits.mplot3d import Axes3D
# Big font helper
+def adjust_fontsize(size=None):
+= 8
+ SMALL_SIZE = 10
+ MEDIUM_SIZE = 12
+ BIGGER_SIZE if size != None:
+ = MEDIUM_SIZE = BIGGER_SIZE = size
+ SMALL_SIZE
+"font", size=SMALL_SIZE) # controls default text sizes
+ plt.rc("axes", titlesize=SMALL_SIZE) # fontsize of the axes title
+ plt.rc("axes", labelsize=MEDIUM_SIZE) # fontsize of the x and y labels
+ plt.rc("xtick", labelsize=SMALL_SIZE) # fontsize of the tick labels
+ plt.rc("ytick", labelsize=SMALL_SIZE) # fontsize of the tick labels
+ plt.rc("legend", fontsize=SMALL_SIZE) # legend fontsize
+ plt.rc("figure", titlesize=BIGGER_SIZE) # fontsize of the figure title
+ plt.rc(
+
+# Helper functions
+def standard_units(x):
+return (x - np.mean(x)) / np.std(x)
+
+
+def correlation(x, y):
+return np.mean(standard_units(x) * standard_units(y))
+
+
+def slope(x, y):
+return correlation(x, y) * np.std(y) / np.std(x)
+
+
+def intercept(x, y):
+return np.mean(y) - slope(x, y) * np.mean(x)
+
+
+def fit_least_squares(x, y):
+= intercept(x, y)
+ theta_0 = slope(x, y)
+ theta_1 return theta_0, theta_1
+
+
+def predict(x, theta_0, theta_1):
+return theta_0 + theta_1 * x
+
+
+def compute_mse(y, yhat):
+return np.mean((y - yhat) ** 2)
+
+
+"default") # Revert style to default mpl plt.style.use(
"default") # Revert style to default mpl
+ plt.style.use(= range(3)
+ NO_VIZ, RESID, RESID_SCATTER
+
+def least_squares_evaluation(x, y, visualize=NO_VIZ):
+# statistics
+ print(f"x_mean : {np.mean(x):.2f}, y_mean : {np.mean(y):.2f}")
+ print(f"x_stdev: {np.std(x):.2f}, y_stdev: {np.std(y):.2f}")
+ print(f"r = Correlation(x, y): {correlation(x, y):.3f}")
+
+# Performance metrics
+ = fit_least_squares(x, y)
+ ahat, bhat = predict(x, ahat, bhat)
+ yhat print(f"\theta_0: {ahat:.2f}, \theta_1: {bhat:.2f}")
+ print(f"RMSE: {np.sqrt(compute_mse(y, yhat)):.3f}")
+
+# visualization
+ = None, None
+ fig, ax_resid if visualize == RESID_SCATTER:
+ = plt.subplots(1, 2, figsize=(8, 3))
+ fig, axs 0].scatter(x, y)
+ axs[0].plot(x, yhat)
+ axs[0].set_title("LS fit")
+ axs[= axs[1]
+ ax_resid elif visualize == RESID:
+ = plt.figure(figsize=(4, 3))
+ fig = plt.gca()
+ ax_resid
+if ax_resid is not None:
+ - yhat, color="red")
+ ax_resid.scatter(x, y 4, 14], [0, 0], color="black")
+ ax_resid.plot(["Residuals")
+ ax_resid.set_title(
+return fig
# Load in four different datasets: I, II, III, IV
+= [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
+ x = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
+ y1 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
+ y2 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
+ y3 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
+ x4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
+ y4
+= {
+ anscombe "I": pd.DataFrame(list(zip(x, y1)), columns=["x", "y"]),
+ "II": pd.DataFrame(list(zip(x, y2)), columns=["x", "y"]),
+ "III": pd.DataFrame(list(zip(x, y3)), columns=["x", "y"]),
+ "IV": pd.DataFrame(list(zip(x4, y4)), columns=["x", "y"]),
+
+ }
+# Plot the scatter plot and line of best fit
+= plt.subplots(2, 2, figsize=(10, 10))
+ fig, axs
+for i, dataset in enumerate(["I", "II", "III", "IV"]):
+= anscombe[dataset]
+ ans = ans["x"], ans["y"]
+ x, y = fit_least_squares(x, y)
+ ahat, bhat = predict(x, ahat, bhat)
+ yhat // 2, i % 2].scatter(x, y, alpha=0.6, color="red") # plot the x, y points
+ axs[i // 2, i % 2].plot(x, yhat) # plot the line of best fit
+ axs[i // 2, i % 2].set_xlabel(f"$x_{i+1}$")
+ axs[i // 2, i % 2].set_ylabel(f"$y_{i+1}$")
+ axs[i // 2, i % 2].set_title(f"Dataset {dataset}")
+ axs[i
+ plt.show()
While these four sets of datapoints look very different, they actually all have identical means \(\bar x\), \(\bar y\), standard deviations \(\sigma_x\), \(\sigma_y\), correlation \(r\), and RMSE! If we only look at these statistics, we would probably be inclined to say that these datasets are similar.
+for dataset in ["I", "II", "III", "IV"]:
+print(f">>> Dataset {dataset}:")
+ = anscombe[dataset]
+ ans = least_squares_evaluation(ans["x"], ans["y"], visualize=NO_VIZ)
+ fig print()
+ print()
>>> Dataset I:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+ heta_0: 3.00, heta_1: 0.50
+RMSE: 1.119
+
+
+>>> Dataset II:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+ heta_0: 3.00, heta_1: 0.50
+RMSE: 1.119
+
+
+>>> Dataset III:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.816
+ heta_0: 3.00, heta_1: 0.50
+RMSE: 1.118
+
+
+>>> Dataset IV:
+x_mean : 9.00, y_mean : 7.50
+x_stdev: 3.16, y_stdev: 1.94
+r = Correlation(x, y): 0.817
+ heta_0: 3.00, heta_1: 0.50
+RMSE: 1.118
+
+
+We may also wish to visualize the model’s residuals, defined as the difference between the observed and predicted \(y_i\) value (\(e_i = y_i - \hat{y}_i\)). This gives a high-level view of how “off” each prediction is from the true observed value. Recall that you explored this concept in Data 8: a good regression fit should display no clear pattern in its plot of residuals. The residual plots for Anscombe’s quartet are displayed below. Note how only the first plot shows no clear pattern to the magnitude of residuals. This is an indication that SLR is not the best choice of model for the remaining three sets of points.
+ +# Residual visualization
+= plt.subplots(2, 2, figsize=(10, 10))
+ fig, axs
+for i, dataset in enumerate(["I", "II", "III", "IV"]):
+= anscombe[dataset]
+ ans = ans["x"], ans["y"]
+ x, y = fit_least_squares(x, y)
+ ahat, bhat = predict(x, ahat, bhat)
+ yhat // 2, i % 2].scatter(
+ axs[i - yhat, alpha=0.6, color="red"
+ x, y # plot the x, y points
+ ) // 2, i % 2].plot(
+ axs[i ="black"
+ x, np.zeros_like(x), color# plot the residual line
+ ) // 2, i % 2].set_xlabel(f"$x_{i+1}$")
+ axs[i // 2, i % 2].set_ylabel(f"$e_{i+1}$")
+ axs[i // 2, i % 2].set_title(f"Dataset {dataset} Residuals")
+ axs[i
+ plt.show()
Now, we’ll shift from the SLR model to the constant model, also known as a summary statistic. The constant model is slightly different from the simple linear regression model we’ve explored previously. Rather than generating predictions from an inputted feature variable, the constant model always predicts the same constant number. This ignores any relationships between variables. For example, let’s say we want to predict the number of drinks a boba shop sells in a day. Boba tea sales likely depend on the time of year, the weather, how the customers feel, whether school is in session, etc., but the constant model ignores these factors in favor of a simpler model. In other words, the constant model employs a simplifying assumption.
+It is also a parametric, statistical model:
+\[\hat{y} = \theta_0\]
+\(\theta_0\) is the parameter of the constant model, just as \(\theta_0\) and \(\theta_1\) were the parameters in SLR. Since our parameter \(\theta_0\) is 1-dimensional (\(\theta_0 \in \mathbb{R}\)), we now have no input to our model and will always predict \(\hat{y} = \theta_0\).
+Our task now is to determine what value of \(\theta_0\) best represents the optimal model – in other words, what number should we guess each time to have the lowest possible average loss on our data?
+Like before, we’ll use Mean Squared Error (MSE). Recall that the MSE is average squared loss (L2 loss) over the data \(D = \{y_1, y_2, ..., y_n\}\).
+\[\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \hat{y_i})^2 \]
+Our modeling process now looks like this:
+Given the constant model \(\hat{y} = \theta_0\), we can rewrite the MSE equation as
+\[\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2 \]
+We can fit the model by finding the optimal \(\hat{\theta_0}\) that minimizes the MSE using a calculus approach.
+\[ +\begin{align} +\frac{d}{d\theta_0}\text{R}(\theta) & = \frac{d}{d\theta_0}(\frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2) +\\ &= \frac{1}{n}\sum^{n}_{i=1} \frac{d}{d\theta_0} (y_i - \theta_0)^2 \quad \quad \text{a derivative of sums is a sum of derivatives} +\\ &= \frac{1}{n}\sum^{n}_{i=1} 2 (y_i - \theta_0) (-1) \quad \quad \text{chain rule} +\\ &= {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \theta_0) \quad \quad \text{simply constants} +\end{align} +\]
+Set the derivative equation equal to 0:
+\[ +0 = {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \hat{\theta_0}) +\]
Solve for \(\hat{\theta_0}\)
\[ +\begin{align} +0 &= {\frac{-2}{n}}\sum^{n}_{i=1} (y_i - \hat{\theta_0}) +\\ &= \sum^{n}_{i=1} (y_i - \hat{\theta_0}) \quad \quad \text{divide both sides by} \frac{-2}{n} +\\ &= \left(\sum^{n}_{i=1} y_i\right) - \left(\sum^{n}_{i=1} \theta_0\right) \quad \quad \text{separate sums} +\\ &= \left(\sum^{n}_{i=1} y_i\right) - (n \cdot \hat{\theta_0}) \quad \quad \text{c + c + … + c = nc} +\\ n \cdot \hat{\theta_0} &= \sum^{n}_{i=1} y_i +\\ \hat{\theta_0} &= \frac{1}{n} \sum^{n}_{i=1} y_i +\\ \hat{\theta_0} &= \bar{y} +\end{align} +\]
+Let’s take a moment to interpret this result. \(\hat{\theta_0} = \bar{y}\) is the optimal parameter for constant model + MSE. It holds true regardless of what data sample you have, and it provides some formal reasoning as to why the mean is such a common summary statistic.
+Our optimal model parameter is the value of the parameter that minimizes the cost function. This minimum value of the cost function can be expressed:
+\[R(\hat{\theta_0}) = \min_{\theta_0} R(\theta_0)\]
+To restate the above in plain English: we are looking at the value of the cost function when it takes the best parameter as input. This optimal model parameter, \(\hat{\theta_0}\), is the value of \(\theta_0\) that minimizes the cost \(R\).
+For modeling purposes, we care less about the minimum value of cost, \(R(\hat{\theta_0})\), and more about the value of \(\theta\) that results in this lowest average loss. In other words, we concern ourselves with finding the best parameter value such that:
+\[\hat{\theta} = \underset{\theta}{\operatorname{\arg\min}}\:R(\theta)\]
+That is, we want to find the argument \(\theta\) that minimizes the cost function.
+Now that we’ve explored the constant model with an L2 loss, we can compare it to the SLR model that we learned last lecture. Consider the dataset below, which contains information about the ages and lengths of dugongs. Supposed we wanted to predict dugong ages:
++ | Constant Model | +Simple Linear Regression | +
---|---|---|
model | +\(\hat{y} = \theta_0\) | +\(\hat{y} = \theta_0 + \theta_1 x\) | +
data | +sample of ages \(D = \{y_1, y_2, ..., y_n\}\) | +sample of ages \(D = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}\) | +
dimensions | +\(\hat{\theta_0}\) is 1-D | +\(\hat{\theta} = [\hat{\theta_0}, \hat{\theta_1}]\) is 2-D | +
loss surface | +2-D ![]() |
+3-D ![]() |
+
loss model | +\(\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2\) | +\(\hat{R}(\theta_0, \theta_1) = \frac{1}{n}\sum^{n}_{i=1} (y_i - (\theta_0 + \theta_1 x))^2\) | +
RMSE | +7.72 | +4.31 | +
predictions visualized | +rug plot ![]() |
+scatter plot ![]() |
+
(Notice how the points for our SLR scatter plot are visually not a great linear fit. We’ll come back to this).
+The code for generating the graphs and models is included below, but we won’t go over it in too much depth.
+= pd.read_csv("data/dugongs.csv")
+ dugongs = dugongs["Age"]
+ data_constant = dugongs[["Length", "Age"]] data_linear
# Constant Model + MSE
+'default') # Revert style to default mpl
+ plt.style.use(=16)
+ adjust_fontsize(size%matplotlib inline
+
+def mse_constant(theta, data):
+return np.mean(np.array([(y_obs - theta) ** 2 for y_obs in data]), axis=0)
+
+= np.linspace(-20, 42, 1000)
+ thetas = mse_constant(thetas, data_constant)
+ l2_loss_thetas
+# Plotting the loss surface
+
+ plt.plot(thetas, l2_loss_thetas)r'$\theta_0$')
+ plt.xlabel(r'MSE')
+ plt.ylabel(
+# Optimal point
+= np.mean(data_constant)
+ thetahat =50, label = r"$\hat{\theta}_0$")
+ plt.scatter([thetahat], [mse_constant(thetahat, data_constant)], s;
+ plt.legend()# plt.show()
# SLR + MSE
+def mse_linear(theta_0, theta_1, data_linear):
+= data_linear.iloc[:, 0], data_linear.iloc[:, 1]
+ data_x, data_y return np.mean(
+ - (theta_0 + theta_1 * x)) ** 2 for x, y in zip(data_x, data_y)]),
+ np.array([(y =0,
+ axis
+ )
+
+# plotting the loss surface
+= np.linspace(-80, 20, 80)
+ theta_0_values = np.linspace(-10, 30, 80)
+ theta_1_values = np.array(
+ mse_values for x in theta_0_values] for y in theta_1_values]
+ [[mse_linear(x, y, data_linear)
+ )
+# Optimal point
+= data_linear.iloc[:, 0], data_linear.iloc[:, 1]
+ data_x, data_y = np.corrcoef(data_x, data_y)[0, 1] * np.std(data_y) / np.std(data_x)
+ theta_1_hat = np.mean(data_y) - theta_1_hat * np.mean(data_x)
+ theta_0_hat
+# Create the 3D plot
+= plt.figure(figsize=(7, 5))
+ fig = fig.add_subplot(111, projection="3d")
+ ax
+= np.meshgrid(theta_0_values, theta_1_values)
+ X, Y = ax.plot_surface(
+ surf ="viridis", alpha=0.6
+ X, Y, mse_values, cmap# Use alpha to make it slightly transparent
+ )
+# Scatter point using matplotlib
+= ax.scatter(
+ sc
+ [theta_0_hat],
+ [theta_1_hat],
+ [mse_linear(theta_0_hat, theta_1_hat, data_linear)],="o",
+ marker="red",
+ color=100,
+ s="theta hat",
+ label
+ )
+# Create a colorbar
+= fig.colorbar(surf, ax=ax, shrink=0.5, aspect=10)
+ cbar "Cost Value")
+ cbar.set_label(
+"MSE for different $\\theta_0, \\theta_1$")
+ ax.set_title("$\\theta_0$")
+ ax.set_xlabel("$\\theta_1$")
+ ax.set_ylabel("MSE")
+ ax.set_zlabel(
+# plt.show()
Text(0.5, 0, 'MSE')
+# Predictions
+= data_linear["Age"] # The true observations y
+ yobs = data_linear["Length"] # Needed for linear predictions
+ xs = len(yobs) # Predictions
+ n
+= [thetahat for i in range(n)] # Not used, but food for thought
+ yhats_constant = [theta_0_hat + theta_1_hat * x for x in xs] yhats_linear
# Constant Model Rug Plot
+# In case we're in a weird style state
+
+ sns.set_theme()=16)
+ adjust_fontsize(size%matplotlib inline
+
+= plt.figure(figsize=(8, 1.5))
+ fig =0.25, lw=2) ;
+ sns.rugplot(yobs, height='red', lw=4, label=r"$\hat{\theta}_0$");
+ plt.axvline(thetahat, color
+ plt.legend();
+ plt.yticks([])# plt.show()
# SLR model scatter plot
+# In case we're in a weird style state
+
+ sns.set_theme()=16)
+ adjust_fontsize(size%matplotlib inline
+
+=xs, y=yobs)
+ sns.scatterplot(x='red', lw=4);
+ plt.plot(xs, yhats_linear, color# plt.savefig('dugong_line.png', bbox_inches = 'tight');
+# plt.show()
Interpreting the RMSE (Root Mean Squared Error):
+We see now that changing the model used for prediction leads to a wildly different result for the optimal model parameter. What happens if we instead change the loss function used in model evaluation?
+This time, we will consider the constant model with L1 (absolute loss) as the loss function. This means that the average loss will be expressed as the Mean Absolute Error (MAE).
+Recall that the MAE is average absolute loss (L1 loss) over the data \(D = \{y_1, y_2, ..., y_n\}\).
+\[\hat{R}(\theta_0) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \hat{y_i}| \]
+Given the constant model \(\hat{y} = \theta_0\), we can write the MAE as:
+\[\hat{R}(\theta_0) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0| \]
+To fit the model, we find the optimal parameter value \(\hat{\theta_0}\) that minimizes the MAE by differentiating using a calculus approach:
+\[ +\begin{align} +\hat{R}(\theta_0) &= \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0| \\ +\frac{d}{d\theta_0} R(\theta_0) &= \frac{d}{d\theta_0} \left(\frac{1}{n} \sum^{n}_{i=1} |y_i - \theta_0| \right) \\ +&= \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0| +\end{align} +\]
+\[|y_i - \theta_0| = \begin{cases} y_i - \theta_0 \quad \text{ if } \theta_0 \le y_i \\ \theta_0 - y_i \quad \text{if }\theta_0 > y_i \end{cases}\]
+\[\frac{d}{d\theta_0} |y_i - \theta_0| = \begin{cases} \frac{d}{d\theta_0} (y_i - \theta_0) = -1 \quad \text{if }\theta_0 < y_i \\ \frac{d}{d\theta_0} (\theta_0 - y_i) = 1 \quad \text{if }\theta_0 > y_i \end{cases}\]
+\[ +\frac{d}{d\theta_0} R(\theta_0) = \frac{1}{n} \sum^{n}_{i=1} \frac{d}{d\theta_0} |y_i - \theta_0| \\ += \frac{1}{n} \left[\sum_{\theta_0 < y_i} (-1) + \sum_{\theta_0 > y_i} (+1) \right] +\]
+Set the derivative equation equal to 0: \[ 0 = \frac{1}{n}\sum_{\hat{\theta_0} < y_i} (-1) + \frac{1}{n}\sum_{\hat{\theta_0} > y_i} (+1) \]
Solve for \(\hat{\theta_0}\): \[ 0 = -\frac{1}{n}\sum_{\hat{\theta_0} < y_i} (1) + \frac{1}{n}\sum_{\hat{\theta_0} > y_i} (1)\]
\[\sum_{\hat{\theta_0} < y_i} (1) = \sum_{\hat{\theta_0} > y_i} (1) \]
+Thus, the constant model parameter \(\theta = \hat{\theta_0}\) that minimizes MAE must satisfy:
+\[ \sum_{\hat{\theta_0} < y_i} (1) = \sum_{\hat{\theta_0} > y_i} (1) \]
+In other words, the number of observations greater than \(\theta_0\) must be equal to the number of observations less than \(\theta_0\); there must be an equal number of points on the left and right sides of the equation. This is the definition of median, so our optimal value is \[ \hat{\theta_0} = median(y) \]
+First, define the objective function as average loss.
+Then, find the minimum of the objective function:
+Recall critical points from calculus: \(R(\hat{\theta})\) could be a minimum, maximum, or saddle point!
+We’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?
+Let’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.
+= np.array([20, 21, 22, 29, 33])
+ drinks drinks
array([20, 21, 22, 29, 33])
+From our derivations above, we know that the optimal model parameter under MSE cost is the mean of the dataset. Under MAE cost, the optimal parameter is the median of the dataset.
+ np.mean(drinks), np.median(drinks)
(np.float64(25.0), np.float64(22.0))
+If we plot each empirical risk function across several possible values of \(\theta\), we find that each \(\hat{\theta}\) does indeed correspond to the lowest value of error:
+Notice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.
+How do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.
+= np.append(drinks, 1033)
+ drinks_with_outlier
+ display(drinks_with_outlier) np.mean(drinks_with_outlier), np.median(drinks_with_outlier)
array([ 20, 21, 22, 29, 33, 1033])
+(np.float64(193.0), np.float64(25.5))
+This means that under the MSE, the optimal model parameter \(\hat{\theta}\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.
+Let’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.
+= np.append(drinks, 35)
+ drinks_with_additional_observation drinks_with_additional_observation
array([20, 21, 22, 29, 33, 35])
+When we again visualize the cost functions, we find that the MAE now plots a horizontal line between 22 and 29. This means that there are infinitely many optimal values for the model parameter: any value \(\hat{\theta} \in [22, 29]\) will minimize the MAE. In contrast, the MSE still has a single best value for \(\hat{\theta}\). In other words, the MSE has a unique solution for \(\hat{\theta}\); the MAE is not guaranteed to have a single unique solution.
+
To summarize our example,
++ | MSE (Mean Squared Loss) | +MAE (Mean Absolute Loss) | +
---|---|---|
Loss Function | +\(\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} (y_i - \theta_0)^2\) | +\(\hat{R}(\theta) = \frac{1}{n}\sum^{n}_{i=1} |y_i - \theta_0|\) | +
Optimal \(\hat{\theta_0}\) | +\(\hat{\theta_0} = mean(y) = \bar{y}\) | +\(\hat{\theta_0} = median(y)\) | +
Loss Surface | +![]() |
+![]() |
+
Shape | +Smooth - easy to minimize using numerical methods (in a few weeks) | +Piecewise - at each of the “kinks,” it’s not differentiable. Harder to minimize. | +
Outliers | +Sensitive to outliers (since they change mean substantially). Sensitivity also depends on the dataset size. | +More robust to outliers. | +
\(\hat{\theta_0}\) Uniqueness | +Unique \(\hat{\theta_0}\) | +Infinitely many \(\hat{\theta_0}\)s | +
At this point, we have an effective method of fitting models to predict linear relationships. Given a feature variable and target, we can apply our four-step process to find the optimal model parameters.
+A key word above is linear. When we computed parameter estimates earlier, we assumed that \(x_i\) and \(y_i\) shared a roughly linear relationship. Data in the real world isn’t always so straightforward, but we can transform the data to try and obtain linearity.
+The Tukey-Mosteller Bulge Diagram is a useful tool for summarizing what transformations can linearize the relationship between two variables. To determine what transformations might be appropriate, trace the shape of the “bulge” made by your data. Find the quadrant of the diagram that matches this bulge. The transformations shown on the vertical and horizontal axes of this quadrant can help improve the fit between the variables.
+Note that:
+Other goals in addition to linearity are possible, for example, making data appear more symmetric. Linearity allows us to fit lines to the transformed data.
+Let’s revisit our dugongs example. The lengths and ages are plotted below:
+# `corrcoef` computes the correlation coefficient between two variables
+# `std` finds the standard deviation
+= dugongs["Length"]
+ x = dugongs["Age"]
+ y = np.corrcoef(x, y)[0, 1]
+ r = r * np.std(y) / np.std(x)
+ theta_1 = np.mean(y) - theta_1 * np.mean(x)
+ theta_0
+= plt.subplots(1, 2, dpi=200, figsize=(8, 3))
+ fig, ax 0].scatter(x, y)
+ ax[0].set_xlabel("Length")
+ ax[0].set_ylabel("Age")
+ ax[
+1].scatter(x, y)
+ ax[1].plot(x, theta_0 + theta_1 * x, "tab:red")
+ ax[1].set_xlabel("Length")
+ ax[1].set_ylabel("Age") ax[
Text(0, 0.5, 'Age')
+Looking at the plot on the left, we see that there is a slight curvature to the data points. Plotting the SLR curve on the right results in a poor fit.
+For SLR to perform well, we’d like there to be a rough linear trend relating "Age"
and "Length"
. What is making the raw data deviate from a linear relationship? Notice that the data points with "Length"
greater than 2.6 have disproportionately high values of "Age"
relative to the rest of the data. If we could manipulate these data points to have lower "Age"
values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \(y_i\) (that is, taking \(\log(\) "Age"
\()\) ) would achieve just that.
An important word on \(\log\): in Data 100 (and most upper-division STEM courses), \(\log\) denotes the natural logarithm with base \(e\). The base-10 logarithm, where relevant, is indicated by \(\log_{10}\).
+= np.log(y)
+ z
+= np.corrcoef(x, z)[0, 1]
+ r = r * np.std(z) / np.std(x)
+ theta_1 = np.mean(z) - theta_1 * np.mean(x)
+ theta_0
+= plt.subplots(1, 2, dpi=200, figsize=(8, 3))
+ fig, ax 0].scatter(x, z)
+ ax[0].set_xlabel("Length")
+ ax[0].set_ylabel(r"$\log{(Age)}$")
+ ax[
+1].scatter(x, z)
+ ax[1].plot(x, theta_0 + theta_1 * x, "tab:red")
+ ax[1].set_xlabel("Length")
+ ax[1].set_ylabel(r"$\log{(Age)}$")
+ ax[
+=0.3) plt.subplots_adjust(wspace
Our SLR fit looks a lot better! We now have a new target variable: the SLR model is now trying to predict the log of "Age"
, rather than the untransformed "Age"
. In other words, we are applying the transformation \(z_i = \log{(y_i)}\). Notice that the resulting model is still linear in the parameters \(\theta = [\theta_0, \theta_1]\). The SLR model becomes:
\[\hat{\log{y}} = \theta_0 + \theta_1 x\] \[\hat{z} = \theta_0 + \theta_1 x\]
+It turns out that this linearized relationship can help us understand the underlying relationship between \(x\) and \(y\). If we rearrange the relationship above, we find:
+\[\log{(y)} = \theta_0 + \theta_1 x\] \[y = e^{\theta_0 + \theta_1 x}\] \[y = (e^{\theta_0})e^{\theta_1 x}\] \[y_i = C e^{k x}\]
+For some constants \(C\) and \(k\).
+\(y\) is an exponential function of \(x\). Applying an exponential fit to the untransformed variables corroborates this finding.
+=120, figsize=(4, 3))
+ plt.figure(dpi
+
+ plt.scatter(x, y)* np.exp(theta_1 * x), "tab:red")
+ plt.plot(x, np.exp(theta_0) "Length")
+ plt.xlabel("Age") plt.ylabel(
Text(0, 0.5, 'Age')
+You may wonder: why did we choose to apply a log transformation specifically? Why not some other function to linearize the data?
+Practically, many other mathematical operations that modify the relative scales of "Age"
and "Length"
could have worked here.
Multiple linear regression is an extension of simple linear regression that adds additional features to the model. The multiple linear regression model takes the form:
+\[\hat{y} = \theta_0\:+\:\theta_1x_{1}\:+\:\theta_2 x_{2}\:+\:...\:+\:\theta_p x_{p}\]
+Our predicted value of \(y\), \(\hat{y}\), is a linear combination of the single observations (features), \(x_i\), and the parameters, \(\theta_i\).
+We’ll dive deeper into Multiple Linear Regression in the next lecture.
+Earlier, we calculated the constant model MSE using calculus. It turns out that there is a much more elegant way of performing this same minimization algebraically, without using calculus at all.
+In this calculation, we use the fact that the sum of deviations from the mean is \(0\) or that \(\sum_{i=1}^{n} (y_i - \bar{y}) = 0\).
+Let’s quickly walk through the proof for this: \[ +\begin{align} +\sum_{i=1}^{n} (y_i - \bar{y}) &= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n} \bar{y} \\ +&= \sum_{i=1}^{n} y_i - n\bar{y} \\ +&= \sum_{i=1}^{n} y_i - n\frac{1}{n}\sum_{i=1}^{n}y_i \\ +&= \sum_{i=1}^{n} y_i - \sum_{i=1}^{n}y_i \\ +& = 0 +\end{align} +\]
+In our calculations, we’ll also be using the definition of the variance as a sample. As a refresher:
+\[\sigma_y^2 = \frac{1}{n}\sum_{i=1}^{n} (y_i - \bar{y})^2\]
+Getting into our calculation for MSE minimization:
+\[ +\begin{align} +R(\theta) &= {\frac{1}{n}}\sum^{n}_{i=1} (y_i - \theta)^2 +\\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y}) + (\bar{y} - \theta)]^2\quad \quad \text{using trick that a-b can be written as (a-c) + (c-b) } \\ +&\quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \space \space \text{where a, b, and c are any numbers} +\\ &= \frac{1}{n}\sum^{n}_{i=1} [(y_i - \bar{y})^2 + 2(y_i - \bar{y})(\bar{y} - \theta) + (\bar{y} - \theta)^2] +\\ &= \frac{1}{n}[\sum^{n}_{i=1}(y_i - \bar{y})^2 + 2(\bar{y} - \theta)\sum^{n}_{i=1}(y_i - \bar{y}) + n(\bar{y} - \theta)^2] \quad \quad \text{distribute sum to individual terms} +\\ &= \frac{1}{n}\sum^{n}_{i=1}(y_i - \bar{y})^2 + \frac{2}{n}(\bar{y} - \theta)\cdot0 + (\bar{y} - \theta)^2 \quad \quad \text{sum of deviations from mean is 0} +\\ &= \sigma_y^2 + (\bar{y} - \theta)^2 +\end{align} +\]
+Since variance can’t be negative, we know that our first term, \(\sigma_y^2\) is greater than or equal to \(0\). Also note, that the first term doesn’t involve \(\theta\) at all, meaning changing our model won’t change this value. For the purposes of determining $#, we can then essentially ignore this term.
+Looking at the second term, \((\bar{y} - \theta)^2\), since it is squared, we know it must be greater than or equal to \(0\). As this term does involve \(\theta\), picking the value of \(\theta\) that minimizes this term will allow us to minimize our average loss. For the second term to equal \(0\), \(\theta = \bar{y}\), or in other words, \(\hat{\theta} = \bar{y} = mean(y)\).
+In the derivation above, we decompose the expected loss, \(R(\theta)\), into two key components: the variance of the data, \(\sigma_y^2\), and the square of the bias, \((\bar{y} - \theta)^2\). This decomposition is insightful for understanding the behavior of estimators in statistical models.
+Variance, \(\sigma_y^2\): This term represents the spread of the data points around their mean, \(\bar{y}\), and is a measure of the data’s inherent variability. Importantly, it does not depend on the choice of \(\theta\), meaning it’s a fixed property of the data. Variance serves as an indicator of the data’s dispersion and is crucial in understanding the dataset’s structure, but it remains constant regardless of how we adjust our model parameter \(\theta\).
Bias Squared, \((\bar{y} - \theta)^2\): This term captures the bias of the estimator, defined as the square of the difference between the mean of the data points, \(\bar{y}\), and the parameter \(\theta\). The bias quantifies the systematic error introduced when estimating \(\theta\). Minimizing this term is essential for improving the accuracy of the estimator. When \(\theta = \bar{y}\), the bias is \(0\), indicating that the estimator is unbiased for the parameter it estimates. This highlights a critical principle in statistical estimation: choosing \(\theta\) to be the sample mean, \(\bar{y}\), minimizes the average loss, rendering the estimator both efficient and unbiased for the population mean.
At the end of the Feature Engineering lecture (Lecture 14), we arrived at the issue of fine-tuning model complexity. We identified that a model that’s too complex can lead to overfitting while a model that’s too simple can lead to underfitting. This brings us to a natural question: how do we control model complexity to avoid under- and overfitting?
+To answer this question, we will need to address two things: first, we need to understand when our model begins to overfit by assessing its performance on unseen data. We can achieve this through cross-validation. Secondly, we need to introduce a technique to adjust the complexity of our models ourselves – to do so, we will apply regularization.
+From the last lecture, we learned that increasing model complexity decreased our model’s training error but increased its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but it generalizes worse to new data that hasn’t been seen before. For this reason, a low training error is not always representative of our model’s underlying performance – we need to also assess how well it performs on unseen data to ensure that it is not overfitting.
+Truly, the only way to know when our model overfits is by evaluating it on unseen data. Unfortunately, that means we need to wait for more data. This may be very expensive and time-consuming.
+How should we proceed? In this section, we will build up a viable solution to this problem.
+The simplest approach to avoid overfitting is to keep some of our data “secret” from ourselves. We can set aside a random portion of our full dataset to use only for testing purposes. The datapoints in this test set will not be used to fit the model. Instead, we will:
+Importantly, the optimal model parameters were found by only considering the data in the training set. After the model has been fitted to the training data, we do not change any parameters before making predictions on the test set. Importantly, we only ever make predictions on the test set once after all model design has been completely finalized. We treat the test set performance as the final test of how well a model does. To reiterate, the test set is only ever touched once: to compute the performance of the model after all fine-tuning has been completed.
+The process of sub-dividing our dataset into training and test sets is known as a train-test split. Typically, between 10% and 20% of the data is allocated to the test set.
+In sklearn
, the train_test_split
function (documentation) of the model_selection
module allows us to automatically generate train-test splits.
We will work with the vehicles
dataset from previous lectures. As before, we will attempt to predict the mpg
of a vehicle from transformations of its hp
. In the cell below, we allocate 20% of the full dataset to testing, and the remaining 80% to training.
import pandas as pd
+import numpy as np
+import seaborn as sns
+import warnings
+'ignore')
+ warnings.filterwarnings(
+# Load the dataset and construct the design matrix
+= sns.load_dataset("mpg").rename(columns={"horsepower":"hp"}).dropna()
+ vehicles = vehicles[["hp"]]
+ X "hp^2"] = vehicles["hp"]**2
+ X["hp^3"] = vehicles["hp"]**3
+ X["hp^4"] = vehicles["hp"]**4
+ X[
+= vehicles["mpg"] Y
from sklearn.model_selection import train_test_split
+
+# `test_size` specifies the proportion of the full dataset that should be allocated to testing
+# `random_state` makes our results reproducible for educational purposes
+= train_test_split(
+ X_train, X_test, Y_train, Y_test
+ X,
+ Y, =0.2,
+ test_size=220
+ random_state
+ )
+print(f"Size of full dataset: {X.shape[0]} points")
+print(f"Size of training set: {X_train.shape[0]} points")
+print(f"Size of test set: {X_test.shape[0]} points")
Size of full dataset: 392 points
+Size of training set: 313 points
+Size of test set: 79 points
+After performing our train-test split, we fit a model to the training set and assess its performance on the test set.
+import sklearn.linear_model as lm
+from sklearn.metrics import mean_squared_error
+
+= lm.LinearRegression()
+ model
+# Fit to the training set
+
+ model.fit(X_train, Y_train)
+# Calculate errors
+= mean_squared_error(Y_train, model.predict(X_train))
+ train_error = mean_squared_error(Y_test, model.predict(X_test))
+ test_error
+print(f"Training error: {train_error}")
+print(f"Test error: {test_error}")
Training error: 17.858516841012097
+Test error: 23.19240562932651
+Now, what if we were dissatisfied with our test set performance? With our current framework, we’d be stuck. As outlined previously, assessing model performance on the test set is the final stage of the model design process; we can’t go back and adjust our model based on the new discovery that it is overfitting. If we did, then we would be factoring in information from the test set to design our model. The test error would no longer be a true representation of the model’s performance on unseen data!
+Our solution is to introduce a validation set. A validation set is a random portion of the training set that is set aside for assessing model performance while the model is still being developed. The process for using a validation set is:
+The process of creating a validation set is called a validation split.
+Note that the validation error behaves quite differently from the training error explored previously. As the model becomes more complex, it makes better predictions on the training data; the variance of the model typically increases as model complexity increases. Validation error, on the other hand, decreases then increases as we increase model complexity. This reflects the transition from under- to overfitting: at low model complexity, the model underfits because it is not complex enough to capture the main trends in the data; at high model complexity, the model overfits because it “memorizes” the training data too closely.
+We can update our understanding of the relationships between error, complexity, and model variance:
+Our goal is to train a model with complexity near the orange dotted line – this is where our model minimizes the validation error. Note that this relationship is a simplification of the real-world, but it’s a good enough approximation for the purposes of Data 100.
+Introducing a validation set gave us an “extra” chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this one set of validation data.
+But what if, by random chance, our validation set just happened to contain many outliers? It is possible that the validation datapoints we set aside do not actually represent other unseen data that the model might encounter. Ideally, we would like to validate our model’s performance on several different unseen datasets. This would give us greater confidence in our understanding of how the model behaves on new data.
+Let’s think back to our validation framework. Earlier, we set aside \(x\)% of our training data (say, 20%) to use for validation.
+In the example above, we set aside the first 20% of training datapoints for the validation set. This was an arbitrary choice. We could have set aside any 20% portion of the training data for validation. In fact, there are 5 non-overlapping “chunks” of training points that we could have designated as the validation set.
+The common term for one of these chunks is a fold. In the example above, we had 5 folds, each containing 20% of the training data. This gives us a new perspective: we really have 5 validation sets “hidden” in our training set.
+In cross-validation, we perform validation splits for each fold in the training set. For a dataset with \(K\) folds, we:
+At this stage, we have refined our model selection workflow. We begin by performing a train-test split to set aside a test set for the final evaluation of model performance. Then, we alternate between adjusting our design matrix and computing the cross-validation error to finetune the model’s design. In the example below, we illustrate the use of 4-fold cross-validation to help inform model design.
+An important use of cross-validation is for hyperparameter selection. A hyperparameter is some value in a model that is chosen before the model is fit to any data. This means that it is distinct from the model parameters, \(\theta_i\), because its value is selected before the training process begins. We cannot use our usual techniques – calculus, ordinary least squares, or gradient descent – to choose its value. Instead, we must decide it ourselves.
+Some examples of hyperparameters in Data 100 are:
+.fit
)To select a hyperparameter value via cross-validation, we first list out several “guesses” for what the best hyperparameter may be. For each guess, we then run cross-validation to compute the cross-validation error incurred by the model when using that choice of hyperparameter value. We then select the value of the hyperparameter that resulted in the lowest cross-validation error.
+For example, we may wish to use cross-validation to decide what value we should use for \(\alpha\), which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best \(\alpha\), like 0.1, 1, and 10. For each possible value, we perform cross-validation to see what error the model has when we use that value of \(\alpha\) to train it.
+We’ve now addressed the first of our two goals for today: creating a framework to assess model performance on unseen data. Now, we’ll discuss our second objective: developing a technique to adjust model complexity. This will allow us to directly tackle the issues of under- and overfitting.
+Earlier, we adjusted the complexity of our polynomial model by tuning a hyperparameter – the degree of the polynomial. We tested out several different polynomial degrees, computed the validation error for each, and selected the value that minimized the validation error. Tweaking the “complexity” was simple; it was only a matter of adjusting the polynomial degree.
+In most machine learning problems, complexity is defined differently from what we have seen so far. Today, we’ll explore two different definitions of complexity: the squared and absolute magnitude of \(\theta_i\) coefficients.
+Think back to our work using gradient descent to descend down a loss surface. You may find it helpful to refer back to the Gradient Descent note to refresh your memory. Our aim was to find the combination of model parameters that the smallest, minimum loss. We visualized this using a contour map by plotting possible parameter values on the horizontal and vertical axes, which allows us to take a bird’s eye view above the loss surface. Notice that the contour map has \(p=2\) parameters for ease of visualization. We want to find the model parameters corresponding to the lowest point on the loss surface.
+Let’s review our current modeling framework.
+\[\hat{\mathbb{Y}} = \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2 + \ldots + \theta_p \phi_p\]
+Recall that we represent our features with \(\phi_i\) to reflect the fact that we have performed feature engineering.
+Previously, we restricted model complexity by limiting the total number of features present in the model. We only included a limited number of polynomial features at a time; all other polynomials were excluded from the model.
+What if, instead of fully removing particular features, we kept all features and used each one only a “little bit”? If we put a limit on how much each feature can contribute to the predictions, we can still control the model’s complexity without the need to manually determine how many features should be removed.
+What do we mean by a “little bit”? Consider the case where some parameter \(\theta_i\) is close to or equal to 0. Then, feature \(\phi_i\) barely impacts the prediction – the feature is weighted by such a small value that its presence doesn’t significantly change the value of \(\hat{\mathbb{Y}}\). If we restrict how large each parameter \(\theta_i\) can be, we restrict how much feature \(\phi_i\) contributes to the model. This has the effect of reducing model complexity.
+In regularization, we restrict model complexity by putting a limit on the magnitudes of the model parameters \(\theta_i\).
+What do these limits look like? Suppose we specify that the sum of all absolute parameter values can be no greater than some number \(Q\). In other words:
+\[\sum_{i=1}^p |\theta_i| \leq Q\]
+where \(p\) is the total number of parameters in the model. You can think of this as us giving our model a “budget” for how it distributes the magnitudes of each parameter. If the model assigns a large value to some \(\theta_i\), it may have to assign a small value to some other \(\theta_j\). This has the effect of increasing feature \(\phi_i\)’s influence on the predictions while decreasing the influence of feature \(\phi_j\). The model will need to be strategic about how the parameter weights are distributed – ideally, more “important” features will receive greater weighting.
+Notice that the intercept term, \(\theta_0\), is excluded from this constraint. We typically do not regularize the intercept term.
+Now, let’s think back to gradient descent and visualize the loss surface as a contour map. As a refresher, a loss surface means that each point represents the model’s loss for a particular combination of \(\theta_1\), \(\theta_2\). Let’s say our goal is to find the combination of parameters that gives us the lowest loss.
+
With no constraint, the optimal \(\hat{\theta}\) is in the center. We denote this as \(\hat{\theta}_\text{No Reg}\).
Applying this constraint limits what combinations of model parameters are valid. We can now only consider parameter combinations with a total absolute sum less than or equal to our number \(Q\). For our 2D example, the constraint \(\sum_{i=1}^p |\theta_i| \leq Q\) can be rewritten as \(|\theta_0| + |\theta_1| \leq Q\). This means that we can only assign our regularized parameter vector \(\hat{\theta}_{\text{Reg}}\) to positions in the green diamond below.
+
We can no longer select the parameter vector that truly minimizes the loss surface, \(\hat{\theta}_{\text{No Reg}}\), because this combination of parameters does not lie within our allowed region. Instead, we select whatever allowable combination brings us closest to the true minimum loss, which is depicted by the red point below.
Notice that, under regularization, our optimized \(\theta_1\) and \(\theta_2\) values are much smaller than they were without regularization (indeed, \(\theta_1\) has decreased to 0). The model has decreased in complexity because we have limited how much our features contribute to the model. In fact, by setting its parameter to 0, we have effectively removed the influence of feature \(\phi_1\) from the model altogether.
If we change the value of \(Q\), we change the region of allowed parameter combinations. The model will still choose the combination of parameters that produces the lowest loss – the closest point in the constrained region to the true minimizer, \(\hat{\theta}_{\text{No Reg}}\).
+When \(Q\) is small, we severely restrict the size of our parameters. \(\theta_i\)s are small in value, and features \(\phi_i\) only contribute a little to the model. The allowed region of model parameters contracts, and the model becomes much simpler:
+When \(Q\) is large, we do not restrict our parameter sizes by much. \(\theta_i\)s are large in value, and features \(\phi_i\) contribute more to the model. The allowed region of model parameters expands, and the model becomes more complex:
+Consider the extreme case of when \(Q\) is extremely large. In this situation, our restriction has essentially no effect, and the allowed region includes the OLS solution!
+Now what if \(Q\) was extremely small? Most parameters are then set to (essentially) 0.
+Let’s summarize what we have seen.
+How do we actually apply our constraint \(\sum_{i=1}^p |\theta_i| \leq Q\)? We will do so by modifying the objective function that we seek to minimize when fitting a model.
+Recall our ordinary least squares objective function: our goal was to find parameters that minimize the model’s mean squared error:
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\]
+To apply our constraint, we need to rephrase our minimization goal as:
+\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p |\theta_i| \leq Q\]
+Unfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the Lagrangian Duality. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following augmented objective function is equivalent to our minimization goal above.
+\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\] \[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\] \[ = \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\]
+The last two expressions include the MSE expressed using vector notation, and the last expression writes \(\sum_{i=1}^p |\theta_i|\) as it’s L1 norm equivalent form, \(|| \theta ||_1\).
+Notice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that penalizes large coefficients. In order to minimize this new objective function, we’ll end up balancing two components:
+The \(\lambda\) factor controls the degree of regularization. Roughly speaking, \(\lambda\) is related to our \(Q\) constraint from before by the rule \(\lambda \approx \frac{1}{Q}\). To understand why, let’s consider two extreme examples. Recall that our goal is to minimize the cost function: \(\frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_1\).
+Assume \(\lambda \rightarrow \infty\). Then, \(\lambda || \theta ||_1\) dominates the cost function. In order to neutralize the \(\infty\) and minimize this term, we set \(\theta_j = 0\) for all \(j \ge 1\). This is a very constrained model that is mathematically equivalent to the constant model
Assume \(\lambda \rightarrow 0\). Then, \(\lambda || \theta ||_1=0\). Minimizing the cost function is equivalent to minimizing \(\frac{1}{n} || Y - X\theta ||_2^2\), our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum \(\hat{\theta} = \hat\theta_{No Reg.}\).
We call \(\lambda\) the regularization penalty hyperparameter; it needs to be determined prior to training the model, so we must find the best value via cross-validation.
+The process of finding the optimal \(\hat{\theta}\) to minimize our new objective function is called L1 regularization. It is also sometimes known by the acronym “LASSO”, which stands for “Least Absolute Shrinkage and Selection Operator.”
+Unlike ordinary least squares, which can be solved via the closed-form solution \(\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\), there is no closed-form solution for the optimal parameter vector under L1 regularization. Instead, we use the Lasso
model class of sklearn
.
import sklearn.linear_model as lm
+
+# The alpha parameter represents our lambda term
+= lm.Lasso(alpha=2)
+ lasso_model
+ lasso_model.fit(X_train, Y_train)
+ lasso_model.coef_
array([-2.54932056e-01, -9.48597165e-04, 8.91976284e-06, -1.22872290e-08])
+Notice that all model coefficients are very small in magnitude. In fact, some of them are so small that they are essentially 0. An important characteristic of L1 regularization is that many model parameters are set to 0. In other words, LASSO effectively selects only a subset of the features. The reason for this comes back to our loss surface and allowed “diamond” regions from earlier – we can often get closer to the lowest loss contour at a corner of the diamond than along an edge.
+When a model parameter is set to 0 or close to 0, its corresponding feature is essentially removed from the model. We say that L1 regularization performs feature selection because, by setting the parameters of unimportant features to 0, LASSO “selects” which features are more useful for modeling. L1 regularization indicates that the features with non-zero parameters are more informative for modeling than those with parameters set to zero.
+The regularization procedure we just performed had one subtle issue. To see what it is, let’s take a look at the design matrix for our lasso_model
.
X_train.head()
+ | hp | +hp^2 | +hp^3 | +hp^4 | +
---|---|---|---|---|
259 | +85.0 | +7225.0 | +614125.0 | +52200625.0 | +
129 | +67.0 | +4489.0 | +300763.0 | +20151121.0 | +
207 | +102.0 | +10404.0 | +1061208.0 | +108243216.0 | +
302 | +70.0 | +4900.0 | +343000.0 | +24010000.0 | +
71 | +97.0 | +9409.0 | +912673.0 | +88529281.0 | +
Our features – hp
, hp^2
, hp^3
, and hp^4
– are on drastically different numeric scales! The values contained in hp^4
are orders of magnitude larger than those contained in hp
. This can be a problem because the value of hp^4
will naturally contribute more to each predicted \(\hat{y}\) because it is so much greater than the values of the other features. For hp
to have much of an impact at all on the prediction, it must be scaled by a large model parameter.
By inspecting the fitted parameters of our model, we see that this is the case – the parameter for hp
is much larger in magnitude than the parameter for hp^4
.
"Feature":X_train.columns, "Parameter":lasso_model.coef_}) pd.DataFrame({
+ | Feature | +Parameter | +
---|---|---|
0 | +hp | +-2.549321e-01 | +
1 | +hp^2 | +-9.485972e-04 | +
2 | +hp^3 | +8.919763e-06 | +
3 | +hp^4 | +-1.228723e-08 | +
Recall that by applying regularization, we give our a model a “budget” for how it can allocate the values of model parameters. For hp
to have much of an impact on each prediction, LASSO is forced to “spend” more of this budget on the parameter for hp
.
We can avoid this issue by scaling the data before regularizing. This is a process where we convert all features to the same numeric scale. A common way to scale data is to perform standardization such that all features have mean 0 and standard deviation 1; essentially, we replace everything with its Z-score.
+\[z_i = \frac{x_i - \mu}{\sigma}\]
+In all of our work above, we considered the constraint \(\sum_{i=1}^p |\theta_i| \leq Q\) to limit the complexity of the model. What if we had applied a different constraint?
+In L2 regularization, also known as ridge regression, we constrain the model such that the sum of the squared parameters must be less than some number \(Q\). This constraint takes the form:
+\[\sum_{i=1}^p \theta_i^2 \leq Q\]
+As before, we typically do not regularize the intercept term.
+In our 2D example, the constraint becomes \(\theta_1^2 + \theta_2^2 \leq Q\). Can you see how this is similar to the equation for a circle, \(x^2 + y^2 = r^2\)? The allowed region of parameters for a given value of \(Q\) is now shaped like a ball.
+If we modify our objective function like before, we find that our new goal is to minimize the function: \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p \theta_i^2 \leq Q\]
+Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.
+Using Lagrangian Duality (again, out of scope for Data 100), we can re-express our objective function as: \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2\] \[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\] \[= \frac{1}{n}||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda || \theta ||_2^2\]
+The last two expressions include the MSE expressed using vector notation, and the last expression writes \(\sum_{i=1}^p \theta_i^2\) as it’s L2 norm equivalent form, \(|| \theta ||_2^2\).
+When applying L2 regularization, our goal is to minimize this updated objective function.
+Unlike L1 regularization, L2 regularization does have a closed-form solution for the best parameter vector when regularization is applied:
+\[\hat\theta_{\text{ridge}} = (\mathbb{X}^{\top}\mathbb{X} + n\lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}\]
+This solution exists even if \(\mathbb{X}\) is not full column rank. This is a major reason why L2 regularization is often used – it can produce a solution even when there is colinearity in the features. We will discuss the concept of colinearity in a future lecture, but we will not derive this result in Data 100, as it involves a fair bit of matrix calculus.
+In sklearn
, we perform L2 regularization using the Ridge
class. It runs gradient descent to minimize the L2 objective function. Notice that we scale the data before regularizing.
= lm.Ridge(alpha=1) # alpha represents the hyperparameter lambda
+ ridge_model
+ ridge_model.fit(X_train, Y_train)
+ ridge_model.coef_
array([ 5.89130559e-02, -6.42445915e-03, 4.44468157e-05, -8.83981945e-08])
+Our regression models are summarized below. Note the objective function is what the gradient descent optimizer minimizes.
+Type | +Model | +Loss | +Regularization | +Objective Function | +Solution | +
---|---|---|---|---|---|
OLS | +\(\hat{\mathbb{Y}} = \mathbb{X}\theta\) | +MSE | +None | +\(\frac{1}{n} \|\mathbb{Y}-\mathbb{X} \theta\|^2_2\) | +\(\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\) if \(\mathbb{X}\) is full column rank | +
Ridge | +\(\hat{\mathbb{Y}} = \mathbb{X} \theta\) | +MSE | +L2 | +\(\frac{1}{n} \|\mathbb{Y}-\mathbb{X}\theta\|^2_2 + \lambda \sum_{i=1}^p \theta_i^2\) | +\(\hat{\theta}_{ridge} = (\mathbb{X}^{\top}\mathbb{X} + n \lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}\) | +
LASSO | +\(\hat{\mathbb{Y}} = \mathbb{X} \theta\) | +MSE | +L1 | +\(\frac{1}{n} \|\mathbb{Y}-\mathbb{X}\theta\|^2_2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\) | +No closed form solution | +
import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+'figure.figsize'] = (12, 9)
+ plt.rcParams[
+set()
+ sns.'talk')
+ sns.set_context(=20, precision=2, suppress=True)
+ np.set_printoptions(threshold'display.max_rows', 30)
+ pd.set_option('display.max_columns', None)
+ pd.set_option('display.precision', 2)
+ pd.set_option(# This option stops scientific notation for pandas
+'display.float_format', '{:.2f}'.format)
+ pd.set_option(
+# Silence some spurious seaborn warnings
+import warnings
+"ignore", category=FutureWarning) warnings.filterwarnings(
In the past few lectures, we’ve learned that pandas
is a toolkit to restructure, modify, and explore a dataset. What we haven’t yet touched on is how to make these data transformation decisions. When we receive a new set of data from the “real world,” how do we know what processing we should do to convert this data into a usable form?
Data cleaning, also called data wrangling, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+Exploratory Data Analysis (EDA) is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset’s format; because of this, EDA and data cleaning are often thought of as an “infinite loop,” with each process driving the other.
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we’ll develop a “checklist” of sorts for you to consider when approaching a new dataset. Throughout this process, we’ll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+We often prefer rectangular data for data analysis. Rectangular structures are easy to manipulate and analyze. A key element of data cleaning is about transforming data to be more rectangular.
+There are two kinds of rectangular data: tables and matrices. Tables have named columns with different data types and are manipulated using data transformation languages. Matrices contain numeric data of the same type and are manipulated using linear algebra.
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We’ll only cover CSV, TSV, and JSON in lecture, but you’ll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+CSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas
lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections
and babynames
datasets were stored and loaded as CSVs:
"data/elections.csv").head(5) pd.read_csv(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.21 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.79 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.20 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.80 | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.57 | +
To better understand the properties of a CSV, let’s take a look at the first few rows of the raw data file to see what it looks like before being loaded into a DataFrame
. We’ll use the repr()
function to return the raw string with its special characters:
with open("data/elections.csv", "r") as table:
+= 0
+ i for row in table:
+ print(repr(row))
+ += 1
+ i if i > 3:
+ break
'Year,Candidate,Party,Popular vote,Result,%\n'
+'1824,Andrew Jackson,Democratic-Republican,151271,loss,57.21012204\n'
+'1824,John Quincy Adams,Democratic-Republican,113142,win,42.78987796\n'
+'1828,Andrew Jackson,Democratic,642806,win,56.20392707\n'
+Each row, or record, in the data is delimited by a newline \n
. Each column, or field, in the data is delimited by a comma ,
(hence, comma-separated!).
Another common file type is TSV (Tab-Separated Values). In a TSV, records are still delimited by a newline \n
, while fields are delimited by \t
tab character.
Let’s check out the first few rows of the raw TSV file. Again, we’ll use the repr()
function so that print
shows the special characters.
with open("data/elections.txt", "r") as table:
+= 0
+ i for row in table:
+ print(repr(row))
+ += 1
+ i if i > 3:
+ break
'\ufeffYear\tCandidate\tParty\tPopular vote\tResult\t%\n'
+'1824\tAndrew Jackson\tDemocratic-Republican\t151271\tloss\t57.21012204\n'
+'1824\tJohn Quincy Adams\tDemocratic-Republican\t113142\twin\t42.78987796\n'
+'1828\tAndrew Jackson\tDemocratic\t642806\twin\t56.20392707\n'
+TSVs can be loaded into pandas
using pd.read_csv
. We’ll need to specify the delimiter with parametersep='\t'
(documentation).
"data/elections.txt", sep='\t').head(3) pd.read_csv(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.21 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.79 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.20 | +
An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does pandas
differentiate between a comma delimiter vs. a comma within the field itself, for example 8,900
? To remedy this, check out the quotechar
parameter.
JSON (JavaScript Object Notation) files behave similarly to Python dictionaries. A raw JSON is shown below.
+with open("data/elections.json", "r") as table:
+= 0
+ i for row in table:
+ print(row)
+ += 1
+ i if i > 8:
+ break
[
+
+ {
+
+ "Year": 1824,
+
+ "Candidate": "Andrew Jackson",
+
+ "Party": "Democratic-Republican",
+
+ "Popular vote": 151271,
+
+ "Result": "loss",
+
+ "%": 57.21012204
+
+ },
+
+JSON files can be loaded into pandas
using pd.read_json
.
'data/elections.json').head(3) pd.read_json(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.21 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.79 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.20 | +
The City of Berkeley Open Data website has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let’s download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the ds100_utils.py
file that we can reuse these helper functions in many different notebooks.
from ds100_utils import fetch_and_cache
+
+= fetch_and_cache(
+ covid_file "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ =False)
+ force# a file path wrapper object covid_file
Using cached version that was downloaded (UTC): Tue Aug 27 03:33:01 2024
+PosixPath('data/confirmed-cases.json')
+Let’s start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use Python
tools to probe the file.
Since there seem to be text files, let’s investigate the number of lines, which often corresponds to the number of records
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+print(covid_file, "is", sum(1 for l in f), "lines.")
data/confirmed-cases.json is 0.116367 MB
+data/confirmed-cases.json is 1110 lines.
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there’s an entire book called “Data Science at the Command Line” that explores this idea in depth! In Jupyter/IPython, you can prefix lines with !
to execute arbitrary Unix commands, and within those lines, you can refer to Python variables and expressions with the syntax {expr}
.
Here, we use the ls
command to list files, using the -lh
flags, which request “long format with information in human-readable form.” We also use the wc
command for “word count,” but with the -l
flag, which asks for line counts instead of words.
These two give us the same information as the code above, albeit in a slightly different form:
+!ls -lh {covid_file}
+!wc -l {covid_file}
-rw-r--r-- 1 jianingding21 staff 114K Aug 27 03:33 data/confirmed-cases.json
+ 1109 data/confirmed-cases.json
+Let’s explore the data format using Python
.
with open(covid_file, "r") as f:
+for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
'{\n'
+' "meta" : {\n'
+' "view" : {\n'
+' "id" : "xn6j-b766",\n'
+' "name" : "COVID-19 Confirmed Cases",\n'
+We can use the head
Unix command (which is where pandas
’ head
method comes from!) to see the first few lines of the file:
!head -5 {covid_file}
{
+ "meta" : {
+ "view" : {
+ "id" : "xn6j-b766",
+ "name" : "COVID-19 Confirmed Cases",
+In order to load the JSON file into pandas
, Let’s first do some EDA with Oython’s json
package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into pandas
. Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the json
package.
import json
+
+with open(covid_file, "rb") as f:
+= json.load(f) covid_json
The covid_json
variable is now a dictionary encoding the data in the file:
type(covid_json)
dict
+We can examine what keys are in the top level JSON object by listing out the keys.
+ covid_json.keys()
dict_keys(['meta', 'data'])
+Observation: The JSON dictionary contains a meta
key which likely refers to metadata (data about the data). Metadata is often maintained with the data and can be a good source of additional information.
We can investigate the metadata further by examining the keys associated with the metadata.
+'meta'].keys() covid_json[
dict_keys(['view'])
+The meta
key contains another dictionary called view
. This likely refers to metadata about a particular “view” of some underlying database. We will learn more about views when we study SQL later in the class.
'meta']['view'].keys() covid_json[
dict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+print(covid_json['meta']['view']['description'])
Counts of confirmed COVID-19 cases among Berkeley residents by date.
+We can look at a few entries in the data
field. This is what we’ll load into pandas
.
for i in range(3):
+print(f"{i:03} | {covid_json['data'][i]}")
000 | ['row-kzbg.v7my-c3y2', '00000000-0000-0000-0405-CB14DE51DAA7', 0, 1643733903, None, 1643733903, None, '{ }', '2020-02-28T00:00:00', '1', '1']
+001 | ['row-jkyx_9u4r-h2yw', '00000000-0000-0000-F806-86D0DBE0E17F', 0, 1643733903, None, 1643733903, None, '{ }', '2020-02-29T00:00:00', '0', '1']
+002 | ['row-qifg_4aug-y3ym', '00000000-0000-0000-2DCE-4D1872F9B216', 0, 1643733903, None, 1643733903, None, '{ }', '2020-03-01T00:00:00', '0', '1']
+Observations: * These look like equal-length records, so maybe data
is a table! * But what do each of values in the record mean? Where can we find column headers?
For that, we’ll need the columns
key in the metadata dictionary. This returns a list:
type(covid_json['meta']['view']['columns'])
list
+pandas
Finally, let’s load the data (not the metadata) into a pandas
DataFrame
. In the following block of code we:
Translate the JSON records into a DataFrame
:
covid_json['meta']['view']['columns']
covid_json['data']
Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
Examine the tail
of the table.
# Load the data from JSON and assign column titles
+= pd.DataFrame(
+ covid 'data'],
+ covid_json[=[c['name'] for c in covid_json['meta']['view']['columns']])
+ columns
+ covid.tail()
+ | sid | +id | +position | +created_at | +created_meta | +updated_at | +updated_meta | +meta | +Date | +New Cases | +Cumulative Cases | +
---|---|---|---|---|---|---|---|---|---|---|---|
699 | +row-49b6_x8zv.gyum | +00000000-0000-0000-A18C-9174A6D05774 | +0 | +1643733903 | +None | +1643733903 | +None | +{ } | +2022-01-27T00:00:00 | +106 | +10694 | +
700 | +row-gs55-p5em.y4v9 | +00000000-0000-0000-F41D-5724AEABB4D6 | +0 | +1643733903 | +None | +1643733903 | +None | +{ } | +2022-01-28T00:00:00 | +223 | +10917 | +
701 | +row-3pyj.tf95-qu67 | +00000000-0000-0000-BEE3-B0188D2518BD | +0 | +1643733903 | +None | +1643733903 | +None | +{ } | +2022-01-29T00:00:00 | +139 | +11056 | +
702 | +row-cgnd.8syv.jvjn | +00000000-0000-0000-C318-63CF75F7F740 | +0 | +1643733903 | +None | +1643733903 | +None | +{ } | +2022-01-30T00:00:00 | +33 | +11089 | +
703 | +row-qywv_24x6-237y | +00000000-0000-0000-FE92-9789FED3AA20 | +0 | +1643733903 | +None | +1643733903 | +None | +{ } | +2022-01-31T00:00:00 | +42 | +11131 | +
Last time, we introduced .merge
as the pandas
method for joining multiple DataFrame
s together. In our discussion of joins, we touched on the idea of using a “key” to determine what rows should be merged from each table. Let’s take a moment to examine this idea more closely.
The primary key is the column or set of columns in a table that uniquely determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student’s Cal ID as the primary key.
++ | Cal ID | +Name | +Major | +
---|---|---|---|
0 | +3034619471 | +Oski | +Data Science | +
1 | +3035619472 | +Ollie | +Computer Science | +
2 | +3025619473 | +Orrie | +Data Science | +
3 | +3046789372 | +Ollie | +Economics | +
The foreign key is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset’s foreign keys can be useful when assigning the left_on
and right_on
parameters of .merge
. In the table of office hour tickets below, "Cal ID"
is a foreign key referencing the previous table.
+ | OH Request | +Cal ID | +Question | +
---|---|---|---|
0 | +1 | +3034619471 | +HW 2 Q1 | +
1 | +2 | +3035619472 | +HW 2 Q3 | +
2 | +3 | +3025619473 | +Lab 3 Q4 | +
3 | +4 | +3035619472 | +HW 2 Q7 | +
Variables are columns. A variable is a measurement of a particular concept. Variables have two common properties: data type/storage type and variable type/feature type. The data type of a variable indicates how each variable value is stored in memory (integer, floating point, boolean, etc.) and affects which pandas
functions are used. The variable type is a conceptualized measurement of information (and therefore indicates what values a variable can take on). Variable type is identified through expert knowledge, exploring the data itself, or consulting the data codebook. The variable type affects how one visualizes and inteprets the data. In this class, “variable types” are conceptual.
After loading data into a file, it’s a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+Quantitative variables describe some numeric quantity or amount. We can divide quantitative data further into:
+Qualitative variables, also known as categorical variables, describe data that isn’t measuring some quantity or amount. The sub-categories of categorical data are:
+Note that many variables don’t sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We’ll do so by considering the data’s granularity, scope, and temporality.
+The granularity of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data’s granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+The scope of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+The temporality of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+Time and date fields of a dataset could represent a few things:
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley’s time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+pandas
’ dt
accessorsLet’s briefly look at how we can use pandas
’ dt
accessors to work with dates/times in a dataset using the dataset you’ll see in Lab 3: the Berkeley PD Calls for Service dataset.
= pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+ calls calls.head()
+ | CASENO | +OFFENSE | +EVENTDT | +EVENTTM | +CVLEGEND | +CVDOW | +InDbDate | +Block_Location | +BLKADDR | +City | +State | +
---|---|---|---|---|---|---|---|---|---|---|---|
0 | +21014296 | +THEFT MISD. (UNDER $950) | +04/01/2021 12:00:00 AM | +10:58 | +LARCENY | +4 | +06/15/2021 12:00:00 AM | +Berkeley, CA\n(37.869058, -122.270455) | +NaN | +Berkeley | +CA | +
1 | +21014391 | +THEFT MISD. (UNDER $950) | +04/01/2021 12:00:00 AM | +10:38 | +LARCENY | +4 | +06/15/2021 12:00:00 AM | +Berkeley, CA\n(37.869058, -122.270455) | +NaN | +Berkeley | +CA | +
2 | +21090494 | +THEFT MISD. (UNDER $950) | +04/19/2021 12:00:00 AM | +12:15 | +LARCENY | +1 | +06/15/2021 12:00:00 AM | +2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | +2100 BLOCK HASTE ST | +Berkeley | +CA | +
3 | +21090204 | +THEFT FELONY (OVER $950) | +02/13/2021 12:00:00 AM | +17:00 | +LARCENY | +6 | +06/15/2021 12:00:00 AM | +2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | +2600 BLOCK WARRING ST | +Berkeley | +CA | +
4 | +21090179 | +BURGLARY AUTO | +02/08/2021 12:00:00 AM | +6:20 | +BURGLARY - VEHICLE | +1 | +06/15/2021 12:00:00 AM | +2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | +2700 BLOCK GARBER ST | +Berkeley | +CA | +
Looks like there are three columns with dates/times: EVENTDT
, EVENTTM
, and InDbDate
.
Most likely, EVENTDT
stands for the date when the event took place, EVENTTM
stands for the time of day the event took place (in 24-hr format), and InDbDate
is the date this call is recorded onto the database.
If we check the data type of these columns, we will see they are stored as strings. We can convert them to datetime
objects using pandas to_datetime
function.
"EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+ calls[ calls.head()
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57895/874729699.py:1: UserWarning:
+
+Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.
+
++ | CASENO | +OFFENSE | +EVENTDT | +EVENTTM | +CVLEGEND | +CVDOW | +InDbDate | +Block_Location | +BLKADDR | +City | +State | +
---|---|---|---|---|---|---|---|---|---|---|---|
0 | +21014296 | +THEFT MISD. (UNDER $950) | +2021-04-01 | +10:58 | +LARCENY | +4 | +06/15/2021 12:00:00 AM | +Berkeley, CA\n(37.869058, -122.270455) | +NaN | +Berkeley | +CA | +
1 | +21014391 | +THEFT MISD. (UNDER $950) | +2021-04-01 | +10:38 | +LARCENY | +4 | +06/15/2021 12:00:00 AM | +Berkeley, CA\n(37.869058, -122.270455) | +NaN | +Berkeley | +CA | +
2 | +21090494 | +THEFT MISD. (UNDER $950) | +2021-04-19 | +12:15 | +LARCENY | +1 | +06/15/2021 12:00:00 AM | +2100 BLOCK HASTE ST\nBerkeley, CA\n(37.864908,... | +2100 BLOCK HASTE ST | +Berkeley | +CA | +
3 | +21090204 | +THEFT FELONY (OVER $950) | +2021-02-13 | +17:00 | +LARCENY | +6 | +06/15/2021 12:00:00 AM | +2600 BLOCK WARRING ST\nBerkeley, CA\n(37.86393... | +2600 BLOCK WARRING ST | +Berkeley | +CA | +
4 | +21090179 | +BURGLARY AUTO | +2021-02-08 | +6:20 | +BURGLARY - VEHICLE | +1 | +06/15/2021 12:00:00 AM | +2700 BLOCK GARBER ST\nBerkeley, CA\n(37.86066,... | +2700 BLOCK GARBER ST | +Berkeley | +CA | +
Now, we can use the dt
accessor on this column.
We can get the month:
+"EVENTDT"].dt.month.head() calls[
0 4
+1 4
+2 4
+3 2
+4 2
+Name: EVENTDT, dtype: int32
+Which day of the week the date is on:
+"EVENTDT"].dt.dayofweek.head() calls[
0 3
+1 3
+2 0
+3 5
+4 0
+Name: EVENTDT, dtype: int32
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+"EVENTDT").head() calls.sort_values(
+ | CASENO | +OFFENSE | +EVENTDT | +EVENTTM | +CVLEGEND | +CVDOW | +InDbDate | +Block_Location | +BLKADDR | +City | +State | +
---|---|---|---|---|---|---|---|---|---|---|---|
2513 | +20057398 | +BURGLARY COMMERCIAL | +2020-12-17 | +16:05 | +BURGLARY - COMMERCIAL | +4 | +06/15/2021 12:00:00 AM | +600 BLOCK GILMAN ST\nBerkeley, CA\n(37.878405,... | +600 BLOCK GILMAN ST | +Berkeley | +CA | +
624 | +20057207 | +ASSAULT/BATTERY MISD. | +2020-12-17 | +16:50 | +ASSAULT | +4 | +06/15/2021 12:00:00 AM | +2100 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.871... | +2100 BLOCK SHATTUCK AVE | +Berkeley | +CA | +
154 | +20092214 | +THEFT FROM AUTO | +2020-12-17 | +18:30 | +LARCENY - FROM VEHICLE | +4 | +06/15/2021 12:00:00 AM | +800 BLOCK SHATTUCK AVE\nBerkeley, CA\n(37.8918... | +800 BLOCK SHATTUCK AVE | +Berkeley | +CA | +
659 | +20057324 | +THEFT MISD. (UNDER $950) | +2020-12-17 | +15:44 | +LARCENY | +4 | +06/15/2021 12:00:00 AM | +1800 BLOCK 4TH ST\nBerkeley, CA\n(37.869888, -... | +1800 BLOCK 4TH ST | +Berkeley | +CA | +
993 | +20057573 | +BURGLARY RESIDENTIAL | +2020-12-17 | +22:15 | +BURGLARY - RESIDENTIAL | +4 | +06/15/2021 12:00:00 AM | +1700 BLOCK STUART ST\nBerkeley, CA\n(37.857495... | +1700 BLOCK STUART ST | +Berkeley | +CA | +
Doesn’t look like it! We are good!
+We can also do many things with the dt
accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on .dt
accessor and time series/date functionality.
At this stage in our data cleaning and EDA workflow, we’ve achieved quite a lot: we’ve identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the “real world.”
+Data used in research or industry is often “messy” – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+We often solve some of these more common issues in the following ways:
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as NaN
values.
A third method to address missing data is to perform imputation: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+Regardless of the strategy used to deal with missing data, we should think carefully about why particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+Now, let’s walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+We will examine the data included in the original CDC article published in 2021.
+Suppose Table 1 was saved as a CSV file located in data/cdc_tuberculosis.csv
.
We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways: 1. Using a text editor like emacs, vim, VSCode, etc. 2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc. 3. The Python
file object 4. pandas
, using pd.read_csv()
To try out options 1 and 2, you can view or download the Tuberculosis from the lecture demo notebook under the data
folder in the left hand menu. Notice how the CSV file is a type of rectangular data (i.e., tabular data) stored as comma-separated values.
Next, let’s try out option 3 using the Python
file object. We’ll look at the first four lines:
with open("data/cdc_tuberculosis.csv", "r") as f:
+= 0
+ i for row in f:
+ print(row)
+ += 1
+ i if i > 3:
+ break
,No. of TB cases,,,TB incidence,,
+
+U.S. jurisdiction,2019,2020,2021,2019,2020,2021
+
+Total,"8,900","7,173","7,860",2.71,2.16,2.37
+
+Alabama,87,72,92,1.77,1.43,1.83
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+You may recall that all line breaks in text files are encoded as the special newline character \n
. Python’s print()
prints each string (including the newline), and an additional newline on top of that.
If you’re curious, we can use the repr()
function to return the raw string with all special characters:
with open("data/cdc_tuberculosis.csv", "r") as f:
+= 0
+ i for row in f:
+ print(repr(row)) # print raw strings
+ += 1
+ i if i > 3:
+ break
',No. of TB cases,,,TB incidence,,\n'
+'U.S. jurisdiction,2019,2020,2021,2019,2020,2021\n'
+'Total,"8,900","7,173","7,860",2.71,2.16,2.37\n'
+'Alabama,87,72,92,1.77,1.43,1.83\n'
+Finally, let’s try option 4 and use the tried-and-true Data 100 approach: pandas
.
= pd.read_csv("data/cdc_tuberculosis.csv")
+ tb_df tb_df.head()
+ | Unnamed: 0 | +No. of TB cases | +Unnamed: 2 | +Unnamed: 3 | +TB incidence | +Unnamed: 5 | +Unnamed: 6 | +
---|---|---|---|---|---|---|---|
0 | +U.S. jurisdiction | +2019 | +2020 | +2021 | +2019.00 | +2020.00 | +2021.00 | +
1 | +Total | +8,900 | +7,173 | +7,860 | +2.71 | +2.16 | +2.37 | +
2 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +
3 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +
4 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +
You may notice some strange things about this table: what’s up with the “Unnamed” column names and the first row?
+Congratulations — you’re ready to wrangle your data! Because of how things are stored, we’ll need to clean the data a bit to name our columns better.
+A reasonable first step is to identify the row with the right header. The pd.read_csv()
function (documentation) has the convenient header
parameter that we can set to use the elements in row 1 as the appropriate columns:
= pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+ tb_df 5) tb_df.head(
+ | U.S. jurisdiction | +2019 | +2020 | +2021 | +2019.1 | +2020.1 | +2021.1 | +
---|---|---|---|---|---|---|---|
0 | +Total | +8,900 | +7,173 | +7,860 | +2.71 | +2.16 | +2.37 | +
1 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +
2 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +
3 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +
4 | +Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +
Wait…but now we can’t differentiate betwen the “Number of TB cases” and “TB incidence” year columns. pandas
has tried to make our lives easier by automatically adding “.1” to the latter columns, but this doesn’t help us, as humans, understand the data.
We can do this manually with df.rename()
(documentation):
= {'2019': 'TB cases 2019',
+ rename_dict '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+ = tb_df.rename(columns=rename_dict)
+ tb_df 5) tb_df.head(
+ | U.S. jurisdiction | +TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +
---|---|---|---|---|---|---|---|
0 | +Total | +8,900 | +7,173 | +7,860 | +2.71 | +2.16 | +2.37 | +
1 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +
2 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +
3 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +
4 | +Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +
You might already be wondering: what’s up with that first record?
+Row 0 is what we call a rollup record, or summary record. It’s often useful when displaying tables to humans. The granularity of record 0 (Totals) vs the rest of the records (States) is different.
+Okay, EDA step two. How was the rollup record aggregated?
+Let’s check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get 2x the total cases in each of our TB cases by year (why do you think this is?).
+sum(axis=0) tb_df.
U.S. jurisdiction TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
+TB cases 2019 8,9008758183642,111666718245583029973261085237...
+TB cases 2020 7,1737258136591,706525417194122219282169239376...
+TB cases 2021 7,8609258129691,750585443194992281064255127494...
+TB incidence 2019 109.94
+TB incidence 2020 93.09
+TB incidence 2021 102.94
+dtype: object
+Whoa, what’s going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+ tb_df.dtypes
U.S. jurisdiction object
+TB cases 2019 object
+TB cases 2020 object
+TB cases 2021 object
+TB incidence 2019 float64
+TB incidence 2020 float64
+TB incidence 2021 float64
+dtype: object
+Since there are commas in the values for TB cases, the numbers are read as the object
datatype, or storage type (close to the Python
string datatype), so pandas
is concatenating strings instead of adding integers (recall that Python can “sum”, or concatenate, strings together: "data" + "100"
evaluates to "data100"
).
Fortunately read_csv
also has a thousands
parameter (documentation):
# improve readability: chaining method calls with outer parentheses/line breaks
+= (
+ tb_df "data/cdc_tuberculosis.csv", header=1, thousands=',')
+ pd.read_csv(=rename_dict)
+ .rename(columns
+ )5) tb_df.head(
+ | U.S. jurisdiction | +TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +
---|---|---|---|---|---|---|---|
0 | +Total | +8900 | +7173 | +7860 | +2.71 | +2.16 | +2.37 | +
1 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +
2 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +
3 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +
4 | +Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +
sum() tb_df.
U.S. jurisdiction TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...
+TB cases 2019 17800
+TB cases 2020 14346
+TB cases 2021 15720
+TB incidence 2019 109.94
+TB incidence 2020 93.09
+TB incidence 2021 102.94
+dtype: object
+The total TB cases look right. Phew!
+Let’s just look at the records with state-level granularity:
+= tb_df[1:]
+ state_tb_df 5) state_tb_df.head(
+ | U.S. jurisdiction | +TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +
---|---|---|---|---|---|---|---|
1 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +
2 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +
3 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +
4 | +Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +
5 | +California | +2111 | +1706 | +1750 | +5.35 | +4.32 | +4.46 | +
U.S. Census population estimates source (2019), source (2020-2021).
+Running the below cells cleans the data. There are a few new methods here: * df.convert_dtypes()
(documentation) conveniently converts all float dtypes into ints and is out of scope for the class. * df.drop_na()
(documentation) will be explained in more detail next time.
# 2010s census data
+= pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+ census_2010s_df = (
+ census_2010s_df
+ census_2010s_df
+ .reset_index()=["index", "Census", "Estimates Base"])
+ .drop(columns={"Unnamed: 0": "Geographic Area"})
+ .rename(columns# "smart" converting of columns, use at your own risk
+ .convert_dtypes() # we'll introduce this next time
+ .dropna()
+ )'Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+ census_2010s_df[
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+ 5) census_2010s_df.head(
+ | Geographic Area | +2010 | +2011 | +2012 | +2013 | +2014 | +2015 | +2016 | +2017 | +2018 | +2019 | +
---|---|---|---|---|---|---|---|---|---|---|---|
0 | +United States | +309321666 | +311556874 | +313830990 | +315993715 | +318301008 | +320635163 | +322941311 | +324985539 | +326687501 | +328239523 | +
1 | +Northeast | +55380134 | +55604223 | +55775216 | +55901806 | +56006011 | +56034684 | +56042330 | +56059240 | +56046620 | +55982803 | +
2 | +Midwest | +66974416 | +67157800 | +67336743 | +67560379 | +67745167 | +67860583 | +67987540 | +68126781 | +68236628 | +68329004 | +
3 | +South | +114866680 | +116006522 | +117241208 | +118364400 | +119624037 | +120997341 | +122351760 | +123542189 | +124569433 | +125580448 | +
4 | +West | +72100436 | +72788329 | +73477823 | +74167130 | +74925793 | +75742555 | +76559681 | +77257329 | +77834820 | +78347268 | +
Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use python
’s importlib
library:
from importlib import reload
+reload(utils)
or use iPython
magic which will intelligently import code when files change:
%load_ext autoreload
+%autoreload 2
# census 2020s data
+= pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+ census_2020s_df = (
+ census_2020s_df
+ census_2020s_df
+ .reset_index()=["index", "Unnamed: 1"])
+ .drop(columns={"Unnamed: 0": "Geographic Area"})
+ .rename(columns# "smart" converting of columns, use at your own risk
+ .convert_dtypes() # we'll introduce this next time
+ .dropna()
+ )'Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+ census_2020s_df[
+5) census_2020s_df.head(
+ | Geographic Area | +2020 | +2021 | +2022 | +
---|---|---|---|---|
0 | +United States | +331511512 | +332031554 | +333287557 | +
1 | +Northeast | +57448898 | +57259257 | +57040406 | +
2 | +Midwest | +68961043 | +68836505 | +68787595 | +
3 | +South | +126450613 | +127346029 | +128716192 | +
4 | +West | +78650958 | +78589763 | +78743364 | +
DataFrame
s)Time to merge
! Here we use the DataFrame
method df1.merge(right=df2, ...)
on DataFrame
df1
(documentation). Contrast this with the function pd.merge(left=df1, right=df2, ...)
(documentation). Feel free to use either.
# merge TB DataFrame with two US census DataFrames
+= (
+ tb_census_df
+ tb_df=census_2010s_df,
+ .merge(right="U.S. jurisdiction", right_on="Geographic Area")
+ left_on=census_2020s_df,
+ .merge(right="U.S. jurisdiction", right_on="Geographic Area")
+ left_on
+ )5) tb_census_df.head(
+ | U.S. jurisdiction | +TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +Geographic Area_x | +2010 | +2011 | +2012 | +2013 | +2014 | +2015 | +2016 | +2017 | +2018 | +2019 | +Geographic Area_y | +2020 | +2021 | +2022 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +Alabama | +4785437 | +4799069 | +4815588 | +4830081 | +4841799 | +4852347 | +4863525 | +4874486 | +4887681 | +4903185 | +Alabama | +5031362 | +5049846 | +5074296 | +
1 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +Alaska | +713910 | +722128 | +730443 | +737068 | +736283 | +737498 | +741456 | +739700 | +735139 | +731545 | +Alaska | +732923 | +734182 | +733583 | +
2 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +Arizona | +6407172 | +6472643 | +6554978 | +6632764 | +6730413 | +6829676 | +6941072 | +7044008 | +7158024 | +7278717 | +Arizona | +7179943 | +7264877 | +7359197 | +
3 | +Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +Arkansas | +2921964 | +2940667 | +2952164 | +2959400 | +2967392 | +2978048 | +2989918 | +3001345 | +3009733 | +3017804 | +Arkansas | +3014195 | +3028122 | +3045637 | +
4 | +California | +2111 | +1706 | +1750 | +5.35 | +4.32 | +4.46 | +California | +37319502 | +37638369 | +37948800 | +38260787 | +38596972 | +38918045 | +39167117 | +39358497 | +39461588 | +39512223 | +California | +39501653 | +39142991 | +39029342 | +
Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census DataFrame
s. Let’s do the latter.
# try merging again, but cleaner this time
+= (
+ tb_census_df
+ tb_df=census_2010s_df[["Geographic Area", "2019"]],
+ .merge(right="U.S. jurisdiction", right_on="Geographic Area")
+ left_on="Geographic Area")
+ .drop(columns=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ .merge(right="U.S. jurisdiction", right_on="Geographic Area")
+ left_on="Geographic Area")
+ .drop(columns
+ )5) tb_census_df.head(
+ | U.S. jurisdiction | +TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +2019 | +2020 | +2021 | +
---|---|---|---|---|---|---|---|---|---|---|
0 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +4903185 | +5031362 | +5049846 | +
1 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +731545 | +732923 | +734182 | +
2 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +7278717 | +7179943 | +7264877 | +
3 | +Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +3017804 | +3014195 | +3028122 | +
4 | +California | +2111 | +1706 | +1750 | +5.35 | +4.32 | +4.46 | +39512223 | +39501653 | +39142991 | +
Let’s recompute incidence to make sure we know where the original CDC numbers came from.
+From the CDC report: TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+\[\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} \]
+\[= \frac{\text{TB cases in population}}{\text{population}} \times 100000\]
+Let’s try this for 2019:
+"recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+ tb_census_df[5) tb_census_df.head(
+ | U.S. jurisdiction | +TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +2019 | +2020 | +2021 | +recompute incidence 2019 | +
---|---|---|---|---|---|---|---|---|---|---|---|
0 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +4903185 | +5031362 | +5049846 | +1.77 | +
1 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +731545 | +732923 | +734182 | +7.93 | +
2 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +7278717 | +7179943 | +7264877 | +2.51 | +
3 | +Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +3017804 | +3014195 | +3028122 | +2.12 | +
4 | +California | +2111 | +1706 | +1750 | +5.35 | +4.32 | +4.46 | +39512223 | +39501653 | +39142991 | +5.34 | +
Awesome!!!
+Let’s use a for-loop and Python format strings to compute TB incidence for all years. Python f-strings are just used for the purposes of this demo, but they’re handy to know when you explore data beyond this course (documentation).
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+ tb_census_df[5) tb_census_df.head(
+ | U.S. jurisdiction | +TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +2019 | +2020 | +2021 | +recompute incidence 2019 | +recompute incidence 2020 | +recompute incidence 2021 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +4903185 | +5031362 | +5049846 | +1.77 | +1.43 | +1.82 | +
1 | +Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +731545 | +732923 | +734182 | +7.93 | +7.91 | +7.90 | +
2 | +Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +7278717 | +7179943 | +7264877 | +2.51 | +1.89 | +1.78 | +
3 | +Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +3017804 | +3014195 | +3028122 | +2.12 | +1.96 | +2.28 | +
4 | +California | +2111 | +1706 | +1750 | +5.35 | +4.32 | +4.46 | +39512223 | +39501653 | +39142991 | +5.34 | +4.32 | +4.47 | +
These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+ tb_census_df.describe()
+ | TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +2019 | +2020 | +2021 | +recompute incidence 2019 | +recompute incidence 2020 | +recompute incidence 2021 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | +51.00 | +51.00 | +51.00 | +51.00 | +51.00 | +51.00 | +51.00 | +51.00 | +51.00 | +51.00 | +51.00 | +51.00 | +
mean | +174.51 | +140.65 | +154.12 | +2.10 | +1.78 | +1.97 | +6436069.08 | +6500225.73 | +6510422.63 | +2.10 | +1.78 | +1.97 | +
std | +341.74 | +271.06 | +286.78 | +1.50 | +1.34 | +1.48 | +7360660.47 | +7408168.46 | +7394300.08 | +1.50 | +1.34 | +1.47 | +
min | +1.00 | +0.00 | +2.00 | +0.17 | +0.00 | +0.21 | +578759.00 | +577605.00 | +579483.00 | +0.17 | +0.00 | +0.21 | +
25% | +25.50 | +29.00 | +23.00 | +1.29 | +1.21 | +1.23 | +1789606.00 | +1820311.00 | +1844920.00 | +1.30 | +1.21 | +1.23 | +
50% | +70.00 | +67.00 | +69.00 | +1.80 | +1.52 | +1.70 | +4467673.00 | +4507445.00 | +4506589.00 | +1.81 | +1.52 | +1.69 | +
75% | +180.50 | +139.00 | +150.00 | +2.58 | +1.99 | +2.22 | +7446805.00 | +7451987.00 | +7502811.00 | +2.58 | +1.99 | +2.22 | +
max | +2111.00 | +1706.00 | +1750.00 | +7.91 | +7.92 | +7.92 | +39512223.00 | +39501653.00 | +39142991.00 | +7.93 | +7.91 | +7.90 | +
How do we reproduce that reported statistic in the original CDC report?
+++Reported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
This is TB incidence computed across the entire U.S. population! How do we reproduce this? * We need to reproduce the “Total” TB incidences in our rolled record. * But our current tb_census_df
only has 51 entries (50 states plus Washington, D.C.). There is no rolled record. * What happened…?
Let’s get exploring!
+Before we keep exploring, we’ll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+= tb_df.set_index("U.S. jurisdiction")
+ tb_df 5) tb_df.head(
+ | TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +
---|---|---|---|---|---|---|
U.S. jurisdiction | ++ | + | + | + | + | + |
Total | +8900 | +7173 | +7860 | +2.71 | +2.16 | +2.37 | +
Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +
Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +
Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +
Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +
= census_2010s_df.set_index("Geographic Area")
+ census_2010s_df 5) census_2010s_df.head(
+ | 2010 | +2011 | +2012 | +2013 | +2014 | +2015 | +2016 | +2017 | +2018 | +2019 | +
---|---|---|---|---|---|---|---|---|---|---|
Geographic Area | ++ | + | + | + | + | + | + | + | + | + |
United States | +309321666 | +311556874 | +313830990 | +315993715 | +318301008 | +320635163 | +322941311 | +324985539 | +326687501 | +328239523 | +
Northeast | +55380134 | +55604223 | +55775216 | +55901806 | +56006011 | +56034684 | +56042330 | +56059240 | +56046620 | +55982803 | +
Midwest | +66974416 | +67157800 | +67336743 | +67560379 | +67745167 | +67860583 | +67987540 | +68126781 | +68236628 | +68329004 | +
South | +114866680 | +116006522 | +117241208 | +118364400 | +119624037 | +120997341 | +122351760 | +123542189 | +124569433 | +125580448 | +
West | +72100436 | +72788329 | +73477823 | +74167130 | +74925793 | +75742555 | +76559681 | +77257329 | +77834820 | +78347268 | +
= census_2020s_df.set_index("Geographic Area")
+ census_2020s_df 5) census_2020s_df.head(
+ | 2020 | +2021 | +2022 | +
---|---|---|---|
Geographic Area | ++ | + | + |
United States | +331511512 | +332031554 | +333287557 | +
Northeast | +57448898 | +57259257 | +57040406 | +
Midwest | +68961043 | +68836505 | +68787595 | +
South | +126450613 | +127346029 | +128716192 | +
West | +78650958 | +78589763 | +78743364 | +
It turns out that our merge above only kept state records, even though our original tb_df
had the “Total” rolled record:
tb_df.head()
+ | TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +
---|---|---|---|---|---|---|
U.S. jurisdiction | ++ | + | + | + | + | + |
Total | +8900 | +7173 | +7860 | +2.71 | +2.16 | +2.37 | +
Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +
Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +
Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +
Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +
Recall that merge
by default does an inner merge by default, meaning that it only preserves keys that are present in both DataFrame
s.
The rolled records in our census DataFrame
have different Geographic Area
fields, which was the key we merged on:
5) census_2010s_df.head(
+ | 2010 | +2011 | +2012 | +2013 | +2014 | +2015 | +2016 | +2017 | +2018 | +2019 | +
---|---|---|---|---|---|---|---|---|---|---|
Geographic Area | ++ | + | + | + | + | + | + | + | + | + |
United States | +309321666 | +311556874 | +313830990 | +315993715 | +318301008 | +320635163 | +322941311 | +324985539 | +326687501 | +328239523 | +
Northeast | +55380134 | +55604223 | +55775216 | +55901806 | +56006011 | +56034684 | +56042330 | +56059240 | +56046620 | +55982803 | +
Midwest | +66974416 | +67157800 | +67336743 | +67560379 | +67745167 | +67860583 | +67987540 | +68126781 | +68236628 | +68329004 | +
South | +114866680 | +116006522 | +117241208 | +118364400 | +119624037 | +120997341 | +122351760 | +123542189 | +124569433 | +125580448 | +
West | +72100436 | +72788329 | +73477823 | +74167130 | +74925793 | +75742555 | +76559681 | +77257329 | +77834820 | +78347268 | +
The Census DataFrame
has several rolled records. The aggregate record we are looking for actually has the Geographic Area named “United States”.
One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we’ll use df.rename()
(documentation):
# rename rolled record for 2010s
+={'United States':'Total'}, inplace=True)
+ census_2010s_df.rename(index5) census_2010s_df.head(
+ | 2010 | +2011 | +2012 | +2013 | +2014 | +2015 | +2016 | +2017 | +2018 | +2019 | +
---|---|---|---|---|---|---|---|---|---|---|
Geographic Area | ++ | + | + | + | + | + | + | + | + | + |
Total | +309321666 | +311556874 | +313830990 | +315993715 | +318301008 | +320635163 | +322941311 | +324985539 | +326687501 | +328239523 | +
Northeast | +55380134 | +55604223 | +55775216 | +55901806 | +56006011 | +56034684 | +56042330 | +56059240 | +56046620 | +55982803 | +
Midwest | +66974416 | +67157800 | +67336743 | +67560379 | +67745167 | +67860583 | +67987540 | +68126781 | +68236628 | +68329004 | +
South | +114866680 | +116006522 | +117241208 | +118364400 | +119624037 | +120997341 | +122351760 | +123542189 | +124569433 | +125580448 | +
West | +72100436 | +72788329 | +73477823 | +74167130 | +74925793 | +75742555 | +76559681 | +77257329 | +77834820 | +78347268 | +
# same, but for 2020s rename rolled record
+={'United States':'Total'}, inplace=True)
+ census_2020s_df.rename(index5) census_2020s_df.head(
+ | 2020 | +2021 | +2022 | +
---|---|---|---|
Geographic Area | ++ | + | + |
Total | +331511512 | +332031554 | +333287557 | +
Northeast | +57448898 | +57259257 | +57040406 | +
Midwest | +68961043 | +68836505 | +68787595 | +
South | +126450613 | +127346029 | +128716192 | +
West | +78650958 | +78589763 | +78743364 | +
Next let’s rerun our merge. Note the different chaining, because we are now merging on indexes (df.merge()
documentation).
= (
+ tb_census_df
+ tb_df=census_2010s_df[["2019"]],
+ .merge(right=True, right_index=True)
+ left_index=census_2020s_df[["2020", "2021"]],
+ .merge(right=True, right_index=True)
+ left_index
+ )5) tb_census_df.head(
+ | TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +2019 | +2020 | +2021 | +
---|---|---|---|---|---|---|---|---|---|
Total | +8900 | +7173 | +7860 | +2.71 | +2.16 | +2.37 | +328239523 | +331511512 | +332031554 | +
Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +4903185 | +5031362 | +5049846 | +
Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +731545 | +732923 | +734182 | +
Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +7278717 | +7179943 | +7264877 | +
Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +3017804 | +3014195 | +3028122 | +
Finally, let’s recompute our incidences:
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+ tb_census_df[5) tb_census_df.head(
+ | TB cases 2019 | +TB cases 2020 | +TB cases 2021 | +TB incidence 2019 | +TB incidence 2020 | +TB incidence 2021 | +2019 | +2020 | +2021 | +recompute incidence 2019 | +recompute incidence 2020 | +recompute incidence 2021 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|
Total | +8900 | +7173 | +7860 | +2.71 | +2.16 | +2.37 | +328239523 | +331511512 | +332031554 | +2.71 | +2.16 | +2.37 | +
Alabama | +87 | +72 | +92 | +1.77 | +1.43 | +1.83 | +4903185 | +5031362 | +5049846 | +1.77 | +1.43 | +1.82 | +
Alaska | +58 | +58 | +58 | +7.91 | +7.92 | +7.92 | +731545 | +732923 | +734182 | +7.93 | +7.91 | +7.90 | +
Arizona | +183 | +136 | +129 | +2.51 | +1.89 | +1.77 | +7278717 | +7179943 | +7264877 | +2.51 | +1.89 | +1.78 | +
Arkansas | +64 | +59 | +69 | +2.12 | +1.96 | +2.28 | +3017804 | +3014195 | +3028122 | +2.12 | +1.96 | +2.28 | +
We reproduced the total U.S. incidences correctly!
+We’re almost there. Let’s revisit the quote:
+++Reported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
Recall that percent change from \(A\) to \(B\) is computed as \(\text{percent change} = \frac{B - A}{A} \times 100\).
+= tb_census_df.loc['Total', 'recompute incidence 2020']
+ incidence_2020 incidence_2020
np.float64(2.1637257652759883)
+= tb_census_df.loc['Total', 'recompute incidence 2021']
+ incidence_2021 incidence_2021
np.float64(2.3672448914298068)
+= (incidence_2021 - incidence_2020)/incidence_2020 * 100
+ difference difference
np.float64(9.405957511804143)
+Mauna Loa Observatory has been monitoring CO2 concentrations since 1958.
+= "data/co2_mm_mlo.txt" co2_file
Let’s do some EDA!!
+Pandas
?Let’s instead check out this .txt
file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
Lines 71-78 (inclusive) are shown below:
+line number | file contents
+
+71 | # decimal average interpolated trend #days
+72 | # date (season corr)
+73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+Notice how:
+We can use read_csv
to read the data into a pandas
DataFrame
, and we provide several arguments to specify that the separators are white space, there is no header (we will set our own column names), and to skip the first 72 rows of the file.
= pd.read_csv(
+ co2 = None, skiprows = 72,
+ co2_file, header = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+ sep
+ ) co2.head()
+ | 0 | +1 | +2 | +3 | +4 | +5 | +6 | +
---|---|---|---|---|---|---|---|
0 | +1958 | +3 | +1958.21 | +315.71 | +315.71 | +314.62 | +-1 | +
1 | +1958 | +4 | +1958.29 | +317.45 | +317.45 | +315.29 | +-1 | +
2 | +1958 | +5 | +1958.38 | +317.50 | +317.50 | +314.71 | +-1 | +
3 | +1958 | +6 | +1958.46 | +-99.99 | +317.10 | +314.85 | +-1 | +
4 | +1958 | +7 | +1958.54 | +315.86 | +315.86 | +314.98 | +-1 | +
Congratulations! You’ve wrangled the data!
+…But our columns aren’t named. We need to do more EDA.
+The NOAA webpage might have some useful tidbits (in this case it doesn’t).
+Using this information, we’ll rerun pd.read_csv
, but this time with some custom column names.
= pd.read_csv(
+ co2 = None, skiprows = 72,
+ co2_file, header = '\s+', #regex for continuous whitespace (next lecture)
+ sep = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+ names
+ ) co2.head()
+ | Yr | +Mo | +DecDate | +Avg | +Int | +Trend | +Days | +
---|---|---|---|---|---|---|---|
0 | +1958 | +3 | +1958.21 | +315.71 | +315.71 | +314.62 | +-1 | +
1 | +1958 | +4 | +1958.29 | +317.45 | +317.45 | +315.29 | +-1 | +
2 | +1958 | +5 | +1958.38 | +317.50 | +317.50 | +314.71 | +-1 | +
3 | +1958 | +6 | +1958.46 | +-99.99 | +317.10 | +314.85 | +-1 | +
4 | +1958 | +7 | +1958.54 | +315.86 | +315.86 | +314.98 | +-1 | +
Scientific studies tend to have very clean data, right…? Let’s jump right in and make a time series plot of CO2 monthly averages.
+='DecDate', y='Avg', data=co2); sns.lineplot(x
The code above uses the seaborn
plotting library (abbreviated sns
). We will cover this in the Visualization lecture, but now you don’t need to worry about how it works!
Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some missing values. What happened here?
+ co2.head()
+ | Yr | +Mo | +DecDate | +Avg | +Int | +Trend | +Days | +
---|---|---|---|---|---|---|---|
0 | +1958 | +3 | +1958.21 | +315.71 | +315.71 | +314.62 | +-1 | +
1 | +1958 | +4 | +1958.29 | +317.45 | +317.45 | +315.29 | +-1 | +
2 | +1958 | +5 | +1958.38 | +317.50 | +317.50 | +314.71 | +-1 | +
3 | +1958 | +6 | +1958.46 | +-99.99 | +317.10 | +314.85 | +-1 | +
4 | +1958 | +7 | +1958.54 | +315.86 | +315.86 | +314.98 | +-1 | +
co2.tail()
+ | Yr | +Mo | +DecDate | +Avg | +Int | +Trend | +Days | +
---|---|---|---|---|---|---|---|
733 | +2019 | +4 | +2019.29 | +413.32 | +413.32 | +410.49 | +26 | +
734 | +2019 | +5 | +2019.38 | +414.66 | +414.66 | +411.20 | +28 | +
735 | +2019 | +6 | +2019.46 | +413.92 | +413.92 | +411.58 | +27 | +
736 | +2019 | +7 | +2019.54 | +411.77 | +411.77 | +411.43 | +23 | +
737 | +2019 | +8 | +2019.62 | +409.95 | +409.95 | +411.84 | +29 | +
Some data have unusual values like -1 and -99.99.
+Let’s check the description at the top of the file again.
+Days
the equipment was in operation that month.Avg
How can we fix this? First, let’s explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+First, we consider the shape of the data. How many rows should we have?
+ co2.shape
(738, 7)
+Nice!! The number of rows (i.e. records) match our expectations.
+Let’s now check the quality of each feature.
+Days
Days
is a time field, so let’s analyze other time fields to see if there is an explanation for missing values of days of operation.
Let’s start with months, Mo
.
Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+"Mo"].value_counts().sort_index() co2[
Mo
+1 61
+2 61
+3 62
+4 62
+5 62
+6 62
+7 62
+8 62
+9 61
+10 61
+11 61
+12 61
+Name: count, dtype: int64
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+Next let’s explore days Days
itself, which is the number of days that the measurement equipment worked.
'Days']);
+ sns.displot(co2["Distribution of days feature"); # suppresses unneeded plotting output plt.title(
In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values–that’s about 27% of the data!
+Finally, let’s check the last time feature, year Yr
.
Let’s check to see if there is any connection between missing-ness and the year of the recording.
+="Yr", y="Days", data=co2);
+ sns.scatterplot(x"Day field by Year"); # the ; suppresses output plt.title(
Observations:
+Potential Next Steps:
+Avg
Next, let’s return to the -99.99 values in Avg
to analyze the overall quality of the CO2 measurements. We’ll plot a histogram of the average CO2 measurements
# Histograms of average CO2 measurements
+'Avg']); sns.displot(co2[
The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+We also see that there are only a few missing Avg
values (<1% of values). Let’s examine all of them:
"Avg"] < 0] co2[co2[
+ | Yr | +Mo | +DecDate | +Avg | +Int | +Trend | +Days | +
---|---|---|---|---|---|---|---|
3 | +1958 | +6 | +1958.46 | +-99.99 | +317.10 | +314.85 | +-1 | +
7 | +1958 | +10 | +1958.79 | +-99.99 | +312.66 | +315.61 | +-1 | +
71 | +1964 | +2 | +1964.12 | +-99.99 | +320.07 | +319.61 | +-1 | +
72 | +1964 | +3 | +1964.21 | +-99.99 | +320.73 | +319.55 | +-1 | +
73 | +1964 | +4 | +1964.29 | +-99.99 | +321.77 | +319.48 | +-1 | +
213 | +1975 | +12 | +1975.96 | +-99.99 | +330.59 | +331.60 | +0 | +
313 | +1984 | +4 | +1984.29 | +-99.99 | +346.84 | +344.27 | +2 | +
There doesn’t seem to be a pattern to these values, other than that most records also were missing Days
data.
NaN
, or Impute Missing Avg
Data?How should we address the invalid Avg
data?
Remember we want to fix the following plot:
+='DecDate', y='Avg', data=co2)
+ sns.lineplot(x"CO2 Average By Month"); plt.title(
Since we are plotting Avg
vs DecDate
, we should just focus on dealing with missing values for Avg
.
Let’s consider a few options: 1. Drop those records 2. Replace -99.99 with NaN 3. Substitute it with a likely value for the average CO2?
+What do you think are the pros and cons of each possible action?
+Let’s examine each of these three options.
+# 1. Drop missing values
+= co2[co2['Avg'] > 0]
+ co2_drop co2_drop.head()
+ | Yr | +Mo | +DecDate | +Avg | +Int | +Trend | +Days | +
---|---|---|---|---|---|---|---|
0 | +1958 | +3 | +1958.21 | +315.71 | +315.71 | +314.62 | +-1 | +
1 | +1958 | +4 | +1958.29 | +317.45 | +317.45 | +315.29 | +-1 | +
2 | +1958 | +5 | +1958.38 | +317.50 | +317.50 | +314.71 | +-1 | +
4 | +1958 | +7 | +1958.54 | +315.86 | +315.86 | +314.98 | +-1 | +
5 | +1958 | +8 | +1958.62 | +314.93 | +314.93 | +315.94 | +-1 | +
# 2. Replace NaN with -99.99
+= co2.replace(-99.99, np.nan)
+ co2_NA co2_NA.head()
+ | Yr | +Mo | +DecDate | +Avg | +Int | +Trend | +Days | +
---|---|---|---|---|---|---|---|
0 | +1958 | +3 | +1958.21 | +315.71 | +315.71 | +314.62 | +-1 | +
1 | +1958 | +4 | +1958.29 | +317.45 | +317.45 | +315.29 | +-1 | +
2 | +1958 | +5 | +1958.38 | +317.50 | +317.50 | +314.71 | +-1 | +
3 | +1958 | +6 | +1958.46 | +NaN | +317.10 | +314.85 | +-1 | +
4 | +1958 | +7 | +1958.54 | +315.86 | +315.86 | +314.98 | +-1 | +
We’ll also use a third version of the data.
+First, we note that the dataset already comes with a substitute value for the -99.99.
+From the file description:
+++The
+interpolated
column includes average values from the preceding column (average
) and interpolated values where data are missing. Interpolated values are computed in two steps…
The Int
feature has values that exactly match those in Avg
, except when Avg
is -99.99, and then a reasonable estimate is used instead.
So, the third version of our data will use the Int
feature instead of Avg
.
# 3. Use interpolated column which estimates missing Avg values
+= co2.copy()
+ co2_impute 'Avg'] = co2['Int']
+ co2_impute[ co2_impute.head()
+ | Yr | +Mo | +DecDate | +Avg | +Int | +Trend | +Days | +
---|---|---|---|---|---|---|---|
0 | +1958 | +3 | +1958.21 | +315.71 | +315.71 | +314.62 | +-1 | +
1 | +1958 | +4 | +1958.29 | +317.45 | +317.45 | +315.29 | +-1 | +
2 | +1958 | +5 | +1958.38 | +317.50 | +317.50 | +314.71 | +-1 | +
3 | +1958 | +6 | +1958.46 | +317.10 | +317.10 | +314.85 | +-1 | +
4 | +1958 | +7 | +1958.54 | +315.86 | +315.86 | +314.98 | +-1 | +
What’s a reasonable estimate?
+To answer this question, let’s zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+# assumes single year, hence Mo
+ 'Mo', 'Avg', data=data)
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter(2, 13)
+ ax.set_xlim(
+ ax.set_title(title)3, 13))
+ ax.set_xticks(np.arange(
+def data_year(data, year):
+return data[data["Yr"] == 1958]
+
+ # uses matplotlib subplots
+# you may see more next week; focus on output for now
+= plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+ fig, axes
+= 1958
+ year 0], title="1. Drop Missing")
+ line_and_points(data_year(co2_drop, year), axes[1], title="2. Missing Set to NaN")
+ line_and_points(data_year(co2_NA, year), axes[2], title="3. Missing Interpolated")
+ line_and_points(data_year(co2_impute, year), axes[
+f"Monthly Averages for {year}")
+ fig.suptitle( plt.tight_layout()
In the big picture since there are only 7 Avg
values missing (<1% of 738 months), any of these approaches would work.
However there is some appeal to option C, Imputing:
+Let’s replot our original figure with option 3:
+='DecDate', y='Avg', data=co2_impute)
+ sns.lineplot(x"CO2 Average By Month, Imputed"); plt.title(
Looks pretty close to what we see on the NOAA website!
+From the description:
+The data you present depends on your research question.
+How do CO2 levels vary by season?
+Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?
+= co2_impute.groupby('Yr').mean()
+ co2_year ='Yr', y='Avg', data=co2_year)
+ sns.lineplot(x"CO2 Average By Year"); plt.title(
Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+We went over a lot of content this lecture; let’s summarize the most important points:
+There are a few options we can take to deal with missing data:
+NaN
missing valuesThere are several ways to approach EDA and Data Wrangling:
+At this point, we’ve grown quite familiar with the modeling process. We’ve introduced the concept of loss, used it to fit several types of models, and, most recently, extended our analysis to multiple regression. Along the way, we’ve forged our way through the mathematics of deriving the optimal model parameters in all its gory detail. It’s time to make our lives a little easier – let’s implement the modeling process in code!
+In this lecture, we’ll explore two techniques for model fitting:
+python
python
’s sklearn
packageWith our new programming frameworks in hand, we will also add sophistication to our models by introducing more complex features to enhance model performance.
+Before we dive into feature engineering, let’s quickly review gradient descent, which we covered in the last lecture. Recall that gradient descent is a powerful technique for choosing the model parameters that minimize the loss function.
+As we learned earlier, we set the derivative of the loss function to zero and solve to determine the optimal parameters \(\theta\) that minimize loss. For a loss surface in 2D (or higher), the best way to minimize loss is to “walk” down the loss surface until we reach our optimal parameters \(\vec{\theta}\). The gradient vector tells us which direction to “walk” in.
+For example, the vector of parameter values \(\vec{\theta} = \begin{bmatrix} + \theta_{0} \\ + \theta_{1} \\ + \end{bmatrix}\) gives us a two parameter model (d = 2). To calculate our gradient vector, we can take the partial derivative of loss with respect to each parameter: \(\frac{\partial L}{\partial \theta_0}\) and \(\frac{\partial L}{\partial \theta_1}\).
+Its gradient vector would then be the 2D vector: \[\nabla_{\vec{\theta}} L = \begin{bmatrix} \frac{\partial L}{\partial \theta_0} \\ \frac{\partial L}{\partial \theta_1} \end{bmatrix}\]
+Note that \(-\nabla_{\vec{\theta}} L\) always points in the downhill direction of the surface.
+Recall that we also discussed the gradient descent update rule, where we nudge \(\theta\) in a negative gradient direction until \(\theta\) converges.
+As a refresher, the rule is as follows: \[\vec{\theta}^{(t+1)} = \vec{\theta}^{(t)} - \alpha \nabla_{\vec{\theta}} L(\vec{\theta}^{(t)}) \]
+Let’s now walk through an example of calculating and updating the gradient vector. Say our model and loss are: \[\begin{align} +f_{\vec{\theta}}(\vec{x}) &= \vec{x}^T\vec{\theta} = \theta_0x_0 + \theta_1x_1 +\\l(y, \hat{y}) &= (y - \hat{y})^2 +\end{align} +\]
+Plugging in \(f_{\vec{\theta}}(\vec{x})\) for \(\hat{y}\), our loss function becomes \(l(\vec{\theta}, \vec{x}, y_i) = (y_i - \theta_0x_0 - \theta_1x_1)^2\).
+To calculate our gradient vector, we can start by computing the partial derivative of the loss function with respect to \(\theta_0\): \[\frac{\partial}{\partial \theta_{0}} l(\vec{\theta}, \vec{x}, y_i) = 2(y_i - \theta_0x_0 - \theta_1x_1)(-x_0)\]
+Let’s now do the same but with respect to \(\theta_1\): \[\frac{\partial}{\partial \theta_{1}} l(\vec{\theta}, \vec{x}, y_i) = 2(y_i - \theta_0x_0 - \theta_1x_1)(-x_1)\]
+Putting this together, our gradient vector is: \[\nabla_{\theta} l(\vec{\theta}, \vec{x}, y_i) = \begin{bmatrix} -2(y_i - \theta_0x_0 - \theta_1x_1)(x_0) \\ -2(y_i - \theta_0x_0 - \theta_1x_1)(x_1) \end{bmatrix}\]
+Remember that we need to keep updating \(\theta\) until the algorithm converges to a solution and stops updating significantly (or at all). When updating \(\theta\), we’ll have a fixed number of updates and subsequent updates will be quite small (we won’t change \(\theta\) by much).
+Let’s now dive deeper into gradient and stochastic gradient descent. In the previous lecture, we discussed how finding the gradient across all the data is extremeley computationally taxing and takes a lot of resources to calculate.
+We know that the solution to the normal equation is \(\hat{\theta} = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\mathbb{Y}\). Let’s break this down and determine the computational complexity for this solution.
+Let \(n\) be the number of samples (rows) and \(d\) be the number of features (columns).
+In total, calculating the solution to the normal equation takes \(O(nd^2) + O(d^3) + O(nd) + O(d^2)\) time. We can see that \(O(nd^2) + O(d^3)\) dominates the complexity — this can be problematic for high-dimensional models and very large datasets.
+On the other hand, the time complexity of a single gradient descent step takes only \(O(nd)\) time.
+Suppose we run \(T\) iterations. The final complexity would then be \(O(Tnd)\). Typically, \(n\) is much larger than \(T\) and \(d\). How can we reduce the cost of this algorithm using a technique from Data 100? Do we really need to use \(n\) data points? We don’t! Instead, we can use stochastic gradient descent.
+We know that our true gradient of \(\nabla_{\vec{\theta}} L (\vec{\theta^{(t)}}) = \frac{1}{n}\sum_{i=1}^{n}\nabla_{\vec{\theta}} l(y_i, f_{\vec{\theta}^{(t)}}(X_i))\) has a time complexity of \(O(nd)\). Instead of using all \(n\) samples to calculate the true gradient of the loss surface, let’s use a sample of our data to approximate. Say we sample \(b\) records (\(s_1, \cdots, s_b\)) from our \(n\) datapoints. Our new (stochastic) gradient descent function will be \(\nabla_{\vec{\theta}} L (\vec{\theta^{(t)}}) = \frac{1}{b}\sum_{i=1}^{b}\nabla_{\vec{\theta}} l(y_{s_i}, f_{\vec{\theta}^{(t)}}(X_{s_i}))\) and will now have a time complexity of \(O(bd)\), which is much faster!
+Stochastic gradient descent helps us approximate the gradient while also reducing the time complexity and computational cost. The time complexity scales with the number of datapoints selected in the sample. To sample data, there are two approaches we can use:
+But how do we decide our mini-batch size (\(b\)), or the number of datapoints in our sample? The original stochastic gradient descent algorithm uses \(b=1\) so that only one sample is used to approximate the gradient at a time. Although we don’t use such a small mini-batch size often, \(b\) typically is small. When choosing \(b\), there are several factors to consider: a larger batch size results in a better gradient estimate, parallelism, and other systems factors. On the other hand, a smaller batch size will be faster and have more frequent updates. It is up to data scientists to balance the tradeoff between batch size and time complexity.
+Summarizing our two gradient descent techniques:
+At this point in the course, we’ve equipped ourselves with some powerful techniques to build and optimize models. We’ve explored how to develop models of multiple variables, as well as how to transform variables to help linearize a dataset and fit these models to maximize their performance.
+All of this was done with one major caveat: the regression models we’ve worked with so far are all linear in the input variables. We’ve assumed that our predictions should be some combination of linear variables. While this works well in some cases, the real world isn’t always so straightforward. We’ll learn an important method to address this issue – feature engineering – and consider some new problems that can arise when we do so.
+Feature engineering is the process of transforming raw features into more informative features that can be used in modeling or EDA tasks and improve model performance.
+Feature engineering allows you to:
+A feature function describes the transformations we apply to raw features in a dataset to create a design matrix of transformed features. We typically denote the feature function as \(\Phi\) (the Greek letter “phi” that we use to represent the true function). When we apply the feature function to our original dataset \(\mathbb{X}\), the result, \(\Phi(\mathbb{X})\), is a transformed design matrix ready to be used in modeling.
+For example, we might design a feature function that computes the square of an existing feature and adds it to the design matrix. In this case, our existing matrix \([x]\) is transformed to \([x, x^2]\). Its dimension increases from 1 to 2. Often, the dimension of the featurized dataset increases as seen here.
+The new features introduced by the feature function can then be used in modeling. Often, we use the symbol \(\phi_i\) to represent transformed features after feature engineering.
+\[ +\begin{align} +\hat{y} &= \theta_0 + \theta_1 x + \theta_2 x^2 \\ +\hat{y} &= \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2 +\end{align} +\]
+In matrix notation, the symbol \(\Phi\) is sometimes used to denote the design matrix after feature engineering has been performed. Note that in the usage below, \(\Phi\) is now a feature-engineered matrix, rather than a function.
+\[\hat{\mathbb{Y}} = \Phi \theta\]
+More formally, we describe a feature function as transforming the original \(\mathbb{R}^{n \times p}\) dataset \(\mathbb{X}\) to a featurized \(\mathbb{R}^{n \times p'}\) dataset \(\mathbb{\Phi}\), where \(p'\) is typically greater than \(p\).
+\[\mathbb{X} \in \mathbb{R}^{n \times p} \longrightarrow \Phi \in \mathbb{R}^{n \times p'}\]
+Feature engineering opens up a whole new set of possibilities for designing better-performing models. As you will see in lab and homework, feature engineering is one of the most important parts of the entire modeling process.
+A particularly powerful use of feature engineering is to allow us to perform regression on non-numeric features. One hot encoding is a feature engineering technique that generates numeric features from categorical data, allowing us to use our usual methods to fit a regression model on the data.
+To illustrate how this works, we’ll refer back to the tips
dataset from previous lectures. Consider the "day"
column of the dataset:
import numpy as np
+import seaborn as sns
+import pandas as pd
+import sklearn.linear_model as lm
+= sns.load_dataset("tips")
+ tips tips.head()
+ | total_bill | +tip | +sex | +smoker | +day | +time | +size | +
---|---|---|---|---|---|---|---|
0 | +16.99 | +1.01 | +Female | +No | +Sun | +Dinner | +2 | +
1 | +10.34 | +1.66 | +Male | +No | +Sun | +Dinner | +3 | +
2 | +21.01 | +3.50 | +Male | +No | +Sun | +Dinner | +3 | +
3 | +23.68 | +3.31 | +Male | +No | +Sun | +Dinner | +2 | +
4 | +24.59 | +3.61 | +Female | +No | +Sun | +Dinner | +4 | +
At first glance, it doesn’t seem possible to fit a regression model to this data – we can’t directly perform any mathematical operations on the entry “Sun”.
+To resolve this, we instead create a new table with a feature for each unique value in the original "day"
column. We then iterate through the "day"
column. For each entry in "day"
we fill the corresponding feature in the new table with 1. All other features are set to 0.
The OneHotEncoder
class of sklearn
(documentation) offers a quick way to perform this one-hot encoding. You will explore its use in detail in the lab. For now, recognize that we follow a very similar workflow to when we were working with the LinearRegression
class: we initialize a OneHotEncoder
object, fit it to our data, and finally use .transform
to apply the fitted encoder.
from sklearn.preprocessing import OneHotEncoder
+
+# Initialize a OneHotEncoder object
+= OneHotEncoder()
+ ohe
+# Fit the encoder
+"day"]])
+ ohe.fit(tips[[
+# Use the encoder to transform the raw "day" feature
+= ohe.transform(tips[["day"]]).toarray()
+ encoded_day = pd.DataFrame(encoded_day, columns=ohe.get_feature_names_out())
+ encoded_day_df
+ encoded_day_df.head()
+ | day_Fri | +day_Sat | +day_Sun | +day_Thur | +
---|---|---|---|---|
0 | +0.0 | +0.0 | +1.0 | +0.0 | +
1 | +0.0 | +0.0 | +1.0 | +0.0 | +
2 | +0.0 | +0.0 | +1.0 | +0.0 | +
3 | +0.0 | +0.0 | +1.0 | +0.0 | +
4 | +0.0 | +0.0 | +1.0 | +0.0 | +
The one-hot encoded features can then be used in the design matrix to train a model:
+\[\hat{y} = \theta_1 (\text{total}\_\text{bill}) + \theta_2 (\text{size}) + \theta_3 (\text{day}\_\text{Fri}) + \theta_4 (\text{day}\_\text{Sat}) + \theta_5 (\text{day}\_\text{Sun}) + \theta_6 (\text{day}\_\text{Thur})\]
+Or in shorthand:
+\[\hat{y} = \theta_{1}\phi_{1} + \theta_{2}\phi_{2} + \theta_{3}\phi_{3} + \theta_{4}\phi_{4} + \theta_{5}\phi_{5} + \theta_{6}\phi_{6}\]
+Now, the day
feature (or rather, the four new boolean features that represent day) can be used to fit a model.
Using sklearn
to fit the new model, we can determine the model coefficients, allowing us to understand how each feature impacts the predicted tip.
from sklearn.linear_model import LinearRegression
+= tips[["total_bill", "size", "day"]].join(encoded_day_df).drop(columns = "day")
+ data_w_ohe = lm.LinearRegression(fit_intercept=False) #Tell sklearn to not add an additional bias column. Why?
+ ohe_model "tip"])
+ ohe_model.fit(data_w_ohe, tips[
+"Feature":data_w_ohe.columns, "Model Coefficient":ohe_model.coef_}) pd.DataFrame({
+ | Feature | +Model Coefficient | +
---|---|---|
0 | +total_bill | +0.092994 | +
1 | +size | +0.187132 | +
2 | +day_Fri | +0.745787 | +
3 | +day_Sat | +0.621129 | +
4 | +day_Sun | +0.732289 | +
5 | +day_Thur | +0.668294 | +
For example, when looking at the coefficient for day_Fri
, we can now understand the impact of it being Friday on the predicted tip — if it is a Friday, the predicted tip increases by approximately $0.75.
When one-hot encoding, keep in mind that any set of one-hot encoded columns will always sum to a column of all ones, representing the bias column. More formally, the bias column is a linear combination of the OHE columns.
+We must be careful not to include this bias column in our design matrix. Otherwise, there will be linear dependence in the model, meaning \(\mathbb{X}^{\top}\mathbb{X}\) would no longer be invertible, and our OLS estimate \(\hat{\theta} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\) fails.
+To resolve this issue, we simply omit one of the one-hot encoded columns or do not include an intercept term. The adjusted design matrices are shown below.
+Either approach works — we still retain the same information as the omitted column being a linear combination of the remaining columns.
+We have encountered a few cases now where models with linear features have performed poorly on datasets that show clear non-linear curvature.
+As an example, consider the vehicles
dataset, which contains information about cars. Suppose we want to use the hp
(horsepower) of a car to predict its "mpg"
(gas mileage in miles per gallon). If we visualize the relationship between these two variables, we see a non-linear curvature. Fitting a linear model to these variables results in a high (poor) value of RMSE.
\[\hat{y} = \theta_0 + \theta_1 (\text{hp})\]
+= None
+ pd.options.mode.chained_assignment = sns.load_dataset("mpg").dropna().rename(columns = {"horsepower": "hp"}).sort_values("hp")
+ vehicles
+= vehicles[["hp"]]
+ X = vehicles["mpg"]
+ Y
+= lm.LinearRegression()
+ hp_model
+ hp_model.fit(X, Y)= hp_model.predict(X)
+ hp_model_predictions
+import matplotlib.pyplot as plt
+
+=vehicles, x="hp", y="mpg")
+ sns.scatterplot(data"hp"], hp_model_predictions, c="tab:red");
+ plt.plot(vehicles[
+print(f"MSE of model with (hp) feature: {np.mean((Y-hp_model_predictions)**2)}")
MSE of model with (hp) feature: 23.943662938603108
+As we can see from the plot, the data follows a curved line rather than a straight one. To capture this non-linearity, we can incorporate non-linear features. Let’s introduce a polynomial term, \(\text{hp}^2\), into our regression model. The model now takes the form:
+\[\hat{y} = \theta_0 + \theta_1 (\text{hp}) + \theta_2 (\text{hp}^2)\] \[\hat{y} = \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2\]
+How can we fit a model with non-linear features? We can use the exact same techniques as before: ordinary least squares, gradient descent, or sklearn
. This is because our new model is still a linear model. Although it contains non-linear features, it is linear with respect to the model parameters. All of our previous work on fitting models was done under the assumption that we were working with linear models. Because our new model is still linear, we can apply our existing methods to determine the optimal parameters.
# Add a hp^2 feature to the design matrix
+= vehicles[["hp"]]
+ X "hp^2"] = vehicles["hp"]**2
+ X[
+# Use sklearn to fit the model
+= lm.LinearRegression()
+ hp2_model
+ hp2_model.fit(X, Y)= hp2_model.predict(X)
+ hp2_model_predictions
+=vehicles, x="hp", y="mpg")
+ sns.scatterplot(data"hp"], hp2_model_predictions, c="tab:red");
+ plt.plot(vehicles[
+print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}")
MSE of model with (hp^2) feature: 18.98476890761722
+Looking a lot better! By incorporating a squared feature, we are able to capture the curvature of the dataset. Our model is now a parabola centered on our data. Notice that our new model’s error has decreased relative to the original model with linear features.
+We’ve seen now that feature engineering allows us to build all sorts of features to improve the performance of the model. In particular, we saw that designing a more complex feature (squaring hp
in the vehicles
data previously) substantially improved the model’s ability to capture non-linear relationships. To take full advantage of this, we might be inclined to design increasingly complex features. Consider the following three models, each of different order (the maximum exponent power of each model):
As we can see in the plots above, MSE continues to decrease with each additional polynomial term. To visualize it further, let’s plot models as the complexity increases from 0 to 7:
+When we use our model to make predictions on the same data that was used to fit the model, we find that the MSE decreases with each additional polynomial term (as our model gets more complex). The training error is the model’s error when generating predictions from the same data that was used for training purposes. We can conclude that the training error goes down as the complexity of the model increases.
+This seems like good news – when working on the training data, we can improve model performance by designing increasingly complex models.
+ +However, high model complexity comes with its own set of issues. When building the vehicles
models above, we trained the models on the entire dataset and then evaluated their performance on this same dataset. In reality, we are likely to instead train the model on a sample from the population, then use it to make predictions on data it didn’t encounter during training.
Let’s walk through a more realistic example. Say we are given a training dataset of just 6 datapoints and want to train a model to then make predictions on a different set of points. We may be tempted to make a highly complex model (e.g., degree 5), especially given it makes perfect predictions on the training data as clear on the left. However, as shown in the graph on the right, this model would perform horribly on the rest of the population!
+This phenomenon called overfitting. The model effectively just memorized the training data it encountered when it was fitted, leaving it unable to generalize well to data it didn’t encounter during training. This is a problem: we want models that are generalizable to “unseen” data.
+Additionally, since complex models are sensitive to the specific dataset used to train them, they have high variance. A model with high variance tends to vary more dramatically when trained on different datasets. Going back to our example above, we can see our degree-5 model varies erratically when we fit it to different samples of 6 points from vehicles
.
We now face a dilemma: we know that we can decrease training error by increasing model complexity, but models that are too complex start to overfit and can’t be reapplied to new datasets due to high variance.
+We can see that there is a clear trade-off that comes from the complexity of our model. As model complexity increases, the model’s error on the training data decreases. At the same time, the model’s variance tends to increase.
+The takeaway here: we need to strike a balance in the complexity of our models; we want models that are generalizable to “unseen” data. A model that is too simple won’t be able to capture the key relationships between our variables of interest; a model that is too complex runs the risk of overfitting.
+This begs the question: how do we control the complexity of a model? Stay tuned for Lecture 17 on Cross-Validation and Regularization!
+PyTorch
While this material is out of scope for Data 100, it is useful if you plan to enter a career in data science!
+In practice, you will use software packages such as PyTorch
when computing gradients and implementing gradient descent. You’ll often follow three main steps:
If you want to learn more, this Intro to PyTorch tutorial is a great resource to get started!
+ + + + +import pandas as pd
+import seaborn as sns
+import plotly.express as px
+import matplotlib.pyplot as plt
+import numpy as np
+from sklearn.linear_model import LinearRegression
+= None # default='warn' pd.options.mode.chained_assignment
sklearn
Throughout this lecture, we’ll refer to the penguins
dataset.
import pandas as pd
+import seaborn as sns
+import numpy as np
+
+= sns.load_dataset("penguins")
+ penguins = penguins[penguins["species"] == "Adelie"].dropna()
+ penguins penguins.head()
+ | species | +island | +bill_length_mm | +bill_depth_mm | +flipper_length_mm | +body_mass_g | +sex | +
---|---|---|---|---|---|---|---|
0 | +Adelie | +Torgersen | +39.1 | +18.7 | +181.0 | +3750.0 | +Male | +
1 | +Adelie | +Torgersen | +39.5 | +17.4 | +186.0 | +3800.0 | +Female | +
2 | +Adelie | +Torgersen | +40.3 | +18.0 | +195.0 | +3250.0 | +Female | +
4 | +Adelie | +Torgersen | +36.7 | +19.3 | +193.0 | +3450.0 | +Female | +
5 | +Adelie | +Torgersen | +39.3 | +20.6 | +190.0 | +3650.0 | +Male | +
Our goal will be to predict the value of the "bill_depth_mm"
for a particular penguin given its "flipper_length_mm"
and "body_mass_g"
. We’ll also add a bias column of all ones to represent the intercept term of our models.
# Add a bias column of all ones to `penguins`
+"bias"] = np.ones(len(penguins), dtype=int)
+ penguins[
+# Define the design matrix, X...
+# Note that we use .to_numpy() to convert our DataFrame into a NumPy array so it is in Matrix form
+= penguins[["bias", "flipper_length_mm", "body_mass_g"]].to_numpy()
+ X
+# ...as well as the target variable, Y
+# Again, we use .to_numpy() to convert our DataFrame into a NumPy array so it is in Matrix form
+= penguins[["bill_depth_mm"]].to_numpy() Y
In the lecture on ordinary least squares, we expressed multiple linear regression using matrix notation.
+\[\hat{\mathbb{Y}} = \mathbb{X}\theta\]
+We used a geometric approach to derive the following expression for the optimal model parameters:
+\[\hat{\theta} = (\mathbb{X}^T \mathbb{X})^{-1}\mathbb{X}^T \mathbb{Y}\]
+That’s a whole lot of matrix manipulation. How do we implement it in python
?
There are three operations we need to perform here: multiplying matrices, taking transposes, and finding inverses.
+@
operator.T
attribute of an NumPy
array or DataFrame
NumPy
’s in-built method np.linalg.inv
Putting this all together, we can compute the OLS estimate for the optimal model parameters, stored in the array theta_hat
.
= np.linalg.inv(X.T @ X) @ X.T @ Y
+ theta_hat theta_hat
array([[1.10029953e+01],
+ [9.82848689e-03],
+ [1.47749591e-03]])
+To make predictions using our optimized parameter values, we matrix-multiply the design matrix with the parameter vector:
+\[\hat{\mathbb{Y}} = \mathbb{X}\theta\]
+= X @ theta_hat
+ Y_hat pd.DataFrame(Y_hat).head()
+ | 0 | +
---|---|
0 | +18.322561 | +
1 | +18.445578 | +
2 | +17.721412 | +
3 | +17.997254 | +
4 | +18.263268 | +
sklearn
WorkflowWe’ve already saved a lot of time (and avoided tedious calculations) by translating our derived formulas into code. However, we still had to go through the process of writing out the linear algebra ourselves.
+To make life even easier, we can turn to the sklearn
python
library. sklearn
is a robust library of machine learning tools used extensively in research and industry. It is the standard for simple machine learning tasks and gives us a wide variety of in-built modeling frameworks and methods, so we’ll keep returning to sklearn
techniques as we progress through Data 100.
Regardless of the specific type of model being implemented, sklearn
follows a standard set of steps for creating a model:
Import the LinearRegression
model from sklearn
from sklearn.linear_model import LinearRegression
Create a model object. This generates a new instance of the model class. You can think of it as making a new “copy” of a standard “template” for a model. In code, this looks like:
+my_model = LinearRegression()
Fit the model to the X
design matrix and Y
target vector. This calculates the optimal model parameters “behind the scenes” without us explicitly working through the calculations ourselves. The fitted parameters are then stored within the model for use in future predictions:
my_model.fit(X, Y)
Use the fitted model to make predictions on the X
input data using .predict
.
my_model.predict(X)
To extract the fitted parameters, we can use:
+my_model.coef_
+
+my_model.intercept_
+Let’s put this into action with our multiple regression task!
+1. Initialize an instance of the model class
+sklearn
stores “templates” of useful models for machine learning. We begin the modeling process by making a “copy” of one of these templates for our own use. Model initialization looks like ModelClass()
, where ModelClass
is the type of model we wish to create.
For now, let’s create a linear regression model using LinearRegression
.
my_model
is now an instance of the LinearRegression
class. You can think of it as the “idea” of a linear regression model. We haven’t trained it yet, so it doesn’t know any model parameters and cannot be used to make predictions. In fact, we haven’t even told it what data to use for modeling! It simply waits for further instructions.
= LinearRegression() my_model
2. Train the model using .fit
Before the model can make predictions, we will need to fit it to our training data. When we fit the model, sklearn
will run gradient descent behind the scenes to determine the optimal model parameters. It will then save these model parameters to our model instance for future use.
All sklearn
model classes include a .fit
method, which is used to fit the model. It takes in two inputs: the design matrix, X
, and the target variable, Y
.
Let’s start by fitting a model with just one feature: the flipper length. We create a design matrix X
by pulling out the "flipper_length_mm"
column from the DataFrame
.
# .fit expects a 2D data design matrix, so we use double brackets to extract a DataFrame
+= penguins[["flipper_length_mm"]]
+ X = penguins["bill_depth_mm"]
+ Y
+ my_model.fit(X, Y)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
Notice that we use double brackets to extract this column. Why double brackets instead of just single brackets? The .fit
method, by default, expects to receive 2-dimensional data – some kind of data that includes both rows and columns. Writing penguins["flipper_length_mm"]
would return a 1D Series
, causing sklearn
to error. We avoid this by writing penguins[["flipper_length_mm"]]
to produce a 2D DataFrame
.
And in just three lines of code, our model has run gradient descent to determine the optimal model parameters! Our single-feature model takes the form:
+\[\text{bill depth} = \theta_0 + \theta_1 \text{flipper length}\]
+Note that LinearRegression
will automatically include an intercept term.
The fitted model parameters are stored as attributes of the model instance. my_model.intercept_
will return the value of \(\hat{\theta}_0\) as a scalar. my_model.coef_
will return all values \(\hat{\theta}_1,
+\hat{\theta}_1, ...\) in an array. Because our model only contains one feature, we see just the value of \(\hat{\theta}_1\) in the cell below.
# The intercept term, theta_0
+ my_model.intercept_
np.float64(7.297305899612313)
+# All parameters theta_1, ..., theta_p
+ my_model.coef_
array([0.05812622])
+3. Use the fitted model to make predictions
+Now that the model has been trained, we can use it to make predictions! To do so, we use the .predict
method. .predict
takes in one argument: the design matrix that should be used to generate predictions. To understand how the model performs on the training set, we would pass in the training data. Alternatively, to make predictions on unseen data, we would pass in a new dataset that wasn’t used to train the model.
Below, we call .predict
to generate model predictions on the original training data. As before, we use double brackets to ensure that we extract 2-dimensional data.
= my_model.predict(penguins[["flipper_length_mm"]])
+ Y_hat_one_feature
+print(f"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_one_feature)**2))}")
The RMSE of the model is 1.154936309923901
+What if we wanted a model with two features?
+\[\text{bill depth} = \theta_0 + \theta_1 \text{flipper length} + \theta_2 \text{body mass}\]
+We repeat this three-step process by intializing a new model object, then calling .fit
and .predict
as before.
# Step 1: initialize LinearRegression model
+= LinearRegression()
+ two_feature_model
+# Step 2: fit the model
+= penguins[["flipper_length_mm", "body_mass_g"]]
+ X_two_features = penguins["bill_depth_mm"]
+ Y
+
+ two_feature_model.fit(X_two_features, Y)
+# Step 3: make predictions
+= two_feature_model.predict(X_two_features)
+ Y_hat_two_features
+print(f"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}")
The RMSE of the model is 0.9881331104079043
+We can also see that we obtain the same predictions using sklearn
as we did when applying the ordinary least squares formula before!
"Y_hat from OLS":np.squeeze(Y_hat), "Y_hat from sklearn":Y_hat_two_features}).head() pd.DataFrame({
+ | Y_hat from OLS | +Y_hat from sklearn | +
---|---|---|
0 | +18.322561 | +18.322561 | +
1 | +18.445578 | +18.445578 | +
2 | +17.721412 | +17.721412 | +
3 | +17.997254 | +17.997254 | +
4 | +18.263268 | +18.263268 | +
At this point, we’ve grown quite familiar with the process of choosing a model and a corresponding loss function and optimizing parameters by choosing the values of \(\theta\) that minimize the loss function. So far, we’ve optimized \(\theta\) by
+One thing to note, however, is that the techniques we used above can only be applied if we make some big assumptions. For the calculus approach, we assumed that the loss function was differentiable at all points and that we could algebraically solve for the zero points of the derivative; for the geometric approach, OLS only applies when using a linear model with MSE loss. What happens when we have more complex models with different, more complex loss functions? The techniques we’ve learned so far will not work, so we need a new optimization technique: gradient descent.
+++BIG IDEA: use an iterative algorithm to numerically compute the minimum of the loss.
+
Let’s consider an arbitrary function. Our goal is to find the value of \(x\) that minimizes this function.
+def arbitrary(x):
+return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10
Above, we saw that the minimum is somewhere around 5.3. Let’s see if we can figure out how to find the exact minimum algorithmically from scratch. One very slow (and terrible) way would be manual guess-and-check.
+6) arbitrary(
0.0
+A somewhat better (but still slow) approach is to use brute force to try out a bunch of x values and return the one that yields the lowest loss.
+def simple_minimize(f, xs):
+# Takes in a function f and a set of values xs.
+ # Calculates the value of the function f at all values x in xs
+ # Takes the minimum value of f(x) and returns the corresponding value x
+ = [f(x) for x in xs]
+ y return xs[np.argmin(y)]
+
+= [5.3, 5.31, 5.32, 5.33, 5.34, 5.35]
+ guesses simple_minimize(arbitrary, guesses)
5.33
+This process is essentially the same as before where we made a graphical plot, it’s just that we’re only looking at 20 selected points.
+= np.linspace(1, 7, 200)
+ xs = np.linspace(1, 7, 5)
+ sparse_xs
+= arbitrary(xs)
+ ys = arbitrary(sparse_xs)
+ sparse_ys
+= px.line(x = xs, y = arbitrary(xs))
+ fig = sparse_xs, y = arbitrary(sparse_xs), mode = "markers")
+ fig.add_scatter(x = False)
+ fig.update_layout(showlegend=False, width=800, height=600)
+ fig.update_layout(autosize fig.show()
This basic approach suffers from three major flaws:
+Scipy.optimize.minimize
One way to minimize this mathematical function is to use the scipy.optimize.minimize
function. It takes a function and a starting guess and tries to find the minimum.
from scipy.optimize import minimize
+
+# takes a function f and a starting point x0 and returns a readout
+# with the optimal input value of x which minimizes f
+= 3.5) minimize(arbitrary, x0
message: Optimization terminated successfully.
+ success: True
+ status: 0
+ fun: -0.13827491292966557
+ x: [ 2.393e+00]
+ nit: 3
+ jac: [ 6.486e-06]
+ hess_inv: [[ 7.385e-01]]
+ nfev: 20
+ njev: 10
+scipy.optimize.minimize
is great. It may also seem a bit magical. How could you write a function that can find the minimum of any mathematical function? There are a number of ways to do this, which we’ll explore in today’s lecture, eventually arriving at the important idea of gradient descent, which is the principle that scipy.optimize.minimize
uses.
It turns out that under the hood, the fit
method for LinearRegression
models uses gradient descent. Gradient descent is also how much of machine learning works, including even advanced neural network models.
In Data 100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it’s important that we know the underlying principles that optimization functions harness to find optimal parameters.
+Looking at the function across this domain, it is clear that the function’s minimum value occurs around \(\theta = 5.3\). Let’s pretend for a moment that we couldn’t see the full view of the cost function. How would we guess the value of \(\theta\) that minimizes the function?
+It turns out that the first derivative of the function can give us a clue. In the plots below, the line indicates the value of the derivative of each value of \(\theta\). The derivative is negative where it is red and positive where it is green.
+Say we make a guess for the minimizing value of \(\theta\). Remember that we read plots from left to right, and assume that our starting \(\theta\) value is to the left of the optimal \(\hat{\theta}\). If the guess “undershoots” the true minimizing value – our guess for \(\theta\) is lower than the value of the \(\hat{\theta}\) that minimizes the function – the derivative will be negative. This means that if we increase \(\theta\) (move further to the right), then we can decrease our loss function further. If this guess “overshoots” the true minimizing value, the derivative will be positive, implying the converse.
+
+![]() |
+
We can use this pattern to help formulate our next guess for the optimal \(\hat{\theta}\). Consider the case where we’ve undershot \(\theta\) by guessing too low of a value. We’ll want our next guess to be greater in value than our previous guess – that is, we want to shift our guess to the right. You can think of this as following the slope “downhill” to the function’s minimum value.
+
+![]() |
+
If we’ve overshot \(\hat{\theta}\) by guessing too high of a value, we’ll want our next guess to be lower in value – we want to shift our guess for \(\hat{\theta}\) to the left.
+
+![]() |
+
In other words, the derivative of the function at each point tells us the direction of our next guess.
+Armed with this knowledge, let’s try to see if we can use the derivative to optimize the function.
+We start by making some guess for the minimizing value of \(x\). Then, we look at the derivative of the function at this value of \(x\), and step downhill in the opposite direction. We can express our new rule as a recurrence relation:
+\[x^{(t+1)} = x^{(t)} - \frac{d}{dx} f(x^{(t)})\]
+Translating this statement into English: we obtain our next guess for the minimizing value of \(x\) at timestep \(t+1\) (\(x^{(t+1)}\)) by taking our last guess (\(x^{(t)}\)) and subtracting the derivative of the function at that point (\(\frac{d}{dx} f(x^{(t)})\)).
+A few steps are shown below, where the old step is shown as a transparent point, and the next step taken is the green-filled dot.
+
+![]() |
+
Looking pretty good! We do have a problem though – once we arrive close to the minimum value of the function, our guesses “bounce” back and forth past the minimum without ever reaching it.
+
+![]() |
+
In other words, each step we take when updating our guess moves us too far. We can address this by decreasing the size of each step.
+Let’s update our algorithm to use a learning rate (also sometimes called the step size), which controls how far we move with each update. We represent the learning rate with \(\alpha\).
+\[x^{(t+1)} = x^{(t)} - \alpha \frac{d}{dx} f(x^{(t)})\]
+A small \(\alpha\) means that we will take small steps; a large \(\alpha\) means we will take large steps. When do we stop updating? We stop updating either after a fixed number of updates or after a subsequent update doesn’t change much.
+Updating our function to use \(\alpha=0.3\), our algorithm successfully converges (settles on a solution and stops updating significantly, or at all) on the minimum value.
+
+![]() |
+
In our analysis above, we focused our attention on the global minimum of the loss function. You may be wondering: what about the local minimum that’s just to the left?
+If we had chosen a different starting guess for \(\theta\), or a different value for the learning rate \(\alpha\), our algorithm may have gotten “stuck” and converged on the local minimum, rather than on the true optimum value of loss.
+
+![]() |
+
If the loss function is convex, gradient descent is guaranteed to converge and find the global minimum of the objective function. Formally, a function \(f\) is convex if: \[tf(a) + (1-t)f(b) \geq f(ta + (1-t)b)\] for all \(a, b\) in the domain of \(f\) and \(t \in [0, 1]\).
+To put this into words: if you drew a line between any two points on the curve, all values on the curve must be on or below the line. Importantly, any local minimum of a convex function is also its global minimum so we avoid the situation where the algorithm converges on some critical point that is not the minimum of the function.
+
+![]() |
+
In summary, non-convex loss functions can cause problems with optimization. This means that our choice of loss function is a key factor in our modeling process. It turns out that MSE is convex, which is a major reason why it is such a popular choice of loss function. Gradient descent is only guaranteed to converge (given enough iterations and an appropriate step size) for convex functions.
+++Terminology clarification: In past lectures, we have used “loss” to refer to the error incurred on a single datapoint. In applications, we usually care more about the average error across all datapoints. Going forward, we will take the “model’s loss” to mean the model’s average error across the dataset. This is sometimes also known as the empirical risk, cost function, or objective function. \[L(\theta) = R(\theta) = \frac{1}{n} \sum_{i=1}^{n} L(y, \hat{y})\]
+
In our discussion above, we worked with some arbitrary function \(f\). As data scientists, we will almost always work with gradient descent in the context of optimizing models – specifically, we want to apply gradient descent to find the minimum of a loss function. In a modeling context, our goal is to minimize a loss function by choosing the minimizing model parameters.
+Recall our modeling workflow from the past few lectures:
+Gradient descent is a powerful technique for completing this last task. By applying the gradient descent algorithm, we can select values for our parameters \(\theta_i\) that will lead to the model having minimal loss on the training data.
+When using gradient descent in a modeling context, we:
+We can “translate” our gradient descent rule from before by replacing \(x\) with \(\theta\) and \(f\) with \(L\):
+\[\theta^{(t+1)} = \theta^{(t)} - \alpha \frac{d}{d\theta} L(\theta^{(t)})\]
+tips
DatasetTo see this in action, let’s consider a case where we have a linear model with no offset. We want to predict the tip (y) given the price of a meal (x). To do this, we
+Let’s apply our gradient_descent
function from before to optimize our model on the tips
dataset. We will try to select the best parameter \(\theta_i\) to predict the tip
\(y\) from the total_bill
\(x\).
= sns.load_dataset("tips")
+ df df.head()
+ | total_bill | +tip | +sex | +smoker | +day | +time | +size | +
---|---|---|---|---|---|---|---|
0 | +16.99 | +1.01 | +Female | +No | +Sun | +Dinner | +2 | +
1 | +10.34 | +1.66 | +Male | +No | +Sun | +Dinner | +3 | +
2 | +21.01 | +3.50 | +Male | +No | +Sun | +Dinner | +3 | +
3 | +23.68 | +3.31 | +Male | +No | +Sun | +Dinner | +2 | +
4 | +24.59 | +3.61 | +Female | +No | +Sun | +Dinner | +4 | +
We can visualize the value of the MSE on our dataset for different possible choices of \(\theta_1\). To optimize our model, we want to select the value of \(\theta_1\) that leads to the lowest MSE.
+import plotly.graph_objects as go
+
+def derivative_arbitrary(x):
+return (4*x**3 - 45*x**2 + 160*x - 180)/10
+
+= go.Figure()
+ fig = np.array([2.3927, 3.5309, 5.3263])
+ roots
+= xs, y = arbitrary(xs),
+ fig.add_trace(go.Scatter(x = "lines", name = "f"))
+ mode = xs, y = derivative_arbitrary(xs),
+ fig.add_trace(go.Scatter(x = "lines", name = "df", line = {"dash": "dash"}))
+ mode = np.array(roots), y = 0*roots,
+ fig.add_trace(go.Scatter(x = "markers", name = "df = zero", marker_size = 12))
+ mode = 20, yaxis_range=[-1, 3])
+ fig.update_layout(font_size =False, width=800, height=600)
+ fig.update_layout(autosize fig.show()
To apply gradient descent, we need to compute the derivative of the loss function with respect to our parameter \(\theta_1\).
+for some learning rate \(\alpha\).
+Implementing this in code, we can visualize the MSE loss on the tips
data. MSE is convex, so there is one global minimum.
def gradient_descent(df, initial_guess, alpha, n):
+"""Performs n steps of gradient descent on df using learning rate alpha starting
+ from initial_guess. Returns a numpy array of all guesses over time."""
+= [initial_guess]
+ guesses = initial_guess
+ current_guess while len(guesses) < n:
+ = current_guess - alpha * df(current_guess)
+ current_guess
+ guesses.append(current_guess)
+ return np.array(guesses)
+
+def mse_single_arg(theta_1):
+"""Returns the MSE on our data for the given theta1"""
+ = df["total_bill"]
+ x = df["tip"]
+ y_obs = theta_1 * x
+ y_hat return np.mean((y_hat - y_obs) ** 2)
+
+def mse_loss_derivative_single_arg(theta_1):
+"""Returns the derivative of the MSE on our data for the given theta1"""
+ = df["total_bill"]
+ x = df["tip"]
+ y_obs = theta_1 * x
+ y_hat
+ return np.mean(2 * (y_hat - y_obs) * x)
+
+= pd.DataFrame({"theta_1":np.linspace(-1.5, 1), "MSE":[mse_single_arg(theta_1) for theta_1 in np.linspace(-1.5, 1)]})
+ loss_df
+= gradient_descent(mse_loss_derivative_single_arg, -0.5, 0.0001, 100)
+ trajectory
+"theta_1"], loss_df["MSE"])
+ plt.plot(loss_df[for guess in trajectory], c="white", edgecolor="firebrick")
+ plt.scatter(trajectory, [mse_single_arg(guess) -1], mse_single_arg(trajectory[-1]), c="firebrick")
+ plt.scatter(trajectory[r"$\theta_1$")
+ plt.xlabel(r"$L(\theta_1)$");
+ plt.ylabel(
+print(f"Final guess for theta_1: {trajectory[-1]}")
Final guess for theta_1: 0.14369554654231262
+The function we worked with above was one-dimensional – we were only minimizing the function with respect to a single parameter, \(\theta\). However, models usually have a cost function with multiple parameters that need to be optimized. For example, simple linear regression has 2 parameters: \[\hat{y} + \theta_0 + \theta_1x\] and multiple linear regression has \(p+1\) parameters: \[\mathbb{Y} = \theta_0 + \theta_1 \Bbb{X}_{:,1} + \theta_2 \Bbb{X}_{:,2} + \cdots + \theta_p \Bbb{X}_{:,p}\]
+We’ll need to expand gradient descent so we can update our guesses for all model parameters all in one go.
+With multiple parameters to optimize, we consider a loss surface, or the model’s loss for a particular combination of possible parameter values.
+import plotly.graph_objects as go
+
+
+def mse_loss(theta, X, y_obs):
+= X @ theta
+ y_hat return np.mean((y_hat - y_obs) ** 2)
+
+= df.copy()
+ tips_with_bias "bias"] = 1
+ tips_with_bias[= tips_with_bias[["bias", "total_bill"]]
+ tips_with_bias
+= np.linspace(0, 2, 10)
+ uvalues = np.linspace(-0.1, 0.35, 10)
+ vvalues = np.meshgrid(uvalues, vvalues)
+ (u,v) = np.vstack((u.flatten(),v.flatten()))
+ thetas
+def mse_loss_single_arg(theta):
+return mse_loss(theta, tips_with_bias, df["tip"])
+
+= np.array([mse_loss_single_arg(t) for t in thetas.T])
+ MSE
+= go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape))
+ loss_surface
+= np.argmin(MSE)
+ ind = go.Scatter3d(name = "Optimal Point",
+ optimal_point = [thetas.T[ind,0]], y = [thetas.T[ind,1]],
+ x = [MSE[ind]],
+ z =dict(size=10, color="red"))
+ marker
+= go.Figure(data=[loss_surface, optimal_point])
+ fig = dict(
+ fig.update_layout(scene = "theta0",
+ xaxis_title = "theta1",
+ yaxis_title = "MSE"), autosize=False, width=800, height=600)
+ zaxis_title
+ fig.show()
We can also visualize a bird’s-eye view of the loss surface from above using a contour plot:
+= go.Contour(x=u[0], y=v[:, 0], z=np.reshape(MSE, u.shape))
+ contour = go.Figure(contour)
+ fig
+ fig.update_layout(= "theta0",
+ xaxis_title = "theta1", autosize=False, width=800, height=600)
+ yaxis_title
+ fig.show()
As before, the derivative of the loss function tells us the best way towards the minimum value.
+On a 2D (or higher) surface, the best way to go down (gradient) is described by a vector.
+
+![]() |
+
++Math Aside: Partial Derivatives
+
+++
+- For an equation with multiple variables, we take a partial derivative by differentiating with respect to just one variable at a time. The partial derivative is denoted with a \(\partial\). Intuitively, we want to see how the function changes if we only vary one variable while holding other variables constant.
+- Using \(f(x, y) = 3x^2 + y\) as an example, +
++
- taking the partial derivative with respect to x and treating y as a constant gives us \(\frac{\partial f}{\partial x} = 6x\)
+- taking the partial derivative with respect to y and treating x as a constant gives us \(\frac{\partial f}{\partial y} = 1\)
+
For the vector of parameter values \(\vec{\theta} = \begin{bmatrix} + \theta_{0} \\ + \theta_{1} \\ + \end{bmatrix}\), we take the partial derivative of loss with respect to each parameter: \(\frac{\partial L}{\partial \theta_0}\) and \(\frac{\partial L}{\partial \theta_1}\).
+++For example, consider the 2D function: \[f(\theta_0, \theta_1) = 8 \theta_0^2 + 3\theta_0\theta_1\] For a function of 2 variables \(f(\theta_0, \theta_1)\), we define the gradient \[ +\begin{align} +\frac{\partial f}{\partial \theta_{0}} &= 16\theta_0 + 3\theta_1 \\ +\frac{\partial f}{\partial \theta_{1}} &= 3\theta_0 \\ +\nabla_{\vec{\theta}} f(\vec{\theta}) &= \begin{bmatrix} 16\theta_0 + 3\theta_1 \\ 3\theta_0 \\ \end{bmatrix} +\end{align} +\]
+
The gradient vector of a generic function of \(p+1\) variables is therefore \[\nabla_{\vec{\theta}} L = \begin{bmatrix} \frac{\partial L}{\partial \theta_0} \\ \frac{\partial L}{\partial \theta_1} \\ \vdots \end{bmatrix}\] where \(\nabla_\theta L\) always points in the downhill direction of the surface. We can interpret each gradient as: “If I nudge the \(i\)th model weight, what happens to loss?”
+We can use this to update our 1D gradient rule for models with multiple parameters.
+Recall our 1D update rule: \[\theta^{(t+1)} = \theta^{(t)} - \alpha \frac{d}{d\theta}L(\theta^{(t)})\]
For models with multiple parameters, we work in terms of vectors: \[\begin{bmatrix} + \theta_{0}^{(t+1)} \\ + \theta_{1}^{(t+1)} \\ + \vdots + \end{bmatrix} = \begin{bmatrix} + \theta_{0}^{(t)} \\ + \theta_{1}^{(t)} \\ + \vdots + \end{bmatrix} - \alpha \begin{bmatrix} + \frac{\partial L}{\partial \theta_{0}} \\ + \frac{\partial L}{\partial \theta_{1}} \\ + \vdots \\ + \end{bmatrix}\]
Written in a more compact form, \[\vec{\theta}^{(t+1)} = \vec{\theta}^{(t)} - \alpha \nabla_{\vec{\theta}} L(\theta^{(t)}) \]
+Formally, the algorithm we derived above is called batch gradient descent. For each iteration of the algorithm, the derivative of loss is computed across the entire batch of all \(n\) datapoints. While this update rule works well in theory, it is not practical in most circumstances. For large datasets (with perhaps billions of datapoints), finding the gradient across all the data is incredibly computationally taxing; gradient descent will converge slowly because each individual update is slow.
+Stochastic (mini-batch) gradient descent tries to address this issue. In stochastic descent, only a sample of the full dataset is used at each update. We estimate the true gradient of the loss surface using just that sample of data. The batch size is the number of data points used in each sample. The sampling strategy is generally without replacement (data is shuffled and batch size examples are selected one at a time.)
+Each complete “pass” through the data is known as a training epoch. After shuffling the data, in a single training epoch of stochastic gradient descent, we
+Every data point appears once in a single training epoch. We then perform several training epochs until we’re satisfied.
+Batch gradient descent is a deterministic technique – because the entire dataset is used at each update iteration, the algorithm will always advance towards the minimum of the loss surface. In contrast, stochastic gradient descent involve an element of randomness. Since only a subset of the full data is used to update the guess for \(\vec{\theta}\) at each iteration, there’s a chance the algorithm will not progress towards the true minimum of loss with each update. Over the longer term, these stochastic techniques should still converge towards the optimal solution.
+The diagrams below represent a “bird’s eye view” of a loss surface from above. Notice that batch gradient descent takes a direct path towards the optimal \(\hat{\theta}\). Stochastic gradient descent, in contrast, “hops around” on its path to the minimum point on the loss surface. This reflects the randomness of the sampling process at each update step.
+
+![]() |
+
To summarize the tradeoffs of batch size:
+- | +Smaller Batch Size | +Larger Batch Size | +
---|---|---|
Pros | +More frequent gradient updates | +Leverage hardware acceleration to improve overall system performance and higher quality gradient updates | +
Cons | +More variability in the gradient estimates | +Less frequent gradient updates | +
The typical solution is to set batch size to ensure sufficient hardware utilization.
+ + + + +Last time, we introduced the idea of random variables and how they affect the data and model we construct. We also demonstrated the decomposition of model risk from a fitted model and dived into the bias-variance tradeoff.
+In this lecture, we will explore regression inference via hypothesis testing, understand how to use bootstrapping under the right assumptions, and consider the environment of understanding causality in theory and in practice.
+There are two main reasons why we build models:
+Recall the framework we established in the last lecture. The relationship between datapoints is given by \(Y = g(x) + \epsilon\), where \(g(x)\) is the true underlying relationship, and \(\epsilon\) represents randomness. If we assume \(g(x)\) is linear, we can express this relationship in terms of the unknown, true model parameters \(\theta\).
+\[f_{\theta}(x) = g(x) + \epsilon = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p + \epsilon\]
+Our model attempts to estimate each true population parameter \(\theta_i\) using the sample estimates \(\hat{\theta}_i\) calculated from the design matrix \(\Bbb{X}\) and response vector \(\Bbb{Y}\).
+\[f_{\hat{\theta}}(x) = \hat{\theta}_0 + \hat{\theta}_1 x_1 + \ldots + \hat{\theta}_p x_p\]
+Let’s pause for a moment. At this point, we’re very used to working with the idea of a model parameter. But what exactly does each coefficient \(\theta_i\) actually mean? We can think of each \(\theta_i\) as a slope of the linear model. If all other variables are held constant, a unit change in \(x_i\) will result in a \(\theta_i\) change in \(f_{\theta}(x)\). Broadly speaking, a large value of \(\theta_i\) means that the feature \(x_i\) has a large effect on the response; conversely, a small value of \(\theta_i\) means that \(x_i\) has little effect on the response. In the extreme case, if the true parameter \(\theta_i\) is 0, then the feature \(x_i\) has no effect on \(Y(x)\).
+If the true parameter \(\theta_i\) for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between \(x_i\) and \(Y(x)\)! But how can we test if a parameter is actually 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate \(\hat{\theta}_i\). However, we also need to consider that if our random sample comes out differently, we may find a different result for \(\hat{\theta}_i\). To infer if the true parameter \(\theta_i\) is 0, we want to draw our conclusion from the distribution of \(\hat{\theta}_i\) estimates we could have drawn across all other random samples. This is where hypothesis testing comes in handy!
+To test if the true parameter \(\theta_i\) is 0, we construct a hypothesis test where our null hypothesis states that the true parameter \(\theta_i\) is 0, and the alternative hypothesis states that the true parameter \(\theta_i\) is not 0. If our p-value is smaller than our cutoff value (usually p = 0.05), we reject the null hypothesis in favor of the alternative hypothesis.
+To determine the properties (e.g., variance) of the sampling distribution of an estimator, we’d need access to the population. Ideally, we’d want to consider all possible samples in the population, compute an estimate for each sample, and study the distribution of those estimates.
+
+
+
However, this can be quite expensive and time-consuming. Even more importantly, we don’t have access to the population —— we only have one random sample from the population. How can we consider all possible samples if we only have one?
+Bootstrapping comes in handy here! With bootstrapping, we treat our random sample as a “population” and resample from it with replacement. Intuitively, a random sample resembles the population (if it is big enough), so a random resample also resembles a random sample of the population. When sampling, there are a couple things to keep in mind:
+
+
+
Bootstrap resampling is a technique for estimating the sampling distribution of an estimator. To execute it, we can follow the pseudocode below:
+collect a random sample of size n (called the bootstrap population)
+
+initiate a list of estimates
+
+repeat 10,000 times:
+ resample with replacement from the bootstrap population
+ apply estimator f to the resample
+ store in list
+
+list of estimates is the bootstrapped sampling distribution of f
+How well does bootstrapping actually represent our population? The bootstrapped sampling distribution of an estimator does not exactly match the sampling distribution of that estimator, but it is often close. Similarly, the variance of the bootstrapped distribution is often close to the true variance of the estimator. The example below displays the results of different bootstraps from a known population using a sample size of \(n=50\).
+
+
+
In the real world, we don’t know the population distribution. The center of the bootstrapped distribution is the estimator applied to our original sample, so we have no way of understanding the estimator’s true expected value; the center and spread of our bootstrap are approximations. The quality of our bootstrapped distribution also depends on the quality of our original sample. If our original sample was not representative of the population (like Sample 5 in the image above), then the bootstrap is next to useless. In general, bootstrapping works better for large samples, when the population distribution is not heavily skewed (no outliers), and when the estimator is “low variance” (insensitive to extreme values).
+ +Although our bootstrapped sample distribution does not exactly match the sampling distribution of the population, we can see that it is relatively close. This demonstrates the benefit of bootstrapping —— without knowing the actual population distribution, we can still roughly approximate the true slope for the model by using only a single random sample of 20 cars.
+ +We can conduct the hypothesis testing described earlier through bootstrapping (this equivalence can be proven through the duality argument, which is out of scope for this class). We use bootstrapping to compute approximate 95% confidence intervals for each \(\theta_i\). If the interval doesn’t contain 0, we reject the null hypothesis at the p=5% level. Otherwise, the data is consistent with the null, as the true parameter could possibly be 0.
+To show an example of this hypothesis testing process, we’ll work with the snowy plover dataset throughout this section. The data are about the eggs and newly hatched chicks of the Snowy Plover. The data were collected at the Point Reyes National Seashore by a former student at Berkeley. Here’s a parent bird and some eggs.
+
+
+
Note that Egg Length
and Egg Breadth
(widest diameter) are measured in millimeters, and Egg Weight
and Bird Weight
are measured in grams. For reference, a standard paper clip weighs about one gram.
import pandas as pd
+= pd.read_csv("data/snowy_plover.csv")
+ eggs 5) eggs.head(
+ | egg_weight | +egg_length | +egg_breadth | +bird_weight | +
---|---|---|---|---|
0 | +7.4 | +28.80 | +21.84 | +5.2 | +
1 | +7.7 | +29.04 | +22.45 | +5.4 | +
2 | +7.9 | +29.36 | +22.48 | +5.6 | +
3 | +7.5 | +30.10 | +21.71 | +5.3 | +
4 | +8.3 | +30.17 | +22.75 | +5.9 | +
Our goal will be to predict the weight of a newborn plover chick, which we assume follows the true relationship \(Y = f_{\theta}(x)\) below.
+\[\text{bird\_weight} = \theta_0 + \theta_1 \text{egg\_weight} + \theta_2 \text{egg\_length} + \theta_3 \text{egg\_breadth} + \epsilon\]
+Note that for each \(i\), the parameter \(\theta_i\) is a fixed number, but it is unobservable. We can only estimate it. The random error \(\epsilon\) is also unobservable, but it is assumed to have expectation 0 and be independent and identically distributed across eggs.
+Say we wish to determine if the egg_weight
impacts the bird_weight
of a chick – we want to infer if \(\theta_1\) is equal to 0.
First, we define our hypotheses:
+Next, we use our data to fit a model \(\hat{Y} = f_{\hat{\theta}}(x)\) that approximates the relationship above. This gives us the observed value of \(\hat{\theta}_1\) from our data.
+from sklearn.linear_model import LinearRegression
+import numpy as np
+
+= eggs[["egg_weight", "egg_length", "egg_breadth"]]
+ X = eggs["bird_weight"]
+ Y
+= LinearRegression()
+ model
+ model.fit(X, Y)
+# This gives an array containing the fitted model parameter estimates
+= model.coef_
+ thetas
+# Put the parameter estimates in a nice table for viewing
+
+ display(pd.DataFrame(+ list(model.coef_),
+ [model.intercept_] =['theta_hat'],
+ columns=['intercept', 'egg_weight', 'egg_length', 'egg_breadth']
+ index
+ ))
+print("RMSE", np.mean((Y - model.predict(X)) ** 2))
+ | theta_hat | +
---|---|
intercept | +-4.605670 | +
egg_weight | +0.431229 | +
egg_length | +0.066570 | +
egg_breadth | +0.215914 | +
RMSE 0.045470853802757547
+Our single sample of data gives us the value of \(\hat{\theta}_1=0.431\). To get a sense of how this estimate might vary if we were to draw different random samples, we will use bootstrapping. As a refresher, to construct a bootstrap sample, we will draw a resample from the collected data that:
+We draw a bootstrap sample, use this sample to fit a model, and record the result for \(\hat{\theta}_1\) on this bootstrapped sample. We then repeat this process many times to generate a bootstrapped empirical distribution of \(\hat{\theta}_1\). This gives us an estimate of what the true distribution of \(\hat{\theta}_1\) across all possible samples might look like.
+# Set a random seed so you generate the same random sample as staff
+# In the "real world", we wouldn't do this
+import numpy as np
+1337)
+ np.random.seed(
+# Set the sample size of each bootstrap sample
+= len(eggs)
+ n
+# Create a list to store all the bootstrapped estimates
+= []
+ estimates
+# Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample.
+# Repeat 10000 times.
+for i in range(10000):
+# draw a bootstrap sample
+ = eggs.sample(n, replace=True)
+ bootstrap_resample = bootstrap_resample[["egg_weight", "egg_length", "egg_breadth"]]
+ X_bootstrap = bootstrap_resample["bird_weight"]
+ Y_bootstrap
+ # use bootstrapped sample to fit a model
+ = LinearRegression()
+ bootstrap_model
+ bootstrap_model.fit(X_bootstrap, Y_bootstrap)= bootstrap_model.coef_
+ bootstrap_thetas
+ # record the result for theta_1
+ 0])
+ estimates.append(bootstrap_thetas[
+ # calculate the 95% confidence interval
+= np.percentile(estimates, 2.5, axis=0)
+ lower = np.percentile(estimates, 97.5, axis=0)
+ upper = (lower, upper)
+ conf_interval conf_interval
(np.float64(-0.2586481195684874), np.float64(1.103424385420405))
+Our bootstrapped 95% confidence interval for \(\theta_1\) is \([-0.259, 1.103]\). Immediately, we can see that 0 is indeed contained in this interval – this means that we cannot conclude that \(\theta_1\) is non-zero! More formally, we fail to reject the null hypothesis (that \(\theta_1\) is 0) under a 5% p-value cutoff.
+We can repeat this process to construct 95% confidence intervals for the other parameters of the model.
+1337)
+ np.random.seed(
+= []
+ theta_0_estimates = []
+ theta_1_estimates = []
+ theta_2_estimates = []
+ theta_3_estimates
+
+for i in range(10000):
+= eggs.sample(n, replace=True)
+ bootstrap_resample = bootstrap_resample[["egg_weight", "egg_length", "egg_breadth"]]
+ X_bootstrap = bootstrap_resample["bird_weight"]
+ Y_bootstrap
+ = LinearRegression()
+ bootstrap_model
+ bootstrap_model.fit(X_bootstrap, Y_bootstrap)= bootstrap_model.intercept_
+ bootstrap_theta_0 = bootstrap_model.coef_
+ bootstrap_theta_1, bootstrap_theta_2, bootstrap_theta_3
+
+ theta_0_estimates.append(bootstrap_theta_0)
+ theta_1_estimates.append(bootstrap_theta_1)
+ theta_2_estimates.append(bootstrap_theta_2)
+ theta_3_estimates.append(bootstrap_theta_3)
+ = np.percentile(theta_0_estimates, 2.5), np.percentile(theta_0_estimates, 97.5)
+ theta_0_lower, theta_0_upper = np.percentile(theta_1_estimates, 2.5), np.percentile(theta_1_estimates, 97.5)
+ theta_1_lower, theta_1_upper = np.percentile(theta_2_estimates, 2.5), np.percentile(theta_2_estimates, 97.5)
+ theta_2_lower, theta_2_upper = np.percentile(theta_3_estimates, 2.5), np.percentile(theta_3_estimates, 97.5)
+ theta_3_lower, theta_3_upper
+# Make a nice table to view results
+"lower":[theta_0_lower, theta_1_lower, theta_2_lower, theta_3_lower], "upper":[theta_0_upper, \
+ pd.DataFrame({=["theta_0", "theta_1", "theta_2", "theta_3"]) theta_1_upper, theta_2_upper, theta_3_upper]}, index
+ | lower | +upper | +
---|---|---|
theta_0 | +-15.278542 | +5.161473 | +
theta_1 | +-0.258648 | +1.103424 | +
theta_2 | +-0.099138 | +0.208557 | +
theta_3 | +-0.257141 | +0.758155 | +
Something’s off here. Notice that 0 is included in the 95% confidence interval for every parameter of the model. Using the interpretation we outlined above, this would suggest that we can’t say for certain that any of the input variables impact the response variable! This makes it seem like our model can’t make any predictions – and yet, each model we fit in our bootstrap experiment above could very much make predictions of \(Y\).
+How can we explain this result? Think back to how we first interpreted the parameters of a linear model. We treated each \(\theta_i\) as a slope, where a unit increase in \(x_i\) leads to a \(\theta_i\) increase in \(Y\), if all other variables are held constant. It turns out that this last assumption is very important. If variables in our model are somehow related to one another, then it might not be possible to have a change in one of them while holding the others constant. This means that our interpretation framework is no longer valid! In the models we fit above, we incorporated egg_length
, egg_breadth
, and egg_weight
as input variables. These variables are very likely related to one another – an egg with large egg_length
and egg_breadth
will likely be heavy in egg_weight
. This means that the model parameters cannot be meaningfully interpreted as slopes.
To support this conclusion, we can visualize the relationships between our feature variables. Notice the strong positive association between the features.
+import seaborn as sns
+"egg_length", "egg_breadth", "egg_weight", 'bird_weight']]); sns.pairplot(eggs[[
This issue is known as collinearity, sometimes also called multicollinearity. Collinearity occurs when one feature can be predicted fairly accurately by a linear combination of the other features, which happens when one feature is highly correlated with the others.
+Why is collinearity a problem? Its consequences span several aspects of the modeling process:
+The take-home point is that we need to be careful with what features we select for modeling. If two features likely encode similar information, it is often a good idea to choose only one of them as an input variable.
+Let us now consider a more interpretable model: we instead assume a true relationship using only egg weight:
+\[f_\theta(x) = \theta_0 + \theta_1 \text{egg\_weight} + \epsilon\]
+from sklearn.linear_model import LinearRegression
+= eggs[["egg_weight"]]
+ X_int = eggs["bird_weight"]
+ Y_int
+= LinearRegression()
+ model_int
+
+ model_int.fit(X_int, Y_int)
+# This gives an array containing the fitted model parameter estimates
+= model_int.coef_
+ thetas_int
+# Put the parameter estimates in a nice table for viewing
+"theta_hat":[model_int.intercept_, thetas_int[0]]}, index=["theta_0", "theta_1"]) pd.DataFrame({
+ | theta_hat | +
---|---|
theta_0 | +-0.058272 | +
theta_1 | +0.718515 | +
import matplotlib.pyplot as plt
+
+# Set a random seed so you generate the same random sample as staff
+# In the "real world", we wouldn't do this
+1337)
+ np.random.seed(
+# Set the sample size of each bootstrap sample
+= len(eggs)
+ n
+# Create a list to store all the bootstrapped estimates
+= []
+ estimates_int
+# Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample.
+# Repeat 10000 times.
+for i in range(10000):
+= eggs.sample(n, replace=True)
+ bootstrap_resample_int = bootstrap_resample_int[["egg_weight"]]
+ X_bootstrap_int = bootstrap_resample_int["bird_weight"]
+ Y_bootstrap_int
+ = LinearRegression()
+ bootstrap_model_int
+ bootstrap_model_int.fit(X_bootstrap_int, Y_bootstrap_int)= bootstrap_model_int.coef_
+ bootstrap_thetas_int
+ 0])
+ estimates_int.append(bootstrap_thetas_int[
+=120)
+ plt.figure(dpi="density")
+ sns.histplot(estimates_int, statr"$\hat{\theta}_1$")
+ plt.xlabel(r"Bootstrapped estimates $\hat{\theta}_1$ Under the Interpretable Model"); plt.title(
Notice how the interpretable model performs almost as well as our other model:
+from sklearn.metrics import mean_squared_error
+
+= mean_squared_error(Y, model.predict(X))
+ rmse = mean_squared_error(Y_int, model_int.predict(X_int))
+ rmse_int print(f'RMSE of Original Model: {rmse}')
+print(f'RMSE of Interpretable Model: {rmse_int}')
RMSE of Original Model: 0.045470853802757547
+RMSE of Interpretable Model: 0.04649394137555684
+Yet, the confidence interval for the true parameter \(\theta_{1}\) does not contain zero.
+= np.percentile(estimates_int, 2.5)
+ lower_int = np.percentile(estimates_int, 97.5)
+ upper_int
+= (lower_int, upper_int)
+ conf_interval_int conf_interval_int
(np.float64(0.6029335250209632), np.float64(0.8208401738546208))
+In retrospect, it’s no surprise that the weight of an egg best predicts the weight of a newly-hatched chick.
+A model with highly correlated variables prevents us from interpreting how the variables are related to the prediction.
+Keep the following in mind: All inference assumes that the regression model holds.
+Note: the content in this section is out of scope.
+ +The difference between correlation/prediction vs. causation is best illustrated through examples.
+Some questions about correlation / prediction include:
+While these may sound like causal questions, they are not! Questions about causality are about the effects of interventions (not just passive observation). For example:
+Note, however, that regression coefficients are sometimes called “effects”, which can be deceptive!
+When using data alone, predictive questions (i.e., are breastfed babies healthier?) can be answered, but causal questions (i.e., does breastfeeding improve babies’ health?) cannot. The reason for this is that there are many possible causes for our predictive question. For example, possible explanations for why breastfed babies are healthier on average include:
+We cannot tell which explanations are true (or to what extent) just by observing (\(x\),\(y\)) pairs. Additionally, causal questions implicitly involve counterfactuals, events that didn’t happen. For example, we could ask, would the same breastfed babies have been less healthy if they hadn’t been breastfed? Explanation 1 from above implies they would be, but explanations 2 and 3 do not.
+Let T represent a treatment (for example, alcohol use) and Y represent an outcome (for example, lung cancer).
+A confounder is a variable that affects both T and Y, distorting the correlation between them. Using the example above, rich parents could be a confounder for breastfeeding and a baby’s health. Confounders can be a measured covariate (a feature) or an unmeasured variable we don’t know about, and they generally cause problems, as the relationship between T and Y is affected by data we cannot see. We commonly assume that all confounders are observed (this is also called ignorability).
+In a randomized experiment, participants are randomly assigned into two groups: treatment and control. A treatment is applied only to the treatment group. We assume ignorability and gather as many measurements as possible so that we can compare them between the control and treatment groups to determine whether or not the treatment has a true effect or is just a confounding factor.
+However, often, randomly assigning treatments is impractical or unethical. For example, assigning a treatment of cigarettes to test the effect of smoking on the lungs would not only be impractical but also unethical.
+An alternative to bypass this issue is to utilize observational studies. This can be done by obtaining two participant groups separated based on some identified treatment variable. Unlike randomized experiments, however, we cannot assume ignorability here: the participants could have separated into two groups based on other covariates! In addition, there could also be unmeasured confounders.
+Up until this point in the semester, we’ve focused on analyzing datasets. We’ve looked into the early stages of the data science lifecycle, focusing on the programming tools, visualization techniques, and data cleaning methods needed for data analysis.
+This lecture marks a shift in focus. We will move away from examining datasets to actually using our data to better understand the world. Specifically, the next sequence of lectures will explore predictive modeling: generating models to make some predictions about the world around us. In this lecture, we’ll introduce the conceptual framework for setting up a modeling task. In the next few lectures, we’ll put this framework into practice by implementing various kinds of models.
+A model is an idealized representation of a system. A system is a set of principles or procedures according to which something functions. We live in a world full of systems: the procedure of turning on a light happens according to a specific set of rules dictating the flow of electricity. The truth behind how any event occurs is usually complex, and many times the specifics are unknown. The workings of the world can be viewed as its own giant procedure. Models seek to simplify the world and distill them into workable pieces.
+Example: We model the fall of an object on Earth as subject to a constant acceleration of \(9.81 m/s^2\) due to gravity.
+Why do we want to build models? As far as data scientists and statisticians are concerned, there are three reasons, and each implies a different focus on modeling.
+To explain complex phenomena occurring in the world we live in. Examples of this might be:
+In these cases, we care about creating models that are simple and interpretable, allowing us to understand what the relationships between our variables are.
To make accurate predictions about unseen data. Some examples include:
+When making predictions, we care more about making extremely accurate predictions, at the cost of having an uninterpretable model. These are sometimes called black-box models and are common in fields like deep learning.
To measure the causal effects of one event on some other event. For example,
+This is a much harder question because most statistical tools are designed to infer association, not causation. We will not focus on this task in Data 100, but you can take other advanced classes on causal inference (e.g., Stat 156, Data 102) if you are intrigued!
Most of the time, we aim to strike a balance between building interpretable models and building accurate models.
+In general, models can be split into two categories:
+Deterministic physical (mechanistic) models: Laws that govern how the world works.
+Probabilistic models: Models that attempt to understand how random processes evolve. These are more general and can be used to describe many phenomena in the real world. These models commonly make simplifying assumptions about the nature of the world.
+Note: These specific models are not in the scope of Data 100 and exist to serve as motivation.
+The regression line is the unique straight line that minimizes the mean squared error of estimation among all straight lines. As with any straight line, it can be defined by a slope and a y-intercept:
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+# Set random seed for consistency
+43)
+ np.random.seed('default')
+ plt.style.use(
+#Generate random noise for plotting
+= np.linspace(-3, 3, 100)
+ x = x * 0.5 - 1 + np.random.randn(100) * 0.3
+ y
+#plot regression line
+=x,y=y); sns.regplot(x
For a pair of variables \(x\) and \(y\) representing our data \(\mathcal{D} = \{(x_1, y_1), (x_2, y_2), \dots, (x_n, y_n)\}\), we denote their means/averages as \(\bar x\) and \(\bar y\) and standard deviations as \(\sigma_x\) and \(\sigma_y\).
+A variable is represented in standard units if the following are true:
+To convert a variable \(x_i\) into standard units, we subtract its mean from it and divide it by its standard deviation. For example, \(x_i\) in standard units is \(\frac{x_i - \bar x}{\sigma_x}\).
+The correlation (\(r\)) is the average of the product of \(x\) and \(y\), both measured in standard units.
+\[r = \frac{1}{n} \sum_{i=1}^n (\frac{x_i - \bar{x}}{\sigma_x})(\frac{y_i - \bar{y}}{\sigma_y})\]
+def plot_and_get_corr(ax, x, y, title):
+-3, 3)
+ ax.set_xlim(-3, 3)
+ ax.set_ylim(
+ ax.set_xticks([])
+ ax.set_yticks([])= 0.73)
+ ax.scatter(x, y, alpha = np.corrcoef(x, y)[0, 1]
+ r + " (corr: {})".format(r.round(2)))
+ ax.set_title(title return r
+
+= plt.subplots(2, 2, figsize = (10, 10))
+ fig, axs
+# Just noise
+= np.random.randn(2, 100)
+ x1, y1 = plot_and_get_corr(axs[0, 0], x1, y1, title = "noise")
+ corr1
+# Strong linear
+= np.linspace(-3, 3, 100)
+ x2 = x2 * 0.5 - 1 + np.random.randn(100) * 0.3
+ y2 = plot_and_get_corr(axs[0, 1], x2, y2, title = "strong linear")
+ corr2
+# Unequal spread
+= np.linspace(-3, 3, 100)
+ x3 = - x3/3 + np.random.randn(100)*(x3)/2.5
+ y3 = plot_and_get_corr(axs[1, 0], x3, y3, title = "strong linear")
+ corr3 = axs[1, 0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())
+ extent
+# Strong non-linear
+= np.linspace(-3, 3, 100)
+ x4 = 2*np.sin(x3 - 1.5) + np.random.randn(100) * 0.3
+ y4 = plot_and_get_corr(axs[1, 1], x4, y4, title = "strong non-linear")
+ corr4
+ plt.show()
When the variables \(y\) and \(x\) are measured in standard units, the regression line for predicting \(y\) based on \(x\) has slope \(r\) and passes through the origin.
+\[\hat{y}_{su} = r \cdot x_{su}\]
+\[\frac{\hat{y} - \bar{y}}{\sigma_y} = r \cdot \frac{x - \bar{x}}{\sigma_x}\]
+Starting from the top, we have our claimed form of the regression line, and we want to show that it is equivalent to the optimal linear regression line: \(\hat{y} = \hat{a} + \hat{b}x\).
+Recall:
+At a high level, a model is a way of representing a system. In Data 100, we’ll treat a model as some mathematical rule we use to describe the relationship between variables.
+What variables are we modeling? Typically, we use a subset of the variables in our sample of collected data to model another variable in this data. To put this more formally, say we have the following dataset \(\mathcal{D}\):
+\[\mathcal{D} = \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\}\]
+Each pair of values \((x_i, y_i)\) represents a datapoint. In a modeling setting, we call these observations. \(y_i\) is the dependent variable we are trying to model, also called an output or response. \(x_i\) is the independent variable inputted into the model to make predictions, also known as a feature.
+Our goal in modeling is to use the observed data \(\mathcal{D}\) to predict the output variable \(y_i\). We denote each prediction as \(\hat{y}_i\) (read: “y hat sub i”).
+How do we generate these predictions? Some examples of models we’ll encounter in the next few lectures are given below:
+\[\hat{y}_i = \theta\] \[\hat{y}_i = \theta_0 + \theta_1 x_i\]
+The examples above are known as parametric models. They relate the collected data, \(x_i\), to the prediction we make, \(\hat{y}_i\). A few parameters (\(\theta\), \(\theta_0\), \(\theta_1\)) are used to describe the relationship between \(x_i\) and \(\hat{y}_i\).
+Notice that we don’t immediately know the values of these parameters. While the features, \(x_i\), are taken from our observed data, we need to decide what values to give \(\theta\), \(\theta_0\), and \(\theta_1\) ourselves. This is the heart of parametric modeling: what parameter values should we choose so our model makes the best possible predictions?
+To choose our model parameters, we’ll work through the modeling process.
+Our first step is choosing a model: defining the mathematical rule that describes the relationship between the features, \(x_i\), and predictions \(\hat{y}_i\).
+In Data 8, you learned about the Simple Linear Regression (SLR) model. You learned that the model takes the form: \[\hat{y}_i = a + bx_i\]
+In Data 100, we’ll use slightly different notation: we will replace \(a\) with \(\theta_0\) and \(b\) with \(\theta_1\). This will allow us to use the same notation when we explore more complex models later on in the course.
+\[\hat{y}_i = \theta_0 + \theta_1 x_i\]
+The parameters of the SLR model are \(\theta_0\), also called the intercept term, and \(\theta_1\), also called the slope term. To create an effective model, we want to choose values for \(\theta_0\) and \(\theta_1\) that most accurately predict the output variable. The “best” fitting model parameters are given the special names: \(\hat{\theta}_0\) and \(\hat{\theta}_1\); they are the specific parameter values that allow our model to generate the best possible predictions.
+In Data 8, you learned that the best SLR model parameters are: \[\hat{\theta}_0 = \bar{y} - \hat{\theta}_1\bar{x} \qquad \qquad \hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}\]
+A quick reminder on notation:
+In Data 100, we want to understand how to derive these best model coefficients. To do so, we’ll introduce the concept of a loss function.
+We’ve talked about the idea of creating the “best” possible predictions. This begs the question: how do we decide how “good” or “bad” our model’s predictions are?
+A loss function characterizes the cost, error, or fit resulting from a particular choice of model or model parameters. This function, \(L(y, \hat{y})\), quantifies how “bad” or “far off” a single prediction by our model is from a true, observed value in our collected data.
+The choice of loss function for a particular model will affect the accuracy and computational cost of estimation, and it’ll also depend on the estimation task at hand. For example,
+Regardless of the specific function used, a loss function should follow two basic principles:
+Two common choices of loss function are squared loss and absolute loss.
+Squared loss, also known as L2 loss, computes loss as the square of the difference between the observed \(y_i\) and predicted \(\hat{y}_i\): \[L(y_i, \hat{y}_i) = (y_i - \hat{y}_i)^2\]
+Absolute loss, also known as L1 loss, computes loss as the absolute difference between the observed \(y_i\) and predicted \(\hat{y}_i\): \[L(y_i, \hat{y}_i) = |y_i - \hat{y}_i|\]
+L1 and L2 loss give us a tool for quantifying our model’s performance on a single data point. This is a good start, but ideally, we want to understand how our model performs across our entire dataset. A natural way to do this is to compute the average loss across all data points in the dataset. This is known as the cost function, \(\hat{R}(\theta)\): \[\hat{R}(\theta) = \frac{1}{n} \sum^n_{i=1} L(y_i, \hat{y}_i)\]
+The cost function has many names in the statistics literature. You may also encounter the terms:
+We can substitute our L1 and L2 loss into the cost function definition. The Mean Squared Error (MSE) is the average squared loss across a dataset: \[\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
+The Mean Absolute Error (MAE) is the average absolute loss across a dataset: \[\text{MAE}= \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i|\]
+Now that we’ve established the concept of a loss function, we can return to our original goal of choosing model parameters. Specifically, we want to choose the best set of model parameters that will minimize the model’s cost on our dataset. This process is called fitting the model.
+We know from calculus that a function is minimized when (1) its first derivative is equal to zero and (2) its second derivative is positive. We often call the function being minimized the objective function (our objective is to find its minimum).
+To find the optimal model parameter, we:
+We repeat this process for each parameter present in the model. For now, we’ll disregard the second derivative condition.
+To help us make sense of this process, let’s put it into action by deriving the optimal model parameters for simple linear regression using the mean squared error as our cost function. Remember: although the notation may look tricky, all we are doing is following the three steps above!
+Step 1: take the derivative of the cost function with respect to each model parameter. We substitute the SLR model, \(\hat{y}_i = \theta_0+\theta_1 x_i\), into the definition of MSE above and differentiate with respect to \(\theta_0\) and \(\theta_1\). \[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)^2\]
+\[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n} y_i - \theta_0 - \theta_1 x_i\]
+\[\frac{\partial}{\partial \theta_1} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \theta_0 - \theta_1 x_i)x_i\]
+Let’s walk through these derivations in more depth, starting with the derivative of MSE with respect to \(\theta_0\).
+Given our MSE above, we know that: \[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{\partial}{\partial \theta_0} \frac{1}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]
+Noting that the derivative of sum is equivalent to the sum of derivatives, this then becomes: \[ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \theta_0} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]
+We can then apply the chain rule.
+\[ = \frac{1}{n} \sum_{i=1}^{n} 2 \cdot{(y_i - \theta_0 - \theta_1 x_i)}\dot(-1)\]
+Finally, we can simplify the constants, leaving us with our answer.
+\[\frac{\partial}{\partial \theta_0} \text{MSE} = \frac{-2}{n} \sum_{i=1}^{n}{(y_i - \theta_0 - \theta_1 x_i)}\]
+Following the same procedure, we can take the derivative of MSE with respect to \(\theta_1\).
+\[\frac{\partial}{\partial \theta_1} \text{MSE} = \frac{\partial}{\partial \theta_1} \frac{1}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]
+\[ = \frac{1}{n} \sum_{i=1}^{n} \frac{\partial}{\partial \theta_1} {(y_i - \theta_0 - \theta_1 x_i)}^{2}\]
+\[ = \frac{1}{n} \sum_{i=1}^{n} 2 \dot{(y_i - \theta_0 - \theta_1 x_i)}\dot(-x_i)\]
+\[= \frac{-2}{n} \sum_{i=1}^{n} {(y_i - \theta_0 - \theta_1 x_i)}x_i\]
+Step 2: set the derivatives equal to 0. After simplifying terms, this produces two estimating equations. The best set of model parameters \((\hat{\theta}_0, \hat{\theta}_1)\) must satisfy these two optimality conditions. \[0 = \frac{-2}{n} \sum_{i=1}^{n} y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i \Longleftrightarrow \frac{1}{n}\sum_{i=1}^{n} y_i - \hat{y}_i = 0\] \[0 = \frac{-2}{n} \sum_{i=1}^{n} (y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i)x_i \Longleftrightarrow \frac{1}{n}\sum_{i=1}^{n} (y_i - \hat{y}_i)x_i = 0\]
+Step 3: solve the estimating equations to compute estimates for \(\hat{\theta}_0\) and \(\hat{\theta}_1\).
+Taking the first equation gives the estimate of \(\hat{\theta}_0\): \[\frac{1}{n} \sum_{i=1}^n y_i - \hat{\theta}_0 - \hat{\theta}_1 x_i = 0 \]
+\[\left(\frac{1}{n} \sum_{i=1}^n y_i \right) - \hat{\theta}_0 - \hat{\theta}_1\left(\frac{1}{n} \sum_{i=1}^n x_i \right) = 0\]
+\[ \hat{\theta}_0 = \bar{y} - \hat{\theta}_1 \bar{x}\]
+With a bit more maneuvering, the second equation gives the estimate of \(\hat{\theta}_1\). Start by multiplying the first estimating equation by \(\bar{x}\), then subtracting the result from the second estimating equation.
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)x_i - \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)\bar{x} = 0 \]
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)(x_i - \bar{x}) = 0 \]
+Next, plug in \(\hat{y}_i = \hat{\theta}_0 + \hat{\theta}_1 x_i = \bar{y} + \hat{\theta}_1(x_i - \bar{x})\):
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y} - \hat{\theta}_1(x - \bar{x}))(x_i - \bar{x}) = 0 \]
+\[\frac{1}{n} \sum_{i=1}^n (y_i - \bar{y})(x_i - \bar{x}) = \hat{\theta}_1 \times \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2 +\]
+By using the definition of correlation \(\left(r = \frac{1}{n} \sum_{i=1}^n (\frac{x_i-\bar{x}}{\sigma_x})(\frac{y_i-\bar{y}}{\sigma_y}) \right)\) and standard deviation \(\left(\sigma_x = \sqrt{\frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^2} \right)\), we can conclude: \[r \sigma_x \sigma_y = \hat{\theta}_1 \times \sigma_x^2\] \[\hat{\theta}_1 = r \frac{\sigma_y}{\sigma_x}\]
+Just as was given in Data 8!
+Remember, this derivation found the optimal model parameters for SLR when using the MSE cost function. If we had used a different model or different loss function, we likely would have found different values for the best model parameters. However, regardless of the model and loss used, we can always follow these three steps to fit the model.
+ + + + +Up until this point in the class , we’ve focused on regression tasks - that is, predicting an unbounded numerical quantity from a given dataset. We discussed optimization, feature engineering, and regularization all in the context of performing regression to predict some quantity.
+Now that we have this deep understanding of the modeling process, let’s expand our knowledge of possible modeling tasks.
+In the next two lectures, we’ll tackle the task of classification. A classification problem aims to classify data into categories. Unlike in regression, where we predicted a numeric output, classification involves predicting some categorical variable, or response, \(y\). Examples of classification tasks include:
+There are a couple of different types of classification:
+We can further combine multiple related classfication predictions (e.g., translation, voice recognition, etc.) to tackle complex problems through structured prediction tasks.
+In Data 100, we will mostly deal with binary classification, where we are attempting to classify data into one of two classes.
+To build a classification model, we need to modify our modeling workflow slightly. Recall that in regression we:
+In classification, however, we no longer want to output numeric predictions; instead, we want to predict the class to which a datapoint belongs. This means that we need to update our workflow. To build a classification model, we will:
+There are two key differences: as we’ll soon see, we need to incorporate a non-linear transformation to capture the non-linear relationships hidden in our data. We do so by applying the sigmoid function to a linear combination of the features. Secondly, we must apply a decision rule to convert the numeric quantities computed by our model into an actual class prediction. This can be as simple as saying that any datapoint with a feature greater than some number \(x\) belongs to Class 1.
+Regression:
+Classification:
+This was a very high-level overview. Let’s walk through the process in detail to clarify what we mean.
+Throughout this lecture, we will work with the games
dataset, which contains information about games played in the NBA basketball league. Our goal will be to use a basketball team’s "GOAL_DIFF"
to predict whether or not a given team won their game ("WON"
). If a team wins their game, we’ll say they belong to Class 1. If they lose, they belong to Class 0.
For those who are curious, "GOAL_DIFF"
represents the difference in successful field goal percentages between the two competing teams.
import warnings
+"ignore")
+ warnings.filterwarnings(
+import pandas as pd
+import numpy as np
+='ignore')
+ np.seterr(divide
+= pd.read_csv("data/games").dropna()
+ games games.head()
+ | GAME_ID | +TEAM_NAME | +MATCHUP | +WON | +GOAL_DIFF | +AST | +
---|---|---|---|---|---|---|
0 | +21701216 | +Dallas Mavericks | +DAL vs. PHX | +0 | +-0.251 | +20 | +
1 | +21700846 | +Phoenix Suns | +PHX @ GSW | +0 | +-0.237 | +13 | +
2 | +21700071 | +San Antonio Spurs | +SAS @ ORL | +0 | +-0.234 | +19 | +
3 | +21700221 | +New York Knicks | +NYK @ TOR | +0 | +-0.234 | +17 | +
4 | +21700306 | +Miami Heat | +MIA @ NYK | +0 | +-0.222 | +21 | +
Let’s visualize the relationship between "GOAL_DIFF"
and "WON"
using the Seaborn function sns.stripplot
. A strip plot automatically introduces a small amount of random noise to jitter the data. Recall that all values in the "WON"
column are either 1 (won) or 0 (lost) – if we were to directly plot them without jittering, we would see severe overplotting.
import seaborn as sns
+import matplotlib.pyplot as plt
+
+=games, x="GOAL_DIFF", y="WON", orient="h", hue='WON', alpha=0.7)
+ sns.stripplot(data# By default, sns.stripplot plots 0, then 1. We invert the y axis to reverse this behavior
+; plt.gca().invert_yaxis()
This dataset is unlike anything we’ve seen before – our target variable contains only two unique values! (Remember that each y value is either 0 or 1; the plot above jitters the y data slightly for ease of reading.)
+The regression models we have worked with always assumed that we were attempting to predict a continuous target. If we apply a linear regression model to this dataset, something strange happens.
+import sklearn.linear_model as lm
+
+= games[["GOAL_DIFF"]], games["WON"]
+ X, Y = lm.LinearRegression()
+ regression_model
+ regression_model.fit(X, Y)
+"k")
+ plt.plot(X.squeeze(), regression_model.predict(X), =games, x="GOAL_DIFF", y="WON", orient="h", hue='WON', alpha=0.7)
+ sns.stripplot(data; plt.gca().invert_yaxis()
The linear regression fit follows the data as closely as it can. However, this approach has a key flaw - the predicted output, \(\hat{y}\), can be outside the range of possible classes (there are predictions above 1 and below 0). This means that the output can’t always be interpreted (what does it mean to predict a class of -2.3?).
+Our usual linear regression framework won’t work here. Instead, we’ll need to get more creative.
+Back in Data 8, you gradually built up to the concept of linear regression by using the graph of averages. Before you knew the mathematical underpinnings of the regression line, you took a more intuitive approach: you bucketed the \(x\) data into bins of common values, then computed the average \(y\) for all datapoints in the same bin. The result gave you the insight needed to derive the regression fit.
+Let’s take the same approach as we grapple with our new classification task. In the cell below, we 1) bucket the "GOAL_DIFF"
data into bins of similar values and 2) compute the average "WON"
value of all datapoints in a bin.
# bucket the GOAL_DIFF data into 20 bins
+= pd.cut(games["GOAL_DIFF"], 20)
+ bins "bin"] = [(b.left + b.right) / 2 for b in bins]
+ games[= games.groupby("bin")["WON"].mean()
+ win_rates_by_bin
+# plot the graph of averages
+=games, x="GOAL_DIFF", y="WON", orient="h", alpha=0.5, hue='WON') # alpha makes the points transparent
+ sns.stripplot(data="tab:red")
+ plt.plot(win_rates_by_bin.index, win_rates_by_bin, c; plt.gca().invert_yaxis()
Interesting: our result is certainly not like the straight line produced by finding the graph of averages for a linear relationship. We can make two observations:
+Let’s think more about what we’ve just done.
+To find the average \(y\) value for each bin, we computed:
+\[\frac{1 \text{(\# Y = 1 in bin)} + 0 \text{(\# Y = 0 in bin)}}{\text{\# datapoints in bin}} = \frac{\text{\# Y = 1 in bin}}{\text{\# datapoints in bin}} = P(\text{Y = 1} | \text{bin})\]
+This is simply the probability of a datapoint in that bin belonging to Class 1! This aligns with our observation from earlier: all of our predictions lie between 0 and 1, just as we would expect for a probability.
+Our graph of averages was really modeling the probability, \(p\), that a datapoint belongs to Class 1, or essentially that \(\text{Y = 1}\) for a particular value of \(\text{x}\).
+\[ p = P(Y = 1 | \text{ x} )\]
+In logistic regression, we have a new modeling goal. We want to model the probability that a particular datapoint belongs to Class 1 by approximating the S-shaped curve we plotted above. However, we’ve only learned about linear modeling techniques like Linear Regression and OLS.
+Fortunately for us, we’re already well-versed with a technique to model non-linear relationships – we can apply non-linear transformations like log or exponents to make a non-linear relationship more linear. Recall the steps we’ve applied previously:
+In past examples, we used the bulge diagram to help us decide what transformations may be useful. The S-shaped curve we saw above, however, looks nothing like any relationship we’ve seen in the past. We’ll need to think carefully about what transformations will linearize this curve.
+Let’s consider our eventual goal: determining if we should predict a Class of 0 or 1 for each datapoint. Rephrased, we want to decide if it seems more “likely” that the datapoint belongs to Class 0 or to Class 1. One way of deciding this is to see which class has the higher predicted probability for a given datapoint. The odds is defined as the probability of a datapoint belonging to Class 1 divided by the probability of it belonging to Class 0.
+\[\text{odds} = \frac{P(Y=1|x)}{P(Y=0|x)} = \frac{p}{1-p}\]
+If we plot the odds for each input "GOAL_DIFF"
(\(x\)), we see something that looks more promising.
= win_rates_by_bin
+ p = p/(1-p)
+ odds
+
+ plt.plot(odds.index, odds)"x")
+ plt.xlabel(r"Odds $= \frac{p}{1-p}$"); plt.ylabel(
It turns out that the relationship between our input "GOAL_DIFF"
and the odds is roughly exponential! Let’s linearize the exponential by taking the logarithm (as suggested by the Tukey-Mosteller Bulge Diagram). As a reminder, you should assume that any logarithm in Data 100 is the base \(e\) natural logarithm unless told otherwise.
import numpy as np
+= np.log(odds)
+ log_odds ="tab:green")
+ plt.plot(odds.index, log_odds, c"x")
+ plt.xlabel(r"Log-Odds $= \log{\frac{p}{1-p}}$"); plt.ylabel(
We see something promising – the relationship between the log-odds and our input feature is approximately linear. This means that we can use a linear model to describe the relationship between the log-odds and \(x\). In other words:
+\[\begin{align} +\log{(\frac{p}{1-p})} &= \theta_0 + \theta_1 x_1 + ... + \theta_p x_p\\ +&= x^{\top} \theta +\end{align}\]
+Here, we use \(x^{\top}\) to represent an observation in our dataset, stored as a row vector. You can imagine it as a single row in our design matrix. \(x^{\top} \theta\) indicates a linear combination of the features for this observation (just as we used in multiple linear regression).
+We’re in good shape! We have now derived the following relationship:
+\[\log{(\frac{p}{1-p})} = x^{\top} \theta\]
+Remember that our goal is to predict the probability of a datapoint belonging to Class 1, \(p\). Let’s rearrange this relationship to uncover the original relationship between \(p\) and our input data, \(x^{\top}\).
+\[\begin{align} +\log{(\frac{p}{1-p})} &= x^T \theta\\ +\frac{p}{1-p} &= e^{x^T \theta}\\ +p &= (1-p)e^{x^T \theta}\\ +p &= e^{x^T \theta}- p e^{x^T \theta}\\ +p(1 + e^{x^T \theta}) &= e^{x^T \theta} \\ +p &= \frac{e^{x^T \theta}}{1+e^{x^T \theta}}\\ +p &= \frac{1}{1+e^{-x^T \theta}}\\ +\end{align}\]
+Phew, that was a lot of algebra. What we’ve uncovered is the logistic regression model to predict the probability of a datapoint \(x^{\top}\) belonging to Class 1. If we plot this relationship for our data, we see the S-shaped curve from earlier!
+# We'll discuss the `LogisticRegression` class next time
+= np.linspace(-0.3, 0.3)
+ xs
+= lm.LogisticRegression(C=20)
+ logistic_model
+ logistic_model.fit(X, Y)= logistic_model.predict_proba(xs[:, np.newaxis])[:, 1]
+ predicted_prob
+=games, x="GOAL_DIFF", y="WON", orient="h", alpha=0.5)
+ sns.stripplot(data="k", lw=3, label="Logistic regression model")
+ plt.plot(xs, predicted_prob, c=2, c="tab:red", label="Graph of averages")
+ plt.plot(win_rates_by_bin.index, win_rates_by_bin, lw="upper left")
+ plt.legend(loc; plt.gca().invert_yaxis()
The S-shaped curve is formally known as the sigmoid function and is typically denoted by \(\sigma\).
+\[\sigma(t) = \frac{1}{1+e^{-t}}\]
+ +In the context of our modeling process, the sigmoid is considered an activation function. It takes in a linear combination of the features and applies a non-linear transformation.
+To predict a probability using the logistic regression model, we:
+Our predicted probabilities are of the form \(P(Y=1|x) = p = \frac{1}{1+e^{-x^T \theta}} = \frac{1}{1+e^{-(\theta_0 + \theta_1 x_1 + \theta_2 x_2 + \ldots + \theta_p x_p)}}\)
+An important note: despite its name, logistic regression is used for classification tasks, not regression tasks. In Data 100, we always apply logistic regression with the goal of classifying data.
+Let’s summarize our logistic regression modeling workflow:
+Our main takeaways from this section are:
+Putting this together, we know that the estimated probability that response is 1 given the features \(x\) is equal to the logistic function \(\sigma()\) at the value \(x^{\top}\theta\):
+\[\begin{align} +\hat{P}_{\theta}(Y = 1 | x) = \frac{1}{1 + e^{-x^{\top}\theta}} +\end{align}\]
+More commonly, the logistic regression model is written as:
+\[\begin{align} +\hat{P}_{\theta}(Y = 1 | x) = \sigma(x^{\top}\theta) +\end{align}\]
+ + +To quantify the error of our logistic regression model, we’ll need to define a new loss function.
+You may wonder: why not use our familiar mean squared error? It turns out that the MSE is not well suited for logistic regression. To see why, let’s consider a simple, artificially generated toy
dataset with just one feature (this will be easier to work with than the more complicated games
data).
= pd.DataFrame({
+ toy_df "x": [-4, -2, -0.5, 1, 3, 5],
+ "y": [0, 0, 1, 0, 1, 1]})
+ toy_df.head()
+ | x | +y | +
---|---|---|
0 | +-4.0 | +0 | +
1 | +-2.0 | +0 | +
2 | +-0.5 | +1 | +
3 | +1.0 | +0 | +
4 | +3.0 | +1 | +
We’ll construct a basic logistic regression model with only one feature and no intercept term. Our predicted probabilities take the form:
+\[p=P(Y=1|x)=\frac{1}{1+e^{-\theta_1 x}}\]
+In the cell below, we plot the MSE for our model on the data.
+def sigmoid(z):
+return 1/(1+np.e**(-z))
+
+ def mse_on_toy_data(theta):
+= sigmoid(toy_df['x'] * theta)
+ p_hat return np.mean((toy_df['y'] - p_hat)**2)
+
+= np.linspace(-15, 5, 100)
+ thetas for theta in thetas])
+ plt.plot(thetas, [mse_on_toy_data(theta) "MSE on toy classification data")
+ plt.title(r'$\theta_1$')
+ plt.xlabel('MSE'); plt.ylabel(
This looks nothing like the parabola we found when plotting the MSE of a linear regression model! In particular, we can identify two flaws with using the MSE for logistic regression:
+Suffice to say, we don’t want to use the MSE when working with logistic regression. Instead, we’ll consider what kind of behavior we would like to see in a loss function.
+Let \(y\) be the binary label (it can either be 0 or 1), and \(p\) be the model’s predicted probability of the label \(y\) being 1.
+In other words, our loss function should behave differently depending on the value of the true class, \(y\).
+The cross-entropy loss incorporates this changing behavior. We will use it throughout our work on logistic regression. Below, we write out the cross-entropy loss for a single datapoint (no averages just yet).
+\[\text{Cross-Entropy Loss} = \begin{cases} + -\log{(p)} & \text{if } y=1 \\ + -\log{(1-p)} & \text{if } y=0 +\end{cases}\]
+Why does this (seemingly convoluted) loss function “work”? Let’s break it down.
+When \(y=1\) | +When \(y=0\) | +
---|---|
![]() |
+![]() |
+
As \(p \rightarrow 0\), loss approches \(\infty\) | +As \(p \rightarrow 0\), loss approches 0 | +
As \(p \rightarrow 1\), loss approaches 0 | +As \(p \rightarrow 1\), loss approaches \(\infty\) | +
All good – we are seeing the behavior we want for our logistic regression model.
+The piecewise function we outlined above is difficult to optimize: we don’t want to constantly “check” which form of the loss function we should be using at each step of choosing the optimal model parameters. We can re-express cross-entropy loss in a more convenient way:
+\[\text{Cross-Entropy Loss} = -\left(y\log{(p)}+(1-y)\log{(1-p)}\right)\]
+By setting \(y\) to 0 or 1, we see that this new form of cross-entropy loss gives us the same behavior as the original formulation. Another way to think about this is that in either scenario (y being equal to 0 or 1), only one of the cross-entropy loss terms is activated, which gives us a convenient way to combine the two independent loss functions.
+When \(y=1\):
+\[\begin{align} +\text{CE} &= -\left((1)\log{(p)}+(1-1)\log{(1-p)}\right)\\ +&= -\log{(p)} +\end{align}\]
+When \(y=0\):
+\[\begin{align} +\text{CE} &= -\left((0)\log{(p)}+(1-0)\log{(1-p)}\right)\\ +&= -\log{(1-p)} +\end{align}\]
+The empirical risk of the logistic regression model is then the mean cross-entropy loss across all datapoints in the dataset. When fitting the model, we want to determine the model parameter \(\theta\) that leads to the lowest mean cross-entropy loss possible.
+\[ +\begin{align} +R(\theta) &= - \frac{1}{n} \sum_{i=1}^n \left(y_i\log{(p_i)}+(1-y_i)\log{(1-p_i)}\right) \\ +&= - \frac{1}{n} \sum_{i=1}^n \left(y_i\log{\sigma(X_i^{\top}\theta)}+(1-y_i)\log{(1-\sigma(X_i^{\top}\theta))}\right) +\end{align} +\]
+The optimization problem is therefore to find the estimate \(\hat{\theta}\) that minimizes \(R(\theta)\):
+\[ +\hat{\theta} = \underset{\theta}{\arg\min} - \frac{1}{n} \sum_{i=1}^n \left(y_i\log{(\sigma(X_i^{\top}\theta))}+(1-y_i)\log{(1-\sigma(X_i^{\top}\theta))}\right) +\]
+Plotting the cross-entropy loss surface for our toy
dataset gives us a more encouraging result – our loss function is now convex. This means we can optimize it using gradient descent. Computing the gradient of the logistic model is fairly challenging, so we’ll let sklearn
take care of this for us. You won’t need to compute the gradient of the logistic model in Data 100.
def cross_entropy(y, p_hat):
+return - y * np.log(p_hat) - (1 - y) * np.log(1 - p_hat)
+
+def mean_cross_entropy_on_toy_data(theta):
+= sigmoid(toy_df['x'] * theta)
+ p_hat return np.mean(cross_entropy(toy_df['y'], p_hat))
+
+for theta in thetas], color = 'green')
+ plt.plot(thetas, [mean_cross_entropy_on_toy_data(theta) r'Mean Cross-Entropy Loss($\theta$)')
+ plt.ylabel(r'$\theta$'); plt.xlabel(
It may have seemed like we pulled cross-entropy loss out of thin air. How did we know that taking the negative logarithms of our probabilities would work so well? It turns out that cross-entropy loss is justified by probability theory.
+The following section is out of scope, but is certainly an interesting read!
+To build some intuition for logistic regression, let’s look at an introductory example to classification: the coin flip. Suppose we observe some outcomes of a coin flip (1 = Heads, 0 = Tails).
+= [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
+ flips flips
[0, 0, 1, 1, 1, 1, 0, 0, 0, 0]
+A reasonable model is to assume all flips are IID (independent and identically distributed). In other words, each flip has the same probability of returning a 1 (or heads). Let’s define a parameter \(\theta\), the probability that the next flip is a heads. We will use this parameter to inform our decision for \(\hat y\) (predicting either 0 or 1) of the next flip. If \(\theta \ge 0.5, \hat y = 1, \text{else } \hat y = 0\).
+You may be inclined to say \(0.5\) is the best choice for \(\theta\). However, notice that we made no assumption about the coin itself. The coin may be biased, so we should make our decision based only on the data. We know that exactly \(\frac{4}{10}\) of the flips were heads, so we might guess \(\hat \theta = 0.4\). In the next section, we will mathematically prove why this is the best possible estimate.
+Let’s call the result of the coin flip a random variable \(Y\). This is a Bernoulli random variable with two outcomes. \(Y\) has the following distribution:
+\[P(Y = y) = \begin{cases} + p, \text{if } y=1\\ + 1 - p, \text{if } y=0 + \end{cases} \]
+\(p\) is unknown to us. But we can find the \(p\) that makes the data we observed the most likely.
+The probability of observing 4 heads and 6 tails follows the binomial distribution.
+\[\binom{10}{4} (p)^4 (1-p)^6\]
+We define the likelihood of obtaining our observed data as a quantity proportional to the probability above. To find it, simply multiply the probabilities of obtaining each coin flip.
+\[(p)^{4} (1-p)^6\]
+The technique known as maximum likelihood estimation finds the \(p\) that maximizes the above likelihood. You can find this maximum by taking the derivative of the likelihood, but we’ll provide a more intuitive graphical solution.
+= np.linspace(0, 1)
+ thetas **4)*(1-thetas)**6)
+ plt.plot(thetas, (thetasr"$\theta$")
+ plt.xlabel("Likelihood"); plt.ylabel(
More generally, the likelihood for some Bernoulli(\(p\)) random variable \(Y\) is:
+\[P(Y = y) = \begin{cases} + 1, \text{with probability } p\\ + 0, \text{with probability } 1 - p + \end{cases} \]
+Equivalently, this can be written in a compact way:
+\[P(Y=y) = p^y(1-p)^{1-y}\]
+In our example, a Bernoulli random variable is analogous to a single data point (e.g., one instance of a basketball team winning or losing a game). All together, our games
data consists of many IID Bernoulli(\(p\)) random variables. To find the likelihood of independent events in succession, simply multiply their likelihoods.
\[\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}\]
+As with the coin example, we want to find the parameter \(p\) that maximizes this likelihood. Earlier, we gave an intuitive graphical solution, but let’s take the derivative of the likelihood to find this maximum.
+At a first glance, this derivative will be complicated! We will have to use the product rule, followed by the chain rule. Instead, we can make an observation that simplifies the problem.
+Finding the \(p\) that maximizes \[\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}\] is equivalent to the \(p\) that maximizes \[\text{log}(\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i})\]
+This is because \(\text{log}\) is a strictly increasing function. It won’t change the maximum or minimum of the function it was applied to. From \(\text{log}\) properties, \(\text{log}(a*b)\) = \(\text{log}(a) + \text{log}(b)\). We can apply this to our equation above to get:
+\[\underset{p}{\text{argmax}} \sum_{i=1}^{n} \text{log}(p^{y_i} (1-p)^{1-y_i})\]
+\[= \underset{p}{\text{argmax}} \sum_{i=1}^{n} (\text{log}(p^{y_i}) + \text{log}((1-p)^{1-y_i}))\]
+\[= \underset{p}{\text{argmax}} \sum_{i=1}^{n} (y_i\text{log}(p) + (1-y_i)\text{log}(1-p))\]
+We can add a constant factor of \(\frac{1}{n}\) out front. It won’t affect the \(p\) that maximizes our likelihood.
+\[=\underset{p}{\text{argmax}} \frac{1}{n} \sum_{i=1}^{n} y_i\text{log}(p) + (1-y_i)\text{log}(1-p)\]
+One last “trick” we can do is change this to a minimization problem by negating the result. This works because we are dealing with a concave function, which can be made convex.
+\[= \underset{p}{\text{argmin}} -\frac{1}{n} \sum_{i=1}^{n} y_i\text{log}(p) + (1-y_i)\text{log}(1-p)\]
+Now let’s say that we have data that are independent with different probability \(p_i\). Then, we would want to find the \(p_1, p_2, \dots, p_n\) that maximize \[\prod_{i=1}^{n} p_i^{y_i} (1-p_i)^{1-y_i}\]
+Setting up and simplifying the optimization problems as we did above, we ultimately want to find:
+\[= \underset{p}{\text{argmin}} -\frac{1}{n} \sum_{i=1}^{n} y_i\text{log}(p_i) + (1-y_i)\text{log}(1-p_i)\]
+For logistic regression, \(p_i = \sigma(x^{\top}\theta)\). Plugging that in, we get:
+\[= \underset{p}{\text{argmin}} -\frac{1}{n} \sum_{i=1}^{n} y_i\text{log}(\sigma(x^{\top}\theta)) + (1-y_i)\text{log}(1-\sigma(x^{\top}\theta))\]
+This is exactly our average cross-entropy loss minimization problem from before!
+Why did we do all this complicated math? We have shown that minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.
+Note that this is under the assumption that all data is drawn independently from the same logistic regression model with parameter \(\theta\). In fact, many of the model + loss combinations we’ve seen can be motivated using MLE (e.g., OLS, Ridge Regression, etc.). In probability and ML classes, you’ll get the chance to explore MLE further.
+ + +Today, we will continue studying the Logistic Regression model and discuss decision boundaries that help inform the classification of a particular prediction and learn about linear separability. Picking up from last lecture’s discussion of cross-entropy loss, we’ll study a few of its pitfalls, and learn potential remedies. We will also provide an implementation of sklearn
’s logistic regression model. Lastly, we’ll return to decision rules and discuss metrics that allow us to determine our model’s performance in different scenarios.
This will introduce us to the process of thresholding – a technique used to classify data from our model’s predicted probabilities, or \(P(Y=1|x)\). In doing so, we’ll focus on how these thresholding decisions affect the behavior of our model and learn various evaluation metrics useful for binary classification, and apply them to our study of logistic regression.
+In logistic regression, we model the probability that a datapoint belongs to Class 1.
+
Last week, we developed the logistic regression model to predict that probability, but we never actually made any classifications for whether our prediction \(y\) belongs in Class 0 or Class 1.
\[ p = P(Y=1 | x) = \frac{1}{1 + e^{-x^{\top}\theta}}\]
+A decision rule tells us how to interpret the output of the model to make a decision on how to classify a datapoint. We commonly make decision rules by specifying a threshold, \(T\). If the predicted probability is greater than or equal to \(T\), predict Class 1. Otherwise, predict Class 0.
+\[\hat y = \text{classify}(x) = \begin{cases} + 1, & P(Y=1|x) \ge T\\ + 0, & \text{otherwise } + \end{cases}\]
+The threshold is often set to \(T = 0.5\), but not always. We’ll discuss why we might want to use other thresholds \(T \neq 0.5\) later in this lecture.
+Using our decision rule, we can define a decision boundary as the “line” that splits the data into classes based on its features. For logistic regression, since we are working in \(p\) dimensions, the decision boundary is a hyperplane – a linear combination of the features in \(p\)-dimensions – and we can recover it from the final logistic regression model. For example, if we have a model with 2 features (2D), we have \(\theta = [\theta_0, \theta_1, \theta_2]\) including the intercept term, and we can solve for the decision boundary like so:
+\[ +\begin{align} +T &= \frac{1}{1 + e^{-(\theta_0 + \theta_1 * \text{feature1} + \theta_2 * \text{feature2})}} \\ +1 + e^{-(\theta_0 + \theta_1 \cdot \text{feature1} + \theta_2 \cdot \text{feature2})} &= \frac{1}{T} \\ +e^{-(\theta_0 + \theta_1 \cdot \text{feature1} + \theta_2 \cdot \text{feature2})} &= \frac{1}{T} - 1 \\ +\theta_0 + \theta_1 \cdot \text{feature1} + \theta_2 \cdot \text{feature2} &= -\log(\frac{1}{T} - 1) +\end{align} +\]
+For a model with 2 features, the decision boundary is a line in terms of its features. To make it easier to visualize, we’ve included an example of a 1-dimensional and a 2-dimensional decision boundary below. Notice how the decision boundary predicted by our logistic regression model perfectly separates the points into two classes. Here the color is the predicted class, rather than the true class.
+In real life, however, that is often not the case, and we often see some overlap between points of different classes across the decision boundary. The true classes of the 2D data are shown below:
+As you can see, the decision boundary predicted by our logistic regression does not perfectly separate the two classes. There’s a “muddled” region near the decision boundary where our classifier predicts the wrong class. What would the data have to look like for the classifier to make perfect predictions?
+A classification dataset is said to be linearly separable if there exists a hyperplane among input features \(x\) that separates the two classes \(y\).
+Linear separability in 1D can be found with a rugplot of a single feature where a point perfectly separates the classes (Remember that in 1D, our decision boundary is just a point). For example, notice how the plot on the bottom left is linearly separable along the vertical line \(x=0\). However, no such line perfectly separates the two classes on the bottom right.
+This same definition holds in higher dimensions. If there are two features, the separating hyperplane must exist in two dimensions (any line of the form \(y=mx+b\)). We can visualize this using a scatter plot.
+This sounds great! When the dataset is linearly separable, a logistic regression classifier can perfectly assign datapoints into classes. Can it achieve 0 cross-entropy loss?
+\[-(y \log(p) + (1 - y) \log(1 - p))\]
+Cross-entropy loss is 0 if \(p = 1\) when \(y = 1\), and \(p = 0\) when \(y = 0\). Consider a simple model with one feature and no intercept.
+\[P_{\theta}(Y = 1|x) = \sigma(\theta x) = \frac{1}{1 + e^{-\theta x}}\]
+What \(\theta\) will achieve 0 loss if we train on the datapoint \(x = 1, y = 1\)? We would want \(p = 1\) which occurs when \(\theta \rightarrow \infty\).
+However, (unexpected) complications may arise. When data is linearly separable, the optimal model parameters diverge to \(\pm \infty\). The sigmoid can never output exactly 0 or 1, so no finite optimal \(\theta\) exists. This can be a problem when using gradient descent to fit the model. Consider a simple, linearly separable “toy” dataset with two datapoints.
+Let’s also visualize the mean cross entropy loss along with the direction of the gradient (how this loss surface is calculated is out of scope).
+It’s nearly impossible to see, but the plateau to the right is slightly tilted. Because gradient descent follows the tilted loss surface downwards, it never converges.
+The diverging weights cause the model to be overconfident. Say we add a new point \((x, y) = (-0.5, 1)\). Following the behavior above, our model will incorrectly predict \(p=0\), and thus, \(\hat y = 0\).
+
The loss incurred by this misclassified point is infinite.
\[-(y\text{ log}(p) + (1-y)\text{ log}(1-p))=1 * \text{log}(0)\]
+Thus, diverging weights (\(|\theta| \rightarrow \infty\)) occur with linearly separable data. “Overconfidence”, as shown here, is a particularly dangerous version of overfitting.
+To avoid large weights and infinite loss (particularly on linearly separable data), we use regularization. The same principles apply as with linear regression - make sure to standardize your features first.
+For example, \(L2\) (Ridge) Logistic Regression takes on the form:
+\[\min_{\theta} -\frac{1}{n} \sum_{i=1}^{n} (y_i \text{log}(\sigma(X_i^T\theta)) + (1-y_i)\text{log}(1-\sigma(X_i^T\theta))) + \lambda \sum_{j=1}^{d} \theta_j^2\]
+Now, let us compare the loss functions of un-regularized and regularized logistic regression.
+As we can see, \(L2\) regularization helps us prevent diverging weights and deters against “overconfidence.”
+sklearn
’s logistic regression defaults to \(L2\) regularization and C=1.0
; C
is the inverse of \(\lambda\): \[C = \frac{1}{\lambda}\] Setting C
to a large value, for example, C=300.0
, results in minimal regularization.
# sklearn defaults
+model = LogisticRegression(penalty = 'l2', C = 1.0, ...)
+model.fit()
+Note that in Data 100, we only use sklearn
to fit logistic regression models. There is no closed-form solution to the optimal theta vector, and the gradient is a little messy (see the bonus section below for details).
From here, the .predict
function returns the predicted class \(\hat y\) of the point. In the simple binary case where the threshold is 0.5,
\[\hat y = \begin{cases} + 1, & P(Y=1|x) \ge 0.5\\ + 0, & \text{otherwise } + \end{cases}\]
+You might be thinking, if we’ve already introduced cross-entropy loss, why do we need additional ways of assessing how well our models perform? In linear regression, we made numerical predictions and used a loss function to determine how “good” these predictions were. In logistic regression, our ultimate goal is to classify data – we are much more concerned with whether or not each datapoint was assigned the correct class using the decision rule. As such, we are interested in the quality of classifications, not the predicted probabilities.
+The most basic evaluation metric is accuracy, that is, the proportion of correctly classified points.
+\[\text{accuracy} = \frac{\# \text{ of points classified correctly}}{\# \text{ of total points}}\]
+Translated to code:
+def accuracy(X, Y):
+ return np.mean(model.predict(X) == Y)
+
+model.score(X, y) # built-in accuracy function
+You can find the sklearn
documentation here.
However, accuracy is not always a great metric for classification. To understand why, let’s consider a classification problem with 100 emails where only 5 are truly spam, and the remaining 95 are truly ham. We’ll investigate two models where accuracy is a poor metric.
+As this example illustrates, accuracy is not always a good metric for classification, particularly when your data could exhibit class imbalance (e.g., very few 1’s compared to 0’s).
+There are 4 different different classifications that our model might make:
+These classifications can be concisely summarized in a confusion matrix.
+An easy way to remember this terminology is as follows:
+We can now write the accuracy calculation as \[\text{accuracy} = \frac{TP + TN}{n}\]
+In sklearn
, we use the following syntax to plot a confusion matrix:
from sklearn.metrics import confusion_matrix
+cm = confusion_matrix(Y_true, Y_pred)
+The purpose of our discussion of the confusion matrix was to motivate better performance metrics for classification problems with class imbalance - namely, precision and recall.
+Precision is defined as
+\[\text{precision} = \frac{\text{TP}}{\text{TP + FP}}\]
+Precision answers the question: “Of all observations that were predicted to be \(1\), what proportion was actually \(1\)?” It measures how accurate the classifier is when its predictions are positive.
+Recall (or sensitivity) is defined as
+\[\text{recall} = \frac{\text{TP}}{\text{TP + FN}}\]
+Recall aims to answer: “Of all observations that were actually \(1\), what proportion was predicted to be \(1\)?” It measures how many positive predictions were missed.
+Here’s a helpful graphic that summarizes our discussion above.
+In this section, we will calculate the accuracy, precision, and recall performance metrics for our earlier spam classification example. As a reminder, we had 100 emails, 5 of which were spam. We designed two models:
+First, let’s begin by creating the confusion matrix.
++ | 0 | +1 | +
---|---|---|
0 | +True Negative: 95 | +False Positive: 0 | +
1 | +False Negative: 5 | +True Positive: 0 | +
\[\text{accuracy} = \frac{95}{100} = 0.95\] \[\text{precision} = \frac{0}{0 + 0} = \text{undefined}\] \[\text{recall} = \frac{0}{0 + 5} = 0\]
+Notice how our precision is undefined because we never predicted class \(1\). Our recall is 0 for the same reason – the numerator is 0 (we had no positive predictions).
+The confusion matrix for Model 2 is:
++ | 0 | +1 | +
---|---|---|
0 | +True Negative: 0 | +False Positive: 95 | +
1 | +False Negative: 0 | +True Positive: 5 | +
\[\text{accuracy} = \frac{5}{100} = 0.05\] \[\text{precision} = \frac{5}{5 + 95} = 0.05\] \[\text{recall} = \frac{5}{5 + 0} = 1\]
+Our precision is low because we have many false positives, and our recall is perfect - we correctly classified all spam emails (we never predicted class \(0\)).
+Precision (\(\frac{\text{TP}}{\text{TP} + \textbf{ FP}}\)) penalizes false positives, while recall (\(\frac{\text{TP}}{\text{TP} + \textbf{ FN}}\)) penalizes false negatives. In fact, precision and recall are inversely related. This is evident in our second model – we observed a high recall and low precision. Usually, there is a tradeoff in these two (most models can either minimize the number of FP or FN; and in rare cases, both).
+The specific performance metric(s) to prioritize depends on the context. In many medical settings, there might be a much higher cost to missing positive cases. For instance, in our breast cancer example, it is more costly to misclassify malignant tumors (false negatives) than it is to incorrectly classify a benign tumor as malignant (false positives). In the case of the latter, pathologists can conduct further studies to verify malignant tumors. As such, we should minimize the number of false negatives. This is equivalent to maximizing recall.
+The True Positive Rate (TPR) is defined as
+\[\text{true positive rate} = \frac{\text{TP}}{\text{TP + FN}}\]
+You’ll notice this is equivalent to recall. In the context of our spam email classifier, it answers the question: “What proportion of spam did I mark correctly?”. We’d like this to be close to \(1\).
+The True Negative Rate (TNR) is defined as
+\[\text{true negative rate} = \frac{\text{TN}}{\text{TN + FP}}\]
+Another word for TNR is specificity. This answers the question: “What proportion of ham did I mark correctly?”. We’d like this to be close to \(1\).
+The False Positive Rate (FPR) is defined as
+\[\text{false positive rate} = \frac{\text{FP}}{\text{FP + TN}}\]
+FPR is equal to 1 - specificity, or 1 - TNR. This answers the question: “What proportion of regular email did I mark as spam?”. We’d like this to be close to \(0\).
+As we increase threshold \(T\), both TPR and FPR decrease. We’ve plotted this relationship below for some model on a toy
dataset.
One way to minimize the number of FP vs. FN (equivalently, maximizing precision vs. recall) is by adjusting the classification threshold \(T\).
+\[\hat y = \begin{cases} + 1, & P(Y=1|x) \ge T\\ + 0, & \text{otherwise } + \end{cases}\]
+The default threshold in sklearn
is \(T = 0.5\). As we increase the threshold \(T\), we “raise the standard” of how confident our classifier needs to be to predict 1 (i.e., “positive”).
As you may notice, the choice of threshold \(T\) impacts our classifier’s performance.
+In fact, we can choose a threshold \(T\) based on our desired number, or proportion, of false positives and false negatives. We can do so using a few different tools. We’ll touch on two of the most important ones in Data 100.
+A Precision-Recall Curve (PR Curve) is an alternative to the ROC curve that displays the relationship between precision and recall for various threshold values. In this curve, we test out many different possible thresholds and for each one we compute the precision and recall of the classifier.
+Let’s first consider how precision and recall change as a function of the threshold \(T\). We know this quite well from earlier – precision will generally increase, and recall will decrease.
+Displayed below is the PR Curve for the same toy
dataset. Notice how threshold values increase as we move to the left.
Once again, the perfect classifier will resemble the orange curve, this time, facing the opposite direction.
+We want our PR curve to be as close to the “top right” of this graph as possible. Again, we use the AUC to determine “closeness”, with the perfect classifier exhibiting an AUC = 1 (and the worst with an AUC = 0.5).
+The “Receiver Operating Characteristic” Curve (ROC Curve) plots the tradeoff between FPR and TPR. Notice how the far-left of the curve corresponds to higher threshold \(T\) values. At lower thresholds, the FPR and TPR are both high as there are many positive predictions while at higher thresholds the FPR and TPR are both low as there are fewer positive predictions.
+The “perfect” classifier is the one that has a TPR of 1, and FPR of 0. This is achieved at the top-left of the plot below. More generally, it’s ROC curve resembles the curve in orange.
+We want our model to be as close to this orange curve as possible. How do we quantify “closeness”?
+We can compute the area under curve (AUC) of the ROC curve. Notice how the perfect classifier has an AUC = 1. The closer our model’s AUC is to 1, the better it is.
+On the other hand, a terrible model will have an AUC closer to 0.5. Random predictors randomly predict \(P(Y = 1 | x)\) to be uniformly between 0 and 1. This indicates the classifier is not able to distinguish between positive and negative classes, and thus, randomly predicts one of the two.
+We can also illustrate this by comparing different thresholds and seeing their points on the ROC curve.
+Let’s define the following terms: \[ +\begin{align} +t_i &= \phi(x_i)^T \theta \\ +p_i &= \sigma(t_i) \\ +t_i &= \log(\frac{p_i}{1 - p_i}) \\ +1 - \sigma(t_i) &= \sigma(-t_i) \\ +\frac{d}{dt} \sigma(t) &= \sigma(t) \sigma(-t) +\end{align} +\]
+Now, we can simplify the cross-entropy loss \[ +\begin{align} +y_i \log(p_i) + (1 - y_i) \log(1 - p_i) &= y_i \log(\frac{p_i}{1 - p_i}) + \log(1 - p_i) \\ +&= y_i \phi(x_i)^T + \log(\sigma(-\phi(x_i)^T \theta)) +\end{align} +\]
+Hence, the optimal \(\hat{\theta}\) is \[\text{argmin}_{\theta} - \frac{1}{n} \sum_{i=1}^n (y_i \phi(x_i)^T + \log(\sigma(-\phi(x_i)^T \theta)))\]
+We want to minimize \[L(\theta) = - \frac{1}{n} \sum_{i=1}^n (y_i \phi(x_i)^T + \log(\sigma(-\phi(x_i)^T \theta)))\]
+So we take the derivative \[ +\begin{align} +\triangledown_{\theta} L(\theta) &= - \frac{1}{n} \sum_{i=1}^n \triangledown_{\theta} y_i \phi(x_i)^T + \triangledown_{\theta} \log(\sigma(-\phi(x_i)^T \theta)) \\ +&= - \frac{1}{n} \sum_{i=1}^n y_i \phi(x_i) + \triangledown_{\theta} \log(\sigma(-\phi(x_i)^T \theta)) \\ +&= - \frac{1}{n} \sum_{i=1}^n y_i \phi(x_i) + \frac{1}{\sigma(-\phi(x_i)^T \theta)} \triangledown_{\theta} \sigma(-\phi(x_i)^T \theta) \\ +&= - \frac{1}{n} \sum_{i=1}^n y_i \phi(x_i) + \frac{\sigma(-\phi(x_i)^T \theta)}{\sigma(-\phi(x_i)^T \theta)} \sigma(\phi(x_i)^T \theta)\triangledown_{\theta} \sigma(-\phi(x_i)^T \theta) \\ +&= - \frac{1}{n} \sum_{i=1}^n (y_i - \sigma(\phi(x_i)^T \theta)\phi(x_i)) +\end{align} +\]
+Setting the derivative equal to 0 and solving for \(\hat{\theta}\), we find that there’s no general analytic solution. Therefore, we must solve using numeric methods.
+\[\theta^{(0)} \leftarrow \text{initial vector (random, zeros, ...)} \]
+For \(\tau\) from 0 to convergence: \[ \theta^{(\tau + 1)} \leftarrow \theta^{(\tau)} - \rho(\tau)\left( \frac{1}{n} \sum_{i=1}^n \triangledown_{\theta} L_i(\theta) \mid_{\theta = \theta^{(\tau)}}\right) \]
+\[\theta^{(0)} \leftarrow \text{initial vector (random, zeros, ...)} \]
+For \(\tau\) from 0 to convergence, let \(B\) ~ \(\text{Random subset of indices}\). \[ \theta^{(\tau + 1)} \leftarrow \theta^{(\tau)} - \rho(\tau)\left( \frac{1}{|B|} \sum_{i \in B} \triangledown_{\theta} L_i(\theta) \mid_{\theta = \theta^{(\tau)}}\right) \]
+ + +We’ve now spent a number of lectures exploring how to build effective models – we introduced the SLR and constant models, selected cost functions to suit our modeling task, and applied transformations to improve the linear fit.
+Throughout all of this, we considered models of one feature (\(\hat{y}_i = \theta_0 + \theta_1 x_i\)) or zero features (\(\hat{y}_i = \theta_0\)). As data scientists, we usually have access to datasets containing many features. To make the best models we can, it will be beneficial to consider all of the variables available to us as inputs to a model, rather than just one. In today’s lecture, we’ll introduce multiple linear regression as a framework to incorporate multiple features into a model. We will also learn how to accelerate the modeling process – specifically, we’ll see how linear algebra offers us a powerful set of tools for understanding model performance.
+Multiple linear regression is an extension of simple linear regression that adds additional features to the model. The multiple linear regression model takes the form:
+\[\hat{y} = \theta_0\:+\:\theta_1x_{1}\:+\:\theta_2 x_{2}\:+\:...\:+\:\theta_p x_{p}\]
+Our predicted value of \(y\), \(\hat{y}\), is a linear combination of the single observations (features), \(x_i\), and the parameters, \(\theta_i\).
+We can explore this idea further by looking at a dataset containing aggregate per-player data from the 2018-19 NBA season, downloaded from Kaggle.
+import pandas as pd
+= pd.read_csv('data/nba18-19.csv', index_col=0)
+ nba = None # Drops name of index (players are ordered by rank) nba.index.name
5) nba.head(
+ | Player | +Pos | +Age | +Tm | +G | +GS | +MP | +FG | +FGA | +FG% | +... | +FT% | +ORB | +DRB | +TRB | +AST | +STL | +BLK | +TOV | +PF | +PTS | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | +Álex Abrines\abrinal01 | +SG | +25 | +OKC | +31 | +2 | +19.0 | +1.8 | +5.1 | +0.357 | +... | +0.923 | +0.2 | +1.4 | +1.5 | +0.6 | +0.5 | +0.2 | +0.5 | +1.7 | +5.3 | +
2 | +Quincy Acy\acyqu01 | +PF | +28 | +PHO | +10 | +0 | +12.3 | +0.4 | +1.8 | +0.222 | +... | +0.700 | +0.3 | +2.2 | +2.5 | +0.8 | +0.1 | +0.4 | +0.4 | +2.4 | +1.7 | +
3 | +Jaylen Adams\adamsja01 | +PG | +22 | +ATL | +34 | +1 | +12.6 | +1.1 | +3.2 | +0.345 | +... | +0.778 | +0.3 | +1.4 | +1.8 | +1.9 | +0.4 | +0.1 | +0.8 | +1.3 | +3.2 | +
4 | +Steven Adams\adamsst01 | +C | +25 | +OKC | +80 | +80 | +33.4 | +6.0 | +10.1 | +0.595 | +... | +0.500 | +4.9 | +4.6 | +9.5 | +1.6 | +1.5 | +1.0 | +1.7 | +2.6 | +13.9 | +
5 | +Bam Adebayo\adebaba01 | +C | +21 | +MIA | +82 | +28 | +23.3 | +3.4 | +5.9 | +0.576 | +... | +0.735 | +2.0 | +5.3 | +7.3 | +2.2 | +0.9 | +0.8 | +1.5 | +2.5 | +8.9 | +
5 rows × 29 columns
+Let’s say we are interested in predicting the number of points (PTS
) an athlete will score in a basketball game this season.
Suppose we want to fit a linear model by using some characteristics, or features of a player. Specifically, we’ll focus on field goals, assists, and 3-point attempts.
+FG
, the average number of (2-point) field goals per gameAST
, the average number of assists per game3PA
, the average number of 3-point field goals attempted per game'FG', 'AST', '3PA', 'PTS']].head() nba[[
+ | FG | +AST | +3PA | +PTS | +
---|---|---|---|---|
1 | +1.8 | +0.6 | +4.1 | +5.3 | +
2 | +0.4 | +0.8 | +1.5 | +1.7 | +
3 | +1.1 | +1.9 | +2.2 | +3.2 | +
4 | +6.0 | +1.6 | +0.0 | +13.9 | +
5 | +3.4 | +2.2 | +0.2 | +8.9 | +
Because we are now dealing with many parameter values, we’ve collected them all into a parameter vector with dimensions \((p+1) \times 1\) to keep things tidy. Remember that \(p\) represents the number of features we have (in this case, 3).
+\[\theta = \begin{bmatrix} + \theta_{0} \\ + \theta_{1} \\ + \vdots \\ + \theta_{p} + \end{bmatrix}\]
+We are working with two vectors here: a row vector representing the observed data, and a column vector containing the model parameters. The multiple linear regression model is equivalent to the dot (scalar) product of the observation vector and parameter vector.
+\[[1,\:x_{1},\:x_{2},\:x_{3},\:...,\:x_{p}] \theta = [1,\:x_{1},\:x_{2},\:x_{3},\:...,\:x_{p}] \begin{bmatrix} + \theta_{0} \\ + \theta_{1} \\ + \vdots \\ + \theta_{p} + \end{bmatrix} = \theta_0\:+\:\theta_1x_{1}\:+\:\theta_2 x_{2}\:+\:...\:+\:\theta_p x_{p}\]
+Notice that we have inserted 1 as the first value in the observation vector. When the dot product is computed, this 1 will be multiplied with \(\theta_0\) to give the intercept of the regression model. We call this 1 entry the intercept or bias term.
+Given that we have three features here, we can express this model as: \[\hat{y} = \theta_0\:+\:\theta_1x_{1}\:+\:\theta_2 x_{2}\:+\:\theta_3 x_{3}\]
+Our features are represented by \(x_1\) (FG
), \(x_2\) (AST
), and \(x_3\) (3PA
) with each having correpsonding parameters, \(\theta_1\), \(\theta_2\), and \(\theta_3\).
In statistics, this model + loss is called Ordinary Least Squares (OLS). The solution to OLS is the minimizing loss for parameters \(\hat{\theta}\), also called the least squares estimate.
+We now know how to generate a single prediction from multiple observed features. Data scientists usually work at scale – that is, they want to build models that can produce many predictions, all at once. The vector notation we introduced above gives us a hint on how we can expedite multiple linear regression. We want to use the tools of linear algebra.
+Let’s think about how we can apply what we did above. To accommodate for the fact that we’re considering several feature variables, we’ll adjust our notation slightly. Each observation can now be thought of as a row vector with an entry for each of \(p\) features.
+
+![]() |
+
To make a prediction from the first observation in the data, we take the dot product of the parameter vector and first observation vector. To make a prediction from the second observation, we would repeat this process to find the dot product of the parameter vector and the second observation vector. If we wanted to find the model predictions for each observation in the dataset, we’d repeat this process for all \(n\) observations in the data.
+\[\hat{y}_1 = \theta_0 + \theta_1 x_{11} + \theta_2 x_{12} + ... + \theta_p x_{1p} = [1,\:x_{11},\:x_{12},\:x_{13},\:...,\:x_{1p}] \theta\] \[\hat{y}_2 = \theta_0 + \theta_1 x_{21} + \theta_2 x_{22} + ... + \theta_p x_{2p} = [1,\:x_{21},\:x_{22},\:x_{23},\:...,\:x_{2p}] \theta\] \[\vdots\] \[\hat{y}_n = \theta_0 + \theta_1 x_{n1} + \theta_2 x_{n2} + ... + \theta_p x_{np} = [1,\:x_{n1},\:x_{n2},\:x_{n3},\:...,\:x_{np}] \theta\]
+Our observed data is represented by \(n\) row vectors, each with dimension \((p+1)\). We can collect them all into a single matrix, which we call \(\mathbb{X}\).
+
+![]() |
+
The matrix \(\mathbb{X}\) is known as the design matrix. It contains all observed data for each of our \(p\) features, where each row corresponds to one observation, and each column corresponds to a feature. It often (but not always) contains an additional column of all ones to represent the intercept or bias column.
+To review what is happening in the design matrix: each row represents a single observation. For example, a student in Data 100. Each column represents a feature. For example, the ages of students in Data 100. This convention allows us to easily transfer our previous work in DataFrames over to this new linear algebra perspective.
+
+![]() |
+
The multiple linear regression model can then be restated in terms of matrices: \[ +\Large +\mathbb{\hat{Y}} = \mathbb{X} \theta +\]
+Here, \(\mathbb{\hat{Y}}\) is the prediction vector with \(n\) elements (\(\mathbb{\hat{Y}} \in \mathbb{R}^{n}\)); it contains the prediction made by the model for each of the \(n\) input observations. \(\mathbb{X}\) is the design matrix with dimensions \(\mathbb{X} \in \mathbb{R}^{n \times (p + 1)}\), and \(\theta\) is the parameter vector with dimensions \(\theta \in \mathbb{R}^{(p + 1)}\). Note that our true output \(\mathbb{Y}\) is also a vector with \(n\) elements (\(\mathbb{Y} \in \mathbb{R}^{n}\)).
+ +We now have a new approach to understanding models in terms of vectors and matrices. To accompany this new convention, we should update our understanding of risk functions and model fitting.
+Recall our definition of MSE: \[R(\theta) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2\]
+At its heart, the MSE is a measure of distance – it gives an indication of how “far away” the predictions are from the true values, on average.
+ +We can express the MSE as a squared L2 norm if we rewrite it in terms of the prediction vector, \(\hat{\mathbb{Y}}\), and true target vector, \(\mathbb{Y}\):
+\[R(\theta) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n} (||\mathbb{Y} - \hat{\mathbb{Y}}||_2)^2\]
+Here, the superscript 2 outside of the parentheses means that we are squaring the norm. If we plug in our linear model \(\hat{\mathbb{Y}} = \mathbb{X} \theta\), we find the MSE cost function in vector notation:
+\[R(\theta) = \frac{1}{n} (||\mathbb{Y} - \mathbb{X} \theta||_2)^2\]
+Under the linear algebra perspective, our new task is to fit the optimal parameter vector \(\theta\) such that the cost function is minimized. Equivalently, we wish to minimize the norm \[||\mathbb{Y} - \mathbb{X} \theta||_2 = ||\mathbb{Y} - \hat{\mathbb{Y}}||_2.\]
+We can restate this goal in two ways:
+There are several equivalent terms in the context of regression. The ones we use most often for this course are bolded.
+Up until now, we’ve mostly thought of our model as a scalar product between horizontally stacked observations and the parameter vector. We can also think of \(\hat{\mathbb{Y}}\) as a linear combination of feature vectors, scaled by the parameters. We use the notation \(\mathbb{X}_{:, i}\) to denote the \(i\)th column of the design matrix. You can think of this as following the same convention as used when calling .iloc
and .loc
. “:” means that we are taking all entries in the \(i\)th column.
+![]() |
+
\[ +\hat{\mathbb{Y}} = +\theta_0 \begin{bmatrix} + 1 \\ + 1 \\ + \vdots \\ + 1 + \end{bmatrix} + \theta_1 \begin{bmatrix} + x_{11} \\ + x_{21} \\ + \vdots \\ + x_{n1} + \end{bmatrix} + \ldots + \theta_p \begin{bmatrix} + x_{1p} \\ + x_{2p} \\ + \vdots \\ + x_{np} + \end{bmatrix} + = \theta_0 \mathbb{X}_{:,\:1} + \theta_1 \mathbb{X}_{:,\:2} + \ldots + \theta_p \mathbb{X}_{:,\:p+1}\]
+This new approach is useful because it allows us to take advantage of the properties of linear combinations.
+Because the prediction vector, \(\hat{\mathbb{Y}} = \mathbb{X} \theta\), is a linear combination of the columns of \(\mathbb{X}\), we know that the predictions are contained in the span of \(\mathbb{X}\). That is, we know that \(\mathbb{\hat{Y}} \in \text{Span}(\mathbb{X})\).
+The diagram below is a simplified view of \(\text{Span}(\mathbb{X})\), assuming that each column of \(\mathbb{X}\) has length \(n\). Notice that the columns of \(\mathbb{X}\) define a subspace of \(\mathbb{R}^n\), where each point in the subspace can be reached by a linear combination of \(\mathbb{X}\)’s columns. The prediction vector \(\mathbb{\hat{Y}}\) lies somewhere in this subspace.
+
+![]() |
+
Examining this diagram, we find a problem. The vector of true values, \(\mathbb{Y}\), could theoretically lie anywhere in \(\mathbb{R}^n\) space – its exact location depends on the data we collect out in the real world. However, our multiple linear regression model can only make predictions in the subspace of \(\mathbb{R}^n\) spanned by \(\mathbb{X}\). Remember the model fitting goal we established in the previous section: we want to generate predictions such that the distance between the vector of true values, \(\mathbb{Y}\), and the vector of predicted values, \(\mathbb{\hat{Y}}\), is minimized. This means that we want \(\mathbb{\hat{Y}}\) to be the vector in \(\text{Span}(\mathbb{X})\) that is closest to \(\mathbb{Y}\).
+Another way of rephrasing this goal is to say that we wish to minimize the length of the residual vector \(e\), as measured by its \(L_2\) norm.
+
+![]() |
+
The vector in \(\text{Span}(\mathbb{X})\) that is closest to \(\mathbb{Y}\) is always the orthogonal projection of \(\mathbb{Y}\) onto \(\text{Span}(\mathbb{X}).\) Thus, we should choose the parameter vector \(\theta\) that makes the residual vector orthogonal to any vector in \(\text{Span}(\mathbb{X})\). You can visualize this as the vector created by dropping a perpendicular line from \(\mathbb{Y}\) onto the span of \(\mathbb{X}\).
+ +Remember our goal is to find \(\hat{\theta}\) such that we minimize the objective function \(R(\theta)\). Equivalently, this is the \(\hat{\theta}\) such that the residual vector \(e = \mathbb{Y} - \mathbb{X} \hat{\theta}\) is orthogonal to \(\text{Span}(\mathbb{X})\).
+Looking at the definition of orthogonality of \(\mathbb{Y} - \mathbb{X}\hat{\theta}\) to \(span(\mathbb{X})\), we can write: \[\mathbb{X}^T (\mathbb{Y} - \mathbb{X}\hat{\theta}) = \vec{0}\]
+Let’s then rearrange the terms: \[\mathbb{X}^T \mathbb{Y} - \mathbb{X}^T \mathbb{X} \hat{\theta} = \vec{0}\]
+And finally, we end up with the normal equation: \[\mathbb{X}^T \mathbb{X} \hat{\theta} = \mathbb{X}^T \mathbb{Y}\]
+Any vector \(\theta\) that minimizes MSE on a dataset must satisfy this equation.
+If \(\mathbb{X}^T \mathbb{X}\) is invertible, we can conclude: \[\hat{\theta} = (\mathbb{X}^T \mathbb{X})^{-1} \mathbb{X}^T \mathbb{Y}\]
+This is called the least squares estimate of \(\theta\): it is the value of \(\theta\) that minimizes the squared loss.
+Note that the least squares estimate was derived under the assumption that \(\mathbb{X}^T \mathbb{X}\) is invertible. This condition holds true when \(\mathbb{X}^T \mathbb{X}\) is full column rank, which, in turn, happens when \(\mathbb{X}\) is full column rank. The proof for why \(\mathbb{X}\) needs to be full column rank is optional and in the Bonus section at the end.
+Our geometric view of multiple linear regression has taken us far! We have identified the optimal set of parameter values to minimize MSE in a model of multiple features. Now, we want to understand how well our fitted model performs.
+One measure of model performance is the Root Mean Squared Error, or RMSE. The RMSE is simply the square root of MSE. Taking the square root converts the value back into the original, non-squared units of \(y_i\), which is useful for understanding the model’s performance. A low RMSE indicates more “accurate” predictions – that there is a lower average loss across the dataset.
+\[\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}\]
+When working with SLR, we generated plots of the residuals against a single feature to understand the behavior of residuals. When working with several features in multiple linear regression, it no longer makes sense to consider a single feature in our residual plots. Instead, multiple linear regression is evaluated by making plots of the residuals against the predicted values. As was the case with SLR, a multiple linear model performs well if its residual plot shows no patterns.
+
+![]() |
+
For SLR, we used the correlation coefficient to capture the association between the target variable and a single feature variable. In a multiple linear model setting, we will need a performance metric that can account for multiple features at once. Multiple \(R^2\), also called the coefficient of determination, is the proportion of variance of our fitted values (predictions) \(\hat{y}_i\) to our true values \(y_i\). It ranges from 0 to 1 and is effectively the proportion of variance in the observations that the model explains.
+\[R^2 = \frac{\text{variance of } \hat{y}_i}{\text{variance of } y_i} = \frac{\sigma^2_{\hat{y}}}{\sigma^2_y}\]
+Note that for OLS with an intercept term, for example \(\hat{y} = \theta_0 + \theta_1x_1 + \theta_2x_2 + \cdots + \theta_px_p\), \(R^2\) is equal to the square of the correlation between \(y\) and \(\hat{y}\). On the other hand for SLR, \(R^2\) is equal to \(r^2\), the correlation between \(x\) and \(y\). The proof of these last two properties is out of scope for this course.
+Additionally, as we add more features, our fitted values tend to become closer and closer to our actual values. Thus, \(R^2\) increases.
+Adding more features doesn’t always mean our model is better though! We’ll see why later in the course.
+\[\mathbb{X}^Te = 0 \]
+ +\[\sum_i^n e_i = 0\]
+ +To summarize:
++ | Model | +Estimate | +Unique? | +
---|---|---|---|
Constant Model + MSE | +\(\hat{y} = \theta_0\) | +\(\hat{\theta}_0 = mean(y) = \bar{y}\) | +Yes. Any set of values has a unique mean. | +
Constant Model + MAE | +\(\hat{y} = \theta_0\) | +\(\hat{\theta}_0 = median(y)\) | +Yes, if odd. No, if even. Return the average of the middle 2 values. | +
Simple Linear Regression + MSE | +\(\hat{y} = \theta_0 + \theta_1x\) | +\(\hat{\theta}_0 = \bar{y} - \hat{\theta}_1\bar{x}\) \(\hat{\theta}_1 = r\frac{\sigma_y}{\sigma_x}\) | +Yes. Any set of non-constant* values has a unique mean, SD, and correlation coefficient. | +
OLS (Linear Model + MSE) | +\(\mathbb{\hat{Y}} = \mathbb{X}\mathbb{\theta}\) | +\(\hat{\theta} = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\mathbb{Y}\) | +Yes, if \(\mathbb{X}\) is full column rank (all columns are linearly independent, # of datapoints >>> # of features). | +
The Least Squares estimate \(\hat{\theta}\) is unique if and only if \(\mathbb{X}\) is full column rank.
+ +Therefore, if \(\mathbb{X}\) is not full column rank, we will not have unique estimates. This can happen for two major reasons.
+In this sequence of lectures, we will dive right into things by having you explore and manipulate real-world data. We’ll first introduce pandas
, a popular Python library for interacting with tabular data.
Data scientists work with data stored in a variety of formats. This class focuses primarily on tabular data — data that is stored in a table.
+Tabular data is one of the most common systems that data scientists use to organize data. This is in large part due to the simplicity and flexibility of tables. Tables allow us to represent each observation, or instance of collecting data from an individual, as its own row. We can record each observation’s distinct characteristics, or features, in separate columns.
+To see this in action, we’ll explore the elections
dataset, which stores information about political candidates who ran for president of the United States in previous years.
In the elections
dataset, each row (blue box) represents one instance of a candidate running for president in a particular year. For example, the first row represents Andrew Jackson running for president in the year 1824. Each column (yellow box) represents one characteristic piece of information about each presidential candidate. For example, the column named “Result” stores whether or not the candidate won the election.
Your work in Data 8 helped you grow very familiar with using and interpreting data stored in a tabular format. Back then, you used the Table
class of the datascience
library, a special programming library created specifically for Data 8 students.
In Data 100, we will be working with the programming library pandas
, which is generally accepted in the data science community as the industry- and academia-standard tool for manipulating tabular data (as well as the inspiration for Petey, our panda bear mascot).
Using pandas
, we can
NumPy
functions to our data (our friends from Data 8).Series
, DataFrame
s, and IndicesTo begin our work in pandas
, we must first import the library into our Python environment. This will allow us to use pandas
data structures and methods in our code.
# `pd` is the conventional alias for Pandas, as `np` is for NumPy
+import pandas as pd
There are three fundamental data structures in pandas
:
Series
: 1D labeled array data; best thought of as columnar data.DataFrame
: 2D tabular data with rows and columns.Index
: A sequence of row/column labels.DataFrame
s, Series
, and Indices can be represented visually in the following diagram, which considers the first few rows of the elections
dataset.
Notice how the DataFrame is a two-dimensional object — it contains both rows and columns. The Series above is a singular column of this DataFrame
, namely the Result
column. Both contain an Index, or a shared list of row labels (the integers from 0 to 4, inclusive).
A Series
represents a column of a DataFrame
; more generally, it can be any 1-dimensional array-like object. It contains both:
In the cell below, we create a Series
named s
.
= pd.Series(["welcome", "to", "data 100"])
+ s s
0 welcome
+1 to
+2 data 100
+dtype: object
+# Accessing data values within the Series
+ s.values
array(['welcome', 'to', 'data 100'], dtype=object)
+# Accessing the Index of the Series
+ s.index
RangeIndex(start=0, stop=3, step=1)
+By default, the index
of a Series
is a sequential list of integers beginning from 0. Optionally, a manually specified list of desired indices can be passed to the index
argument.
= pd.Series([-1, 10, 2], index = ["a", "b", "c"])
+ s s
a -1
+b 10
+c 2
+dtype: int64
+ s.index
Index(['a', 'b', 'c'], dtype='object')
+Indices can also be changed after initialization.
+= ["first", "second", "third"]
+ s.index s
first -1
+second 10
+third 2
+dtype: int64
+ s.index
Index(['first', 'second', 'third'], dtype='object')
+Series
Much like when working with NumPy
arrays, we can select a single value or a set of values from a Series
. To do so, there are three primary methods:
To demonstrate this, let’s define the Series ser
.
= pd.Series([4, -2, 0, 6], index = ["a", "b", "c", "d"])
+ ser ser
a 4
+b -2
+c 0
+d 6
+dtype: int64
+# We return the value stored at the index label "a"
+"a"] ser[
np.int64(4)
+# We return a Series of the values stored at the index labels "a" and "c"
+"a", "c"]] ser[[
a 4
+c 0
+dtype: int64
+Perhaps the most interesting (and useful) method of selecting data from a Series
is by using a filtering condition.
First, we apply a boolean operation to the Series
. This creates a new Series
of boolean values.
# Filter condition: select all elements greater than 0
+> 0 ser
a True
+b False
+c False
+d True
+dtype: bool
+We then use this boolean condition to index into our original Series
. pandas
will select only the entries in the original Series
that satisfy the condition.
> 0] ser[ser
a 4
+d 6
+dtype: int64
+DataFrames
Typically, we will work with Series
using the perspective that they are columns in a DataFrame
. We can think of a DataFrame
as a collection of Series
that all share the same Index
.
In Data 8, you encountered the Table
class of the datascience
library, which represented tabular data. In Data 100, we’ll be using the DataFrame
class of the pandas
library.
DataFrame
There are many ways to create a DataFrame
. Here, we will cover the most popular approaches:
Series
.More generally, the syntax for creating a DataFrame
is:
pandas.DataFrame(data, index, columns)
+In Data 100, our data are typically stored in a CSV (comma-separated values) file format. We can import a CSV file into a DataFrame
by passing the data path as an argument to the following pandas
function.
pd.read_csv("filename.csv")
With our new understanding of pandas
in hand, let’s return to the elections
dataset from before. Now, we can recognize that it is represented as a pandas
DataFrame
.
= pd.read_csv("data/elections.csv")
+ elections elections
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.574789 | +
... | +... | +... | +... | +... | +... | +... | +
177 | +2016 | +Jill Stein | +Green | +1457226 | +loss | +1.073699 | +
178 | +2020 | +Joseph Biden | +Democratic | +81268924 | +win | +51.311515 | +
179 | +2020 | +Donald Trump | +Republican | +74216154 | +loss | +46.858542 | +
180 | +2020 | +Jo Jorgensen | +Libertarian | +1865724 | +loss | +1.177979 | +
181 | +2020 | +Howard Hawkins | +Green | +405035 | +loss | +0.255731 | +
182 rows × 6 columns
+This code stores our DataFrame
object in the elections
variable. Upon inspection, our elections
DataFrame
has 182 rows and 6 columns (Year
, Candidate
, Party
, Popular Vote
, Result
, %
). Each row represents a single record — in our example, a presidential candidate from some particular year. Each column represents a single attribute or feature of the record.
We’ll now explore creating a DataFrame
with data of our own.
Consider the following examples. The first code cell creates a DataFrame
with a single column Numbers
.
= pd.DataFrame([1, 2, 3], columns=["Numbers"])
+ df_list df_list
+ | Numbers | +
---|---|
0 | +1 | +
1 | +2 | +
2 | +3 | +
The second creates a DataFrame
with the columns Numbers
and Description
. Notice how a 2D list of values is required to initialize the second DataFrame
— each nested list represents a single row of data.
= pd.DataFrame([[1, "one"], [2, "two"]], columns = ["Number", "Description"])
+ df_list df_list
+ | Number | +Description | +
---|---|---|
0 | +1 | +one | +
1 | +2 | +two | +
A third (and more common) way to create a DataFrame
is with a dictionary. The dictionary keys represent the column names, and the dictionary values represent the column values.
Below are two ways of implementing this approach. The first is based on specifying the columns of the DataFrame
, whereas the second is based on specifying the rows of the DataFrame
.
= pd.DataFrame({
+ df_dict "Fruit": ["Strawberry", "Orange"],
+ "Price": [5.49, 3.99]
+
+ }) df_dict
+ | Fruit | +Price | +
---|---|---|
0 | +Strawberry | +5.49 | +
1 | +Orange | +3.99 | +
= pd.DataFrame(
+ df_dict
+ ["Fruit":"Strawberry", "Price":5.49},
+ {"Fruit": "Orange", "Price":3.99}
+ {
+ ]
+ ) df_dict
+ | Fruit | +Price | +
---|---|---|
0 | +Strawberry | +5.49 | +
1 | +Orange | +3.99 | +
Series
Earlier, we explained how a Series
was synonymous to a column in a DataFrame
. It follows, then, that a DataFrame
is equivalent to a collection of Series
, which all share the same Index
.
In fact, we can initialize a DataFrame
by merging two or more Series
. Consider the Series
s_a
and s_b
.
# Notice how our indices, or row labels, are the same
+
+= pd.Series(["a1", "a2", "a3"], index = ["r1", "r2", "r3"])
+ s_a = pd.Series(["b1", "b2", "b3"], index = ["r1", "r2", "r3"]) s_b
We can turn individual Series
into a DataFrame
using two common methods (shown below):
pd.DataFrame(s_a)
+ | 0 | +
---|---|
r1 | +a1 | +
r2 | +a2 | +
r3 | +a3 | +
s_b.to_frame()
+ | 0 | +
---|---|
r1 | +b1 | +
r2 | +b2 | +
r3 | +b3 | +
To merge the two Series
and specify their column names, we use the following syntax:
+ pd.DataFrame({"A-column": s_a,
+ "B-column": s_b
+ })
+ | A-column | +B-column | +
---|---|---|
r1 | +a1 | +b1 | +
r2 | +a2 | +b2 | +
r3 | +a3 | +b3 | +
On a more technical note, an index doesn’t have to be an integer, nor does it have to be unique. For example, we can set the index of the elections
DataFrame
to be the name of presidential candidates.
# Creating a DataFrame from a CSV file and specifying the index column
+= pd.read_csv("data/elections.csv", index_col = "Candidate")
+ elections elections
+ | Year | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|
Candidate | ++ | + | + | + | + |
Andrew Jackson | +1824 | +Democratic-Republican | +151271 | +loss | +57.210122 | +
John Quincy Adams | +1824 | +Democratic-Republican | +113142 | +win | +42.789878 | +
Andrew Jackson | +1828 | +Democratic | +642806 | +win | +56.203927 | +
John Quincy Adams | +1828 | +National Republican | +500897 | +loss | +43.796073 | +
Andrew Jackson | +1832 | +Democratic | +702735 | +win | +54.574789 | +
... | +... | +... | +... | +... | +... | +
Jill Stein | +2016 | +Green | +1457226 | +loss | +1.073699 | +
Joseph Biden | +2020 | +Democratic | +81268924 | +win | +51.311515 | +
Donald Trump | +2020 | +Republican | +74216154 | +loss | +46.858542 | +
Jo Jorgensen | +2020 | +Libertarian | +1865724 | +loss | +1.177979 | +
Howard Hawkins | +2020 | +Green | +405035 | +loss | +0.255731 | +
182 rows × 5 columns
+We can also select a new column and set it as the index of the DataFrame
. For example, we can set the index of the elections
DataFrame
to represent the candidate’s party.
= True) # Resetting the index so we can set it again
+ elections.reset_index(inplace # This sets the index to the "Party" column
+"Party") elections.set_index(
+ | Candidate | +Year | +Popular vote | +Result | +% | +
---|---|---|---|---|---|
Party | ++ | + | + | + | + |
Democratic-Republican | +Andrew Jackson | +1824 | +151271 | +loss | +57.210122 | +
Democratic-Republican | +John Quincy Adams | +1824 | +113142 | +win | +42.789878 | +
Democratic | +Andrew Jackson | +1828 | +642806 | +win | +56.203927 | +
National Republican | +John Quincy Adams | +1828 | +500897 | +loss | +43.796073 | +
Democratic | +Andrew Jackson | +1832 | +702735 | +win | +54.574789 | +
... | +... | +... | +... | +... | +... | +
Green | +Jill Stein | +2016 | +1457226 | +loss | +1.073699 | +
Democratic | +Joseph Biden | +2020 | +81268924 | +win | +51.311515 | +
Republican | +Donald Trump | +2020 | +74216154 | +loss | +46.858542 | +
Libertarian | +Jo Jorgensen | +2020 | +1865724 | +loss | +1.177979 | +
Green | +Howard Hawkins | +2020 | +405035 | +loss | +0.255731 | +
182 rows × 5 columns
+And, if we’d like, we can revert the index back to the default list of integers.
+# This resets the index to be the default list of integer
+=True)
+ elections.reset_index(inplace elections.index
RangeIndex(start=0, stop=182, step=1)
+It is also important to note that the row labels that constitute an index don’t have to be unique. While index values can be unique and numeric, acting as a row number, they can also be named and non-unique.
+Here we see unique and numeric index values.
+However, here the index values are not unique.
+DataFrame
Attributes: Index, Columns, and ShapeOn the other hand, column names in a DataFrame
are almost always unique. Looking back to the elections
dataset, it wouldn’t make sense to have two columns named "Candidate"
. Sometimes, you’ll want to extract these different values, in particular, the list of row and column labels.
For index/row labels, use DataFrame.index
:
"Party", inplace = True)
+ elections.set_index( elections.index
Index(['Democratic-Republican', 'Democratic-Republican', 'Democratic',
+ 'National Republican', 'Democratic', 'National Republican',
+ 'Anti-Masonic', 'Whig', 'Democratic', 'Whig',
+ ...
+ 'Constitution', 'Republican', 'Independent', 'Libertarian',
+ 'Democratic', 'Green', 'Democratic', 'Republican', 'Libertarian',
+ 'Green'],
+ dtype='object', name='Party', length=182)
+For column labels, use DataFrame.columns
:
elections.columns
Index(['index', 'Candidate', 'Year', 'Popular vote', 'Result', '%'], dtype='object')
+And for the shape of the DataFrame
, we can use DataFrame.shape
to get the number of rows followed by the number of columns:
elections.shape
(182, 6)
+DataFrame
sNow that we’ve learned more about DataFrame
s, let’s dive deeper into their capabilities.
The API (Application Programming Interface) for the DataFrame
class is enormous. In this section, we’ll discuss several methods of the DataFrame
API that allow us to extract subsets of data.
The simplest way to manipulate a DataFrame
is to extract a subset of rows and columns, known as slicing.
Common ways we may want to extract data are grabbing:
+n
rows in the DataFrame
.We will do so with four primary methods of the DataFrame
class:
.head
and .tail
.loc
.iloc
[]
.head
and .tail
The simplest scenario in which we want to extract data is when we simply want to select the first or last few rows of the DataFrame
.
To extract the first n
rows of a DataFrame
df
, we use the syntax df.head(n)
.
= pd.read_csv("data/elections.csv") elections
# Extract the first 5 rows of the DataFrame
+5) elections.head(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.574789 | +
Similarly, calling df.tail(n)
allows us to extract the last n
rows of the DataFrame
.
# Extract the last 5 rows of the DataFrame
+5) elections.tail(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
177 | +2016 | +Jill Stein | +Green | +1457226 | +loss | +1.073699 | +
178 | +2020 | +Joseph Biden | +Democratic | +81268924 | +win | +51.311515 | +
179 | +2020 | +Donald Trump | +Republican | +74216154 | +loss | +46.858542 | +
180 | +2020 | +Jo Jorgensen | +Libertarian | +1865724 | +loss | +1.177979 | +
181 | +2020 | +Howard Hawkins | +Green | +405035 | +loss | +0.255731 | +
.loc
For the more complex task of extracting data with specific column or index labels, we can use .loc
. The .loc
accessor allows us to specify the labels of rows and columns we wish to extract. The labels (commonly referred to as the indices) are the bold text on the far left of a DataFrame
, while the column labels are the column names found at the top of a DataFrame
.
To grab data with .loc
, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the .loc
function; the column labels are the second.
Arguments to .loc
can be:
For example, to select a single value, we can select the row labeled 0
and the column labeled Candidate
from the elections
DataFrame
.
0, 'Candidate'] elections.loc[
'Andrew Jackson'
+Keep in mind that passing in just one argument as a single value will produce a Series
. Below, we’ve extracted a subset of the "Popular vote"
column as a Series
.
87, 25, 179], "Popular vote"] elections.loc[[
87 15761254
+25 848019
+179 74216154
+Name: Popular vote, dtype: int64
+To select multiple rows and columns, we can use Python slice notation. Here, we select the rows from labels 0
to 3
and the columns from labels "Year"
to "Popular vote"
. Notice that unlike Python slicing, .loc
is inclusive of the right upper bound.
0:3, 'Year':'Popular vote'] elections.loc[
+ | Year | +Candidate | +Party | +Popular vote | +
---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +
Suppose that instead, we want to extract all column values for the first four rows in the elections
DataFrame
. The shorthand :
is useful for this.
0:3, :] elections.loc[
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
We can use the same shorthand to extract all rows.
+"Year", "Candidate", "Result"]] elections.loc[:, [
+ | Year | +Candidate | +Result | +
---|---|---|---|
0 | +1824 | +Andrew Jackson | +loss | +
1 | +1824 | +John Quincy Adams | +win | +
2 | +1828 | +Andrew Jackson | +win | +
3 | +1828 | +John Quincy Adams | +loss | +
4 | +1832 | +Andrew Jackson | +win | +
... | +... | +... | +... | +
177 | +2016 | +Jill Stein | +loss | +
178 | +2020 | +Joseph Biden | +win | +
179 | +2020 | +Donald Trump | +loss | +
180 | +2020 | +Jo Jorgensen | +loss | +
181 | +2020 | +Howard Hawkins | +loss | +
182 rows × 3 columns
+There are a couple of things we should note. Firstly, unlike conventional Python, pandas
allows us to slice string values (in our example, the column labels). Secondly, slicing with .loc
is inclusive. Notice how our resulting DataFrame
includes every row and column between and including the slice labels we specified.
Equivalently, we can use a list to obtain multiple rows and columns in our elections
DataFrame
.
0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] elections.loc[[
+ | Year | +Candidate | +Party | +Popular vote | +
---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +
Lastly, we can interchange list and slicing notation.
+0, 1, 2, 3], :] elections.loc[[
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
.iloc
Slicing with .iloc
works similarly to .loc
. However, .iloc
uses the index positions of rows and columns rather than the labels (think to yourself: loc uses lables; iloc uses indices). The arguments to the .iloc
function also behave similarly — single values, lists, indices, and any combination of these are permitted.
Let’s begin reproducing our results from above. We’ll begin by selecting the first presidential candidate in our elections
DataFrame
:
# elections.loc[0, "Candidate"] - Previous approach
+0, 1] elections.iloc[
'Andrew Jackson'
+Notice how the first argument to both .loc
and .iloc
are the same. This is because the row with a label of 0
is conveniently in the \(0^{\text{th}}\) (equivalently, the first position) of the elections
DataFrame
. Generally, this is true of any DataFrame
where the row labels are incremented in ascending order from 0.
And, as before, if we were to pass in only one single value argument, our result would be a Series
.
1,2,3],1] elections.iloc[[
1 John Quincy Adams
+2 Andrew Jackson
+3 John Quincy Adams
+Name: Candidate, dtype: object
+However, when we select the first four rows and columns using .iloc
, we notice something.
# elections.loc[0:3, 'Year':'Popular vote'] - Previous approach
+0:4, 0:4] elections.iloc[
+ | Year | +Candidate | +Party | +Popular vote | +
---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +
Slicing is no longer inclusive in .iloc
— it’s exclusive. In other words, the right end of a slice is not included when using .iloc
. This is one of the subtleties of pandas
syntax; you will get used to it with practice.
List behavior works just as expected.
+#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach
+0, 1, 2, 3], [0, 1, 2, 3]] elections.iloc[[
+ | Year | +Candidate | +Party | +Popular vote | +
---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +
And just like with .loc
, we can use a colon with .iloc
to extract all rows or columns.
0:3] elections.iloc[:,
+ | Year | +Candidate | +Party | +
---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +
2 | +1828 | +Andrew Jackson | +Democratic | +
3 | +1828 | +John Quincy Adams | +National Republican | +
4 | +1832 | +Andrew Jackson | +Democratic | +
... | +... | +... | +... | +
177 | +2016 | +Jill Stein | +Green | +
178 | +2020 | +Joseph Biden | +Democratic | +
179 | +2020 | +Donald Trump | +Republican | +
180 | +2020 | +Jo Jorgensen | +Libertarian | +
181 | +2020 | +Howard Hawkins | +Green | +
182 rows × 3 columns
+This discussion begs the question: when should we use .loc
vs. .iloc
? In most cases, .loc
is generally safer to use. You can imagine .iloc
may return incorrect values when applied to a dataset where the ordering of data can change. However, .iloc
can still be useful — for example, if you are looking at a DataFrame
of sorted movie earnings and want to get the median earnings for a given year, you can use .iloc
to index into the middle.
Overall, it is important to remember that:
+.loc
performances label-based extraction..iloc
performs integer-based extraction.[]
The []
selection operator is the most baffling of all, yet the most commonly used. It only takes a single argument, which may be one of the following:
That is, []
is context-dependent. Let’s see some examples.
Say we wanted the first four rows of our elections
DataFrame
.
0:4] elections[
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
Suppose we now want the first four columns.
+"Year", "Candidate", "Party", "Popular vote"]] elections[[
+ | Year | +Candidate | +Party | +Popular vote | +
---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +
... | +... | +... | +... | +... | +
177 | +2016 | +Jill Stein | +Green | +1457226 | +
178 | +2020 | +Joseph Biden | +Democratic | +81268924 | +
179 | +2020 | +Donald Trump | +Republican | +74216154 | +
180 | +2020 | +Jo Jorgensen | +Libertarian | +1865724 | +
181 | +2020 | +Howard Hawkins | +Green | +405035 | +
182 rows × 4 columns
+Lastly, []
allows us to extract only the "Candidate"
column.
"Candidate"] elections[
0 Andrew Jackson
+1 John Quincy Adams
+2 Andrew Jackson
+3 John Quincy Adams
+4 Andrew Jackson
+ ...
+177 Jill Stein
+178 Joseph Biden
+179 Donald Trump
+180 Jo Jorgensen
+181 Howard Hawkins
+Name: Candidate, Length: 182, dtype: object
+The output is a Series
! In this course, we’ll become very comfortable with []
, especially for selecting columns. In practice, []
is much more common than .loc
, especially since it is far more concise.
The pandas
library is enormous and contains many useful functions. Here is a link to its documentation. We certainly don’t expect you to memorize each and every method of the library, and we will give you a reference sheet for exams.
The introductory Data 100 pandas
lectures will provide a high-level view of the key data structures and methods that will form the foundation of your pandas
knowledge. A goal of this course is to help you build your familiarity with the real-world programming practice of … Googling! Answers to your questions can be found in documentation, Stack Overflow, etc. Being able to search for, read, and implement documentation is an important life skill for any data scientist.
With that, we will move on to Pandas II!
+ + +Last time, we introduced the pandas
library as a toolkit for processing data. We learned the DataFrame
and Series
data structures, familiarized ourselves with the basic syntax for manipulating tabular data, and began writing our first lines of pandas
code.
In this lecture, we’ll start to dive into some advanced pandas
syntax. You may find it helpful to follow along with a notebook of your own as we walk through these new pieces of code.
We’ll start by loading the babynames
dataset.
# This code pulls census data and loads it into a DataFrame
+# We won't cover it explicitly in this class, but you are welcome to explore it on your own
+import pandas as pd
+import numpy as np
+import urllib.request
+import os.path
+import zipfile
+
+= "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
+ data_url = "data/babynamesbystate.zip"
+ local_filename if not os.path.exists(local_filename): # If the data exists don't download again
+with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
+
+ f.write(resp.read())
+= zipfile.ZipFile(local_filename, 'r')
+ zf
+= 'STATE.CA.TXT'
+ ca_name = ['State', 'Sex', 'Year', 'Name', 'Count']
+ field_names with zf.open(ca_name) as fh:
+= pd.read_csv(fh, header=None, names=field_names)
+ babynames
+ babynames.head()
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
Conditional selection allows us to select a subset of rows in a DataFrame
that satisfy some specified condition.
To understand how to use conditional selection, we must look at another possible input of the .loc
and []
methods – a boolean array, which is simply an array or Series
where each element is either True
or False
. This boolean array must have a length equal to the number of rows in the DataFrame
. It will return all rows that correspond to a value of True
in the array. We used a very similar technique when performing conditional extraction from a Series
in the last lecture.
To see this in action, let’s select all even-indexed rows in the first 10 rows of our DataFrame
.
# Ask yourself: why is :9 is the correct slice to select the first 10 rows?
+= babynames.loc[:9, :]
+ babynames_first_10_rows
+# Notice how we have exactly 10 elements in our boolean array argument
+True, False, True, False, True, False, True, False, True, False]] babynames_first_10_rows[[
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
6 | +CA | +F | +1910 | +Evelyn | +126 | +
8 | +CA | +F | +1910 | +Virginia | +101 | +
We can perform a similar operation using .loc
.
True, False, True, False, True, False, True, False, True, False], :] babynames_first_10_rows.loc[[
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
6 | +CA | +F | +1910 | +Evelyn | +126 | +
8 | +CA | +F | +1910 | +Virginia | +101 | +
These techniques worked well in this example, but you can imagine how tedious it might be to list out True
and False
for every row in a larger DataFrame
. To make things easier, we can instead provide a logical condition as an input to .loc
or []
that returns a boolean array with the necessary length.
For example, to return all names associated with F
sex:
# First, use a logical condition to generate a boolean array
+= (babynames["Sex"] == "F")
+ logical_operator
+# Then, use this boolean array to filter the DataFrame
+ babynames[logical_operator].head()
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
Recall from the previous lecture that .head()
will return only the first few rows in the DataFrame
. In reality, babynames[logical operator]
contains as many rows as there are entries in the original babynames
DataFrame
with sex "F"
.
Here, logical_operator
evaluates to a Series
of boolean values with length 407428.
print("There are a total of {} values in 'logical_operator'".format(len(logical_operator)))
There are a total of 407428 values in 'logical_operator'
+Rows starting at row 0 and ending at row 239536 evaluate to True
and are thus returned in the DataFrame
. Rows from 239537 onwards evaluate to False
and are omitted from the output.
print("The 0th item in this 'logical_operator' is: {}".format(logical_operator.iloc[0]))
+print("The 239536th item in this 'logical_operator' is: {}".format(logical_operator.iloc[239536]))
+print("The 239537th item in this 'logical_operator' is: {}".format(logical_operator.iloc[239537]))
The 0th item in this 'logical_operator' is: True
+The 239536th item in this 'logical_operator' is: True
+The 239537th item in this 'logical_operator' is: False
+Passing a Series
as an argument to babynames[]
has the same effect as using a boolean array. In fact, the []
selection operator can take a boolean Series
, array, and list as arguments. These three are used interchangeably throughout the course.
We can also use .loc
to achieve similar results.
"Sex"] == "F"].head() babynames.loc[babynames[
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
Boolean conditions can be combined using various bitwise operators, allowing us to filter results by multiple conditions. In the table below, p and q are boolean arrays or Series
.
Symbol | +Usage | +Meaning | +
---|---|---|
~ | +~p | +Returns negation of p | +
| | +p | q | +p OR q | +
& | +p & q | +p AND q | +
^ | +p ^ q | +p XOR q (exclusive or) | +
When combining multiple conditions with logical operators, we surround each individual condition with a set of parenthesis ()
. This imposes an order of operations on pandas
evaluating your logic and can avoid code erroring.
For example, if we want to return data on all names with sex "F"
born before the year 2000, we can write:
"Sex"] == "F") & (babynames["Year"] < 2000)].head() babynames[(babynames[
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
Note that we’re working with Series
, so using and
in place of &
, or or
in place |
will error.
# This line of code will raise a ValueError
+# babynames[(babynames["Sex"] == "F") and (babynames["Year"] < 2000)].head()
If we want to return data on all names with sex "F"
or all born before the year 2000, we can write:
"Sex"] == "F") | (babynames["Year"] < 2000)].head() babynames[(babynames[
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
Boolean array selection is a useful tool, but can lead to overly verbose code for complex conditions. In the example below, our boolean condition is long enough to extend for several lines of code.
+# Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability
+
+ ("Name"] == "Bella") |
+ babynames[(babynames["Name"] == "Alex") |
+ (babynames["Name"] == "Ani") |
+ (babynames["Name"] == "Lisa")]
+ (babynames[ ).head()
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
6289 | +CA | +F | +1923 | +Bella | +5 | +
7512 | +CA | +F | +1925 | +Bella | +8 | +
12368 | +CA | +F | +1932 | +Lisa | +5 | +
14741 | +CA | +F | +1936 | +Lisa | +8 | +
17084 | +CA | +F | +1939 | +Lisa | +5 | +
Fortunately, pandas
provides many alternative methods for constructing boolean filters.
The .isin
function is one such example. This method evaluates if the values in a Series
are contained in a different sequence (list, array, or Series
) of values. In the cell below, we achieve equivalent results to the DataFrame
above with far more concise code.
= ["Bella", "Alex", "Narges", "Lisa"]
+ names "Name"].isin(names).head() babynames[
0 False
+1 False
+2 False
+3 False
+4 False
+Name: Name, dtype: bool
+"Name"].isin(names)].head() babynames[babynames[
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
6289 | +CA | +F | +1923 | +Bella | +5 | +
7512 | +CA | +F | +1925 | +Bella | +8 | +
12368 | +CA | +F | +1932 | +Lisa | +5 | +
14741 | +CA | +F | +1936 | +Lisa | +8 | +
17084 | +CA | +F | +1939 | +Lisa | +5 | +
The function str.startswith
can be used to define a filter based on string values in a Series
object. It checks to see if string values in a Series
start with a particular character.
# Identify whether names begin with the letter "N"
+"Name"].str.startswith("N").head() babynames[
0 False
+1 False
+2 False
+3 False
+4 False
+Name: Name, dtype: bool
+# Extracting names that begin with the letter "N"
+"Name"].str.startswith("N")].head() babynames[babynames[
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
76 | +CA | +F | +1910 | +Norma | +23 | +
83 | +CA | +F | +1910 | +Nellie | +20 | +
127 | +CA | +F | +1910 | +Nina | +11 | +
198 | +CA | +F | +1910 | +Nora | +6 | +
310 | +CA | +F | +1911 | +Nellie | +23 | +
In many data science tasks, we may need to change the columns contained in our DataFrame
in some way. Fortunately, the syntax to do so is fairly straightforward.
To add a new column to a DataFrame
, we use a syntax similar to that used when accessing an existing column. Specify the name of the new column by writing df["column"]
, then assign this to a Series
or array containing the values that will populate this column.
# Create a Series of the length of each name.
+= babynames["Name"].str.len()
+ babyname_lengths
+# Add a column named "name_lengths" that includes the length of each name
+"name_lengths"] = babyname_lengths
+ babynames[5) babynames.head(
+ | State | +Sex | +Year | +Name | +Count | +name_lengths | +
---|---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +4 | +
1 | +CA | +F | +1910 | +Helen | +239 | +5 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +7 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +8 | +
4 | +CA | +F | +1910 | +Frances | +134 | +7 | +
If we need to later modify an existing column, we can do so by referencing this column again with the syntax df["column"]
, then re-assigning it to a new Series
or array of the appropriate length.
# Modify the “name_lengths” column to be one less than its original value
+"name_lengths"] = babynames["name_lengths"] - 1
+ babynames[ babynames.head()
+ | State | +Sex | +Year | +Name | +Count | +name_lengths | +
---|---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +3 | +
1 | +CA | +F | +1910 | +Helen | +239 | +4 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +6 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +7 | +
4 | +CA | +F | +1910 | +Frances | +134 | +6 | +
We can rename a column using the .rename()
method. It takes in a dictionary that maps old column names to their new ones.
# Rename “name_lengths” to “Length”
+= babynames.rename(columns={"name_lengths":"Length"})
+ babynames babynames.head()
+ | State | +Sex | +Year | +Name | +Count | +Length | +
---|---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +3 | +
1 | +CA | +F | +1910 | +Helen | +239 | +4 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +6 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +7 | +
4 | +CA | +F | +1910 | +Frances | +134 | +6 | +
If we want to remove a column or row of a DataFrame
, we can call the .drop
(documentation) method. Use the axis
parameter to specify whether a column or row should be dropped. Unless otherwise specified, pandas
will assume that we are dropping a row by default.
# Drop our new "Length" column from the DataFrame
+= babynames.drop("Length", axis="columns")
+ babynames 5) babynames.head(
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
Notice that we re-assigned babynames
to the result of babynames.drop(...)
. This is a subtle but important point: pandas
table operations do not occur in-place. Calling df.drop(...)
will output a copy of df
with the row/column of interest removed without modifying the original df
table.
In other words, if we simply call:
+# This creates a copy of `babynames` and removes the column "Name"...
+"Name", axis="columns")
+ babynames.drop(
+# ...but the original `babynames` is unchanged!
+# Notice that the "Name" column is still present
+5) babynames.head(
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +
1 | +CA | +F | +1910 | +Helen | +239 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +
4 | +CA | +F | +1910 | +Frances | +134 | +
pandas
contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in Data 100.
Discussing all functionality offered by pandas
could take an entire semester! We will walk you through the most commonly-used functions and encourage you to explore and experiment on your own.
NumPy
and built-in function support.shape
.size
.describe()
.sample()
.value_counts()
.unique()
.sort_values()
The pandas
documentation will be a valuable resource in Data 100 and beyond.
NumPy
pandas
is designed to work well with NumPy
, the framework for array computations you encountered in Data 8. Just about any NumPy
function can be applied to pandas
DataFrame
s and Series
.
# Pull out the number of babies named Yash each year
+= babynames[babynames["Name"] == "Yash"]["Count"]
+ yash_count yash_count.head()
331824 8
+334114 9
+336390 11
+338773 12
+341387 10
+Name: Count, dtype: int64
+# Average number of babies named Yash each year
+ np.mean(yash_count)
np.float64(17.142857142857142)
+# Max number of babies named Yash born in any one year
+max(yash_count) np.
np.int64(29)
+.shape
and .size
.shape
and .size
are attributes of Series
and DataFrame
s that measure the “amount” of data stored in the structure. Calling .shape
returns a tuple containing the number of rows and columns present in the DataFrame
or Series
. .size
is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.
Many functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.
+# Return the shape of the DataFrame, in the format (num_rows, num_columns)
+ babynames.shape
(407428, 5)
+# Return the size of the DataFrame, equal to num_rows * num_columns
+ babynames.size
2037140
+.describe()
If many statistics are required from a DataFrame
(minimum value, maximum value, mean value, etc.), then .describe()
(documentation) can be used to compute all of them at once.
babynames.describe()
+ | Year | +Count | +
---|---|---|
count | +407428.000000 | +407428.000000 | +
mean | +1985.733609 | +79.543456 | +
std | +27.007660 | +293.698654 | +
min | +1910.000000 | +5.000000 | +
25% | +1969.000000 | +7.000000 | +
50% | +1992.000000 | +13.000000 | +
75% | +2008.000000 | +38.000000 | +
max | +2022.000000 | +8260.000000 | +
A different set of statistics will be reported if .describe()
is called on a Series
.
"Sex"].describe() babynames[
count 407428
+unique 2
+top F
+freq 239537
+Name: Sex, dtype: object
+.sample()
As we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). .sample()
(documentation) lets us quickly select random entries (a row if called from a DataFrame
, or a value if called from a Series
).
By default, .sample()
selects entries without replacement. Pass in the argument replace=True
to sample with replacement.
# Sample a single row
+ babynames.sample()
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
28158 | +CA | +F | +1950 | +Vikki | +14 | +
Naturally, this can be chained with other methods and operators (iloc
, etc.).
# Sample 5 random rows, and select all columns after column 2
+5).iloc[:, 2:] babynames.sample(
+ | Year | +Name | +Count | +
---|---|---|---|
82058 | +1979 | +Lakesha | +11 | +
387687 | +2016 | +Zayn | +101 | +
105977 | +1988 | +Cecilia | +213 | +
75257 | +1976 | +Clarice | +7 | +
7685 | +1925 | +Elia | +5 | +
# Randomly sample 4 names from the year 2000, with replacement, and select all columns after column 2
+"Year"] == 2000].sample(4, replace = True).iloc[:, 2:] babynames[babynames[
+ | Year | +Name | +Count | +
---|---|---|---|
342973 | +2000 | +Grayson | +46 | +
151608 | +2000 | +Roshni | +8 | +
343172 | +2000 | +Dwayne | +27 | +
343039 | +2000 | +Jair | +38 | +
.value_counts()
The Series.value_counts()
(documentation) method counts the number of occurrence of each unique value in a Series
. In other words, it counts the number of times each unique value appears. This is often useful for determining the most or least common entries in a Series
.
In the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the "Name"
column of babynames
. Note that the return value is also a Series
.
"Name"].value_counts().head() babynames[
Name
+Jean 223
+Francis 221
+Guadalupe 218
+Jessie 217
+Marion 214
+Name: count, dtype: int64
+.unique()
If we have a Series
with many repeated values, then .unique()
(documentation) can be used to identify only the unique values. Here we return an array of all the names in babynames
.
"Name"].unique() babynames[
array(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],
+ dtype=object)
+.sort_values()
Ordering a DataFrame
can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values
(documentation) allows us to order a DataFrame
or Series
by a specified column. We can choose to either receive the rows in ascending
order (default) or descending
order.
# Sort the "Count" column from highest to lowest
+="Count", ascending=False).head() babynames.sort_values(by
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
268041 | +CA | +M | +1957 | +Michael | +8260 | +
267017 | +CA | +M | +1956 | +Michael | +8258 | +
317387 | +CA | +M | +1990 | +Michael | +8246 | +
281850 | +CA | +M | +1969 | +Michael | +8245 | +
283146 | +CA | +M | +1970 | +Michael | +8196 | +
Unlike when calling .value_counts()
on a DataFrame
, we do not need to explicitly specify the column used for sorting when calling .value_counts()
on a Series
. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.
# Sort the "Name" Series alphabetically
+"Name"].sort_values(ascending=True).head() babynames[
366001 Aadan
+384005 Aadan
+369120 Aadan
+398211 Aadarsh
+370306 Aaden
+Name: Name, dtype: object
+Manipulating DataFrames
is not a skill that is mastered in just one day. Due to the flexibility of pandas
, there are many different ways to get from point A to point B. We recommend trying multiple different ways to solve the same problem to gain even more practice and reach that point of mastery sooner.
Next, we will start digging deeper into the mechanics behind grouping data.
+ + +We will introduce the concept of aggregating data – we will familiarize ourselves with GroupBy
objects and used them as tools to consolidate and summarize aDataFrame
. In this lecture, we will explore working with the different aggregation functions and dive into some advanced .groupby
methods to show just how powerful of a resource they can be for understanding our data. We will also introduce other techniques for data aggregation to provide flexibility in how we manipulate our tables.
First, let’s finish our discussion about sorting. Let’s try to solve a sorting problem using different approaches. Assume we want to find the longest baby names and sort our data accordingly.
+We’ll start by loading the babynames
dataset. Note that this dataset is filtered to only contain data from California.
# This code pulls census data and loads it into a DataFrame
+# We won't cover it explicitly in this class, but you are welcome to explore it on your own
+import pandas as pd
+import numpy as np
+import urllib.request
+import os.path
+import zipfile
+
+= "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
+ data_url = "data/babynamesbystate.zip"
+ local_filename if not os.path.exists(local_filename): # If the data exists don't download again
+with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
+
+ f.write(resp.read())
+= zipfile.ZipFile(local_filename, 'r')
+ zf
+= 'STATE.CA.TXT'
+ ca_name = ['State', 'Sex', 'Year', 'Name', 'Count']
+ field_names with zf.open(ca_name) as fh:
+= pd.read_csv(fh, header=None, names=field_names)
+ babynames
+10) babynames.tail(
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
407418 | +CA | +M | +2022 | +Zach | +5 | +
407419 | +CA | +M | +2022 | +Zadkiel | +5 | +
407420 | +CA | +M | +2022 | +Zae | +5 | +
407421 | +CA | +M | +2022 | +Zai | +5 | +
407422 | +CA | +M | +2022 | +Zay | +5 | +
407423 | +CA | +M | +2022 | +Zayvier | +5 | +
407424 | +CA | +M | +2022 | +Zia | +5 | +
407425 | +CA | +M | +2022 | +Zora | +5 | +
407426 | +CA | +M | +2022 | +Zuriel | +5 | +
407427 | +CA | +M | +2022 | +Zylo | +5 | +
One method to do this is to first start by creating a column that contains the lengths of the names.
+# Create a Series of the length of each name
+= babynames["Name"].str.len()
+ babyname_lengths
+# Add a column named "name_lengths" that includes the length of each name
+"name_lengths"] = babyname_lengths
+ babynames[5) babynames.head(
+ | State | +Sex | +Year | +Name | +Count | +name_lengths | +
---|---|---|---|---|---|---|
0 | +CA | +F | +1910 | +Mary | +295 | +4 | +
1 | +CA | +F | +1910 | +Helen | +239 | +5 | +
2 | +CA | +F | +1910 | +Dorothy | +220 | +7 | +
3 | +CA | +F | +1910 | +Margaret | +163 | +8 | +
4 | +CA | +F | +1910 | +Frances | +134 | +7 | +
We can then sort the DataFrame
by that column using .sort_values()
:
# Sort by the temporary column
+= babynames.sort_values(by="name_lengths", ascending=False)
+ babynames 5) babynames.head(
+ | State | +Sex | +Year | +Name | +Count | +name_lengths | +
---|---|---|---|---|---|---|
334166 | +CA | +M | +1996 | +Franciscojavier | +8 | +15 | +
337301 | +CA | +M | +1997 | +Franciscojavier | +5 | +15 | +
339472 | +CA | +M | +1998 | +Franciscojavier | +6 | +15 | +
321792 | +CA | +M | +1991 | +Ryanchristopher | +7 | +15 | +
327358 | +CA | +M | +1993 | +Johnchristopher | +5 | +15 | +
Finally, we can drop the name_length
column from babynames
to prevent our table from getting cluttered.
# Drop the 'name_length' column
+= babynames.drop("name_lengths", axis='columns')
+ babynames 5) babynames.head(
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
334166 | +CA | +M | +1996 | +Franciscojavier | +8 | +
337301 | +CA | +M | +1997 | +Franciscojavier | +5 | +
339472 | +CA | +M | +1998 | +Franciscojavier | +6 | +
321792 | +CA | +M | +1991 | +Ryanchristopher | +7 | +
327358 | +CA | +M | +1993 | +Johnchristopher | +5 | +
key
ArgumentAnother way to approach this is to use the key
argument of .sort_values()
. Here we can specify that we want to sort "Name"
values by their length.
"Name", key=lambda x: x.str.len(), ascending=False).head() babynames.sort_values(
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
334166 | +CA | +M | +1996 | +Franciscojavier | +8 | +
327472 | +CA | +M | +1993 | +Ryanchristopher | +5 | +
337301 | +CA | +M | +1997 | +Franciscojavier | +5 | +
337477 | +CA | +M | +1997 | +Ryanchristopher | +5 | +
312543 | +CA | +M | +1987 | +Franciscojavier | +5 | +
map
FunctionWe can also use the map
function on a Series
to solve this. Say we want to sort the babynames
table by the number of "dr"
’s and "ea"
’s in each "Name"
. We’ll define the function dr_ea_count
to help us out.
# First, define a function to count the number of times "dr" or "ea" appear in each name
+def dr_ea_count(string):
+return string.count('dr') + string.count('ea')
+
+# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column
+"dr_ea_count"] = babynames["Name"].map(dr_ea_count)
+ babynames[
+# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork
+= babynames.sort_values(by="dr_ea_count", ascending=False)
+ babynames babynames.head()
+ | State | +Sex | +Year | +Name | +Count | +dr_ea_count | +
---|---|---|---|---|---|---|
115957 | +CA | +F | +1990 | +Deandrea | +5 | +3 | +
101976 | +CA | +F | +1986 | +Deandrea | +6 | +3 | +
131029 | +CA | +F | +1994 | +Leandrea | +5 | +3 | +
108731 | +CA | +F | +1988 | +Deandrea | +5 | +3 | +
308131 | +CA | +M | +1985 | +Deandrea | +6 | +3 | +
We can drop the dr_ea_count
once we’re done using it to maintain a neat table.
# Drop the `dr_ea_count` column
+= babynames.drop("dr_ea_count", axis = 'columns')
+ babynames 5) babynames.head(
+ | State | +Sex | +Year | +Name | +Count | +
---|---|---|---|---|---|
115957 | +CA | +F | +1990 | +Deandrea | +5 | +
101976 | +CA | +F | +1986 | +Deandrea | +6 | +
131029 | +CA | +F | +1994 | +Leandrea | +5 | +
108731 | +CA | +F | +1988 | +Deandrea | +5 | +
308131 | +CA | +M | +1985 | +Deandrea | +6 | +
.groupby
Up until this point, we have been working with individual rows of DataFrame
s. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame
. To do this, we’ll use pandas
GroupBy
objects. Our goal is to group together rows that fall under the same category and perform an operation that aggregates across all rows in the category.
Let’s say we wanted to aggregate all rows in babynames
for a given year.
"Year") babynames.groupby(
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1037c3d90>
+What does this strange output mean? Calling .groupby
(documentation) has generated a GroupBy
object. You can imagine this as a set of “mini” sub-DataFrame
s, where each subframe contains all of the rows from babynames
that correspond to a particular year.
The diagram below shows a simplified view of babynames
to help illustrate this idea.
We can’t work with a GroupBy
object directly – that is why you saw that strange output earlier rather than a standard view of a DataFrame
. To actually manipulate values within these “mini” DataFrame
s, we’ll need to call an aggregation method. This is a method that tells pandas
how to aggregate the values within the GroupBy
object. Once the aggregation is applied, pandas
will return a normal (now grouped) DataFrame
.
The first aggregation method we’ll consider is .agg
. The .agg
method takes in a function as its argument; this function is then applied to each column of a “mini” grouped DataFrame. We end up with a new DataFrame
with one aggregated row per subframe. Let’s see this in action by finding the sum
of all counts for each year in babynames
– this is equivalent to finding the number of babies born in each year.
"Year", "Count"]].groupby("Year").agg(sum).head(5) babynames[[
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/2718070104.py:1: FutureWarning:
+
+The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
+
++ | Count | +
---|---|
Year | ++ |
1910 | +9163 | +
1911 | +9983 | +
1912 | +17946 | +
1913 | +22094 | +
1914 | +26926 | +
We can relate this back to the diagram we used above. Remember that the diagram uses a simplified version of babynames
, which is why we see smaller values for the summed counts.
Calling .agg
has condensed each subframe back into a single row. This gives us our final output: a DataFrame
that is now indexed by "Year"
, with a single row for each unique year in the original babynames
DataFrame.
There are many different aggregation functions we can use, all of which are useful in different applications.
+"Year", "Count"]].groupby("Year").agg(min).head(5) babynames[[
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/86785752.py:1: FutureWarning:
+
+The provided callable <built-in function min> is currently using DataFrameGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
+
++ | Count | +
---|---|
Year | ++ |
1910 | +5 | +
1911 | +5 | +
1912 | +5 | +
1913 | +5 | +
1914 | +5 | +
"Year", "Count"]].groupby("Year").agg(max).head(5) babynames[[
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/3032256904.py:1: FutureWarning:
+
+The provided callable <built-in function max> is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
+
++ | Count | +
---|---|
Year | ++ |
1910 | +295 | +
1911 | +390 | +
1912 | +534 | +
1913 | +614 | +
1914 | +773 | +
# Same result, but now we explicitly tell pandas to only consider the "Count" column when summing
+"Year")[["Count"]].agg(sum).head(5) babynames.groupby(
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/1958904241.py:2: FutureWarning:
+
+The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
+
++ | Count | +
---|---|
Year | ++ |
1910 | +9163 | +
1911 | +9983 | +
1912 | +17946 | +
1913 | +22094 | +
1914 | +26926 | +
There are many different aggregations that can be applied to the grouped data. The primary requirement is that an aggregation function must:
+Series
of data (a single column of the grouped subframe).Series
.Because of this fairly broad requirement, pandas
offers many ways of computing an aggregation.
In-built Python operations – such as sum
, max
, and min
– are automatically recognized by pandas
.
# What is the minimum count for each name in any year?
+"Name")[["Count"]].agg(min).head() babynames.groupby(
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/3244314896.py:2: FutureWarning:
+
+The provided callable <built-in function min> is currently using DataFrameGroupBy.min. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "min" instead.
+
++ | Count | +
---|---|
Name | ++ |
Aadan | +5 | +
Aadarsh | +6 | +
Aaden | +10 | +
Aadhav | +6 | +
Aadhini | +6 | +
# What is the largest single-year count of each name?
+"Name")[["Count"]].agg(max).head() babynames.groupby(
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/3805876622.py:2: FutureWarning:
+
+The provided callable <built-in function max> is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
+
++ | Count | +
---|---|
Name | ++ |
Aadan | +7 | +
Aadarsh | +6 | +
Aaden | +158 | +
Aadhav | +8 | +
Aadhini | +6 | +
As mentioned previously, functions from the NumPy
library, such as np.mean
, np.max
, np.min
, and np.sum
, are also fair game in pandas
.
# What is the average count for each name across all years?
+"Name")[["Count"]].agg(np.mean).head() babynames.groupby(
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/308986604.py:2: FutureWarning:
+
+The provided callable <function mean at 0x103985360> is currently using DataFrameGroupBy.mean. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" instead.
+
++ | Count | +
---|---|
Name | ++ |
Aadan | +6.000000 | +
Aadarsh | +6.000000 | +
Aaden | +46.214286 | +
Aadhav | +6.750000 | +
Aadhini | +6.000000 | +
pandas
also offers a number of in-built functions. Functions that are native to pandas
can be referenced using their string name within a call to .agg
. Some examples include:
.agg("sum")
.agg("max")
.agg("min")
.agg("mean")
.agg("first")
.agg("last")
The latter two entries in this list – "first"
and "last"
– are unique to pandas
. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where multiple columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.
Let’s illustrate this with an example. Say we add a new column to babynames
that contains the first letter of each name.
# Imagine we had an additional column, "First Letter". We'll explain this code next week
+"First Letter"] = babynames["Name"].str[0]
+ babynames[
+# We construct a simplified DataFrame containing just a subset of columns
+= babynames[["Name", "First Letter", "Year"]]
+ babynames_new babynames_new.head()
+ | Name | +First Letter | +Year | +
---|---|---|---|
115957 | +Deandrea | +D | +1990 | +
101976 | +Deandrea | +D | +1986 | +
131029 | +Leandrea | +L | +1994 | +
108731 | +Deandrea | +D | +1988 | +
308131 | +Deandrea | +D | +1985 | +
If we form groups for each name in the dataset, "First Letter"
will be the same for all members of the group. This means that if we simply select the first entry for "First Letter"
in the group, we’ll represent all data in that group.
We can use a dictionary to apply different aggregation functions to each column during grouping.
+"Name").agg({"First Letter":"first", "Year":"max"}).head() babynames_new.groupby(
+ | First Letter | +Year | +
---|---|---|
Name | ++ | + |
Aadan | +A | +2014 | +
Aadarsh | +A | +2019 | +
Aaden | +A | +2020 | +
Aadhav | +A | +2019 | +
Aadhini | +A | +2022 | +
Let’s use .agg
to find the total number of babies born in each year. Recall that using .agg
with .groupby()
follows the format: df.groupby(column_name).agg(aggregation_function)
. The line of code below gives us the total number of babies born in each year.
"Year")[["Count"]].agg(sum).head(5)
+ babynames.groupby(# Alternative 1
+# babynames.groupby("Year")[["Count"]].sum()
+# Alternative 2
+# babynames.groupby("Year").sum(numeric_only=True)
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/390646742.py:1: FutureWarning:
+
+The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
+
++ | Count | +
---|---|
Year | ++ |
1910 | +9163 | +
1911 | +9983 | +
1912 | +17946 | +
1913 | +22094 | +
1914 | +26926 | +
Here’s an illustration of the process:
+Plotting the Dataframe
we obtain tells an interesting story.
import plotly.express as px
+= babynames.groupby("Year")[["Count"]].agg(sum)
+ puzzle2 = "Count") px.line(puzzle2, y
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/4066413905.py:2: FutureWarning:
+
+The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
+
+A word of warning: we made an enormous assumption when we decided to use this dataset to estimate birth rate. According to this article from the Legistlative Analyst Office, the true number of babies born in California in 2020 was 421,275. However, our plot shows 362,882 babies —— what happened?
+.groupby()
FunctionA groupby
operation involves some combination of splitting a DataFrame
into grouped subframes, applying a function, and combining the results.
For some arbitrary DataFrame
df
below, the code df.groupby("year").agg(sum)
does the following:
DataFrame
into sub-DataFrame
s with rows belonging to the same year.sum
function to each column of each sub-DataFrame
.sum
into a single DataFrame
, indexed by year
..agg()
Function.agg()
can take in any function that aggregates several values into one summary value. Some commonly-used aggregation functions can even be called directly, without explicit use of .agg()
. For example, we can call .mean()
on .groupby()
:
babynames.groupby("Year").mean().head()
+We can now put this all into practice. Say we want to find the baby name with sex “F” that has fallen in popularity the most in California. To calculate this, we can first create a metric: “Ratio to Peak” (RTP). The RTP is the ratio of babies born with a given name in 2022 to the maximum number of babies born with the name in any year.
+Let’s start with calculating this for one baby, “Jennifer”.
+# We filter by babies with sex "F" and sort by "Year"
+= babynames[babynames["Sex"] == "F"]
+ f_babynames = f_babynames.sort_values(["Year"])
+ f_babynames
+# Determine how many Jennifers were born in CA per year
+= f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]
+ jenn_counts_series
+# Determine the max number of Jennifers born in a year and the number born in 2022
+# to calculate RTP
+= max(f_babynames[f_babynames["Name"] == "Jennifer"]["Count"])
+ max_jenn = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]
+ curr_jenn = curr_jenn / max_jenn
+ rtp rtp
np.float64(0.018796372629843364)
+By creating a function to calculate RTP and applying it to our DataFrame
by using .groupby()
, we can easily compute the RTP for all names at once!
def ratio_to_peak(series):
+return series.iloc[-1] / max(series)
+
+#Using .groupby() to apply the function
+= f_babynames.groupby("Name")[["Year", "Count"]].agg(ratio_to_peak)
+ rtp_table rtp_table.head()
+ | Year | +Count | +
---|---|---|
Name | ++ | + |
Aadhini | +1.0 | +1.000000 | +
Aadhira | +1.0 | +0.500000 | +
Aadhya | +1.0 | +0.660000 | +
Aadya | +1.0 | +0.586207 | +
Aahana | +1.0 | +0.269231 | +
In the rows shown above, we can see that every row shown has a Year
value of 1.0
.
This is the “pandas
-ification” of logic you saw in Data 8. Much of the logic you’ve learned in Data 8 will serve you well in Data 100.
Note that you must be careful with which columns you apply the .agg()
function to. If we were to apply our function to the table as a whole by doing f_babynames.groupby("Name").agg(ratio_to_peak)
, executing our .agg()
call would result in a TypeError
.
We can avoid this issue (and prevent unintentional loss of data) by explicitly selecting column(s) we want to apply our aggregation function to BEFORE calling .agg()
,
By default, .groupby
will not rename any aggregated columns. As we can see in the table above, the aggregated column is still named Count
even though it now represents the RTP. For better readability, we can rename Count
to Count RTP
= rtp_table.rename(columns = {"Count": "Count RTP"})
+ rtp_table rtp_table
+ | Year | +Count RTP | +
---|---|---|
Name | ++ | + |
Aadhini | +1.0 | +1.000000 | +
Aadhira | +1.0 | +0.500000 | +
Aadhya | +1.0 | +0.660000 | +
Aadya | +1.0 | +0.586207 | +
Aahana | +1.0 | +0.269231 | +
... | +... | +... | +
Zyanya | +1.0 | +0.466667 | +
Zyla | +1.0 | +1.000000 | +
Zylah | +1.0 | +1.000000 | +
Zyra | +1.0 | +1.000000 | +
Zyrah | +1.0 | +0.833333 | +
13782 rows × 2 columns
+By sorting rtp_table
, we can see the names whose popularity has decreased the most.
= rtp_table.rename(columns = {"Count": "Count RTP"})
+ rtp_table "Count RTP").head() rtp_table.sort_values(
+ | Year | +Count RTP | +
---|---|---|
Name | ++ | + |
Debra | +1.0 | +0.001260 | +
Debbie | +1.0 | +0.002815 | +
Carol | +1.0 | +0.003180 | +
Tammy | +1.0 | +0.003249 | +
Susan | +1.0 | +0.003305 | +
To visualize the above DataFrame
, let’s look at the line plot below:
import plotly.express as px
+"Name"] == "Debra"], x = "Year", y = "Count") px.line(f_babynames[f_babynames[
We can get the list of the top 10 names and then plot popularity with the following code:
+= rtp_table.sort_values("Count RTP").head(10).index
+ top10
+ px.line("Name"].isin(top10)],
+ f_babynames[f_babynames[= "Year",
+ x = "Count",
+ y = "Name"
+ color )
As a quick exercise, consider what code would compute the total number of babies with each name.
+"Name")[["Count"]].agg(sum).head()
+ babynames.groupby(# alternative solution:
+# babynames.groupby("Name")[["Count"]].sum()
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/1912269730.py:1: FutureWarning:
+
+The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
+
++ | Count | +
---|---|
Name | ++ |
Aadan | +18 | +
Aadarsh | +6 | +
Aaden | +647 | +
Aadhav | +27 | +
Aadhini | +6 | +
.groupby()
, ContinuedWe’ll work with the elections
DataFrame
again.
import pandas as pd
+import numpy as np
+
+= pd.read_csv("data/elections.csv")
+ elections 5) elections.head(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.574789 | +
GroupBy
ObjectsThe result of groupby
applied to a DataFrame
is a DataFrameGroupBy
object, not a DataFrame
.
= elections.groupby("Year")
+ grouped_by_year type(grouped_by_year)
pandas.core.groupby.generic.DataFrameGroupBy
+There are several ways to look into DataFrameGroupBy
objects:
= elections.groupby("Party")
+ grouped_by_party grouped_by_party.groups
{'American': [22, 126], 'American Independent': [115, 119, 124], 'Anti-Masonic': [6], 'Anti-Monopoly': [38], 'Citizens': [127], 'Communist': [89], 'Constitution': [160, 164, 172], 'Constitutional Union': [24], 'Democratic': [2, 4, 8, 10, 13, 14, 17, 20, 28, 29, 34, 37, 39, 45, 47, 52, 55, 57, 64, 70, 74, 77, 81, 83, 86, 91, 94, 97, 100, 105, 108, 111, 114, 116, 118, 123, 129, 134, 137, 140, 144, 151, 158, 162, 168, 176, 178], 'Democratic-Republican': [0, 1], 'Dixiecrat': [103], 'Farmer–Labor': [78], 'Free Soil': [15, 18], 'Green': [149, 155, 156, 165, 170, 177, 181], 'Greenback': [35], 'Independent': [121, 130, 143, 161, 167, 174], 'Liberal Republican': [31], 'Libertarian': [125, 128, 132, 138, 139, 146, 153, 159, 163, 169, 175, 180], 'National Democratic': [50], 'National Republican': [3, 5], 'National Union': [27], 'Natural Law': [148], 'New Alliance': [136], 'Northern Democratic': [26], 'Populist': [48, 61, 141], 'Progressive': [68, 82, 101, 107], 'Prohibition': [41, 44, 49, 51, 54, 59, 63, 67, 73, 75, 99], 'Reform': [150, 154], 'Republican': [21, 23, 30, 32, 33, 36, 40, 43, 46, 53, 56, 60, 65, 69, 72, 79, 80, 84, 87, 90, 96, 98, 104, 106, 109, 112, 113, 117, 120, 122, 131, 133, 135, 142, 145, 152, 157, 166, 171, 173, 179], 'Socialist': [58, 62, 66, 71, 76, 85, 88, 92, 95, 102], 'Southern Democratic': [25], 'States' Rights': [110], 'Taxpayers': [147], 'Union': [93], 'Union Labor': [42], 'Whig': [7, 9, 11, 12, 16, 19]}
+"Socialist") grouped_by_party.get_group(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
58 | +1904 | +Eugene V. Debs | +Socialist | +402810 | +loss | +2.985897 | +
62 | +1908 | +Eugene V. Debs | +Socialist | +420852 | +loss | +2.850866 | +
66 | +1912 | +Eugene V. Debs | +Socialist | +901551 | +loss | +6.004354 | +
71 | +1916 | +Allan L. Benson | +Socialist | +590524 | +loss | +3.194193 | +
76 | +1920 | +Eugene V. Debs | +Socialist | +913693 | +loss | +3.428282 | +
85 | +1928 | +Norman Thomas | +Socialist | +267478 | +loss | +0.728623 | +
88 | +1932 | +Norman Thomas | +Socialist | +884885 | +loss | +2.236211 | +
92 | +1936 | +Norman Thomas | +Socialist | +187910 | +loss | +0.412876 | +
95 | +1940 | +Norman Thomas | +Socialist | +116599 | +loss | +0.234237 | +
102 | +1948 | +Norman Thomas | +Socialist | +139569 | +loss | +0.286312 | +
GroupBy
MethodsThere are many aggregation methods we can use with .agg
. Some useful options are:
.mean
: creates a new DataFrame
with the mean value of each group.sum
: creates a new DataFrame
with the sum of each group.max
and .min
: creates a new DataFrame
with the maximum/minimum value of each group.first
and .last
: creates a new DataFrame
with the first/last row in each group.size
: creates a new Series
with the number of entries in each group.count
: creates a new DataFrame
with the number of entries, excluding missing values.Let’s illustrate some examples by creating a DataFrame
called df
.
= pd.DataFrame({'letter':['A','A','B','C','C','C'],
+ df 'num':[1,2,3,4,np.nan,4],
+ 'state':[np.nan, 'tx', 'fl', 'hi', np.nan, 'ak']})
+ df
+ | letter | +num | +state | +
---|---|---|---|
0 | +A | +1.0 | +NaN | +
1 | +A | +2.0 | +tx | +
2 | +B | +3.0 | +fl | +
3 | +C | +4.0 | +hi | +
4 | +C | +NaN | +NaN | +
5 | +C | +4.0 | +ak | +
Note the slight difference between .size()
and .count()
: while .size()
returns a Series
and counts the number of entries including the missing values, .count()
returns a DataFrame
and counts the number of entries in each column excluding missing values.
"letter").size() df.groupby(
letter
+A 2
+B 1
+C 3
+dtype: int64
+"letter").count() df.groupby(
+ | num | +state | +
---|---|---|
letter | ++ | + |
A | +2 | +1 | +
B | +1 | +1 | +
C | +2 | +2 | +
You might recall that the value_counts()
function in the previous note does something similar. It turns out value_counts()
and groupby.size()
are the same, except value_counts()
sorts the resulting Series
in descending order automatically.
"letter"].value_counts() df[
letter
+C 3
+A 2
+B 1
+Name: count, dtype: int64
+These (and other) aggregation functions are so common that pandas
allows for writing shorthand. Instead of explicitly stating the use of .agg
, we can call the function directly on the GroupBy
object.
For example, the following are equivalent:
+elections.groupby("Candidate").agg(mean)
elections.groupby("Candidate").mean()
There are many other methods that pandas
supports. You can check them out on the pandas
documentation.
Another common use for GroupBy
objects is to filter data by group.
groupby.filter
takes an argument func
, where func
is a function that:
DataFrame
object as inputTrue
or False
.groupby.filter
applies func
to each group/sub-DataFrame
:
func
returns True
for a group, then all rows belonging to the group are preserved.func
returns False
for a group, then all rows belonging to that group are filtered out.In other words, sub-DataFrame
s that correspond to True
are returned in the final result, whereas those with a False
value are not. Importantly, groupby.filter
is different from groupby.agg
in that an entire sub-DataFrame
is returned in the final DataFrame
, not just a single row. As a result, groupby.filter
preserves the original indices and the column we grouped on does NOT become the index!
To illustrate how this happens, let’s go back to the elections
dataset. Say we want to identify “tight” election years – that is, we want to find all rows that correspond to election years where all candidates in that year won a similar portion of the total vote. Specifically, let’s find all rows corresponding to a year where no candidate won more than 45% of the total vote.
In other words, we want to:
+%
in that year is less than 45%DataFrame
rows that correspond to these yearsFor each year, we need to find the maximum %
among all rows for that year. If this maximum %
is lower than 45%, we will tell pandas
to keep all rows corresponding to that year.
"Year").filter(lambda sf: sf["%"].max() < 45).head(9) elections.groupby(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
23 | +1860 | +Abraham Lincoln | +Republican | +1855993 | +win | +39.699408 | +
24 | +1860 | +John Bell | +Constitutional Union | +590901 | +loss | +12.639283 | +
25 | +1860 | +John C. Breckinridge | +Southern Democratic | +848019 | +loss | +18.138998 | +
26 | +1860 | +Stephen A. Douglas | +Northern Democratic | +1380202 | +loss | +29.522311 | +
66 | +1912 | +Eugene V. Debs | +Socialist | +901551 | +loss | +6.004354 | +
67 | +1912 | +Eugene W. Chafin | +Prohibition | +208156 | +loss | +1.386325 | +
68 | +1912 | +Theodore Roosevelt | +Progressive | +4122721 | +loss | +27.457433 | +
69 | +1912 | +William Taft | +Republican | +3486242 | +loss | +23.218466 | +
70 | +1912 | +Woodrow Wilson | +Democratic | +6296284 | +win | +41.933422 | +
What’s going on here? In this example, we’ve defined our filtering function, func
, to be lambda sf: sf["%"].max() < 45
. This filtering function will find the maximum "%"
value among all entries in the grouped sub-DataFrame
, which we call sf
. If the maximum value is less than 45, then the filter function will return True
and all rows in that grouped sub-DataFrame
will appear in the final output DataFrame
.
Examine the DataFrame
above. Notice how, in this preview of the first 9 rows, all entries from the years 1860 and 1912 appear. This means that in 1860 and 1912, no candidate in that year won more than 45% of the total vote.
You may ask: how is the groupby.filter
procedure different to the boolean filtering we’ve seen previously? Boolean filtering considers individual rows when applying a boolean condition. For example, the code elections[elections["%"] < 45]
will check the "%"
value of every single row in elections
; if it is less than 45, then that row will be kept in the output. groupby.filter
, in contrast, applies a boolean condition across all rows in a group. If not all rows in that group satisfy the condition specified by the filter, the entire group will be discarded in the output.
lambda
FunctionsWhat if we wish to aggregate our DataFrame
using a non-standard function – for example, a function of our own design? We can do so by combining .agg
with lambda
expressions.
Let’s first consider a puzzle to jog our memory. We will attempt to find the Candidate
from each Party
with the highest %
of votes.
A naive approach may be to group by the Party
column and aggregate by the maximum.
"Party").agg(max).head(10) elections.groupby(
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/4278286395.py:1: FutureWarning:
+
+The provided callable <built-in function max> is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
+
++ | Year | +Candidate | +Popular vote | +Result | +% | +
---|---|---|---|---|---|
Party | ++ | + | + | + | + |
American | +1976 | +Thomas J. Anderson | +873053 | +loss | +21.554001 | +
American Independent | +1976 | +Lester Maddox | +9901118 | +loss | +13.571218 | +
Anti-Masonic | +1832 | +William Wirt | +100715 | +loss | +7.821583 | +
Anti-Monopoly | +1884 | +Benjamin Butler | +134294 | +loss | +1.335838 | +
Citizens | +1980 | +Barry Commoner | +233052 | +loss | +0.270182 | +
Communist | +1932 | +William Z. Foster | +103307 | +loss | +0.261069 | +
Constitution | +2016 | +Michael Peroutka | +203091 | +loss | +0.152398 | +
Constitutional Union | +1860 | +John Bell | +590901 | +loss | +12.639283 | +
Democratic | +2020 | +Woodrow Wilson | +81268924 | +win | +61.344703 | +
Democratic-Republican | +1824 | +John Quincy Adams | +151271 | +win | +57.210122 | +
This approach is clearly wrong – the DataFrame
claims that Woodrow Wilson won the presidency in 2020.
Why is this happening? Here, the max
aggregation function is taken over every column independently. Among Democrats, max
is computing:
Year
a Democratic candidate ran for president (2020)Candidate
with the alphabetically “largest” name (“Woodrow Wilson”)Result
with the alphabetically “largest” outcome (“win”)Instead, let’s try a different approach. We will:
+DataFrame
so that rows are in descending order of %
Party
and select the first row of each sub-DataFrame
While it may seem unintuitive, sorting elections
by descending order of %
is extremely helpful. If we then group by Party
, the first row of each GroupBy
object will contain information about the Candidate
with the highest voter %
.
= elections.sort_values("%", ascending=False)
+ elections_sorted_by_percent 5) elections_sorted_by_percent.head(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
114 | +1964 | +Lyndon Johnson | +Democratic | +43127041 | +win | +61.344703 | +
91 | +1936 | +Franklin Roosevelt | +Democratic | +27752648 | +win | +60.978107 | +
120 | +1972 | +Richard Nixon | +Republican | +47168710 | +win | +60.907806 | +
79 | +1920 | +Warren Harding | +Republican | +16144093 | +win | +60.574501 | +
133 | +1984 | +Ronald Reagan | +Republican | +54455472 | +win | +59.023326 | +
"Party").agg(lambda x : x.iloc[0]).head(10)
+ elections_sorted_by_percent.groupby(
+# Equivalent to the below code
+# elections_sorted_by_percent.groupby("Party").agg('first').head(10)
+ | Year | +Candidate | +Popular vote | +Result | +% | +
---|---|---|---|---|---|
Party | ++ | + | + | + | + |
American | +1856 | +Millard Fillmore | +873053 | +loss | +21.554001 | +
American Independent | +1968 | +George Wallace | +9901118 | +loss | +13.571218 | +
Anti-Masonic | +1832 | +William Wirt | +100715 | +loss | +7.821583 | +
Anti-Monopoly | +1884 | +Benjamin Butler | +134294 | +loss | +1.335838 | +
Citizens | +1980 | +Barry Commoner | +233052 | +loss | +0.270182 | +
Communist | +1932 | +William Z. Foster | +103307 | +loss | +0.261069 | +
Constitution | +2008 | +Chuck Baldwin | +199750 | +loss | +0.152398 | +
Constitutional Union | +1860 | +John Bell | +590901 | +loss | +12.639283 | +
Democratic | +1964 | +Lyndon Johnson | +43127041 | +win | +61.344703 | +
Democratic-Republican | +1824 | +Andrew Jackson | +151271 | +loss | +57.210122 | +
Here’s an illustration of the process:
+Notice how our code correctly determines that Lyndon Johnson from the Democratic Party has the highest voter %
.
More generally, lambda
functions are used to design custom aggregation functions that aren’t pre-defined by Python. The input parameter x
to the lambda
function is a GroupBy
object. Therefore, it should make sense why lambda x : x.iloc[0]
selects the first row in each groupby object.
In fact, there’s a few different ways to approach this problem. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, etc. We’ve given a few examples below.
+Note: Understanding these alternative solutions is not required. They are given to demonstrate the vast number of problem-solving approaches in pandas
.
# Using the idxmax function
+= elections.loc[elections.groupby('Party')['%'].idxmax()]
+ best_per_party 5) best_per_party.head(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
22 | +1856 | +Millard Fillmore | +American | +873053 | +loss | +21.554001 | +
115 | +1968 | +George Wallace | +American Independent | +9901118 | +loss | +13.571218 | +
6 | +1832 | +William Wirt | +Anti-Masonic | +100715 | +loss | +7.821583 | +
38 | +1884 | +Benjamin Butler | +Anti-Monopoly | +134294 | +loss | +1.335838 | +
127 | +1980 | +Barry Commoner | +Citizens | +233052 | +loss | +0.270182 | +
# Using the .drop_duplicates function
+= elections.sort_values('%').drop_duplicates(['Party'], keep='last')
+ best_per_party2 5) best_per_party2.head(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
148 | +1996 | +John Hagelin | +Natural Law | +113670 | +loss | +0.118219 | +
164 | +2008 | +Chuck Baldwin | +Constitution | +199750 | +loss | +0.152398 | +
110 | +1956 | +T. Coleman Andrews | +States' Rights | +107929 | +loss | +0.174883 | +
147 | +1996 | +Howard Phillips | +Taxpayers | +184656 | +loss | +0.192045 | +
136 | +1988 | +Lenora Fulani | +New Alliance | +217221 | +loss | +0.237804 | +
We know now that .groupby
gives us the ability to group and aggregate data across our DataFrame
. The examples above formed groups using just one column in the DataFrame
. It’s possible to group by multiple columns at once by passing in a list of column names to .groupby
.
Let’s consider the babynames
dataset again. In this problem, we will find the total number of baby names associated with each sex for each year. To do this, we’ll group by both the "Year"
and "Sex"
columns.
babynames.head()
+ | State | +Sex | +Year | +Name | +Count | +First Letter | +
---|---|---|---|---|---|---|
115957 | +CA | +F | +1990 | +Deandrea | +5 | +D | +
101976 | +CA | +F | +1986 | +Deandrea | +6 | +D | +
131029 | +CA | +F | +1994 | +Leandrea | +5 | +L | +
108731 | +CA | +F | +1988 | +Deandrea | +5 | +D | +
308131 | +CA | +M | +1985 | +Deandrea | +6 | +D | +
# Find the total number of baby names associated with each sex for each
+# year in the data
+"Year", "Sex"])[["Count"]].agg(sum).head(6) babynames.groupby([
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/3186035650.py:3: FutureWarning:
+
+The provided callable <built-in function sum> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
+
++ | + | Count | +
---|---|---|
Year | +Sex | ++ |
1910 | +F | +5950 | +
M | +3213 | +|
1911 | +F | +6602 | +
M | +3381 | +|
1912 | +F | +9804 | +
M | +8142 | +
Notice that both "Year"
and "Sex"
serve as the index of the DataFrame
(they are both rendered in bold). We’ve created a multi-index DataFrame
where two different index values, the year and sex, are used to uniquely identify each row.
This isn’t the most intuitive way of representing this data – and, because multi-indexed DataFrames have multiple dimensions in their index, they can often be difficult to use.
+Another strategy to aggregate across two columns is to create a pivot table. You saw these back in Data 8. One set of values is used to create the index of the pivot table; another set is used to define the column names. The values contained in each cell of the table correspond to the aggregated data for each index-column pair.
+Here’s an illustration of the process:
+The best way to understand pivot tables is to see one in action. Let’s return to our original goal of summing the total number of names associated with each combination of year and sex. We’ll call the pandas
.pivot_table
method to create a new table.
# The `pivot_table` method is used to generate a Pandas pivot table
+import numpy as np
+
+ babynames.pivot_table(= "Year",
+ index = "Sex",
+ columns = "Count",
+ values = np.sum,
+ aggfunc 5) ).head(
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/2548053048.py:3: FutureWarning:
+
+The provided callable <function sum at 0x103984160> is currently using DataFrameGroupBy.sum. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "sum" instead.
+
+Sex | +F | +M | +
---|---|---|
Year | ++ | + |
1910 | +5950 | +3213 | +
1911 | +6602 | +3381 | +
1912 | +9804 | +8142 | +
1913 | +11860 | +10234 | +
1914 | +13815 | +13111 | +
Looks a lot better! Now, our DataFrame
is structured with clear index-column combinations. Each entry in the pivot table represents the summed count of names for a given combination of "Year"
and "Sex"
.
Let’s take a closer look at the code implemented above.
+index = "Year"
specifies the column name in the original DataFrame
that should be used as the index of the pivot tablecolumns = "Sex"
specifies the column name in the original DataFrame
that should be used to generate the columns of the pivot tablevalues = "Count"
indicates what values from the original DataFrame
should be used to populate the entry for each index-column combinationaggfunc = np.sum
tells pandas
what function to use when aggregating the data specified by values
. Here, we are summing the name counts for each pair of "Year"
and "Sex"
We can even include multiple values in the index or columns of our pivot tables.
+= babynames.pivot_table(
+ babynames_pivot ="Year", # the rows (turned into index)
+ index="Sex", # the column values
+ columns=["Count", "Name"],
+ values=max, # group operation
+ aggfunc
+ )6) babynames_pivot.head(
/var/folders/m7/89sj44pj21ddhplt2bn4qjcm0000gr/T/ipykernel_57856/970182367.py:1: FutureWarning:
+
+The provided callable <built-in function max> is currently using DataFrameGroupBy.max. In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "max" instead.
+
++ | Count | +Name | +||
---|---|---|---|---|
Sex | +F | +M | +F | +M | +
Year | ++ | + | + | + |
1910 | +295 | +237 | +Yvonne | +William | +
1911 | +390 | +214 | +Zelma | +Willis | +
1912 | +534 | +501 | +Yvonne | +Woodrow | +
1913 | +584 | +614 | +Zelma | +Yoshio | +
1914 | +773 | +769 | +Zelma | +Yoshio | +
1915 | +998 | +1033 | +Zita | +Yukio | +
Note that each row provides the number of girls and number of boys having that year’s most common name, and also lists the alphabetically largest girl name and boy name. The counts for number of girls/boys in the resulting DataFrame
do not correspond to the names listed. For example, in 1910, the most popular girl name is given to 295 girls, but that name was likely not Yvonne.
When working on data science projects, we’re unlikely to have absolutely all the data we want contained in a single DataFrame
– a real-world data scientist needs to grapple with data coming from multiple sources. If we have access to multiple datasets with related information, we can join two or more tables into a single DataFrame
.
To put this into practice, we’ll revisit the elections
dataset.
5) elections.head(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +
---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.574789 | +
Say we want to understand the popularity of the names of each presidential candidate in 2022. To do this, we’ll need the combined data of babynames
and elections
.
We’ll start by creating a new column containing the first name of each presidential candidate. This will help us join each name in elections
to the corresponding name data in babynames
.
# This `str` operation splits each candidate's full name at each
+# blank space, then takes just the candidate's first name
+"First Name"] = elections["Candidate"].str.split().str[0]
+ elections[5) elections.head(
+ | Year | +Candidate | +Party | +Popular vote | +Result | +% | +First Name | +
---|---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +Andrew | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +John | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +Andrew | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +John | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.574789 | +Andrew | +
# Here, we'll only consider `babynames` data from 2022
+= babynames[babynames["Year"]==2022]
+ babynames_2022 babynames_2022.head()
+ | State | +Sex | +Year | +Name | +Count | +First Letter | +
---|---|---|---|---|---|---|
237964 | +CA | +F | +2022 | +Leandra | +10 | +L | +
404916 | +CA | +M | +2022 | +Leandro | +99 | +L | +
405892 | +CA | +M | +2022 | +Andreas | +14 | +A | +
235927 | +CA | +F | +2022 | +Andrea | +322 | +A | +
405695 | +CA | +M | +2022 | +Deandre | +18 | +D | +
Now, we’re ready to join the two tables. pd.merge
is the pandas
method used to join DataFrame
s together.
= pd.merge(left = elections, right = babynames_2022, \
+ merged = "First Name", right_on = "Name")
+ left_on
+ merged.head()# Notice that pandas automatically specifies `Year_x` and `Year_y`
+# when both merged DataFrames have the same column name to avoid confusion
+
+# Second option
+# merged = elections.merge(right = babynames_2022, \
+# left_on = "First Name", right_on = "Name")
+ | Year_x | +Candidate | +Party | +Popular vote | +Result | +% | +First Name | +State | +Sex | +Year_y | +Name | +Count | +First Letter | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +1824 | +Andrew Jackson | +Democratic-Republican | +151271 | +loss | +57.210122 | +Andrew | +CA | +M | +2022 | +Andrew | +741 | +A | +
1 | +1824 | +John Quincy Adams | +Democratic-Republican | +113142 | +win | +42.789878 | +John | +CA | +M | +2022 | +John | +490 | +J | +
2 | +1828 | +Andrew Jackson | +Democratic | +642806 | +win | +56.203927 | +Andrew | +CA | +M | +2022 | +Andrew | +741 | +A | +
3 | +1828 | +John Quincy Adams | +National Republican | +500897 | +loss | +43.796073 | +John | +CA | +M | +2022 | +John | +490 | +J | +
4 | +1832 | +Andrew Jackson | +Democratic | +702735 | +win | +54.574789 | +Andrew | +CA | +M | +2022 | +Andrew | +741 | +A | +
Let’s take a closer look at the parameters:
+left
and right
parameters are used to specify the DataFrame
s to be joined.left_on
and right_on
parameters are assigned to the string names of the columns to be used when performing the join. These two on
parameters tell pandas
what values should act as pairing keys to determine which rows to merge across the DataFrame
s. We’ll talk more about this idea of a pairing key next lecture.Congratulations! We finally tackled pandas
. Don’t worry if you are still not feeling very comfortable with it—you will have plenty of chances to practice over the next few weeks.
Next, we will get our hands dirty with some real-world datasets and use our pandas
knowledge to conduct some exploratory data analysis.
So far in this course, we’ve focused on supervised learning techniques that create a function to map inputs (features) to labelled outputs. Regression and classification are two main examples, where the output value of regression is quantitative while the output value of classification is categorical.
+Today, we’ll introduce an unsupervised learning technique called PCA. Unlike supervised learning, unsupervised learning is applied to unlabeled data. Because we have features but no labels, we aim to identify patterns in those features.
+Visualization can help us identify clusters or patterns in our dataset, and it can give us an intuition about our data and how to clean it for the model. For this demo, we’ll return to the MPG dataset from Lecture 19 and see how far we can push visualization for multiple features.
+import pandas as pd
+import numpy as np
+import scipy as sp
+import plotly.express as px
+import seaborn as sns
= sns.load_dataset("mpg").dropna()
+ mpg mpg.head()
+ | mpg | +cylinders | +displacement | +horsepower | +weight | +acceleration | +model_year | +origin | +name | +
---|---|---|---|---|---|---|---|---|---|
0 | +18.0 | +8 | +307.0 | +130.0 | +3504 | +12.0 | +70 | +usa | +chevrolet chevelle malibu | +
1 | +15.0 | +8 | +350.0 | +165.0 | +3693 | +11.5 | +70 | +usa | +buick skylark 320 | +
2 | +18.0 | +8 | +318.0 | +150.0 | +3436 | +11.0 | +70 | +usa | +plymouth satellite | +
3 | +16.0 | +8 | +304.0 | +150.0 | +3433 | +12.0 | +70 | +usa | +amc rebel sst | +
4 | +17.0 | +8 | +302.0 | +140.0 | +3449 | +10.5 | +70 | +usa | +ford torino | +
We can plot one feature as a histogram to see it’s distribution. Since we only plot one feature, we consider this a 1-dimensional plot.
+="displacement") px.histogram(mpg, x
We can also visualize two features (2-dimensional scatter plot):
+="displacement", y="horsepower") px.scatter(mpg, x
Three features (3-dimensional scatter plot):
+= px.scatter_3d(mpg, x="displacement", y="horsepower", z="weight",
+ fig =800, height=800)
+ width=dict(size=3)) fig.update_traces(marker
We can even push to 4 features using a 3D scatter plot and a colorbar:
+= px.scatter_3d(mpg, x="displacement",
+ fig ="horsepower",
+ y="weight",
+ z="model_year",
+ color=800, height=800,
+ width=.7)
+ opacity=dict(size=5)) fig.update_traces(marker
Visualizing 5 features is also possible if we make the scatter dots unique to the datapoint’s origin.
+= px.scatter_3d(mpg, x="displacement",
+ fig ="horsepower",
+ y="weight",
+ z="model_year",
+ color="mpg",
+ size="origin",
+ symbol=900, height=800,
+ width=.7)
+ opacity# hide color scale legend on the plotly fig
+=False) fig.update_layout(coloraxis_showscale
However, adding more features to our visualization can make our plot look messy and uninformative, and it can also be near impossible if we have a large number of features. The problem is that many datasets come with more than 5 features —— hundreds, even. Is it still possible to visualize all those features?
+Suppose we have a dataset of:
+Let’s “rename” this in terms of linear algebra so that we can be more clear with our wording. Using linear algebra, we can view our matrix as:
+The intrinsic dimension of a dataset is the minimal set of dimensions needed to approximately represent the data. In linear algebra terms, it is the dimension of the column space of a matrix, or the number of linearly independent columns in a matrix; this is equivalently called the rank of a matrix.
+In the examples below, Dataset 1 has 2 dimensions because it has 2 linearly independent columns. Similarly, Dataset 2 has 3 dimensions because it has 3 linearly independent columns.
+What about Dataset 4 below?
+It may be tempting to say that it has 4 dimensions, but the Weight (lbs)
column is actually just a linear transformation of the Weight (kg)
column. Thus, no new information is captured, and the matrix of our dataset has a (column) rank of 3! Therefore, despite having 4 columns, we still say that this data is 3-dimensional.
Plotting the weight columns together reveals the key visual intuition. While the two columns visually span a 2D space as a line, the data does not deviate at all from that singular line. This means that one of the weight columns is redundant! Even given the option to cover the whole 2D space, the data below does not. It might as well not have this dimension, which is why we still do not consider the data below to span more than 1 dimension.
+What happens when there are outliers? Below, we’ve added one outlier point to the dataset above, and just that one point is enough to change the rank of the matrix from 1 to 2 dimensions. However, the data is still approximately 1-dimensional.
+Dimensionality reduction is generally an approximation of the original data that’s achieved by projecting the data onto a desired dimension. In the example below, our original datapoints (blue dots) are 2-dimensional. We have a few choices if we want to project them down to 1-dimension: project them onto the \(x\)-axis (left), project them onto the \(y\)-axis (middle), or project them to a line \(mx + b\) (right). The resulting datapoints after the projection is shown in red. Which projection do you think is better? How can we calculate that?
+In general, we want the projection which is the best approximation for the original data (the graph on the right). In other words, we want the projection that captures the most variance of the original data. In the next section, we’ll see how this is calculated.
+One linear technique for dimensionality reduction is matrix decomposition, which is closely tied to matrix multiplication. In this section, we will decompose our data matrix \(X\) into a lower-dimensional matrix \(Z\) that approximately recovers the original data when multiplied by \(W\).
+First, consider the matrix multiplication example below:
+Matrix decomposition (a.k.a matrix factorization) is the opposite of matrix multiplication. Instead of multiplying two matrices, we want to decompose a single matrix into 2 separate matrices. Just like with real numbers, there are infinite ways to decompose a matrix into a product of two matrices. For example, \(9.9\) can be decomposed as \(1.1 * 9\), \(3.3 * 3.3\), \(1 * 9.9\), etc. Additionally, the sizes of the 2 decomposed matrices can vary drastically. In the example below, the first factorization (top) multiplies a \(3x2\) matrix by a \(2x3\) matrix while the second factorization (bottom) multiplies a \(3x3\) matrix by a \(3x3\) matrix; both result in the original matrix on the right.
+We can even expand the \(3x3\) matrices to \(3x4\) and \(4x3\) (shown below as the factorization on top), but this defeats the point of dimensionality reduction since we’re adding more “useless” dimensions. On the flip side, we also can’t reduce the dimension to \(3x1\) and \(1x3\) (shown below as the factorization on the bottom); since the rank of the original matrix is greater than 1, this decomposition will not result in the original matrix.
+
In practice, we often work with datasets containing many features, so we usually want to construct decompositions where the dimensionality is below the rank of the original matrix. While this does not recover the data exactly, we can still provide approximate reconstructions of the matrix.
In the next section, we will discuss a method to automatically and approximately factorize data. This avoids redundant features and makes computation easier because we can train on less data. Since some approximations are better than others, we will also discuss how the method helps us capture a lot of information in a low number of dimensions.
+In PCA, our goal is to transform observations from high-dimensional data down to low dimensions (often 2, as most visualizations are 2D) through linear transformations. In other words, we want to find a linear transformation that creates a low-dimension representation that captures as much of the original data’s total variance as possible.
+We often perform PCA during the Exploratory Data Analysis (EDA) stage of our data science lifecycle when we don’t know what model to use. It helps us with:
+There are two equivalent ways of framing PCA:
+To execute the first approach of variance maximization framing (more common), we can find the variances of each attribute with np.var
and then keep the \(k\) attributes with the highest variance. However, this approach limits us to work with attributes individually; it cannot resolve collinearity, and we cannot combine features.
The second approach uses PCA to construct principal components with the most variance in the data (even higher than the first approach) using linear combinations of features. We’ll describe the procedure in the next section.
+To perform PCA on a matrix:
+The \(k\) principal components capture the most variance of any \(k\)-dimensional reduction of the data matrix.
+In practice, however, we don’t carry out the procedures in step 2 because they take too long to compute. Instead, we use singular value decomposition (SVD) to find all principal components efficiently.
+In this section, we will derive PCA keeping the following goal in mind: minimize the reconstruction loss for our matrix factorization model. You are not expected to be able to be able to redo this derivation, but understanding the derivation may help with future assignments.
+Given a matrix \(X\) with \(n\) rows and \(d\) columns, our goal is to find its best decomposition such that \[X \approx Z W\] Z has \(n\) rows and \(k\) columns; W has \(k\) rows and \(d\) columns.
+To measure the accuracy of our reconstruction, we define the reconstruction loss below, where \(X_i\) is the row vector of \(X\), and \(Z_i\) is the row vector of \(Z\):
+There are many solutions to the above, so let’s constrain our model such that \(W\) is a row-orthonormal matrix (i.e. \(WW^T=I\)) where the rows of \(W\) are our principal components.
+In our derivation, let’s first work with the case where \(k=1\). Here Z will be an \(n \times 1\) vector and W will be a \(1 \times d\) vector.
+\[\begin{aligned} +L(z,w) &= \frac{1}{n}\sum_{i=1}^{n}(X_i - z_{i}w)(X_i - z_{i}w)^T \\ +&= \frac{1}{n}\sum_{i=1}^{n}(X_{i}X_{i}^T - 2z_{i}X_{i}w^T + z_{i}^{2}ww^T) & \text{(expand the loss)} \\ += \frac{1}{n}\sum_{i=1}^{n}(-2z_{i}X_{i}w^T + z_{i}^{2}) & \text{(First term is constant and }ww^T=1\text{ by orthonormality)} \\ +\end{aligned}\]
+Now, we can take the derivative with respect to \(Z_i\). \[\begin{aligned} +\frac{\partial{L(Z,W)}}{\partial{z_i}} &= \frac{1}{n}(-2X_{i}w^T + 2z_{i}) \\ +z_i &= X_iw^T & \text{(Setting derivative equal to 0 and solving for }z_i\text{)}\end{aligned}\]
+We can now substitute our solution for \(z_i\) in our loss function:
+\[\begin{aligned} +L(z,w) &= \frac{1}{n}\sum_{i=1}^{n}(-2z_{i}X_{i}w^T + z_{i}^{2}) \\ +L(z=X_iw^T,w) &= \frac{1}{n}\sum_{i=1}^{n}(-2X_iw^TX_{i}w^T + (X_iw^T)^{2}) \\ +&= \frac{1}{n}\sum_{i=1}^{n}(-X_iw^TX_{i}w^T) \\ +&= \frac{1}{n}\sum_{i=1}^{n}(-wX_{i}^TX_{i}w^T) \\ +&= -w\frac{1}{n}\sum_{i=1}^{n}(X_i^TX_{i})w^T \\ +&= -w\Sigma w^T +\end{aligned}\]
+Now, we need to minimize our loss with respect to \(w\). Since we have a negative sign, one way we can do this is by making \(w\) really big. However, we also have the orthonormality constraint \(ww^T=1\). To incorporate this constraint into the equation, we can add a Lagrange multiplier, \(\lambda\). Note that lagrangian multipliers are out of scope for Data 100.
+\[ +L(w,\lambda) = -w\Sigma w^T + \lambda(ww^T-1) +\]
+Taking the derivative with respect to \(w\), \[\begin{aligned} +\frac{\partial{L(w,\lambda)}}{w} &= -2\Sigma w^T + 2\lambda w^T \\ +2\Sigma w^T - 2\lambda w^T &= 0 & \text{(Setting derivative equal to 0)} \\ +\Sigma w^T &= \lambda w^T \\ +\end{aligned}\]
+This result implies that:
+This derivation can inductively be used for the next (second) principal component (not shown).
+The final takeaway from this derivation is that the principal components are the eigenvectors with the largest eigenvalues of the covariance matrix. These are the directions of the maximum variance of the data. We can construct the latent factors (the Z matrix) by projecting the centered data X onto the principal component vectors:
+We often work with high-dimensional data that contain many columns/features. Given all these dimensions, this data can be difficult to visualize and model. However, not all the data in this high-dimensional space is useful —— there could be repeated features or outliers that make the data seem more complex than it really is. The most concise representation of high-dimensional data is its intrinsic dimension. Our goal with this lecture is to use dimensionality reduction to find the intrinsic dimension of a high-dimensional dataset. In other words, we want to find a smaller set of new features/columns that approximates the original data well without losing that much information. This is especially useful because this smaller set of features allows us to better visualize the data and do EDA to understand which modeling techniques would fit the data well.
+In order to find the intrinsic dimension of a high-dimensional dataset, we’ll use techniques from linear algebra. Suppose we have a high-dimensional dataset, \(X\), that has \(n\) rows and \(d\) columns. We want to factor (split) \(X\) into two matrices, \(Z\) and \(W\). \(Z\) has \(n\) rows and \(k\) columns; \(W\) has \(k\) rows and \(d\) columns.
+\[ X \approx ZW\]
+We can reframe this problem as a loss function: in other words, if we want \(X\) to roughly equal \(ZW\), their difference should be as small as possible, ideally 0. This difference becomes our loss function, \(L(Z, W)\):
+\[L(Z, W) = \frac{1}{n}\sum_{i=1}^{n}||X_i - Z_iW||^2\]
+Breaking down the variables in this formula:
+Using calculus and optimization techniques (take EECS 127 if you’re interested!), we find that this loss is minimized when \[Z = XW^T\] The proof for this is out of scope for Data 100, but for those who are interested, we:
+This gives us a very cool result of
+\[\Sigma w^T = \lambda w^T\]
+\(\Sigma\) is the covariance matrix of \(X\). The equation above implies that:
+This tells us that the principal components (rows of \(W\)) are the eigenvectors with the largest eigenvalues of the covariance matrix \(\Sigma\). They represent the directions of maximum variance in the data. We can construct the latent factors, or the \(Z\) matrix, by projecting the centered data \(X\) onto the principal component vectors, \(W^T\).
+But how do we compute the eigenvectors of \(\Sigma\)? Let’s dive into SVD to answer this question.
+Singular value decomposition (SVD) is an important concept in linear algebra. Since this class requires a linear algebra course (MATH 54, MATH 56, or EECS 16A) as a pre/co-requisite, we assume you have taken or are taking a linear algebra course, so we won’t explain SVD in its entirety. In particular, we will go over:
+We will not dive deep into the theory and details of SVD. Instead, we will only cover what is needed for a data science interpretation. If you’d like more information, check out EECS 16B Note 14 or EECS 16B Note 15.
+ + +Singular value decomposition (SVD) describes a matrix \(X\)’s decomposition into three matrices: \[ X = U S V^T \]
+Let’s break down each of these terms one by one.
+NumPy
For this demo, we’ll work with a rectangular dataset containing \(n=100\) rows and \(d=4\) columns.
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+import numpy as np
+
+23) # kallisti
+ np.random.seed(
+"figure.figsize"] = (4, 4)
+ plt.rcParams["figure.dpi"] = 150
+ plt.rcParams[set()
+ sns.
+= pd.read_csv("data/rectangle_data.csv")
+ rectangle 5) rectangle.head(
+ | width | +height | +area | +perimeter | +
---|---|---|---|---|
0 | +8 | +6 | +48 | +28 | +
1 | +2 | +4 | +8 | +12 | +
2 | +1 | +3 | +3 | +8 | +
3 | +9 | +3 | +27 | +24 | +
4 | +9 | +8 | +72 | +34 | +
In NumPy
, the SVD decomposition function can be called with np.linalg.svd
(documentation). There are multiple versions of SVD; to get the version that we will follow, we need to set the full_matrices
parameter to False
.
= np.linalg.svd(rectangle, full_matrices=False) U, S, Vt
First, let’s examine U
. As we can see, it’s dimensions are \(n \times d\).
U.shape
(100, 4)
+The first 5 rows of U
are shown below.
5) pd.DataFrame(U).head(
+ | 0 | +1 | +2 | +3 | +
---|---|---|---|---|
0 | +-0.155151 | +0.064830 | +-0.029935 | +0.934418 | +
1 | +-0.038370 | +-0.089155 | +0.062019 | +-0.299462 | +
2 | +-0.020357 | +-0.081138 | +0.058997 | +0.006852 | +
3 | +-0.101519 | +-0.076203 | +-0.148160 | +-0.011848 | +
4 | +-0.218973 | +0.206423 | +0.007274 | +-0.056580 | +
\(S\) is a little different in NumPy
. Since the only useful values in the diagonal matrix \(S\) are the singular values on the diagonal axis, only those values are returned and they are stored in an array.
Our rectangle_data
has a rank of \(3\), so we should have 3 non-zero singular values, sorted from largest to smallest.
S
array([3.62932568e+02, 6.29904732e+01, 2.56544651e+01, 2.56364534e-14])
+It seems like we have 4 non-zero values instead of 3, but notice that the last value is so small (\(10^{-15}\)) that it’s practically \(0\). Hence, we can round the values to get 3 singular values.
+round(S) np.
array([363., 63., 26., 0.])
+To get S
in matrix format, we use np.diag
.
= np.diag(S)
+ Sm Sm
array([[3.62932568e+02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
+ [0.00000000e+00, 6.29904732e+01, 0.00000000e+00, 0.00000000e+00],
+ [0.00000000e+00, 0.00000000e+00, 2.56544651e+01, 0.00000000e+00],
+ [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 2.56364534e-14]])
+Finally, we can see that Vt
is indeed a \(d \times d\) matrix.
Vt.shape
(4, 4)
+ pd.DataFrame(Vt)
+ | 0 | +1 | +2 | +3 | +
---|---|---|---|---|
0 | +-0.146436 | +-0.129942 | +-8.100201e-01 | +-0.552756 | +
1 | +-0.192736 | +-0.189128 | +5.863482e-01 | +-0.763727 | +
2 | +-0.704957 | +0.709155 | +7.951614e-03 | +0.008396 | +
3 | +-0.666667 | +-0.666667 | +9.775109e-17 | +0.333333 | +
To check that this SVD is a valid decomposition, we can reverse it and see if it matches our original table (it does, yay!).
+@ Sm @ Vt).head(5) pd.DataFrame(U
+ | 0 | +1 | +2 | +3 | +
---|---|---|---|---|
0 | +8.0 | +6.0 | +48.0 | +28.0 | +
1 | +2.0 | +4.0 | +8.0 | +12.0 | +
2 | +1.0 | +3.0 | +3.0 | +8.0 | +
3 | +9.0 | +3.0 | +27.0 | +24.0 | +
4 | +9.0 | +8.0 | +72.0 | +34.0 | +
Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) can be easily mixed up, especially when you have to keep track of so many acronyms. Here is a quick summary:
+After centering the original data matrix \(X\) so that each column has a mean of 0, we find its SVD: \[ X = U S V^T \]
+Because \(X\) is centered, the covariance matrix of \(X\), \(\Sigma\), is equal to \(X^T X\). Rearranging this equation, we get
+\[ +\begin{align} +\Sigma &= X^T X \\ +&= (U S V^T)^T U S V^T \\ +&= V S^T U^T U S V^T & \text{U is orthonormal, so $U^T U = I$} \\ +&= V S^2 V^T +\end{align} +\]
+Multiplying both sides by \(V\), we get
+\[ +\begin{align} +\Sigma V &= VS^2 V^T V \\ +&= V S^2 +\end{align} +\]
+This shows that the columns of \(V\) are the eigenvectors of the covariance matrix \(\Sigma\) and, therefore, the principal components. Additionally, the squared singular values \(S^2\) are the eigenvalues of \(\Sigma\).
+ +We’ve now shown that the first \(k\) columns of \(V\) (equivalently, the first \(k\) rows of \(V^{T}\)) are the first k principal components of \(X\). We can use them to construct the latent vector representation of \(X\), \(Z\), by projecting \(X\) onto the principal components.
+ +We can then instead compute \(Z\) as follows:
+\[ +\begin{align} +Z &= X V \\ +&= USV^T V \\ +&= U S +\end{align} +\]
+\[Z = XV = US\]
+In other words, we can construct \(X\)’s’ latent vector representation \(Z\) through:
+Using \(Z\), we can approximately recover the centered \(X\) matrix by multiplying \(Z\) by \(V^T\): \[ Z V^T = XV V^T = USV^T = X\]
+Note that to recover the original (uncentered) \(X\) matrix, we would also need to add back the mean.
+ +As we discussed above, when conducting PCA, we first center the data matrix \(X\) and then rotate it such that the direction with the most variation (e.g., the direction that is most spread out) aligns with the x-axis.
+In particular, the elements of each column of \(V\) (row of \(V^{T}\)) rotate the original feature vectors, projecting \(X\) onto the principal components.
+The first column of \(V\) indicates how each feature contributes (e.g. positive, negative, etc.) to principal component 1; it essentially assigns “weights” to each feature.
+Coupled together, this interpretation also allows us to understand that:
+Let’s summarize the steps to obtain Principal Components via SVD:
+Center the data matrix \(X\) by subtracting the mean of each attribute column.
To find the \(k\) principal components:
+Let’s now walk through an example where we compute PCA using SVD. In order to get the first \(k\) principal components from an \(n \times d\) matrix \(X\), we:
+axis=0
so that the mean is computed per column.= rectangle - np.mean(rectangle, axis=0)
+ centered_df 5) centered_df.head(
+ | width | +height | +area | +perimeter | +
---|---|---|---|---|
0 | +2.97 | +1.35 | +24.78 | +8.64 | +
1 | +-3.03 | +-0.65 | +-15.22 | +-7.36 | +
2 | +-4.03 | +-1.65 | +-20.22 | +-11.36 | +
3 | +3.97 | +-1.65 | +3.78 | +4.64 | +
4 | +3.97 | +3.35 | +48.78 | +14.64 | +
= np.linalg.svd(centered_df, full_matrices=False)
+ U, S, Vt = pd.DataFrame(np.diag(np.round(S, 1))) Sm
= Vt.T[:, :2]
+ two_PCs pd.DataFrame(two_PCs).head()
+ | 0 | +1 | +
---|---|---|
0 | +-0.098631 | +0.668460 | +
1 | +-0.072956 | +-0.374186 | +
2 | +-0.931226 | +-0.258375 | +
3 | +-0.343173 | +0.588548 | +
We define the total variance of a data matrix as the sum of variances of attributes. The principal components are a low-dimension representation that capture as much of the original data’s total variance as possible. Formally, the \(i\)-th singular value tells us the component score, or how much of the data variance is captured by the \(i\)-th principal component. Assuming the number of datapoints is \(n\):
+\[\text{i-th component score} = \frac{(\text{i-th singular value}^2)}{n}\]
+Summing up the component scores is equivalent to computing the total variance if we center our data.
+Data Centering: PCA has a data-centering step that precedes any singular value decomposition, where, if implemented, the component score is defined as above.
+If you want to dive deeper into PCA, Steve Brunton’s SVD Video Series is a great resource.
+We often plot the first two principal components using a scatter plot, with PC1 on the \(x\)-axis and PC2 on the \(y\)-axis. This is often called a PCA plot.
+If the first two singular values are large and all others are small, then two dimensions are enough to describe most of what distinguishes one observation from another. If not, a PCA plot omits a lot of information.
+PCA plots help us assess similarities between our data points and if there are any clusters in our dataset. In the case study before, for example, we could create the following PCA plot:
+A scree plot shows the variance ratio captured by each principal component, with the largest variance ratio first. They help us visually determine the number of dimensions needed to describe the data reasonably. The singular values that fall in the region of the plot after a large drop-off correspond to principal components that are not needed to describe the data since they explain a relatively low proportion of the total variance of the data. This point where adding more principal components results in diminishing returns is called the “elbow” and is the point just before the line flattens out. Using this “elbow method”, we can see that the elbow is at the second principal component.
+Biplots superimpose the directions onto the plot of PC1 vs. PC2, where vector \(j\) corresponds to the direction for feature \(j\) (e.g., \(v_{1j}, v_{2j}\)). There are several ways to scale biplot vectors —— in this course, we plot the direction itself. For other scalings, which can lead to more interpretable directions/loadings, see SAS biplots.
+Through biplots, we can interpret how features correlate with the principal components shown: positively, negatively, or not much at all.
+The directions of the arrow are (\(v_1\), \(v_2\)), where \(v_1\) and \(v_2\) are how that specific feature column contributes to PC1 and PC2, respectively. \(v_1\) and \(v_2\) are elements of the first and second columns of \(V\), respectively (i.e., the first two rows of \(V^T\)).
+Say we were considering feature 3, and say that was the purple arrow labeled “520” here (pointing bottom right).
+Let’s examine how the House of Representatives (of the 116th Congress, 1st session) voted in the month of September 2019.
+Specifically, we’ll look at the records of Roll call votes. From the U.S. Senate (link): roll call votes occur when a representative or senator votes “yea” or “nay” so that the names of members voting on each side are recorded. A voice vote is a vote in which those in favor or against a measure say “yea” or “nay,” respectively, without the names or tallies of members voting on each side being recorded.
+import pandas as pd
+import seaborn as sns
+import matplotlib.pyplot as plt
+import numpy as np
+import yaml
+from datetime import datetime
+import plotly.express as px
+import plotly.graph_objects as go
+
+
+= pd.read_csv("data/votes.csv")
+ votes = votes.astype({"roll call": str})
+ votes votes.head()
+ | chamber | +session | +roll call | +member | +vote | +
---|---|---|---|---|---|
0 | +House | +1 | +555 | +A000374 | +Not Voting | +
1 | +House | +1 | +555 | +A000370 | +Yes | +
2 | +House | +1 | +555 | +A000055 | +No | +
3 | +House | +1 | +555 | +A000371 | +Yes | +
4 | +House | +1 | +555 | +A000372 | +No | +
Suppose we pivot this table to group each legislator and their voting pattern across every (roll call) vote in this month. We mark 1 if the legislator voted Yes (“yea”), and 0 otherwise (“No”, “nay”, no vote, speaker, etc.).
+def was_yes(s):
+return 1 if s.iloc[0] == "Yes" else 0
+
+
+= votes.pivot_table(
+ vote_pivot ="member", columns="roll call", values="vote", aggfunc=was_yes, fill_value=0
+ index
+ )print(vote_pivot.shape)
+ vote_pivot.head()
(441, 41)
+roll call | +515 | +516 | +517 | +518 | +519 | +520 | +521 | +522 | +523 | +524 | +... | +546 | +547 | +548 | +549 | +550 | +551 | +552 | +553 | +554 | +555 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
member | ++ | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + | + |
A000055 | +1 | +0 | +0 | +0 | +1 | +1 | +0 | +1 | +1 | +1 | +... | +0 | +0 | +1 | +0 | +0 | +1 | +0 | +0 | +1 | +0 | +
A000367 | +0 | +0 | +0 | +0 | +0 | +0 | +0 | +0 | +0 | +0 | +... | +0 | +1 | +1 | +1 | +1 | +0 | +1 | +1 | +0 | +1 | +
A000369 | +1 | +1 | +0 | +0 | +1 | +1 | +0 | +1 | +1 | +1 | +... | +0 | +0 | +1 | +0 | +0 | +1 | +0 | +0 | +1 | +0 | +
A000370 | +1 | +1 | +1 | +1 | +1 | +0 | +1 | +0 | +0 | +0 | +... | +1 | +1 | +1 | +1 | +1 | +0 | +1 | +1 | +1 | +1 | +
A000371 | +1 | +1 | +1 | +1 | +1 | +0 | +1 | +0 | +0 | +0 | +... | +1 | +1 | +1 | +1 | +1 | +0 | +1 | +1 | +1 | +1 | +
5 rows × 41 columns
+Do legislators’ roll call votes show a relationship with their political party?
+While we could consider loading information about the legislator, such as their party, and see how this relates to their voting pattern, it turns out that we can do a lot with PCA to cluster legislators by how they vote. Let’s calculate the principal components using the SVD method.
+= vote_pivot - np.mean(vote_pivot, axis=0)
+ vote_pivot_centered = np.linalg.svd(vote_pivot_centered, full_matrices=False) # SVD u, s, vt
We can use the singular values in s
to construct a scree plot:
= px.line(y=s**2 / sum(s**2), title='Variance Explained', width=700, height=600, markers=True)
+ fig ='Principal Component i')
+ fig.update_xaxes(title_text='Proportion of Variance Explained') fig.update_yaxes(title_text
It looks like this graph plateaus after the third principal component, so our “elbow” is at PC3, and most of the variance is captured by just the first three principal components. Let’s use these PCs to visualize the latent vector representation of \(X\)!
+# Calculate the latent vector representation (US or XV)
+# using the first 3 principal components
+= pd.DataFrame(index=vote_pivot_centered.index)
+ vote_2d "z1", "z2", "z3"]] = (u * s)[:, :3]
+ vote_2d[[
+# Plot the latent vector representation
+= px.scatter_3d(vote_2d, x='z1', y='z2', z='z3', title='Vote Data', width=800, height=600)
+ fig =dict(size=5)) fig.update_traces(marker
Baesd on the plot above, it looks like there are two clusters of datapoints. What do you think this corresponds to?
+By incorporating member information (source), we can augment our graph with biographic data like each member’s party and gender.
+= yaml.safe_load(open("data/legislators-2019.yaml"))
+ legislators_data
+
+def to_date(s):
+return datetime.strptime(s, "%Y-%m-%d")
+
+
+= pd.DataFrame(
+ legs =[
+ columns"leg_id",
+ "first",
+ "last",
+ "gender",
+ "state",
+ "chamber",
+ "party",
+ "birthday",
+
+ ],=[
+ data
+ ["id"]["bioguide"],
+ x["name"]["first"],
+ x["name"]["last"],
+ x["bio"]["gender"],
+ x["terms"][-1]["state"],
+ x["terms"][-1]["type"],
+ x["terms"][-1]["party"],
+ x["bio"]["birthday"]),
+ to_date(x[
+ ]for x in legislators_data
+
+ ],
+ )"age"] = 2024 - legs["birthday"].dt.year
+ legs["leg_id")
+ legs.set_index(
+ legs.sort_index()
+= vote_2d.join(legs.set_index("leg_id")).dropna()
+ vote_2d
+42)
+ np.random.seed("z1_jittered"] = vote_2d["z1"] + np.random.normal(0, 0.1, len(vote_2d))
+ vote_2d["z2_jittered"] = vote_2d["z2"] + np.random.normal(0, 0.1, len(vote_2d))
+ vote_2d["z3_jittered"] = vote_2d["z3"] + np.random.normal(0, 0.1, len(vote_2d))
+ vote_2d[
+='z1_jittered', y='z2_jittered', z='z3_jittered', color='party', symbol="gender", size='age',
+ px.scatter_3d(vote_2d, x='Vote Data', width=800, height=600, size_max=10,
+ title= 0.7,
+ opacity ={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
+ color_discrete_map=['first', 'last', 'state', 'party', 'gender', 'age']) hover_data
Using SVD and PCA, we can clearly see a separation between the red dots (Republican) and blue dots (Deomcrat).
+ +We can also look at \(V^T\) directly to try to gain insight into why each component is as it is.
+= px.bar(x=vote_pivot_centered.columns, y=vt[0, :])
+ fig_eig # extract the trace from the figure
+ fig_eig.show()
We have the party affiliation labels so we can see if this eigenvector aligns with one of the parties.
+= (
+ party_line_votes "leg_id")["party"])
+ vote_pivot_centered.join(legs.set_index("party")
+ .groupby(
+ .mean()
+ .T.reset_index()={"index": "call"})
+ .rename(columns"call")
+ .melt(
+ )= px.bar(
+ fig
+ party_line_votes,="call", y="value", facet_row = "party", color="party",
+ x={'Democrat':'blue', 'Republican':'red', "Independent": "green"})
+ color_discrete_maplambda a: a.update(text=a.text.split("=")[-1])) fig.for_each_annotation(
= pd.DataFrame(
+ loadings "pc1": np.sqrt(s[0]) * vt[0, :], "pc2": np.sqrt(s[1]) * vt[1, :]},
+ {=vote_pivot_centered.columns,
+ index
+ )
+"num votes"] = votes[votes["vote"].isin(["Yes", "No"])].groupby("member").size()
+ vote_2d[=True)
+ vote_2d.dropna(inplace
+= px.scatter(
+ fig
+ vote_2d, ='z1_jittered',
+ x='z2_jittered',
+ y='party',
+ color="gender",
+ symbol='num votes',
+ size='Biplot',
+ title=800,
+ width=600,
+ height=10,
+ size_max= 0.7,
+ opacity ={'Democrat':'blue', 'Republican':'red', "Independent": "green"},
+ color_discrete_map=['first', 'last', 'state', 'party', 'gender', 'age'])
+ hover_data
+for (call, pc1, pc2) in loadings.head(20).itertuples():
+=[0,pc1], y=[0,pc2], name=call,
+ fig.add_scatter(x='lines+markers', textposition='top right',
+ mode= dict(size=10,symbol= "arrow-bar-up", angleref="previous"))
+ marker fig
Each roll call from the 116th Congress - 1st Session: https://clerk.house.gov/evs/2019/ROLL_500.asp
+As shown in the demo, the primary goal of PCA is to transform observations from high-dimensional data down to low dimensions through linear transformations.
+In machine learning, PCA is often used as a preprocessing step prior to training a supervised model.
+Let’s explore how PCA is useful for building an image classification model based on the Fashion-MNIST dataset, a dataset containing images of articles of clothing; these images are gray scale with a size of 28 by 28 pixels. The copyright for Fashion-MNIST is held by Zalando SE. Fashion-MNIST is licensed under the MIT license.
+First, we’ll load in the data.
+import requests
+from pathlib import Path
+import time
+import gzip
+import os
+import numpy as np
+import plotly.express as px
+
+def fetch_and_cache(data_url, file, data_dir="data", force=False):
+"""
+ Download and cache a url and return the file object.
+
+ data_url: the web address to download
+ file: the file in which to save the results.
+ data_dir: (default="data") the location to save the data
+ force: if true the file is always re-downloaded
+
+ return: The pathlib.Path object representing the file.
+ """
+
+= Path(data_dir)
+ data_dir =True)
+ data_dir.mkdir(exist_ok= data_dir / Path(file)
+ file_path # If the file already exists and we want to force a download then
+ # delete the file first so that the creation date is correct.
+ if force and file_path.exists():
+
+ file_path.unlink()if force or not file_path.exists():
+ print("Downloading...", end=" ")
+ = requests.get(data_url)
+ resp with file_path.open("wb") as f:
+
+ f.write(resp.content)print("Done!")
+ = time.ctime(file_path.stat().st_mtime)
+ last_modified_time else:
+ = time.ctime(file_path.stat().st_mtime)
+ last_modified_time print("Using cached version that was downloaded (UTC):", last_modified_time)
+ return file_path
+
+
+def head(filename, lines=5):
+"""
+ Returns the first few lines of a file.
+
+ filename: the name of the file to open
+ lines: the number of lines to include
+
+ return: A list of the first few lines from the file.
+ """
+from itertools import islice
+
+with open(filename, "r") as f:
+ return list(islice(f, lines))
+
+
+def load_data():
+"""
+ Loads the Fashion-MNIST dataset.
+
+ This is a dataset of 60,000 28x28 grayscale images of 10 fashion categories,
+ along with a test set of 10,000 images. This dataset can be used as
+ a drop-in replacement for MNIST.
+
+ The classes are:
+
+ | Label | Description |
+ |:-----:|-------------|
+ | 0 | T-shirt/top |
+ | 1 | Trouser |
+ | 2 | Pullover |
+ | 3 | Dress |
+ | 4 | Coat |
+ | 5 | Sandal |
+ | 6 | Shirt |
+ | 7 | Sneaker |
+ | 8 | Bag |
+ | 9 | Ankle boot |
+
+ Returns:
+ Tuple of NumPy arrays: `(x_train, y_train), (x_test, y_test)`.
+
+ **x_train**: uint8 NumPy array of grayscale image data with shapes
+ `(60000, 28, 28)`, containing the training data.
+
+ **y_train**: uint8 NumPy array of labels (integers in range 0-9)
+ with shape `(60000,)` for the training data.
+
+ **x_test**: uint8 NumPy array of grayscale image data with shapes
+ (10000, 28, 28), containing the test data.
+
+ **y_test**: uint8 NumPy array of labels (integers in range 0-9)
+ with shape `(10000,)` for the test data.
+
+ Example:
+
+ (x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()
+ assert x_train.shape == (60000, 28, 28)
+ assert x_test.shape == (10000, 28, 28)
+ assert y_train.shape == (60000,)
+ assert y_test.shape == (10000,)
+
+ License:
+ The copyright for Fashion-MNIST is held by Zalando SE.
+ Fashion-MNIST is licensed under the [MIT license](
+ https://github.com/zalandoresearch/fashion-mnist/blob/master/LICENSE).
+
+ """
+= os.path.join("datasets", "fashion-mnist")
+ dirname = "https://storage.googleapis.com/tensorflow/tf-keras-datasets/"
+ base = [
+ files "train-labels-idx1-ubyte.gz",
+ "train-images-idx3-ubyte.gz",
+ "t10k-labels-idx1-ubyte.gz",
+ "t10k-images-idx3-ubyte.gz",
+
+ ]
+= []
+ paths for fname in files:
+ + fname, fname))
+ paths.append(fetch_and_cache(base # paths.append(get_file(fname, origin=base + fname, cache_subdir=dirname))
+
+with gzip.open(paths[0], "rb") as lbpath:
+ = np.frombuffer(lbpath.read(), np.uint8, offset=8)
+ y_train
+with gzip.open(paths[1], "rb") as imgpath:
+ = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(
+ x_train len(y_train), 28, 28
+
+ )
+with gzip.open(paths[2], "rb") as lbpath:
+ = np.frombuffer(lbpath.read(), np.uint8, offset=8)
+ y_test
+with gzip.open(paths[3], "rb") as imgpath:
+ = np.frombuffer(imgpath.read(), np.uint8, offset=16).reshape(
+ x_test len(y_test), 28, 28
+
+ )
+return (x_train, y_train), (x_test, y_test)
= [
+ class_names "T-shirt/top",
+ "Trouser",
+ "Pullover",
+ "Dress",
+ "Coat",
+ "Sandal",
+ "Shirt",
+ "Sneaker",
+ "Bag",
+ "Ankle boot",
+
+ ]= {i: class_name for i, class_name in enumerate(class_names)}
+ class_dict
+= load_data()
+ (train_images, train_labels), (test_images, test_labels) print("Training images", train_images.shape)
+print("Test images", test_images.shape)
+
+= np.random.default_rng(42)
+ rng = 5000
+ n = rng.choice(np.arange(len(train_images)), size=n, replace=False)
+ sample_idx
+# Invert and normalize the images so they look better
+= -1 * train_images[sample_idx].astype(np.int16)
+ img_mat = (img_mat - img_mat.min()) / (img_mat.max() - img_mat.min())
+ img_mat
+= pd.DataFrame(
+ images
+ {"images": img_mat.tolist(),
+ "labels": train_labels[sample_idx],
+ "class": [class_dict[x] for x in train_labels[sample_idx]],
+
+ } )
Using cached version that was downloaded (UTC): Tue Aug 27 03:33:08 2024
+Using cached version that was downloaded (UTC): Tue Aug 27 03:33:08 2024
+Using cached version that was downloaded (UTC): Tue Aug 27 03:33:08 2024
+Using cached version that was downloaded (UTC): Tue Aug 27 03:33:08 2024
+Training images (60000, 28, 28)
+Test images (10000, 28, 28)
+Let’s see what some of the images contained in this dataset look like.
+def show_images(images, ncols=5, max_images=30):
+# conver the subset of images into a n,28,28 matrix for facet visualization
+ = np.array(images.head(max_images)["images"].to_list())
+ img_mat = px.imshow(
+ fig
+ img_mat,="gray",
+ color_continuous_scale=0,
+ facet_col=ncols,
+ facet_col_wrap=220 * int(np.ceil(len(images) / ncols)),
+ height
+ )=False)
+ fig.update_layout(coloraxis_showscale# Extract the facet number and convert it back to the class label.
+
+ fig.for_each_annotation(lambda a: a.update(text=images.iloc[int(a.text.split("=")[-1])]["class"])
+
+ )return fig
+
+
+= show_images(images.groupby("class", as_index=False).sample(2), ncols=6)
+ fig fig.show()
Let’s break this down further and look at it by class, or the category of clothing:
+print(class_dict)
+
+'class',as_index=False).sample(2), ncols=6) show_images(images.groupby(
{0: 'T-shirt/top', 1: 'Trouser', 2: 'Pullover', 3: 'Dress', 4: 'Coat', 5: 'Sandal', 6: 'Shirt', 7: 'Sneaker', 8: 'Bag', 9: 'Ankle boot'}
+As we can see, each 28x28 pixel image is labelled by the category of clothing it belongs to. Us humans can very easily look at these images and identify the type of clothing being displayed, even if the image is a little blurry. However, this task is less intuitive for machine learning models. To illustrate this, let’s take a small sample of the training data to see how the images above are represented in their raw format:
+ images.head()
+ | images | +labels | +class | +
---|---|---|---|
0 | +[[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | +3 | +Dress | +
1 | +[[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | +4 | +Coat | +
2 | +[[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | +0 | +T-shirt/top | +
3 | +[[1.0, 1.0, 1.0, 1.0, 1.0, 0.996078431372549, ... | +2 | +Pullover | +
4 | +[[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0,... | +1 | +Trouser | +
Each row represents one image. Every image belongs to a "class"
of clothing with it’s enumerated "label"
. In place of a typically displayed image, the raw data contains a 28x28 2D array of pixel values; each pixel value is a float between 0 and 1. If we just focus on the images, we get a 3D matrix. You can think of this as a matrix containing 2D images.
= np.array(images["images"].to_list())
+ X X.shape
(5000, 28, 28)
+However, we’re not used to working with 3D matrices for our training data X
. Typical training data expects a vector of features for each datapoint, not a matrix per datapoint. We can reshape our 3D matrix so that it fits our typical training data by “unrolling” the the 28x28 pixels into a single row vector containing 28*28 = 784 dimensions.
= X.reshape(X.shape[0], -1)
+ X X.shape
(5000, 784)
+What we have now is 5000 datapoints that each have 784 features. That’s a lot of features! Not only would training a model on this data take a very long time, it’s also very likely that our matrix is linearly independent. PCA is a very good strategy to use in situations like these when there are lots of features, but we want to remove redundant information.
+sklearn
To perform PCA, let’s begin by centering our data.
+= X - X.mean(axis=0) X
We can run PCA using sklearn
’s PCA
package.
from sklearn.decomposition import PCA
+
+= 50
+ n_comps = PCA(n_components=n_comps)
+ pca pca.fit(X)
PCA(n_components=50)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=50)
Now that sklearn
helped us find the principal components, let’s visualize a scree plot.
# Make a line plot and show markers
+= px.line(y=pca.explained_variance_ratio_ * 100, markers=True)
+ fig fig.show()
We can see that the line starts flattening out around 2 or 3, which suggests that most of the data is explained by just the first two or three dimensions. To illustrate this, let’s plot the first three principal components and the datapoints’ corresponding classes. Can you identify any patterns?
+'z1', 'z2', 'z3']] = pca.transform(X)[:, :3]
+ images[[= px.scatter_3d(images, x='z1', y='z2', z='z3', color='class', hover_data=['labels'],
+ fig =1000, height=800)
+ width# set marker size to 5
+=dict(size=5)) fig.update_traces(marker
As we saw in the demos, we often perform PCA during the Exploratory Data Analysis (EDA) stage of our data science lifecycle (if we already know what to model, we probably don’t need PCA!). It helps us with:
+PCA is commonly used in biomedical contexts, which have many named variables! It can be used to:
+ +Suppose we know the child mortality rate of a given country. Linear regression tries to predict the fertility rate from the mortality rate; for example, if the mortality is 6, we might guess the fertility is near 4. The regression line tells us the “best” prediction of fertility given all possible mortality values by minimizing the root mean squared error. See the vertical red lines (note that only some are shown).
+
We can also perform a regression in the reverse direction. That is, given fertility, we try to predict mortality. In this case, we get a different regression line that minimizes the root mean squared length of the horizontal lines.
The rank-1 approximation is close but not the same as the mortality regression line. Instead of minimizing horizontal or vertical error, our rank-1 approximation minimizes the error perpendicular to the subspace onto which we’re projecting. That is, SVD finds the line such that if we project our data onto that line, the error between the projection and our original data is minimized. The similarity of the rank-1 approximation and the fertility was just a coincidence. Looking at adiposity and bicep size from our body measurements dataset, we see the 1D subspace onto which we are projecting is between the two regression lines.
+Even in higher dimensions, the idea behind principal components is the same! Suppose we have 30-dimensional data and decide to use the first 5 principal components. Our procedure minimizes the error between the original 30-dimensional data and the projection of that 30-dimensional data onto the “best” 5-dimensional subspace. See CS 189 Note 10 for more details.
+One key fact to remember is that the decomposition is not arbitrary. The rank of a matrix limits how small our inner dimensions can be if we want to perfectly recreate our matrix. The proof for this is out of scope.
+Even if we know we have to factorize our matrix using an inner dimension of \(R\), that still leaves a large space of solutions to traverse. What if we have a procedure to automatically factorize a rank \(R\) matrix into an \(R\)-dimensional representation with some transformation matrix?
+What if we wanted a 2D representation? It’s valuable to compress all of the data that is relevant into as few dimensions as possible in order to plot it efficiently. Some 2D matrices yield better approximations than others. How well can we do?
+The proof defining component score is out of scope for this class, but it is included below for your convenience.
+Setup: Consider the design matrix \(X \in \mathbb{R}^{n \times d}\), where the \(j\)-th column (corresponding to the \(j\)-th feature) is \(x_j \in \mathbb{R}^n\) and the element in row \(i\), column \(j\) is \(x_{ij}\). Further, define \(\tilde{X}\) as the centered design matrix. The \(j\)-th column is \(\tilde{x}_j \in \mathbb{R}^n\) and the element in row \(i\), column \(j\) is \(\tilde{x}_{ij} = x_{ij} - \bar{x_j}\), where \(\bar{x_j}\) is the mean of the \(x_j\) column vector from the original \(X\).
+Variance: Construct the covariance matrix: \(\frac{1}{n} \tilde{X}^T \tilde{X} \in \mathbb{R}^{d \times d}\). The \(j\)-th element along the diagonal is the variance of the \(j\)-th column of the original design matrix \(X\):
+\[\left( \frac{1}{n} \tilde{X}^T \tilde{X} \right)_{jj} = \frac{1}{n} \tilde{x}_j ^T \tilde{x}_j = \frac{1}{n} \sum_{i=i}^n (\tilde{x}_{ij} )^2 = \frac{1}{n} \sum_{i=i}^n (x_{ij} - \bar{x_j})^2\]
+SVD: Suppose singular value decomposition of the centered design matrix \(\tilde{X}\) yields \(\tilde{X} = U S V^T\), where \(U \in \mathbb{R}^{n \times d}\) and \(V \in \mathbb{R}^{d \times d}\) are matrices with orthonormal columns, and \(S \in \mathbb{R}^{d \times d}\) is a diagonal matrix with singular values of \(\tilde{X}\).
+\[ +\begin{aligned} +\tilde{X}^T \tilde{X} &= (U S V^T )^T (U S V^T) \\ +&= V S U^T U S V^T & (S^T = S) \\ +&= V S^2 V^T & (U^T U = I) \\ +\frac{1}{n} \tilde{X}^T \tilde{X} &= \frac{1}{n} V S V^T =V \left( \frac{1}{n} S \right) V^T \\ +\frac{1}{n} \tilde{X}^T \tilde{X} V &= V \left( \frac{1}{n} S \right) V^T V = V \left( \frac{1}{n} S \right) & \text{(right multiply by }V \rightarrow V^T V = I \text{)} \\ +V^T \frac{1}{n} \tilde{X}^T \tilde{X} V &= V^T V \left( \frac{1}{n} S \right) = \frac{1}{n} S & \text{(left multiply by }V^T \rightarrow V^T V = I \text{)} \\ +\left( \frac{1}{n} \tilde{X}^T \tilde{X} \right)_{jj} &= \frac{1}{n}S_j^2 & \text{(Define }S_j\text{ as the} j\text{-th singular value)} \\ +\frac{1}{n} S_j^2 &= \frac{1}{n} \sum_{i=i}^n (x_{ij} - \bar{x_j})^2 +\end{aligned} +\]
+The last line defines the \(j\)-th component score.
+ + +In the past few lectures, we’ve examined the role of complexity in influencing model performance. We’ve considered model complexity in the context of a tradeoff between two competing factors: model variance and training error.
+So far, our analysis has been mostly qualitative. We’ve acknowledged that our choice of model complexity needs to strike a balance between model variance and training error, but we haven’t yet discussed why exactly this tradeoff exists.
+To better understand the origin of this tradeoff, we will need to dive into random variables. The next two course notes on probability will be a brief digression from our work on modeling so we can build up the concepts needed to understand this so-called bias-variance tradeoff. In specific, we will cover:
+We’ll go over just enough probability to help you understand its implications for modeling, but if you want to go a step further, take Data 140, CS 70, and/or EECS 126.
+ +In Data 100, we want to understand the broader relationship between the following:
+Suppose we generate a set of random data, like a random sample from some population. A random variable is a function from the outcome of a random event to a number.
+It is random since our sample was drawn at random; it is variable because its exact value depends on how this random sample came out. As such, the domain or input of our random variable is all possible outcomes for some random event in a sample space, and its range or output is the real number line. We typically denote random variables with uppercase letters, such as \(X\) or \(Y\). In contrast, note that regular variables tend to be denoted using lowercase letters. Sometimes we also use uppercase letters to refer to matrices (such as your design matrix \(\mathbb{X}\)), but we will do our best to be clear with the notation.
+To motivate what this (rather abstract) definition means, let’s consider the following examples:
+Let’s formally define a fair coin toss. A fair coin can land on heads (\(H\)) or tails (\(T\)), each with a probability of 0.5. With these possible outcomes, we can define a random variable \(X\) as: \[X = \begin{cases} + 1, \text{if the coin lands heads} \\ + 0, \text{if the coin lands tails} + \end{cases}\]
+\(X\) is a function with a domain, or input, of \(\{H, T\}\) and a range, or output, of \(\{1, 0\}\). In practice, while we don’t use the following function notation, you could write the above as \[X = \begin{cases} X(H) = 1 \\ X(T) = 0 \end{cases}\]
+Suppose we draw a random sample \(s\) of size 3 from all students enrolled in Data 100.
+We can define \(Y\) as the number of data science students in our sample. Its domain is all possible samples of size 3, and its range is \(\{0, 1, 2, 3\}\).
+
+
+
Note that we can use random variables in mathematical expressions to create new random variables.
+For example, let’s say we sample 3 students at random from lecture and look at their midterm scores. Let \(X_1\), \(X_2\), and \(X_3\) represent each student’s midterm grade.
+We can use these random variables to create a new random variable, \(Y\), which represents the average of the 3 scores: \(Y = (X_1 + X_2 + X_3)/3\).
+As we’re creating this random variable, a few questions arise:
+But, what exactly is a distribution? Let’s dive into this!
+To define any random variable \(X\), we need to be able to specify 2 things:
+If \(X\) is discrete (has a finite number of possible values), the probability that a random variable \(X\) takes on the value \(x\) is given by \(P(X=x)\), and probabilities must sum to 1: \(\sum_{\text{all } x} P(X=x) = 1\),
+We can often display this using a probability distribution table. In the coin toss example, the probability distribution table of \(X\) is given by.
+\(x\) | +\(P(X=x)\) | +
---|---|
0 | +\(\frac{1}{2}\) | +
1 | +\(\frac{1}{2}\) | +
The distribution of a random variable \(X\) describes how the total probability of 100% is split across all the possible values of \(X\), and it fully defines a random variable. If you know the distribution of a random variable you can:
+np.random.choice
, df.sample
, or scipy.stats.<dist>.rvs(...)
The distribution of a discrete random variable can also be represented using a histogram. If a variable is continuous, meaning it can take on infinitely many values, we can illustrate its distribution using a density curve.
+
+
+
We often don’t know the (true) distribution and instead compute an empirical distribution. If you flip a coin 3 times and get {H, H, T}, you may ask —— what is the probability that the coin will land heads? We can come up with an empirical estimate of \(\frac{2}{3}\), though the true probability might be \(\frac{1}{2}\).
+Probabilities are areas. For discrete random variables, the area of the red bars represents the probability that a discrete random variable \(X\) falls within those values. For continuous random variables, the area under the curve represents the probability that a discrete random variable \(Y\) falls within those values.
+
+
+
If we sum up the total area of the bars/under the density curve, we should get 100%, or 1.
+We can show the distribution of \(Y\) in the following tables. The table on the left lists all possible samples of \(s\) and the number of times they can appear (\(Y(s)\)). We can use this to calculate the values for the table on the right, a probability distribution table.
+
+
+
Rather than fully write out a probability distribution or show a histogram, there are some common distributions that come up frequently when doing data science. These distributions are specified by some parameters, which are constants that specify the shape of the distribution. In terms of notation, the ‘~’ means “has the probability distribution of”.
+These common distributions are listed below:
+There are several ways to describe a random variable. The methods shown above —— a table of all samples \(s, X(s)\), distribution table \(P(X=x)\), and histograms —— are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a “summary” of how the variable tends to behave, they are not random. Instead, think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.
+The expectation of a random variable \(X\) is the weighted average of the values of \(X\), where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation:
+The latter is more commonly used as we are usually just given the distribution, not all possible samples.
+We want to emphasize that the expectation is a number, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the simulated values.
+Going back to our coin toss example, we define a random variable \(X\) as: \[X = \begin{cases} + 1, \text{if the coin lands heads} \\ + 0, \text{if the coin lands tails} + \end{cases}\]
+We can calculate its expectation \(\mathbb{E}[X]\) using the second method of applying the weights one possible value at a time: \[\begin{align} +\mathbb{E}[X] &= \sum_{x} x P(X=x) \\ +&= 1 * 0.5 + 0 * 0.5 \\ +&= 0.5 +\end{align}\]
+Note that \(\mathbb{E}[X] = 0.5\) is not a possible value of \(X\); it’s an average. The expectation of X does not need to be a possible value of X.
+Consider the random variable \(X\):
+\(x\) | +\(P(X=x)\) | +
---|---|
3 | +0.1 | +
4 | +0.2 | +
6 | +0.4 | +
8 | +0.3 | +
To calculate it’s expectation, \[\begin{align} +\mathbb{E}[X] &= \sum_{x} x P(X=x) \\ +&= 3 * 0.1 + 4 * 0.2 + 6 * 0.4 + 8 * 0.3 \\ +&= 0.3 + 0.8 + 2.4 + 2.4 \\ +&= 5.9 +\end{align}\]
+Again, note that \(\mathbb{E}[X] = 5.9\) is not a possible value of \(X\); it’s an average. The expectation of X does not need to be a possible value of X.
+The variance of a random variable is a measure of its chance error. It is defined as the expected squared deviation from the expectation of \(X\). Put more simply, variance asks: how far does \(X\) typically vary from its average value, just by chance? What is the spread of \(X\)’s distribution?
+\[\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]\]
+The units of variance are the square of the units of \(X\). To get it back to the right scale, use the standard deviation of \(X\): \[\text{SD}(X) = \sqrt{\text{Var}(X)}\]
+Like with expectation, variance and standard deviation are numbers, not random variables! Variance helps us describe the variability of a random variable. It is the expected squared error between the random variable and its expected value. As you will see shortly, we can use variance to help us quantify the chance error that arises when using a sample \(X\) to estimate the population mean.
+By Chebyshev’s inequality, which you saw in Data 8, no matter what the shape of the distribution of \(X\) is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”
+If we expand the square and use properties of expectation, we can re-express variance as the computational formula for variance.
+\[\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\]
+This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as \(\mathbb{E}[X^2] = \text{Var}(X)\) if \(X\) is centered and \(E(X)=0\).
+ +How do we compute \(\mathbb{E}[X^2]\)? Any function of a random variable is also a random variable. That means that by squaring \(X\), we’ve created a new random variable. To compute \(\mathbb{E}[X^2]\), we can simply apply our definition of expectation to the random variable \(X^2\).
+\[\mathbb{E}[X^2] = \sum_{x} x^2 P(X = x)\]
+Let \(X\) be the outcome of a single fair die roll. \(X\) is a random variable defined as \[X = \begin{cases} + \frac{1}{6}, \text{if } x \in \{1,2,3,4,5,6\} \\ + 0, \text{otherwise} + \end{cases}\]
+ + +We can summarize our discussion so far in the following diagram:
+
+
+
Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable. If you create multiple random variables based on your sample, then functions of those random variables are also random variables.
+For example, if \(X_1, X_2, ..., X_n\) are random variables, then so are all of these:
+Many functions of random variables that we are interested in (e.g., counts, means) involve sums of random variables, so let’s dive deeper into the properties of sums of random variables.
+Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: \[\mathbb{E}[X] = \sum_{x} x P(X=x)\]
+From it, we can derive some useful properties:
+\[\mathbb{E}[aX+b] = aE[\mathbb{X}] + b\]
+ +\[\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]\]
+ +Let’s now get into the properties of variance. Recall the definition of variance: \[\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]\]
+Combining it with the properties of expectation, we can derive some useful properties:
+
+
+
We define the covariance of two random variables as the expected product of deviations from expectation. Put more simply, covariance is a generalization of variance to variance:
+\[\text{Cov}(X, X) = \mathbb{E}[(X - \mathbb{E}[X])^2] = \text{Var}(X)\]
+\[\text{Cov}(X, Y) = \mathbb{E}[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])]\]
+We can treat the covariance as a measure of association. Remember the definition of correlation given when we first established SLR?
+\[r(X, Y) = \mathbb{E}\left[\left(\frac{X-\mathbb{E}[X]}{\text{SD}(X)}\right)\left(\frac{Y-\mathbb{E}[Y]}{\text{SD}(Y)}\right)\right] = \frac{\text{Cov}(X, Y)}{\text{SD}(X)\text{SD}(Y)}\]
+It turns out we’ve been quietly using covariance for some time now! If \(X\) and \(Y\) are independent, then \(\text{Cov}(X, Y) =0\) and \(r(X, Y) = 0\). Note, however, that the converse is not always true: \(X\) and \(Y\) could have \(\text{Cov}(X, Y) = r(X, Y) = 0\) but not be independent.
+Suppose that we have two random variables \(X\) and \(Y\):
+Note that in Data 100, you’ll never be expected to prove that random variables are i.i.d.
+Now let’s walk through an example. Say \(X_1\) and \(X_2\) be numbers on rolls of two fair die. \(X_1\) and \(X_2\) are i.i.d, so \(X_1\) and \(X_2\) have the same distribution. However, the sums \(Y = X_1 + X_1 = 2X_1\) and \(Z=X_1+X_2\) have different distributions but the same expectation.
+
+
+
However, \(Y = X_1\) has a larger variance.
+
+
+
To get some practice with the formulas discussed so far, let’s derive the expectation and variance for a Bernoulli(\(p\)) random variable. If \(X\) ~ Bernoulli(\(p\)),
+\(\mathbb{E}[X] = 1 \cdot p + 0 \cdot (1 - p) = p\)
+To compute the variance, we will use the computational formula. We first find that: \(\mathbb{E}[X^2] = 1^2 \cdot p + 0^2 \cdot (1 - p) = p\)
+From there, let’s calculate our variance: \(\text{Var}(X) = \mathbb{E}[X^2] - \mathbb{E}[X]^2 = p - p^2 = p(1-p)\)
+Let \(Y\) ~ Binomial(\(n\), \(p\)). We can think of \(Y\) as being the sum of \(n\) i.i.d. Bernoulli(\(p\)) random variables. Mathematically, this translates to
+\[Y = \sum_{i=1}^n X_i\]
+where \(X_i\) is the indicator of a success on trial \(i\).
+Using linearity of expectation,
+\[\mathbb{E}[Y] = \sum_{i=1}^n \mathbb{E}[X_i] = np\]
+For the variance, since each \(X_i\) is independent of the other, \(\text{Cov}(X_i, X_j) = 0\),
+\[\text{Var}(Y) = \sum_{i=1}^n \text{Var}[X_i] = np(1-p)\]
+Note that \(\text{Cov}(X,Y)\) would equal 0 if \(X\) and \(Y\) are independent.
+ + + + +Last time, we introduced the idea of random variables: numerical functions of a sample. Most of our work in the last lecture was done to build a background in probability and statistics. Now that we’ve established some key ideas, we’re in a good place to apply what we’ve learned to our original goal – understanding how the randomness of a sample impacts the model design process.
+In this lecture, we will delve more deeply into the idea of fitting a model to a sample. We’ll explore how to re-express our modeling process in terms of random variables and use this new understanding to steer model complexity.
+There are several cases of random variables that appear often and have useful properties. Below are the ones we will explore further in this course. The numbers in parentheses are the parameters of a random variable, which are constants. Parameters define a random variable’s shape (i.e., distribution) and its values. For this lecture, we’ll focus more heavily on the bolded random variables and their special properties, but you should familiarize yourself with all the ones listed below:
+Suppose you win cash based on the number of heads you get in a series of 20 coin flips. Let \(X_i = 1\) if the \(i\)-th coin is heads, \(0\) otherwise. Which payout strategy would you choose?
+A. \(Y_A = 10 * X_1 + 10 * X_2\)
+B. \(Y_B = \sum_{i=1}^{20} X_i\)
+C. \(Y_C = 20 * X_1\)
+ +Today, we’ve talked extensively about populations; if we know the distribution of a random variable, we can reliably compute expectation, variance, functions of the random variable, etc. Note that:
+In Data Science, however, we often do not have access to the whole population, so we don’t know its distribution. As such, we need to collect a sample and use its distribution to estimate or infer properties of the population. In cases like these, we can take several samples of size \(n\) from the population (an easy way to do this is using df.sample(n, replace=True)
), and compute the mean of each sample. When sampling, we make the (big) assumption that we sample uniformly at random with replacement from the population; each observation in our sample is a random variable drawn i.i.d from our population distribution. Remember that our sample mean is a random variable since it depends on our randomly drawn sample! On the other hand, our population mean is simply a number (a fixed value).
Consider an i.i.d. sample \(X_1, X_2, ..., X_n\) drawn from a population with mean 𝜇 and SD 𝜎. We define the sample mean as \[\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i\]
+The expectation of the sample mean is given by: \[\begin{align} + \mathbb{E}[\bar{X}_n] &= \frac{1}{n} \sum_{i=1}^n \mathbb{E}[X_i] \\ + &= \frac{1}{n} (n \mu) \\ + &= \mu +\end{align}\]
+The variance is given by: \[\begin{align} + \text{Var}(\bar{X}_n) &= \frac{1}{n^2} \text{Var}( \sum_{i=1}^n X_i) \\ + &= \frac{1}{n^2} \left( \sum_{i=1}^n \text{Var}(X_i) \right) \\ + &= \frac{1}{n^2} (n \sigma^2) = \frac{\sigma^2}{n} +\end{align}\]
+\(\bar{X}_n\) is approximately normally distributed by the Central Limit Theorem (CLT).
+In Data 8 and in the previous lecture, you encountered the Central Limit Theorem (CLT). This is a powerful theorem for estimating the distribution of a population with mean \(\mu\) and standard deviation \(\sigma\) from a collection of smaller samples. The CLT tells us that if an i.i.d sample of size \(n\) is large, then the probability distribution of the sample mean is roughly normal with mean \(\mu\) and SD of \(\frac{\sigma}{\sqrt{n}}\). More generally, any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists! This is because we rarely know a lot about the population.
++
Importantly, the CLT assumes that each observation in our samples is drawn i.i.d from the distribution of the population. In addition, the CLT is accurate only when \(n\) is “large”, but what counts as a “large” sample size depends on the specific distribution. If a population is highly symmetric and unimodal, we could need as few as \(n=20\); if a population is very skewed, we need a larger \(n\). If in doubt, you can bootstrap the sample mean and see if the bootstrapped distribution is bell-shaped. Classes like Data 140 investigate this idea in great detail.
+For a more in-depth demo, check out onlinestatbook.
+Now let’s say we want to use the sample mean to estimate the population mean, for example, the average height of Cal undergraduates. We can typically collect a single sample, which has just one average. However, what if we happened, by random chance, to draw a sample with a different mean or spread than that of the population? We might get a skewed view of how the population behaves (consider the extreme case where we happen to sample the exact same value \(n\) times!).
+
+
+
For example, notice the difference in variation between these two distributions that are different in sample size. The distribution with a bigger sample size (\(n=800\)) is tighter around the mean than the distribution with a smaller sample size (\(n=200\)). Try plugging in these values into the standard deviation equation for the sample mean to make sense of this!
+Applying the CLT allows us to make sense of all of this and resolve this issue. By drawing many samples, we can consider how the sample distribution varies across multiple subsets of the data. This allows us to approximate the properties of the population without the need to survey every single member.
+Given this potential variance, it is also important that we consider the average value and spread of all possible sample means, and what this means for how big \(n\) should be. For every sample size, the expected value of the sample mean is the population mean: \[\mathbb{E}[\bar{X}_n] = \mu\] We call the sample mean an unbiased estimator of the population mean and will explore this idea more in the next lecture.
+ +At this point in the course, we’ve spent a great deal of time working with models. When we first introduced the idea of modeling a few weeks ago, we did so in the context of prediction: using models to make accurate predictions about unseen data. Another reason we might build models is to better understand complex phenomena in the world around us. Inference is the task of using a model to infer the true underlying relationships between the feature and response variables. For example, if we are working with a set of housing data, prediction might ask: given the attributes of a house, how much is it worth? Inference might ask: how much does having a local park impact the value of a house?
+A major goal of inference is to draw conclusions about the full population of data given only a random sample. To do this, we aim to estimate the value of a parameter, which is a numerical function of the population (for example, the population mean \(\mu\)). We use a collected sample to construct a statistic, which is a numerical function of the random sample (for example, the sample mean \(\bar{X}_n\)). It’s helpful to think “p” for “parameter” and “population,” and “s” for “sample” and “statistic.”
+Since the sample represents a random subset of the population, any statistic we generate will likely deviate from the true population parameter, and it could have been different. We say that the sample statistic is an estimator of the true population parameter. Notationally, the population parameter is typically called \(\theta\), while its estimator is denoted by \(\hat{\theta}\).
+To address our inference question, we aim to construct estimators that closely estimate the value of the population parameter. We evaluate how “good” an estimator is by answering three questions:
+This relationship can be illustrated with an archery analogy. Imagine that the center of the target is the \(\theta\) and each arrow corresponds to a separate parameter estimate \(\hat{\theta}\)
+
+
+
Ideally, we want our estimator to have low bias and low variance, but how can we mathematically quantify that? See sec-bias-variance-tradeoff for more detail.
+Now that we’ve established the idea of an estimator, let’s see how we can apply this learning to the modeling process. To do so, we’ll take a moment to formalize our data collection and models in the language of random variables.
+Say we are working with an input variable, \(x\), and a response variable, \(Y\). We assume that \(Y\) and \(x\) are linked by some relationship \(g\); in other words, \(Y = g(x)\) where \(g\) represents some “universal truth” or “law of nature” that defines the underlying relationship between \(x\) and \(Y\). In the image below, \(g\) is denoted by the red line.
+As data scientists, however, we have no way of directly “seeing” the underlying relationship \(g\). The best we can do is collect observed data out in the real world to try to understand this relationship. Unfortunately, the data collection process will always have some inherent error (think of the randomness you might encounter when taking measurements in a scientific experiment). We say that each observation comes with some random error or noise term, \(\epsilon\) (read: “epsilon”). This error is assumed to be a random variable with expectation \(\mathbb{E}(\epsilon)=0\), variance \(\text{Var}(\epsilon) = \sigma^2\), and be i.i.d. across each observation. The existence of this random noise means that our observations, \(Y(x)\), are random variables.
+
+
+
We can only observe our random sample of data, represented by the blue points above. From this sample, we want to estimate the true relationship \(g\). We do this by constructing the model \(\hat{Y}(x)\) to estimate \(g\).
+\[\text{True relationship: } g(x)\]
+\[\text{Observed relationship: }Y = g(x) + \epsilon\]
+\[\text{Prediction: }\hat{Y}(x)\]
+
+
+
When building models, it is also important to note that our choice of features will also significantly impact our estimation. In the plot below, you can see how the different models (green and purple) can lead to different estimates.
+
+
+
If we assume that the true relationship \(g\) is linear, we can express the response as \(Y = f_{\theta}(x)\), where our true relationship is modeled by \[Y = g(x) + \epsilon\] \[ f_{\theta}(x) = Y = \theta_0 + \sum_{j=1}^p \theta_j x_j + \epsilon\]
+ +This true relationship has true, unobservable parameters \(\theta\), and it has random noise \(\epsilon\), so we can never observe the true relationship. Instead, the next best thing we can do is obtain a sample \(\Bbb{X}\), \(\Bbb{Y}\) of \(n\) observed relationships, \((x, Y)\) and use it to train a model and obtain an estimate of \(\hat{\theta}\) \[\hat{Y}(x) = f_{\hat{\theta}}(x) = \hat{\theta_0} + \sum_{j=1}^p \hat{\theta_j} x_j\]
+ +Now taking a look at our original equations, we can see that they both have differing sources of randomness. For our observed relationship, \(Y = g(x) + \epsilon\), \(\epsilon\) represents errors which occur during or after the observation or measurement process. For the estimation model, the data we have is a random sample collected from the population, which was constructed from decisions made before the measurement process.
+Recall the model and the data we generated from that model in the last section:
+\[\text{True relationship: } g(x)\]
+\[\text{Observed relationship: }Y = g(x) + \epsilon\]
+\[\text{Prediction: }\hat{Y}(x)\]
+With this reformulated modeling goal, we can now revisit the Bias-Variance Tradeoff from two lectures ago (shown below):
+
+
+
In today’s lecture, we’ll explore a more mathematical version of the graph you see above by introducing the terms model risk, observation variance, model bias, and model variance. Eventually, we’ll work our way up to an updated version of the Bias-Variance Tradeoff graph that you see below
+
+
+
Model risk is defined as the mean square prediction error of the random variable \(\hat{Y}\). It is an expectation across all samples we could have possibly gotten when fitting the model, which we can denote as random variables \(X_1, X_2, \ldots, X_n, Y\). Model risk considers the model’s performance on any sample that is theoretically possible, rather than the specific data that we have collected.
+\[\text{model risk }=E\left[(Y-\hat{Y(x)})^2\right]\]
+What is the origin of the error encoded by model risk? Note that there are two types of errors:
+Recall the data-generating process we established earlier. There is a true underlying relationship \(g\), observed data (with random noise) \(Y\), and model \(\hat{Y}\).
+
+
+
To better understand model risk, we’ll zoom in on a single data point in the plot above.
+
+
+
Remember that \(\hat{Y}(x)\) is a random variable – it is the prediction made for \(x\) after being fit on the specific sample used for training. If we had used a different sample for training, a different prediction might have been made for this value of \(x\). To capture this, the diagram above considers both the prediction \(\hat{Y}(x)\) made for a particular random training sample, and the expected prediction across all possible training samples, \(E[\hat{Y}(x)]\).
+We can use this simplified diagram to break down the prediction error into smaller components. First, start by considering the error on a single prediction, \(Y(x)-\hat{Y}(x)\).
+
+
+
We can identify three components of this error.
+
+
+
That is, the error can be written as:
+\[Y(x)-\hat{Y}(x) = \epsilon + \left(g(x)-E\left[\hat{Y}(x)\right]\right) + \left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)\] \[\newline \]
+The model risk is the expected square of the expression above, \(E\left[(Y(x)-\hat{Y}(x))^2\right]\). If we square both sides and then take the expectation, we will get the following decomposition of model risk:
+\[E\left[(Y(x)-\hat{Y}(x))^2\right] = E[\epsilon^2] + \left(g(x)-E\left[\hat{Y}(x)\right]\right)^2 + E\left[\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right]\]
+It looks like we are missing some cross-product terms when squaring the right-hand side, but it turns out that all of those cross-product terms are zero. The detailed derivation is out of scope for this class, but a proof is included at the end of this note for your reference.
+This expression may look complicated at first glance, but we’ve actually already defined each term earlier in this lecture! Let’s look at them term by term.
+The first term in the above decomposition is \(E[\epsilon^2]\). Remember \(\epsilon\) is the random noise when observing \(Y\), with expectation \(\mathbb{E}(\epsilon)=0\) and variance \(\text{Var}(\epsilon) = \sigma^2\). We can show that \(E[\epsilon^2]\) is the variance of \(\epsilon\): \[ +\begin{align*} +\text{Var}(\epsilon) &= E[\epsilon^2] + \left(E[\epsilon]\right)^2\\ +&= E[\epsilon^2] + 0^2\\ +&= \sigma^2. +\end{align*} +\]
+This term describes how variable the random error \(\epsilon\) (and \(Y\)) is for each observation. This is called the observation variance. It exists due to the randomness in our observations \(Y\). It is a form of chance error we talked about in the Sampling lecture.
+\[\text{observation variance} = \text{Var}(\epsilon) = \sigma^2.\]
+The observation variance results from measurement errors when observing data or missing information that acts like noise. To reduce this observation variance, we could try to get more precise measurements, but it is often beyond the control of data scientists. Because of this, the observation variance \(\sigma^2\) is sometimes called “irreducible error.”
+We will then look at the last term: \(E\left[\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right]\). If you recall the definition of variance from the last lecture, this is precisely \(\text{Var}(\hat{Y}(x))\). We call this the model variance.
+It describes how much the prediction \(\hat{Y}(x)\) tends to vary when we fit the model on different samples. Remember the sample we collect can come out very differently, thus the prediction \(\hat{Y}(x)\) will also be different. The model variance describes this variability due to the randomness in our sampling process. Like observation variance, it is also a form of chance error—even though the sources of randomness are different.
+\[\text{model variance} = \text{Var}(\hat{Y}(x)) = E\left[\left(\hat{Y}(x) - E\left[\hat{Y}(x)\right]\right)^2\right]\]
+The main reason for the large model variance is because of overfitting: we paid too much attention to the details in our sample that small differences in our random sample lead to large differences in the fitted model. To remediate this, we try to reduce model complexity (e.g. take out some features and limit the magnitude of estimated model coefficients) and not fit our model on the noises.
+Finally, the second term is \(\left(g(x)-E\left[\hat{Y}(x)\right]\right)^2\). What is this? The term \(E\left[\hat{Y}(x)\right] - g(x)\) is called the model bias.
+Remember that \(g(x)\) is the fixed underlying truth and \(\hat{Y}(x)\) is our fitted model, which is random. Model bias therefore measures how far off \(g(x)\) and \(\hat{Y}(x)\) are on average over all possible samples.
+\[\text{model bias} = E\left[\hat{Y}(x) - g(x)\right] = E\left[\hat{Y}(x)\right] - g(x)\]
+The model bias is not random; it’s an average measure for a specific individual \(x\). If bias is positive, our model tends to overestimate \(g(x)\); if it’s negative, our model tends to underestimate \(g(x)\). And if it’s 0, we can say that our model is unbiased.
+ +There are two main reasons for large model biases:
+To fix this, we increase model complexity (but we don’t want to overfit!) or consult domain experts to see which models make sense. You can start to see a tradeoff here: if we increase model complexity, we decrease the model bias, but we also risk increasing the model variance.
+To summarize:
+The above definitions enable us to simplify the decomposition of model risk before as:
+\[ E[(Y(x) - \hat{Y}(x))^2] = \sigma^2 + (E[\hat{Y}(x)] - g(x))^2 + \text{Var}(\hat{Y}(x)) \] \[\text{model risk } = \text{observation variance} + (\text{model bias})^2 \text{+ model variance}\]
+This is known as the bias-variance tradeoff. What does it mean? Remember that the model risk is a measure of the model’s performance. Our goal in building models is to keep model risk low; this means that we will want to ensure that each component of model risk is kept at a small value.
+Observation variance is an inherent, random part of the data collection process. We aren’t able to reduce the observation variance, so we’ll focus our attention on the model bias and model variance.
+In the Feature Engineering lecture, we considered the issue of overfitting. We saw that the model’s error or bias tends to decrease as model complexity increases — if we design a highly complex model, it will tend to make predictions that are closer to the true relationship \(g\). At the same time, model variance tends to increase as model complexity increases; a complex model may overfit to the training data, meaning that small differences in the random samples used for training lead to large differences in the fitted model. We have a problem. To decrease model bias, we could increase the model’s complexity, which would lead to overfitting and an increase in model variance. Alternatively, we could decrease model variance by decreasing the model’s complexity at the cost of increased model bias due to underfitting.
+
+
+
We need to strike a balance. Our goal in model creation is to use a complexity level that is high enough to keep bias low, but not so high that model variance is large.
+This section walks through the detailed derivation of the Bias-Variance Decomposition in the Bias-Variance Tradeoff section above, and this content is out of scope.
+ + + + + +Last lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data — the primary focus of lecture 6. In this note, we’ll discuss the necessary tools to manipulate text: Python string manipulation and regular expressions.
+There are two main reasons for working with text.
+First, we’ll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by Python and pandas
. The Python functions operate on a single string, while their equivalent in pandas
are vectorized — they operate on a Series
of string data.
Operation | +Python | +Pandas (Series ) |
+
---|---|---|
Transformation | +
|
+
|
+
Replacement + Deletion | +
|
+
|
+
Split | +
|
+
|
+
Substring | +
|
+
|
+
Membership | +
|
+
|
+
Length | +
|
+
|
+
We’ll discuss the differences between Python string functions and pandas
Series
methods in the following section on canonicalization.
Assume we want to merge the given tables.
+import pandas as pd
+
+with open('data/county_and_state.csv') as f:
+= pd.read_csv(f)
+ county_and_state
+ with open('data/county_and_population.csv') as f:
+= pd.read_csv(f) county_and_pop
; display(county_and_state), display(county_and_pop)
+ | County | +State | +
---|---|---|
0 | +De Witt County | +IL | +
1 | +Lac qui Parle County | +MN | +
2 | +Lewis and Clark County | +MT | +
3 | +St John the Baptist Parish | +LS | +
+ | County | +Population | +
---|---|---|
0 | +DeWitt | +16798 | +
1 | +Lac Qui Parle | +8067 | +
2 | +Lewis & Clark | +55716 | +
3 | +St. John the Baptist | +43044 | +
Last time, we used a primary key and foreign key to join two tables. While neither of these keys exist in our DataFrame
s, the "County"
columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables?
The following function uses Python string manipulation to convert a single county name into canonical form. It does so by eliminating whitespace, punctuation, and unnecessary text.
+def canonicalize_county(county_name):
+return (
+
+ county_name
+ .lower()' ', '')
+ .replace('&', 'and')
+ .replace('.', '')
+ .replace('county', '')
+ .replace('parish', '')
+ .replace(
+ )
+"St. John the Baptist") canonicalize_county(
'stjohnthebaptist'
+We will use the pandas
map
function to apply the canonicalize_county
function to every row in both DataFrame
s. In doing so, we’ll create a new column in each called clean_county_python
with the canonical form.
'clean_county_python'] = county_and_pop['County'].map(canonicalize_county)
+ county_and_pop['clean_county_python'] = county_and_state['County'].map(canonicalize_county)
+ county_and_state[; display(county_and_state), display(county_and_pop)
+ | County | +State | +clean_county_python | +
---|---|---|---|
0 | +De Witt County | +IL | +dewitt | +
1 | +Lac qui Parle County | +MN | +lacquiparle | +
2 | +Lewis and Clark County | +MT | +lewisandclark | +
3 | +St John the Baptist Parish | +LS | +stjohnthebaptist | +
+ | County | +Population | +clean_county_python | +
---|---|---|---|
0 | +DeWitt | +16798 | +dewitt | +
1 | +Lac Qui Parle | +8067 | +lacquiparle | +
2 | +Lewis & Clark | +55716 | +lewisandclark | +
3 | +St. John the Baptist | +43044 | +stjohnthebaptist | +
Alternatively, we can use pandas
Series
methods to create this standardized column. To do so, we must call the .str
attribute of our Series
object prior to calling any methods, like .lower
and .replace
. Notice how these method names match their equivalent built-in Python string functions.
Chaining multiple Series
methods in this manner eliminates the need to use the map
function (as this code is vectorized).
def canonicalize_county_series(county_series):
+return (
+
+ county_seriesstr.lower()
+ .str.replace(' ', '')
+ .str.replace('&', 'and')
+ .str.replace('.', '')
+ .str.replace('county', '')
+ .str.replace('parish', '')
+ .
+ )
+'clean_county_pandas'] = canonicalize_county_series(county_and_pop['County'])
+ county_and_pop['clean_county_pandas'] = canonicalize_county_series(county_and_state['County'])
+ county_and_state[; display(county_and_pop), display(county_and_state)
+ | County | +Population | +clean_county_python | +clean_county_pandas | +
---|---|---|---|---|
0 | +DeWitt | +16798 | +dewitt | +dewitt | +
1 | +Lac Qui Parle | +8067 | +lacquiparle | +lacquiparle | +
2 | +Lewis & Clark | +55716 | +lewisandclark | +lewisandclark | +
3 | +St. John the Baptist | +43044 | +stjohnthebaptist | +stjohnthebaptist | +
+ | County | +State | +clean_county_python | +clean_county_pandas | +
---|---|---|---|---|
0 | +De Witt County | +IL | +dewitt | +dewitt | +
1 | +Lac qui Parle County | +MN | +lacquiparle | +lacquiparle | +
2 | +Lewis and Clark County | +MT | +lewisandclark | +lewisandclark | +
3 | +St John the Baptist Parish | +LS | +stjohnthebaptist | +stjohnthebaptist | +
Extraction explores the idea of obtaining useful information from text data. This will be particularily important in model building, which we’ll study in a few weeks.
+Say we want to read some data from a .txt
file.
with open('data/log.txt', 'r') as f:
+= f.readlines()
+ log_lines
+ log_lines
['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n',
+ '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] "GET /stat141/Notes/dim.html HTTP/1.0" 404 302 "http://eeyore.ucdavis.edu/stat141/Notes/session.html"\n',
+ '169.237.46.240 - "" [3/Feb/2006:10:18:37 -0800] "GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1"\n']
+Suppose we want to extract the day, month, year, hour, minutes, seconds, and time zone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won’t work.
+Instead, we can use some clever thinking. Notice how the relevant information is contained within a set of brackets, further separated by /
and :
. We can hone in on this region of text, and split the data on these characters. Python’s built-in .split
function makes this easy.
= log_lines[0] # Only considering the first row of data
+ first
+= first.split("[")[1].split(']')[0]
+ pertinent = pertinent.split('/')
+ day, month, rest = rest.split(':')
+ year, hour, minute, rest = rest.split(' ')
+ seconds, time_zone day, month, year, hour, minute, seconds, time_zone
('26', 'Jan', '2014', '10', '47', '58', '-0800')
+There are two problems with this code:
+map
function or pandas
Series
methods.In the next section, we’ll introduce regular expressions - a tool that solves problem 2.
+A regular expression (“RegEx”) is a sequence of characters that specifies a search pattern. They are written to extract specific information from text. Regular expressions are essentially part of a smaller programming language embedded in Python, made available through the re
module. As such, they have a stand-alone syntax and methods for various capabilities.
Regular expressions are useful in many applications beyond data science. For example, Social Security Numbers (SSNs) are often validated with regular expressions.
+r"[0-9]{3}-[0-9]{2}-[0-9]{4}" # Regular Expression Syntax
+
+# 3 of any digit, then a dash,
+# then 2 of any digit, then a dash,
+# then 4 of any digit
'[0-9]{3}-[0-9]{2}-[0-9]{4}'
+There are a ton of resources to learn and experiment with regular expressions. A few are provided below:
+There are four basic operations with regular expressions.
+Operation | +Order | +Syntax Example | +Matches | +Doesn’t Match | +
---|---|---|---|---|
Or : | |
+4 | +AA|BAAB | +AA BAAB |
+every other string | +
Concatenation |
+3 | +AABAAB | +AABAAB | +every other string | +
Closure : * (zero or more) |
+2 | +AB*A | +AA ABBBBBBA | +AB ABABA |
+
Group : () (parenthesis) |
+1 | +A(A|B)AAB (AB)*A |
+AAAAB ABAAB A ABABABABA |
+every other string AA ABBA |
+
Notice how these metacharacter operations are ordered. Rather than being literal characters, these metacharacters manipulate adjacent characters. ()
takes precedence, followed by *
, and finally |
. This allows us to differentiate between very different regex commands like AB*
and (AB)*
. The former reads “A
then zero or more copies of B
”, while the latter specifies “zero or more copies of AB
”.
Question 1: Give a regular expression that matches moon
, moooon
, etc. Your expression should match any even number of o
s except zero (i.e. don’t match mn
).
Answer 1: moo(oo)*n
oo
before the capture group ensures that mn
is not matched.(oo)*
ensures the number of o
’s is even.Question 2: Using only basic operations, formulate a regex that matches muun
, muuuun
, moon
, moooon
, etc. Your expression should match any even number of u
s or o
s except zero (i.e. don’t match mn
).
Answer 2: m(uu(uu)*|oo(oo)*)n
m
and trailing n
ensures that only strings beginning with m
and ending with n
are matched.|
.
+m(uu(uu)*)|(oo(oo)*)n
. This incorrectly matches muu
and oooon
.
+|
. The incorrect solution matches only half of the string, and ignores either the beginning m
or trailing n
.|
. That way, each OR clause is everything to the left and right of |
within the group. This ensures both the beginning m
and trailing n
are matched.Provided below are more complex regular expression functions.
+Operation | +Syntax Example | +Matches | +Doesn’t Match | +
---|---|---|---|
Any Character : . (except newline) |
+.U.U.U. | +CUMULUS JUGULUM |
+SUCCUBUS TUMULTUOUS | +
Character Class : [] (match one character in [] ) |
+[A-Za-z][a-z]* | +word Capitalized |
+camelCase 4illegal | +
Repeated "a" Times : {a} |
+j[aeiou]{3}hn | +jaoehn jooohn |
+jhn jaeiouhn |
+
Repeated "from a to b" Times : {a, b} |
+j[ou]{1,2}hn | +john juohn |
+jhn jooohn |
+
At Least One : + |
+jo+hn | +john joooooohn |
+jhn jjohn |
+
Zero or One : ? |
+joh?n | +jon john |
+any other string | +
A character class matches a single character in its class. These characters can be hardcoded —— in the case of [aeiou]
—— or shorthand can be specified to mean a range of characters. Examples include:
[A-Z]
: Any capitalized letter[a-z]
: Any lowercase letter[0-9]
: Any single digit[A-Za-z]
: Any capitalized of lowercase letter[A-Za-z0-9]
: Any capitalized or lowercase letter or single digitLet’s analyze a few examples of complex regular expressions.
+Matches | +Does Not Match | +
---|---|
|
++ |
RASPBERRY SPBOO |
+SUBSPACE SUBSPECIES |
+
|
++ |
231-41-5121 573-57-1821 |
+231415121 57-3571821 |
+
|
++ |
horse@pizza.com horse@pizza.food.com |
+frank_99@yahoo.com hug@cs |
+
Explanations
+.*SPB.*
only matches strings that contain the substring SPB
.
+.*
metacharacter matches any amount of non-negative characters. Newlines do not count.com
or edu
domain, where all characters of the email are letters.
+.
must precede the domain name. Including a backslash \
before any metacharacter (in this case, the .
) tells RegEx to match that character exactly.Here are a few more convenient regular expressions.
+Operation | +Syntax Example | +Matches | +Doesn’t Match | +
---|---|---|---|
built in character class |
+\w+ \d+ \s+ |
+Fawef_03 231123 whitespace |
+this person 423 people non-whitespace |
+
character class negation : [^] (everything except the given characters) |
+[^a-z]+. | +PEPPERS3982 17211!↑å | +porch CLAmS |
+
escape character : \ (match the literal next character) |
+cow\.com | +cow.com | +cowscom | +
beginning of line : ^ |
+^ark | +ark two ark o ark | +dark | +
end of line : $ |
+ark$ | +dark ark o ark |
+ark two | +
lazy version of zero or more : *? |
+5.*?5 | +5005 55 |
+5005005 | +
In order to fully understand the last operation in the table, we have to discuss greediness. RegEx is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern <div>.*</div>
. In the sentence below, we would hope that the bolded portions would be matched:
“This is a <div>example</div> of greediness <div>in</div> regular expressions.”
+However, in reality, RegEx captures far more of the sentence. The way RegEx processes the text given that pattern is as follows:
+“Look for the exact string <>”
Then, “look for any character 0 or more times”
Then, “look for the exact string </div>”
The result would be all the characters starting from the leftmost <div> and the rightmost </div> (inclusive):
+“This is a <div>example</div> of greediness <div>in</div> regular expressions.”
+We can fix this by making our pattern non-greedy, <div>.*?</div>
. You can read up more in the documentation here.
Let’s revisit our earlier problem of extracting date/time data from the given .txt
files. Here is how the data looked.
0] log_lines[
'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'
+Question: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and time zone.
+Answer: \[.*\]
[
and ]
is necessary. Therefore, an escape character \
is required before both [
and ]
— otherwise these metacharacters will match character classes.[
and ]
. For this example, .*
will suffice.Alternative Solution: \[\w+/\w+/\w+:\w+:\w+:\w+\s-\w+\]
[
and ]
was garbage - .*
will still match that.Earlier in this note, we examined the process of canonicalization using python
string manipulation and pandas
Series
methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let’s fix this.
To do so, we need to understand a few functions in the re
module. The first of these is the substitute function: re.sub(pattern, rep1, text)
. It behaves similarly to python
’s built-in .replace
function, and returns text with all instances of pattern
replaced by rep1
.
The regular expression here removes text surrounded by <>
(also known as HTML tags).
In order, the pattern matches … 1. a single <
2. any character that is not a >
: div, td valign…, /td, /div 3. a single >
Any substring in text
that fulfills all three conditions will be replaced by ''
.
import re
+
+= "<div><td valign='top'>Moo</td></div>"
+ text = r"<[^>]+>"
+ pattern '', text) re.sub(pattern,
'Moo'
+Notice the r
preceding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (i.e., the Python newline metacharacter \n
). This makes them useful for regular expressions, which often contain literal \
characters.
In other words, don’t forget to tag your RegEx with an r
.
pandas
We can also use regular expressions with pandas
Series
methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple:
ser.str.replace(pattern, repl, regex=True
).
Consider the following DataFrame
html_data
with a single column.
= {"HTML": ["<div><td valign='top'>Moo</td></div>", \
+ data "<a href='http://ds100.org'>Link</a>", \
+ "<b>Bold text</b>"]}
+ = pd.DataFrame(data) html_data
html_data
+ | HTML | +
---|---|
0 | +<div><td valign='top'>Moo</td></div> | +
1 | +<a href='http://ds100.org'>Link</a> | +
2 | +<b>Bold text</b> | +
= r"<[^>]+>"
+ pattern 'HTML'].str.replace(pattern, '', regex=True) html_data[
0 Moo
+1 Link
+2 Bold text
+Name: HTML, dtype: object
+Just like with canonicalization, the re
module provides capability to extract relevant text from a string:
re.findall(pattern, text)
. This function returns a list of all matches to pattern
.
Using the familiar regular expression for Social Security Numbers:
+= "My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789."
+ text = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
+ pattern re.findall(pattern, text)
['123-45-6789', '321-45-6789']
+pandas
pandas
similarily provides extraction functionality on a Series
of data: ser.str.findall(pattern)
Consider the following DataFrame
ssn_data
.
= {"SSN": ["987-65-4321", "forty", \
+ data "123-45-6789 bro or 321-45-6789",
+ "999-99-9999"]}
+ = pd.DataFrame(data) ssn_data
ssn_data
+ | SSN | +
---|---|
0 | +987-65-4321 | +
1 | +forty | +
2 | +123-45-6789 bro or 321-45-6789 | +
3 | +999-99-9999 | +
"SSN"].str.findall(pattern) ssn_data[
0 [987-65-4321]
+1 []
+2 [123-45-6789, 321-45-6789]
+3 [999-99-9999]
+Name: SSN, dtype: object
+This function returns a list for every row containing the pattern matches in a given string.
+As you may expect, there are similar pandas
equivalents for other re
functions as well. Series.str.extract
takes in a pattern and returns a DataFrame
of each capture group’s first match in the string. In contrast, Series.str.extractall
returns a multi-indexed DataFrame
of all matches for each capture group. You can see the difference in the outputs below:
= r"([0-9]{3})-([0-9]{2})-([0-9]{4})"
+ pattern_cg "SSN"].str.extract(pattern_cg) ssn_data[
+ | 0 | +1 | +2 | +
---|---|---|---|
0 | +987 | +65 | +4321 | +
1 | +NaN | +NaN | +NaN | +
2 | +123 | +45 | +6789 | +
3 | +999 | +99 | +9999 | +
"SSN"].str.extractall(pattern_cg) ssn_data[
+ | + | 0 | +1 | +2 | +
---|---|---|---|---|
+ | match | ++ | + | + |
0 | +0 | +987 | +65 | +4321 | +
2 | +0 | +123 | +45 | +6789 | +
1 | +321 | +45 | +6789 | +|
3 | +0 | +999 | +99 | +9999 | +
Earlier we used parentheses (
)
to specify the highest order of operation in regular expressions. However, they have another meaning; parentheses are often used to represent capture groups. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data.
Let’s take a look at an example.
+= "Observations: 03:04:53 - Horse awakens. \
+ text 03:05:14 - Horse goes back to sleep."
Say we want to capture all occurences of time data (hour, minute, and second) as separate entities.
+= r"(\d\d):(\d\d):(\d\d)"
+ pattern_1 re.findall(pattern_1, text)
[('03', '04', '53'), ('03', '05', '14')]
+Notice how the given pattern has 3 capture groups, each specified by the regular expression (\d\d)
. We then use re.findall
to return these capture groups, each as tuples containing 3 matches.
These regular expression capture groups can be different. We can use the (\d{2})
shorthand to extract the same data.
= r"(\d\d):(\d\d):(\d{2})"
+ pattern_2 re.findall(pattern_2, text)
[('03', '04', '53'), ('03', '05', '14')]
+With the notion of capture groups, convince yourself how the following regular expression works.
+= log_lines[0]
+ first first
'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/ HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"\n'
+= r'\[(\d+)\/(\w+)\/(\d+):(\d+):(\d+):(\d+) (.+)\]'
+ pattern = re.findall(pattern, first)[0]
+ day, month, year, hour, minute, second, time_zone print(day, month, year, hour, minute, second, time_zone)
26 Jan 2014 10 47 58 -0800
+Today, we explored the capabilities of regular expressions in data wrangling with text data. However, there are a few things to be wary of.
+Writing regular expressions is like writing a program.
+Regular expressions are terrible at certain types of problems:
+json.load()
parser, not RegEx!Ultimately, the goal is not to memorize all regular expressions. Rather, the aim is to:
+python
and pandas
RegEx methods.In data science, understanding characteristics of a population starts with having quality data to investigate. While it is often impossible to collect all the data describing a population, we can overcome this by properly sampling from the population. In this note, we will discuss appropriate techniques for sampling from populations.
+In general: a census is “a complete count or survey of a population, typically recording various details of individuals.” An example is the U.S. Decennial Census which was held in April 2020. It counts every person living in all 50 states, DC, and US territories, not just citizens. Participation is required by law (it is mandated by the U.S. Constitution). Important uses include the allocation of Federal funds, congressional representation, and drawing congressional and state legislative districts. The census is composed of a survey mailed to different housing addresses in the United States.
+A survey is a set of questions. An example is workers sampling individuals and households. What is asked and how it is asked can affect how the respondent answers or even whether or not they answer in the first place.
+While censuses are great, it is often very difficult and expensive to survey everyone in a population. Imagine the amount of resources, money, time, and energy the U.S. spent on the 2020 Census. While this does give us more accurate information about the population, it’s often infeasible to execute. Thus, we usually survey a subset of the population instead.
+A sample is (usually) a subset of the population that is often used to make inferences about the population. If our sample is a good representation of our population, then we can use it to glean useful information at a lower cost. That being said, how the sample is drawn will affect the reliability of such inferences. Two common sources of error in sampling are chance error, where random samples can vary from what is expected in any direction, and bias, which is a systematic error in one direction. Biases can be the result of many things, for example, our sampling scheme or survey methods.
+Let’s define some useful vocabulary:
+While ideally, these three sets would be exactly the same, they usually aren’t in practice. For example, there may be individuals in your sampling frame (and hence, your sample) that are not in your population. And generally, sample sizes are much smaller than population sizes.
+The following case study is adapted from Statistics by Freedman, Pisani, and Purves, W.W. Norton NY, 1978.
+In 1936, President Franklin D. Roosevelt (Democratic) went up for re-election against Alf Landon (Republican). As is usual, polls were conducted in the months leading up to the election to try and predict the outcome. The Literary Digest was a magazine that had successfully predicted the outcome of 5 general elections coming into 1936. In their polling for the 1936 election, they sent out their survey to 10 million individuals whom they found from phone books, lists of magazine subscribers, and lists of country club members. Of the roughly 2.4 million people who filled out the survey, only 43% reported they would vote for Roosevelt; thus, the Digest predicted that Landon would win.
+On election day, Roosevelt won in a landslide, winning 61% of the popular vote of about 45 million voters. How could the Digest have been so wrong with their polling?
+It turns out that the Literary Digest sample was not representative of the population. Their sampling frame of people found in phone books, lists of magazine subscribers, and lists of country club members were more affluent and tended to vote Republican. As such, their sampling frame was inherently skewed in Landon’s favor. The Literary Digest completely overlooked the lion’s share of voters who were still suffering through the Great Depression. Furthermore, they had a dismal response rate (about 24%); who knows how the other non-respondents would have polled? The Digest folded just 18 months after this disaster.
+At the same time, George Gallup, a rising statistician, also made predictions about the 1936 elections. Despite having a smaller sample size of “only” 50,000 (this is still more than necessary; more when we cover the Central Limit Theorem), his estimate that 56% of voters would choose Roosevelt was much closer to the actual result (61%). Gallup also predicted the Digest’s prediction within 1% with a sample size of only 3000 people by anticipating the Digest’s affluent sampling frame and subsampling those individuals.
+So what’s the moral of the story? Samples, while convenient, are subject to chance error and bias. Election polling, in particular, can involve many sources of bias. To name a few:
+Randomized Response
+Suppose you want to ask someone a sensitive question: “Have you ever cheated on an exam?” An individual may be embarrassed or afraid to answer truthfully and might lie or not answer the question. One solution is to leverage a randomized response:
+First, you can ask the individual to secretly flip a fair coin; you (the surveyor) don’t know the outcome of the coin flip.
+Then, you ask them to answer “Yes” if the coin landed heads and to answer truthfully if the coin landed tails.
+The surveyor doesn’t know if the “Yes” means that the person cheated or if it means that the coin landed heads. The individual’s sensitive information remains secret. However, if the response is “No”, then the surveyor knows the individual didn’t cheat. We assume the individual is comfortable revealing this information.
+Generally, we can assume that the coin lands heads 50% of the time, masking the remaining 50% of the “No” answers. We can therefore double the proportion of “No” answers to estimate the true fraction of “No” answers.
+Election Polls
+Today, the Gallup Poll is one of the leading polls for election results. The many sources of biases – who responds to polls? Do voters tell the truth? How can we predict turnout? – still remain, but the Gallup Poll uses several tactics to mitigate them. Within their sampling frame of “civilian, non-institutionalized population” of adults in telephone households in continental U.S., they use random digit dialing to include both listed/unlisted phone numbers and to avoid selection bias. Additionally, they use a within-household selection process to randomly select households with one or more adults. If no one answers, re-call multiple times to avoid non-response bias.
+When sampling, it is essential to focus on the quality of the sample rather than the quantity of the sample. A huge sample size does not fix a bad sampling method. Our main goal is to gather a sample that is representative of the population it came from. In this section, we’ll explore the different types of sampling and their pros and cons.
+A convenience sample is whatever you can get ahold of; this type of sampling is non-random. Note that haphazard sampling is not necessarily random sampling; there are many potential sources of bias.
+In a probability sample, we provide the chance that any specified set of individuals will be in the sample (individuals in the population can have different chances of being selected; they don’t all have to be uniform), and we sample at random based off this known chance. For this reason, probability samples are also called random samples. The randomness provides a few benefits:
+The real world is usually more complicated, and we often don’t know the initial probabilities. For example, we do not generally know the probability that a given bacterium is in a microbiome sample or whether people will answer when Gallup calls landlines. That being said, still we try to model probability sampling to the best of our ability even when the sampling or measurement process is not fully under our control.
+A few common random sampling schemes:
+Suppose we have 3 TA’s (Arman, Boyu, Charlie): I decide to sample 2 of them as follows:
+We can list all the possible outcomes and their respective probabilities in a table:
+Outcome | +Probability | +
---|---|
{A, B} | +0.5 | +
{A, C} | +0.5 | +
{B, C} | +0 | +
This is a probability sample (though not a great one). Of the 3 people in my population, I know the chance of getting each subset. Suppose I’m measuring the average distance TAs live from campus.
+Consider the following sampling scheme:
+Yes. For a sample [n, n + 10, n + 20, …, n + 1090], where 1 <= n <= 10, the probability of that sample is 1/10. Otherwise, the probability is 0.
+Only 10 possible samples! +We are trying to collect a sample from Berkeley residents to predict the which one of Barbie and Oppenheimer would perform better on their opening day, July 21st.
+First, let’s grab a dataset that has every single resident in Berkeley (this is a fake dataset) and which movie they actually watched on July 21st.
+Let’s load in the movie.csv
table. We can assume that:
is_male
is a boolean that indicates if a resident identifies as male.import matplotlib.pyplot as plt
+import numpy as np
+import pandas as pd
+import seaborn as sns
+
+='darkgrid', font_scale = 1.5,
+ sns.set_theme(style={'figure.figsize':(7,5)})
+ rc
+= np.random.default_rng() rng
= pd.read_csv("data/movie.csv")
+ movie
+# create a 1/0 int that indicates Barbie vote
+'barbie'] = (movie['movie'] == 'Barbie').astype(int)
+ movie[ movie.head()
+ | age | +is_male | +movie | +barbie | +
---|---|---|---|---|
0 | +35 | +False | +Barbie | +1 | +
1 | +42 | +True | +Oppenheimer | +0 | +
2 | +55 | +False | +Barbie | +1 | +
3 | +77 | +True | +Oppenheimer | +0 | +
4 | +31 | +False | +Barbie | +1 | +
What fraction of Berkeley residents chose Barbie?
+= np.mean(movie["barbie"])
+ actual_barbie actual_barbie
np.float64(0.5302792307692308)
+This is the actual outcome of the competition. Based on this result, Barbie would win. How did our sample of retirees do?
+Let’s take a convenience sample of people who have retired (>= 65 years old). What proportion of them went to see Barbie instead of Oppenheimer?
+= movie[movie['age'] >= 65] # take a convenience sample of retirees
+ convenience_sample "barbie"]) # what proportion of them saw Barbie? np.mean(convenience_sample[
np.float64(0.3744755089093924)
+Based on this result, we would have predicted that Oppenheimer would win! What happened? Is it possible that our sample is too small or noisy?
+# what's the size of our sample?
+len(convenience_sample)
359396
+# what proportion of our data is in the convenience sample?
+len(convenience_sample)/len(movie)
0.27645846153846154
+Seems like our sample is rather large (roughly 360,000 people), so the error is likely not due to solely to chance.
+Let us aggregate all choices by age and visualize the fraction of Barbie views, split by gender.
+= movie.groupby(["age","is_male"]).agg("mean", numeric_only=True).reset_index()
+ votes_by_barbie votes_by_barbie.head()
+ | age | +is_male | +barbie | +
---|---|---|---|
0 | +18 | +False | +0.819594 | +
1 | +18 | +True | +0.667001 | +
2 | +19 | +False | +0.812214 | +
3 | +19 | +True | +0.661252 | +
4 | +20 | +False | +0.805281 | +
# A common matplotlib/seaborn pattern: create the figure and axes object, pass ax
+# to seaborn for drawing into, and later fine-tune the figure via ax.
+= plt.subplots();
+ fig, ax
+= ["#bf1518", "#397eb7"]
+ red_blue with sns.color_palette(red_blue):
+=votes_by_barbie, x = "age", y = "barbie", hue = "is_male", ax=ax)
+ sns.pointplot(data
+= [i.get_text() for i in ax.get_xticklabels()]
+ new_ticks range(0, len(new_ticks), 10), new_ticks[::10])
+ ax.set_xticks("Preferences by Demographics"); ax.set_title(
Suppose we took a simple random sample (SRS) of the same size as our retiree sample:
+= len(convenience_sample)
+ n = movie.sample(n, replace = False) ## By default, replace = False
+ random_sample "barbie"]) np.mean(random_sample[
np.float64(0.5306514262818729)
+This is very close to the actual vote of 0.5302792307692308!
+It turns out that we can get similar results with a much smaller sample size, say, 800:
+= 800
+ n = movie.sample(n, replace = False)
+ random_sample
+# Compute the sample average and the resulting relative error
+= np.mean(random_sample["barbie"])
+ sample_barbie = abs(sample_barbie-actual_barbie)/actual_barbie
+ err
+# We can print output with Markdown formatting too...
+from IPython.display import Markdown
+f"**Actual** = {actual_barbie:.4f}, **Sample** = {sample_barbie:.4f}, "
+ Markdown(f"**Err** = {100*err:.2f}%.")
Actual = 0.5303, Sample = 0.5300, Err = 0.05%.
+We’ll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.
+In our SRS of size 800, what would be our chance error?
+Let’s simulate 1000 versions of taking the 800-sized SRS from before:
+= 1000 # number of simulations
+ nrep = 800 # size of our sample
+ n = []
+ poll_result for i in range(0, nrep):
+= movie.sample(n, replace = False)
+ random_sample "barbie"])) poll_result.append(np.mean(random_sample[
= plt.subplots()
+ fig, ax ='density', ax=ax)
+ sns.histplot(poll_result, stat="orange", lw=4); ax.axvline(actual_barbie, color
What fraction of these simulated samples would have predicted Barbie?
+= pd.Series(poll_result)
+ poll_result sum(poll_result > 0.5)/1000 np.
np.float64(0.949)
+You can see the curve looks roughly Gaussian/normal. Using KDE:
+='density', kde=True); sns.histplot(poll_result, stat
Understanding the sampling process is what lets us go from describing the data to understanding the world. Without knowing / assuming something about how the data were collected, there is no connection between the sample and the population. Ultimately, the dataset doesn’t tell us about the world behind the data.
+ + +So far in the course, we have made our way through the entire data science lifecycle: we learned how to load and explore a dataset, formulate questions, and use the tools of prediction and inference to come up with answers. For the remaining weeks of the semester, we are going to make a second pass through the lifecycle, this time with a different set of tools, ideas, and abstractions.
+With this goal in mind, let’s go back to the very beginning of the lifecycle. We first started our work in data analysis by looking at the pandas
library, which offered us powerful tools to manipulate tabular data stored in (primarily) CSV files. CSVs work well when analyzing relatively small datasets (less than 10GB) that don’t need to be shared across many users. In research and industry, however, data scientists often need to access enormous bodies of data that cannot be easily stored in a CSV format. Collaborating with others when working with CSVs can also be tricky —— a real-world data scientist may run into problems when multiple users try to make modifications or more dire security issues arise regarding who should and should not have access to the data.
A database is a large, organized collection of data. Databases are administered by Database Management Systems (DBMS), which are software systems that store, manage, and facilitate access to one or more databases. Databases help mitigate many of the issues that come with using CSVs for data storage: they provide reliable storage that can survive system crashes or disk failures, are optimized to compute on data that does not fit into memory, and contain special data structures to improve performance. Using databases rather than CSVs offers further benefits from the standpoint of data management. A DBMS can apply settings that configure how data is organized, block certain data anomalies (for example, enforcing non-negative weights or ages), and determine who is allowed access to the data. It can also ensure safe concurrent operations where multiple users reading and writing to the database will not lead to fatal errors. Below, you can see the functionality of the different types of data storage and management architectures. In data science, common large-scale DBMS systems used are Google BigQuery, Amazon Redshift, Snowflake, Databricks, Microsoft SQL Server, and more. To learn more about these, consider taking Data 101!
+
+
+
As you may have guessed, we can’t use our usual pandas
methods to work with data in a database. Instead, we’ll turn to Structured Query Language.
Structured Query Language, or SQL (commonly pronounced “sequel,” though this is the subject of fierce debate), is a special programming language designed to communicate with databases, and it is the dominant language/technology for working with data. You may have encountered it in classes like CS 61A or Data C88C before, and you likely will encounter it in the future. It is a language of tables: all inputs and outputs are tables. Unlike Python, it is a declarative programming language – this means that rather than writing the exact logic needed to complete a task, a piece of SQL code “declares” what the desired final output should be and leaves the program to determine what logic should be implemented. This logic differs depending on the SQL code itself or on the system it’s running on (ie. MongoDB, SQLite, DuckDB, etc.). Most systems don’t follow the standards, and every system you work with will be a little different.
+For the purposes of Data 100, we use SQLite or DuckDB. SQLite is an easy-to-use library that allows users to directly manipulate a database file or an in-memory database with a simplified version of SQL. It’s commonly used to store data for small apps on mobile devices and is optimized for simplicity and speed of simple data tasks. DuckDB is an easy-to-use library that lets you directly manipulate a database file, collection of table formatted files (e.g., CSV), or in-memory pandas
DataFrame
s using a more complete version of SQL. It’s optimized for simplicity and speed of advanced data analysis tasks and is becoming increasingly popular for data analysis tasks on large datasets.
It is important to reiterate that SQL is an entirely different language from Python. However, Python does have special engines that allow us to run SQL code in a Jupyter notebook. While this is typically not how SQL is used outside of an educational setting, we will use this workflow to illustrate how SQL queries are constructed using the tools we’ve already worked with this semester. You will learn more about how to run SQL queries in Jupyter in an upcoming lab and homework.
+The syntax below will seem unfamiliar to you; for now, just focus on understanding the output displayed. We will clarify the SQL code in a bit.
+To start, we’ll look at a database called example_duck.db
and connect to it using DuckDB.
# Load the SQL Alchemy Python library and DuckDB
+import sqlalchemy
+import duckdb
# Load %%sql cell magic
+%load_ext sql
# Connect to the database
+%sql duckdb:///data/example_duck.db --alias duck
Now that we’re connected, let’s make some queries!
+%%sql
+* FROM Dragon; SELECT
* duckdb:///data/example_duck.db
+Done.
+name | +year | +cute | +
---|
Thanks to the pandas
magic, the resulting return data is displayed in a format almost identical to our pandas
tables but without an index.
+
+
Looking at the Dragon
table above, we can see that it contains contains three columns. The first of these, "name"
, contains text data. The "year"
column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, "cute"
, contains integer data with no restrictions on allowable values.
Now, let’s look at the schema of our database. A schema describes the logical structure of a table. Whenever a new table is created, the creator must declare its schema.
+%%sql
+*
+ SELECT
+ FROM sqlite_mastertype='table' WHERE
* duckdb:///data/example_duck.db
+Done.
+type | +name | +tbl_name | +rootpage | +sql | +
---|
The summary above displays information about the database; it contains four tables named sqlite_sequence
, Dragon
, Dish
, and Scene
. The rightmost column above lists the command that was used to construct each table.
Let’s look more closely at the command used to create the Dragon
table (the second entry above).
CREATE TABLE Dragon (name TEXT PRIMARY KEY,
+ year INTEGER CHECK (year >= 2000),
+ cute INTEGER)
+The statement CREATE TABLE
is used to specify the schema of the table – a description of what logic is used to organize the table. Schema follows a set format:
ColName
: the name of a column
DataType
: the type of data to be stored in a column. Some of the most common SQL data types are:
INT
(integers)FLOAT
(floating point numbers)TEXT
(strings)BLOB
(arbitrary data, such as audio/video files)DATETIME
(a date and time)Constraint
: some restriction on the data to be stored in the column. Common constraints are:
CHECK
(data must obey a certain condition)PRIMARY KEY
(designate a column as the table’s primary key)NOT NULL
(data cannot be null)DEFAULT
(a default fill value if no specific entry is given)Note that different implementations of SQL (e.g., DuckDB, SQLite, MySQL) will support different types. In Data 100, we’ll primarily use DuckDB.
+Database tables (also referred to as relations) are structured much like DataFrame
s in pandas
. Each row, sometimes called a tuple, represents a single record in the dataset. Each column, sometimes called an attribute or field, describes some feature of the record.
The primary key is a set of column(s) that uniquely identify each record in the table. In the Dragon
table, the "name"
column is its primary key that uniquely identifies each entry in the table. Because "name"
is the primary key of the table, no two entries in the table can have the same name – a given value of "name"
is unique to each dragon. Primary keys are used to ensure data integrity and to optimize data access.
A foreign key is a column or set of columns that references a primary key in another table. A foreign key constraint ensures that a primary key exists in the referenced table. For example, let’s say we have 2 tables, student
and assignment
, with the following schemas:
CREATE TABLE student (
+ student_id INTEGER PRIMARY KEY,
+ name VARCHAR,
+ email VARCHAR
+);
+
+CREATE TABLE assignment (
+ assignment_id INTEGER PRIMARY KEY,
+ description VARCHAR
+);
+Note that each table has a primary key that uniquely identifies each student and assignment.
+Say we want to create the table grade
to store the score each student got on each assignment. Naturally, this will depend on the information in student
and assignment
; we should not be saving the grade for a nonexisistent student nor a nonexisistent assignment. Hence, we can create the columns student_id
and assignment_id
that reference foreign tables student
and assignment
, respectively. This way, we ensure that the data in grade
is always up-to-date with the other tables.
CREATE TABLE grade (
+ student_id INTEGER,
+ assignment_id INTEGER,
+ score REAL,
+ FOREIGN KEY (student_id) REFERENCES student(student_id),
+ FOREIGN KEY (assignment_id) REFERENCES assignment(assignment_id)
+);
+To extract and manipulate data stored in a SQL table, we will need to familiarize ourselves with the syntax to write pieces of SQL code, which we call queries.
+SELECT
ing From TablesThe basic unit of a SQL query is the SELECT
statement. SELECT
specifies what columns we would like to extract from a given table. We use FROM
to tell SQL the table from which we want to SELECT
our data.
%%sql
+*
+ SELECT ; FROM Dragon
* duckdb:///data/example_duck.db
+Done.
+name | +year | +cute | +
---|
In SQL, *
means “everything.” The query above grabs all the columns in Dragon
and displays them in the outputted table. We can also specify a specific subset of columns to be SELECT
ed. Notice that the outputted columns appear in the order they were SELECT
ed.
%%sql
+
+ SELECT cute, year; FROM Dragon
* duckdb:///data/example_duck.db
+Done.
+cute | +year | +
---|
Every SQL query must include both a SELECT
and FROM
statement. Intuitively, this makes sense —— we know that we’ll want to extract some piece of information from the table; to do so, we also need to indicate what table we want to consider.
It is important to note that SQL enforces a strict “order of operations” —— SQL clauses must always follow the same sequence. For example, the SELECT
statement must always precede FROM
. This means that any SQL query will follow the same structure.
SELECT <column list>
+FROM <table>
+[additional clauses]
+The additional clauses we use depend on the specific task we’re trying to achieve. We may refine our query to filter on a certain condition, aggregate a particular column, or join several tables together. We will spend the rest of this note outlining some useful clauses to build up our understanding of the order of operations.
+And just like that, we’ve already written two SQL queries. There are a few things to note in the queries above. Firstly, notice that every “verb” is written in uppercase. It is convention to write SQL operations in capital letters, but your code will run just fine even if you choose to keep things in lowercase. Second, the query above separates each statement with a new line. SQL queries are not impacted by whitespace within the query; this means that SQL code is typically written with a new line after each statement to make things more readable. The semicolon (;
) indicates the end of a query. There are some “flavors” of SQL in which a query will not run if no semicolon is present; however, in Data 100, the SQL version we will use works with or without an ending semicolon. Queries in these notes will end with semicolons to build up good habits.
AS
The AS
keyword allows us to give a column a new name (called an alias) after it has been SELECT
ed. The general syntax is:
SELECT column_in_input_table AS new_name_in_output_table
+%%sql
+
+ SELECT cute AS cuteness, year AS birth; FROM Dragon
* duckdb:///data/example_duck.db
+Done.
+cuteness | +birth | +
---|
DISTINCT
To SELECT
only the unique values in a column, we use the DISTINCT
keyword. This will cause any any duplicate entries in a column to be removed. If we want to find only the unique years in Dragon
, without any repeats, we would write:
%%sql
+
+ SELECT DISTINCT year; FROM Dragon
* duckdb:///data/example_duck.db
+Done.
+year | +
---|
WHERE
ConditionsThe WHERE
keyword is used to select only some rows of a table, filtered on a given Boolean condition.
%%sql
+
+ SELECT name, year
+ FROM Dragon> 0; WHERE cute
* duckdb:///data/example_duck.db
+Done.
+name | +year | +
---|
We can add complexity to the WHERE
condition using the keywords AND
, OR
, and NOT
, much like we would in Python.
%%sql
+
+ SELECT name, year
+ FROM Dragon> 0 OR year > 2013; WHERE cute
* duckdb:///data/example_duck.db
+Done.
+name | +year | +
---|
To spare ourselves needing to write complicated logical expressions by combining several conditions, we can also filter for entries that are IN
a specified list of values. This is similar to the use of in
or .isin
in Python.
%%sql
+
+ SELECT name, year
+ FROM Dragon'hiccup', 'puff'); WHERE name IN (
* duckdb:///data/example_duck.db
+Done.
+name | +year | +
---|
In Python
, there is no distinction between double ""
and single quotes ''
. SQL, on the other hand, distinguishes double quotes ""
as column names and single quotes ''
as strings. For example, we can make the call
SELECT "birth weight"
+FROM patient
+WHERE "first name" = 'Joey'
+to select the column "birth weight"
from the patient
table and only select rows where the column "first name"
is equal to 'Joey'
.
WHERE
WITH NULL
ValuesYou may have noticed earlier that our table actually has a missing value. In SQL, missing data is given the special value NULL
. NULL
behaves in a fundamentally different way to other data types. We can’t use the typical operators (=, >, and <) on NULL
values (in fact, NULL == NULL
returns False
!). Instead, we check to see if a value IS
or IS NOT
NULL
.
%%sql
+
+ SELECT name, cute
+ FROM Dragon; WHERE cute IS NOT NULL
* duckdb:///data/example_duck.db
+Done.
+name | +cute | +
---|
ORDER BY
What if we want the output table to appear in a certain order? The ORDER BY
keyword behaves similarly to .sort_values()
in pandas
.
%%sql
+*
+ SELECT
+ FROM Dragon; ORDER BY cute
* duckdb:///data/example_duck.db
+Done.
+name | +year | +cute | +
---|
By default, ORDER BY
will display results in ascending order (ASC
) with the lowest values at the top of the table. To sort in descending order, we use the DESC
keyword after specifying the column to be used for ordering.
%%sql
+*
+ SELECT
+ FROM Dragon; ORDER BY cute DESC
* duckdb:///data/example_duck.db
+Done.
+name | +year | +cute | +
---|
We can also tell SQL to ORDER BY
two columns at once. This will sort the table by the first listed column, then use the values in the second listed column to break any ties.
%%sql
+*
+ SELECT
+ FROM Dragon; ORDER BY year, cute DESC
* duckdb:///data/example_duck.db
+Done.
+name | +year | +cute | +
---|
Note that in this example, year
is sorted in ascending order and cute
in descending order. If you want year
to be ordered in descending order as well, you need to specify year DESC, cute DESC;
.
LIMIT
vs. OFFSET
In many instances, we are only concerned with a certain number of rows in the output table (for example, wanting to find the first two dragons in the table). The LIMIT
keyword restricts the output to a specified number of rows. It serves a function similar to that of .head()
in pandas
.
%%sql
+*
+ SELECT
+ FROM Dragon2; LIMIT
* duckdb:///data/example_duck.db
+Done.
+name | +year | +cute | +
---|
The OFFSET
keyword indicates the index at which LIMIT
should start. In other words, we can use OFFSET
to shift where the LIMIT
ing begins by a specified number of rows. For example, we might care about the dragons that are at positions 2 and 3 in the table.
%%sql
+*
+ SELECT
+ FROM Dragon2
+ LIMIT 1; OFFSET
* duckdb:///data/example_duck.db
+Done.
+name | +year | +cute | +
---|
With these keywords in hand, let’s update our SQL order of operations. Remember: every SQL query must list clauses in this order.
+SELECT <column expression list>
+FROM <table>
+[WHERE <predicate>]
+[ORDER BY <column list>]
+[LIMIT <number of rows>]
+[OFFSET <number of rows>];
+Let’s summarize what we’ve learned so far. We know that SELECT
and FROM
are the fundamental building blocks of any SQL query. We can augment these two keywords with additional clauses to refine the data in our output table.
Any clauses that we include must follow a strict ordering within the query:
+SELECT <column list>
+FROM <table>
+[WHERE <predicate>]
+[ORDER BY <column list>]
+[LIMIT <number of rows>]
+[OFFSET <number of rows>]
+Here, any clause contained in square brackets [ ]
is optional —— we only need to use the keyword if it is relevant to the table operation we want to perform. Also note that by convention, we use all caps for keywords in SQL statements and use newlines to make code more readable.
In this lecture, we’ll continue our work from last time to introduce some advanced SQL syntax.
+First, let’s load in the basic_examples.db
database.
# Load the SQL Alchemy Python library and DuckDB
+import sqlalchemy
+import duckdb
# Load %%sql cell magic
+%load_ext sql
# Connect to the database
+%sql duckdb:///data/basic_examples.db --alias basic
GROUP BY
At this point, we’ve seen that SQL offers much of the same functionality that was given to us by pandas
. We can extract data from a table, filter it, and reorder it to suit our needs.
In pandas
, much of our analysis work relied heavily on being able to use .groupby()
to aggregate across the rows of our dataset. SQL’s answer to this task is the (very conveniently named) GROUP BY
clause. While the outputs of GROUP BY
are similar to those of .groupby()
—— in both cases, we obtain an output table where some column has been used for grouping —— the syntax and logic used to group data in SQL are fairly different to the pandas
implementation.
To illustrate GROUP BY
, we will consider the Dish
table from our database.
%%sql
+*
+ SELECT ; FROM Dish
* duckdb:///data/basic_examples.db
+Done.
+name | +type | +cost | +
---|
Notice that there are multiple dishes of the same type
. What if we wanted to find the total costs of dishes of a certain type
? To accomplish this, we would write the following code.
%%sql
+type, SUM(cost)
+ SELECT
+ FROM Dishtype; GROUP BY
* duckdb:///data/basic_examples.db
+Done.
+type | +sum("cost") | +
---|
What is going on here? The statement GROUP BY type
tells SQL to group the data based on the value contained in the type
column (whether a record is an appetizer, entree, or dessert). SUM(cost)
sums up the costs of dishes in each type
and displays the result in the output table.
You may be wondering: why does SUM(cost)
come before the command to GROUP BY type
? Don’t we need to form groups before we can count the number of entries in each? Remember that SQL is a declarative programming language —— a SQL programmer simply states what end result they would like to see, and leaves the task of figuring out how to obtain this result to SQL itself. This means that SQL queries sometimes don’t follow what a reader sees as a “logical” sequence of thought. Instead, SQL requires that we follow its set order of operations when constructing queries. So long as we follow this order, SQL will handle the underlying logic.
In practical terms: our goal with this query was to output the total cost
s of each type
. To communicate this to SQL, we say that we want to SELECT
the SUM
med cost
values for each type
group.
There are many aggregation functions that can be used to aggregate the data contained in each group. Some common examples are:
+COUNT
: count the number of rows associated with each groupMIN
: find the minimum value of each groupMAX
: find the maximum value of each groupSUM
: sum across all records in each groupAVG
: find the average value of each groupWe can easily compute multiple aggregations all at once (a task that was very tricky in pandas
).
%%sql
+type, SUM(cost), MIN(cost), MAX(name)
+ SELECT
+ FROM Dishtype; GROUP BY
* duckdb:///data/basic_examples.db
+Done.
+type | +sum("cost") | +min("cost") | +max("name") | +
---|
To count the number of rows associated with each group, we use the COUNT
keyword. Calling COUNT(*)
will compute the total number of rows in each group, including rows with null values. Its pandas
equivalent is .groupby().size()
.
Recall the Dragon
table from the previous lecture:
%%sql
+* FROM Dragon; SELECT
* duckdb:///data/basic_examples.db
+Done.
+name | +year | +cute | +
---|
Notice that COUNT(*)
and COUNT(cute)
result in different outputs.
%%sql
+*)
+ SELECT year, COUNT(
+ FROM Dragon; GROUP BY year
* duckdb:///data/basic_examples.db
+Done.
+year | +count_star() | +
---|
%%sql
+
+ SELECT year, COUNT(cute)
+ FROM Dragon; GROUP BY year
* duckdb:///data/basic_examples.db
+Done.
+year | +count(cute) | +
---|
With this definition of GROUP BY
in hand, let’s update our SQL order of operations. Remember: every SQL query must list clauses in this order.
SELECT <column expression list>
+FROM <table>
+[WHERE <predicate>]
+[GROUP BY <column list>]
+[ORDER BY <column list>]
+[LIMIT <number of rows>]
+[OFFSET <number of rows>];
+Note that we can use the AS
keyword to rename columns during the selection process and that column expressions may include aggregation functions (MAX
, MIN
, etc.).
Now, what if we only want groups that meet a certain condition? HAVING
filters groups by applying some condition across all rows in each group. We interpret it as a way to keep only the groups HAVING
some condition. Note the difference between WHERE
and HAVING
: we use WHERE
to filter rows, whereas we use HAVING
to filter groups. WHERE
precedes HAVING
in terms of how SQL executes a query.
Let’s take a look at the Dish
table to see how we can use HAVING
. Say we want to group dishes with a cost greater than 4 by type
and only keep groups where the max cost is less than 10.
%%sql
+type, COUNT(*)
+ SELECT
+ FROM Dish> 4
+ WHERE cost type
+ GROUP BY < 10; HAVING MAX(cost)
* duckdb:///data/basic_examples.db
+Done.
+type | +count_star() | +
---|
Here, we first use WHERE
to filter for rows with a cost greater than 4. We then group our values by type
before applying the HAVING
operator. With HAVING
, we can filter our groups based on if the max cost is less than 10.
With this definition of GROUP BY
and HAVING
in hand, let’s update our SQL order of operations. Remember: every SQL query must list clauses in this order.
SELECT <column expression list>
+FROM <table>
+[WHERE <predicate>]
+[GROUP BY <column list>]
+[ORDER BY <column list>]
+[LIMIT <number of rows>]
+[OFFSET <number of rows>];
+Note that we can use the AS
keyword to rename columns during the selection process and that column expressions may include aggregation functions (MAX
, MIN
, etc.).
In the last lecture, we mostly worked under the assumption that our data had already been cleaned. However, as we saw in our first pass through the data science lifecycle, we’re very unlikely to be given data that is free of formatting issues. With this in mind, we’ll want to learn how to clean and transform data in SQL.
+Our typical workflow when working with “big data” is:
+pandas
) to analyze this data in detailWe can, however, still perform simple data cleaning and re-structuring using SQL directly. To do so, we’ll use the Title
table from the imdb_duck
database, which contains information about movies and actors.
Let’s load in the imdb_duck
database.
import os
+"TQDM_DISABLE"] = "1"
+ os.environ[if os.path.exists("/home/jovyan/shared/sql/imdb_duck.db"):
+= "duckdb:////home/jovyan/shared/sql/imdb_duck.db"
+ imdbpath elif os.path.exists("data/imdb_duck.db"):
+= "duckdb:///data/imdb_duck.db"
+ imdbpath else:
+import gdown
+ = 'https://drive.google.com/uc?id=10tKOHGLt9QoOgq5Ii-FhxpB9lDSQgl1O'
+ url = 'data/imdb_duck.db'
+ output_path =False)
+ gdown.download(url, output_path, quiet= "duckdb:///data/imdb_duck.db" imdbpath
from sqlalchemy import create_engine
+= create_engine(imdbpath, connect_args={'read_only': True})
+ imdb_engine %sql imdb_engine --alias imdb
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.ParserException) Parser Error: syntax error at or near "imdb_engine"
+[SQL: imdb_engine]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+Since we’ll be working with the Title
table, let’s take a quick look at what it contains.
%%sql imdb
+
+ *
+ SELECT
+ FROM Title'Ginny & Georgia', 'What If...?', 'Succession', 'Veep', 'Tenet')
+ WHERE primaryTitle IN (10; LIMIT
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.ParserException) Parser Error: syntax error at or near "imdb"
+[SQL: imdb
+
+SELECT *
+FROM Title
+WHERE primaryTitle IN ('Ginny & Georgia', 'What If...?', 'Succession', 'Veep', 'Tenet')
+LIMIT 10;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+LIKE
One common task we encountered in our first look at EDA was needing to match string data. For example, we might want to remove entries beginning with the same prefix as part of the data cleaning process.
+In SQL, we use the LIKE
operator to (you guessed it) look for strings that are like a given string pattern.
%%sql
+
+ SELECT titleType, primaryTitle
+ FROM Title'Star Wars: Episode I - The Phantom Menace' WHERE primaryTitle LIKE
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 2: FROM Title
+ ^
+[SQL: SELECT titleType, primaryTitle
+FROM Title
+WHERE primaryTitle LIKE 'Star Wars: Episode I - The Phantom Menace']
+(Background on this error at: https://sqlalche.me/e/20/f405)
+What if we wanted to find all Star Wars movies? %
is the wildcard operator, it means “look for any character, any number of times”. This makes it helpful for identifying strings that are similar to our desired pattern, even when we don’t know the full text of what we aim to extract.
%%sql
+
+ SELECT titleType, primaryTitle
+ FROM Title'%Star Wars%'
+ WHERE primaryTitle LIKE 10; LIMIT
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 2: FROM Title
+ ^
+[SQL: SELECT titleType, primaryTitle
+FROM Title
+WHERE primaryTitle LIKE '%Star Wars%'
+LIMIT 10;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+Alternatively, we can use RegEx! DuckDB and most real DBMSs allow for this. Note that here, we have to use the SIMILAR TO
operater rather than LIKE
.
%%sql
+
+ SELECT titleType, primaryTitle
+ FROM Title'.*Star Wars*.'
+ WHERE primaryTitle SIMILAR TO 10; LIMIT
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 2: FROM Title
+ ^
+[SQL: SELECT titleType, primaryTitle
+FROM Title
+WHERE primaryTitle SIMILAR TO '.*Star Wars*.'
+LIMIT 10;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+CAST
ing Data TypesA common data cleaning task is converting data to the correct variable type. The CAST
keyword is used to generate a new output column. Each entry in this output column is the result of converting the data in an existing column to a new data type. For example, we may wish to convert numeric data stored as a string to an integer.
%%sql
+
+ SELECT primaryTitle, CAST(runtimeMinutes AS INT); FROM Title
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 2: FROM Title;
+ ^
+[SQL: SELECT primaryTitle, CAST(runtimeMinutes AS INT)
+FROM Title;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+We use CAST
when SELECT
ing colunns for our output table. In the example above, we want to SELECT
the columns of integer year and runtime data that is created by the CAST
.
SQL will automatically name a new column according to the command used to SELECT
it, which can lead to unwieldy column names. We can rename the CAST
ed column using the AS
keyword.
%%sql
+
+ SELECT primaryTitle AS title, CAST(runtimeMinutes AS INT) AS minutes, CAST(startYear AS INT) AS year
+ FROM Title5; LIMIT
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 2: FROM Title
+ ^
+[SQL: SELECT primaryTitle AS title, CAST(runtimeMinutes AS INT) AS minutes, CAST(startYear AS INT) AS year
+FROM Title
+LIMIT 5;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+CASE
When working with pandas
, we often ran into situations where we wanted to generate new columns using some form of conditional statement. For example, say we wanted to describe a film title as “old,” “mid-aged,” or “new,” depending on the year of its release.
In SQL, conditional operations are performed using a CASE
clause. Conceptually, CASE
behaves much like the CAST
operation: it creates a new column that we can then SELECT
to appear in the output. The syntax for a CASE
clause is as follows:
CASE WHEN <condition> THEN <value>
+ WHEN <other condition> THEN <other value>
+ ...
+ ELSE <yet another value>
+ END
+Scanning through the skeleton code above, you can see that the logic is similar to that of an if
statement in Python. The conditional statement is first opened by calling CASE
. Each new condition is specified by WHEN
, with THEN
indicating what value should be filled if the condition is met. ELSE
specifies the value that should be filled if no other conditions are met. Lastly, END
indicates the end of the conditional statement; once END
has been called, SQL will continue evaluating the query as usual.
Let’s see this in action. In the example below, we give the new column created by the CASE
statement the name movie_age
.
%%sql
+/* If a movie was filmed before 1950, it is "old"
+if a movie was filmed before 2000, it is "mid-aged"
+ Otherwise, is "new" */
+ Else, a movie
+
+ SELECT titleType, startYear,< 1950 THEN 'old'
+ CASE WHEN startYear < 2000 THEN 'mid-aged'
+ WHEN startYear 'new'
+ ELSE
+ END AS movie_age; FROM Title
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 10: FROM Title;
+ ^
+[SQL: /* If a movie was filmed before 1950, it is "old"
+Otherwise, if a movie was filmed before 2000, it is "mid-aged"
+Else, a movie is "new" */
+
+SELECT titleType, startYear,
+CASE WHEN startYear < 1950 THEN 'old'
+ WHEN startYear < 2000 THEN 'mid-aged'
+ ELSE 'new'
+ END AS movie_age
+FROM Title;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+JOIN
ing TablesAt this point, we’re well-versed in using SQL as a tool to clean, manipulate, and transform data in a table. Notice that this sentence referred to one table, specifically. What happens if the data we need is distributed across multiple tables? This is an important consideration when using SQL —— recall that we first introduced SQL as a language to query from databases. Databases often store data in a multidimensional structure. In other words, information is stored across several tables, with each table containing a small subset of all the data housed by the database.
+A common way of organizing a database is by using a star schema. A star schema is composed of two types of tables. A fact table is the central table of the database —— it contains the information needed to link entries across several dimension tables, which contain more detailed information about the data.
+Say we were working with a database about boba offerings in Berkeley. The dimension tables of the database might contain information about tea varieties and boba toppings. The fact table would be used to link this information across the various dimension tables.
+If we explicitly mark the relationships between tables, we start to see the star-like structure of the star schema.
+To join data across multiple tables, we’ll use the (creatively named) JOIN
keyword. We’ll make things easier for now by first considering the simpler cats
dataset, which consists of the tables s
and t
.
To perform a join, we amend the FROM
clause. You can think of this as saying, “SELECT
my data FROM
tables that have been JOIN
ed together.”
Remember: SQL does not consider newlines or whitespace when interpreting queries. The indentation given in the example below is to help improve readability. If you wish, you can write code that does not follow this formatting.
+SELECT <column list>
+FROM table_1
+ JOIN table_2
+ ON key_1 = key_2;
+We also need to specify what column from each table should be used to determine matching entries. By defining these keys, we provide SQL with the information it needs to pair rows of data together.
+The most commonly used type of SQL JOIN
is the inner join. It turns out you’re already familiar with what an inner join does, and how it works – this is the type of join we’ve been using in pandas
all along! In an inner join, we combine every row in our first table with its matching entry in the second table. If a row from either table does not have a match in the other table, it is omitted from the output.
In a cross join, all possible combinations of rows appear in the output table, regardless of whether or not rows share a matching key. Because all rows are joined, even if there is no matching key, it is not necessary to specify what keys to consider in an ON
statement. A cross join is also known as a cartesian product.
Conceptually, we can interpret an inner join as a cross join, followed by removing all rows that do not share a matching key. Notice that the output of the inner join above contains all rows of the cross join example that contain a single color across the entire row.
+In a left outer join, all rows in the left table are kept in the output table. If a row in the right table shares a match with the left table, this row will be kept; otherwise, the rows in the right table are omitted from the output. We can fill in any missing values with NULL
.
A right outer join keeps all rows in the right table. Rows in the left table are only kept if they share a match in the right table. Again, we can fill in any missing values with NULL
.
In a full outer join, all rows that have a match between the two tables are joined together. If a row has no match in the second table, then the values of the columns for that second table are filled with NULL
. In other words, a full outer join performs an inner join while still keeping rows that have no match in the other table. This is best understood visually:
We have kept the same output achieved using an inner join, with the addition of partially null rows for entries in s
and t
that had no match in the second table.
JOIN
sWhen joining tables, we often create aliases for table names (similarly to what we did with column names in the last lecture). We do this as it is typically easier to refer to aliases, especially when we are working with long table names. We can even reference columns using aliased table names!
+Let’s say we want to determine the average rating of various movies. We’ll need to JOIN
the Title
and Rating
tables and can create aliases for both tables.
%%sql
+
+
+ SELECT primaryTitle, averageRating
+ FROM Title AS T INNER JOIN Rating AS R= R.tconst; ON T.tconst
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 2: FROM Title AS T INNER JOIN Rating AS R
+ ^
+[SQL: SELECT primaryTitle, averageRating
+FROM Title AS T INNER JOIN Rating AS R
+ON T.tconst = R.tconst;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+Note that the AS
is actually optional! We can create aliases for our tables even without it, but we usually include it for clarity.
%%sql
+
+
+ SELECT primaryTitle, averageRating
+ FROM Title T INNER JOIN Rating R= R.tconst; ON T.tconst
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 2: FROM Title T INNER JOIN Rating R
+ ^
+[SQL: SELECT primaryTitle, averageRating
+FROM Title T INNER JOIN Rating R
+ON T.tconst = R.tconst;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+For more sophisticated data problems, the queries can become very complex. Common table expressions (CTEs) allow us to break down these complex queries into more manageable parts. To do so, we create temporary tables corresponding to different aspects of the problem and then reference them in the final query:
+WITH
+table_name1 AS (
+ SELECT ...
+),
+table_name2 AS (
+ SELECT ...
+)
+SELECT ...
+FROM
+table_name1,
+table_name2, ...
+Let’s say we want to identify the top 10 action movies that are highly rated (with an average rating greater than 7) and popular (having more than 5000 votes), along with the primary actors who are the most popular. We can use CTEs to break this query down into separate problems. Initially, we can filter to find good action movies and prolific actors separately. This way, in our final join, we only need to change the order.
+%%sql
+
+ WITH
+ good_action_movies AS (*
+ SELECT = R.tconst
+ FROM Title T JOIN Rating R ON T.tconst '%Action%' AND averageRating > 7 AND numVotes > 5000
+ WHERE genres LIKE
+ ),
+ prolific_actors AS (*) as numRoles
+ SELECT N.nconst, primaryName, COUNT(= P.nconst
+ FROM Name N JOIN Principal P ON N.nconst = 'actor'
+ WHERE category
+ GROUP BY N.nconst, primaryName
+ )
+ SELECT primaryTitle, primaryName, numRoles, ROUND(averageRating) AS rating
+ FROM good_action_movies m, prolific_actors a, principal p= m.tconst AND p.nconst = a.nconst
+ WHERE p.tconst
+ ORDER BY rating DESC, numRoles DESC10; LIMIT
* duckdb:///data/basic_examples.db
+(duckdb.duckdb.CatalogException) Catalog Error: Table with name Title does not exist!
+Did you mean "system.information_schema.tables"?
+LINE 4: F...
+ ^
+[SQL: WITH
+good_action_movies AS (
+ SELECT *
+ FROM Title T JOIN Rating R ON T.tconst = R.tconst
+ WHERE genres LIKE '%Action%' AND averageRating > 7 AND numVotes > 5000
+),
+prolific_actors AS (
+ SELECT N.nconst, primaryName, COUNT(*) as numRoles
+ FROM Name N JOIN Principal P ON N.nconst = P.nconst
+ WHERE category = 'actor'
+ GROUP BY N.nconst, primaryName
+)
+SELECT primaryTitle, primaryName, numRoles, ROUND(averageRating) AS rating
+FROM good_action_movies m, prolific_actors a, principal p
+WHERE p.tconst = m.tconst AND p.nconst = a.nconst
+ORDER BY rating DESC, numRoles DESC
+LIMIT 10;]
+(Background on this error at: https://sqlalche.me/e/20/f405)
+In our journey of the data science lifecycle, we have begun to explore the vast world of exploratory data analysis. More recently, we learned how to pre-process data using various data manipulation techniques. As we work towards understanding our data, there is one key component missing in our arsenal — the ability to visualize and discern relationships in existing data.
+These next two lectures will introduce you to various examples of data visualizations and their underlying theory. In doing so, we’ll motivate their importance in real-world examples with the use of plotting libraries.
+You’ve likely encountered several forms of data visualizations in your studies. You may remember two such examples from Data 8: line plots, scatter plots, and histograms. Each of these served a unique purpose. For example, line plots displayed how numerical quantities changed over time, while histograms were useful in understanding a variable’s distribution.
+Line Chart | +Scatter Plot | +
---|---|
![]() |
+![]() |
+
Histogram | +
---|
![]() |
+
Visualizations are useful for a number of reasons. In Data 100, we consider two areas in particular:
+Altogether, these goals emphasize the fact that visualizations aren’t a matter of making “pretty” pictures; we need to do a lot of thinking about what stylistic choices communicate ideas most effectively.
+This course note will focus on the first half of visualization topics in Data 100. The goal here is to understand how to choose the “right” plot depending on different variable types and, secondly, how to generate these plots using code.
+A distribution describes both the set of values that a single variable can take and the frequency of unique values in a single variable. For example, if we’re interested in the distribution of students across Data 100 discussion sections, the set of possible values is a list of discussion sections (10-11am, 11-12pm, etc.), and the frequency that each of those values occur is the number of students enrolled in each section. In other words, the we’re interested in how a variable is distributed across it’s possible values. Therefore, distributions must satisfy two properties:
+Not a Valid Distribution | +Valid Distribution | +
---|---|
![]() |
+![]() |
+
This is not a valid distribution since individuals can be associated with more than one category and the bar values demonstrate values in minutes and not probability. | +This example satisfies the two properties of distributions, so it is a valid distribution. | +
Different plots are more or less suited for displaying particular types of variables, laid out in the diagram below:
+The first step of any visualization is to identify the type(s) of variables we’re working with. From here, we can select an appropriate plot type:
+A bar plot is one of the most common ways of displaying the distribution of a qualitative (categorical) variable. The length of a bar plot encodes the frequency of a category; the width encodes no useful information. The color could indicate a sub-category, but this is not necessarily the case.
+Let’s contextualize this in an example. We will use the World Bank dataset (wb
) in our analysis.
import pandas as pd
+import numpy as np
+
+= pd.read_csv("data/world_bank.csv", index_col=0)
+ wb wb.head()
+ | Continent | +Country | +Primary completion rate: Male: % of relevant age group: 2015 | +Primary completion rate: Female: % of relevant age group: 2015 | +Lower secondary completion rate: Male: % of relevant age group: 2015 | +Lower secondary completion rate: Female: % of relevant age group: 2015 | +Youth literacy rate: Male: % of ages 15-24: 2005-14 | +Youth literacy rate: Female: % of ages 15-24: 2005-14 | +Adult literacy rate: Male: % ages 15 and older: 2005-14 | +Adult literacy rate: Female: % ages 15 and older: 2005-14 | +... | +Access to improved sanitation facilities: % of population: 1990 | +Access to improved sanitation facilities: % of population: 2015 | +Child immunization rate: Measles: % of children ages 12-23 months: 2015 | +Child immunization rate: DTP3: % of children ages 12-23 months: 2015 | +Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016 | +Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016 | +Children sleeping under treated bed nets: % of children under age 5: 2009-2016 | +Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016 | +Tuberculosis: Treatment success rate: % of new cases: 2014 | +Tuberculosis: Cases detection rate: % of new estimated cases: 2015 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +Africa | +Algeria | +106.0 | +105.0 | +68.0 | +85.0 | +96.0 | +92.0 | +83.0 | +68.0 | +... | +80.0 | +88.0 | +95.0 | +95.0 | +66.0 | +42.0 | +NaN | +NaN | +88.0 | +80.0 | +
1 | +Africa | +Angola | +NaN | +NaN | +NaN | +NaN | +79.0 | +67.0 | +82.0 | +60.0 | +... | +22.0 | +52.0 | +55.0 | +64.0 | +NaN | +NaN | +25.9 | +28.3 | +34.0 | +64.0 | +
2 | +Africa | +Benin | +83.0 | +73.0 | +50.0 | +37.0 | +55.0 | +31.0 | +41.0 | +18.0 | +... | +7.0 | +20.0 | +75.0 | +79.0 | +23.0 | +33.0 | +72.7 | +25.9 | +89.0 | +61.0 | +
3 | +Africa | +Botswana | +98.0 | +101.0 | +86.0 | +87.0 | +96.0 | +99.0 | +87.0 | +89.0 | +... | +39.0 | +63.0 | +97.0 | +95.0 | +NaN | +NaN | +NaN | +NaN | +77.0 | +62.0 | +
5 | +Africa | +Burundi | +58.0 | +66.0 | +35.0 | +30.0 | +90.0 | +88.0 | +89.0 | +85.0 | +... | +42.0 | +48.0 | +93.0 | +94.0 | +55.0 | +43.0 | +53.8 | +25.4 | +91.0 | +51.0 | +
5 rows × 47 columns
+We can visualize the distribution of the Continent
column using a bar plot. There are a few ways to do this.
'Continent'].value_counts().plot(kind='bar'); wb[
Recall that .value_counts()
returns a Series
with the total count of each unique value. We call .plot(kind='bar')
on this result to visualize these counts as a bar plot.
Plotting methods in pandas
are the least preferred and not supported in Data 100, as their functionality is limited. Instead, future examples will focus on other libraries built specifically for visualizing data. The most well-known library here is matplotlib
.
import matplotlib.pyplot as plt # matplotlib is typically given the alias plt
+
+= wb['Continent'].value_counts()
+ continent
+ plt.bar(continent.index, continent)'Continent')
+ plt.xlabel('Count'); plt.ylabel(
While more code is required to achieve the same result, matplotlib
is often used over pandas
for its ability to plot more complex visualizations, some of which are discussed shortly.
However, note how we needed to label the axes with plt.xlabel
and plt.ylabel
, as matplotlib
does not support automatic axis labeling. To get around these inconveniences, we can use a more efficient plotting library: seaborn
.
Seaborn
import seaborn as sns # seaborn is typically given the alias sns
+= wb, x = 'Continent'); sns.countplot(data
In contrast to matplotlib
, the general structure of a seaborn
call involves passing in an entire DataFrame
, and then specifying what column(s) to plot. seaborn.countplot
both counts and visualizes the number of unique values in a given column. This column is specified by the x
argument to sns.countplot
, while the DataFrame
is specified by the data
argument.
For the vast majority of visualizations, seaborn
is far more concise and aesthetically pleasing than matplotlib
. However, the color scheme of this particular bar plot is arbitrary - it encodes no additional information about the categories themselves. This is not always true; color may signify meaningful detail in other visualizations. We’ll explore this more in-depth during the next lecture.
By now, you’ll have noticed that each of these plotting libraries have a very different syntax. As with pandas
, we’ll teach you the important methods in matplotlib
and seaborn
, but you’ll learn more through documentation.
Revisiting our example with the wb
DataFrame, let’s plot the distribution of Gross national income per capita
.
5) wb.head(
+ | Continent | +Country | +Primary completion rate: Male: % of relevant age group: 2015 | +Primary completion rate: Female: % of relevant age group: 2015 | +Lower secondary completion rate: Male: % of relevant age group: 2015 | +Lower secondary completion rate: Female: % of relevant age group: 2015 | +Youth literacy rate: Male: % of ages 15-24: 2005-14 | +Youth literacy rate: Female: % of ages 15-24: 2005-14 | +Adult literacy rate: Male: % ages 15 and older: 2005-14 | +Adult literacy rate: Female: % ages 15 and older: 2005-14 | +... | +Access to improved sanitation facilities: % of population: 1990 | +Access to improved sanitation facilities: % of population: 2015 | +Child immunization rate: Measles: % of children ages 12-23 months: 2015 | +Child immunization rate: DTP3: % of children ages 12-23 months: 2015 | +Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016 | +Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016 | +Children sleeping under treated bed nets: % of children under age 5: 2009-2016 | +Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016 | +Tuberculosis: Treatment success rate: % of new cases: 2014 | +Tuberculosis: Cases detection rate: % of new estimated cases: 2015 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +Africa | +Algeria | +106.0 | +105.0 | +68.0 | +85.0 | +96.0 | +92.0 | +83.0 | +68.0 | +... | +80.0 | +88.0 | +95.0 | +95.0 | +66.0 | +42.0 | +NaN | +NaN | +88.0 | +80.0 | +
1 | +Africa | +Angola | +NaN | +NaN | +NaN | +NaN | +79.0 | +67.0 | +82.0 | +60.0 | +... | +22.0 | +52.0 | +55.0 | +64.0 | +NaN | +NaN | +25.9 | +28.3 | +34.0 | +64.0 | +
2 | +Africa | +Benin | +83.0 | +73.0 | +50.0 | +37.0 | +55.0 | +31.0 | +41.0 | +18.0 | +... | +7.0 | +20.0 | +75.0 | +79.0 | +23.0 | +33.0 | +72.7 | +25.9 | +89.0 | +61.0 | +
3 | +Africa | +Botswana | +98.0 | +101.0 | +86.0 | +87.0 | +96.0 | +99.0 | +87.0 | +89.0 | +... | +39.0 | +63.0 | +97.0 | +95.0 | +NaN | +NaN | +NaN | +NaN | +77.0 | +62.0 | +
5 | +Africa | +Burundi | +58.0 | +66.0 | +35.0 | +30.0 | +90.0 | +88.0 | +89.0 | +85.0 | +... | +42.0 | +48.0 | +93.0 | +94.0 | +55.0 | +43.0 | +53.8 | +25.4 | +91.0 | +51.0 | +
5 rows × 47 columns
+How should we define our categories for this variable? In the previous example, these were a few unique values of the Continent
column. If we use similar logic here, our categories are the different numerical values contained in the Gross national income per capita
column.
Under this assumption, let’s plot this distribution using the seaborn.countplot
function.
= wb, x = 'Gross national income per capita, Atlas method: $: 2016'); sns.countplot(data
What happened? A bar plot (either plt.bar
or sns.countplot
) will create a separate bar for each unique value of a variable. With a continuous variable, we may not have a finite number of possible values, which can lead to situations like above where we would need many, many bars to display each unique value.
Specifically, we can say this histogram suffers from overplotting as we are unable to interpret the plot and gain any meaningful insight.
+Rather than bar plots, to visualize the distribution of a continuous variable, we use one of the following types of plots:
+Box plots and violin plots are two very similar kinds of visualizations. Both display the distribution of a variable using information about quartiles.
+In a box plot, the width of the box at any point does not encode meaning. In a violin plot, the width of the plot indicates the density of the distribution at each possible value.
+=wb, y='Gross national income per capita, Atlas method: $: 2016'); sns.boxplot(data
=wb, y="Gross national income per capita, Atlas method: $: 2016"); sns.violinplot(data
A quartile represents a 25% portion of the data. We say that:
+This means that the middle 50% of the data lies between the first and third quartiles. This is demonstrated in the histogram below. The three quartiles are marked with red vertical bars.
+= wb['Gross domestic product: % growth : 2016']
+ gdp = gdp[~gdp.isna()]
+ gdp
+= np.percentile(gdp, [25, 50, 75])
+ q1, q2, q3
+= wb.copy()
+ wb_quartiles 'category'] = None
+ wb_quartiles['Gross domestic product: % growth : 2016'] < q1) | (wb_quartiles['Gross domestic product: % growth : 2016'] > q3), 'category'] = 'Outside of the middle 50%'
+ wb_quartiles.loc[(wb_quartiles['Gross domestic product: % growth : 2016'] > q1) & (wb_quartiles['Gross domestic product: % growth : 2016'] < q3), 'category'] = 'In the middle 50%'
+ wb_quartiles.loc[(wb_quartiles[
+="Gross domestic product: % growth : 2016", hue="category")
+ sns.histplot(wb_quartiles, x="firebrick", lw=6, height=0.1); sns.rugplot([q1, q2, q3], c
In a box plot, the lower extent of the box lies at Q1, while the upper extent of the box lies at Q3. The horizontal line in the middle of the box corresponds to Q2 (equivalently, the median).
+=wb, y='Gross domestic product: % growth : 2016'); sns.boxplot(data
The whiskers of a box-plot are the two points that lie at the [\(1^{st}\) Quartile \(-\) (\(1.5\times\) IQR)], and the [\(3^{rd}\) Quartile \(+\) (\(1.5\times\) IQR)]. They are the lower and upper ranges of “normal” data (the points excluding outliers).
+The different forms of information contained in a box plot can be summarised as follows:
+A violin plot displays quartile information, albeit a bit more subtly through smoothed density curves. Look closely at the center vertical bar of the violin plot below; the three quartiles and “whiskers” are still present!
+=wb, y='Gross domestic product: % growth : 2016'); sns.violinplot(data
Plotting side-by-side box or violin plots allows us to compare distributions across different categories. In other words, they enable us to plot both a qualitative variable and a quantitative continuous variable in one visualization.
+With seaborn
, we can easily create side-by-side plots by specifying both an x and y column.
=wb, x="Continent", y='Gross domestic product: % growth : 2016'); sns.boxplot(data
You are likely familiar with histograms from Data 8. A histogram collects continuous data into bins, then plots this binned data. Each bin reflects the density of datapoints with values that lie between the left and right ends of the bin; in other words, the area of each bin is proportional to the percentage of datapoints it contains.
+Below, we plot a histogram using matplotlib and seaborn. Which graph do you prefer?
+# The `edgecolor` argument controls the color of the bin edges
+= wb["Gross national income per capita, Atlas method: $: 2016"]
+ gni =True, edgecolor="white")
+ plt.hist(gni, density
+# Add labels
+"Gross national income per capita")
+ plt.xlabel("Density")
+ plt.ylabel("Distribution of gross national income per capita"); plt.title(
=wb, x="Gross national income per capita, Atlas method: $: 2016", stat="density")
+ sns.histplot(data"Distribution of gross national income per capita"); plt.title(
We can overlay histograms (or density curves) to compare distributions across qualitative categories.
+The hue
parameter of sns.histplot
specifies the column that should be used to determine the color of each category. hue
can be used in many seaborn
plotting functions.
Notice that the resulting plot includes a legend describing which color corresponds to each hemisphere – a legend should always be included if color is used to encode information in a visualization!
+# Create a new variable to store the hemisphere in which each country is located
+= ["Asia", "Europe", "N. America"]
+ north = ["Africa", "Oceania", "S. America"]
+ south "Continent"].isin(north), "Hemisphere"] = "Northern"
+ wb.loc[wb["Continent"].isin(south), "Hemisphere"] = "Southern" wb.loc[wb[
=wb, x="Gross national income per capita, Atlas method: $: 2016", hue="Hemisphere", stat="density")
+ sns.histplot(data"Distribution of gross national income per capita"); plt.title(
Again, each bin of a histogram is scaled such that its area is proportional to the percentage of all datapoints that it contains.
+= plt.hist(gni, density=True, edgecolor="white", bins=5)
+ densities, bins, _ "Gross national income per capita")
+ plt.xlabel("Density")
+ plt.ylabel(
+print(f"First bin has width {bins[1]-bins[0]} and height {densities[0]}")
+print(f"This corresponds to {bins[1]-bins[0]} * {densities[0]} = {(bins[1]-bins[0])*densities[0]*100}% of the data")
First bin has width 16410.0 and height 4.7741589911386953e-05
+This corresponds to 16410.0 * 4.7741589911386953e-05 = 78.343949044586% of the data
+Histograms allow us to assess a distribution by their shape. There are a few properties of histograms we can analyze:
+The skew of a histogram describes the direction in which its “tail” extends. - A distribution with a long right tail is skewed right (such as Gross national income per capita
). In a right-skewed distribution, the few large outliers “pull” the mean to the right of the median.
= wb, x = 'Gross national income per capita, Atlas method: $: 2016', stat = 'density');
+ sns.histplot(data 'Distribution with a long right tail') plt.title(
Text(0.5, 1.0, 'Distribution with a long right tail')
+Access to an improved water source
). In a left-skewed distribution, the few small outliers “pull” the mean to the left of the median.In the case where a distribution has equal-sized right and left tails, it is symmetric. The mean is approximately equal to the median. Think of mean as the balancing point of the distribution.
+= wb, x = 'Access to an improved water source: % of population: 2015', stat = 'density');
+ sns.histplot(data 'Distribution with a long left tail') plt.title(
Text(0.5, 1.0, 'Distribution with a long left tail')
+Loosely speaking, an outlier is defined as a data point that lies an abnormally large distance away from other values. Let’s make this more concrete. As you may have observed in the box plot infographic earlier, we define outliers to be the data points that fall beyond the whiskers. Specifically, values that are less than the [\(1^{st}\) Quartile \(-\) (\(1.5\times\) IQR)], or greater than [\(3^{rd}\) Quartile \(+\) (\(1.5\times\) IQR).]
+In Data 100, we describe a “mode” of a histogram as a peak in the distribution. Often, however, it is difficult to determine what counts as its own “peak.” For example, the number of peaks in the distribution of HIV rates across different countries varies depending on the number of histogram bins we plot.
+If we set the number of bins to 5, the distribution appears unimodal.
+# Rename the very long column name for convenience
+= wb.rename(columns={'Antiretroviral therapy coverage: % of people living with HIV: 2015':"HIV rate"})
+ wb # With 5 bins, it seems that there is only one peak
+=wb, x="HIV rate", stat="density", bins=5)
+ sns.histplot(data"5 histogram bins"); plt.title(
# With 10 bins, there seem to be two peaks
+
+=wb, x="HIV rate", stat="density", bins=10)
+ sns.histplot(data"10 histogram bins"); plt.title(
# And with 20 bins, it becomes hard to say what counts as a "peak"!
+
+=wb, x ="HIV rate", stat="density", bins=20)
+ sns.histplot(data"20 histogram bins"); plt.title(
In part, it is these ambiguities that motivate us to consider using Kernel Density Estimation (KDE), which we will explore more in the next lecture.
+ + +Often, we want to identify general trends across a distribution, rather than focus on detail. Smoothing a distribution helps generalize the structure of the data and eliminate noise.
+A kernel density estimate (KDE) is a smooth, continuous function that approximates a curve. It allows us to represent general trends in a distribution without focusing on the details, which is useful for analyzing the broad structure of a dataset.
+More formally, a KDE attempts to approximate the underlying probability distribution from which our dataset was drawn. You may have encountered the idea of a probability distribution in your other classes; if not, we’ll discuss it at length in the next lecture. For now, you can think of a probability distribution as a description of how likely it is for us to sample a particular value in our dataset.
+A KDE curve estimates the probability density function of a random variable. Consider the example below, where we have used sns.displot
to plot both a histogram (containing the data points we actually collected) and a KDE curve (representing the approximated probability distribution from which this data was drawn) using data from the World Bank dataset (wb
).
import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+
+= pd.read_csv("data/world_bank.csv", index_col=0)
+ wb = wb.rename(columns={'Antiretroviral therapy coverage: % of people living with HIV: 2015':"HIV rate",
+ wb 'Gross national income per capita, Atlas method: $: 2016':'gni'})
+ wb.head()
+ | Continent | +Country | +Primary completion rate: Male: % of relevant age group: 2015 | +Primary completion rate: Female: % of relevant age group: 2015 | +Lower secondary completion rate: Male: % of relevant age group: 2015 | +Lower secondary completion rate: Female: % of relevant age group: 2015 | +Youth literacy rate: Male: % of ages 15-24: 2005-14 | +Youth literacy rate: Female: % of ages 15-24: 2005-14 | +Adult literacy rate: Male: % ages 15 and older: 2005-14 | +Adult literacy rate: Female: % ages 15 and older: 2005-14 | +... | +Access to improved sanitation facilities: % of population: 1990 | +Access to improved sanitation facilities: % of population: 2015 | +Child immunization rate: Measles: % of children ages 12-23 months: 2015 | +Child immunization rate: DTP3: % of children ages 12-23 months: 2015 | +Children with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016 | +Children with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016 | +Children sleeping under treated bed nets: % of children under age 5: 2009-2016 | +Children with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016 | +Tuberculosis: Treatment success rate: % of new cases: 2014 | +Tuberculosis: Cases detection rate: % of new estimated cases: 2015 | +
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | +Africa | +Algeria | +106.0 | +105.0 | +68.0 | +85.0 | +96.0 | +92.0 | +83.0 | +68.0 | +... | +80.0 | +88.0 | +95.0 | +95.0 | +66.0 | +42.0 | +NaN | +NaN | +88.0 | +80.0 | +
1 | +Africa | +Angola | +NaN | +NaN | +NaN | +NaN | +79.0 | +67.0 | +82.0 | +60.0 | +... | +22.0 | +52.0 | +55.0 | +64.0 | +NaN | +NaN | +25.9 | +28.3 | +34.0 | +64.0 | +
2 | +Africa | +Benin | +83.0 | +73.0 | +50.0 | +37.0 | +55.0 | +31.0 | +41.0 | +18.0 | +... | +7.0 | +20.0 | +75.0 | +79.0 | +23.0 | +33.0 | +72.7 | +25.9 | +89.0 | +61.0 | +
3 | +Africa | +Botswana | +98.0 | +101.0 | +86.0 | +87.0 | +96.0 | +99.0 | +87.0 | +89.0 | +... | +39.0 | +63.0 | +97.0 | +95.0 | +NaN | +NaN | +NaN | +NaN | +77.0 | +62.0 | +
5 | +Africa | +Burundi | +58.0 | +66.0 | +35.0 | +30.0 | +90.0 | +88.0 | +89.0 | +85.0 | +... | +42.0 | +48.0 | +93.0 | +94.0 | +55.0 | +43.0 | +53.8 | +25.4 | +91.0 | +51.0 | +
5 rows × 47 columns
+import seaborn as sns
+import matplotlib.pyplot as plt
+
+= wb, x = 'HIV rate', \
+ sns.displot(data = True, stat = "density")
+ kde
+"Distribution of HIV rates"); plt.title(
Notice that the smooth KDE curve is higher when the histogram bins are taller. You can think of the height of the KDE curve as representing how “probable” it is that we randomly sample a datapoint with the corresponding value. This intuitively makes sense – if we have already collected more datapoints with a particular value (resulting in a tall histogram bin), it is more likely that, if we randomly sample another datapoint, we will sample one with a similar value (resulting in a high KDE curve).
+The area under a probability density function should always integrate to 1, representing the fact that the total probability of a distribution should always sum to 100%. Hence, a KDE curve will always have an area under the curve of 1.
+We perform kernel density estimation using three steps.
+We’ll explain what a “kernel” is momentarily.
+To make things simpler, let’s construct a KDE for a small, artificially generated dataset of 5 datapoints: \([2.2, 2.8, 3.7, 5.3, 5.7]\). In the plot below, each vertical bar represents one data point.
+= [2.2, 2.8, 3.7, 5.3, 5.7]
+ data
+=0.3)
+ sns.rugplot(data, height
+"Data")
+ plt.xlabel("Density")
+ plt.ylabel(-3, 10)
+ plt.xlim(0, 0.5); plt.ylim(
Our goal is to create the following KDE curve, which was generated automatically by sns.kdeplot
.
+ sns.kdeplot(data)
+"Data")
+ plt.xlabel(-3, 10)
+ plt.xlim(0, 0.5); plt.ylim(
To begin generating a density curve, we need to choose a kernel and bandwidth value (\(\alpha\)). What are these exactly?
+A kernel is a density curve. It is the mathematical function that attempts to capture the randomness of each data point in our sampled data. To explain what this means, consider just one of the datapoints in our dataset: \(2.2\). We obtained this datapoint by randomly sampling some information out in the real world (you can imagine \(2.2\) as representing a single measurement taken in an experiment, for example). If we were to sample a new datapoint, we may obtain a slightly different value. It could be higher than \(2.2\); it could also be lower than \(2.2\). We make the assumption that any future sampled datapoints will likely be similar in value to the data we’ve already drawn. This means that our kernel – our description of the probability of randomly sampling any new value – will be greatest at the datapoint we’ve already drawn but still have non-zero probability above and below it. The area under any kernel should integrate to 1, representing the total probability of drawing a new datapoint.
+A bandwidth value, usually denoted by \(\alpha\), represents the width of the kernel. A large value of \(\alpha\) will result in a wide, short kernel function, while a small value with result in a narrow, tall kernel.
+Below, we place a Gaussian kernel, plotted in orange, over the datapoint \(2.2\). A Gaussian kernel is simply the normal distribution, which you may have called a bell curve in Data 8.
+def gaussian_kernel(x, z, a):
+# We'll discuss where this mathematical formulation came from later
+ return (1/np.sqrt(2*np.pi*a**2)) * np.exp((-(x - z)**2 / (2 * a**2)))
+
+# Plot our datapoint
+2.2], height=0.3)
+ sns.rugplot([
+# Plot the kernel
+= np.linspace(-3, 10, 1000)
+ x 2.2, 1))
+ plt.plot(x, gaussian_kernel(x,
+"Data")
+ plt.xlabel("Density")
+ plt.ylabel(-3, 10)
+ plt.xlim(0, 0.5); plt.ylim(
To begin creating our KDE, we place a kernel on each datapoint in our dataset. For our dataset of 5 points, we will have 5 kernels.
+# You will work with the functions below in Lab 4
+def create_kde(kernel, pts, a):
+# Takes in a kernel, set of points, and alpha
+ # Returns the KDE as a function
+ def f(x):
+ = 0
+ output for pt in pts:
+ += kernel(x, pt, a)
+ output return output / len(pts) # Normalization factor
+ return f
+
+def plot_kde(kernel, pts, a):
+# Calls create_kde and plots the corresponding KDE
+ = create_kde(kernel, pts, a)
+ f = np.linspace(min(pts) - 5, max(pts) + 5, 1000)
+ x = [f(xi) for xi in x]
+ y ;
+ plt.plot(x, y)
+ def plot_separate_kernels(kernel, pts, a, norm=False):
+# Plots individual kernels, which are then summed to create the KDE
+ = np.linspace(min(pts) - 5, max(pts) + 5, 1000)
+ x for pt in pts:
+ = kernel(x, pt, a)
+ y if norm:
+ /= len(pts)
+ y
+ plt.plot(x, y)
+ ;
+ plt.show()
+ -3, 10)
+ plt.xlim(0, 0.5)
+ plt.ylim("Data")
+ plt.xlabel("Density")
+ plt.ylabel(
+= 1) plot_separate_kernels(gaussian_kernel, data, a
Above, we said that each kernel has an area of 1. Earlier, we also said that our goal is to construct a KDE curve using these kernels with a total area of 1. If we were to directly sum the kernels as they are, we would produce a KDE curve with an integrated area of (5 kernels) \(\times\) (area of 1 each) = 5. To avoid this, we will normalize each of our kernels. This involves multiplying each kernel by \(\frac{1}{\#\:\text{datapoints}}\).
+In the cell below, we multiply each of our 5 kernels by \(\frac{1}{5}\) to apply normalization.
+-3, 10)
+ plt.xlim(0, 0.5)
+ plt.ylim("Data")
+ plt.xlabel("Density")
+ plt.ylabel(
+# The `norm` argument specifies whether or not to normalize the kernels
+= 1, norm = True) plot_separate_kernels(gaussian_kernel, data, a
Our KDE curve is the sum of the normalized kernels. Notice that the final curve is identical to the plot generated by sns.kdeplot
we saw earlier!
-3, 10)
+ plt.xlim(0, 0.5)
+ plt.ylim("Data")
+ plt.xlabel("Density")
+ plt.ylabel(
+= 1) plot_kde(gaussian_kernel, data, a
A general “KDE formula” function is given above.
+i
.
+A kernel (for our purposes) is a valid density function. This means it:
+The most common kernel is the Gaussian kernel. The Gaussian kernel is equivalent to the Gaussian probability density function (the Normal distribution), centered at the observed value with a standard deviation of (this is known as the bandwidth parameter).
+\[K_a(x, x_i) = \frac{1}{\sqrt{2\pi\alpha^{2}}}e^{-\frac{(x-x_i)^{2}}{2\alpha^{2}}}\]
+In this formula:
+The details of this (admittedly intimidating) formula are less important than understanding its role in kernel density estimation – this equation gives us the shape of each kernel.
+Gaussian Kernel, \(\alpha\) = 0.1 | +Gaussian Kernel, \(\alpha\) = 1 | +
---|---|
![]() |
+![]() |
+
Gaussian Kernel, \(\alpha\) = 2 | +Gaussian Kernel, \(\alpha\) = 10 | +
---|---|
![]() |
+![]() |
+
Another example of a kernel is the Boxcar kernel. The boxcar kernel assigns a uniform density to points within a “window” of the observation, and a density of 0 elsewhere. The equation below is a boxcar kernel with the center at \(x_i\) and the bandwidth of \(\alpha\).
+\[K_a(x, x_i) = \begin{cases} + \frac{1}{\alpha}, & |x - x_i| \le \frac{\alpha}{2}\\ + 0, & \text{else } + \end{cases}\]
+The boxcar kernel is seldom used in practice – we include it here to demonstrate that a kernel function can take whatever form you would like, provided it integrates to 1 and does not output negative values.
+def boxcar_kernel(alpha, x, z):
+return (((x-z)>=-alpha/2)&((x-z)<=alpha/2))/alpha
+
+= np.linspace(-5, 5, 200)
+ xs =1
+ alpha= [boxcar_kernel(alpha, x, 0) for x in xs]
+ kde_curve ; plt.plot(xs, kde_curve)
The diagram on the right is how the density curve for our 5 point dataset would have looked had we used the Boxcar kernel with bandwidth \(\alpha\) = 1.
+KDE | +Boxcar | +
---|---|
![]() |
+![]() |
+
displot
As we saw earlier, we can use seaborn
’s displot
function to plot various distributions. In particular, displot
allows you to specify the kind
of plot and is a wrapper for histplot
, kdeplot
, and ecdfplot
.
Below, we can see a couple of examples of how sns.displot
can be used to plot various distributions.
First, we can plot a histogram by setting kind
to "hist"
. Note that here we’ve specified stat = density
to normalize the histogram such that the area under the histogram is equal to 1.
=wb,
+ sns.displot(data="gni",
+ x="hist",
+ kind="density") # default: stat=count and density integrates to 1
+ stat"Distribution of gross national income per capita"); plt.title(
Now, what if we want to generate a KDE plot? We can set kind
= to "kde"
!
=wb,
+ sns.displot(data="gni",
+ x='kde')
+ kind"Distribution of gross national income per capita"); plt.title(
And finally, if we want to generate an Empirical Cumulative Distribution Function (ECDF), we can specify kind = "ecdf"
.
=wb,
+ sns.displot(data="gni",
+ x='ecdf')
+ kind"Cumulative Distribution of gross national income per capita"); plt.title(
Up until now, we’ve discussed how to visualize single-variable distributions. Going beyond this, we want to understand the relationship between pairs of numerical variables.
+Scatter plots are one of the most useful tools in representing the relationship between pairs of quantitative variables. They are particularly important in gauging the strength, or correlation, of the relationship between variables. Knowledge of these relationships can then motivate decisions in our modeling process.
+In matplotlib
, we use the function plt.scatter
to generate a scatter plot. Notice that, unlike our examples of plotting single-variable distributions, now we specify sequences of values to be plotted along the x-axis and the y-axis.
"per capita: % growth: 2016"], \
+ plt.scatter(wb['Adult literacy rate: Female: % ages 15 and older: 2005-14'])
+ wb[
+"% growth per capita")
+ plt.xlabel("Female adult literacy rate")
+ plt.ylabel("Female adult literacy against % growth"); plt.title(
In seaborn
, we call the function sns.scatterplot
. We use the x
and y
parameters to indicate the values to be plotted along the x and y axes, respectively. By using the hue
parameter, we can specify a third variable to be used for coloring each scatter point.
= wb, x = "per capita: % growth: 2016", \
+ sns.scatterplot(data = "Adult literacy rate: Female: % ages 15 and older: 2005-14",
+ y = "Continent")
+ hue
+"Female adult literacy against % growth"); plt.title(
Although the plots above communicate the general relationship between the two plotted variables, they both suffer a major limitation – overplotting. Overplotting occurs when scatter points with similar values are stacked on top of one another, making it difficult to see the number of scatter points actually plotted in the visualization. Notice how in the upper righthand region of the plots, we cannot easily tell just how many points have been plotted. This makes our visualizations difficult to interpret.
+We have a few methods to help reduce overplotting:
+s
, of plt.scatter
or sns.scatterplot
.In the cell below, we first jitter the data using np.random.uniform
, then re-plot it with smaller markers. The resulting plot is much easier to interpret.
# Setting a seed ensures that we produce the same plot each time
+# This means that the course notes will not change each time you access them
+150)
+ np.random.seed(
+# This call to np.random.uniform generates random numbers between -1 and 1
+# We add these random numbers to the original x data to jitter it slightly
+= np.random.uniform(-1, 1, len(wb))
+ x_noise = wb["per capita: % growth: 2016"] + x_noise
+ jittered_x
+# Repeat for y data
+= np.random.uniform(-5, 5, len(wb))
+ y_noise = wb["Adult literacy rate: Female: % ages 15 and older: 2005-14"] + y_noise
+ jittered_y
+# Setting the size parameter `s` changes the size of each point
+=15)
+ plt.scatter(jittered_x, jittered_y, s
+"% growth per capita (jittered)")
+ plt.xlabel("Female adult literacy rate (jittered)")
+ plt.ylabel("Female adult literacy against % growth"); plt.title(
lmplot
and jointplot
seaborn
also includes several built-in functions for creating more sophisticated scatter plots. Two of the most commonly used examples are sns.lmplot
and sns.jointplot
.
sns.lmplot
plots both a scatter plot and a linear regression line, all in one function call. We’ll discuss linear regression in a few lectures.
= wb, x = "per capita: % growth: 2016", \
+ sns.lmplot(data = "Adult literacy rate: Female: % ages 15 and older: 2005-14")
+ y
+"Female adult literacy against % growth"); plt.title(
sns.jointplot
creates a visualization with three components: a scatter plot, a histogram of the distribution of x values, and a histogram of the distribution of y values.
= wb, x = "per capita: % growth: 2016", \
+ sns.jointplot(data = "Adult literacy rate: Female: % ages 15 and older: 2005-14")
+ y
+# plt.suptitle allows us to shift the title up so it does not overlap with the histogram
+"Female adult literacy against % growth")
+ plt.suptitle(=0.9); plt.subplots_adjust(top
For datasets with a very large number of datapoints, jittering is unlikely to fully resolve the issue of overplotting. In these cases, we can attempt to visualize our data by its density, rather than displaying each individual datapoint.
+Hex plots can be thought of as two-dimensional histograms that show the joint distribution between two variables. This is particularly useful when working with very dense data. In a hex plot, the x-y plane is binned into hexagons. Hexagons that are darker in color indicate a greater density of data – that is, there are more data points that lie in the region enclosed by the hexagon.
+We can generate a hex plot using sns.jointplot
modified with the kind
parameter.
= wb, x = "per capita: % growth: 2016", \
+ sns.jointplot(data = "Adult literacy rate: Female: % ages 15 and older: 2005-14", \
+ y = "hex")
+ kind
+# plt.suptitle allows us to shift the title up so it does not overlap with the histogram
+"Female adult literacy against % growth")
+ plt.suptitle(=0.9); plt.subplots_adjust(top
Contour plots are an alternative way of plotting the joint distribution of two variables. You can think of them as the 2-dimensional versions of KDE plots. A contour plot can be interpreted in a similar way to a topographic map. Each contour line represents an area that has the same density of datapoints throughout the region. Contours marked with darker colors contain more datapoints (a higher density) in that region.
+sns.kdeplot
will generate a contour plot if we specify both x and y data.
= wb, x = "per capita: % growth: 2016", \
+ sns.kdeplot(data = "Adult literacy rate: Female: % ages 15 and older: 2005-14", \
+ y = True)
+ fill
+"Female adult literacy against % growth"); plt.title(
We have now covered visualizations in great depth, looking into various forms of visualizations, plotting libraries, and high-level theory.
+Much of this was done to uncover insights in data, which will prove necessary when we begin building models of data later in the course. A strong graphical correlation between two variables hints at an underlying relationship that we may want to study in greater detail. However, relying on visual relationships alone is limiting - not all plots show association. The presence of outliers and other statistical anomalies makes it hard to interpret data.
+Transformations are the process of manipulating data to find significant relationships between variables. These are often found by applying mathematical functions to variables that “transform” their range of possible values and highlight some previously hidden associations between data.
+To see why we may want to transform data, consider the following plot of adult literacy rates against gross national income.
+# Some data cleaning to help with the next example
+= pd.DataFrame(index=wb.index)
+ df 'lit'] = wb['Adult literacy rate: Female: % ages 15 and older: 2005-14'] \
+ df[+ wb["Adult literacy rate: Male: % ages 15 and older: 2005-14"]
+ 'inc'] = wb['gni']
+ df[=True)
+ df.dropna(inplace
+"inc"], df["lit"])
+ plt.scatter(df["Gross national income per capita")
+ plt.xlabel("Adult literacy rate")
+ plt.ylabel("Adult literacy rate against GNI per capita"); plt.title(
This plot is difficult to interpret for two reasons:
+A transformation would allow us to visualize this data more clearly, which, in turn, would enable us to describe the underlying relationship between our variables of interest.
+We will most commonly apply a transformation to linearize a relationship between variables. If we find a transformation to make a scatter plot of two variables linear, we can “backtrack” to find the exact relationship between the variables. This helps us in two major ways. Firstly, linear relationships are particularly simple to interpret – we have an intuitive sense of what the slope and intercept of a linear trend represent, and how they can help us understand the relationship between two variables. Secondly, linear relationships are the backbone of linear models. We will begin exploring linear modeling in great detail next week. As we’ll soon see, linear models become much more effective when we are working with linearized data.
+In the remainder of this note, we will discuss how to linearize a dataset to produce the result below. Notice that the resulting plot displays a rough linear relationship between the values plotted on the x and y axes.
+To linearize a relationship, begin by asking yourself: what makes the data non-linear? It is helpful to repeat this question for each variable in your visualization.
+Let’s start by considering the gross national income variable in our plot above. Looking at the y values in the scatter plot, we can see that many large y values are all clumped together, compressing the vertical axis. The scale of the horizontal axis is also being distorted by the few large outlying x values on the right.
+If we decreased the size of these outliers relative to the bulk of the data, we could reduce the distortion of the horizontal axis. How can we do this? We need a transformation that will:
+One function that produces this result is the log transformation. When we take the logarithm of a large number, the original number will decrease in magnitude dramatically. Conversely, when we take the logarithm of a small number, the original number does not change its value by as significant of an amount (to illustrate this, consider the difference between \(\log{(100)} = 4.61\) and \(\log{(10)} = 2.3\)).
+In Data 100 (and most upper-division STEM classes), \(\log\) is used to refer to the natural logarithm with base \(e\).
+# np.log takes the logarithm of an array or Series
+"inc"]), df["lit"])
+ plt.scatter(np.log(df[
+"Log(gross national income per capita)")
+ plt.xlabel("Adult literacy rate")
+ plt.ylabel("Adult literacy rate against Log(GNI per capita)"); plt.title(
After taking the logarithm of our x values, our plot appears much more balanced in its horizontal scale. We no longer have many datapoints clumped on one end and a few outliers out at extreme values.
+Let’s repeat this reasoning for the y values. Considering only the vertical axis of the plot, notice how there are many datapoints concentrated at large y values. Only a few datapoints lie at smaller values of y.
+If we were to “spread out” these large values of y more, we would no longer see the dense concentration in one region of the y-axis. We need a transformation that will:
+In this case, it is helpful to apply a power transformation – that is, raise our y values to a power. Let’s try raising our adult literacy rate values to the power of 4. Large values raised to the power of 4 will increase in magnitude proportionally much more than small values raised to the power of 4 (consider the difference between \(2^4 = 16\) and \(200^4 = 1600000000\)).
+# Apply a log transformation to the x values and a power transformation to the y values
+"inc"]), df["lit"]**4)
+ plt.scatter(np.log(df[
+"Log(gross national income per capita)")
+ plt.xlabel("Adult literacy rate (4th power)")
+ plt.ylabel("Adult literacy rate (4th power) against Log(GNI per capita)")
+ plt.suptitle(=0.9); plt.subplots_adjust(top
Our scatter plot is looking a lot better! Now, we are plotting the log of our original x values on the horizontal axis, and the 4th power of our original y values on the vertical axis. We start to see an approximate linear relationship between our transformed variables.
+What can we take away from this? We now know that the log of gross national income and adult literacy to the power of 4 are roughly linearly related. If we denote the original, untransformed gross national income values as \(x\) and the original adult literacy rate values as \(y\), we can use the standard form of a linear fit to express this relationship:
+\[y^4 = m(\log{x}) + b\]
+Where \(m\) represents the slope of the linear fit, while \(b\) represents the intercept.
+The cell below computes \(m\) and \(b\) for our transformed data. We’ll discuss how this code was generated in a future lecture.
+# The code below fits a linear regression model. We'll discuss it at length in a future lecture
+from sklearn.linear_model import LinearRegression
+
+= LinearRegression()
+ model "inc"]]), df["lit"]**4)
+ model.fit(np.log(df[[= model.coef_[0], model.intercept_
+ m, b
+print(f"The slope, m, of the transformed data is: {m}")
+print(f"The intercept, b, of the transformed data is: {b}")
+
+= df.sort_values("inc")
+ df "inc"]), df["lit"]**4, label="Transformed data")
+ plt.scatter(np.log(df["inc"]), m*np.log(df["inc"])+b, c="red", label="Linear regression")
+ plt.plot(np.log(df["Log(gross national income per capita)")
+ plt.xlabel("Adult literacy rate (4th power)")
+ plt.ylabel(; plt.legend()
The slope, m, of the transformed data is: 336400693.43172705
+The intercept, b, of the transformed data is: -1802204836.0479987
+What if we want to understand the underlying relationship between our original variables, before they were transformed? We can simply rearrange our linear expression above!
+Recall our linear relationship between the transformed variables \(\log{x}\) and \(y^4\).
+\[y^4 = m(\log{x}) + b\]
+By rearranging the equation, we find a relationship between the untransformed variables \(x\) and \(y\).
+\[y = [m(\log{x}) + b]^{(1/4)}\]
+When we plug in the values for \(m\) and \(b\) computed above, something interesting happens.
+# Now, plug the values for m and b into the relationship between the untransformed x and y
+"inc"], df["lit"], label="Untransformed data")
+ plt.scatter(df["inc"], (m*np.log(df["inc"])+b)**(1/4), c="red", label="Modeled relationship")
+ plt.plot(df["Gross national income per capita")
+ plt.xlabel("Adult literacy rate")
+ plt.ylabel(; plt.legend()
We have found a relationship between our original variables – gross national income and adult literacy rate!
+Transformations are powerful tools for understanding our data in greater detail. To summarize what we just achieved:
+Linearization will be an important tool as we begin our work on linear modeling next week.
+The Tukey-Mosteller Bulge Diagram is a good guide when determining possible transformations to achieve linearity. It is a visual summary of the reasoning we just worked through above.
+How does it work? Each curved “bulge” represents a possible shape of non-linear data. To use the diagram, find which of the four bulges resembles your dataset the most closely. Then, look at the axes of the quadrant for this bulge. The horizontal axis will list possible transformations that could be applied to your x data for linearization. Similarly, the vertical axis will list possible transformations that could be applied to your y data. Note that each axis lists two possible transformations. While either of these transformations has the potential to linearize your dataset, note that this is an iterative process. It’s important to try out these transformations and look at the results to see whether you’ve actually achieved linearity. If not, you’ll need to continue testing other possible transformations.
+Generally:
+Important: You should still understand the logic we worked through to determine how best to transform the data. The bulge diagram is just a summary of this same reasoning. You will be expected to be able to explain why a given transformation is or is not appropriate for linearization.
+Visualization requires a lot of thought!
+This class primarily uses seaborn
and matplotlib
, but pandas
also has basic built-in plotting methods. Many other visualization libraries exist, and plotly
is one of them.
plotly
creates very easily creates interactive plots.plotly
will occasionally appear in lecture code, labs, and assignments!Next, we’ll go deeper into the theory behind visualization.
+This section marks a pivot to the second major topic of this lecture - visualization theory. We’ll discuss the abstract nature of visualizations and analyze how they convey information.
+Remember, we had two goals for visualizing data. This section is particularly important in:
+Visualizations are able to convey information through various encodings. In the remainder of this lecture, we’ll look at the use of color, scale, and depth, to name a few.
+One detail that we may have overlooked in our earlier discussion of rugplots is the importance of encodings. Rugplots are effective visuals because they utilize line thickness to encode frequency. Consider the following diagram:
+Encodings are also useful for representing multi-dimensional data. Notice how the following visual highlights four distinct “dimensions” of data:
+The human visual perception system is only capable of visualizing data in a three-dimensional plane, but as you’ve seen, we can encode many more channels of information.
+However, we should be careful to not misrepresent relationships in our data by manipulating the scale or axes. The visualization below improperly portrays two seemingly independent relationships on the same plot. The authors have clearly changed the scale of the y-axis to mislead their audience.
+Notice how the downwards-facing line segment contains values in the millions, while the upwards-trending segment only contains values near three hundred thousand. These lines should not be intersecting.
+When there is a large difference in the magnitude of the data, it’s advised to analyze percentages instead of counts. The following diagrams correctly display the trends in cancer screening and abortion rates.
+Great visualizations not only consider the scale of the data but also utilize the axes in a way that best conveys information. For example, data scientists commonly set certain axes limits to highlight parts of the visualization they are most interested in.
+The visualization on the right captures the trend in coronavirus cases during March of 2020. From only looking at the visualization on the left, a viewer may incorrectly believe that coronavirus began to skyrocket on March 4th, 2020. However, the second illustration tells a different story - cases rose closer to March 21th, 2020.
+Color is another important feature in visualizations that does more than what meets the eye.
+We already explored using color to encode a categorical variable in our scatter plot. Let’s now discuss the uses of color in novel visualizations like colormaps and heatmaps.
+5-8% of the world is red-green color blind, so we have to be very particular about our color scheme. We want to make these as accessible as possible. Choosing a set of colors that work together is evidently a challenging task!
+Colormaps are mappings from pixel data to color values, and they’re often used to highlight distinct parts of an image. Let’s investigate a few properties of colormaps.
+Jet Colormap
Viridis Colormap
The jet colormap is infamous for being misleading. While it seems more vibrant than viridis, the aggressive colors poorly encode numerical data. To understand why, let’s analyze the following images.
+The diagram on the left compares how a variety of colormaps represent pixel data that transitions from a high to low intensity. These include the jet colormap (row a) and grayscale (row b). Notice how the grayscale images do the best job in smoothly transitioning between pixel data. The jet colormap is the worst at this - the four images in row (a) look like a conglomeration of individual colors.
+The difference is also evident in the images labeled (a) and (b) on the left side. The grayscale image is better at preserving finer detail in the vertical line strokes. Additionally, grayscale is preferred in X-ray scans for being more neutral. The intensity of the dark red color in the jet colormap is frightening and indicates something is wrong.
+Why is the jet colormap so much worse? The answer lies in how its color composition is perceived to the human eye.
+Jet Colormap Perception
Viridis Colormap Perception
The jet colormap is largely misleading because it is not perceptually uniform. Perceptually uniform colormaps have the property that if the pixel data goes from 0.1 to 0.2, the perceptual change is the same as when the data goes from 0.8 to 0.9.
+Notice how the said uniformity is present within the linear trend displayed in the viridis colormap. On the other hand, the jet colormap is largely non-linear - this is precisely why it’s considered a worse colormap.
+In our earlier discussion of multi-dimensional encodings, we analyzed a scatter plot with four pseudo-dimensions: the two axes, area, and color. Were these appropriate to use? The following diagram analyzes how well the human eye can distinguish between these “markings”.
+There are a few key takeaways from this diagram
+Conditioning is the process of comparing data that belong to separate groups. We’ve seen this before in overlayed distributions, side-by-side box plots, and scatter plots with categorical encodings. Here, we’ll introduce terminology that formalizes these examples.
+Consider an example where we want to analyze income earnings for males and females with varying levels of education. There are multiple ways to compare this data.
+The barplot is an example of juxtaposition: placing multiple plots side by side, with the same scale. The scatter plot is an example of superposition: placing multiple density curves and scatter plots on top of each other.
+Which is better depends on the problem at hand. Here, superposition makes the precise wage difference very clear from a quick glance. However, many sophisticated plots convey information that favors the use of juxtaposition. Below is one example.
+The last component of a great visualization is perhaps the most critical - the use of context. Adding informative titles, axis labels, and descriptive captions are all best practices that we’ve heard repeatedly in Data 8.
+A publication-ready plot (and every Data 100 plot) needs:
+Captions should:
+