diff --git a/_quarto.yml b/_quarto.yml index 06c90a10..dded0daf 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -30,10 +30,10 @@ book: - ols/ols.qmd - gradient_descent/gradient_descent.qmd - feature_engineering/feature_engineering.qmd - # - cv_regularization/cv_reg.qmd + - case_study_HCE/case_study_HCE.qmd + - cv_regularization/cv_reg.qmd # - probability_1/probability_1.qmd # - probability_2/probability_2.qmd - # - case_study_HCE/case_study_HCE.qmd # - case_study_climate/case_study_climate.qmd # - inference_causality/inference_causality.qmd # - sql_I/sql_I.qmd diff --git a/case_study_HCE/case_study_HCE.html b/case_study_HCE/case_study_HCE.html deleted file mode 100644 index 00f627a7..00000000 --- a/case_study_HCE/case_study_HCE.html +++ /dev/null @@ -1,706 +0,0 @@ - - - - - - - - - -Case Study in Human Contexts and Ethics - - - - - - - - - - - - - - - - - - - -
- -
- -
-
-

Case Study in Human Contexts and Ethics

-
- - - -
- - - - -
- - -
- -
-
-
- -
-
-Learning Outcomes -
-
-
-
-
-
    -
  • Learn about the ethical dilemmas that data scientists face.
  • -
  • Know how critique models using contextual knowledge about data.
  • -
-
-
-
-
-

Disclaimer: The following chapter discusses issues of structural racism. Some of the items in the chapter may be sensitive and may or may not be the opinions, ideas, and beliefs of the students who collected the materials. The Data 100 course staff tries its best to only present information that is relevant for teaching the lessons at hand.

-
-

Note: Given the nuanced nature of some of the arguments made in the lecture, it is highly recommended that you view the lecture recording in order to fully engage and understand the material. The course notes will have the same broader structure but are by no means comprehensive.

-

Let’s immerse ourselves in the real-world story of data scientists working for an organization called the Cook County Assessor’s Office (CCAO). Their job is to estimate the values of houses in order to assign property taxes. This is because the tax burden in this area is determined by the estimated value of a house, which is different from its price. Since values change over time and there are no obvious indicators of value, they created a model to estimate the values of houses. In this chapter, we will dig deep into what problems biased the models, the consequences to human lives, and how we can learn from this example to do better.

-
-

The Problem

-

A report by the Chicago Tribune uncovered a major scandal: the team showed that the model perpetuated a highly regressive tax system that disproportionately burdened African-American and Latinx homeowners in Cook County. How did they know?

-
- -
-

In the field of housing assessment, there are standard metrics that assessors use across the world to estimate the fairness of assessments: coefficient of dispersion and price-related differential. These metrics have been rigorously tested by experts in the field and are out of scope for our class. Calculating these metrics for the Cook County prices revealed that the pricing created by the CCAO did not fall in acceptable ranges (see figure above). This by itself is not the end of the story, but a good indicator that something fishy was going on.

-
- -
-

This prompted them to investigate if the model itself was producing fair tax rates. Evidently, when accounting for the home owner’s income, they found that the model actually produced a regressive tax rate (see figure above). A tax rate is regressive if the percentage tax rate is higher for individuals with lower net income. A tax rate is progressive if the percentage tax rate is higher for individuals with higher net income.

-
- -
-


Further digging suggests that not only was the system unfair to people across the axis of income, it was also unfair across the axis of race (see figure above). The likelihood of a property being under- or over-assessed was highly dependent on the owner’s race, and that did not sit well with many homeowners.

-
-

Spotlight: Appeals

-

What actually caused this to come about? A comprehensive answer goes beyond just models. At the end of the day, these are real systems that have a lot of moving parts. One of which was the appeals system. Homeowners are mailed the value their home assessed by CCAO, and the homeowner can choose to appeal to a board of elected officials to try and change the listed value of their home and thus how much they are taxed. In theory, this sounds like a very fair system: someone oversees the final pricing of houses rather than just an algorithm. However, it ended up exacerbating the problems.

-
-

“Appeals are a good thing,” Thomas Jaconetty, deputy assessor for valuation and appeals, said in an interview. “The goal here is fairness. We made the numbers. We can change them.”

-
-
- -
-


-

Here we can borrow lessons from Critical Race Theory. On the surface, everyone having the legal right to try and appeal is undeniable. However, not everyone has an equal ability to do so. Those who have the money to hire tax lawyers to appeal for them have a drastically higher chance of trying and succeeding (see above figure). This model is part of a deeper institutional pattern rife with potential corruption.

-
- -
-


-

Homeowners who appealed were generally under-assessed relative to homeowners who did not (see above figure). Those with higher incomes pay less in property tax, tax lawyers are able to grow their business due to their role in appeals, and politicians are commonly socially connected to the aforementioned tax lawyers and wealthy homeowners. All these stakeholders have reasons to advertise the model as an integral part of a fair system. Here lies the value in asking questions: a system that seems fair on the surface may in actuality be unfair upon taking a closer look.

-
-
-

Human Impacts

-
- -
-


-

The impact of the housing model extends beyond the realm of home ownership and taxation. Discriminatory practices have a long history within the United States, and the model served to perpetuate this fact. To this day, Chicago is one of the most segregated cities in the United States (source). These factors are central to informing us, as data scientists, about what is at stake.

-
-
-

Spotlight: Intersection of Real Estate and Race

-

Housing has been a persistent source of racial inequality throughout US history, amongst other factors. It is one of the main areas where inequalities are created and reproduced. In the beginning, Jim Crow laws were explicit in forbidding people of color from schools, public utilities, etc.

-
- -
-


-

Today, while advancements in Civil Rights have been made, the spirit of the laws are alive in many parts of the US. The real estate industry was “professionalized” in the 1920’s and 1930’s by aspiring to become a science guided by strict methods and principles outlined below:

-
    -
  • Redlining: making it difficult or impossible to get a federally-backed mortgage to buy a house in specific neighborhoods coded as “risky” (red). -
      -
    • What made them “risky” according to the makers of these was racial composition.
    • -
    • Segregation was not only a result of federal policy, but developed by real estate professionals.
    • -
  • -
  • The methods centered on creating objective rating systems (information technologies) for the appraisal of property values which encoded race as a factor of valuation (see figure below), -
      -
    • This, in turn, influenced federal policy and practice.
    • -
  • -
-
- -
-Source: Colin Koopman, How We Became Our Data (2019) p. 137 -
-
-


-
-
-
-

The Response: Cook County Open Data Initiative

-

The response started in politics. A new assessor, Fritz Kaegi, was elected and created a new mandate with two goals:

-
    -
  1. Distributional equity in property taxation, meaning that properties of same value treated alike during assessments.
  2. -
  3. Creating a new Office of Data Science.
  4. -
-
- -
-


-
-

Question/Problem Formulation

-
-
-
- -
-
-Driving Questions -
-
-
-
    -
  • What do we want to know?
  • -
  • What problems are we trying to solve?
  • -
  • What are the hypotheses we want to test?
  • -
  • What are our metrics for success?
  • -
-
-
-

The new Office of Data Science started by redefining their goals.

-
    -
  1. Accurately, uniformly, and impartially assess the value of a home by -
      -
    • Following international standards (coefficient of dispersion)
    • -
    • Predicting value of all homes with as little total error as possible
    • -
  2. -
  3. Create a robust pipeline that accurately assesses property values at scale and is fair by -
      -
    • Disrupts the circuit of corruption (Board of Review appeals process)
    • -
    • Eliminates regressivity
    • -
    • Engenders trust in the system among all stakeholders
    • -
  4. -
-
-
-
- -
-
-Definitions: Fairness and Transparency -
-
-
-

The definitions, as given by the Cook County Assessor’s Office, are given below:

-
    -
  • Fairness: The ability of our pipeline to accurately assess property values, accounting for disparities in geography, information, etc.
  • -
  • Transparency: The ability of the data science department to share and explain pipeline results and decisions to both internal and external stakeholders
  • -
-

Note how the Office defines “fairness” in terms of accuracy. Thus, the problem - make the system more fair - was already framed in terms amenable to a data scientist: make the assessments more accurate.
The idea here is that if the model is more accurate it will also (perhaps necessarily) become more fair, which is a big assumption. There are, in a sense, two different problems - make accurate assessments, and make a fair system.

-
-
-

The way the goals are defined lead us to ask the question: what does it actually mean to accurately assess property values, and what role does “scale” play?

-
    -
  1. What is an assessment of a home’s value?
  2. -
  3. What makes one assessment more accurate than another?
  4. -
  5. What makes one batch of assessments more accurate than another batch?
  6. -
-

Each of the above questions leads to a slew of more questions. Considering just the first question, one answer could be that an assessment is an estimate of the value of a home. This leads to more inquiries: what is the value of a home? What determines it? How do we know? For this class, we take it to be the house’s market value.

-
-
-

Data Acquisition and Cleaning

-
-
-
- -
-
-Driving Questions -
-
-
-
    -
  • What data do we have, and what data do we need?
  • -
  • How will we sample more data?
  • -
  • Is our data representative of the population we want to study?
  • -
-
-
-

The data scientists also critically examined their original sales data:

-
- -
-


-

and asked the questions:

-
    -
  1. How was this data collected?
  2. -
  3. When was this data collected?
  4. -
  5. Who collected this data?
  6. -
  7. For what purposes was the data collected?
  8. -
  9. How and why were particular categories created?
  10. -
-

For example, attributes can have different likelihoods of appearing in the data, and housing data in the floodplains geographic region of Chicago were less represented than other regions.

-

The features can even be reported at different rates. Improvements in homes, which tend to increase property value, were unlikely to be reported by the homeowners.

-

Additionally, they found that there was simply more missing data in lower income neighborhoods.

-
-
-

Exploratory Data Analysis

-
-
-
- -
-
-Driving Questions -
-
-
-
    -
  • How is our data organized, and what does it contain?
  • -
  • Do we already have relevant data?
  • -
  • What are the biases, anomalies, or other issues with the data?
  • -
  • How do we transform the data to enable effective analysis?
  • -
-
-
-

Before the modeling step, they investigated a multitude of crucial questions:

-
    -
  1. Which attributes are most predictive of sales price?
  2. -
  3. Is the data uniformly distributed?
  4. -
  5. Do all neighborhoods have up to date data? Do all neighborhoods have the same granularity?
    -
  6. -
  7. Do some neighborhoods have missing or outdated data?
  8. -
-

Firstly, they found that the impact of certain features, such as bedroom number, were much more impactful in determining house value inside certain neighborhoods more than others. This informed them that different models should be used depending on the neighborhood.

-

They also noticed that low income neighborhoods had disproportionately spottier data. This informed them that they needed to develop new data collection practices - including finding new sources of data.

-
-
-

Prediction and Inference

-
-
-
- -
-
-Driving Questions -
-
-
-
    -
  • What does the data say about the world?
  • -
  • Does it answer our questions or accurately solve the problem?
  • -
  • How robust are our conclusions and can we trust the predictions?
  • -
-
-
-

Rather than using a singular model to predict sale prices (“fair market value”) of unsold properties, the CCAO fit machine learning models that discover patterns using known sale prices and characteristics of similar and nearby properties. It uses different model weights for each township.

-

Compared to traditional mass appraisal, the CCAO’s new approach is more granular and more sensitive to neighborhood variations.

-

Here, we might ask why should any particular individual believe that the model is accurate for their property?

-

This leads us to recognize that the CCAO counts on its performance of “transparency” (putting data, models, pipeline onto GitLab) to foster public trust, which would help it equate the production of “accurate assessments” with “fairness”.

-

There’s a lot more to be said here on the relationship between accuracy, fairness, and metrics we tend to use when evaluating our models. Given the nuanced nature of the argument, it is recommended you view the corresponding lecture as the course notes are not as comprehensive for this portion of lecture.

-
-
-

Reports Decisions, and Conclusions

-
-
-
- -
-
-Driving Questions -
-
-
-
    -
  • How successful is the system for each goal? -
      -
    • Accuracy/uniformity of the model
    • -
    • Fairness and transparency that eliminates regressivity and engenders trust
    • -
  • -
  • How do you know?
  • -
-
-
-

The model is not the end of the road. The new Office still sends homeowners their house evaluations, but now the data that they get sent back from the homeowners is taken into account. More detailed reports are being written by the Office itself to democratize the information. Town halls and other public facing outreach helps involves the whole community in the process of housing evaluations, rather than limiting participation to a select few.

-
-
-
-

Key Takeaways

-
    -
  1. Accuracy is a necessary, but not sufficient, condition of a fair system.

  2. -
  3. Fairness and transparency are context-dependent and sociotechnical concepts.

  4. -
  5. Learn to work with contexts, and consider how your data analysis will reshape them.

  6. -
  7. Keep in mind the power, and limits, of data analysis.

  8. -
-
-
-

Lessons for Data Science Practice

-
    -
  1. Question/Problem Formulation

    -
      -
    • Who is responsible for framing the problem?
    • -
    • Who are the stakeholders? How are they involved in the problem framing?
    • -
    • What do you bring to the table? How does your positionality affect your understanding of the problem?
    • -
    • What are the narratives that you’re tapping into?
    • -
  2. -
  3. Data Acquisition and Cleaning

    -
      -
    • Where does the data come from?
    • -
    • Who collected it? For what purpose?
    • -
    • What kinds of collecting and recording systems and techniques were used?
    • -
    • How has this data been used in the past?
    • -
    • What restrictions are there on access to the data, and what enables you to have access?
    • -
  4. -
  5. Exploratory Data Analysis & Visualization

    -
      -
    • What kind of personal or group identities have become salient in this data?
    • -
    • Which variables became salient, and what kinds of relationship obtain between them?
    • -
    • Do any of the relationships made visible lend themselves to arguments that might be potentially harmful to a particular community?
    • -
  6. -
  7. Prediction and Inference

    -
      -
    • What does the prediction or inference do in the world?
    • -
    • Are the results useful for the intended purposes?
    • -
    • Are there benchmarks to compare the results?
    • -
    • How are your predictions and inferences dependent upon the larger system in which your model works?
    • -
  8. -
  9. Reports, Decisions, and Solutions

    -
      -
    • How do we know if we have accomplished our goals?
    • -
    • How does your work fit in the broader literature?
    • -
    • Where does your work agree or disagree with the status quo?
    • -
    • Do your conclusions make sense?
    • -
  10. -
-
- -
- - -
- - - - \ No newline at end of file diff --git a/case_study_HCE/case_study_HCE.ipynb b/case_study_HCE/case_study_HCE.ipynb deleted file mode 100644 index 3ade52d1..00000000 --- a/case_study_HCE/case_study_HCE.ipynb +++ /dev/null @@ -1,329 +0,0 @@ -{ - "cells": [ - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "---\n", - "title: Case Study in Human Contexts and Ethics\n", - "execute:\n", - " echo: true\n", - "format:\n", - " html:\n", - " code-fold: true\n", - " code-tools: true\n", - " toc: true\n", - " toc-title: Case Study in Human Contexts and Ethics\n", - " page-layout: full\n", - " theme:\n", - " - cosmo\n", - " - cerulean\n", - " callout-icon: false\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.callout-note collapse=\"false\"}\n", - "## Learning Outcomes\n", - "* Learn about the ethical dilemmas that data scientists face.\n", - "* Know how critique models using contextual knowledge about data. \n", - ":::" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "> **Disclaimer**: The following chapter discusses issues of structural racism. Some of the items in the chapter may be sensitive and may or may not be the opinions, ideas, and beliefs of the students who collected the materials. The Data 100 course staff tries its best to only present information that is relevant for teaching the lessons at hand.\n", - "\n", - "**Note:** Given the nuanced nature of some of the arguments made in the lecture, it is highly recommended that you view the lecture recording in order to fully engage and understand the material. The course notes will have the same broader structure but are by no means comprehensive.\n", - "\n", - "\n", - "Let's immerse ourselves in the real-world story of data scientists working for an organization called the Cook County Assessor's Office (CCAO). Their job is to **estimate the values of houses** in order to **assign property taxes**. This is because the tax burden in this area is determined by the estimated **value** of a house, which is different from its price. Since values change over time and there are no obvious indicators of value, they created a **model** to estimate the values of houses. In this chapter, we will dig deep into what problems biased the models, the consequences to human lives, and how we can learn from this example to do better. \n", - "\n", - "\n", - "## The Problem\n", - "\n", - "A [report](https://apps.chicagotribune.com/news/watchdog/cook-county-property-tax-divide/assessments.html) by the Chicago Tribune uncovered a major scandal: the team showed that the model perpetuated a highly regressive tax system that disproportionately burdened African-American and Latinx homeowners in Cook County. How did they know? \n", - "\n", - "
\n", - "
\n", - "\n", - "In the field of housing assessment, there are standard metrics that assessors use across the world to estimate the fairness of assessments: [coefficient of dispersion](https://www.realestateagent.com/real-estate-glossary/real-estate/coefficient-of-dispersion.html) and [price-related differential](https://leg.wa.gov/House/Committees/FIN/Documents/2009/RatioText.pdf). These metrics have been rigorously tested by experts in the field and are out of scope for our class. Calculating these metrics for the Cook County prices revealed that the pricing created by the CCAO did not fall in acceptable ranges (see figure above). This by itself is **not the end** of the story, but a good indicator that **something fishy was going on**.\n", - "\n", - "
\n", - "
\n", - "\n", - "This prompted them to investigate if the model itself was producing fair tax rates. Evidently, when accounting for the home owner's income, they found that the model actually produced a **regressive** tax rate (see figure above). A tax rate is **regressive** if the percentage tax rate is higher for individuals with lower net income. A tax rate is **progressive** if the percentage tax rate is higher for individuals with higher net income. \n", - "\n", - "
\n", - "\n", - "
\n", - "
\n", - "Further digging suggests that not only was the system unfair to people across the axis of income, it was also unfair across the axis of race (see figure above). The likelihood of a property being under- or over-assessed was highly dependent on the owner's race, and that did not sit well with many homeowners.\n", - "\n", - "\n", - "### Spotlight: Appeals\n", - "\n", - "What actually caused this to come about? A comprehensive answer goes beyond just models. At the end of the day, these are real systems that have a lot of moving parts. One of which was the **appeals system**. Homeowners are mailed the value their home assessed by CCAO, and the homeowner can choose to appeal to a board of elected officials to try and change the listed value of their home and thus how much they are taxed. In theory, this sounds like a very fair system: someone oversees the final pricing of houses rather than just an algorithm. However, it ended up exacerbating the problems. \n", - "\n", - "> “Appeals are a good thing,” Thomas Jaconetty, deputy assessor for valuation and appeals, said in an interview. “The goal here is fairness. We made the numbers. We can change them.”\n", - "\n", - "
\n", - "
\n", - "\n", - "
\n", - "\n", - "Here we can borrow lessons from [Critical Race Theory](https://www.britannica.com/topic/critical-race-theory). On the surface, everyone having the legal right to try and appeal is undeniable. However, not everyone has an equal ability to do so. Those who have the money to hire tax lawyers to appeal for them have a drastically higher chance of trying and succeeding (see above figure). This model is part of a deeper institutional pattern rife with potential corruption.\n", - "\n", - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "Homeowners who appealed were generally under-assessed relative to homeowners who did not (see above figure). Those with higher incomes pay less in property tax, tax lawyers are able to grow their business due to their role in appeals, and politicians are commonly socially connected to the aforementioned tax lawyers and wealthy homeowners. All these stakeholders have reasons to advertise the model as an integral part of a fair system. Here lies the value in asking questions: a system that seems fair on the surface may in actuality be unfair upon taking a closer look. \n", - "\n", - "### Human Impacts\n", - "\n", - "
\n", - "
\n", - "
\n", - "\n", - "The impact of the housing model extends beyond the realm of home ownership and taxation. Discriminatory practices have a long history within the United States, and the model served to perpetuate this fact. To this day, Chicago is one of the most segregated cities in the United States ([source](https://fivethirtyeight.com/features/the-most-diverse-cities-are-often-the-most-segregated/)). These factors are central to informing us, as data scientists, about what is at stake.\n", - "\n", - "\n", - "### Spotlight: Intersection of Real Estate and Race\n", - "\n", - "Housing has been a persistent source of racial inequality throughout US history, amongst other factors. It is one of the main areas where inequalities are created and reproduced. In the beginning, [Jim Crow](https://www.history.com/topics/early-20th-century-us/jim-crow-laws) laws were explicit in forbidding people of color from schools, public utilities, etc. \n", - "\n", - "
\n", - "
\n", - "\n", - "Today, while advancements in Civil Rights have been made, the spirit of the laws are alive in many parts of the US. The real estate industry was “professionalized” in the 1920’s and 1930’s by aspiring to become a science guided by strict methods and principles outlined below:\n", - "\n", - "- Redlining: making it difficult or impossible to get a federally-backed mortgage to buy a house in specific neighborhoods coded as “risky” (red).\n", - " - What made them “risky” according to the makers of these was racial composition.\n", - " - Segregation was not only a result of federal policy, but developed by real estate professionals.\n", - "- The methods centered on creating objective rating systems (information technologies) for the appraisal of property values which encoded **race** as a factor of valuation (see figure below),\n", - " - This, in turn, influenced federal policy and practice.\n", - "\n", - "
Source: Colin Koopman, How We Became Our Data (2019) p. 137
\n", - "
\n", - "\n", - "\n", - "## The Response: Cook County Open Data Initiative\n", - "\n", - "The response started in politics. A new assessor, Fritz Kaegi, was elected and created a new mandate with two goals: \n", - "\n", - "1. Distributional equity in property taxation, meaning that properties of same value treated alike during assessments.\n", - "2. Creating a new Office of Data Science.\n", - "\n", - "
\n", - "
\n", - "\n", - "### Question/Problem Formulation\n", - "::: {.callout-note}\n", - "## Driving Questions\n", - "\n", - "- What do we want to know?\n", - "- What problems are we trying to solve?\n", - "- What are the hypotheses we want to test?\n", - "- What are our metrics for success?\n", - ":::\n", - "\n", - "The new Office of Data Science started by redefining their goals. \n", - "\n", - "1. Accurately, uniformly, and impartially assess the value of a home by\n", - " - Following international standards (coefficient of dispersion)\n", - " - Predicting value of all homes with as little total error as possible\n", - "\n", - "2. Create a robust pipeline that accurately assesses property values at scale and is fair by\n", - " - Disrupts the circuit of corruption (Board of Review appeals process)\n", - " - Eliminates regressivity\n", - " - Engenders trust in the system among all stakeholders \n", - "\n", - "\n", - "::: {.callout-tip}\n", - "## Definitions: Fairness and Transparency\n", - "The definitions, as given by the Cook County Assessor's Office, are given below:
\n", - "\n", - "* Fairness: The ability of our pipeline to accurately assess property values, accounting for disparities in geography, information, etc.
\n", - "* Transparency: The ability of the data science department to share and explain pipeline results and decisions to both internal and external stakeholders
\n", - "\n", - "Note how the Office defines \"fairness\" in terms of accuracy. Thus, the problem - make the system more fair - was already framed in terms amenable to a data scientist: make the assessments more accurate.
\n", - "The idea here is that if the model is more accurate it will also (perhaps necessarily) become more fair, which is a big assumption. There are, in a sense, two different problems - make accurate assessments, and make a fair system. \n", - ":::\n", - "\n", - "The way the goals are defined lead us to ask the question: what does it actually mean to accurately assess property values, and what role does “scale” play?\n", - "\n", - "1. What is an assessment of a home’s value?\n", - "2. What makes one assessment more accurate than another?\n", - "3. What makes one batch of assessments more accurate than another batch?\n", - "\n", - "Each of the above questions leads to a slew of more questions. Considering just the first question, one answer could be that an assessment is an estimate of the value of a home. This leads to more inquiries: what is the value of a home? What determines it? How do we know? For this class, we take it to be the house's market value.\n", - "\n", - "### Data Acquisition and Cleaning\n", - "::: {.callout-note}\n", - "## Driving Questions\n", - "\n", - "- What data do we have, and what data do we need?\n", - "- How will we sample more data?\n", - "- Is our data representative of the population we want to study?\n", - ":::\n", - "\n", - "The data scientists also critically examined their original sales data: \n", - "\n", - "
\n", - "
\n", - "\n", - "and asked the questions:\n", - "\n", - "1. How was this data collected?\n", - "2. When was this data collected? \n", - "3. Who collected this data?\n", - "4. For what purposes was the data collected?\n", - "5. How and why were particular categories created? \n", - "\n", - "For example, attributes can have different likelihoods of appearing in the data, and housing data in the floodplains geographic region of Chicago were less represented than other regions.\n", - "\n", - "The features can even be reported at different rates. Improvements in homes, which tend to increase property value, were unlikely to be reported by the homeowners.\n", - "\n", - "Additionally, they found that there was simply more missing data in lower income neighborhoods. \n", - "\n", - "### Exploratory Data Analysis\n", - "::: {.callout-note}\n", - "## Driving Questions\n", - "\n", - "- How is our data organized, and what does it contain?\n", - "- Do we already have relevant data?\n", - "- What are the biases, anomalies, or other issues with the data?\n", - "- How do we transform the data to enable effective analysis?\n", - ":::\n", - "\n", - "Before the modeling step, they investigated a multitude of crucial questions: \n", - "\n", - "1. Which attributes are most predictive of sales price?\n", - "2. Is the data uniformly distributed? \n", - "3. Do all neighborhoods have up to date data? Do all neighborhoods have the same granularity? \n", - "4. Do some neighborhoods have missing or outdated data? \n", - "\n", - "Firstly, they found that the impact of certain features, such as bedroom number, were much more impactful in determining house value inside certain neighborhoods more than others. This informed them that different models should be used depending on the neighborhood.\n", - "\n", - "They also noticed that low income neighborhoods had disproportionately spottier data. This informed them that they needed to develop new data collection practices - including finding new sources of data. \n", - "\n", - "\n", - "\n", - "### Prediction and Inference\n", - "::: {.callout-note}\n", - "## Driving Questions\n", - "\n", - "- What does the data say about the world?\n", - "- Does it answer our questions or accurately solve the problem?\n", - "- How robust are our conclusions and can we trust the predictions? \n", - ":::\n", - "\n", - "Rather than using a singular model to predict sale prices (“fair market value”) of unsold properties, the CCAO fit machine learning models that discover patterns using known sale prices and characteristics of **similar and nearby properties**. It uses different model weights for each township.\n", - "\n", - "Compared to traditional mass appraisal, the CCAO’s new approach is more granular and more sensitive to neighborhood variations. \n", - "\n", - "Here, we might ask why should any particular individual believe that the model is accurate for their property?\n", - "\n", - "This leads us to recognize that the CCAO counts on its performance of “transparency” (putting data, models, pipeline onto GitLab) to foster public trust, which would help it equate the production of “accurate assessments” with “fairness”.\n", - "\n", - "There's a lot more to be said here on the relationship between accuracy, fairness, and metrics we tend to use when evaluating our models. Given the nuanced nature of the argument, it is recommended you view the corresponding lecture as the course notes are not as comprehensive for this portion of lecture.\n", - "\n", - "### Reports Decisions, and Conclusions\n", - "::: {.callout-note}\n", - "## Driving Questions\n", - "\n", - "- How successful is the system for each goal?\n", - " - Accuracy/uniformity of the model\n", - " - Fairness and transparency that eliminates regressivity and engenders trust\n", - "- How do you know? \n", - ":::\n", - "\n", - "The model is not the end of the road. The new Office still sends homeowners their house evaluations, but now the data that they get sent back from the homeowners is taken into account. More detailed reports are being written by the Office itself to democratize the information. Town halls and other public facing outreach helps involves the whole community in the process of housing evaluations, rather than limiting participation to a select few.\n", - "\n", - "## Key Takeaways\n", - "\n", - "1. Accuracy is a necessary, but not sufficient, condition of a fair system.\n", - "\n", - "2. Fairness and transparency are context-dependent and sociotechnical concepts.\n", - "\n", - "3. Learn to work with contexts, and consider how your data analysis will reshape them.\n", - "\n", - "4. Keep in mind the power, and limits, of data analysis.\n", - "\n", - "\n", - "\n", - "## Lessons for Data Science Practice\n", - "\n", - "1. Question/Problem Formulation\n", - "\n", - " - Who is responsible for framing the problem?\n", - " - Who are the stakeholders? How are they involved in the problem framing?\n", - " - What do you bring to the table? How does your positionality affect your understanding of the problem?\n", - " - What are the narratives that you're tapping into? \n", - "\n", - "2. Data Acquisition and Cleaning\n", - "\n", - " - Where does the data come from?\n", - " - Who collected it? For what purpose?\n", - " - What kinds of collecting and recording systems and techniques were used? \n", - " - How has this data been used in the past?\n", - " - What restrictions are there on access to the data, and what enables you to have access?\n", - "\n", - "3. Exploratory Data Analysis & Visualization\n", - "\n", - " - What kind of personal or group identities have become salient in this data? \n", - " - Which variables became salient, and what kinds of relationship obtain between them? \n", - " - Do any of the relationships made visible lend themselves to arguments that might be potentially harmful to a particular community?\n", - "\n", - "4. Prediction and Inference\n", - "\n", - " - What does the prediction or inference do in the world?\n", - " - Are the results useful for the intended purposes?\n", - " - Are there benchmarks to compare the results?\n", - " - How are your predictions and inferences dependent upon the larger system in which your model works?\n", - "\n", - "5. Reports, Decisions, and Solutions\n", - "\n", - " - How do we know if we have accomplished our goals?\n", - " - How does your work fit in the broader literature? \n", - " - Where does your work agree or disagree with the status quo?\n", - " - Do your conclusions make sense?\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/cv_regularization/cv_reg.ipynb b/cv_regularization/cv_reg.ipynb deleted file mode 100644 index f9be1e28..00000000 --- a/cv_regularization/cv_reg.ipynb +++ /dev/null @@ -1,635 +0,0 @@ -{ - "cells": [ - { - "cell_type": "raw", - "metadata": {}, - "source": [ - "---\n", - "title: Cross Validation and Regularization\n", - "format:\n", - " html:\n", - " toc: true\n", - " toc-depth: 5\n", - " toc-location: right\n", - " code-fold: false\n", - " theme:\n", - " - cosmo\n", - " - cerulean\n", - " callout-icon: false\n", - "---" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "::: {.callout-note collapse=\"false\"}\n", - "## Learning Outcomes\n", - "* Recognize the need for validation and test sets to preview model performance on unseen data\n", - "* Apply cross-validation to select model hyperparameters\n", - "* Understand the conceptual basis for L1 and L2 regularization\n", - ":::\n", - "\n", - "At the end of the Feature Engineering lecture (Lecture 14), we arrived at the issue of fine-tuning model complexity. We identified that a model that's too complex can lead to overfitting, while a model that's too simple can lead to underfitting. This brings us to a natural question: how do we control model complexity to avoid under- and overfitting? \n", - "\n", - "To answer this question, we will need to address two things: first, we need to understand *when* our model begins to overfit by assessing its performance on unseen data. We can achieve this through **cross-validation**. Secondly, we need to introduce a technique to adjust the complexity of our models ourselves – to do so, we will apply **regularization**.\n", - "\n", - "## Training, Test, and Validation Sets\n", - "\n", - "From the last lecture, we learned that *increasing* model complexity *decreased* our model's training error but *increased* its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but generalize worse to new data it hasn't seen before. For this reason, a low training error is not always representative of our model's underlying performance - we need to also assess how well it performs on unseen data to ensure that it is not overfitting.\n", - "\n", - "Truly, the only way to know when our model overfits is by evaluating it on unseen data. Unfortunately, that means we need to wait for more data. This may be very expensive and time-consuming.\n", - "\n", - "How should we proceed? In this section, we will build up a viable solution to this problem.\n", - "\n", - "### Test Sets\n", - "\n", - "The simplest approach to avoid overfitting is to keep some of our data \"secret\" from ourselves. We can set aside a random portion of our full dataset to use only for testing purposes. The datapoints in this **test set** will *not* be used in the model fitting process. Instead, we will:\n", - "\n", - "* Use the remaining portion of our dataset – now called the **training set** – to run ordinary least squares, gradient descent, or some other technique to fit model parameters\n", - "* Take the fitted model and use it to make predictions on datapoints in the test set. The model's performance on the test set (expressed as the MSE, RMSE, etc.) is now indicative of how well it can make predictions on unseen data\n", - "\n", - "Importantly, the optimal model parameters were found by *only* considering the data in the training set. After the model has been fitted to the training data, we do not change any parameters before making predictions on the test set. Importantly, we only ever make predictions on the test set **once** after all model design has been completely finalized. We treat the test set performance as the final test of how well a model does.\n", - "\n", - "The process of sub-dividing our dataset into training and test sets is known as a **train-test split**. Typically, between 10% and 20% of the data is allocated to the test set.\n", - "\n", - "
train-test-split
\n", - "\n", - "In `sklearn`, the `train_test_split` function of the `model_selection` module allows us to automatically generate train-test splits. \n", - "\n", - "Throughout today's work, we will work with the `vehicles` dataset from previous lectures. As before, we will attempt to predict the `mpg` of a vehicle from transformations of its `hp`. In the cell below, we allocate 20% of the full dataset to testing, and the remaining 80% to training." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "#| code-fold: true\n", - "import pandas as pd\n", - "import numpy as np\n", - "import seaborn as sns\n", - "import warnings\n", - "warnings.filterwarnings('ignore')\n", - "\n", - "# Load the dataset and construct the design matrix\n", - "vehicles = sns.load_dataset(\"mpg\").rename(columns={\"horsepower\":\"hp\"}).dropna()\n", - "X = vehicles[[\"hp\"]]\n", - "X[\"hp^2\"] = vehicles[\"hp\"]**2\n", - "X[\"hp^3\"] = vehicles[\"hp\"]**3\n", - "X[\"hp^4\"] = vehicles[\"hp\"]**4\n", - "\n", - "Y = vehicles[\"mpg\"]" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Size of full dataset: 392 points\n", - "Size of training set: 313 points\n", - "Size of test set: 79 points\n" - ] - } - ], - "source": [ - "from sklearn.model_selection import train_test_split\n", - "\n", - "# `test_size` specifies the proportion of the full dataset that should be allocated to testing\n", - "# `random_state` makes our results reproducible for educational purposes\n", - "X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=220)\n", - "\n", - "print(f\"Size of full dataset: {X.shape[0]} points\")\n", - "print(f\"Size of training set: {X_train.shape[0]} points\")\n", - "print(f\"Size of test set: {X_test.shape[0]} points\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "After performing our train-test split, we fit a model to the training set and assess its performance on the test set." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "import sklearn.linear_model as lm\n", - "from sklearn.metrics import mean_squared_error\n", - "\n", - "model = lm.LinearRegression()\n", - "\n", - "# Fit to the training set\n", - "model.fit(X_train, Y_train)\n", - "\n", - "# Make predictions on the test set\n", - "test_predictions = model.predict(X_test)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Validation Sets\n", - "\n", - "Now, what if we were dissatisfied with our test set performance? With our current framework, we'd be stuck. As outlined previously, assessing model performance on the test set is the *final* stage of the model design process. We can't go back and adjust our model based on the new discovery that it is overfitting – if we did, then we would be *factoring in information from the test set* to design our model. The test error would no longer be a true representation of the model's performance on unseen data! \n", - "\n", - "Our solution is to introduce a **validation set**. A validation set is a random portion of the *training set* that is set aside for assessing model performance while the model is *still being developed*. The process for using a validation set is:\n", - "\n", - "* Perform a train-test split. Set the test set aside; we will not touch it until the very end of the model design process.\n", - "* Set aside a portion of the training set to be used for validation.\n", - "* Fit the model parameters to the datapoints contained in the remaining portion of the training set.\n", - "* Assess the model's performance on the validation set. Adjust the model as needed, re-fit it to the remaining portion of the training set, then re-evaluate it on the validation set. Repeat as necessary until you are satisfied.\n", - "* After *all* model development is complete, assess the model's performance on the test set. This is the final test of how well the model performs on unseen data. No further modifications should be made to the model.\n", - "\n", - "The process of creating a validation set is called a **validation split**.\n", - "\n", - "
validation-split
\n", - "\n", - "Note that the validation error behaves quite differently from the training error explored previously. Recall that the training error decreased monotonically with increasing model degree – as the model became more complex, it made better and better predictions on the training data. The validation error, in contrast, decreases *then increases* as we increase model complexity. This reflects the transition from under- to overfitting. At low model complexity, the model underfits because it is not complex enough to capture the main trends in the data. At high model complexity, the model overfits because it \"memorizes\" the training data too closely.\n", - "\n", - "We can update our understanding of the relationships between error, complexity, and model variance:\n", - "\n", - "
training_validation_curve
\n", - "\n", - "Our goal is to train a model with complexity near the orange dotted line – this is where our model achieves minimum error on the validation set. Note that this relationship is a simplification of the real-world. But for the purposes of Data 100, this is good enough.\n", - "\n", - "## K-Fold Cross-Validation\n", - "\n", - "Introducing a validation set gave us an \"extra\" chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this one set of validation data.\n", - "\n", - "But what if, by random chance, our validation set just happened to contain many outliers? It is possible that the validation datapoints we set aside do not actually represent other unseen data that the model might encounter. Ideally, we would like to validate our model's performance on several different unseen datasets. This would give us greater confidence in our understanding of how the model behaves on new data.\n", - "\n", - "Let's think back to our validation framework. Earlier, we set aside x% of our training data (say, 20%) to use for validation. \n", - "\n", - "
validation_set
\n", - "\n", - "In the example above, we set aside the first 20% of training datapoints for the validation set. This was an arbitrary choice. We could have set aside *any* 20% portion of the training data for validation. In fact, there are 5 non-overlapping \"chunks\" of training points that we could have designated as the validation set.\n", - "\n", - "
possible_validation_sets
\n", - "\n", - "The common term for one of these chunks is a **fold**. In the example above, we had 5 folds, each containing 20% of the training data. This gives us a new perspective: we really have *5* validation sets \"hidden\" in our training set. \n", - "\n", - "In **cross-validation**, we perform validation splits for each fold in the training set. For a dataset with $K$ folds, we:\n", - "\n", - "* Pick one fold to be the validation fold\n", - "* Fit the model to training data from every fold *other* than the validation fold\n", - "* Compute the model's error on the validation fold and record it\n", - "* Repeat for all $K$ folds\n", - "\n", - "The **cross-validation error** is then the *average* error across all $K$ validation folds. \n", - "
cross_validation
\n", - "\n", - "### Model Selection Workflow\n", - "At this stage, we have refined our model selection workflow. We begin by performing a train-test split to set aside a test set for the final evaluation of model performance. Then, we alternate between adjusting our design matrix and computing the cross-validation error to finetune the model's design. In the example below, we illustrate the use of 4-fold cross-validation to help inform model design.\n", - "\n", - "
model_selection
\n", - "\n", - "### Hyperparameters\n", - "An important use of cross-validation is for **hyperparameter** selection. A hyperparameter is some value in a model that is chosen *before* the model is fit to any data. This means that it is distinct from the model *parameters* $\\theta_i$ because its value is selected before the training process begins. We cannot use our usual techniques – calculus, ordinary least squares, or gradient descent – to choose its value. Instead, we must decide it ourselves. \n", - "\n", - "Some examples of hyperparameters in Data 100 are:\n", - "\n", - "* The degree of our polynomial model (recall that we selected the degree before creating our design matrix and calling `.fit`)\n", - "* The learning rate, $\\alpha$, in gradient descent\n", - "* The regularization penalty, $\\lambda$ (to be introduced later this lecture)\n", - "\n", - "To select a hyperparameter value via cross-validation, we first list out several \"guesses\" for what the best hyperparameter may be. For each guess, we then run cross-validation to compute the cross-validation error incurred by the model when using that choice of hyperparameter value. We then select the value of the hyperparameter that resulted in the lowest cross-validation error. \n", - "\n", - "For example, we may wish to use cross-validation to decide what value we should use for $\\alpha$, which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best $\\alpha$: 0.1, 1, and 10. For each possible value, we perform cross-validation to see what error the model has *when we use that value of $\\alpha$ to train it*.\n", - "\n", - "
hyperparameter_tuning
\n", - "\n", - "## Regularization\n", - "\n", - "We've now addressed the first of our two goals for today: creating a framework to assess model performance on unseen data. Now, we'll discuss our second objective: developing a technique to adjust model complexity. This will allow us to directly tackle the issues of under- and overfitting.\n", - "\n", - "Earlier, we adjusted the complexity of our polynomial model by tuning a hyperparameter – the degree of the polynomial. We trialed several different polynomial degrees, computed the validation error for each, and selected the value that minimized the validation error. Tweaking the \"complexity\" was simple; it was only a matter of adjusting the polynomial degree.\n", - "\n", - "In most machine learning problems, complexity is defined differently from what we have seen so far. Today, we'll explore two different definitions of complexity: the *squared* and *absolute* magnitude of $\\theta_i$ coefficients.\n", - "\n", - "### Constraining Model Parameters\n", - "\n", - "Think back to our work using gradient descent to descend down a loss surface. You may find it helpful to refer back to the Gradient Descent note to refresh your memory. Our aim was to find the combination of model parameters that led to the model having minimum loss. We visualized this using a contour map by plotting possible parameter values on the horizontal and vertical axes, which allows us to take a bird's eye view above the loss surface. We want to find the model parameters corresponding to the lowest point on the loss surface.\n", - "\n", - "
unconstrained
\n", - "\n", - "Let's review our current modeling framework.\n", - "\n", - "$$\\hat{\\mathbb{Y}} = \\theta_0 + \\theta_1 \\phi_1 + \\theta_2 \\phi_2 + \\ldots + \\theta_p \\phi_p$$\n", - "\n", - "Recall that we represent our features with $\\phi_i$ to reflect the fact that we have performed feature engineering. \n", - "\n", - "Previously, we restricted model complexity by limiting the total number of features present in the model. We only included a limited number of polynomial features at a time; all other polynomials were excluded from the model.\n", - "\n", - "What if, instead of fully removing particular features, we kept all features and used each one only a \"little bit\"? If we put a limit on how *much* each feature can contribute to the predictions, we can still control the model's complexity without the need to manually determine how many features should be removed. \n", - "\n", - "What do we mean by a \"little bit\"? Consider the case where some parameter $\\theta_i$ is close to or equal to 0. Then, feature $\\phi_i$ barely impacts the prediction – the feature is weighted by such a small value that its presence doesn't significantly change the value of $\\hat{\\mathbb{Y}}$. If we restrict how large each parameter $\\theta_i$ can be, we restrict how much feature $\\phi_i$ contributes to the model. This has the effect of *reducing* model complexity.\n", - "\n", - "In **regularization**, we restrict model complexity by *putting a limit* on the magnitudes of the model parameters $\\theta_i$. \n", - "\n", - "What do these limits look like? Suppose we specify that the sum of all absolute parameter values can be no greater than some number $Q$. In other words:\n", - "\n", - "$$\\sum_{i=1}^p |\\theta_i| \\leq Q$$\n", - "\n", - "where $p$ is the total number of parameters in the model. You can think of this as us giving our model a \"budget\" for how it distributes the magnitudes of each parameter. If the model assigns a large value to some $\\theta_i$, it may have to assign a small value to some other $\\theta_j$. This has the effect of increasing feature $\\phi_i$'s influence on the predictions while decreasing the influence of feature $\\phi_j$. The model will need to be strategic about how the parameter weights are distributed – ideally, more \"important\" features will receive greater weighting. \n", - "\n", - "Notice that the intercept term, $\\theta_0$, is excluded from this constraint. **We typically do not regularize the intercept term**.\n", - "\n", - "Now, let's think back to gradient descent and visualize the loss surface as a contour map. As a refresher, a loss surface means that each point represents the model's loss for a particular combination of $\\theta_1$, $\\theta_2$. Let's say our goal is to find the combination of parameters that gives us the lowest loss. \n", - "\n", - "
constrained_gd
\n", - "
\n", - "With no constraint, the optimal $\\hat{\\theta}$ is in the center. \n", - "\n", - "Applying this constraint limits what combinations of model parameters are valid. We can now only consider parameter combinations with a total absolute sum less than or equal to our number $Q$. This means that we can only assign our *regularized* parameter vector $\\hat{\\theta}_{\\text{Reg}}$ to positions in the green diamond below.\n", - "\n", - "
diamondreg
\n", - "
\n", - "We can no longer select the parameter vector that *truly* minimizes the loss surface, $\\hat{\\theta}_{\\text{No Reg}}$, because this combination of parameters does not lie within our allowed region. Instead, we select whatever allowable combination brings us *closest* to the true minimum loss.\n", - "\n", - "
diamond
\n", - "
\n", - "Notice that, under regularization, our optimized $\\theta_1$ and $\\theta_2$ values are much smaller than they were without regularization (indeed, $\\theta_1$ has decreased to 0). The model has *decreased in complexity* because we have limited how much our features contribute to the model. In fact, by setting its parameter to 0, we have effectively removed the influence of feature $\\phi_1$ from the model altogether. \n", - "\n", - "If we change the value of $Q$, we change the region of allowed parameter combinations. The model will still choose the combination of parameters that produces the lowest loss – the closest point in the constrained region to the true minimizer, $\\hat{\\theta}_{\\text{No Reg}}$.\n", - "\n", - "If we make $Q$ smaller:\n", - "
diamondpoint
\n", - "\n", - "If we make $Q$ larger: \n", - "
largerq
\n", - "\n", - "* When $Q$ is small, we severely restrict the size of our parameters. $\\theta_i$s are small in value, and features $\\phi_i$ only contribute a little to the model. The allowed region of model parameters contracts, and the model becomes much simpler.\n", - "* When $Q$ is large, we do not restrict our parameter sizes by much. $\\theta_i$s are large in value, and features $\\phi_i$ contribute more to the model. The allowed region of model parameters expands, and the model becomes more complex.\n", - "\n", - "Consider the extreme case of when $Q$ is extremely large. In this situation, our restriction has essentially no effect, and the allowed region includes the OLS solution!\n", - "\n", - "
verylarge
\n", - "
\n", - "\n", - "Now what if $Q$ were very small? Our parameters are then set to (essentially 0). If the model has no intercept term: $\\hat{\\mathbb{Y}} = (0)\\phi_1 + (0)\\phi_2 + \\ldots = 0$. And if the model has an intercept term: $\\hat{\\mathbb{Y}} = (0)\\phi_1 + (0)\\phi_2 + \\ldots = \\theta_0$. Remember that the intercept term is excluded from the constraint - this is so we avoid the situation where we always predict 0.\n", - "\n", - "Let's summarize what we have seen. \n", - "\n", - "
summary
\n", - "\n", - "## L1 (LASSO) Regularization\n", - "\n", - "How do we actually apply our constraint $\\sum_{i=1}^p |\\theta_i| \\leq Q$? We will do so by modifying the *objective function* that we seek to minimize when fitting a model.\n", - "\n", - "Recall our ordinary least squares objective function: our goal was to find parameters that minimize the model's mean squared error.\n", - "\n", - "$$\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2 = \\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2$$\n", - "\n", - "To apply our constraint, we need to rephrase our minimization goal. \n", - "\n", - "$$\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2\\:\\text{such that} \\sum_{i=1}^p |\\theta_i| \\leq Q$$\n", - "\n", - "Unfortunately, we can't directly use this formulation as our objective function – it's not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the [Lagrangian Duality](https://en.wikipedia.org/wiki/Duality_(optimization)). The details of this are out of scope (take EECS 127 if you're interested in learning more), but the end result is very useful. It turns out that minimizing the following *augmented* objective function is *equivalent* to our minimization goal above.\n", - "\n", - "$$\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2 + \\lambda \\sum_{i=1}^p \\vert \\theta_i \\vert = ||\\mathbb{Y} - \\mathbb{X}\\theta||_2^2 + \\lambda \\sum_{i=1}^p |\\theta_i|$$\n", - "\n", - "The second of these two expressions includes the MSE expressed using vector notation.\n", - "\n", - "Notice that we've replaced the constraint with a second term in our objective function. We're now minimizing a function with an additional regularization term that *penalizes large coefficients*. In order to minimize this new objective function, we'll end up balancing two components:\n", - "\n", - "* Keep the model's error on the training data low, represented by the term $\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 x_{i, 1} + \\theta_2 x_{i, 2} + \\ldots + \\theta_p x_{i, p}))^2$\n", - "* At the same time, keep the magnitudes of model parameters low, represented by the term $\\lambda \\sum_{i=1}^p |\\theta_i|$\n", - "\n", - "The $\\lambda$ factor controls the degree of regularization. Roughly speaking, $\\lambda$ is related to our $Q$ constraint from before by the rule $\\lambda \\approx \\frac{1}{Q}$.
To understand why, let's consider two extreme examples:\n", - "\n", - "- Assume $\\lambda \\rightarrow \\infty$. Then, $\\lambda \\sum_{j=1}^{d} \\vert \\theta_j \\vert$ dominates the cost function. To minimize this term, we set $\\theta_j = 0$ for all $j \\ge 1$. This is a very constrained model that is mathematically equivalent to the constant model. Earlier, we explained the constant model also arises when the L2 norm ball radius $Q \\rightarrow 0$.\n", - "\n", - "- Assume $\\lambda \\rightarrow 0$. Then, $\\lambda \\sum_{j=1}^{d} \\vert \\theta_j \\vert$ is 0. Minimizing the cost function is equivalent to $\\min_{\\theta} \\frac{1}{n} || Y - X\\theta ||_2^2$, our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum $\\hat{\\theta} = \\hat\\theta_{No Reg.}$. We showed that the global optimum is achieved when the L2 norm ball radius $Q \\rightarrow \\infty$.\n", - "\n", - "We call $\\lambda$ the **regularization penalty hyperparameter** and select its value via cross-validation.\n", - "\n", - "The process of finding the optimal $\\hat{\\theta}$ to minimize our new objective function is called **L1 regularization**. It is also sometimes known by the acronym \"LASSO\", which stands for \"Least Absolute Shrinkage and Selection Operator.\"\n", - "\n", - "Unlike ordinary least squares, which can be solved via the closed-form solution $\\hat{\\theta}_{OLS} = (\\mathbb{X}^{\\top}\\mathbb{X})^{-1}\\mathbb{X}^{\\top}\\mathbb{Y}$, there is no closed-form solution for the optimal parameter vector under L1 regularization. Instead, we use the `Lasso` model class of `sklearn`." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([-2.54932056e-01, -9.48597165e-04, 8.91976284e-06, -1.22872290e-08])" - ] - }, - "execution_count": 7, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import sklearn.linear_model as lm\n", - "\n", - "# The alpha parameter represents our lambda term\n", - "lasso_model = lm.Lasso(alpha=2)\n", - "lasso_model.fit(X_train, Y_train)\n", - "\n", - "lasso_model.coef_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that all model coefficients are very small in magnitude. In fact, some of them are so small that they are essentially 0. An important characteristic of L1 regularization is that many model parameters are set to 0. In other words, LASSO effectively **selects only a subset** of the features. The reason for this comes back to our loss surface and allowed \"diamond\" regions from earlier – we can often get closer to the lowest loss contour at a corner of the diamond than along an edge. \n", - "\n", - "When a model parameter is set to 0 or close to 0, its corresponding feature is essentially removed from the model. We say that L1 regularization performs **feature selection** because, by setting the parameters of unimportant features to 0, LASSO \"selects\" which features are more useful for modeling. \n", - "\n", - "## Scaling Features for Regularization\n", - "\n", - "The regularization procedure we just performed had one subtle issue. To see what it is, let's take a look at the design matrix for our `lasso_model`." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
hphp^2hp^3hp^4
25985.07225.0614125.052200625.0
12967.04489.0300763.020151121.0
207102.010404.01061208.0108243216.0
30270.04900.0343000.024010000.0
7197.09409.0912673.088529281.0
\n", - "
" - ], - "text/plain": [ - " hp hp^2 hp^3 hp^4\n", - "259 85.0 7225.0 614125.0 52200625.0\n", - "129 67.0 4489.0 300763.0 20151121.0\n", - "207 102.0 10404.0 1061208.0 108243216.0\n", - "302 70.0 4900.0 343000.0 24010000.0\n", - "71 97.0 9409.0 912673.0 88529281.0" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "X_train.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our features – `hp`, `hp^2`, `hp^3`, and `hp^4` – are on drastically different numeric scales! The values contained in `hp^4` are orders of magnitude larger than those contained in `hp`. This can be a problem because the value of `hp^4` will naturally contribute more to each predicted $\\hat{y}$ because it is so much greater than the values of the other features. For `hp` to have much of an impact at all on the prediction, it must be scaled by a large model parameter. \n", - "\n", - "By inspecting the fitted parameters of our model, we see that this is the case – the parameter for `hp` is much larger in magnitude than the parameter for `hp^4`." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
FeatureParameter
0hp-2.549321e-01
1hp^2-9.485972e-04
2hp^38.919763e-06
3hp^4-1.228723e-08
\n", - "
" - ], - "text/plain": [ - " Feature Parameter\n", - "0 hp -2.549321e-01\n", - "1 hp^2 -9.485972e-04\n", - "2 hp^3 8.919763e-06\n", - "3 hp^4 -1.228723e-08" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "pd.DataFrame({\"Feature\":X_train.columns, \"Parameter\":lasso_model.coef_})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Recall that by applying regularization, we give our a model a \"budget\" for how it can allocate the values of model parameters. For `hp` to have much of an impact on each prediction, LASSO is forced to \"spend\" more of this budget on the parameter for `hp`.\n", - "\n", - "We can avoid this issue by **scaling** the data before regularizing. This is a process where we convert all features to the same numeric scale. A common way to scale data is to perform **standardization** such that all features have mean 0 and standard deviation 1; essentially, we replace everything with its Z-score.\n", - "\n", - "$$z_k = \\frac{x_k - \\mu_k}{\\sigma_k}$$\n", - "\n", - "## L2 (Ridge) Regularization\n", - "\n", - "In all of our work above, we considered the constraint $\\sum_{i=1}^p |\\theta_i| \\leq Q$ to limit the complexity of the model. What if we had applied a different constraint?\n", - "\n", - "In **L2 regularization**, also known as **ridge regression**, we constrain the model such that the sum of the *squared* parameters must be less than some number $Q$. This constraint takes the form:\n", - "\n", - "$$\\sum_{i=1}^p \\theta_i^2 \\leq Q$$\n", - "\n", - "As before, we typically do not regularize the intercept term. \n", - "\n", - "The allowed region of parameters for a given value of $Q$ is now shaped like a ball.\n", - "\n", - "
green_constrained_gd_sol
\n", - "\n", - "If we modify our objective function like before, we find that our new goal is to minimize the function:\n", - "$$\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2\\:\\text{such that} \\sum_{i=1}^p \\theta_i^2 \\leq Q$$\n", - "\n", - "Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.\n", - "\n", - "Using Lagrangian Duality, we can re-express our objective function as:\n", - "$$\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2 + \\lambda \\sum_{i=1}^p \\theta_i^2 = ||\\mathbb{Y} - \\mathbb{X}\\theta||_2^2 + \\lambda \\sum_{i=1}^p \\theta_i^2$$\n", - "\n", - "When applying L2 regularization, our goal is to minimize this updated objective function.\n", - "\n", - "Unlike L1 regularization, L2 regularization *does* have a closed-form solution for the best parameter vector when regularization is applied:\n", - "\n", - "$$\\hat\\theta_{\\text{ridge}} = (\\mathbb{X}^{\\top}\\mathbb{X} + n\\lambda I)^{-1}\\mathbb{X}^{\\top}\\mathbb{Y}$$\n", - "\n", - "This solution exists **even if $\\mathbb{X}$ is not full column rank**. This is a major reason why L2 regularization is often used – it can produce a solution even when there is colinearity in the features. We will discuss the concept of colinearity in a future lecture. We will not derive this result in Data 100, as it involves a fair bit of matrix calculus.\n", - "\n", - "In `sklearn`, we perform L2 regularization using the `Ridge` class. Notice that we scale the data before regularizing." - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "array([ 5.89130559e-02, -6.42445915e-03, 4.44468157e-05, -8.83981945e-08])" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "ridge_model = lm.Ridge(alpha=1) # alpha represents the hyperparameter lambda\n", - "ridge_model.fit(X_train, Y_train)\n", - "\n", - "ridge_model.coef_" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Regression Summary\n", - "\n", - "Our regression models are summarized below. Note the objective function is what the gradient descent optimizer minimizes. \n", - "\n", - "| Type | Model | Loss | Regularization | Objective Function | Solution |\n", - "|-----------------|----------------------------------------|---------------|----------------|-------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|\n", - "| OLS | $\\hat{\\mathbb{Y}} = \\mathbb{X}\\theta$ | MSE | None | $\\frac{1}{n} \\|\\mathbb{Y}-\\mathbb{X} \\theta\\|^2_2$ | $\\hat{\\theta}_{OLS} = (\\mathbb{X}^{\\top}\\mathbb{X})^{-1}\\mathbb{X}^{\\top}\\mathbb{Y}$ if $\\mathbb{X}$ is full column rank |\n", - "| Ridge | $\\hat{\\mathbb{Y}} = \\mathbb{X} \\theta$ | MSE | L2 | $\\frac{1}{n} \\|\\mathbb{Y}-\\mathbb{X}\\theta\\|^2_2 + \\lambda \\sum_{i=1}^p \\theta_i^2$ | $\\hat{\\theta}_{ridge} = (\\mathbb{X}^{\\top}\\mathbb{X} + n \\lambda I)^{-1}\\mathbb{X}^{\\top}\\mathbb{Y}$ |\n", - "| LASSO | $\\hat{\\mathbb{Y}} = \\mathbb{X} \\theta$ | MSE | L1 | $\\frac{1}{n} \\|\\mathbb{Y}-\\mathbb{X}\\theta\\|^2_2 + \\lambda \\sum_{i=1}^p \\vert \\theta_i \\vert$ | No closed form | |\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} diff --git a/docs/case_study_HCE/case_study_HCE.html b/docs/case_study_HCE/case_study_HCE.html new file mode 100644 index 00000000..0d93c77a --- /dev/null +++ b/docs/case_study_HCE/case_study_HCE.html @@ -0,0 +1,1214 @@ + + + + + + + + + +Principles and Techniques of Data Science - 15  Case Study in Human Contexts and Ethics + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + +
+ + + +
+ +
+
+

15  Case Study in Human Contexts and Ethics

+
+ + + +
+ + + + +
+ + +
+ +
+
+
+ +
+
+Learning Outcomes +
+
+
+
+
+
    +
  • Learn about the ethical dilemmas that data scientists face.
  • +
  • Know how critique models using contextual knowledge about data.
  • +
+
+
+
+
+

Disclaimer: The following chapter discusses issues of structural racism. Some of the items in the chapter may be sensitive and may or may not be the opinions, ideas, and beliefs of the students who collected the materials. The Data 100 course staff tries its best to only present information that is relevant for teaching the lessons at hand.

+
+

Note: Given the nuanced nature of some of the arguments made in the lecture, it is highly recommended that you view the lecture recording in order to fully engage and understand the material. The course notes will have the same broader structure but are by no means comprehensive.

+

Let’s immerse ourselves in the real-world story of data scientists working for an organization called the Cook County Assessor’s Office (CCAO). Their job is to estimate the values of houses in order to assign property taxes. This is because the tax burden in this area is determined by the estimated value of a house, which is different from its price. Since values change over time and there are no obvious indicators of value, they created a model to estimate the values of houses. In this chapter, we will dig deep into what problems biased the models, the consequences to human lives, and how we can learn from this example to do better.

+
+

15.1 The Problem

+

A report by the Chicago Tribune uncovered a major scandal: the team showed that the model perpetuated a highly regressive tax system that disproportionately burdened African-American and Latinx homeowners in Cook County. How did they know?

+
+ +
+

In the field of housing assessment, there are standard metrics that assessors use across the world to estimate the fairness of assessments: coefficient of dispersion and price-related differential. These metrics have been rigorously tested by experts in the field and are out of scope for our class. Calculating these metrics for the Cook County prices revealed that the pricing created by the CCAO did not fall in acceptable ranges (see figure above). This by itself is not the end of the story, but a good indicator that something fishy was going on.

+
+ +
+

This prompted them to investigate if the model itself was producing fair tax rates. Evidently, when accounting for the home owner’s income, they found that the model actually produced a regressive tax rate (see figure above). A tax rate is regressive if the percentage tax rate is higher for individuals with lower net income. A tax rate is progressive if the percentage tax rate is higher for individuals with higher net income.

+
+ +
+


Further digging suggests that not only was the system unfair to people across the axis of income, it was also unfair across the axis of race (see figure above). The likelihood of a property being under- or over-assessed was highly dependent on the owner’s race, and that did not sit well with many homeowners.

+
+

15.1.1 Spotlight: Appeals

+

What actually caused this to come about? A comprehensive answer goes beyond just models. At the end of the day, these are real systems that have a lot of moving parts. One of which was the appeals system. Homeowners are mailed the value their home assessed by CCAO, and the homeowner can choose to appeal to a board of elected officials to try and change the listed value of their home and thus how much they are taxed. In theory, this sounds like a very fair system: someone oversees the final pricing of houses rather than just an algorithm. However, it ended up exacerbating the problems.

+
+

“Appeals are a good thing,” Thomas Jaconetty, deputy assessor for valuation and appeals, said in an interview. “The goal here is fairness. We made the numbers. We can change them.”

+
+
+ +
+


+

Here we can borrow lessons from Critical Race Theory. On the surface, everyone having the legal right to try and appeal is undeniable. However, not everyone has an equal ability to do so. Those who have the money to hire tax lawyers to appeal for them have a drastically higher chance of trying and succeeding (see above figure). This model is part of a deeper institutional pattern rife with potential corruption.

+
+ +
+


+

Homeowners who appealed were generally under-assessed relative to homeowners who did not (see above figure). Those with higher incomes pay less in property tax, tax lawyers are able to grow their business due to their role in appeals, and politicians are commonly socially connected to the aforementioned tax lawyers and wealthy homeowners. All these stakeholders have reasons to advertise the model as an integral part of a fair system. Here lies the value in asking questions: a system that seems fair on the surface may in actuality be unfair upon taking a closer look.

+
+
+

15.1.2 Human Impacts

+
+ +
+


+

The impact of the housing model extends beyond the realm of home ownership and taxation. Discriminatory practices have a long history within the United States, and the model served to perpetuate this fact. To this day, Chicago is one of the most segregated cities in the United States (source). These factors are central to informing us, as data scientists, about what is at stake.

+
+
+

15.1.3 Spotlight: Intersection of Real Estate and Race

+

Housing has been a persistent source of racial inequality throughout US history, amongst other factors. It is one of the main areas where inequalities are created and reproduced. In the beginning, Jim Crow laws were explicit in forbidding people of color from schools, public utilities, etc.

+
+ +
+


+

Today, while advancements in Civil Rights have been made, the spirit of the laws are alive in many parts of the US. The real estate industry was “professionalized” in the 1920’s and 1930’s by aspiring to become a science guided by strict methods and principles outlined below:

+
    +
  • Redlining: making it difficult or impossible to get a federally-backed mortgage to buy a house in specific neighborhoods coded as “risky” (red). +
      +
    • What made them “risky” according to the makers of these was racial composition.
    • +
    • Segregation was not only a result of federal policy, but developed by real estate professionals.
    • +
  • +
  • The methods centered on creating objective rating systems (information technologies) for the appraisal of property values which encoded race as a factor of valuation (see figure below), +
      +
    • This, in turn, influenced federal policy and practice.
    • +
  • +
+
+ +
+Source: Colin Koopman, How We Became Our Data (2019) p. 137 +
+
+


+
+
+
+

15.2 The Response: Cook County Open Data Initiative

+

The response started in politics. A new assessor, Fritz Kaegi, was elected and created a new mandate with two goals:

+
    +
  1. Distributional equity in property taxation, meaning that properties of same value treated alike during assessments.
  2. +
  3. Creating a new Office of Data Science.
  4. +
+
+ +
+


+
+

15.2.1 Question/Problem Formulation

+
+
+
+ +
+
+Driving Questions +
+
+
+
    +
  • What do we want to know?
  • +
  • What problems are we trying to solve?
  • +
  • What are the hypotheses we want to test?
  • +
  • What are our metrics for success?
  • +
+
+
+

The new Office of Data Science started by redefining their goals.

+
    +
  1. Accurately, uniformly, and impartially assess the value of a home by +
      +
    • Following international standards (coefficient of dispersion)
    • +
    • Predicting value of all homes with as little total error as possible
    • +
  2. +
  3. Create a robust pipeline that accurately assesses property values at scale and is fair by +
      +
    • Disrupts the circuit of corruption (Board of Review appeals process)
    • +
    • Eliminates regressivity
    • +
    • Engenders trust in the system among all stakeholders
    • +
  4. +
+
+
+
+ +
+
+Definitions: Fairness and Transparency +
+
+
+

The definitions, as given by the Cook County Assessor’s Office, are given below:

+
    +
  • Fairness: The ability of our pipeline to accurately assess property values, accounting for disparities in geography, information, etc.
  • +
  • Transparency: The ability of the data science department to share and explain pipeline results and decisions to both internal and external stakeholders
  • +
+

Note how the Office defines “fairness” in terms of accuracy. Thus, the problem - make the system more fair - was already framed in terms amenable to a data scientist: make the assessments more accurate.
The idea here is that if the model is more accurate it will also (perhaps necessarily) become more fair, which is a big assumption. There are, in a sense, two different problems - make accurate assessments, and make a fair system.

+
+
+

The way the goals are defined lead us to ask the question: what does it actually mean to accurately assess property values, and what role does “scale” play?

+
    +
  1. What is an assessment of a home’s value?
  2. +
  3. What makes one assessment more accurate than another?
  4. +
  5. What makes one batch of assessments more accurate than another batch?
  6. +
+

Each of the above questions leads to a slew of more questions. Considering just the first question, one answer could be that an assessment is an estimate of the value of a home. This leads to more inquiries: what is the value of a home? What determines it? How do we know? For this class, we take it to be the house’s market value.

+
+
+

15.2.2 Data Acquisition and Cleaning

+
+
+
+ +
+
+Driving Questions +
+
+
+
    +
  • What data do we have, and what data do we need?
  • +
  • How will we sample more data?
  • +
  • Is our data representative of the population we want to study?
  • +
+
+
+

The data scientists also critically examined their original sales data:

+
+ +
+


+

and asked the questions:

+
    +
  1. How was this data collected?
  2. +
  3. When was this data collected?
  4. +
  5. Who collected this data?
  6. +
  7. For what purposes was the data collected?
  8. +
  9. How and why were particular categories created?
  10. +
+

For example, attributes can have different likelihoods of appearing in the data, and housing data in the floodplains geographic region of Chicago were less represented than other regions.

+

The features can even be reported at different rates. Improvements in homes, which tend to increase property value, were unlikely to be reported by the homeowners.

+

Additionally, they found that there was simply more missing data in lower income neighborhoods.

+
+
+

15.2.3 Exploratory Data Analysis

+
+
+
+ +
+
+Driving Questions +
+
+
+
    +
  • How is our data organized, and what does it contain?
  • +
  • Do we already have relevant data?
  • +
  • What are the biases, anomalies, or other issues with the data?
  • +
  • How do we transform the data to enable effective analysis?
  • +
+
+
+

Before the modeling step, they investigated a multitude of crucial questions:

+
    +
  1. Which attributes are most predictive of sales price?
  2. +
  3. Is the data uniformly distributed?
  4. +
  5. Do all neighborhoods have up to date data? Do all neighborhoods have the same granularity?
    +
  6. +
  7. Do some neighborhoods have missing or outdated data?
  8. +
+

Firstly, they found that the impact of certain features, such as bedroom number, were much more impactful in determining house value inside certain neighborhoods more than others. This informed them that different models should be used depending on the neighborhood.

+

They also noticed that low income neighborhoods had disproportionately spottier data. This informed them that they needed to develop new data collection practices - including finding new sources of data.

+
+
+

15.2.4 Prediction and Inference

+
+
+
+ +
+
+Driving Questions +
+
+
+
    +
  • What does the data say about the world?
  • +
  • Does it answer our questions or accurately solve the problem?
  • +
  • How robust are our conclusions and can we trust the predictions?
  • +
+
+
+

Rather than using a singular model to predict sale prices (“fair market value”) of unsold properties, the CCAO fit machine learning models that discover patterns using known sale prices and characteristics of similar and nearby properties. It uses different model weights for each township.

+

Compared to traditional mass appraisal, the CCAO’s new approach is more granular and more sensitive to neighborhood variations.

+

Here, we might ask why should any particular individual believe that the model is accurate for their property?

+

This leads us to recognize that the CCAO counts on its performance of “transparency” (putting data, models, pipeline onto GitLab) to foster public trust, which would help it equate the production of “accurate assessments” with “fairness”.

+

There’s a lot more to be said here on the relationship between accuracy, fairness, and metrics we tend to use when evaluating our models. Given the nuanced nature of the argument, it is recommended you view the corresponding lecture as the course notes are not as comprehensive for this portion of lecture.

+
+
+

15.2.5 Reports Decisions, and Conclusions

+
+
+
+ +
+
+Driving Questions +
+
+
+
    +
  • How successful is the system for each goal? +
      +
    • Accuracy/uniformity of the model
    • +
    • Fairness and transparency that eliminates regressivity and engenders trust
    • +
  • +
  • How do you know?
  • +
+
+
+

The model is not the end of the road. The new Office still sends homeowners their house evaluations, but now the data that they get sent back from the homeowners is taken into account. More detailed reports are being written by the Office itself to democratize the information. Town halls and other public facing outreach helps involves the whole community in the process of housing evaluations, rather than limiting participation to a select few.

+
+
+
+

15.3 Key Takeaways

+
    +
  1. Accuracy is a necessary, but not sufficient, condition of a fair system.

  2. +
  3. Fairness and transparency are context-dependent and sociotechnical concepts.

  4. +
  5. Learn to work with contexts, and consider how your data analysis will reshape them.

  6. +
  7. Keep in mind the power, and limits, of data analysis.

  8. +
+
+
+

15.4 Lessons for Data Science Practice

+
    +
  1. Question/Problem Formulation

    +
      +
    • Who is responsible for framing the problem?
    • +
    • Who are the stakeholders? How are they involved in the problem framing?
    • +
    • What do you bring to the table? How does your positionality affect your understanding of the problem?
    • +
    • What are the narratives that you’re tapping into?
    • +
  2. +
  3. Data Acquisition and Cleaning

    +
      +
    • Where does the data come from?
    • +
    • Who collected it? For what purpose?
    • +
    • What kinds of collecting and recording systems and techniques were used?
    • +
    • How has this data been used in the past?
    • +
    • What restrictions are there on access to the data, and what enables you to have access?
    • +
  4. +
  5. Exploratory Data Analysis & Visualization

    +
      +
    • What kind of personal or group identities have become salient in this data?
    • +
    • Which variables became salient, and what kinds of relationship obtain between them?
    • +
    • Do any of the relationships made visible lend themselves to arguments that might be potentially harmful to a particular community?
    • +
  6. +
  7. Prediction and Inference

    +
      +
    • What does the prediction or inference do in the world?
    • +
    • Are the results useful for the intended purposes?
    • +
    • Are there benchmarks to compare the results?
    • +
    • How are your predictions and inferences dependent upon the larger system in which your model works?
    • +
  8. +
  9. Reports, Decisions, and Solutions

    +
      +
    • How do we know if we have accomplished our goals?
    • +
    • How does your work fit in the broader literature?
    • +
    • Where does your work agree or disagree with the status quo?
    • +
    • Do your conclusions make sense?
    • +
  10. +
+ + + + +
+ +
+ + +
+ + + + \ No newline at end of file diff --git a/docs/case_study_HCE/images/vis_1.png b/docs/case_study_HCE/images/vis_1.png new file mode 100644 index 00000000..a9ecac7b Binary files /dev/null and b/docs/case_study_HCE/images/vis_1.png differ diff --git a/docs/case_study_HCE/images/vis_10.png b/docs/case_study_HCE/images/vis_10.png new file mode 100644 index 00000000..61daefb9 Binary files /dev/null and b/docs/case_study_HCE/images/vis_10.png differ diff --git a/docs/case_study_HCE/images/vis_2.png b/docs/case_study_HCE/images/vis_2.png new file mode 100644 index 00000000..db39da9e Binary files /dev/null and b/docs/case_study_HCE/images/vis_2.png differ diff --git a/docs/case_study_HCE/images/vis_3.jpg b/docs/case_study_HCE/images/vis_3.jpg new file mode 100644 index 00000000..72e64539 Binary files /dev/null and b/docs/case_study_HCE/images/vis_3.jpg differ diff --git a/docs/case_study_HCE/images/vis_4.png b/docs/case_study_HCE/images/vis_4.png new file mode 100644 index 00000000..472809df Binary files /dev/null and b/docs/case_study_HCE/images/vis_4.png differ diff --git a/docs/case_study_HCE/images/vis_5.png b/docs/case_study_HCE/images/vis_5.png new file mode 100644 index 00000000..74853eb2 Binary files /dev/null and b/docs/case_study_HCE/images/vis_5.png differ diff --git a/docs/case_study_HCE/images/vis_6.png b/docs/case_study_HCE/images/vis_6.png new file mode 100644 index 00000000..60d63cfb Binary files /dev/null and b/docs/case_study_HCE/images/vis_6.png differ diff --git a/docs/case_study_HCE/images/vis_7.png b/docs/case_study_HCE/images/vis_7.png new file mode 100644 index 00000000..ed490433 Binary files /dev/null and b/docs/case_study_HCE/images/vis_7.png differ diff --git a/docs/case_study_HCE/images/vis_8.png b/docs/case_study_HCE/images/vis_8.png new file mode 100644 index 00000000..e2ebc46b Binary files /dev/null and b/docs/case_study_HCE/images/vis_8.png differ diff --git a/docs/case_study_HCE/images/vis_9.png b/docs/case_study_HCE/images/vis_9.png new file mode 100644 index 00000000..aab37580 Binary files /dev/null and b/docs/case_study_HCE/images/vis_9.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations.html b/docs/constant_model_loss_transformations/loss_transformations.html index 29371ea8..2d7f17e1 100644 --- a/docs/constant_model_loss_transformations/loss_transformations.html +++ b/docs/constant_model_loss_transformations/loss_transformations.html @@ -232,6 +232,18 @@ 14  Sklearn and Feature Engineering + + + diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-10-output-2.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-10-output-2.png index 2c87cfcc..d768d848 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-10-output-2.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-10-output-2.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-12-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-12-output-1.png index 2d1655c2..e351a778 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-12-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-12-output-1.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-13-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-13-output-1.png index d75aaf3b..d6e0808a 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-13-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-13-output-1.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-18-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-18-output-1.png index f258bbd4..a5513d21 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-18-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-18-output-1.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-19-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-19-output-1.png index e36a6dde..c9b146de 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-19-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-19-output-1.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-20-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-20-output-1.png index 7530590c..ae3ccf06 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-20-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-20-output-1.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-5-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-5-output-1.png index 3bc172d7..8495665a 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-5-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-5-output-1.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-7-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-7-output-1.png index a5b366fc..3e274fc7 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-7-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-7-output-1.png differ diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-9-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-9-output-1.png index c60bfd6f..5615c4f7 100644 Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-9-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-9-output-1.png differ diff --git a/docs/cv_regularization/cv_reg.html b/docs/cv_regularization/cv_reg.html new file mode 100644 index 00000000..60e04d46 --- /dev/null +++ b/docs/cv_regularization/cv_reg.html @@ -0,0 +1,996 @@ + + + + + + + + + +Principles and Techniques of Data Science - 16  Cross Validation and Regularization + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + +
+ + + +
+ +
+
+

16  Cross Validation and Regularization

+
+ + + +
+ + + + +
+ + +
+ +
+
+
+ +
+
+Learning Outcomes +
+
+
+
+
+
    +
  • Recognize the need for validation and test sets to preview model performance on unseen data
  • +
  • Apply cross-validation to select model hyperparameters
  • +
  • Understand the conceptual basis for L1 and L2 regularization
  • +
+
+
+
+

At the end of the Feature Engineering lecture (Lecture 14), we arrived at the issue of fine-tuning model complexity. We identified that a model that’s too complex can lead to overfitting, while a model that’s too simple can lead to underfitting. This brings us to a natural question: how do we control model complexity to avoid under- and overfitting?

+

To answer this question, we will need to address two things: first, we need to understand when our model begins to overfit by assessing its performance on unseen data. We can achieve this through cross-validation. Secondly, we need to introduce a technique to adjust the complexity of our models ourselves – to do so, we will apply regularization.

+
+

16.1 Training, Test, and Validation Sets

+

From the last lecture, we learned that increasing model complexity decreased our model’s training error but increased its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but generalize worse to new data it hasn’t seen before. For this reason, a low training error is not always representative of our model’s underlying performance - we need to also assess how well it performs on unseen data to ensure that it is not overfitting.

+

Truly, the only way to know when our model overfits is by evaluating it on unseen data. Unfortunately, that means we need to wait for more data. This may be very expensive and time-consuming.

+

How should we proceed? In this section, we will build up a viable solution to this problem.

+
+

16.1.1 Test Sets

+

The simplest approach to avoid overfitting is to keep some of our data “secret” from ourselves. We can set aside a random portion of our full dataset to use only for testing purposes. The datapoints in this test set will not be used in the model fitting process. Instead, we will:

+
    +
  • Use the remaining portion of our dataset – now called the training set – to run ordinary least squares, gradient descent, or some other technique to fit model parameters
  • +
  • Take the fitted model and use it to make predictions on datapoints in the test set. The model’s performance on the test set (expressed as the MSE, RMSE, etc.) is now indicative of how well it can make predictions on unseen data
  • +
+

Importantly, the optimal model parameters were found by only considering the data in the training set. After the model has been fitted to the training data, we do not change any parameters before making predictions on the test set. Importantly, we only ever make predictions on the test set once after all model design has been completely finalized. We treat the test set performance as the final test of how well a model does.

+

The process of sub-dividing our dataset into training and test sets is known as a train-test split. Typically, between 10% and 20% of the data is allocated to the test set.

+
+train-test-split +
+

In sklearn, the train_test_split function of the model_selection module allows us to automatically generate train-test splits.

+

Throughout today’s work, we will work with the vehicles dataset from previous lectures. As before, we will attempt to predict the mpg of a vehicle from transformations of its hp. In the cell below, we allocate 20% of the full dataset to testing, and the remaining 80% to training.

+
+
+Code +
import pandas as pd
+import numpy as np
+import seaborn as sns
+import warnings
+warnings.filterwarnings('ignore')
+
+# Load the dataset and construct the design matrix
+vehicles = sns.load_dataset("mpg").rename(columns={"horsepower":"hp"}).dropna()
+X = vehicles[["hp"]]
+X["hp^2"] = vehicles["hp"]**2
+X["hp^3"] = vehicles["hp"]**3
+X["hp^4"] = vehicles["hp"]**4
+
+Y = vehicles["mpg"]
+
+
+
+
from sklearn.model_selection import train_test_split
+
+# `test_size` specifies the proportion of the full dataset that should be allocated to testing
+# `random_state` makes our results reproducible for educational purposes
+X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=220)
+
+print(f"Size of full dataset: {X.shape[0]} points")
+print(f"Size of training set: {X_train.shape[0]} points")
+print(f"Size of test set: {X_test.shape[0]} points")
+
+
Size of full dataset: 392 points
+Size of training set: 313 points
+Size of test set: 79 points
+
+
+

After performing our train-test split, we fit a model to the training set and assess its performance on the test set.

+
+
import sklearn.linear_model as lm
+from sklearn.metrics import mean_squared_error
+
+model = lm.LinearRegression()
+
+# Fit to the training set
+model.fit(X_train, Y_train)
+
+# Make predictions on the test set
+test_predictions = model.predict(X_test)
+
+
+
+

16.1.2 Validation Sets

+

Now, what if we were dissatisfied with our test set performance? With our current framework, we’d be stuck. As outlined previously, assessing model performance on the test set is the final stage of the model design process. We can’t go back and adjust our model based on the new discovery that it is overfitting – if we did, then we would be factoring in information from the test set to design our model. The test error would no longer be a true representation of the model’s performance on unseen data!

+

Our solution is to introduce a validation set. A validation set is a random portion of the training set that is set aside for assessing model performance while the model is still being developed. The process for using a validation set is:

+
    +
  • Perform a train-test split. Set the test set aside; we will not touch it until the very end of the model design process.
  • +
  • Set aside a portion of the training set to be used for validation.
  • +
  • Fit the model parameters to the datapoints contained in the remaining portion of the training set.
  • +
  • Assess the model’s performance on the validation set. Adjust the model as needed, re-fit it to the remaining portion of the training set, then re-evaluate it on the validation set. Repeat as necessary until you are satisfied.
  • +
  • After all model development is complete, assess the model’s performance on the test set. This is the final test of how well the model performs on unseen data. No further modifications should be made to the model.
  • +
+

The process of creating a validation set is called a validation split.

+
+validation-split +
+

Note that the validation error behaves quite differently from the training error explored previously. Recall that the training error decreased monotonically with increasing model degree – as the model became more complex, it made better and better predictions on the training data. The validation error, in contrast, decreases then increases as we increase model complexity. This reflects the transition from under- to overfitting. At low model complexity, the model underfits because it is not complex enough to capture the main trends in the data. At high model complexity, the model overfits because it “memorizes” the training data too closely.

+

We can update our understanding of the relationships between error, complexity, and model variance:

+
+training_validation_curve +
+

Our goal is to train a model with complexity near the orange dotted line – this is where our model achieves minimum error on the validation set. Note that this relationship is a simplification of the real-world. But for the purposes of Data 100, this is good enough.

+
+
+
+

16.2 K-Fold Cross-Validation

+

Introducing a validation set gave us an “extra” chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this one set of validation data.

+

But what if, by random chance, our validation set just happened to contain many outliers? It is possible that the validation datapoints we set aside do not actually represent other unseen data that the model might encounter. Ideally, we would like to validate our model’s performance on several different unseen datasets. This would give us greater confidence in our understanding of how the model behaves on new data.

+

Let’s think back to our validation framework. Earlier, we set aside x% of our training data (say, 20%) to use for validation.

+
+validation_set +
+

In the example above, we set aside the first 20% of training datapoints for the validation set. This was an arbitrary choice. We could have set aside any 20% portion of the training data for validation. In fact, there are 5 non-overlapping “chunks” of training points that we could have designated as the validation set.

+
+possible_validation_sets +
+

The common term for one of these chunks is a fold. In the example above, we had 5 folds, each containing 20% of the training data. This gives us a new perspective: we really have 5 validation sets “hidden” in our training set.

+

In cross-validation, we perform validation splits for each fold in the training set. For a dataset with \(K\) folds, we:

+
    +
  • Pick one fold to be the validation fold
  • +
  • Fit the model to training data from every fold other than the validation fold
  • +
  • Compute the model’s error on the validation fold and record it
  • +
  • Repeat for all \(K\) folds
  • +
+The cross-validation error is then the average error across all \(K\) validation folds. +
+cross_validation +
+
+

16.2.1 Model Selection Workflow

+

At this stage, we have refined our model selection workflow. We begin by performing a train-test split to set aside a test set for the final evaluation of model performance. Then, we alternate between adjusting our design matrix and computing the cross-validation error to finetune the model’s design. In the example below, we illustrate the use of 4-fold cross-validation to help inform model design.

+
+model_selection +
+
+
+

16.2.2 Hyperparameters

+

An important use of cross-validation is for hyperparameter selection. A hyperparameter is some value in a model that is chosen before the model is fit to any data. This means that it is distinct from the model parameters \(\theta_i\) because its value is selected before the training process begins. We cannot use our usual techniques – calculus, ordinary least squares, or gradient descent – to choose its value. Instead, we must decide it ourselves.

+

Some examples of hyperparameters in Data 100 are:

+
    +
  • The degree of our polynomial model (recall that we selected the degree before creating our design matrix and calling .fit)
  • +
  • The learning rate, \(\alpha\), in gradient descent
  • +
  • The regularization penalty, \(\lambda\) (to be introduced later this lecture)
  • +
+

To select a hyperparameter value via cross-validation, we first list out several “guesses” for what the best hyperparameter may be. For each guess, we then run cross-validation to compute the cross-validation error incurred by the model when using that choice of hyperparameter value. We then select the value of the hyperparameter that resulted in the lowest cross-validation error.

+

For example, we may wish to use cross-validation to decide what value we should use for \(\alpha\), which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best \(\alpha\): 0.1, 1, and 10. For each possible value, we perform cross-validation to see what error the model has when we use that value of \(\alpha\) to train it.

+
+hyperparameter_tuning +
+
+
+
+

16.3 Regularization

+

We’ve now addressed the first of our two goals for today: creating a framework to assess model performance on unseen data. Now, we’ll discuss our second objective: developing a technique to adjust model complexity. This will allow us to directly tackle the issues of under- and overfitting.

+

Earlier, we adjusted the complexity of our polynomial model by tuning a hyperparameter – the degree of the polynomial. We trialed several different polynomial degrees, computed the validation error for each, and selected the value that minimized the validation error. Tweaking the “complexity” was simple; it was only a matter of adjusting the polynomial degree.

+

In most machine learning problems, complexity is defined differently from what we have seen so far. Today, we’ll explore two different definitions of complexity: the squared and absolute magnitude of \(\theta_i\) coefficients.

+
+

16.3.1 Constraining Model Parameters

+

Think back to our work using gradient descent to descend down a loss surface. You may find it helpful to refer back to the Gradient Descent note to refresh your memory. Our aim was to find the combination of model parameters that led to the model having minimum loss. We visualized this using a contour map by plotting possible parameter values on the horizontal and vertical axes, which allows us to take a bird’s eye view above the loss surface. We want to find the model parameters corresponding to the lowest point on the loss surface.

+
+unconstrained +
+

Let’s review our current modeling framework.

+

\[\hat{\mathbb{Y}} = \theta_0 + \theta_1 \phi_1 + \theta_2 \phi_2 + \ldots + \theta_p \phi_p\]

+

Recall that we represent our features with \(\phi_i\) to reflect the fact that we have performed feature engineering.

+

Previously, we restricted model complexity by limiting the total number of features present in the model. We only included a limited number of polynomial features at a time; all other polynomials were excluded from the model.

+

What if, instead of fully removing particular features, we kept all features and used each one only a “little bit”? If we put a limit on how much each feature can contribute to the predictions, we can still control the model’s complexity without the need to manually determine how many features should be removed.

+

What do we mean by a “little bit”? Consider the case where some parameter \(\theta_i\) is close to or equal to 0. Then, feature \(\phi_i\) barely impacts the prediction – the feature is weighted by such a small value that its presence doesn’t significantly change the value of \(\hat{\mathbb{Y}}\). If we restrict how large each parameter \(\theta_i\) can be, we restrict how much feature \(\phi_i\) contributes to the model. This has the effect of reducing model complexity.

+

In regularization, we restrict model complexity by putting a limit on the magnitudes of the model parameters \(\theta_i\).

+

What do these limits look like? Suppose we specify that the sum of all absolute parameter values can be no greater than some number \(Q\). In other words:

+

\[\sum_{i=1}^p |\theta_i| \leq Q\]

+

where \(p\) is the total number of parameters in the model. You can think of this as us giving our model a “budget” for how it distributes the magnitudes of each parameter. If the model assigns a large value to some \(\theta_i\), it may have to assign a small value to some other \(\theta_j\). This has the effect of increasing feature \(\phi_i\)’s influence on the predictions while decreasing the influence of feature \(\phi_j\). The model will need to be strategic about how the parameter weights are distributed – ideally, more “important” features will receive greater weighting.

+

Notice that the intercept term, \(\theta_0\), is excluded from this constraint. We typically do not regularize the intercept term.

+

Now, let’s think back to gradient descent and visualize the loss surface as a contour map. As a refresher, a loss surface means that each point represents the model’s loss for a particular combination of \(\theta_1\), \(\theta_2\). Let’s say our goal is to find the combination of parameters that gives us the lowest loss.

+
+constrained_gd +
+


With no constraint, the optimal \(\hat{\theta}\) is in the center.

+

Applying this constraint limits what combinations of model parameters are valid. We can now only consider parameter combinations with a total absolute sum less than or equal to our number \(Q\). This means that we can only assign our regularized parameter vector \(\hat{\theta}_{\text{Reg}}\) to positions in the green diamond below.

+
+diamondreg +
+


We can no longer select the parameter vector that truly minimizes the loss surface, \(\hat{\theta}_{\text{No Reg}}\), because this combination of parameters does not lie within our allowed region. Instead, we select whatever allowable combination brings us closest to the true minimum loss.

+
+diamond +
+


Notice that, under regularization, our optimized \(\theta_1\) and \(\theta_2\) values are much smaller than they were without regularization (indeed, \(\theta_1\) has decreased to 0). The model has decreased in complexity because we have limited how much our features contribute to the model. In fact, by setting its parameter to 0, we have effectively removed the influence of feature \(\phi_1\) from the model altogether.

+

If we change the value of \(Q\), we change the region of allowed parameter combinations. The model will still choose the combination of parameters that produces the lowest loss – the closest point in the constrained region to the true minimizer, \(\hat{\theta}_{\text{No Reg}}\).

+If we make \(Q\) smaller: +
+diamondpoint +
+If we make \(Q\) larger: +
+largerq +
+
    +
  • When \(Q\) is small, we severely restrict the size of our parameters. \(\theta_i\)s are small in value, and features \(\phi_i\) only contribute a little to the model. The allowed region of model parameters contracts, and the model becomes much simpler.
  • +
  • When \(Q\) is large, we do not restrict our parameter sizes by much. \(\theta_i\)s are large in value, and features \(\phi_i\) contribute more to the model. The allowed region of model parameters expands, and the model becomes more complex.
  • +
+

Consider the extreme case of when \(Q\) is extremely large. In this situation, our restriction has essentially no effect, and the allowed region includes the OLS solution!

+
+verylarge +
+


+

Now what if \(Q\) were very small? Our parameters are then set to (essentially 0). If the model has no intercept term: \(\hat{\mathbb{Y}} = (0)\phi_1 + (0)\phi_2 + \ldots = 0\). And if the model has an intercept term: \(\hat{\mathbb{Y}} = (0)\phi_1 + (0)\phi_2 + \ldots = \theta_0\). Remember that the intercept term is excluded from the constraint - this is so we avoid the situation where we always predict 0.

+

Let’s summarize what we have seen.

+
+summary +
+
+
+
+

16.4 L1 (LASSO) Regularization

+

How do we actually apply our constraint \(\sum_{i=1}^p |\theta_i| \leq Q\)? We will do so by modifying the objective function that we seek to minimize when fitting a model.

+

Recall our ordinary least squares objective function: our goal was to find parameters that minimize the model’s mean squared error.

+

\[\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\]

+

To apply our constraint, we need to rephrase our minimization goal.

+

\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p |\theta_i| \leq Q\]

+

Unfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the Lagrangian Duality. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following augmented objective function is equivalent to our minimization goal above.

+

\[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \vert \theta_i \vert = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p |\theta_i|\]

+

The second of these two expressions includes the MSE expressed using vector notation.

+

Notice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that penalizes large coefficients. In order to minimize this new objective function, we’ll end up balancing two components:

+
    +
  • Keep the model’s error on the training data low, represented by the term \(\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 x_{i, 1} + \theta_2 x_{i, 2} + \ldots + \theta_p x_{i, p}))^2\)
  • +
  • At the same time, keep the magnitudes of model parameters low, represented by the term \(\lambda \sum_{i=1}^p |\theta_i|\)
  • +
+

The \(\lambda\) factor controls the degree of regularization. Roughly speaking, \(\lambda\) is related to our \(Q\) constraint from before by the rule \(\lambda \approx \frac{1}{Q}\).
To understand why, let’s consider two extreme examples:

+
    +
  • Assume \(\lambda \rightarrow \infty\). Then, \(\lambda \sum_{j=1}^{d} \vert \theta_j \vert\) dominates the cost function. To minimize this term, we set \(\theta_j = 0\) for all \(j \ge 1\). This is a very constrained model that is mathematically equivalent to the constant model. Earlier, we explained the constant model also arises when the L2 norm ball radius \(Q \rightarrow 0\).

  • +
  • Assume \(\lambda \rightarrow 0\). Then, \(\lambda \sum_{j=1}^{d} \vert \theta_j \vert\) is 0. Minimizing the cost function is equivalent to \(\min_{\theta} \frac{1}{n} || Y - X\theta ||_2^2\), our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum \(\hat{\theta} = \hat\theta_{No Reg.}\). We showed that the global optimum is achieved when the L2 norm ball radius \(Q \rightarrow \infty\).

  • +
+

We call \(\lambda\) the regularization penalty hyperparameter and select its value via cross-validation.

+

The process of finding the optimal \(\hat{\theta}\) to minimize our new objective function is called L1 regularization. It is also sometimes known by the acronym “LASSO”, which stands for “Least Absolute Shrinkage and Selection Operator.”

+

Unlike ordinary least squares, which can be solved via the closed-form solution \(\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\), there is no closed-form solution for the optimal parameter vector under L1 regularization. Instead, we use the Lasso model class of sklearn.

+
+
import sklearn.linear_model as lm
+
+# The alpha parameter represents our lambda term
+lasso_model = lm.Lasso(alpha=2)
+lasso_model.fit(X_train, Y_train)
+
+lasso_model.coef_
+
+
array([-2.54932056e-01, -9.48597165e-04,  8.91976284e-06, -1.22872290e-08])
+
+
+

Notice that all model coefficients are very small in magnitude. In fact, some of them are so small that they are essentially 0. An important characteristic of L1 regularization is that many model parameters are set to 0. In other words, LASSO effectively selects only a subset of the features. The reason for this comes back to our loss surface and allowed “diamond” regions from earlier – we can often get closer to the lowest loss contour at a corner of the diamond than along an edge.

+

When a model parameter is set to 0 or close to 0, its corresponding feature is essentially removed from the model. We say that L1 regularization performs feature selection because, by setting the parameters of unimportant features to 0, LASSO “selects” which features are more useful for modeling.

+
+
+

16.5 Scaling Features for Regularization

+

The regularization procedure we just performed had one subtle issue. To see what it is, let’s take a look at the design matrix for our lasso_model.

+
+
X_train.head()
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
hphp^2hp^3hp^4
25985.07225.0614125.052200625.0
12967.04489.0300763.020151121.0
207102.010404.01061208.0108243216.0
30270.04900.0343000.024010000.0
7197.09409.0912673.088529281.0
+ +
+
+
+

Our features – hp, hp^2, hp^3, and hp^4 – are on drastically different numeric scales! The values contained in hp^4 are orders of magnitude larger than those contained in hp. This can be a problem because the value of hp^4 will naturally contribute more to each predicted \(\hat{y}\) because it is so much greater than the values of the other features. For hp to have much of an impact at all on the prediction, it must be scaled by a large model parameter.

+

By inspecting the fitted parameters of our model, we see that this is the case – the parameter for hp is much larger in magnitude than the parameter for hp^4.

+
+
pd.DataFrame({"Feature":X_train.columns, "Parameter":lasso_model.coef_})
+
+
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
FeatureParameter
0hp-2.549321e-01
1hp^2-9.485972e-04
2hp^38.919763e-06
3hp^4-1.228723e-08
+ +
+
+
+

Recall that by applying regularization, we give our a model a “budget” for how it can allocate the values of model parameters. For hp to have much of an impact on each prediction, LASSO is forced to “spend” more of this budget on the parameter for hp.

+

We can avoid this issue by scaling the data before regularizing. This is a process where we convert all features to the same numeric scale. A common way to scale data is to perform standardization such that all features have mean 0 and standard deviation 1; essentially, we replace everything with its Z-score.

+

\[z_k = \frac{x_k - \mu_k}{\sigma_k}\]

+
+
+

16.6 L2 (Ridge) Regularization

+

In all of our work above, we considered the constraint \(\sum_{i=1}^p |\theta_i| \leq Q\) to limit the complexity of the model. What if we had applied a different constraint?

+

In L2 regularization, also known as ridge regression, we constrain the model such that the sum of the squared parameters must be less than some number \(Q\). This constraint takes the form:

+

\[\sum_{i=1}^p \theta_i^2 \leq Q\]

+

As before, we typically do not regularize the intercept term.

+

The allowed region of parameters for a given value of \(Q\) is now shaped like a ball.

+
+green_constrained_gd_sol +
+

If we modify our objective function like before, we find that our new goal is to minimize the function: \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2\:\text{such that} \sum_{i=1}^p \theta_i^2 \leq Q\]

+

Notice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.

+

Using Lagrangian Duality, we can re-express our objective function as: \[\frac{1}{n} \sum_{i=1}^n (y_i - (\theta_0 + \theta_1 \phi_{i, 1} + \theta_2 \phi_{i, 2} + \ldots + \theta_p \phi_{i, p}))^2 + \lambda \sum_{i=1}^p \theta_i^2 = ||\mathbb{Y} - \mathbb{X}\theta||_2^2 + \lambda \sum_{i=1}^p \theta_i^2\]

+

When applying L2 regularization, our goal is to minimize this updated objective function.

+

Unlike L1 regularization, L2 regularization does have a closed-form solution for the best parameter vector when regularization is applied:

+

\[\hat\theta_{\text{ridge}} = (\mathbb{X}^{\top}\mathbb{X} + n\lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}\]

+

This solution exists even if \(\mathbb{X}\) is not full column rank. This is a major reason why L2 regularization is often used – it can produce a solution even when there is colinearity in the features. We will discuss the concept of colinearity in a future lecture. We will not derive this result in Data 100, as it involves a fair bit of matrix calculus.

+

In sklearn, we perform L2 regularization using the Ridge class. Notice that we scale the data before regularizing.

+
+
ridge_model = lm.Ridge(alpha=1) # alpha represents the hyperparameter lambda
+ridge_model.fit(X_train, Y_train)
+
+ridge_model.coef_
+
+
array([ 5.89130559e-02, -6.42445915e-03,  4.44468157e-05, -8.83981945e-08])
+
+
+
+
+

16.7 Regression Summary

+

Our regression models are summarized below. Note the objective function is what the gradient descent optimizer minimizes.

+ ++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
TypeModelLossRegularizationObjective FunctionSolution
OLS\(\hat{\mathbb{Y}} = \mathbb{X}\theta\)MSENone\(\frac{1}{n} \|\mathbb{Y}-\mathbb{X} \theta\|^2_2\)\(\hat{\theta}_{OLS} = (\mathbb{X}^{\top}\mathbb{X})^{-1}\mathbb{X}^{\top}\mathbb{Y}\) if \(\mathbb{X}\) is full column rank
Ridge\(\hat{\mathbb{Y}} = \mathbb{X} \theta\)MSEL2\(\frac{1}{n} \|\mathbb{Y}-\mathbb{X}\theta\|^2_2 + \lambda \sum_{i=1}^p \theta_i^2\)\(\hat{\theta}_{ridge} = (\mathbb{X}^{\top}\mathbb{X} + n \lambda I)^{-1}\mathbb{X}^{\top}\mathbb{Y}\)
LASSO\(\hat{\mathbb{Y}} = \mathbb{X} \theta\)MSEL1\(\frac{1}{n} \|\mathbb{Y}-\mathbb{X}\theta\|^2_2 + \lambda \sum_{i=1}^p \vert \theta_i \vert\)No closed form
+ + +
+ +
+ + +
+ + + + \ No newline at end of file diff --git a/docs/cv_regularization/images/constrained_gd.png b/docs/cv_regularization/images/constrained_gd.png new file mode 100644 index 00000000..4eda732b Binary files /dev/null and b/docs/cv_regularization/images/constrained_gd.png differ diff --git a/docs/cv_regularization/images/cross_validation.png b/docs/cv_regularization/images/cross_validation.png new file mode 100644 index 00000000..9faee18b Binary files /dev/null and b/docs/cv_regularization/images/cross_validation.png differ diff --git a/docs/cv_regularization/images/diamond.png b/docs/cv_regularization/images/diamond.png new file mode 100644 index 00000000..cdb03a3b Binary files /dev/null and b/docs/cv_regularization/images/diamond.png differ diff --git a/docs/cv_regularization/images/diamondpoint.png b/docs/cv_regularization/images/diamondpoint.png new file mode 100644 index 00000000..2d56ec3f Binary files /dev/null and b/docs/cv_regularization/images/diamondpoint.png differ diff --git a/docs/cv_regularization/images/diamondreg.png b/docs/cv_regularization/images/diamondreg.png new file mode 100644 index 00000000..6bd70348 Binary files /dev/null and b/docs/cv_regularization/images/diamondreg.png differ diff --git a/docs/cv_regularization/images/green_constrained_gd_sol.png b/docs/cv_regularization/images/green_constrained_gd_sol.png new file mode 100644 index 00000000..aa481a6f Binary files /dev/null and b/docs/cv_regularization/images/green_constrained_gd_sol.png differ diff --git a/docs/cv_regularization/images/hyperparameter_tuning.png b/docs/cv_regularization/images/hyperparameter_tuning.png new file mode 100644 index 00000000..fce75441 Binary files /dev/null and b/docs/cv_regularization/images/hyperparameter_tuning.png differ diff --git a/docs/cv_regularization/images/largerq.png b/docs/cv_regularization/images/largerq.png new file mode 100644 index 00000000..b0d2b797 Binary files /dev/null and b/docs/cv_regularization/images/largerq.png differ diff --git a/docs/cv_regularization/images/model_selection.png b/docs/cv_regularization/images/model_selection.png new file mode 100644 index 00000000..21927386 Binary files /dev/null and b/docs/cv_regularization/images/model_selection.png differ diff --git a/docs/cv_regularization/images/possible_validation_sets.png b/docs/cv_regularization/images/possible_validation_sets.png new file mode 100644 index 00000000..f41f7d36 Binary files /dev/null and b/docs/cv_regularization/images/possible_validation_sets.png differ diff --git a/docs/cv_regularization/images/summary.png b/docs/cv_regularization/images/summary.png new file mode 100644 index 00000000..59a4ccaf Binary files /dev/null and b/docs/cv_regularization/images/summary.png differ diff --git a/docs/cv_regularization/images/train-test-split.png b/docs/cv_regularization/images/train-test-split.png new file mode 100644 index 00000000..6c9bfd0b Binary files /dev/null and b/docs/cv_regularization/images/train-test-split.png differ diff --git a/docs/cv_regularization/images/training_validation_curve.png b/docs/cv_regularization/images/training_validation_curve.png new file mode 100644 index 00000000..0f6fd9aa Binary files /dev/null and b/docs/cv_regularization/images/training_validation_curve.png differ diff --git a/docs/cv_regularization/images/unconstrained.png b/docs/cv_regularization/images/unconstrained.png new file mode 100644 index 00000000..20ad9e44 Binary files /dev/null and b/docs/cv_regularization/images/unconstrained.png differ diff --git a/docs/cv_regularization/images/validation-split.png b/docs/cv_regularization/images/validation-split.png new file mode 100644 index 00000000..5c8aaa3b Binary files /dev/null and b/docs/cv_regularization/images/validation-split.png differ diff --git a/docs/cv_regularization/images/validation_set.png b/docs/cv_regularization/images/validation_set.png new file mode 100644 index 00000000..7d816e7d Binary files /dev/null and b/docs/cv_regularization/images/validation_set.png differ diff --git a/docs/cv_regularization/images/verylarge.png b/docs/cv_regularization/images/verylarge.png new file mode 100644 index 00000000..b08a41ef Binary files /dev/null and b/docs/cv_regularization/images/verylarge.png differ diff --git a/docs/eda/eda.html b/docs/eda/eda.html index 26218bd3..8e443a92 100644 --- a/docs/eda/eda.html +++ b/docs/eda/eda.html @@ -235,6 +235,18 @@ 14  Sklearn and Feature Engineering + + + @@ -644,7 +656,7 @@
force=False) covid_file # a file path wrapper object
-
Using cached version that was downloaded (UTC): Fri Aug 25 09:57:25 2023
+
Using cached version that was downloaded (UTC): Fri Aug 18 22:19:42 2023
PosixPath('data/confirmed-cases.json')
@@ -676,7 +688,7 @@
!ls -lh {covid_file}
 !wc -l {covid_file}
-
-rw-r--r--  1 lillianweng  staff   114K Aug 25 09:57 data/confirmed-cases.json
+
-rw-r--r--  1 Ishani  staff   114K Aug 18 22:19 data/confirmed-cases.json
    1109 data/confirmed-cases.json
@@ -4089,8 +4101,14 @@

sns.displot(co2['Days']);
 plt.title("Distribution of days feature"); # suppresses unneeded plotting output

+
+
/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
+
+The figure layout has changed to tight
+
+
-

+

In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values–that’s about 27% of the data!

@@ -4100,8 +4118,8 @@

Code -
sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
+
sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output

@@ -4125,17 +4143,23 @@

Code -
# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
+
# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+
+
/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
+
+The figure layout has changed to tight
+
+
-

+

The non-missing values are in the 300-400 range (a regular range of CO2 levels).

We also see that there are only a few missing Avg values (<1% of values). Let’s examine all of them:

-
co2[co2["Avg"] < 0]
+
co2[co2["Avg"] < 0]
@@ -4244,8 +4268,8 @@

Code -
sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
+
sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");

@@ -4257,9 +4281,9 @@

-
# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
+
# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
@@ -4335,9 +4359,9 @@

-
# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
+
# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
@@ -4421,10 +4445,10 @@

-
# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
+
# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
@@ -4504,30 +4528,30 @@

Code -
# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
-    # assumes single year, hence Mo
-    ax.plot('Mo', 'Avg', data=data)
-    ax.scatter('Mo', 'Avg', data=data)
-    ax.set_xlim(2, 13)
-    ax.set_title(title)
-    ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
-    return data[data["Yr"] == 1958]
-    
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
+
# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+    # assumes single year, hence Mo
+    ax.plot('Mo', 'Avg', data=data)
+    ax.scatter('Mo', 'Avg', data=data)
+    ax.set_xlim(2, 13)
+    ax.set_title(title)
+    ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+    return data[data["Yr"] == 1958]
+    
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()

@@ -4544,8 +4568,8 @@

Code -
sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
+
sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");

@@ -4572,9 +4596,9 @@

Code -
co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
+
co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");

@@ -4915,1218 +4939,1218 @@

<

diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png index f392d5f9..a04218cf 100644 Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png deleted file mode 100644 index be96b8c9..00000000 Binary files a/docs/eda/eda_files/figure-html/cell-67-output-1.png and /dev/null differ diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png new file mode 100644 index 00000000..31857f62 Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-2.png differ diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png index ffd29ff8..67c3959d 100644 Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png deleted file mode 100644 index 29088928..00000000 Binary files a/docs/eda/eda_files/figure-html/cell-69-output-1.png and /dev/null differ diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png new file mode 100644 index 00000000..fb28f5d5 Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-2.png differ diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png index 49ef3d6a..39cac822 100644 Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png index 15a5fe82..6382e58a 100644 Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png index 40b1fc71..db2b0dee 100644 Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png index 99b6c2d1..897b8b39 100644 Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html index fe821b57..9cfedab8 100644 --- a/docs/feature_engineering/feature_engineering.html +++ b/docs/feature_engineering/feature_engineering.html @@ -64,6 +64,7 @@ + @@ -234,6 +235,18 @@ 14  Sklearn and Feature Engineering

+ + +

@@ -507,7 +520,7 @@

my_model.fit(X, Y)

-
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
+
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Notice that we use double brackets to extract this column. Why double brackets instead of just single brackets? The .fit method, by default, expects to receive 2-dimensional data – some kind of data that includes both rows and columns. Writing penguins["flipper_length_mm"] would return a 1D Series, causing sklearn to error. We avoid this by writing penguins[["flipper_length_mm"]] to produce a 2D DataFrame.

@@ -558,7 +571,7 @@

print(f"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}")

-
The RMSE of the model is 0.9881331104079045
+
The RMSE of the model is 0.9881331104079044

We can also see that we obtain the same predictions using sklearn as we did when applying the ordinary least squares formula before!

@@ -928,7 +941,7 @@

print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}")

-
MSE of model with (hp^2) feature: 18.984768907617216
+
MSE of model with (hp^2) feature: 18.984768907617223