@@ -4304,8 +4292,8 @@
Code
-
+
@@ -4317,9 +4305,9 @@
-
+
@@ -4395,9 +4383,9 @@
-
+
@@ -4481,10 +4469,10 @@
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
@@ -4564,30 +4552,30 @@
Code
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
@@ -4604,8 +4592,8 @@
Code
-
+
@@ -4632,9 +4620,9 @@
Code
-
+
@@ -4975,1218 +4963,1218 @@ <
Source Code
----
-title: Data Cleaning and EDA
-execute:
- echo: true
-format:
- html:
- code-fold: true
- code-tools: true
- toc: true
- toc-title: Data Cleaning and EDA
- page-layout: full
- theme:
- - cosmo
- - cerulean
- callout-icon: false
-jupyter: python3
----
-
-```{python}
-#| code-fold: true
-import numpy as np
-import pandas as pd
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-#%matplotlib inline
-plt.rcParams['figure.figsize'] = (12, 9)
-
-sns.set()
-sns.set_context('talk')
-np.set_printoptions(threshold=20, precision=2, suppress=True)
-pd.set_option('display.max_rows', 30)
-pd.set_option('display.max_columns', None)
-pd.set_option('display.precision', 2)
-# This option stops scientific notation for pandas
-pd.set_option('display.float_format', '{:.2f}'.format)
-
-# Silence some spurious seaborn warnings
-import warnings
-warnings.filterwarnings("ignore", category=FutureWarning)
-```
-
-::: {.callout-note collapse="false"}
-## Learning Outcomes
-* Recognize common file formats
-* Categorize data by its variable type
-* Build awareness of issues with data faithfulness and develop targeted solutions
-:::
-
-**This content is covered in lectures 4, 5, and 6.**
-
-In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
-
-**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
-
-* Unclear structure or formatting
-* Missing or corrupted values
-* Unit conversions
-* ...and so on
-
-**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
-
-In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
-
-## Structure
-
-### File Formats
-There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
-
-#### CSV
-CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
-In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.csv").head(5)
-```
-
-To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
-
-```{python}
-#| code-fold: false
-with open("data/elections.csv", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
-
-#### TSV
-
-Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
-
-Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
-
-```{python}
-#| code-fold: false
-with open("data/elections.txt", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.txt", sep='\t').head(3)
-```
-
-An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-#### JSON
-**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
-
-```{python}
-#| code-fold: false
-with open("data/elections.json", "r") as table:
- i = 0
- for row in table:
- print(row)
- i += 1
- if i > 8:
- break
-```
-
-JSON files can be loaded into `pandas` using `pd.read_json`.
-
-```{python}
-#| code-fold: false
-pd.read_json('data/elections.json').head(3)
-```
-
-##### EDA with JSON: Berkeley COVID-19 Data
-The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
-
-```{python}
-#| code-fold: false
-from ds100_utils import fetch_and_cache
-
-covid_file = fetch_and_cache(
- "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
- "confirmed-cases.json",
- force=False)
-covid_file # a file path wrapper object
-```
-
-###### File Size
-Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
-
-Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
-
-```{python}
-#| code-fold: false
-import os
-
-print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
-
-with open(covid_file, "r") as f:
- print(covid_file, "is", sum(1 for l in f), "lines.")
-```
-
-###### Unix Commands
-As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
-In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
-
-Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
-
-These two give us the same information as the code above, albeit in a slightly different form:
-
-```{python}
-#| code-fold: false
-!ls -lh {covid_file}
-!wc -l {covid_file}
-```
-
-###### File Contents
-Let's explore the data format using `Python`.
-
-```{python}
-#| code-fold: false
-with open(covid_file, "r") as f:
- for i, row in enumerate(f):
- print(repr(row)) # print raw strings
- if i >= 4: break
-```
-
-We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
-
-```{python}
-#| code-fold: false
-!head -5 {covid_file}
-```
-
-In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
-
-```{python}
-#| code-fold: false
-import json
-
-with open(covid_file, "rb") as f:
- covid_json = json.load(f)
-```
-
-The `covid_json` variable is now a dictionary encoding the data in the file:
-
-```{python}
-#| code-fold: false
-type(covid_json)
-```
-
-We can examine what keys are in the top level json object by listing out the keys.
-
-```{python}
-#| code-fold: false
-covid_json.keys()
-```
-
-**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
-
-
-We can investigate the meta data further by examining the keys associated with the metadata.
-
-```{python}
-#| code-fold: false
-covid_json['meta'].keys()
-```
-
-The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
-
-```{python}
-#| code-fold: false
-covid_json['meta']['view'].keys()
-```
-
-Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
-
-```
-meta
-|-> data
- | ... (haven't explored yet)
-|-> view
- | -> id
- | -> name
- | -> attribution
- ...
- | -> description
- ...
- | -> columns
- ...
-```
-
-
-There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
-```{python}
-#| code-fold: false
-print(covid_json['meta']['view']['description'])
-```
-
-###### Examining the Data Field for Records
-
-We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
-
-```{python}
-#| code-fold: false
-for i in range(3):
- print(f"{i:03} | {covid_json['data'][i]}")
-```
-
-Observations:
-* These look like equal-length records, so maybe `data` is a table!
-* But what do each of values in the record mean? Where can we find column headers?
-
-For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
-
-```{python}
-#| code-fold: false
-type(covid_json['meta']['view']['columns'])
-```
-
-###### Summary of exploring the JSON file
-
-1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
-1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
-1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
-
-###### Loading COVID Data into `pandas`
-Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
-
-1. Translate the JSON records into a `DataFrame`:
-
- * fields: `covid_json['meta']['view']['columns']`
- * records: `covid_json['data']`
-
-
-1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
-
-1. Examine the `tail` of the table.
-
-```{python}
-#| code-fold: false
-# Load the data from JSON and assign column titles
-covid = pd.DataFrame(
- covid_json['data'],
- columns=[c['name'] for c in covid_json['meta']['view']['columns']])
-
-covid.tail()
-```
-
-### Variable Types
-
-After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
-
-**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
-
-* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
-* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
-
-**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
-
-* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
-* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
-
-![Classification of variable types](images/variable.png)
-
-Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
-
-### Primary and Foreign Keys
-
-Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
-
-The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
-
-```{python}
-#| echo: false
-pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
- "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
- "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
-```
-
-The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
-
-```{python}
-#| echo: false
-pd.DataFrame({"OH Request":[1, 2, 3, 4], \
- "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
- "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
-```
-
-## Granularity, Scope, and Temporality
-
-After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
-
-### Granularity
-The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
-
-### Scope
-The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
-
-### Temporality
-The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
-
-Time and date fields of a dataset could represent a few things:
-
-1. when the "event" happened
-2. when the data was collected, or when it was entered into the system
-3. when the data was copied into the database
-
-To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
-
-#### Temporality with `pandas`' `dt` accessors
-Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
-```{python}
-#| code-fold: true
-calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
-calls.head()
-```
-
-Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
-
-Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
-
-If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
-calls.head()
-```
-
-Now, we can use the `dt` accessor on this column.
-
-We can get the month:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.month.head()
-```
-
-Which day of the week the date is on:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.dayofweek.head()
-```
-
-Check the mimimum values to see if there are any suspicious-looking, 70s dates:
-
-```{python}
-#| code-fold: false
-calls.sort_values("EVENTDT").head()
-```
-
-Doesn't look like it! We are good!
-
-
-We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
-
-## Faithfulness
-
-At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
-
-Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
-
-* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
-* Violations of obvious dependencies, like an age that does not match a birthday
-* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
-* Signs of data falsification, such as fake email addresses or repeated use of the same names
-* Duplicated records or fields containing the same information
-* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
-
-We often solve some of these more common issues in the following ways:
-
-* Spelling errors: apply corrections or drop records that aren't in a dictionary
-* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
-* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
-* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
-
-### Missing Values
-Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
-
-A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
-
-* Average imputation: replace missing values with the average value for that field
-* Hot deck imputation: replace missing values with some random value
-* Regression imputation: develop a model to predict missing values
-* Multiple imputation: replace missing values with multiple random values
-
-Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
-
-# EDA Demo 1: Tuberculosis in the United States
-
-Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
-
-We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
-
-
-## CSVs and Field Names
-Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
-
-We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
-1. Using a text editor like emacs, vim, VSCode, etc.
-2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
-3. The `Python` file object
-4. `pandas`, using `pd.read_csv()`
-
-To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
-
-Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(row)
- i += 1
- if i > 3:
- break
-```
-
-Whoa, why are there blank lines interspaced between the lines of the CSV?
-
-You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
-
-If you're curious, we can use the `repr()` function to return the raw string with all special characters:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(repr(row)) # print raw strings
- i += 1
- if i > 3:
- break
-```
-
-Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
-tb_df.head()
-```
-
-You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
-
-Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
-
-A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
-tb_df.head(5)
-```
-
-Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
-
-We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
-
-```{python}
-#| code-fold: false
-rename_dict = {'2019': 'TB cases 2019',
- '2020': 'TB cases 2020',
- '2021': 'TB cases 2021',
- '2019.1': 'TB incidence 2019',
- '2020.1': 'TB incidence 2020',
- '2021.1': 'TB incidence 2021'}
-tb_df = tb_df.rename(columns=rename_dict)
-tb_df.head(5)
-```
-
-## Record Granularity
-
-You might already be wondering: what's up with that first record?
-
-Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
-
-Okay, EDA step two. How was the rollup record aggregated?
-
-Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
-
-```{python}
-#| code-fold: true
-tb_df.sum(axis=0)
-```
-
-Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
-```{python}
-#| code-fold: true
-tb_df.dtypes
-```
-
-Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
-
-
-Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
-
-```{python}
-#| code-fold: false
-# improve readability: chaining method calls with outer parentheses/line breaks
-tb_df = (
- pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
- .rename(columns=rename_dict)
-)
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-tb_df.sum()
-```
-
-The Total TB cases look right. Phew!
-
-Let's just look at the records with **state-level granularity**:
-
-```{python}
-#| code-fold: true
-state_tb_df = tb_df[1:]
-state_tb_df.head(5)
-```
-
-## Gather Census Data
-
-U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
-
-Running the below cells cleans the data.
-There are a few new methods here:
-* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
-* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
-
-```{python}
-#| code-fold: true
-# 2010s census data
-census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
-census_2010s_df = (
- census_2010s_df
- .reset_index()
- .drop(columns=["index", "Census", "Estimates Base"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
-
-# with pd.option_context('display.min_rows', 30): # shows more rows
-# display(census_2010s_df)
-
-census_2010s_df.head(5)
-```
-
-Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
-
-```python
-from importlib import reload
-reload(utils)
-```
-
-or use `iPython` magic which will intelligently import code when files change:
-
-```python
-%load_ext autoreload
-%autoreload 2
-```
-
-```{python}
-#| code-fold: true
-# census 2020s data
-census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
-census_2020s_df = (
- census_2020s_df
- .reset_index()
- .drop(columns=["index", "Unnamed: 1"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
-
-census_2020s_df.head(5)
-```
-
-## Joining Data (Merging `DataFrame`s)
-
-Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
-
-```{python}
-#| code-fold: false
-# merge TB DataFrame with two US census DataFrames
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .merge(right=census_2020s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
-
-```{python}
-#| code-fold: false
-# try merging again, but cleaner this time
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["Geographic Area", "2019"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
- .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-## Reproducing Data: Compute Incidence
-
-Let's recompute incidence to make sure we know where the original CDC numbers came from.
-
-From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
-
-If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
-
-$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
-
-$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
-
-Let's try this for 2019:
-
-```{python}
-#| code-fold: false
-tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
-tb_census_df.head(5)
-```
-
-Awesome!!!
-
-Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
-
-```{python}
-#| code-fold: false
-tb_census_df.describe()
-```
-
-## Bonus EDA: Reproducing the Reported Statistic
-
-
-**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-This is TB incidence computed across the entire U.S. population! How do we reproduce this?
-* We need to reproduce the "Total" TB incidences in our rolled record.
-* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
-* What happened...?
-
-Let's get exploring!
-
-Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
-
-```{python}
-#| code-fold: true
-tb_df = tb_df.set_index("U.S. jurisdiction")
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2010s_df = census_2010s_df.set_index("Geographic Area")
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2020s_df = census_2020s_df.set_index("Geographic Area")
-census_2020s_df.head(5)
-```
-
-It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
-
-```{python}
-#| code-fold: false
-tb_df.head()
-```
-
-Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
-
-The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
-
-```{python}
-#| code-fold: false
-census_2010s_df.head(5)
-```
-
-The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
-
-One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
-
-```{python}
-#| code-fold: false
-# rename rolled record for 2010s
-census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-# same, but for 2020s rename rolled record
-census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2020s_df.head(5)
-```
-
-<br/>
-
-Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
-
-```{python}
-#| code-fold: false
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["2019"]],
- left_index=True, right_index=True)
- .merge(right=census_2020s_df[["2020", "2021"]],
- left_index=True, right_index=True)
-)
-tb_census_df.head(5)
-```
-
-<br/>
-
-Finally, let's recompute our incidences:
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-We reproduced the total U.S. incidences correctly!
-
-We're almost there. Let's revisit the quote:
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-Recall that percent change from $A$ to $B$ is computed as
-$\text{percent change} = \frac{B - A}{A} \times 100$.
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
-incidence_2020
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
-incidence_2021
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
-difference
-```
-
-# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
-
-[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
-
-```{python}
-#| code-fold: false
-co2_file = "data/co2_mm_mlo.txt"
-```
-
-Let's do some **EDA**!!
-
-## Reading this file into Pandas?
-Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
-
-Lines 71-78 (inclusive) are shown below:
-
- line number | file contents
-
- 71 | # decimal average interpolated trend #days
- 72 | # date (season corr)
- 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
- 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
- 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
- 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
- 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
- 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
-
-
-Notice how:
-
-- The values are separated by white space, possibly tabs.
-- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
-- The 71st and 72nd lines in the file contain column headings split over two lines.
-
-We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
-)
-co2.head()
-```
-
-Congratulations! You've wrangled the data!
-
-<br/>
-
-...But our columns aren't named.
-**We need to do more EDA.**
-
-## Exploring Variable Feature Types
-
-The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
-
-Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = '\s+', #regex for continuous whitespace (next lecture)
- names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
-)
-co2.head()
-```
-
-## Visualizing CO<sub>2</sub>
-Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2);
-```
-
-The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
-
-Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
-
-```{python}
-#| code-fold: false
-co2.head()
-```
-
-```{python}
-#| code-fold: false
-co2.tail()
-```
-
-Some data have unusual values like -1 and -99.99.
-
-Let's check the description at the top of the file again.
-
-* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
-* -99.99 denotes a missing monthly average `Avg`
-
-How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
-
-<br/>
-
-
-## Sanity Checks: Reasoning about the data
-First, we consider the shape of the data. How many rows should we have?
-
-* If chronological order, we should have one record per month.
-* Data from March 1958 to August 2019.
-* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
-
-```{python}
-#| code-fold: false
-co2.shape
-```
-
-Nice!! The number of rows (i.e. records) match our expectations.\
-
-<br/>
-
-
-Let's now check the quality of each feature.
-
-## Understanding Missing Value 1: `Days`
-`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
-
-Let's start with **months**, `Mo`.
-
-Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
-
-```{python}
-#| code-fold: false
-co2["Mo"].value_counts().sort_index()
-```
-
-As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
-
-<br/>
-
-Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
-
-```{python}
-#| code-fold: true
-sns.displot(co2['Days']);
-plt.title("Distribution of days feature"); # suppresses unneeded plotting output
-```
-
-In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
-
-<br/>
-
-Finally, let's check the last time feature, **year** `Yr`.
-
-Let's check to see if there is any connection between missing-ness and the year of the recording.
-
-```{python}
-#| code-fold: true
-sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
-```
-
-**Observations**:
-
-* All of the missing data are in the early years of operation.
-* It appears there may have been problems with equipment in the mid to late 80s.
-
-**Potential Next Steps**:
-
-* Confirm these explanations through documentation about the historical readings.
-* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
-
-<br/>
-
-## Understanding Missing Value 2: `Avg`
-Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
-
-```{python}
-#| code-fold: true
-# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
-```
-
-The non-missing values are in the 300-400 range (a regular range of CO2 levels).
-
-We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
-
-```{python}
-#| code-fold: false
-co2[co2["Avg"] < 0]
-```
-
-There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
-
-## Drop, `NaN`, or Impute Missing `Avg` Data?
-
-How should we address the invalid `Avg` data?
-
-1. Drop records
-2. Set to NaN
-3. Impute using some strategy
-
-Remember we want to fix the following plot:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
-```
-
-Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
-
-
-Let's consider a few options:
-1. Drop those records
-2. Replace -99.99 with NaN
-3. Substitute it with a likely value for the average CO2?
-
-What do you think are the pros and cons of each possible action?
-
-<br/>
-
-
-Let's examine each of these three options.
-
-```{python}
-#| code-fold: false
-# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
-```
-
-```{python}
-#| code-fold: false
-# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
-```
-
-We'll also use a third version of the data.
-
-First, we note that the dataset already comes with a **substitute value** for the -99.99.
-
-From the file description:
-
-> The `interpolated` column includes average values from the preceding column (`average`)
-and **interpolated values** where data are missing. Interpolated values are
-computed in two steps...
-
-The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
-
-So, the third version of our data will use the `Int` feature instead of `Avg`.
-
-```{python}
-#| code-fold: false
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
-```
-
-What's a **reasonable** estimate?
-
-To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
-
-```{python}
-#| code-fold: true
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
-```
-
-In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
-
-However there is some appeal to **option C: Imputing**:
-
-* Shows seasonal trends for CO2
-* We are plotting all months in our data as a line plot
-
-<br/>
-
-
-Let's replot our original figure with option 3:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
-```
-
-Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
-
-## Presenting the data: A Discussion on Data Granularity
-
-From the description:
-
-* monthly measurements are averages of average day measurements.
-* The NOAA GML website has datasets for daily/hourly measurements too.
-
-The data you present depends on your research question.
-
-**How do CO2 levels vary by season?**
-
-* You might want to keep average monthly data.
-
-**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
-
-* You might be happier with a **coarser granularity** of average year data!
-
-```{python}
-#| code-fold: true
-co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
-```
-
-Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
-
-# Summary
-We went over a lot of content this lecture; let's summarize the most important points:
-
-## Dealing with Missing Values
-There are a few options we can take to deal with missing data:
-
-* Drop missing records
-* Keep `NaN` missing values
-* Impute using an interpolated column
-
-## EDA and Data Wrangling
-There are several ways to approach EDA and Data Wrangling:
-
-* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
-* Examine each **field/attribute/dimension** individually.
-* Examine pairs of related dimensions (e.g. breaking down grades by major).
-* Along the way, we can:
- * **Visualize** or summarize the data.
- * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
- * Identify and **address anomalies**.
- * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
- * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
+---
+title: Data Cleaning and EDA
+execute:
+ echo: true
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ toc: true
+ toc-title: Data Cleaning and EDA
+ page-layout: full
+ theme:
+ - cosmo
+ - cerulean
+ callout-icon: false
+jupyter: python3
+---
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+plt.rcParams['figure.figsize'] = (12, 9)
+
+sns.set()
+sns.set_context('talk')
+np.set_printoptions(threshold=20, precision=2, suppress=True)
+pd.set_option('display.max_rows', 30)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+# This option stops scientific notation for pandas
+pd.set_option('display.float_format', '{:.2f}'.format)
+
+# Silence some spurious seaborn warnings
+import warnings
+warnings.filterwarnings("ignore", category=FutureWarning)
+```
+
+::: {.callout-note collapse="false"}
+## Learning Outcomes
+* Recognize common file formats
+* Categorize data by its variable type
+* Build awareness of issues with data faithfulness and develop targeted solutions
+:::
+
+**This content is covered in lectures 4, 5, and 6.**
+
+In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
+
+**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+
+* Unclear structure or formatting
+* Missing or corrupted values
+* Unit conversions
+* ...and so on
+
+**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
+
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+
+## Structure
+
+### File Formats
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+
+#### CSV
+CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
+In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.csv").head(5)
+```
+
+To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
+
+```{python}
+#| code-fold: false
+with open("data/elections.csv", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
+
+#### TSV
+
+Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
+
+Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
+
+```{python}
+#| code-fold: false
+with open("data/elections.txt", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.txt", sep='\t').head(3)
+```
+
+An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+#### JSON
+**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
+
+```{python}
+#| code-fold: false
+with open("data/elections.json", "r") as table:
+ i = 0
+ for row in table:
+ print(row)
+ i += 1
+ if i > 8:
+ break
+```
+
+JSON files can be loaded into `pandas` using `pd.read_json`.
+
+```{python}
+#| code-fold: false
+pd.read_json('data/elections.json').head(3)
+```
+
+##### EDA with JSON: Berkeley COVID-19 Data
+The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
+
+```{python}
+#| code-fold: false
+from ds100_utils import fetch_and_cache
+
+covid_file = fetch_and_cache(
+ "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ force=False)
+covid_file # a file path wrapper object
+```
+
+###### File Size
+Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
+
+Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
+
+```{python}
+#| code-fold: false
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+ print(covid_file, "is", sum(1 for l in f), "lines.")
+```
+
+###### Unix Commands
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
+In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
+
+Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
+
+These two give us the same information as the code above, albeit in a slightly different form:
+
+```{python}
+#| code-fold: false
+!ls -lh {covid_file}
+!wc -l {covid_file}
+```
+
+###### File Contents
+Let's explore the data format using `Python`.
+
+```{python}
+#| code-fold: false
+with open(covid_file, "r") as f:
+ for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
+```
+
+We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
+
+```{python}
+#| code-fold: false
+!head -5 {covid_file}
+```
+
+In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
+
+```{python}
+#| code-fold: false
+import json
+
+with open(covid_file, "rb") as f:
+ covid_json = json.load(f)
+```
+
+The `covid_json` variable is now a dictionary encoding the data in the file:
+
+```{python}
+#| code-fold: false
+type(covid_json)
+```
+
+We can examine what keys are in the top level json object by listing out the keys.
+
+```{python}
+#| code-fold: false
+covid_json.keys()
+```
+
+**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
+
+
+We can investigate the meta data further by examining the keys associated with the metadata.
+
+```{python}
+#| code-fold: false
+covid_json['meta'].keys()
+```
+
+The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
+
+```{python}
+#| code-fold: false
+covid_json['meta']['view'].keys()
+```
+
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+
+```
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+```
+
+
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+
+```{python}
+#| code-fold: false
+print(covid_json['meta']['view']['description'])
+```
+
+###### Examining the Data Field for Records
+
+We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
+
+```{python}
+#| code-fold: false
+for i in range(3):
+ print(f"{i:03} | {covid_json['data'][i]}")
+```
+
+Observations:
+* These look like equal-length records, so maybe `data` is a table!
+* But what do each of values in the record mean? Where can we find column headers?
+
+For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
+
+```{python}
+#| code-fold: false
+type(covid_json['meta']['view']['columns'])
+```
+
+###### Summary of exploring the JSON file
+
+1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
+1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
+1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
+
+###### Loading COVID Data into `pandas`
+Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
+
+1. Translate the JSON records into a `DataFrame`:
+
+ * fields: `covid_json['meta']['view']['columns']`
+ * records: `covid_json['data']`
+
+
+1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
+
+1. Examine the `tail` of the table.
+
+```{python}
+#| code-fold: false
+# Load the data from JSON and assign column titles
+covid = pd.DataFrame(
+ covid_json['data'],
+ columns=[c['name'] for c in covid_json['meta']['view']['columns']])
+
+covid.tail()
+```
+
+### Variable Types
+
+After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+
+**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
+
+* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
+* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
+
+**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
+
+* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
+* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
+
+![Classification of variable types](images/variable.png)
+
+Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+
+### Primary and Foreign Keys
+
+Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
+
+The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
+
+```{python}
+#| echo: false
+pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
+ "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
+ "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
+```
+
+The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
+
+```{python}
+#| echo: false
+pd.DataFrame({"OH Request":[1, 2, 3, 4], \
+ "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
+ "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
+```
+
+## Granularity, Scope, and Temporality
+
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
+
+### Granularity
+The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+
+### Scope
+The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+
+### Temporality
+The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+
+Time and date fields of a dataset could represent a few things:
+
+1. when the "event" happened
+2. when the data was collected, or when it was entered into the system
+3. when the data was copied into the database
+
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+
+#### Temporality with `pandas`' `dt` accessors
+Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
+
+```{python}
+#| code-fold: true
+calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+calls.head()
+```
+
+Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
+
+Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
+
+If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+calls.head()
+```
+
+Now, we can use the `dt` accessor on this column.
+
+We can get the month:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.month.head()
+```
+
+Which day of the week the date is on:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.dayofweek.head()
+```
+
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+
+```{python}
+#| code-fold: false
+calls.sort_values("EVENTDT").head()
+```
+
+Doesn't look like it! We are good!
+
+
+We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
+
+## Faithfulness
+
+At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
+
+Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+
+* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
+* Violations of obvious dependencies, like an age that does not match a birthday
+* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
+* Signs of data falsification, such as fake email addresses or repeated use of the same names
+* Duplicated records or fields containing the same information
+* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
+
+We often solve some of these more common issues in the following ways:
+
+* Spelling errors: apply corrections or drop records that aren't in a dictionary
+* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
+* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
+* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
+
+### Missing Values
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
+
+A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+
+* Average imputation: replace missing values with the average value for that field
+* Hot deck imputation: replace missing values with some random value
+* Regression imputation: develop a model to predict missing values
+* Multiple imputation: replace missing values with multiple random values
+
+Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+
+# EDA Demo 1: Tuberculosis in the United States
+
+Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+
+We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
+
+
+## CSVs and Field Names
+Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
+
+We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
+1. Using a text editor like emacs, vim, VSCode, etc.
+2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
+3. The `Python` file object
+4. `pandas`, using `pd.read_csv()`
+
+To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
+
+Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(row)
+ i += 1
+ if i > 3:
+ break
+```
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+
+You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
+
+If you're curious, we can use the `repr()` function to return the raw string with all special characters:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(repr(row)) # print raw strings
+ i += 1
+ if i > 3:
+ break
+```
+
+Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
+tb_df.head()
+```
+
+You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
+
+Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
+
+A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+tb_df.head(5)
+```
+
+Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
+
+We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
+
+```{python}
+#| code-fold: false
+rename_dict = {'2019': 'TB cases 2019',
+ '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+tb_df = tb_df.rename(columns=rename_dict)
+tb_df.head(5)
+```
+
+## Record Granularity
+
+You might already be wondering: what's up with that first record?
+
+Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
+
+Okay, EDA step two. How was the rollup record aggregated?
+
+Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
+
+```{python}
+#| code-fold: true
+tb_df.sum(axis=0)
+```
+
+Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+
+```{python}
+#| code-fold: true
+tb_df.dtypes
+```
+
+Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
+
+
+Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
+
+```{python}
+#| code-fold: false
+# improve readability: chaining method calls with outer parentheses/line breaks
+tb_df = (
+ pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
+ .rename(columns=rename_dict)
+)
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+tb_df.sum()
+```
+
+The Total TB cases look right. Phew!
+
+Let's just look at the records with **state-level granularity**:
+
+```{python}
+#| code-fold: true
+state_tb_df = tb_df[1:]
+state_tb_df.head(5)
+```
+
+## Gather Census Data
+
+U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
+
+Running the below cells cleans the data.
+There are a few new methods here:
+* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
+* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
+
+```{python}
+#| code-fold: true
+# 2010s census data
+census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+census_2010s_df = (
+ census_2010s_df
+ .reset_index()
+ .drop(columns=["index", "Census", "Estimates Base"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+census_2010s_df.head(5)
+```
+
+Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
+
+```python
+from importlib import reload
+reload(utils)
+```
+
+or use `iPython` magic which will intelligently import code when files change:
+
+```python
+%load_ext autoreload
+%autoreload 2
+```
+
+```{python}
+#| code-fold: true
+# census 2020s data
+census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+census_2020s_df = (
+ census_2020s_df
+ .reset_index()
+ .drop(columns=["index", "Unnamed: 1"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+
+census_2020s_df.head(5)
+```
+
+## Joining Data (Merging `DataFrame`s)
+
+Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
+
+```{python}
+#| code-fold: false
+# merge TB DataFrame with two US census DataFrames
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .merge(right=census_2020s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
+
+```{python}
+#| code-fold: false
+# try merging again, but cleaner this time
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["Geographic Area", "2019"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+ .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+## Reproducing Data: Compute Incidence
+
+Let's recompute incidence to make sure we know where the original CDC numbers came from.
+
+From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+
+$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
+
+$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
+
+Let's try this for 2019:
+
+```{python}
+#| code-fold: false
+tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+tb_census_df.head(5)
+```
+
+Awesome!!!
+
+Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+
+```{python}
+#| code-fold: false
+tb_census_df.describe()
+```
+
+## Bonus EDA: Reproducing the Reported Statistic
+
+
+**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+This is TB incidence computed across the entire U.S. population! How do we reproduce this?
+* We need to reproduce the "Total" TB incidences in our rolled record.
+* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
+* What happened...?
+
+Let's get exploring!
+
+Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+
+```{python}
+#| code-fold: true
+tb_df = tb_df.set_index("U.S. jurisdiction")
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2010s_df = census_2010s_df.set_index("Geographic Area")
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2020s_df = census_2020s_df.set_index("Geographic Area")
+census_2020s_df.head(5)
+```
+
+It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
+
+```{python}
+#| code-fold: false
+tb_df.head()
+```
+
+Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
+
+The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
+
+```{python}
+#| code-fold: false
+census_2010s_df.head(5)
+```
+
+The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
+
+One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
+
+```{python}
+#| code-fold: false
+# rename rolled record for 2010s
+census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+# same, but for 2020s rename rolled record
+census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2020s_df.head(5)
+```
+
+<br/>
+
+Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
+
+```{python}
+#| code-fold: false
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["2019"]],
+ left_index=True, right_index=True)
+ .merge(right=census_2020s_df[["2020", "2021"]],
+ left_index=True, right_index=True)
+)
+tb_census_df.head(5)
+```
+
+<br/>
+
+Finally, let's recompute our incidences:
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+We reproduced the total U.S. incidences correctly!
+
+We're almost there. Let's revisit the quote:
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+Recall that percent change from $A$ to $B$ is computed as
+$\text{percent change} = \frac{B - A}{A} \times 100$.
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
+incidence_2020
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
+incidence_2021
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
+difference
+```
+
+# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
+
+[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
+
+```{python}
+#| code-fold: false
+co2_file = "data/co2_mm_mlo.txt"
+```
+
+Let's do some **EDA**!!
+
+## Reading this file into Pandas?
+Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
+
+Lines 71-78 (inclusive) are shown below:
+
+ line number | file contents
+
+ 71 | # decimal average interpolated trend #days
+ 72 | # date (season corr)
+ 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+ 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+ 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+ 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+ 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+ 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+
+
+Notice how:
+
+- The values are separated by white space, possibly tabs.
+- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
+- The 71st and 72nd lines in the file contain column headings split over two lines.
+
+We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+)
+co2.head()
+```
+
+Congratulations! You've wrangled the data!
+
+<br/>
+
+...But our columns aren't named.
+**We need to do more EDA.**
+
+## Exploring Variable Feature Types
+
+The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
+
+Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = '\s+', #regex for continuous whitespace (next lecture)
+ names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+)
+co2.head()
+```
+
+## Visualizing CO<sub>2</sub>
+Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2);
+```
+
+The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
+
+Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
+
+```{python}
+#| code-fold: false
+co2.head()
+```
+
+```{python}
+#| code-fold: false
+co2.tail()
+```
+
+Some data have unusual values like -1 and -99.99.
+
+Let's check the description at the top of the file again.
+
+* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
+* -99.99 denotes a missing monthly average `Avg`
+
+How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+
+<br/>
+
+
+## Sanity Checks: Reasoning about the data
+First, we consider the shape of the data. How many rows should we have?
+
+* If chronological order, we should have one record per month.
+* Data from March 1958 to August 2019.
+* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
+
+```{python}
+#| code-fold: false
+co2.shape
+```
+
+Nice!! The number of rows (i.e. records) match our expectations.\
+
+<br/>
+
+
+Let's now check the quality of each feature.
+
+## Understanding Missing Value 1: `Days`
+`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
+
+Let's start with **months**, `Mo`.
+
+Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+
+```{python}
+#| code-fold: false
+co2["Mo"].value_counts().sort_index()
+```
+
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+
+<br/>
+
+Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
+
+```{python}
+#| code-fold: true
+sns.displot(co2['Days']);
+plt.title("Distribution of days feature"); # suppresses unneeded plotting output
+```
+
+In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
+
+<br/>
+
+Finally, let's check the last time feature, **year** `Yr`.
+
+Let's check to see if there is any connection between missing-ness and the year of the recording.
+
+```{python}
+#| code-fold: true
+sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output
+```
+
+**Observations**:
+
+* All of the missing data are in the early years of operation.
+* It appears there may have been problems with equipment in the mid to late 80s.
+
+**Potential Next Steps**:
+
+* Confirm these explanations through documentation about the historical readings.
+* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
+
+<br/>
+
+## Understanding Missing Value 2: `Avg`
+Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
+
+```{python}
+#| code-fold: true
+# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+```
+
+The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+
+We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
+
+```{python}
+#| code-fold: false
+co2[co2["Avg"] < 0]
+```
+
+There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
+
+## Drop, `NaN`, or Impute Missing `Avg` Data?
+
+How should we address the invalid `Avg` data?
+
+1. Drop records
+2. Set to NaN
+3. Impute using some strategy
+
+Remember we want to fix the following plot:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");
+```
+
+Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
+
+
+Let's consider a few options:
+1. Drop those records
+2. Replace -99.99 with NaN
+3. Substitute it with a likely value for the average CO2?
+
+What do you think are the pros and cons of each possible action?
+
+<br/>
+
+
+Let's examine each of these three options.
+
+```{python}
+#| code-fold: false
+# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
+```
+
+```{python}
+#| code-fold: false
+# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
+```
+
+We'll also use a third version of the data.
+
+First, we note that the dataset already comes with a **substitute value** for the -99.99.
+
+From the file description:
+
+> The `interpolated` column includes average values from the preceding column (`average`)
+and **interpolated values** where data are missing. Interpolated values are
+computed in two steps...
+
+The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
+
+So, the third version of our data will use the `Int` feature instead of `Avg`.
+
+```{python}
+#| code-fold: false
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
+```
+
+What's a **reasonable** estimate?
+
+To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+
+```{python}
+#| code-fold: true
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
+```
+
+In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
+
+However there is some appeal to **option C: Imputing**:
+
+* Shows seasonal trends for CO2
+* We are plotting all months in our data as a line plot
+
+<br/>
+
+
+Let's replot our original figure with option 3:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");
+```
+
+Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
+
+## Presenting the data: A Discussion on Data Granularity
+
+From the description:
+
+* monthly measurements are averages of average day measurements.
+* The NOAA GML website has datasets for daily/hourly measurements too.
+
+The data you present depends on your research question.
+
+**How do CO2 levels vary by season?**
+
+* You might want to keep average monthly data.
+
+**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
+
+* You might be happier with a **coarser granularity** of average year data!
+
+```{python}
+#| code-fold: true
+co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");
+```
+
+Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+
+# Summary
+We went over a lot of content this lecture; let's summarize the most important points:
+
+## Dealing with Missing Values
+There are a few options we can take to deal with missing data:
+
+* Drop missing records
+* Keep `NaN` missing values
+* Impute using an interpolated column
+
+## EDA and Data Wrangling
+There are several ways to approach EDA and Data Wrangling:
+
+* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
+* Examine each **field/attribute/dimension** individually.
+* Examine pairs of related dimensions (e.g. breaking down grades by major).
+* Along the way, we can:
+ * **Visualize** or summarize the data.
+ * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
+ * Identify and **address anomalies**.
+ * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
+ * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png
index a04218cf..f392d5f9 100644
Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png
new file mode 100644
index 00000000..be96b8c9
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png
deleted file mode 100644
index 31857f62..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-67-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png
index 67c3959d..ffd29ff8 100644
Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png
new file mode 100644
index 00000000..29088928
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png
deleted file mode 100644
index fb28f5d5..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-69-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png
index 39cac822..49ef3d6a 100644
Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png
index 6382e58a..15a5fe82 100644
Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png
index db2b0dee..40b1fc71 100644
Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png
index 897b8b39..99b6c2d1 100644
Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ
diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html
index ea770e7f..22d26788 100644
--- a/docs/feature_engineering/feature_engineering.html
+++ b/docs/feature_engineering/feature_engineering.html
@@ -556,7 +556,7 @@
my_model.fit(X, Y)
-LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
+LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
Notice that we use double brackets to extract this column. Why double brackets instead of just single brackets? The .fit
method, by default, expects to receive 2-dimensional data – some kind of data that includes both rows and columns. Writing penguins["flipper_length_mm"]
would return a 1D Series
, causing sklearn
to error. We avoid this by writing penguins[["flipper_length_mm"]]
to produce a 2D DataFrame
.
@@ -607,7 +607,7 @@
print(f"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}")
-The RMSE of the model is 0.9881331104079044
+The RMSE of the model is 0.9881331104079045
We can also see that we obtain the same predictions using sklearn
as we did when applying the ordinary least squares formula before!
@@ -977,7 +977,7 @@
print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}")
-MSE of model with (hp^2) feature: 18.984768907617223
+MSE of model with (hp^2) feature: 18.984768907617216
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png
index 92cb01c9..f8396667 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png differ
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png
index f4ae4ea0..ceecd30f 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png differ
diff --git a/docs/gradient_descent/gradient_descent.html b/docs/gradient_descent/gradient_descent.html
index 467ee5fb..ed238d2c 100644
--- a/docs/gradient_descent/gradient_descent.html
+++ b/docs/gradient_descent/gradient_descent.html
@@ -106,7 +106,7 @@
require.undef("plotly");
requirejs.config({
paths: {
- 'plotly': ['https://cdn.plot.ly/plotly-2.25.2.min']
+ 'plotly': ['https://cdn.plot.ly/plotly-2.12.1.min']
}
});
require(['plotly'], function(Plotly) {
@@ -439,9 +439,9 @@
-
Code
- +
-
+
@@ -4395,9 +4383,9 @@
-
+
@@ -4481,10 +4469,10 @@
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
@@ -4564,30 +4552,30 @@
Code
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
@@ -4604,8 +4592,8 @@
Code
-
+
@@ -4632,9 +4620,9 @@
Code
-
+
@@ -4975,1218 +4963,1218 @@ <
Source Code
----
-title: Data Cleaning and EDA
-execute:
- echo: true
-format:
- html:
- code-fold: true
- code-tools: true
- toc: true
- toc-title: Data Cleaning and EDA
- page-layout: full
- theme:
- - cosmo
- - cerulean
- callout-icon: false
-jupyter: python3
----
-
-```{python}
-#| code-fold: true
-import numpy as np
-import pandas as pd
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-#%matplotlib inline
-plt.rcParams['figure.figsize'] = (12, 9)
-
-sns.set()
-sns.set_context('talk')
-np.set_printoptions(threshold=20, precision=2, suppress=True)
-pd.set_option('display.max_rows', 30)
-pd.set_option('display.max_columns', None)
-pd.set_option('display.precision', 2)
-# This option stops scientific notation for pandas
-pd.set_option('display.float_format', '{:.2f}'.format)
-
-# Silence some spurious seaborn warnings
-import warnings
-warnings.filterwarnings("ignore", category=FutureWarning)
-```
-
-::: {.callout-note collapse="false"}
-## Learning Outcomes
-* Recognize common file formats
-* Categorize data by its variable type
-* Build awareness of issues with data faithfulness and develop targeted solutions
-:::
-
-**This content is covered in lectures 4, 5, and 6.**
-
-In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
-
-**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
-
-* Unclear structure or formatting
-* Missing or corrupted values
-* Unit conversions
-* ...and so on
-
-**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
-
-In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
-
-## Structure
-
-### File Formats
-There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
-
-#### CSV
-CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
-In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.csv").head(5)
-```
-
-To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
-
-```{python}
-#| code-fold: false
-with open("data/elections.csv", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
-
-#### TSV
-
-Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
-
-Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
-
-```{python}
-#| code-fold: false
-with open("data/elections.txt", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.txt", sep='\t').head(3)
-```
-
-An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-#### JSON
-**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
-
-```{python}
-#| code-fold: false
-with open("data/elections.json", "r") as table:
- i = 0
- for row in table:
- print(row)
- i += 1
- if i > 8:
- break
-```
-
-JSON files can be loaded into `pandas` using `pd.read_json`.
-
-```{python}
-#| code-fold: false
-pd.read_json('data/elections.json').head(3)
-```
-
-##### EDA with JSON: Berkeley COVID-19 Data
-The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
-
-```{python}
-#| code-fold: false
-from ds100_utils import fetch_and_cache
-
-covid_file = fetch_and_cache(
- "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
- "confirmed-cases.json",
- force=False)
-covid_file # a file path wrapper object
-```
-
-###### File Size
-Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
-
-Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
-
-```{python}
-#| code-fold: false
-import os
-
-print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
-
-with open(covid_file, "r") as f:
- print(covid_file, "is", sum(1 for l in f), "lines.")
-```
-
-###### Unix Commands
-As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
-In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
-
-Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
-
-These two give us the same information as the code above, albeit in a slightly different form:
-
-```{python}
-#| code-fold: false
-!ls -lh {covid_file}
-!wc -l {covid_file}
-```
-
-###### File Contents
-Let's explore the data format using `Python`.
-
-```{python}
-#| code-fold: false
-with open(covid_file, "r") as f:
- for i, row in enumerate(f):
- print(repr(row)) # print raw strings
- if i >= 4: break
-```
-
-We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
-
-```{python}
-#| code-fold: false
-!head -5 {covid_file}
-```
-
-In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
-
-```{python}
-#| code-fold: false
-import json
-
-with open(covid_file, "rb") as f:
- covid_json = json.load(f)
-```
-
-The `covid_json` variable is now a dictionary encoding the data in the file:
-
-```{python}
-#| code-fold: false
-type(covid_json)
-```
-
-We can examine what keys are in the top level json object by listing out the keys.
-
-```{python}
-#| code-fold: false
-covid_json.keys()
-```
-
-**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
-
-
-We can investigate the meta data further by examining the keys associated with the metadata.
-
-```{python}
-#| code-fold: false
-covid_json['meta'].keys()
-```
-
-The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
-
-```{python}
-#| code-fold: false
-covid_json['meta']['view'].keys()
-```
-
-Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
-
-```
-meta
-|-> data
- | ... (haven't explored yet)
-|-> view
- | -> id
- | -> name
- | -> attribution
- ...
- | -> description
- ...
- | -> columns
- ...
-```
-
-
-There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
-```{python}
-#| code-fold: false
-print(covid_json['meta']['view']['description'])
-```
-
-###### Examining the Data Field for Records
-
-We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
-
-```{python}
-#| code-fold: false
-for i in range(3):
- print(f"{i:03} | {covid_json['data'][i]}")
-```
-
-Observations:
-* These look like equal-length records, so maybe `data` is a table!
-* But what do each of values in the record mean? Where can we find column headers?
-
-For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
-
-```{python}
-#| code-fold: false
-type(covid_json['meta']['view']['columns'])
-```
-
-###### Summary of exploring the JSON file
-
-1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
-1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
-1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
-
-###### Loading COVID Data into `pandas`
-Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
-
-1. Translate the JSON records into a `DataFrame`:
-
- * fields: `covid_json['meta']['view']['columns']`
- * records: `covid_json['data']`
-
-
-1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
-
-1. Examine the `tail` of the table.
-
-```{python}
-#| code-fold: false
-# Load the data from JSON and assign column titles
-covid = pd.DataFrame(
- covid_json['data'],
- columns=[c['name'] for c in covid_json['meta']['view']['columns']])
-
-covid.tail()
-```
-
-### Variable Types
-
-After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
-
-**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
-
-* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
-* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
-
-**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
-
-* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
-* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
-
-![Classification of variable types](images/variable.png)
-
-Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
-
-### Primary and Foreign Keys
-
-Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
-
-The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
-
-```{python}
-#| echo: false
-pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
- "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
- "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
-```
-
-The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
-
-```{python}
-#| echo: false
-pd.DataFrame({"OH Request":[1, 2, 3, 4], \
- "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
- "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
-```
-
-## Granularity, Scope, and Temporality
-
-After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
-
-### Granularity
-The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
-
-### Scope
-The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
-
-### Temporality
-The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
-
-Time and date fields of a dataset could represent a few things:
-
-1. when the "event" happened
-2. when the data was collected, or when it was entered into the system
-3. when the data was copied into the database
-
-To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
-
-#### Temporality with `pandas`' `dt` accessors
-Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
-```{python}
-#| code-fold: true
-calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
-calls.head()
-```
-
-Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
-
-Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
-
-If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
-calls.head()
-```
-
-Now, we can use the `dt` accessor on this column.
-
-We can get the month:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.month.head()
-```
-
-Which day of the week the date is on:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.dayofweek.head()
-```
-
-Check the mimimum values to see if there are any suspicious-looking, 70s dates:
-
-```{python}
-#| code-fold: false
-calls.sort_values("EVENTDT").head()
-```
-
-Doesn't look like it! We are good!
-
-
-We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
-
-## Faithfulness
-
-At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
-
-Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
-
-* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
-* Violations of obvious dependencies, like an age that does not match a birthday
-* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
-* Signs of data falsification, such as fake email addresses or repeated use of the same names
-* Duplicated records or fields containing the same information
-* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
-
-We often solve some of these more common issues in the following ways:
-
-* Spelling errors: apply corrections or drop records that aren't in a dictionary
-* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
-* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
-* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
-
-### Missing Values
-Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
-
-A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
-
-* Average imputation: replace missing values with the average value for that field
-* Hot deck imputation: replace missing values with some random value
-* Regression imputation: develop a model to predict missing values
-* Multiple imputation: replace missing values with multiple random values
-
-Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
-
-# EDA Demo 1: Tuberculosis in the United States
-
-Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
-
-We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
-
-
-## CSVs and Field Names
-Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
-
-We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
-1. Using a text editor like emacs, vim, VSCode, etc.
-2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
-3. The `Python` file object
-4. `pandas`, using `pd.read_csv()`
-
-To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
-
-Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(row)
- i += 1
- if i > 3:
- break
-```
-
-Whoa, why are there blank lines interspaced between the lines of the CSV?
-
-You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
-
-If you're curious, we can use the `repr()` function to return the raw string with all special characters:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(repr(row)) # print raw strings
- i += 1
- if i > 3:
- break
-```
-
-Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
-tb_df.head()
-```
-
-You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
-
-Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
-
-A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
-tb_df.head(5)
-```
-
-Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
-
-We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
-
-```{python}
-#| code-fold: false
-rename_dict = {'2019': 'TB cases 2019',
- '2020': 'TB cases 2020',
- '2021': 'TB cases 2021',
- '2019.1': 'TB incidence 2019',
- '2020.1': 'TB incidence 2020',
- '2021.1': 'TB incidence 2021'}
-tb_df = tb_df.rename(columns=rename_dict)
-tb_df.head(5)
-```
-
-## Record Granularity
-
-You might already be wondering: what's up with that first record?
-
-Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
-
-Okay, EDA step two. How was the rollup record aggregated?
-
-Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
-
-```{python}
-#| code-fold: true
-tb_df.sum(axis=0)
-```
-
-Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
-```{python}
-#| code-fold: true
-tb_df.dtypes
-```
-
-Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
-
-
-Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
-
-```{python}
-#| code-fold: false
-# improve readability: chaining method calls with outer parentheses/line breaks
-tb_df = (
- pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
- .rename(columns=rename_dict)
-)
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-tb_df.sum()
-```
-
-The Total TB cases look right. Phew!
-
-Let's just look at the records with **state-level granularity**:
-
-```{python}
-#| code-fold: true
-state_tb_df = tb_df[1:]
-state_tb_df.head(5)
-```
-
-## Gather Census Data
-
-U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
-
-Running the below cells cleans the data.
-There are a few new methods here:
-* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
-* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
-
-```{python}
-#| code-fold: true
-# 2010s census data
-census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
-census_2010s_df = (
- census_2010s_df
- .reset_index()
- .drop(columns=["index", "Census", "Estimates Base"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
-
-# with pd.option_context('display.min_rows', 30): # shows more rows
-# display(census_2010s_df)
-
-census_2010s_df.head(5)
-```
-
-Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
-
-```python
-from importlib import reload
-reload(utils)
-```
-
-or use `iPython` magic which will intelligently import code when files change:
-
-```python
-%load_ext autoreload
-%autoreload 2
-```
-
-```{python}
-#| code-fold: true
-# census 2020s data
-census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
-census_2020s_df = (
- census_2020s_df
- .reset_index()
- .drop(columns=["index", "Unnamed: 1"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
-
-census_2020s_df.head(5)
-```
-
-## Joining Data (Merging `DataFrame`s)
-
-Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
-
-```{python}
-#| code-fold: false
-# merge TB DataFrame with two US census DataFrames
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .merge(right=census_2020s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
-
-```{python}
-#| code-fold: false
-# try merging again, but cleaner this time
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["Geographic Area", "2019"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
- .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-## Reproducing Data: Compute Incidence
-
-Let's recompute incidence to make sure we know where the original CDC numbers came from.
-
-From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
-
-If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
-
-$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
-
-$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
-
-Let's try this for 2019:
-
-```{python}
-#| code-fold: false
-tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
-tb_census_df.head(5)
-```
-
-Awesome!!!
-
-Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
-
-```{python}
-#| code-fold: false
-tb_census_df.describe()
-```
-
-## Bonus EDA: Reproducing the Reported Statistic
-
-
-**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-This is TB incidence computed across the entire U.S. population! How do we reproduce this?
-* We need to reproduce the "Total" TB incidences in our rolled record.
-* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
-* What happened...?
-
-Let's get exploring!
-
-Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
-
-```{python}
-#| code-fold: true
-tb_df = tb_df.set_index("U.S. jurisdiction")
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2010s_df = census_2010s_df.set_index("Geographic Area")
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2020s_df = census_2020s_df.set_index("Geographic Area")
-census_2020s_df.head(5)
-```
-
-It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
-
-```{python}
-#| code-fold: false
-tb_df.head()
-```
-
-Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
-
-The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
-
-```{python}
-#| code-fold: false
-census_2010s_df.head(5)
-```
-
-The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
-
-One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
-
-```{python}
-#| code-fold: false
-# rename rolled record for 2010s
-census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-# same, but for 2020s rename rolled record
-census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2020s_df.head(5)
-```
-
-<br/>
-
-Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
-
-```{python}
-#| code-fold: false
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["2019"]],
- left_index=True, right_index=True)
- .merge(right=census_2020s_df[["2020", "2021"]],
- left_index=True, right_index=True)
-)
-tb_census_df.head(5)
-```
-
-<br/>
-
-Finally, let's recompute our incidences:
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-We reproduced the total U.S. incidences correctly!
-
-We're almost there. Let's revisit the quote:
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-Recall that percent change from $A$ to $B$ is computed as
-$\text{percent change} = \frac{B - A}{A} \times 100$.
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
-incidence_2020
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
-incidence_2021
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
-difference
-```
-
-# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
-
-[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
-
-```{python}
-#| code-fold: false
-co2_file = "data/co2_mm_mlo.txt"
-```
-
-Let's do some **EDA**!!
-
-## Reading this file into Pandas?
-Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
-
-Lines 71-78 (inclusive) are shown below:
-
- line number | file contents
-
- 71 | # decimal average interpolated trend #days
- 72 | # date (season corr)
- 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
- 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
- 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
- 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
- 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
- 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
-
-
-Notice how:
-
-- The values are separated by white space, possibly tabs.
-- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
-- The 71st and 72nd lines in the file contain column headings split over two lines.
-
-We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
-)
-co2.head()
-```
-
-Congratulations! You've wrangled the data!
-
-<br/>
-
-...But our columns aren't named.
-**We need to do more EDA.**
-
-## Exploring Variable Feature Types
-
-The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
-
-Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = '\s+', #regex for continuous whitespace (next lecture)
- names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
-)
-co2.head()
-```
-
-## Visualizing CO<sub>2</sub>
-Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2);
-```
-
-The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
-
-Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
-
-```{python}
-#| code-fold: false
-co2.head()
-```
-
-```{python}
-#| code-fold: false
-co2.tail()
-```
-
-Some data have unusual values like -1 and -99.99.
-
-Let's check the description at the top of the file again.
-
-* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
-* -99.99 denotes a missing monthly average `Avg`
-
-How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
-
-<br/>
-
-
-## Sanity Checks: Reasoning about the data
-First, we consider the shape of the data. How many rows should we have?
-
-* If chronological order, we should have one record per month.
-* Data from March 1958 to August 2019.
-* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
-
-```{python}
-#| code-fold: false
-co2.shape
-```
-
-Nice!! The number of rows (i.e. records) match our expectations.\
-
-<br/>
-
-
-Let's now check the quality of each feature.
-
-## Understanding Missing Value 1: `Days`
-`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
-
-Let's start with **months**, `Mo`.
-
-Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
-
-```{python}
-#| code-fold: false
-co2["Mo"].value_counts().sort_index()
-```
-
-As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
-
-<br/>
-
-Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
-
-```{python}
-#| code-fold: true
-sns.displot(co2['Days']);
-plt.title("Distribution of days feature"); # suppresses unneeded plotting output
-```
-
-In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
-
-<br/>
-
-Finally, let's check the last time feature, **year** `Yr`.
-
-Let's check to see if there is any connection between missing-ness and the year of the recording.
-
-```{python}
-#| code-fold: true
-sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
-```
-
-**Observations**:
-
-* All of the missing data are in the early years of operation.
-* It appears there may have been problems with equipment in the mid to late 80s.
-
-**Potential Next Steps**:
-
-* Confirm these explanations through documentation about the historical readings.
-* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
-
-<br/>
-
-## Understanding Missing Value 2: `Avg`
-Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
-
-```{python}
-#| code-fold: true
-# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
-```
-
-The non-missing values are in the 300-400 range (a regular range of CO2 levels).
-
-We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
-
-```{python}
-#| code-fold: false
-co2[co2["Avg"] < 0]
-```
-
-There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
-
-## Drop, `NaN`, or Impute Missing `Avg` Data?
-
-How should we address the invalid `Avg` data?
-
-1. Drop records
-2. Set to NaN
-3. Impute using some strategy
-
-Remember we want to fix the following plot:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
-```
-
-Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
-
-
-Let's consider a few options:
-1. Drop those records
-2. Replace -99.99 with NaN
-3. Substitute it with a likely value for the average CO2?
-
-What do you think are the pros and cons of each possible action?
-
-<br/>
-
-
-Let's examine each of these three options.
-
-```{python}
-#| code-fold: false
-# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
-```
-
-```{python}
-#| code-fold: false
-# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
-```
-
-We'll also use a third version of the data.
-
-First, we note that the dataset already comes with a **substitute value** for the -99.99.
-
-From the file description:
-
-> The `interpolated` column includes average values from the preceding column (`average`)
-and **interpolated values** where data are missing. Interpolated values are
-computed in two steps...
-
-The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
-
-So, the third version of our data will use the `Int` feature instead of `Avg`.
-
-```{python}
-#| code-fold: false
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
-```
-
-What's a **reasonable** estimate?
-
-To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
-
-```{python}
-#| code-fold: true
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
-```
-
-In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
-
-However there is some appeal to **option C: Imputing**:
-
-* Shows seasonal trends for CO2
-* We are plotting all months in our data as a line plot
-
-<br/>
-
-
-Let's replot our original figure with option 3:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
-```
-
-Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
-
-## Presenting the data: A Discussion on Data Granularity
-
-From the description:
-
-* monthly measurements are averages of average day measurements.
-* The NOAA GML website has datasets for daily/hourly measurements too.
-
-The data you present depends on your research question.
-
-**How do CO2 levels vary by season?**
-
-* You might want to keep average monthly data.
-
-**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
-
-* You might be happier with a **coarser granularity** of average year data!
-
-```{python}
-#| code-fold: true
-co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
-```
-
-Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
-
-# Summary
-We went over a lot of content this lecture; let's summarize the most important points:
-
-## Dealing with Missing Values
-There are a few options we can take to deal with missing data:
-
-* Drop missing records
-* Keep `NaN` missing values
-* Impute using an interpolated column
-
-## EDA and Data Wrangling
-There are several ways to approach EDA and Data Wrangling:
-
-* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
-* Examine each **field/attribute/dimension** individually.
-* Examine pairs of related dimensions (e.g. breaking down grades by major).
-* Along the way, we can:
- * **Visualize** or summarize the data.
- * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
- * Identify and **address anomalies**.
- * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
- * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
+---
+title: Data Cleaning and EDA
+execute:
+ echo: true
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ toc: true
+ toc-title: Data Cleaning and EDA
+ page-layout: full
+ theme:
+ - cosmo
+ - cerulean
+ callout-icon: false
+jupyter: python3
+---
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+plt.rcParams['figure.figsize'] = (12, 9)
+
+sns.set()
+sns.set_context('talk')
+np.set_printoptions(threshold=20, precision=2, suppress=True)
+pd.set_option('display.max_rows', 30)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+# This option stops scientific notation for pandas
+pd.set_option('display.float_format', '{:.2f}'.format)
+
+# Silence some spurious seaborn warnings
+import warnings
+warnings.filterwarnings("ignore", category=FutureWarning)
+```
+
+::: {.callout-note collapse="false"}
+## Learning Outcomes
+* Recognize common file formats
+* Categorize data by its variable type
+* Build awareness of issues with data faithfulness and develop targeted solutions
+:::
+
+**This content is covered in lectures 4, 5, and 6.**
+
+In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
+
+**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+
+* Unclear structure or formatting
+* Missing or corrupted values
+* Unit conversions
+* ...and so on
+
+**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
+
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+
+## Structure
+
+### File Formats
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+
+#### CSV
+CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
+In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.csv").head(5)
+```
+
+To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
+
+```{python}
+#| code-fold: false
+with open("data/elections.csv", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
+
+#### TSV
+
+Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
+
+Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
+
+```{python}
+#| code-fold: false
+with open("data/elections.txt", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.txt", sep='\t').head(3)
+```
+
+An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+#### JSON
+**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
+
+```{python}
+#| code-fold: false
+with open("data/elections.json", "r") as table:
+ i = 0
+ for row in table:
+ print(row)
+ i += 1
+ if i > 8:
+ break
+```
+
+JSON files can be loaded into `pandas` using `pd.read_json`.
+
+```{python}
+#| code-fold: false
+pd.read_json('data/elections.json').head(3)
+```
+
+##### EDA with JSON: Berkeley COVID-19 Data
+The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
+
+```{python}
+#| code-fold: false
+from ds100_utils import fetch_and_cache
+
+covid_file = fetch_and_cache(
+ "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ force=False)
+covid_file # a file path wrapper object
+```
+
+###### File Size
+Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
+
+Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
+
+```{python}
+#| code-fold: false
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+ print(covid_file, "is", sum(1 for l in f), "lines.")
+```
+
+###### Unix Commands
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
+In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
+
+Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
+
+These two give us the same information as the code above, albeit in a slightly different form:
+
+```{python}
+#| code-fold: false
+!ls -lh {covid_file}
+!wc -l {covid_file}
+```
+
+###### File Contents
+Let's explore the data format using `Python`.
+
+```{python}
+#| code-fold: false
+with open(covid_file, "r") as f:
+ for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
+```
+
+We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
+
+```{python}
+#| code-fold: false
+!head -5 {covid_file}
+```
+
+In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
+
+```{python}
+#| code-fold: false
+import json
+
+with open(covid_file, "rb") as f:
+ covid_json = json.load(f)
+```
+
+The `covid_json` variable is now a dictionary encoding the data in the file:
+
+```{python}
+#| code-fold: false
+type(covid_json)
+```
+
+We can examine what keys are in the top level json object by listing out the keys.
+
+```{python}
+#| code-fold: false
+covid_json.keys()
+```
+
+**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
+
+
+We can investigate the meta data further by examining the keys associated with the metadata.
+
+```{python}
+#| code-fold: false
+covid_json['meta'].keys()
+```
+
+The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
+
+```{python}
+#| code-fold: false
+covid_json['meta']['view'].keys()
+```
+
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+
+```
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+```
+
+
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+
+```{python}
+#| code-fold: false
+print(covid_json['meta']['view']['description'])
+```
+
+###### Examining the Data Field for Records
+
+We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
+
+```{python}
+#| code-fold: false
+for i in range(3):
+ print(f"{i:03} | {covid_json['data'][i]}")
+```
+
+Observations:
+* These look like equal-length records, so maybe `data` is a table!
+* But what do each of values in the record mean? Where can we find column headers?
+
+For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
+
+```{python}
+#| code-fold: false
+type(covid_json['meta']['view']['columns'])
+```
+
+###### Summary of exploring the JSON file
+
+1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
+1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
+1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
+
+###### Loading COVID Data into `pandas`
+Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
+
+1. Translate the JSON records into a `DataFrame`:
+
+ * fields: `covid_json['meta']['view']['columns']`
+ * records: `covid_json['data']`
+
+
+1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
+
+1. Examine the `tail` of the table.
+
+```{python}
+#| code-fold: false
+# Load the data from JSON and assign column titles
+covid = pd.DataFrame(
+ covid_json['data'],
+ columns=[c['name'] for c in covid_json['meta']['view']['columns']])
+
+covid.tail()
+```
+
+### Variable Types
+
+After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+
+**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
+
+* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
+* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
+
+**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
+
+* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
+* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
+
+![Classification of variable types](images/variable.png)
+
+Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+
+### Primary and Foreign Keys
+
+Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
+
+The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
+
+```{python}
+#| echo: false
+pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
+ "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
+ "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
+```
+
+The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
+
+```{python}
+#| echo: false
+pd.DataFrame({"OH Request":[1, 2, 3, 4], \
+ "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
+ "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
+```
+
+## Granularity, Scope, and Temporality
+
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
+
+### Granularity
+The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+
+### Scope
+The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+
+### Temporality
+The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+
+Time and date fields of a dataset could represent a few things:
+
+1. when the "event" happened
+2. when the data was collected, or when it was entered into the system
+3. when the data was copied into the database
+
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+
+#### Temporality with `pandas`' `dt` accessors
+Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
+
+```{python}
+#| code-fold: true
+calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+calls.head()
+```
+
+Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
+
+Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
+
+If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+calls.head()
+```
+
+Now, we can use the `dt` accessor on this column.
+
+We can get the month:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.month.head()
+```
+
+Which day of the week the date is on:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.dayofweek.head()
+```
+
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+
+```{python}
+#| code-fold: false
+calls.sort_values("EVENTDT").head()
+```
+
+Doesn't look like it! We are good!
+
+
+We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
+
+## Faithfulness
+
+At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
+
+Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+
+* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
+* Violations of obvious dependencies, like an age that does not match a birthday
+* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
+* Signs of data falsification, such as fake email addresses or repeated use of the same names
+* Duplicated records or fields containing the same information
+* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
+
+We often solve some of these more common issues in the following ways:
+
+* Spelling errors: apply corrections or drop records that aren't in a dictionary
+* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
+* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
+* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
+
+### Missing Values
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
+
+A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+
+* Average imputation: replace missing values with the average value for that field
+* Hot deck imputation: replace missing values with some random value
+* Regression imputation: develop a model to predict missing values
+* Multiple imputation: replace missing values with multiple random values
+
+Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+
+# EDA Demo 1: Tuberculosis in the United States
+
+Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+
+We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
+
+
+## CSVs and Field Names
+Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
+
+We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
+1. Using a text editor like emacs, vim, VSCode, etc.
+2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
+3. The `Python` file object
+4. `pandas`, using `pd.read_csv()`
+
+To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
+
+Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(row)
+ i += 1
+ if i > 3:
+ break
+```
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+
+You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
+
+If you're curious, we can use the `repr()` function to return the raw string with all special characters:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(repr(row)) # print raw strings
+ i += 1
+ if i > 3:
+ break
+```
+
+Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
+tb_df.head()
+```
+
+You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
+
+Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
+
+A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+tb_df.head(5)
+```
+
+Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
+
+We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
+
+```{python}
+#| code-fold: false
+rename_dict = {'2019': 'TB cases 2019',
+ '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+tb_df = tb_df.rename(columns=rename_dict)
+tb_df.head(5)
+```
+
+## Record Granularity
+
+You might already be wondering: what's up with that first record?
+
+Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
+
+Okay, EDA step two. How was the rollup record aggregated?
+
+Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
+
+```{python}
+#| code-fold: true
+tb_df.sum(axis=0)
+```
+
+Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+
+```{python}
+#| code-fold: true
+tb_df.dtypes
+```
+
+Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
+
+
+Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
+
+```{python}
+#| code-fold: false
+# improve readability: chaining method calls with outer parentheses/line breaks
+tb_df = (
+ pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
+ .rename(columns=rename_dict)
+)
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+tb_df.sum()
+```
+
+The Total TB cases look right. Phew!
+
+Let's just look at the records with **state-level granularity**:
+
+```{python}
+#| code-fold: true
+state_tb_df = tb_df[1:]
+state_tb_df.head(5)
+```
+
+## Gather Census Data
+
+U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
+
+Running the below cells cleans the data.
+There are a few new methods here:
+* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
+* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
+
+```{python}
+#| code-fold: true
+# 2010s census data
+census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+census_2010s_df = (
+ census_2010s_df
+ .reset_index()
+ .drop(columns=["index", "Census", "Estimates Base"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+census_2010s_df.head(5)
+```
+
+Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
+
+```python
+from importlib import reload
+reload(utils)
+```
+
+or use `iPython` magic which will intelligently import code when files change:
+
+```python
+%load_ext autoreload
+%autoreload 2
+```
+
+```{python}
+#| code-fold: true
+# census 2020s data
+census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+census_2020s_df = (
+ census_2020s_df
+ .reset_index()
+ .drop(columns=["index", "Unnamed: 1"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+
+census_2020s_df.head(5)
+```
+
+## Joining Data (Merging `DataFrame`s)
+
+Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
+
+```{python}
+#| code-fold: false
+# merge TB DataFrame with two US census DataFrames
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .merge(right=census_2020s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
+
+```{python}
+#| code-fold: false
+# try merging again, but cleaner this time
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["Geographic Area", "2019"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+ .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+## Reproducing Data: Compute Incidence
+
+Let's recompute incidence to make sure we know where the original CDC numbers came from.
+
+From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+
+$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
+
+$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
+
+Let's try this for 2019:
+
+```{python}
+#| code-fold: false
+tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+tb_census_df.head(5)
+```
+
+Awesome!!!
+
+Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+
+```{python}
+#| code-fold: false
+tb_census_df.describe()
+```
+
+## Bonus EDA: Reproducing the Reported Statistic
+
+
+**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+This is TB incidence computed across the entire U.S. population! How do we reproduce this?
+* We need to reproduce the "Total" TB incidences in our rolled record.
+* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
+* What happened...?
+
+Let's get exploring!
+
+Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+
+```{python}
+#| code-fold: true
+tb_df = tb_df.set_index("U.S. jurisdiction")
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2010s_df = census_2010s_df.set_index("Geographic Area")
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2020s_df = census_2020s_df.set_index("Geographic Area")
+census_2020s_df.head(5)
+```
+
+It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
+
+```{python}
+#| code-fold: false
+tb_df.head()
+```
+
+Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
+
+The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
+
+```{python}
+#| code-fold: false
+census_2010s_df.head(5)
+```
+
+The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
+
+One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
+
+```{python}
+#| code-fold: false
+# rename rolled record for 2010s
+census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+# same, but for 2020s rename rolled record
+census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2020s_df.head(5)
+```
+
+<br/>
+
+Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
+
+```{python}
+#| code-fold: false
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["2019"]],
+ left_index=True, right_index=True)
+ .merge(right=census_2020s_df[["2020", "2021"]],
+ left_index=True, right_index=True)
+)
+tb_census_df.head(5)
+```
+
+<br/>
+
+Finally, let's recompute our incidences:
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+We reproduced the total U.S. incidences correctly!
+
+We're almost there. Let's revisit the quote:
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+Recall that percent change from $A$ to $B$ is computed as
+$\text{percent change} = \frac{B - A}{A} \times 100$.
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
+incidence_2020
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
+incidence_2021
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
+difference
+```
+
+# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
+
+[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
+
+```{python}
+#| code-fold: false
+co2_file = "data/co2_mm_mlo.txt"
+```
+
+Let's do some **EDA**!!
+
+## Reading this file into Pandas?
+Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
+
+Lines 71-78 (inclusive) are shown below:
+
+ line number | file contents
+
+ 71 | # decimal average interpolated trend #days
+ 72 | # date (season corr)
+ 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+ 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+ 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+ 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+ 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+ 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+
+
+Notice how:
+
+- The values are separated by white space, possibly tabs.
+- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
+- The 71st and 72nd lines in the file contain column headings split over two lines.
+
+We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+)
+co2.head()
+```
+
+Congratulations! You've wrangled the data!
+
+<br/>
+
+...But our columns aren't named.
+**We need to do more EDA.**
+
+## Exploring Variable Feature Types
+
+The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
+
+Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = '\s+', #regex for continuous whitespace (next lecture)
+ names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+)
+co2.head()
+```
+
+## Visualizing CO<sub>2</sub>
+Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2);
+```
+
+The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
+
+Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
+
+```{python}
+#| code-fold: false
+co2.head()
+```
+
+```{python}
+#| code-fold: false
+co2.tail()
+```
+
+Some data have unusual values like -1 and -99.99.
+
+Let's check the description at the top of the file again.
+
+* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
+* -99.99 denotes a missing monthly average `Avg`
+
+How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+
+<br/>
+
+
+## Sanity Checks: Reasoning about the data
+First, we consider the shape of the data. How many rows should we have?
+
+* If chronological order, we should have one record per month.
+* Data from March 1958 to August 2019.
+* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
+
+```{python}
+#| code-fold: false
+co2.shape
+```
+
+Nice!! The number of rows (i.e. records) match our expectations.\
+
+<br/>
+
+
+Let's now check the quality of each feature.
+
+## Understanding Missing Value 1: `Days`
+`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
+
+Let's start with **months**, `Mo`.
+
+Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+
+```{python}
+#| code-fold: false
+co2["Mo"].value_counts().sort_index()
+```
+
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+
+<br/>
+
+Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
+
+```{python}
+#| code-fold: true
+sns.displot(co2['Days']);
+plt.title("Distribution of days feature"); # suppresses unneeded plotting output
+```
+
+In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
+
+<br/>
+
+Finally, let's check the last time feature, **year** `Yr`.
+
+Let's check to see if there is any connection between missing-ness and the year of the recording.
+
+```{python}
+#| code-fold: true
+sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output
+```
+
+**Observations**:
+
+* All of the missing data are in the early years of operation.
+* It appears there may have been problems with equipment in the mid to late 80s.
+
+**Potential Next Steps**:
+
+* Confirm these explanations through documentation about the historical readings.
+* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
+
+<br/>
+
+## Understanding Missing Value 2: `Avg`
+Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
+
+```{python}
+#| code-fold: true
+# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+```
+
+The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+
+We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
+
+```{python}
+#| code-fold: false
+co2[co2["Avg"] < 0]
+```
+
+There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
+
+## Drop, `NaN`, or Impute Missing `Avg` Data?
+
+How should we address the invalid `Avg` data?
+
+1. Drop records
+2. Set to NaN
+3. Impute using some strategy
+
+Remember we want to fix the following plot:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");
+```
+
+Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
+
+
+Let's consider a few options:
+1. Drop those records
+2. Replace -99.99 with NaN
+3. Substitute it with a likely value for the average CO2?
+
+What do you think are the pros and cons of each possible action?
+
+<br/>
+
+
+Let's examine each of these three options.
+
+```{python}
+#| code-fold: false
+# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
+```
+
+```{python}
+#| code-fold: false
+# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
+```
+
+We'll also use a third version of the data.
+
+First, we note that the dataset already comes with a **substitute value** for the -99.99.
+
+From the file description:
+
+> The `interpolated` column includes average values from the preceding column (`average`)
+and **interpolated values** where data are missing. Interpolated values are
+computed in two steps...
+
+The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
+
+So, the third version of our data will use the `Int` feature instead of `Avg`.
+
+```{python}
+#| code-fold: false
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
+```
+
+What's a **reasonable** estimate?
+
+To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+
+```{python}
+#| code-fold: true
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
+```
+
+In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
+
+However there is some appeal to **option C: Imputing**:
+
+* Shows seasonal trends for CO2
+* We are plotting all months in our data as a line plot
+
+<br/>
+
+
+Let's replot our original figure with option 3:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");
+```
+
+Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
+
+## Presenting the data: A Discussion on Data Granularity
+
+From the description:
+
+* monthly measurements are averages of average day measurements.
+* The NOAA GML website has datasets for daily/hourly measurements too.
+
+The data you present depends on your research question.
+
+**How do CO2 levels vary by season?**
+
+* You might want to keep average monthly data.
+
+**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
+
+* You might be happier with a **coarser granularity** of average year data!
+
+```{python}
+#| code-fold: true
+co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");
+```
+
+Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+
+# Summary
+We went over a lot of content this lecture; let's summarize the most important points:
+
+## Dealing with Missing Values
+There are a few options we can take to deal with missing data:
+
+* Drop missing records
+* Keep `NaN` missing values
+* Impute using an interpolated column
+
+## EDA and Data Wrangling
+There are several ways to approach EDA and Data Wrangling:
+
+* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
+* Examine each **field/attribute/dimension** individually.
+* Examine pairs of related dimensions (e.g. breaking down grades by major).
+* Along the way, we can:
+ * **Visualize** or summarize the data.
+ * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
+ * Identify and **address anomalies**.
+ * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
+ * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png
index a04218cf..f392d5f9 100644
Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png
new file mode 100644
index 00000000..be96b8c9
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png
deleted file mode 100644
index 31857f62..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-67-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png
index 67c3959d..ffd29ff8 100644
Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png
new file mode 100644
index 00000000..29088928
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png
deleted file mode 100644
index fb28f5d5..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-69-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png
index 39cac822..49ef3d6a 100644
Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png
index 6382e58a..15a5fe82 100644
Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png
index db2b0dee..40b1fc71 100644
Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png
index 897b8b39..99b6c2d1 100644
Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ
diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html
index ea770e7f..22d26788 100644
--- a/docs/feature_engineering/feature_engineering.html
+++ b/docs/feature_engineering/feature_engineering.html
@@ -556,7 +556,7 @@
my_model.fit(X, Y)
-LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
+LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
Notice that we use double brackets to extract this column. Why double brackets instead of just single brackets? The .fit
method, by default, expects to receive 2-dimensional data – some kind of data that includes both rows and columns. Writing penguins["flipper_length_mm"]
would return a 1D Series
, causing sklearn
to error. We avoid this by writing penguins[["flipper_length_mm"]]
to produce a 2D DataFrame
.
@@ -607,7 +607,7 @@
print(f"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}")
-The RMSE of the model is 0.9881331104079044
+The RMSE of the model is 0.9881331104079045
We can also see that we obtain the same predictions using sklearn
as we did when applying the ordinary least squares formula before!
@@ -977,7 +977,7 @@
print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}")
-MSE of model with (hp^2) feature: 18.984768907617223
+MSE of model with (hp^2) feature: 18.984768907617216
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png
index 92cb01c9..f8396667 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png differ
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png
index f4ae4ea0..ceecd30f 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png differ
diff --git a/docs/gradient_descent/gradient_descent.html b/docs/gradient_descent/gradient_descent.html
index 467ee5fb..ed238d2c 100644
--- a/docs/gradient_descent/gradient_descent.html
+++ b/docs/gradient_descent/gradient_descent.html
@@ -106,7 +106,7 @@
require.undef("plotly");
requirejs.config({
paths: {
- 'plotly': ['https://cdn.plot.ly/plotly-2.25.2.min']
+ 'plotly': ['https://cdn.plot.ly/plotly-2.12.1.min']
}
});
require(['plotly'], function(Plotly) {
@@ -439,9 +439,9 @@
-
@@ -4395,9 +4383,9 @@
-
+
-
+
@@ -4481,10 +4469,10 @@
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
@@ -4564,30 +4552,30 @@
Code
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
@@ -4604,8 +4592,8 @@
Code
-
+
@@ -4632,9 +4620,9 @@
Code
-
+
@@ -4975,1218 +4963,1218 @@ <
Source Code
----
-title: Data Cleaning and EDA
-execute:
- echo: true
-format:
- html:
- code-fold: true
- code-tools: true
- toc: true
- toc-title: Data Cleaning and EDA
- page-layout: full
- theme:
- - cosmo
- - cerulean
- callout-icon: false
-jupyter: python3
----
-
-```{python}
-#| code-fold: true
-import numpy as np
-import pandas as pd
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-#%matplotlib inline
-plt.rcParams['figure.figsize'] = (12, 9)
-
-sns.set()
-sns.set_context('talk')
-np.set_printoptions(threshold=20, precision=2, suppress=True)
-pd.set_option('display.max_rows', 30)
-pd.set_option('display.max_columns', None)
-pd.set_option('display.precision', 2)
-# This option stops scientific notation for pandas
-pd.set_option('display.float_format', '{:.2f}'.format)
-
-# Silence some spurious seaborn warnings
-import warnings
-warnings.filterwarnings("ignore", category=FutureWarning)
-```
-
-::: {.callout-note collapse="false"}
-## Learning Outcomes
-* Recognize common file formats
-* Categorize data by its variable type
-* Build awareness of issues with data faithfulness and develop targeted solutions
-:::
-
-**This content is covered in lectures 4, 5, and 6.**
-
-In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
-
-**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
-
-* Unclear structure or formatting
-* Missing or corrupted values
-* Unit conversions
-* ...and so on
-
-**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
-
-In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
-
-## Structure
-
-### File Formats
-There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
-
-#### CSV
-CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
-In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.csv").head(5)
-```
-
-To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
-
-```{python}
-#| code-fold: false
-with open("data/elections.csv", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
-
-#### TSV
-
-Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
-
-Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
-
-```{python}
-#| code-fold: false
-with open("data/elections.txt", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.txt", sep='\t').head(3)
-```
-
-An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-#### JSON
-**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
-
-```{python}
-#| code-fold: false
-with open("data/elections.json", "r") as table:
- i = 0
- for row in table:
- print(row)
- i += 1
- if i > 8:
- break
-```
-
-JSON files can be loaded into `pandas` using `pd.read_json`.
-
-```{python}
-#| code-fold: false
-pd.read_json('data/elections.json').head(3)
-```
-
-##### EDA with JSON: Berkeley COVID-19 Data
-The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
-
-```{python}
-#| code-fold: false
-from ds100_utils import fetch_and_cache
-
-covid_file = fetch_and_cache(
- "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
- "confirmed-cases.json",
- force=False)
-covid_file # a file path wrapper object
-```
-
-###### File Size
-Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
-
-Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
-
-```{python}
-#| code-fold: false
-import os
-
-print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
-
-with open(covid_file, "r") as f:
- print(covid_file, "is", sum(1 for l in f), "lines.")
-```
-
-###### Unix Commands
-As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
-In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
-
-Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
-
-These two give us the same information as the code above, albeit in a slightly different form:
-
-```{python}
-#| code-fold: false
-!ls -lh {covid_file}
-!wc -l {covid_file}
-```
-
-###### File Contents
-Let's explore the data format using `Python`.
-
-```{python}
-#| code-fold: false
-with open(covid_file, "r") as f:
- for i, row in enumerate(f):
- print(repr(row)) # print raw strings
- if i >= 4: break
-```
-
-We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
-
-```{python}
-#| code-fold: false
-!head -5 {covid_file}
-```
-
-In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
-
-```{python}
-#| code-fold: false
-import json
-
-with open(covid_file, "rb") as f:
- covid_json = json.load(f)
-```
-
-The `covid_json` variable is now a dictionary encoding the data in the file:
-
-```{python}
-#| code-fold: false
-type(covid_json)
-```
-
-We can examine what keys are in the top level json object by listing out the keys.
-
-```{python}
-#| code-fold: false
-covid_json.keys()
-```
-
-**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
-
-
-We can investigate the meta data further by examining the keys associated with the metadata.
-
-```{python}
-#| code-fold: false
-covid_json['meta'].keys()
-```
-
-The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
-
-```{python}
-#| code-fold: false
-covid_json['meta']['view'].keys()
-```
-
-Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
-
-```
-meta
-|-> data
- | ... (haven't explored yet)
-|-> view
- | -> id
- | -> name
- | -> attribution
- ...
- | -> description
- ...
- | -> columns
- ...
-```
-
-
-There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
-```{python}
-#| code-fold: false
-print(covid_json['meta']['view']['description'])
-```
-
-###### Examining the Data Field for Records
-
-We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
-
-```{python}
-#| code-fold: false
-for i in range(3):
- print(f"{i:03} | {covid_json['data'][i]}")
-```
-
-Observations:
-* These look like equal-length records, so maybe `data` is a table!
-* But what do each of values in the record mean? Where can we find column headers?
-
-For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
-
-```{python}
-#| code-fold: false
-type(covid_json['meta']['view']['columns'])
-```
-
-###### Summary of exploring the JSON file
-
-1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
-1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
-1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
-
-###### Loading COVID Data into `pandas`
-Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
-
-1. Translate the JSON records into a `DataFrame`:
-
- * fields: `covid_json['meta']['view']['columns']`
- * records: `covid_json['data']`
-
-
-1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
-
-1. Examine the `tail` of the table.
-
-```{python}
-#| code-fold: false
-# Load the data from JSON and assign column titles
-covid = pd.DataFrame(
- covid_json['data'],
- columns=[c['name'] for c in covid_json['meta']['view']['columns']])
-
-covid.tail()
-```
-
-### Variable Types
-
-After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
-
-**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
-
-* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
-* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
-
-**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
-
-* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
-* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
-
-![Classification of variable types](images/variable.png)
-
-Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
-
-### Primary and Foreign Keys
-
-Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
-
-The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
-
-```{python}
-#| echo: false
-pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
- "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
- "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
-```
-
-The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
-
-```{python}
-#| echo: false
-pd.DataFrame({"OH Request":[1, 2, 3, 4], \
- "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
- "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
-```
-
-## Granularity, Scope, and Temporality
-
-After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
-
-### Granularity
-The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
-
-### Scope
-The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
-
-### Temporality
-The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
-
-Time and date fields of a dataset could represent a few things:
-
-1. when the "event" happened
-2. when the data was collected, or when it was entered into the system
-3. when the data was copied into the database
-
-To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
-
-#### Temporality with `pandas`' `dt` accessors
-Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
-```{python}
-#| code-fold: true
-calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
-calls.head()
-```
-
-Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
-
-Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
-
-If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
-calls.head()
-```
-
-Now, we can use the `dt` accessor on this column.
-
-We can get the month:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.month.head()
-```
-
-Which day of the week the date is on:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.dayofweek.head()
-```
-
-Check the mimimum values to see if there are any suspicious-looking, 70s dates:
-
-```{python}
-#| code-fold: false
-calls.sort_values("EVENTDT").head()
-```
-
-Doesn't look like it! We are good!
-
-
-We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
-
-## Faithfulness
-
-At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
-
-Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
-
-* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
-* Violations of obvious dependencies, like an age that does not match a birthday
-* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
-* Signs of data falsification, such as fake email addresses or repeated use of the same names
-* Duplicated records or fields containing the same information
-* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
-
-We often solve some of these more common issues in the following ways:
-
-* Spelling errors: apply corrections or drop records that aren't in a dictionary
-* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
-* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
-* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
-
-### Missing Values
-Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
-
-A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
-
-* Average imputation: replace missing values with the average value for that field
-* Hot deck imputation: replace missing values with some random value
-* Regression imputation: develop a model to predict missing values
-* Multiple imputation: replace missing values with multiple random values
-
-Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
-
-# EDA Demo 1: Tuberculosis in the United States
-
-Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
-
-We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
-
-
-## CSVs and Field Names
-Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
-
-We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
-1. Using a text editor like emacs, vim, VSCode, etc.
-2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
-3. The `Python` file object
-4. `pandas`, using `pd.read_csv()`
-
-To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
-
-Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(row)
- i += 1
- if i > 3:
- break
-```
-
-Whoa, why are there blank lines interspaced between the lines of the CSV?
-
-You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
-
-If you're curious, we can use the `repr()` function to return the raw string with all special characters:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(repr(row)) # print raw strings
- i += 1
- if i > 3:
- break
-```
-
-Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
-tb_df.head()
-```
-
-You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
-
-Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
-
-A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
-tb_df.head(5)
-```
-
-Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
-
-We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
-
-```{python}
-#| code-fold: false
-rename_dict = {'2019': 'TB cases 2019',
- '2020': 'TB cases 2020',
- '2021': 'TB cases 2021',
- '2019.1': 'TB incidence 2019',
- '2020.1': 'TB incidence 2020',
- '2021.1': 'TB incidence 2021'}
-tb_df = tb_df.rename(columns=rename_dict)
-tb_df.head(5)
-```
-
-## Record Granularity
-
-You might already be wondering: what's up with that first record?
-
-Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
-
-Okay, EDA step two. How was the rollup record aggregated?
-
-Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
-
-```{python}
-#| code-fold: true
-tb_df.sum(axis=0)
-```
-
-Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
-```{python}
-#| code-fold: true
-tb_df.dtypes
-```
-
-Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
-
-
-Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
-
-```{python}
-#| code-fold: false
-# improve readability: chaining method calls with outer parentheses/line breaks
-tb_df = (
- pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
- .rename(columns=rename_dict)
-)
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-tb_df.sum()
-```
-
-The Total TB cases look right. Phew!
-
-Let's just look at the records with **state-level granularity**:
-
-```{python}
-#| code-fold: true
-state_tb_df = tb_df[1:]
-state_tb_df.head(5)
-```
-
-## Gather Census Data
-
-U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
-
-Running the below cells cleans the data.
-There are a few new methods here:
-* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
-* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
-
-```{python}
-#| code-fold: true
-# 2010s census data
-census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
-census_2010s_df = (
- census_2010s_df
- .reset_index()
- .drop(columns=["index", "Census", "Estimates Base"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
-
-# with pd.option_context('display.min_rows', 30): # shows more rows
-# display(census_2010s_df)
-
-census_2010s_df.head(5)
-```
-
-Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
-
-```python
-from importlib import reload
-reload(utils)
-```
-
-or use `iPython` magic which will intelligently import code when files change:
-
-```python
-%load_ext autoreload
-%autoreload 2
-```
-
-```{python}
-#| code-fold: true
-# census 2020s data
-census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
-census_2020s_df = (
- census_2020s_df
- .reset_index()
- .drop(columns=["index", "Unnamed: 1"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
-
-census_2020s_df.head(5)
-```
-
-## Joining Data (Merging `DataFrame`s)
-
-Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
-
-```{python}
-#| code-fold: false
-# merge TB DataFrame with two US census DataFrames
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .merge(right=census_2020s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
-
-```{python}
-#| code-fold: false
-# try merging again, but cleaner this time
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["Geographic Area", "2019"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
- .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-## Reproducing Data: Compute Incidence
-
-Let's recompute incidence to make sure we know where the original CDC numbers came from.
-
-From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
-
-If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
-
-$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
-
-$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
-
-Let's try this for 2019:
-
-```{python}
-#| code-fold: false
-tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
-tb_census_df.head(5)
-```
-
-Awesome!!!
-
-Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
-
-```{python}
-#| code-fold: false
-tb_census_df.describe()
-```
-
-## Bonus EDA: Reproducing the Reported Statistic
-
-
-**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-This is TB incidence computed across the entire U.S. population! How do we reproduce this?
-* We need to reproduce the "Total" TB incidences in our rolled record.
-* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
-* What happened...?
-
-Let's get exploring!
-
-Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
-
-```{python}
-#| code-fold: true
-tb_df = tb_df.set_index("U.S. jurisdiction")
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2010s_df = census_2010s_df.set_index("Geographic Area")
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2020s_df = census_2020s_df.set_index("Geographic Area")
-census_2020s_df.head(5)
-```
-
-It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
-
-```{python}
-#| code-fold: false
-tb_df.head()
-```
-
-Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
-
-The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
-
-```{python}
-#| code-fold: false
-census_2010s_df.head(5)
-```
-
-The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
-
-One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
-
-```{python}
-#| code-fold: false
-# rename rolled record for 2010s
-census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-# same, but for 2020s rename rolled record
-census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2020s_df.head(5)
-```
-
-<br/>
-
-Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
-
-```{python}
-#| code-fold: false
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["2019"]],
- left_index=True, right_index=True)
- .merge(right=census_2020s_df[["2020", "2021"]],
- left_index=True, right_index=True)
-)
-tb_census_df.head(5)
-```
-
-<br/>
-
-Finally, let's recompute our incidences:
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-We reproduced the total U.S. incidences correctly!
-
-We're almost there. Let's revisit the quote:
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-Recall that percent change from $A$ to $B$ is computed as
-$\text{percent change} = \frac{B - A}{A} \times 100$.
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
-incidence_2020
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
-incidence_2021
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
-difference
-```
-
-# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
-
-[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
-
-```{python}
-#| code-fold: false
-co2_file = "data/co2_mm_mlo.txt"
-```
-
-Let's do some **EDA**!!
-
-## Reading this file into Pandas?
-Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
-
-Lines 71-78 (inclusive) are shown below:
-
- line number | file contents
-
- 71 | # decimal average interpolated trend #days
- 72 | # date (season corr)
- 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
- 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
- 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
- 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
- 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
- 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
-
-
-Notice how:
-
-- The values are separated by white space, possibly tabs.
-- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
-- The 71st and 72nd lines in the file contain column headings split over two lines.
-
-We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
-)
-co2.head()
-```
-
-Congratulations! You've wrangled the data!
-
-<br/>
-
-...But our columns aren't named.
-**We need to do more EDA.**
-
-## Exploring Variable Feature Types
-
-The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
-
-Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = '\s+', #regex for continuous whitespace (next lecture)
- names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
-)
-co2.head()
-```
-
-## Visualizing CO<sub>2</sub>
-Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2);
-```
-
-The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
-
-Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
-
-```{python}
-#| code-fold: false
-co2.head()
-```
-
-```{python}
-#| code-fold: false
-co2.tail()
-```
-
-Some data have unusual values like -1 and -99.99.
-
-Let's check the description at the top of the file again.
-
-* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
-* -99.99 denotes a missing monthly average `Avg`
-
-How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
-
-<br/>
-
-
-## Sanity Checks: Reasoning about the data
-First, we consider the shape of the data. How many rows should we have?
-
-* If chronological order, we should have one record per month.
-* Data from March 1958 to August 2019.
-* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
-
-```{python}
-#| code-fold: false
-co2.shape
-```
-
-Nice!! The number of rows (i.e. records) match our expectations.\
-
-<br/>
-
-
-Let's now check the quality of each feature.
-
-## Understanding Missing Value 1: `Days`
-`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
-
-Let's start with **months**, `Mo`.
-
-Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
-
-```{python}
-#| code-fold: false
-co2["Mo"].value_counts().sort_index()
-```
-
-As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
-
-<br/>
-
-Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
-
-```{python}
-#| code-fold: true
-sns.displot(co2['Days']);
-plt.title("Distribution of days feature"); # suppresses unneeded plotting output
-```
-
-In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
-
-<br/>
-
-Finally, let's check the last time feature, **year** `Yr`.
-
-Let's check to see if there is any connection between missing-ness and the year of the recording.
-
-```{python}
-#| code-fold: true
-sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
-```
-
-**Observations**:
-
-* All of the missing data are in the early years of operation.
-* It appears there may have been problems with equipment in the mid to late 80s.
-
-**Potential Next Steps**:
-
-* Confirm these explanations through documentation about the historical readings.
-* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
-
-<br/>
-
-## Understanding Missing Value 2: `Avg`
-Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
-
-```{python}
-#| code-fold: true
-# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
-```
-
-The non-missing values are in the 300-400 range (a regular range of CO2 levels).
-
-We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
-
-```{python}
-#| code-fold: false
-co2[co2["Avg"] < 0]
-```
-
-There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
-
-## Drop, `NaN`, or Impute Missing `Avg` Data?
-
-How should we address the invalid `Avg` data?
-
-1. Drop records
-2. Set to NaN
-3. Impute using some strategy
-
-Remember we want to fix the following plot:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
-```
-
-Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
-
-
-Let's consider a few options:
-1. Drop those records
-2. Replace -99.99 with NaN
-3. Substitute it with a likely value for the average CO2?
-
-What do you think are the pros and cons of each possible action?
-
-<br/>
-
-
-Let's examine each of these three options.
-
-```{python}
-#| code-fold: false
-# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
-```
-
-```{python}
-#| code-fold: false
-# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
-```
-
-We'll also use a third version of the data.
-
-First, we note that the dataset already comes with a **substitute value** for the -99.99.
-
-From the file description:
-
-> The `interpolated` column includes average values from the preceding column (`average`)
-and **interpolated values** where data are missing. Interpolated values are
-computed in two steps...
-
-The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
-
-So, the third version of our data will use the `Int` feature instead of `Avg`.
-
-```{python}
-#| code-fold: false
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
-```
-
-What's a **reasonable** estimate?
-
-To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
-
-```{python}
-#| code-fold: true
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
-```
-
-In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
-
-However there is some appeal to **option C: Imputing**:
-
-* Shows seasonal trends for CO2
-* We are plotting all months in our data as a line plot
-
-<br/>
-
-
-Let's replot our original figure with option 3:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
-```
-
-Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
-
-## Presenting the data: A Discussion on Data Granularity
-
-From the description:
-
-* monthly measurements are averages of average day measurements.
-* The NOAA GML website has datasets for daily/hourly measurements too.
-
-The data you present depends on your research question.
-
-**How do CO2 levels vary by season?**
-
-* You might want to keep average monthly data.
-
-**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
-
-* You might be happier with a **coarser granularity** of average year data!
-
-```{python}
-#| code-fold: true
-co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
-```
-
-Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
-
-# Summary
-We went over a lot of content this lecture; let's summarize the most important points:
-
-## Dealing with Missing Values
-There are a few options we can take to deal with missing data:
-
-* Drop missing records
-* Keep `NaN` missing values
-* Impute using an interpolated column
-
-## EDA and Data Wrangling
-There are several ways to approach EDA and Data Wrangling:
-
-* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
-* Examine each **field/attribute/dimension** individually.
-* Examine pairs of related dimensions (e.g. breaking down grades by major).
-* Along the way, we can:
- * **Visualize** or summarize the data.
- * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
- * Identify and **address anomalies**.
- * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
- * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
+---
+title: Data Cleaning and EDA
+execute:
+ echo: true
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ toc: true
+ toc-title: Data Cleaning and EDA
+ page-layout: full
+ theme:
+ - cosmo
+ - cerulean
+ callout-icon: false
+jupyter: python3
+---
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+plt.rcParams['figure.figsize'] = (12, 9)
+
+sns.set()
+sns.set_context('talk')
+np.set_printoptions(threshold=20, precision=2, suppress=True)
+pd.set_option('display.max_rows', 30)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+# This option stops scientific notation for pandas
+pd.set_option('display.float_format', '{:.2f}'.format)
+
+# Silence some spurious seaborn warnings
+import warnings
+warnings.filterwarnings("ignore", category=FutureWarning)
+```
+
+::: {.callout-note collapse="false"}
+## Learning Outcomes
+* Recognize common file formats
+* Categorize data by its variable type
+* Build awareness of issues with data faithfulness and develop targeted solutions
+:::
+
+**This content is covered in lectures 4, 5, and 6.**
+
+In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
+
+**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+
+* Unclear structure or formatting
+* Missing or corrupted values
+* Unit conversions
+* ...and so on
+
+**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
+
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+
+## Structure
+
+### File Formats
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+
+#### CSV
+CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
+In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.csv").head(5)
+```
+
+To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
+
+```{python}
+#| code-fold: false
+with open("data/elections.csv", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
+
+#### TSV
+
+Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
+
+Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
+
+```{python}
+#| code-fold: false
+with open("data/elections.txt", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.txt", sep='\t').head(3)
+```
+
+An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+#### JSON
+**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
+
+```{python}
+#| code-fold: false
+with open("data/elections.json", "r") as table:
+ i = 0
+ for row in table:
+ print(row)
+ i += 1
+ if i > 8:
+ break
+```
+
+JSON files can be loaded into `pandas` using `pd.read_json`.
+
+```{python}
+#| code-fold: false
+pd.read_json('data/elections.json').head(3)
+```
+
+##### EDA with JSON: Berkeley COVID-19 Data
+The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
+
+```{python}
+#| code-fold: false
+from ds100_utils import fetch_and_cache
+
+covid_file = fetch_and_cache(
+ "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ force=False)
+covid_file # a file path wrapper object
+```
+
+###### File Size
+Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
+
+Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
+
+```{python}
+#| code-fold: false
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+ print(covid_file, "is", sum(1 for l in f), "lines.")
+```
+
+###### Unix Commands
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
+In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
+
+Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
+
+These two give us the same information as the code above, albeit in a slightly different form:
+
+```{python}
+#| code-fold: false
+!ls -lh {covid_file}
+!wc -l {covid_file}
+```
+
+###### File Contents
+Let's explore the data format using `Python`.
+
+```{python}
+#| code-fold: false
+with open(covid_file, "r") as f:
+ for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
+```
+
+We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
+
+```{python}
+#| code-fold: false
+!head -5 {covid_file}
+```
+
+In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
+
+```{python}
+#| code-fold: false
+import json
+
+with open(covid_file, "rb") as f:
+ covid_json = json.load(f)
+```
+
+The `covid_json` variable is now a dictionary encoding the data in the file:
+
+```{python}
+#| code-fold: false
+type(covid_json)
+```
+
+We can examine what keys are in the top level json object by listing out the keys.
+
+```{python}
+#| code-fold: false
+covid_json.keys()
+```
+
+**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
+
+
+We can investigate the meta data further by examining the keys associated with the metadata.
+
+```{python}
+#| code-fold: false
+covid_json['meta'].keys()
+```
+
+The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
+
+```{python}
+#| code-fold: false
+covid_json['meta']['view'].keys()
+```
+
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+
+```
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+```
+
+
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+
+```{python}
+#| code-fold: false
+print(covid_json['meta']['view']['description'])
+```
+
+###### Examining the Data Field for Records
+
+We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
+
+```{python}
+#| code-fold: false
+for i in range(3):
+ print(f"{i:03} | {covid_json['data'][i]}")
+```
+
+Observations:
+* These look like equal-length records, so maybe `data` is a table!
+* But what do each of values in the record mean? Where can we find column headers?
+
+For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
+
+```{python}
+#| code-fold: false
+type(covid_json['meta']['view']['columns'])
+```
+
+###### Summary of exploring the JSON file
+
+1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
+1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
+1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
+
+###### Loading COVID Data into `pandas`
+Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
+
+1. Translate the JSON records into a `DataFrame`:
+
+ * fields: `covid_json['meta']['view']['columns']`
+ * records: `covid_json['data']`
+
+
+1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
+
+1. Examine the `tail` of the table.
+
+```{python}
+#| code-fold: false
+# Load the data from JSON and assign column titles
+covid = pd.DataFrame(
+ covid_json['data'],
+ columns=[c['name'] for c in covid_json['meta']['view']['columns']])
+
+covid.tail()
+```
+
+### Variable Types
+
+After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+
+**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
+
+* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
+* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
+
+**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
+
+* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
+* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
+
+![Classification of variable types](images/variable.png)
+
+Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+
+### Primary and Foreign Keys
+
+Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
+
+The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
+
+```{python}
+#| echo: false
+pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
+ "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
+ "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
+```
+
+The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
+
+```{python}
+#| echo: false
+pd.DataFrame({"OH Request":[1, 2, 3, 4], \
+ "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
+ "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
+```
+
+## Granularity, Scope, and Temporality
+
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
+
+### Granularity
+The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+
+### Scope
+The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+
+### Temporality
+The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+
+Time and date fields of a dataset could represent a few things:
+
+1. when the "event" happened
+2. when the data was collected, or when it was entered into the system
+3. when the data was copied into the database
+
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+
+#### Temporality with `pandas`' `dt` accessors
+Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
+
+```{python}
+#| code-fold: true
+calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+calls.head()
+```
+
+Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
+
+Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
+
+If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+calls.head()
+```
+
+Now, we can use the `dt` accessor on this column.
+
+We can get the month:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.month.head()
+```
+
+Which day of the week the date is on:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.dayofweek.head()
+```
+
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+
+```{python}
+#| code-fold: false
+calls.sort_values("EVENTDT").head()
+```
+
+Doesn't look like it! We are good!
+
+
+We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
+
+## Faithfulness
+
+At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
+
+Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+
+* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
+* Violations of obvious dependencies, like an age that does not match a birthday
+* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
+* Signs of data falsification, such as fake email addresses or repeated use of the same names
+* Duplicated records or fields containing the same information
+* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
+
+We often solve some of these more common issues in the following ways:
+
+* Spelling errors: apply corrections or drop records that aren't in a dictionary
+* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
+* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
+* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
+
+### Missing Values
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
+
+A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+
+* Average imputation: replace missing values with the average value for that field
+* Hot deck imputation: replace missing values with some random value
+* Regression imputation: develop a model to predict missing values
+* Multiple imputation: replace missing values with multiple random values
+
+Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+
+# EDA Demo 1: Tuberculosis in the United States
+
+Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+
+We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
+
+
+## CSVs and Field Names
+Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
+
+We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
+1. Using a text editor like emacs, vim, VSCode, etc.
+2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
+3. The `Python` file object
+4. `pandas`, using `pd.read_csv()`
+
+To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
+
+Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(row)
+ i += 1
+ if i > 3:
+ break
+```
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+
+You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
+
+If you're curious, we can use the `repr()` function to return the raw string with all special characters:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(repr(row)) # print raw strings
+ i += 1
+ if i > 3:
+ break
+```
+
+Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
+tb_df.head()
+```
+
+You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
+
+Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
+
+A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+tb_df.head(5)
+```
+
+Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
+
+We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
+
+```{python}
+#| code-fold: false
+rename_dict = {'2019': 'TB cases 2019',
+ '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+tb_df = tb_df.rename(columns=rename_dict)
+tb_df.head(5)
+```
+
+## Record Granularity
+
+You might already be wondering: what's up with that first record?
+
+Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
+
+Okay, EDA step two. How was the rollup record aggregated?
+
+Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
+
+```{python}
+#| code-fold: true
+tb_df.sum(axis=0)
+```
+
+Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+
+```{python}
+#| code-fold: true
+tb_df.dtypes
+```
+
+Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
+
+
+Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
+
+```{python}
+#| code-fold: false
+# improve readability: chaining method calls with outer parentheses/line breaks
+tb_df = (
+ pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
+ .rename(columns=rename_dict)
+)
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+tb_df.sum()
+```
+
+The Total TB cases look right. Phew!
+
+Let's just look at the records with **state-level granularity**:
+
+```{python}
+#| code-fold: true
+state_tb_df = tb_df[1:]
+state_tb_df.head(5)
+```
+
+## Gather Census Data
+
+U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
+
+Running the below cells cleans the data.
+There are a few new methods here:
+* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
+* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
+
+```{python}
+#| code-fold: true
+# 2010s census data
+census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+census_2010s_df = (
+ census_2010s_df
+ .reset_index()
+ .drop(columns=["index", "Census", "Estimates Base"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+census_2010s_df.head(5)
+```
+
+Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
+
+```python
+from importlib import reload
+reload(utils)
+```
+
+or use `iPython` magic which will intelligently import code when files change:
+
+```python
+%load_ext autoreload
+%autoreload 2
+```
+
+```{python}
+#| code-fold: true
+# census 2020s data
+census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+census_2020s_df = (
+ census_2020s_df
+ .reset_index()
+ .drop(columns=["index", "Unnamed: 1"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+
+census_2020s_df.head(5)
+```
+
+## Joining Data (Merging `DataFrame`s)
+
+Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
+
+```{python}
+#| code-fold: false
+# merge TB DataFrame with two US census DataFrames
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .merge(right=census_2020s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
+
+```{python}
+#| code-fold: false
+# try merging again, but cleaner this time
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["Geographic Area", "2019"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+ .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+## Reproducing Data: Compute Incidence
+
+Let's recompute incidence to make sure we know where the original CDC numbers came from.
+
+From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+
+$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
+
+$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
+
+Let's try this for 2019:
+
+```{python}
+#| code-fold: false
+tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+tb_census_df.head(5)
+```
+
+Awesome!!!
+
+Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+
+```{python}
+#| code-fold: false
+tb_census_df.describe()
+```
+
+## Bonus EDA: Reproducing the Reported Statistic
+
+
+**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+This is TB incidence computed across the entire U.S. population! How do we reproduce this?
+* We need to reproduce the "Total" TB incidences in our rolled record.
+* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
+* What happened...?
+
+Let's get exploring!
+
+Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+
+```{python}
+#| code-fold: true
+tb_df = tb_df.set_index("U.S. jurisdiction")
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2010s_df = census_2010s_df.set_index("Geographic Area")
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2020s_df = census_2020s_df.set_index("Geographic Area")
+census_2020s_df.head(5)
+```
+
+It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
+
+```{python}
+#| code-fold: false
+tb_df.head()
+```
+
+Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
+
+The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
+
+```{python}
+#| code-fold: false
+census_2010s_df.head(5)
+```
+
+The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
+
+One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
+
+```{python}
+#| code-fold: false
+# rename rolled record for 2010s
+census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+# same, but for 2020s rename rolled record
+census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2020s_df.head(5)
+```
+
+<br/>
+
+Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
+
+```{python}
+#| code-fold: false
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["2019"]],
+ left_index=True, right_index=True)
+ .merge(right=census_2020s_df[["2020", "2021"]],
+ left_index=True, right_index=True)
+)
+tb_census_df.head(5)
+```
+
+<br/>
+
+Finally, let's recompute our incidences:
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+We reproduced the total U.S. incidences correctly!
+
+We're almost there. Let's revisit the quote:
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+Recall that percent change from $A$ to $B$ is computed as
+$\text{percent change} = \frac{B - A}{A} \times 100$.
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
+incidence_2020
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
+incidence_2021
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
+difference
+```
+
+# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
+
+[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
+
+```{python}
+#| code-fold: false
+co2_file = "data/co2_mm_mlo.txt"
+```
+
+Let's do some **EDA**!!
+
+## Reading this file into Pandas?
+Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
+
+Lines 71-78 (inclusive) are shown below:
+
+ line number | file contents
+
+ 71 | # decimal average interpolated trend #days
+ 72 | # date (season corr)
+ 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+ 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+ 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+ 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+ 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+ 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+
+
+Notice how:
+
+- The values are separated by white space, possibly tabs.
+- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
+- The 71st and 72nd lines in the file contain column headings split over two lines.
+
+We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+)
+co2.head()
+```
+
+Congratulations! You've wrangled the data!
+
+<br/>
+
+...But our columns aren't named.
+**We need to do more EDA.**
+
+## Exploring Variable Feature Types
+
+The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
+
+Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = '\s+', #regex for continuous whitespace (next lecture)
+ names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+)
+co2.head()
+```
+
+## Visualizing CO<sub>2</sub>
+Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2);
+```
+
+The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
+
+Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
+
+```{python}
+#| code-fold: false
+co2.head()
+```
+
+```{python}
+#| code-fold: false
+co2.tail()
+```
+
+Some data have unusual values like -1 and -99.99.
+
+Let's check the description at the top of the file again.
+
+* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
+* -99.99 denotes a missing monthly average `Avg`
+
+How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+
+<br/>
+
+
+## Sanity Checks: Reasoning about the data
+First, we consider the shape of the data. How many rows should we have?
+
+* If chronological order, we should have one record per month.
+* Data from March 1958 to August 2019.
+* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
+
+```{python}
+#| code-fold: false
+co2.shape
+```
+
+Nice!! The number of rows (i.e. records) match our expectations.\
+
+<br/>
+
+
+Let's now check the quality of each feature.
+
+## Understanding Missing Value 1: `Days`
+`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
+
+Let's start with **months**, `Mo`.
+
+Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+
+```{python}
+#| code-fold: false
+co2["Mo"].value_counts().sort_index()
+```
+
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+
+<br/>
+
+Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
+
+```{python}
+#| code-fold: true
+sns.displot(co2['Days']);
+plt.title("Distribution of days feature"); # suppresses unneeded plotting output
+```
+
+In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
+
+<br/>
+
+Finally, let's check the last time feature, **year** `Yr`.
+
+Let's check to see if there is any connection between missing-ness and the year of the recording.
+
+```{python}
+#| code-fold: true
+sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output
+```
+
+**Observations**:
+
+* All of the missing data are in the early years of operation.
+* It appears there may have been problems with equipment in the mid to late 80s.
+
+**Potential Next Steps**:
+
+* Confirm these explanations through documentation about the historical readings.
+* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
+
+<br/>
+
+## Understanding Missing Value 2: `Avg`
+Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
+
+```{python}
+#| code-fold: true
+# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+```
+
+The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+
+We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
+
+```{python}
+#| code-fold: false
+co2[co2["Avg"] < 0]
+```
+
+There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
+
+## Drop, `NaN`, or Impute Missing `Avg` Data?
+
+How should we address the invalid `Avg` data?
+
+1. Drop records
+2. Set to NaN
+3. Impute using some strategy
+
+Remember we want to fix the following plot:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");
+```
+
+Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
+
+
+Let's consider a few options:
+1. Drop those records
+2. Replace -99.99 with NaN
+3. Substitute it with a likely value for the average CO2?
+
+What do you think are the pros and cons of each possible action?
+
+<br/>
+
+
+Let's examine each of these three options.
+
+```{python}
+#| code-fold: false
+# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
+```
+
+```{python}
+#| code-fold: false
+# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
+```
+
+We'll also use a third version of the data.
+
+First, we note that the dataset already comes with a **substitute value** for the -99.99.
+
+From the file description:
+
+> The `interpolated` column includes average values from the preceding column (`average`)
+and **interpolated values** where data are missing. Interpolated values are
+computed in two steps...
+
+The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
+
+So, the third version of our data will use the `Int` feature instead of `Avg`.
+
+```{python}
+#| code-fold: false
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
+```
+
+What's a **reasonable** estimate?
+
+To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+
+```{python}
+#| code-fold: true
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
+```
+
+In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
+
+However there is some appeal to **option C: Imputing**:
+
+* Shows seasonal trends for CO2
+* We are plotting all months in our data as a line plot
+
+<br/>
+
+
+Let's replot our original figure with option 3:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");
+```
+
+Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
+
+## Presenting the data: A Discussion on Data Granularity
+
+From the description:
+
+* monthly measurements are averages of average day measurements.
+* The NOAA GML website has datasets for daily/hourly measurements too.
+
+The data you present depends on your research question.
+
+**How do CO2 levels vary by season?**
+
+* You might want to keep average monthly data.
+
+**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
+
+* You might be happier with a **coarser granularity** of average year data!
+
+```{python}
+#| code-fold: true
+co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");
+```
+
+Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+
+# Summary
+We went over a lot of content this lecture; let's summarize the most important points:
+
+## Dealing with Missing Values
+There are a few options we can take to deal with missing data:
+
+* Drop missing records
+* Keep `NaN` missing values
+* Impute using an interpolated column
+
+## EDA and Data Wrangling
+There are several ways to approach EDA and Data Wrangling:
+
+* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
+* Examine each **field/attribute/dimension** individually.
+* Examine pairs of related dimensions (e.g. breaking down grades by major).
+* Along the way, we can:
+ * **Visualize** or summarize the data.
+ * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
+ * Identify and **address anomalies**.
+ * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
+ * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png
index a04218cf..f392d5f9 100644
Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png
new file mode 100644
index 00000000..be96b8c9
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png
deleted file mode 100644
index 31857f62..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-67-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png
index 67c3959d..ffd29ff8 100644
Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png
new file mode 100644
index 00000000..29088928
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png
deleted file mode 100644
index fb28f5d5..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-69-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png
index 39cac822..49ef3d6a 100644
Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png
index 6382e58a..15a5fe82 100644
Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png
index db2b0dee..40b1fc71 100644
Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png
index 897b8b39..99b6c2d1 100644
Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ
diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html
index ea770e7f..22d26788 100644
--- a/docs/feature_engineering/feature_engineering.html
+++ b/docs/feature_engineering/feature_engineering.html
@@ -556,7 +556,7 @@
my_model.fit(X, Y)
-LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
+LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
Notice that we use double brackets to extract this column. Why double brackets instead of just single brackets? The .fit
method, by default, expects to receive 2-dimensional data – some kind of data that includes both rows and columns. Writing penguins["flipper_length_mm"]
would return a 1D Series
, causing sklearn
to error. We avoid this by writing penguins[["flipper_length_mm"]]
to produce a 2D DataFrame
.
@@ -607,7 +607,7 @@
print(f"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}")
-The RMSE of the model is 0.9881331104079044
+The RMSE of the model is 0.9881331104079045
We can also see that we obtain the same predictions using sklearn
as we did when applying the ordinary least squares formula before!
@@ -977,7 +977,7 @@
print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}")
-MSE of model with (hp^2) feature: 18.984768907617223
+MSE of model with (hp^2) feature: 18.984768907617216
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png
index 92cb01c9..f8396667 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png differ
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png
index f4ae4ea0..ceecd30f 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png differ
diff --git a/docs/gradient_descent/gradient_descent.html b/docs/gradient_descent/gradient_descent.html
index 467ee5fb..ed238d2c 100644
--- a/docs/gradient_descent/gradient_descent.html
+++ b/docs/gradient_descent/gradient_descent.html
@@ -106,7 +106,7 @@
require.undef("plotly");
requirejs.config({
paths: {
- 'plotly': ['https://cdn.plot.ly/plotly-2.25.2.min']
+ 'plotly': ['https://cdn.plot.ly/plotly-2.12.1.min']
}
});
require(['plotly'], function(Plotly) {
@@ -439,9 +439,9 @@
-
@@ -4481,10 +4469,10 @@
-
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
@@ -4564,30 +4552,30 @@
Code
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
@@ -4604,8 +4592,8 @@
Code
-
+
@@ -4632,9 +4620,9 @@
Code
-
+
@@ -4975,1218 +4963,1218 @@ <
Source Code
----
-title: Data Cleaning and EDA
-execute:
- echo: true
-format:
- html:
- code-fold: true
- code-tools: true
- toc: true
- toc-title: Data Cleaning and EDA
- page-layout: full
- theme:
- - cosmo
- - cerulean
- callout-icon: false
-jupyter: python3
----
-
-```{python}
-#| code-fold: true
-import numpy as np
-import pandas as pd
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-#%matplotlib inline
-plt.rcParams['figure.figsize'] = (12, 9)
-
-sns.set()
-sns.set_context('talk')
-np.set_printoptions(threshold=20, precision=2, suppress=True)
-pd.set_option('display.max_rows', 30)
-pd.set_option('display.max_columns', None)
-pd.set_option('display.precision', 2)
-# This option stops scientific notation for pandas
-pd.set_option('display.float_format', '{:.2f}'.format)
-
-# Silence some spurious seaborn warnings
-import warnings
-warnings.filterwarnings("ignore", category=FutureWarning)
-```
-
-::: {.callout-note collapse="false"}
-## Learning Outcomes
-* Recognize common file formats
-* Categorize data by its variable type
-* Build awareness of issues with data faithfulness and develop targeted solutions
-:::
-
-**This content is covered in lectures 4, 5, and 6.**
-
-In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
-
-**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
-
-* Unclear structure or formatting
-* Missing or corrupted values
-* Unit conversions
-* ...and so on
-
-**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
-
-In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
-
-## Structure
-
-### File Formats
-There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
-
-#### CSV
-CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
-In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.csv").head(5)
-```
-
-To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
-
-```{python}
-#| code-fold: false
-with open("data/elections.csv", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
-
-#### TSV
-
-Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
-
-Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
-
-```{python}
-#| code-fold: false
-with open("data/elections.txt", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.txt", sep='\t').head(3)
-```
-
-An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-#### JSON
-**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
-
-```{python}
-#| code-fold: false
-with open("data/elections.json", "r") as table:
- i = 0
- for row in table:
- print(row)
- i += 1
- if i > 8:
- break
-```
-
-JSON files can be loaded into `pandas` using `pd.read_json`.
-
-```{python}
-#| code-fold: false
-pd.read_json('data/elections.json').head(3)
-```
-
-##### EDA with JSON: Berkeley COVID-19 Data
-The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
-
-```{python}
-#| code-fold: false
-from ds100_utils import fetch_and_cache
-
-covid_file = fetch_and_cache(
- "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
- "confirmed-cases.json",
- force=False)
-covid_file # a file path wrapper object
-```
-
-###### File Size
-Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
-
-Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
-
-```{python}
-#| code-fold: false
-import os
-
-print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
-
-with open(covid_file, "r") as f:
- print(covid_file, "is", sum(1 for l in f), "lines.")
-```
-
-###### Unix Commands
-As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
-In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
-
-Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
-
-These two give us the same information as the code above, albeit in a slightly different form:
-
-```{python}
-#| code-fold: false
-!ls -lh {covid_file}
-!wc -l {covid_file}
-```
-
-###### File Contents
-Let's explore the data format using `Python`.
-
-```{python}
-#| code-fold: false
-with open(covid_file, "r") as f:
- for i, row in enumerate(f):
- print(repr(row)) # print raw strings
- if i >= 4: break
-```
-
-We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
-
-```{python}
-#| code-fold: false
-!head -5 {covid_file}
-```
-
-In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
-
-```{python}
-#| code-fold: false
-import json
-
-with open(covid_file, "rb") as f:
- covid_json = json.load(f)
-```
-
-The `covid_json` variable is now a dictionary encoding the data in the file:
-
-```{python}
-#| code-fold: false
-type(covid_json)
-```
-
-We can examine what keys are in the top level json object by listing out the keys.
-
-```{python}
-#| code-fold: false
-covid_json.keys()
-```
-
-**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
-
-
-We can investigate the meta data further by examining the keys associated with the metadata.
-
-```{python}
-#| code-fold: false
-covid_json['meta'].keys()
-```
-
-The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
-
-```{python}
-#| code-fold: false
-covid_json['meta']['view'].keys()
-```
-
-Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
-
-```
-meta
-|-> data
- | ... (haven't explored yet)
-|-> view
- | -> id
- | -> name
- | -> attribution
- ...
- | -> description
- ...
- | -> columns
- ...
-```
-
-
-There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
-```{python}
-#| code-fold: false
-print(covid_json['meta']['view']['description'])
-```
-
-###### Examining the Data Field for Records
-
-We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
-
-```{python}
-#| code-fold: false
-for i in range(3):
- print(f"{i:03} | {covid_json['data'][i]}")
-```
-
-Observations:
-* These look like equal-length records, so maybe `data` is a table!
-* But what do each of values in the record mean? Where can we find column headers?
-
-For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
-
-```{python}
-#| code-fold: false
-type(covid_json['meta']['view']['columns'])
-```
-
-###### Summary of exploring the JSON file
-
-1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
-1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
-1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
-
-###### Loading COVID Data into `pandas`
-Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
-
-1. Translate the JSON records into a `DataFrame`:
-
- * fields: `covid_json['meta']['view']['columns']`
- * records: `covid_json['data']`
-
-
-1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
-
-1. Examine the `tail` of the table.
-
-```{python}
-#| code-fold: false
-# Load the data from JSON and assign column titles
-covid = pd.DataFrame(
- covid_json['data'],
- columns=[c['name'] for c in covid_json['meta']['view']['columns']])
-
-covid.tail()
-```
-
-### Variable Types
-
-After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
-
-**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
-
-* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
-* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
-
-**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
-
-* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
-* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
-
-![Classification of variable types](images/variable.png)
-
-Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
-
-### Primary and Foreign Keys
-
-Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
-
-The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
-
-```{python}
-#| echo: false
-pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
- "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
- "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
-```
-
-The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
-
-```{python}
-#| echo: false
-pd.DataFrame({"OH Request":[1, 2, 3, 4], \
- "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
- "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
-```
-
-## Granularity, Scope, and Temporality
-
-After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
-
-### Granularity
-The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
-
-### Scope
-The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
-
-### Temporality
-The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
-
-Time and date fields of a dataset could represent a few things:
-
-1. when the "event" happened
-2. when the data was collected, or when it was entered into the system
-3. when the data was copied into the database
-
-To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
-
-#### Temporality with `pandas`' `dt` accessors
-Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
-```{python}
-#| code-fold: true
-calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
-calls.head()
-```
-
-Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
-
-Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
-
-If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
-calls.head()
-```
-
-Now, we can use the `dt` accessor on this column.
-
-We can get the month:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.month.head()
-```
-
-Which day of the week the date is on:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.dayofweek.head()
-```
-
-Check the mimimum values to see if there are any suspicious-looking, 70s dates:
-
-```{python}
-#| code-fold: false
-calls.sort_values("EVENTDT").head()
-```
-
-Doesn't look like it! We are good!
-
-
-We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
-
-## Faithfulness
-
-At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
-
-Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
-
-* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
-* Violations of obvious dependencies, like an age that does not match a birthday
-* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
-* Signs of data falsification, such as fake email addresses or repeated use of the same names
-* Duplicated records or fields containing the same information
-* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
-
-We often solve some of these more common issues in the following ways:
-
-* Spelling errors: apply corrections or drop records that aren't in a dictionary
-* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
-* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
-* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
-
-### Missing Values
-Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
-
-A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
-
-* Average imputation: replace missing values with the average value for that field
-* Hot deck imputation: replace missing values with some random value
-* Regression imputation: develop a model to predict missing values
-* Multiple imputation: replace missing values with multiple random values
-
-Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
-
-# EDA Demo 1: Tuberculosis in the United States
-
-Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
-
-We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
-
-
-## CSVs and Field Names
-Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
-
-We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
-1. Using a text editor like emacs, vim, VSCode, etc.
-2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
-3. The `Python` file object
-4. `pandas`, using `pd.read_csv()`
-
-To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
-
-Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(row)
- i += 1
- if i > 3:
- break
-```
-
-Whoa, why are there blank lines interspaced between the lines of the CSV?
-
-You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
-
-If you're curious, we can use the `repr()` function to return the raw string with all special characters:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(repr(row)) # print raw strings
- i += 1
- if i > 3:
- break
-```
-
-Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
-tb_df.head()
-```
-
-You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
-
-Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
-
-A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
-tb_df.head(5)
-```
-
-Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
-
-We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
-
-```{python}
-#| code-fold: false
-rename_dict = {'2019': 'TB cases 2019',
- '2020': 'TB cases 2020',
- '2021': 'TB cases 2021',
- '2019.1': 'TB incidence 2019',
- '2020.1': 'TB incidence 2020',
- '2021.1': 'TB incidence 2021'}
-tb_df = tb_df.rename(columns=rename_dict)
-tb_df.head(5)
-```
-
-## Record Granularity
-
-You might already be wondering: what's up with that first record?
-
-Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
-
-Okay, EDA step two. How was the rollup record aggregated?
-
-Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
-
-```{python}
-#| code-fold: true
-tb_df.sum(axis=0)
-```
-
-Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
-```{python}
-#| code-fold: true
-tb_df.dtypes
-```
-
-Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
-
-
-Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
-
-```{python}
-#| code-fold: false
-# improve readability: chaining method calls with outer parentheses/line breaks
-tb_df = (
- pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
- .rename(columns=rename_dict)
-)
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-tb_df.sum()
-```
-
-The Total TB cases look right. Phew!
-
-Let's just look at the records with **state-level granularity**:
-
-```{python}
-#| code-fold: true
-state_tb_df = tb_df[1:]
-state_tb_df.head(5)
-```
-
-## Gather Census Data
-
-U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
-
-Running the below cells cleans the data.
-There are a few new methods here:
-* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
-* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
-
-```{python}
-#| code-fold: true
-# 2010s census data
-census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
-census_2010s_df = (
- census_2010s_df
- .reset_index()
- .drop(columns=["index", "Census", "Estimates Base"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
-
-# with pd.option_context('display.min_rows', 30): # shows more rows
-# display(census_2010s_df)
-
-census_2010s_df.head(5)
-```
-
-Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
-
-```python
-from importlib import reload
-reload(utils)
-```
-
-or use `iPython` magic which will intelligently import code when files change:
-
-```python
-%load_ext autoreload
-%autoreload 2
-```
-
-```{python}
-#| code-fold: true
-# census 2020s data
-census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
-census_2020s_df = (
- census_2020s_df
- .reset_index()
- .drop(columns=["index", "Unnamed: 1"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
-
-census_2020s_df.head(5)
-```
-
-## Joining Data (Merging `DataFrame`s)
-
-Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
-
-```{python}
-#| code-fold: false
-# merge TB DataFrame with two US census DataFrames
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .merge(right=census_2020s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
-
-```{python}
-#| code-fold: false
-# try merging again, but cleaner this time
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["Geographic Area", "2019"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
- .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-## Reproducing Data: Compute Incidence
-
-Let's recompute incidence to make sure we know where the original CDC numbers came from.
-
-From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
-
-If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
-
-$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
-
-$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
-
-Let's try this for 2019:
-
-```{python}
-#| code-fold: false
-tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
-tb_census_df.head(5)
-```
-
-Awesome!!!
-
-Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
-
-```{python}
-#| code-fold: false
-tb_census_df.describe()
-```
-
-## Bonus EDA: Reproducing the Reported Statistic
-
-
-**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-This is TB incidence computed across the entire U.S. population! How do we reproduce this?
-* We need to reproduce the "Total" TB incidences in our rolled record.
-* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
-* What happened...?
-
-Let's get exploring!
-
-Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
-
-```{python}
-#| code-fold: true
-tb_df = tb_df.set_index("U.S. jurisdiction")
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2010s_df = census_2010s_df.set_index("Geographic Area")
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2020s_df = census_2020s_df.set_index("Geographic Area")
-census_2020s_df.head(5)
-```
-
-It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
-
-```{python}
-#| code-fold: false
-tb_df.head()
-```
-
-Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
-
-The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
-
-```{python}
-#| code-fold: false
-census_2010s_df.head(5)
-```
-
-The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
-
-One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
-
-```{python}
-#| code-fold: false
-# rename rolled record for 2010s
-census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-# same, but for 2020s rename rolled record
-census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2020s_df.head(5)
-```
-
-<br/>
-
-Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
-
-```{python}
-#| code-fold: false
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["2019"]],
- left_index=True, right_index=True)
- .merge(right=census_2020s_df[["2020", "2021"]],
- left_index=True, right_index=True)
-)
-tb_census_df.head(5)
-```
-
-<br/>
-
-Finally, let's recompute our incidences:
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-We reproduced the total U.S. incidences correctly!
-
-We're almost there. Let's revisit the quote:
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-Recall that percent change from $A$ to $B$ is computed as
-$\text{percent change} = \frac{B - A}{A} \times 100$.
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
-incidence_2020
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
-incidence_2021
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
-difference
-```
-
-# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
-
-[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
-
-```{python}
-#| code-fold: false
-co2_file = "data/co2_mm_mlo.txt"
-```
-
-Let's do some **EDA**!!
-
-## Reading this file into Pandas?
-Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
-
-Lines 71-78 (inclusive) are shown below:
-
- line number | file contents
-
- 71 | # decimal average interpolated trend #days
- 72 | # date (season corr)
- 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
- 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
- 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
- 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
- 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
- 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
-
-
-Notice how:
-
-- The values are separated by white space, possibly tabs.
-- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
-- The 71st and 72nd lines in the file contain column headings split over two lines.
-
-We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
-)
-co2.head()
-```
-
-Congratulations! You've wrangled the data!
-
-<br/>
-
-...But our columns aren't named.
-**We need to do more EDA.**
-
-## Exploring Variable Feature Types
-
-The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
-
-Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = '\s+', #regex for continuous whitespace (next lecture)
- names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
-)
-co2.head()
-```
-
-## Visualizing CO<sub>2</sub>
-Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2);
-```
-
-The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
-
-Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
-
-```{python}
-#| code-fold: false
-co2.head()
-```
-
-```{python}
-#| code-fold: false
-co2.tail()
-```
-
-Some data have unusual values like -1 and -99.99.
-
-Let's check the description at the top of the file again.
-
-* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
-* -99.99 denotes a missing monthly average `Avg`
-
-How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
-
-<br/>
-
-
-## Sanity Checks: Reasoning about the data
-First, we consider the shape of the data. How many rows should we have?
-
-* If chronological order, we should have one record per month.
-* Data from March 1958 to August 2019.
-* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
-
-```{python}
-#| code-fold: false
-co2.shape
-```
-
-Nice!! The number of rows (i.e. records) match our expectations.\
-
-<br/>
-
-
-Let's now check the quality of each feature.
-
-## Understanding Missing Value 1: `Days`
-`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
-
-Let's start with **months**, `Mo`.
-
-Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
-
-```{python}
-#| code-fold: false
-co2["Mo"].value_counts().sort_index()
-```
-
-As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
-
-<br/>
-
-Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
-
-```{python}
-#| code-fold: true
-sns.displot(co2['Days']);
-plt.title("Distribution of days feature"); # suppresses unneeded plotting output
-```
-
-In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
-
-<br/>
-
-Finally, let's check the last time feature, **year** `Yr`.
-
-Let's check to see if there is any connection between missing-ness and the year of the recording.
-
-```{python}
-#| code-fold: true
-sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
-```
-
-**Observations**:
-
-* All of the missing data are in the early years of operation.
-* It appears there may have been problems with equipment in the mid to late 80s.
-
-**Potential Next Steps**:
-
-* Confirm these explanations through documentation about the historical readings.
-* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
-
-<br/>
-
-## Understanding Missing Value 2: `Avg`
-Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
-
-```{python}
-#| code-fold: true
-# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
-```
-
-The non-missing values are in the 300-400 range (a regular range of CO2 levels).
-
-We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
-
-```{python}
-#| code-fold: false
-co2[co2["Avg"] < 0]
-```
-
-There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
-
-## Drop, `NaN`, or Impute Missing `Avg` Data?
-
-How should we address the invalid `Avg` data?
-
-1. Drop records
-2. Set to NaN
-3. Impute using some strategy
-
-Remember we want to fix the following plot:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
-```
-
-Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
-
-
-Let's consider a few options:
-1. Drop those records
-2. Replace -99.99 with NaN
-3. Substitute it with a likely value for the average CO2?
-
-What do you think are the pros and cons of each possible action?
-
-<br/>
-
-
-Let's examine each of these three options.
-
-```{python}
-#| code-fold: false
-# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
-```
-
-```{python}
-#| code-fold: false
-# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
-```
-
-We'll also use a third version of the data.
-
-First, we note that the dataset already comes with a **substitute value** for the -99.99.
-
-From the file description:
-
-> The `interpolated` column includes average values from the preceding column (`average`)
-and **interpolated values** where data are missing. Interpolated values are
-computed in two steps...
-
-The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
-
-So, the third version of our data will use the `Int` feature instead of `Avg`.
-
-```{python}
-#| code-fold: false
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
-```
-
-What's a **reasonable** estimate?
-
-To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
-
-```{python}
-#| code-fold: true
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
-```
-
-In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
-
-However there is some appeal to **option C: Imputing**:
-
-* Shows seasonal trends for CO2
-* We are plotting all months in our data as a line plot
-
-<br/>
-
-
-Let's replot our original figure with option 3:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
-```
-
-Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
-
-## Presenting the data: A Discussion on Data Granularity
-
-From the description:
-
-* monthly measurements are averages of average day measurements.
-* The NOAA GML website has datasets for daily/hourly measurements too.
-
-The data you present depends on your research question.
-
-**How do CO2 levels vary by season?**
-
-* You might want to keep average monthly data.
-
-**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
-
-* You might be happier with a **coarser granularity** of average year data!
-
-```{python}
-#| code-fold: true
-co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
-```
-
-Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
-
-# Summary
-We went over a lot of content this lecture; let's summarize the most important points:
-
-## Dealing with Missing Values
-There are a few options we can take to deal with missing data:
-
-* Drop missing records
-* Keep `NaN` missing values
-* Impute using an interpolated column
-
-## EDA and Data Wrangling
-There are several ways to approach EDA and Data Wrangling:
-
-* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
-* Examine each **field/attribute/dimension** individually.
-* Examine pairs of related dimensions (e.g. breaking down grades by major).
-* Along the way, we can:
- * **Visualize** or summarize the data.
- * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
- * Identify and **address anomalies**.
- * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
- * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
+---
+title: Data Cleaning and EDA
+execute:
+ echo: true
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ toc: true
+ toc-title: Data Cleaning and EDA
+ page-layout: full
+ theme:
+ - cosmo
+ - cerulean
+ callout-icon: false
+jupyter: python3
+---
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+plt.rcParams['figure.figsize'] = (12, 9)
+
+sns.set()
+sns.set_context('talk')
+np.set_printoptions(threshold=20, precision=2, suppress=True)
+pd.set_option('display.max_rows', 30)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+# This option stops scientific notation for pandas
+pd.set_option('display.float_format', '{:.2f}'.format)
+
+# Silence some spurious seaborn warnings
+import warnings
+warnings.filterwarnings("ignore", category=FutureWarning)
+```
+
+::: {.callout-note collapse="false"}
+## Learning Outcomes
+* Recognize common file formats
+* Categorize data by its variable type
+* Build awareness of issues with data faithfulness and develop targeted solutions
+:::
+
+**This content is covered in lectures 4, 5, and 6.**
+
+In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
+
+**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+
+* Unclear structure or formatting
+* Missing or corrupted values
+* Unit conversions
+* ...and so on
+
+**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
+
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+
+## Structure
+
+### File Formats
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+
+#### CSV
+CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
+In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.csv").head(5)
+```
+
+To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
+
+```{python}
+#| code-fold: false
+with open("data/elections.csv", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
+
+#### TSV
+
+Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
+
+Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
+
+```{python}
+#| code-fold: false
+with open("data/elections.txt", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.txt", sep='\t').head(3)
+```
+
+An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+#### JSON
+**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
+
+```{python}
+#| code-fold: false
+with open("data/elections.json", "r") as table:
+ i = 0
+ for row in table:
+ print(row)
+ i += 1
+ if i > 8:
+ break
+```
+
+JSON files can be loaded into `pandas` using `pd.read_json`.
+
+```{python}
+#| code-fold: false
+pd.read_json('data/elections.json').head(3)
+```
+
+##### EDA with JSON: Berkeley COVID-19 Data
+The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
+
+```{python}
+#| code-fold: false
+from ds100_utils import fetch_and_cache
+
+covid_file = fetch_and_cache(
+ "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ force=False)
+covid_file # a file path wrapper object
+```
+
+###### File Size
+Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
+
+Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
+
+```{python}
+#| code-fold: false
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+ print(covid_file, "is", sum(1 for l in f), "lines.")
+```
+
+###### Unix Commands
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
+In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
+
+Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
+
+These two give us the same information as the code above, albeit in a slightly different form:
+
+```{python}
+#| code-fold: false
+!ls -lh {covid_file}
+!wc -l {covid_file}
+```
+
+###### File Contents
+Let's explore the data format using `Python`.
+
+```{python}
+#| code-fold: false
+with open(covid_file, "r") as f:
+ for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
+```
+
+We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
+
+```{python}
+#| code-fold: false
+!head -5 {covid_file}
+```
+
+In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
+
+```{python}
+#| code-fold: false
+import json
+
+with open(covid_file, "rb") as f:
+ covid_json = json.load(f)
+```
+
+The `covid_json` variable is now a dictionary encoding the data in the file:
+
+```{python}
+#| code-fold: false
+type(covid_json)
+```
+
+We can examine what keys are in the top level json object by listing out the keys.
+
+```{python}
+#| code-fold: false
+covid_json.keys()
+```
+
+**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
+
+
+We can investigate the meta data further by examining the keys associated with the metadata.
+
+```{python}
+#| code-fold: false
+covid_json['meta'].keys()
+```
+
+The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
+
+```{python}
+#| code-fold: false
+covid_json['meta']['view'].keys()
+```
+
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+
+```
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+```
+
+
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+
+```{python}
+#| code-fold: false
+print(covid_json['meta']['view']['description'])
+```
+
+###### Examining the Data Field for Records
+
+We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
+
+```{python}
+#| code-fold: false
+for i in range(3):
+ print(f"{i:03} | {covid_json['data'][i]}")
+```
+
+Observations:
+* These look like equal-length records, so maybe `data` is a table!
+* But what do each of values in the record mean? Where can we find column headers?
+
+For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
+
+```{python}
+#| code-fold: false
+type(covid_json['meta']['view']['columns'])
+```
+
+###### Summary of exploring the JSON file
+
+1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
+1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
+1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
+
+###### Loading COVID Data into `pandas`
+Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
+
+1. Translate the JSON records into a `DataFrame`:
+
+ * fields: `covid_json['meta']['view']['columns']`
+ * records: `covid_json['data']`
+
+
+1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
+
+1. Examine the `tail` of the table.
+
+```{python}
+#| code-fold: false
+# Load the data from JSON and assign column titles
+covid = pd.DataFrame(
+ covid_json['data'],
+ columns=[c['name'] for c in covid_json['meta']['view']['columns']])
+
+covid.tail()
+```
+
+### Variable Types
+
+After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+
+**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
+
+* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
+* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
+
+**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
+
+* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
+* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
+
+![Classification of variable types](images/variable.png)
+
+Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+
+### Primary and Foreign Keys
+
+Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
+
+The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
+
+```{python}
+#| echo: false
+pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
+ "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
+ "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
+```
+
+The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
+
+```{python}
+#| echo: false
+pd.DataFrame({"OH Request":[1, 2, 3, 4], \
+ "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
+ "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
+```
+
+## Granularity, Scope, and Temporality
+
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
+
+### Granularity
+The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+
+### Scope
+The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+
+### Temporality
+The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+
+Time and date fields of a dataset could represent a few things:
+
+1. when the "event" happened
+2. when the data was collected, or when it was entered into the system
+3. when the data was copied into the database
+
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+
+#### Temporality with `pandas`' `dt` accessors
+Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
+
+```{python}
+#| code-fold: true
+calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+calls.head()
+```
+
+Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
+
+Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
+
+If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+calls.head()
+```
+
+Now, we can use the `dt` accessor on this column.
+
+We can get the month:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.month.head()
+```
+
+Which day of the week the date is on:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.dayofweek.head()
+```
+
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+
+```{python}
+#| code-fold: false
+calls.sort_values("EVENTDT").head()
+```
+
+Doesn't look like it! We are good!
+
+
+We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
+
+## Faithfulness
+
+At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
+
+Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+
+* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
+* Violations of obvious dependencies, like an age that does not match a birthday
+* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
+* Signs of data falsification, such as fake email addresses or repeated use of the same names
+* Duplicated records or fields containing the same information
+* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
+
+We often solve some of these more common issues in the following ways:
+
+* Spelling errors: apply corrections or drop records that aren't in a dictionary
+* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
+* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
+* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
+
+### Missing Values
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
+
+A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+
+* Average imputation: replace missing values with the average value for that field
+* Hot deck imputation: replace missing values with some random value
+* Regression imputation: develop a model to predict missing values
+* Multiple imputation: replace missing values with multiple random values
+
+Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+
+# EDA Demo 1: Tuberculosis in the United States
+
+Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+
+We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
+
+
+## CSVs and Field Names
+Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
+
+We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
+1. Using a text editor like emacs, vim, VSCode, etc.
+2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
+3. The `Python` file object
+4. `pandas`, using `pd.read_csv()`
+
+To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
+
+Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(row)
+ i += 1
+ if i > 3:
+ break
+```
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+
+You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
+
+If you're curious, we can use the `repr()` function to return the raw string with all special characters:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(repr(row)) # print raw strings
+ i += 1
+ if i > 3:
+ break
+```
+
+Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
+tb_df.head()
+```
+
+You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
+
+Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
+
+A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+tb_df.head(5)
+```
+
+Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
+
+We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
+
+```{python}
+#| code-fold: false
+rename_dict = {'2019': 'TB cases 2019',
+ '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+tb_df = tb_df.rename(columns=rename_dict)
+tb_df.head(5)
+```
+
+## Record Granularity
+
+You might already be wondering: what's up with that first record?
+
+Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
+
+Okay, EDA step two. How was the rollup record aggregated?
+
+Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
+
+```{python}
+#| code-fold: true
+tb_df.sum(axis=0)
+```
+
+Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+
+```{python}
+#| code-fold: true
+tb_df.dtypes
+```
+
+Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
+
+
+Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
+
+```{python}
+#| code-fold: false
+# improve readability: chaining method calls with outer parentheses/line breaks
+tb_df = (
+ pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
+ .rename(columns=rename_dict)
+)
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+tb_df.sum()
+```
+
+The Total TB cases look right. Phew!
+
+Let's just look at the records with **state-level granularity**:
+
+```{python}
+#| code-fold: true
+state_tb_df = tb_df[1:]
+state_tb_df.head(5)
+```
+
+## Gather Census Data
+
+U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
+
+Running the below cells cleans the data.
+There are a few new methods here:
+* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
+* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
+
+```{python}
+#| code-fold: true
+# 2010s census data
+census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+census_2010s_df = (
+ census_2010s_df
+ .reset_index()
+ .drop(columns=["index", "Census", "Estimates Base"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+census_2010s_df.head(5)
+```
+
+Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
+
+```python
+from importlib import reload
+reload(utils)
+```
+
+or use `iPython` magic which will intelligently import code when files change:
+
+```python
+%load_ext autoreload
+%autoreload 2
+```
+
+```{python}
+#| code-fold: true
+# census 2020s data
+census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+census_2020s_df = (
+ census_2020s_df
+ .reset_index()
+ .drop(columns=["index", "Unnamed: 1"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+
+census_2020s_df.head(5)
+```
+
+## Joining Data (Merging `DataFrame`s)
+
+Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
+
+```{python}
+#| code-fold: false
+# merge TB DataFrame with two US census DataFrames
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .merge(right=census_2020s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
+
+```{python}
+#| code-fold: false
+# try merging again, but cleaner this time
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["Geographic Area", "2019"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+ .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+## Reproducing Data: Compute Incidence
+
+Let's recompute incidence to make sure we know where the original CDC numbers came from.
+
+From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+
+$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
+
+$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
+
+Let's try this for 2019:
+
+```{python}
+#| code-fold: false
+tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+tb_census_df.head(5)
+```
+
+Awesome!!!
+
+Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+
+```{python}
+#| code-fold: false
+tb_census_df.describe()
+```
+
+## Bonus EDA: Reproducing the Reported Statistic
+
+
+**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+This is TB incidence computed across the entire U.S. population! How do we reproduce this?
+* We need to reproduce the "Total" TB incidences in our rolled record.
+* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
+* What happened...?
+
+Let's get exploring!
+
+Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+
+```{python}
+#| code-fold: true
+tb_df = tb_df.set_index("U.S. jurisdiction")
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2010s_df = census_2010s_df.set_index("Geographic Area")
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2020s_df = census_2020s_df.set_index("Geographic Area")
+census_2020s_df.head(5)
+```
+
+It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
+
+```{python}
+#| code-fold: false
+tb_df.head()
+```
+
+Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
+
+The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
+
+```{python}
+#| code-fold: false
+census_2010s_df.head(5)
+```
+
+The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
+
+One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
+
+```{python}
+#| code-fold: false
+# rename rolled record for 2010s
+census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+# same, but for 2020s rename rolled record
+census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2020s_df.head(5)
+```
+
+<br/>
+
+Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
+
+```{python}
+#| code-fold: false
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["2019"]],
+ left_index=True, right_index=True)
+ .merge(right=census_2020s_df[["2020", "2021"]],
+ left_index=True, right_index=True)
+)
+tb_census_df.head(5)
+```
+
+<br/>
+
+Finally, let's recompute our incidences:
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+We reproduced the total U.S. incidences correctly!
+
+We're almost there. Let's revisit the quote:
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+Recall that percent change from $A$ to $B$ is computed as
+$\text{percent change} = \frac{B - A}{A} \times 100$.
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
+incidence_2020
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
+incidence_2021
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
+difference
+```
+
+# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
+
+[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
+
+```{python}
+#| code-fold: false
+co2_file = "data/co2_mm_mlo.txt"
+```
+
+Let's do some **EDA**!!
+
+## Reading this file into Pandas?
+Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
+
+Lines 71-78 (inclusive) are shown below:
+
+ line number | file contents
+
+ 71 | # decimal average interpolated trend #days
+ 72 | # date (season corr)
+ 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+ 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+ 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+ 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+ 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+ 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+
+
+Notice how:
+
+- The values are separated by white space, possibly tabs.
+- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
+- The 71st and 72nd lines in the file contain column headings split over two lines.
+
+We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+)
+co2.head()
+```
+
+Congratulations! You've wrangled the data!
+
+<br/>
+
+...But our columns aren't named.
+**We need to do more EDA.**
+
+## Exploring Variable Feature Types
+
+The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
+
+Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = '\s+', #regex for continuous whitespace (next lecture)
+ names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+)
+co2.head()
+```
+
+## Visualizing CO<sub>2</sub>
+Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2);
+```
+
+The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
+
+Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
+
+```{python}
+#| code-fold: false
+co2.head()
+```
+
+```{python}
+#| code-fold: false
+co2.tail()
+```
+
+Some data have unusual values like -1 and -99.99.
+
+Let's check the description at the top of the file again.
+
+* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
+* -99.99 denotes a missing monthly average `Avg`
+
+How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+
+<br/>
+
+
+## Sanity Checks: Reasoning about the data
+First, we consider the shape of the data. How many rows should we have?
+
+* If chronological order, we should have one record per month.
+* Data from March 1958 to August 2019.
+* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
+
+```{python}
+#| code-fold: false
+co2.shape
+```
+
+Nice!! The number of rows (i.e. records) match our expectations.\
+
+<br/>
+
+
+Let's now check the quality of each feature.
+
+## Understanding Missing Value 1: `Days`
+`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
+
+Let's start with **months**, `Mo`.
+
+Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+
+```{python}
+#| code-fold: false
+co2["Mo"].value_counts().sort_index()
+```
+
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+
+<br/>
+
+Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
+
+```{python}
+#| code-fold: true
+sns.displot(co2['Days']);
+plt.title("Distribution of days feature"); # suppresses unneeded plotting output
+```
+
+In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
+
+<br/>
+
+Finally, let's check the last time feature, **year** `Yr`.
+
+Let's check to see if there is any connection between missing-ness and the year of the recording.
+
+```{python}
+#| code-fold: true
+sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output
+```
+
+**Observations**:
+
+* All of the missing data are in the early years of operation.
+* It appears there may have been problems with equipment in the mid to late 80s.
+
+**Potential Next Steps**:
+
+* Confirm these explanations through documentation about the historical readings.
+* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
+
+<br/>
+
+## Understanding Missing Value 2: `Avg`
+Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
+
+```{python}
+#| code-fold: true
+# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+```
+
+The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+
+We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
+
+```{python}
+#| code-fold: false
+co2[co2["Avg"] < 0]
+```
+
+There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
+
+## Drop, `NaN`, or Impute Missing `Avg` Data?
+
+How should we address the invalid `Avg` data?
+
+1. Drop records
+2. Set to NaN
+3. Impute using some strategy
+
+Remember we want to fix the following plot:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");
+```
+
+Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
+
+
+Let's consider a few options:
+1. Drop those records
+2. Replace -99.99 with NaN
+3. Substitute it with a likely value for the average CO2?
+
+What do you think are the pros and cons of each possible action?
+
+<br/>
+
+
+Let's examine each of these three options.
+
+```{python}
+#| code-fold: false
+# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
+```
+
+```{python}
+#| code-fold: false
+# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
+```
+
+We'll also use a third version of the data.
+
+First, we note that the dataset already comes with a **substitute value** for the -99.99.
+
+From the file description:
+
+> The `interpolated` column includes average values from the preceding column (`average`)
+and **interpolated values** where data are missing. Interpolated values are
+computed in two steps...
+
+The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
+
+So, the third version of our data will use the `Int` feature instead of `Avg`.
+
+```{python}
+#| code-fold: false
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
+```
+
+What's a **reasonable** estimate?
+
+To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+
+```{python}
+#| code-fold: true
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
+```
+
+In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
+
+However there is some appeal to **option C: Imputing**:
+
+* Shows seasonal trends for CO2
+* We are plotting all months in our data as a line plot
+
+<br/>
+
+
+Let's replot our original figure with option 3:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");
+```
+
+Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
+
+## Presenting the data: A Discussion on Data Granularity
+
+From the description:
+
+* monthly measurements are averages of average day measurements.
+* The NOAA GML website has datasets for daily/hourly measurements too.
+
+The data you present depends on your research question.
+
+**How do CO2 levels vary by season?**
+
+* You might want to keep average monthly data.
+
+**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
+
+* You might be happier with a **coarser granularity** of average year data!
+
+```{python}
+#| code-fold: true
+co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");
+```
+
+Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+
+# Summary
+We went over a lot of content this lecture; let's summarize the most important points:
+
+## Dealing with Missing Values
+There are a few options we can take to deal with missing data:
+
+* Drop missing records
+* Keep `NaN` missing values
+* Impute using an interpolated column
+
+## EDA and Data Wrangling
+There are several ways to approach EDA and Data Wrangling:
+
+* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
+* Examine each **field/attribute/dimension** individually.
+* Examine pairs of related dimensions (e.g. breaking down grades by major).
+* Along the way, we can:
+ * **Visualize** or summarize the data.
+ * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
+ * Identify and **address anomalies**.
+ * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
+ * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png
index a04218cf..f392d5f9 100644
Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png
new file mode 100644
index 00000000..be96b8c9
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png
deleted file mode 100644
index 31857f62..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-67-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png
index 67c3959d..ffd29ff8 100644
Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png
new file mode 100644
index 00000000..29088928
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png
deleted file mode 100644
index fb28f5d5..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-69-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png
index 39cac822..49ef3d6a 100644
Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png
index 6382e58a..15a5fe82 100644
Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png
index db2b0dee..40b1fc71 100644
Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png
index 897b8b39..99b6c2d1 100644
Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ
diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html
index ea770e7f..22d26788 100644
--- a/docs/feature_engineering/feature_engineering.html
+++ b/docs/feature_engineering/feature_engineering.html
@@ -556,7 +556,7 @@
my_model.fit(X, Y)
-LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
+LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
Notice that we use double brackets to extract this column. Why double brackets instead of just single brackets? The .fit
method, by default, expects to receive 2-dimensional data – some kind of data that includes both rows and columns. Writing penguins["flipper_length_mm"]
would return a 1D Series
, causing sklearn
to error. We avoid this by writing penguins[["flipper_length_mm"]]
to produce a 2D DataFrame
.
@@ -607,7 +607,7 @@
print(f"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}")
-The RMSE of the model is 0.9881331104079044
+The RMSE of the model is 0.9881331104079045
# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
@@ -4564,30 +4552,30 @@
Code
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
@@ -4604,8 +4592,8 @@
Code
-
+
@@ -4632,9 +4620,9 @@
Code
-
+
@@ -4975,1218 +4963,1218 @@ <
Source Code
----
-title: Data Cleaning and EDA
-execute:
- echo: true
-format:
- html:
- code-fold: true
- code-tools: true
- toc: true
- toc-title: Data Cleaning and EDA
- page-layout: full
- theme:
- - cosmo
- - cerulean
- callout-icon: false
-jupyter: python3
----
-
-```{python}
-#| code-fold: true
-import numpy as np
-import pandas as pd
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-#%matplotlib inline
-plt.rcParams['figure.figsize'] = (12, 9)
-
-sns.set()
-sns.set_context('talk')
-np.set_printoptions(threshold=20, precision=2, suppress=True)
-pd.set_option('display.max_rows', 30)
-pd.set_option('display.max_columns', None)
-pd.set_option('display.precision', 2)
-# This option stops scientific notation for pandas
-pd.set_option('display.float_format', '{:.2f}'.format)
-
-# Silence some spurious seaborn warnings
-import warnings
-warnings.filterwarnings("ignore", category=FutureWarning)
-```
-
-::: {.callout-note collapse="false"}
-## Learning Outcomes
-* Recognize common file formats
-* Categorize data by its variable type
-* Build awareness of issues with data faithfulness and develop targeted solutions
-:::
-
-**This content is covered in lectures 4, 5, and 6.**
-
-In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
-
-**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
-
-* Unclear structure or formatting
-* Missing or corrupted values
-* Unit conversions
-* ...and so on
-
-**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
-
-In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
-
-## Structure
-
-### File Formats
-There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
-
-#### CSV
-CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
-In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.csv").head(5)
-```
-
-To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
-
-```{python}
-#| code-fold: false
-with open("data/elections.csv", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
-
-#### TSV
-
-Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
-
-Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
-
-```{python}
-#| code-fold: false
-with open("data/elections.txt", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.txt", sep='\t').head(3)
-```
-
-An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-#### JSON
-**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
-
-```{python}
-#| code-fold: false
-with open("data/elections.json", "r") as table:
- i = 0
- for row in table:
- print(row)
- i += 1
- if i > 8:
- break
-```
-
-JSON files can be loaded into `pandas` using `pd.read_json`.
-
-```{python}
-#| code-fold: false
-pd.read_json('data/elections.json').head(3)
-```
-
-##### EDA with JSON: Berkeley COVID-19 Data
-The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
-
-```{python}
-#| code-fold: false
-from ds100_utils import fetch_and_cache
-
-covid_file = fetch_and_cache(
- "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
- "confirmed-cases.json",
- force=False)
-covid_file # a file path wrapper object
-```
-
-###### File Size
-Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
-
-Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
-
-```{python}
-#| code-fold: false
-import os
-
-print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
-
-with open(covid_file, "r") as f:
- print(covid_file, "is", sum(1 for l in f), "lines.")
-```
-
-###### Unix Commands
-As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
-In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
-
-Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
-
-These two give us the same information as the code above, albeit in a slightly different form:
-
-```{python}
-#| code-fold: false
-!ls -lh {covid_file}
-!wc -l {covid_file}
-```
-
-###### File Contents
-Let's explore the data format using `Python`.
-
-```{python}
-#| code-fold: false
-with open(covid_file, "r") as f:
- for i, row in enumerate(f):
- print(repr(row)) # print raw strings
- if i >= 4: break
-```
-
-We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
-
-```{python}
-#| code-fold: false
-!head -5 {covid_file}
-```
-
-In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
-
-```{python}
-#| code-fold: false
-import json
-
-with open(covid_file, "rb") as f:
- covid_json = json.load(f)
-```
-
-The `covid_json` variable is now a dictionary encoding the data in the file:
-
-```{python}
-#| code-fold: false
-type(covid_json)
-```
-
-We can examine what keys are in the top level json object by listing out the keys.
-
-```{python}
-#| code-fold: false
-covid_json.keys()
-```
-
-**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
-
-
-We can investigate the meta data further by examining the keys associated with the metadata.
-
-```{python}
-#| code-fold: false
-covid_json['meta'].keys()
-```
-
-The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
-
-```{python}
-#| code-fold: false
-covid_json['meta']['view'].keys()
-```
-
-Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
-
-```
-meta
-|-> data
- | ... (haven't explored yet)
-|-> view
- | -> id
- | -> name
- | -> attribution
- ...
- | -> description
- ...
- | -> columns
- ...
-```
-
-
-There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
-```{python}
-#| code-fold: false
-print(covid_json['meta']['view']['description'])
-```
-
-###### Examining the Data Field for Records
-
-We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
-
-```{python}
-#| code-fold: false
-for i in range(3):
- print(f"{i:03} | {covid_json['data'][i]}")
-```
-
-Observations:
-* These look like equal-length records, so maybe `data` is a table!
-* But what do each of values in the record mean? Where can we find column headers?
-
-For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
-
-```{python}
-#| code-fold: false
-type(covid_json['meta']['view']['columns'])
-```
-
-###### Summary of exploring the JSON file
-
-1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
-1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
-1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
-
-###### Loading COVID Data into `pandas`
-Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
-
-1. Translate the JSON records into a `DataFrame`:
-
- * fields: `covid_json['meta']['view']['columns']`
- * records: `covid_json['data']`
-
-
-1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
-
-1. Examine the `tail` of the table.
-
-```{python}
-#| code-fold: false
-# Load the data from JSON and assign column titles
-covid = pd.DataFrame(
- covid_json['data'],
- columns=[c['name'] for c in covid_json['meta']['view']['columns']])
-
-covid.tail()
-```
-
-### Variable Types
-
-After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
-
-**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
-
-* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
-* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
-
-**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
-
-* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
-* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
-
-![Classification of variable types](images/variable.png)
-
-Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
-
-### Primary and Foreign Keys
-
-Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
-
-The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
-
-```{python}
-#| echo: false
-pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
- "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
- "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
-```
-
-The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
-
-```{python}
-#| echo: false
-pd.DataFrame({"OH Request":[1, 2, 3, 4], \
- "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
- "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
-```
-
-## Granularity, Scope, and Temporality
-
-After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
-
-### Granularity
-The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
-
-### Scope
-The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
-
-### Temporality
-The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
-
-Time and date fields of a dataset could represent a few things:
-
-1. when the "event" happened
-2. when the data was collected, or when it was entered into the system
-3. when the data was copied into the database
-
-To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
-
-#### Temporality with `pandas`' `dt` accessors
-Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
-```{python}
-#| code-fold: true
-calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
-calls.head()
-```
-
-Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
-
-Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
-
-If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
-calls.head()
-```
-
-Now, we can use the `dt` accessor on this column.
-
-We can get the month:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.month.head()
-```
-
-Which day of the week the date is on:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.dayofweek.head()
-```
-
-Check the mimimum values to see if there are any suspicious-looking, 70s dates:
-
-```{python}
-#| code-fold: false
-calls.sort_values("EVENTDT").head()
-```
-
-Doesn't look like it! We are good!
-
-
-We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
-
-## Faithfulness
-
-At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
-
-Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
-
-* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
-* Violations of obvious dependencies, like an age that does not match a birthday
-* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
-* Signs of data falsification, such as fake email addresses or repeated use of the same names
-* Duplicated records or fields containing the same information
-* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
-
-We often solve some of these more common issues in the following ways:
-
-* Spelling errors: apply corrections or drop records that aren't in a dictionary
-* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
-* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
-* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
-
-### Missing Values
-Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
-
-A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
-
-* Average imputation: replace missing values with the average value for that field
-* Hot deck imputation: replace missing values with some random value
-* Regression imputation: develop a model to predict missing values
-* Multiple imputation: replace missing values with multiple random values
-
-Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
-
-# EDA Demo 1: Tuberculosis in the United States
-
-Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
-
-We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
-
-
-## CSVs and Field Names
-Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
-
-We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
-1. Using a text editor like emacs, vim, VSCode, etc.
-2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
-3. The `Python` file object
-4. `pandas`, using `pd.read_csv()`
-
-To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
-
-Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(row)
- i += 1
- if i > 3:
- break
-```
-
-Whoa, why are there blank lines interspaced between the lines of the CSV?
-
-You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
-
-If you're curious, we can use the `repr()` function to return the raw string with all special characters:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(repr(row)) # print raw strings
- i += 1
- if i > 3:
- break
-```
-
-Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
-tb_df.head()
-```
-
-You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
-
-Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
-
-A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
-tb_df.head(5)
-```
-
-Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
-
-We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
-
-```{python}
-#| code-fold: false
-rename_dict = {'2019': 'TB cases 2019',
- '2020': 'TB cases 2020',
- '2021': 'TB cases 2021',
- '2019.1': 'TB incidence 2019',
- '2020.1': 'TB incidence 2020',
- '2021.1': 'TB incidence 2021'}
-tb_df = tb_df.rename(columns=rename_dict)
-tb_df.head(5)
-```
-
-## Record Granularity
-
-You might already be wondering: what's up with that first record?
-
-Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
-
-Okay, EDA step two. How was the rollup record aggregated?
-
-Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
-
-```{python}
-#| code-fold: true
-tb_df.sum(axis=0)
-```
-
-Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
-```{python}
-#| code-fold: true
-tb_df.dtypes
-```
-
-Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
-
-
-Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
-
-```{python}
-#| code-fold: false
-# improve readability: chaining method calls with outer parentheses/line breaks
-tb_df = (
- pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
- .rename(columns=rename_dict)
-)
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-tb_df.sum()
-```
-
-The Total TB cases look right. Phew!
-
-Let's just look at the records with **state-level granularity**:
-
-```{python}
-#| code-fold: true
-state_tb_df = tb_df[1:]
-state_tb_df.head(5)
-```
-
-## Gather Census Data
-
-U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
-
-Running the below cells cleans the data.
-There are a few new methods here:
-* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
-* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
-
-```{python}
-#| code-fold: true
-# 2010s census data
-census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
-census_2010s_df = (
- census_2010s_df
- .reset_index()
- .drop(columns=["index", "Census", "Estimates Base"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
-
-# with pd.option_context('display.min_rows', 30): # shows more rows
-# display(census_2010s_df)
-
-census_2010s_df.head(5)
-```
-
-Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
-
-```python
-from importlib import reload
-reload(utils)
-```
-
-or use `iPython` magic which will intelligently import code when files change:
-
-```python
-%load_ext autoreload
-%autoreload 2
-```
-
-```{python}
-#| code-fold: true
-# census 2020s data
-census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
-census_2020s_df = (
- census_2020s_df
- .reset_index()
- .drop(columns=["index", "Unnamed: 1"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
-
-census_2020s_df.head(5)
-```
-
-## Joining Data (Merging `DataFrame`s)
-
-Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
-
-```{python}
-#| code-fold: false
-# merge TB DataFrame with two US census DataFrames
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .merge(right=census_2020s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
-
-```{python}
-#| code-fold: false
-# try merging again, but cleaner this time
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["Geographic Area", "2019"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
- .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-## Reproducing Data: Compute Incidence
-
-Let's recompute incidence to make sure we know where the original CDC numbers came from.
-
-From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
-
-If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
-
-$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
-
-$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
-
-Let's try this for 2019:
-
-```{python}
-#| code-fold: false
-tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
-tb_census_df.head(5)
-```
-
-Awesome!!!
-
-Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
-
-```{python}
-#| code-fold: false
-tb_census_df.describe()
-```
-
-## Bonus EDA: Reproducing the Reported Statistic
-
-
-**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-This is TB incidence computed across the entire U.S. population! How do we reproduce this?
-* We need to reproduce the "Total" TB incidences in our rolled record.
-* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
-* What happened...?
-
-Let's get exploring!
-
-Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
-
-```{python}
-#| code-fold: true
-tb_df = tb_df.set_index("U.S. jurisdiction")
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2010s_df = census_2010s_df.set_index("Geographic Area")
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2020s_df = census_2020s_df.set_index("Geographic Area")
-census_2020s_df.head(5)
-```
-
-It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
-
-```{python}
-#| code-fold: false
-tb_df.head()
-```
-
-Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
-
-The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
-
-```{python}
-#| code-fold: false
-census_2010s_df.head(5)
-```
-
-The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
-
-One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
-
-```{python}
-#| code-fold: false
-# rename rolled record for 2010s
-census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-# same, but for 2020s rename rolled record
-census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2020s_df.head(5)
-```
-
-<br/>
-
-Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
-
-```{python}
-#| code-fold: false
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["2019"]],
- left_index=True, right_index=True)
- .merge(right=census_2020s_df[["2020", "2021"]],
- left_index=True, right_index=True)
-)
-tb_census_df.head(5)
-```
-
-<br/>
-
-Finally, let's recompute our incidences:
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-We reproduced the total U.S. incidences correctly!
-
-We're almost there. Let's revisit the quote:
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-Recall that percent change from $A$ to $B$ is computed as
-$\text{percent change} = \frac{B - A}{A} \times 100$.
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
-incidence_2020
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
-incidence_2021
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
-difference
-```
-
-# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
-
-[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
-
-```{python}
-#| code-fold: false
-co2_file = "data/co2_mm_mlo.txt"
-```
-
-Let's do some **EDA**!!
-
-## Reading this file into Pandas?
-Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
-
-Lines 71-78 (inclusive) are shown below:
-
- line number | file contents
-
- 71 | # decimal average interpolated trend #days
- 72 | # date (season corr)
- 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
- 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
- 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
- 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
- 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
- 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
-
-
-Notice how:
-
-- The values are separated by white space, possibly tabs.
-- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
-- The 71st and 72nd lines in the file contain column headings split over two lines.
-
-We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
-)
-co2.head()
-```
-
-Congratulations! You've wrangled the data!
-
-<br/>
-
-...But our columns aren't named.
-**We need to do more EDA.**
-
-## Exploring Variable Feature Types
-
-The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
-
-Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = '\s+', #regex for continuous whitespace (next lecture)
- names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
-)
-co2.head()
-```
-
-## Visualizing CO<sub>2</sub>
-Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2);
-```
-
-The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
-
-Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
-
-```{python}
-#| code-fold: false
-co2.head()
-```
-
-```{python}
-#| code-fold: false
-co2.tail()
-```
-
-Some data have unusual values like -1 and -99.99.
-
-Let's check the description at the top of the file again.
-
-* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
-* -99.99 denotes a missing monthly average `Avg`
-
-How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
-
-<br/>
-
-
-## Sanity Checks: Reasoning about the data
-First, we consider the shape of the data. How many rows should we have?
-
-* If chronological order, we should have one record per month.
-* Data from March 1958 to August 2019.
-* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
-
-```{python}
-#| code-fold: false
-co2.shape
-```
-
-Nice!! The number of rows (i.e. records) match our expectations.\
-
-<br/>
-
-
-Let's now check the quality of each feature.
-
-## Understanding Missing Value 1: `Days`
-`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
-
-Let's start with **months**, `Mo`.
-
-Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
-
-```{python}
-#| code-fold: false
-co2["Mo"].value_counts().sort_index()
-```
-
-As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
-
-<br/>
-
-Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
-
-```{python}
-#| code-fold: true
-sns.displot(co2['Days']);
-plt.title("Distribution of days feature"); # suppresses unneeded plotting output
-```
-
-In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
-
-<br/>
-
-Finally, let's check the last time feature, **year** `Yr`.
-
-Let's check to see if there is any connection between missing-ness and the year of the recording.
-
-```{python}
-#| code-fold: true
-sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
-```
-
-**Observations**:
-
-* All of the missing data are in the early years of operation.
-* It appears there may have been problems with equipment in the mid to late 80s.
-
-**Potential Next Steps**:
-
-* Confirm these explanations through documentation about the historical readings.
-* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
-
-<br/>
-
-## Understanding Missing Value 2: `Avg`
-Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
-
-```{python}
-#| code-fold: true
-# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
-```
-
-The non-missing values are in the 300-400 range (a regular range of CO2 levels).
-
-We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
-
-```{python}
-#| code-fold: false
-co2[co2["Avg"] < 0]
-```
-
-There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
-
-## Drop, `NaN`, or Impute Missing `Avg` Data?
-
-How should we address the invalid `Avg` data?
-
-1. Drop records
-2. Set to NaN
-3. Impute using some strategy
-
-Remember we want to fix the following plot:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
-```
-
-Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
-
-
-Let's consider a few options:
-1. Drop those records
-2. Replace -99.99 with NaN
-3. Substitute it with a likely value for the average CO2?
-
-What do you think are the pros and cons of each possible action?
-
-<br/>
-
-
-Let's examine each of these three options.
-
-```{python}
-#| code-fold: false
-# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
-```
-
-```{python}
-#| code-fold: false
-# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
-```
-
-We'll also use a third version of the data.
-
-First, we note that the dataset already comes with a **substitute value** for the -99.99.
-
-From the file description:
-
-> The `interpolated` column includes average values from the preceding column (`average`)
-and **interpolated values** where data are missing. Interpolated values are
-computed in two steps...
-
-The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
-
-So, the third version of our data will use the `Int` feature instead of `Avg`.
-
-```{python}
-#| code-fold: false
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
-```
-
-What's a **reasonable** estimate?
-
-To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
-
-```{python}
-#| code-fold: true
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
-```
-
-In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
-
-However there is some appeal to **option C: Imputing**:
-
-* Shows seasonal trends for CO2
-* We are plotting all months in our data as a line plot
-
-<br/>
-
-
-Let's replot our original figure with option 3:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
-```
-
-Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
-
-## Presenting the data: A Discussion on Data Granularity
-
-From the description:
-
-* monthly measurements are averages of average day measurements.
-* The NOAA GML website has datasets for daily/hourly measurements too.
-
-The data you present depends on your research question.
-
-**How do CO2 levels vary by season?**
-
-* You might want to keep average monthly data.
-
-**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
-
-* You might be happier with a **coarser granularity** of average year data!
-
-```{python}
-#| code-fold: true
-co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
-```
-
-Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
-
-# Summary
-We went over a lot of content this lecture; let's summarize the most important points:
-
-## Dealing with Missing Values
-There are a few options we can take to deal with missing data:
-
-* Drop missing records
-* Keep `NaN` missing values
-* Impute using an interpolated column
-
-## EDA and Data Wrangling
-There are several ways to approach EDA and Data Wrangling:
-
-* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
-* Examine each **field/attribute/dimension** individually.
-* Examine pairs of related dimensions (e.g. breaking down grades by major).
-* Along the way, we can:
- * **Visualize** or summarize the data.
- * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
- * Identify and **address anomalies**.
- * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
- * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
+---
+title: Data Cleaning and EDA
+execute:
+ echo: true
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ toc: true
+ toc-title: Data Cleaning and EDA
+ page-layout: full
+ theme:
+ - cosmo
+ - cerulean
+ callout-icon: false
+jupyter: python3
+---
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+plt.rcParams['figure.figsize'] = (12, 9)
+
+sns.set()
+sns.set_context('talk')
+np.set_printoptions(threshold=20, precision=2, suppress=True)
+pd.set_option('display.max_rows', 30)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+# This option stops scientific notation for pandas
+pd.set_option('display.float_format', '{:.2f}'.format)
+
+# Silence some spurious seaborn warnings
+import warnings
+warnings.filterwarnings("ignore", category=FutureWarning)
+```
+
+::: {.callout-note collapse="false"}
+## Learning Outcomes
+* Recognize common file formats
+* Categorize data by its variable type
+* Build awareness of issues with data faithfulness and develop targeted solutions
+:::
+
+**This content is covered in lectures 4, 5, and 6.**
+
+In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
+
+**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+
+* Unclear structure or formatting
+* Missing or corrupted values
+* Unit conversions
+* ...and so on
+
+**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
+
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+
+## Structure
+
+### File Formats
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+
+#### CSV
+CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
+In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.csv").head(5)
+```
+
+To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
+
+```{python}
+#| code-fold: false
+with open("data/elections.csv", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
+
+#### TSV
+
+Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
+
+Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
+
+```{python}
+#| code-fold: false
+with open("data/elections.txt", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.txt", sep='\t').head(3)
+```
+
+An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+#### JSON
+**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
+
+```{python}
+#| code-fold: false
+with open("data/elections.json", "r") as table:
+ i = 0
+ for row in table:
+ print(row)
+ i += 1
+ if i > 8:
+ break
+```
+
+JSON files can be loaded into `pandas` using `pd.read_json`.
+
+```{python}
+#| code-fold: false
+pd.read_json('data/elections.json').head(3)
+```
+
+##### EDA with JSON: Berkeley COVID-19 Data
+The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
+
+```{python}
+#| code-fold: false
+from ds100_utils import fetch_and_cache
+
+covid_file = fetch_and_cache(
+ "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ force=False)
+covid_file # a file path wrapper object
+```
+
+###### File Size
+Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
+
+Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
+
+```{python}
+#| code-fold: false
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+ print(covid_file, "is", sum(1 for l in f), "lines.")
+```
+
+###### Unix Commands
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
+In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
+
+Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
+
+These two give us the same information as the code above, albeit in a slightly different form:
+
+```{python}
+#| code-fold: false
+!ls -lh {covid_file}
+!wc -l {covid_file}
+```
+
+###### File Contents
+Let's explore the data format using `Python`.
+
+```{python}
+#| code-fold: false
+with open(covid_file, "r") as f:
+ for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
+```
+
+We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
+
+```{python}
+#| code-fold: false
+!head -5 {covid_file}
+```
+
+In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
+
+```{python}
+#| code-fold: false
+import json
+
+with open(covid_file, "rb") as f:
+ covid_json = json.load(f)
+```
+
+The `covid_json` variable is now a dictionary encoding the data in the file:
+
+```{python}
+#| code-fold: false
+type(covid_json)
+```
+
+We can examine what keys are in the top level json object by listing out the keys.
+
+```{python}
+#| code-fold: false
+covid_json.keys()
+```
+
+**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
+
+
+We can investigate the meta data further by examining the keys associated with the metadata.
+
+```{python}
+#| code-fold: false
+covid_json['meta'].keys()
+```
+
+The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
+
+```{python}
+#| code-fold: false
+covid_json['meta']['view'].keys()
+```
+
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+
+```
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+```
+
+
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+
+```{python}
+#| code-fold: false
+print(covid_json['meta']['view']['description'])
+```
+
+###### Examining the Data Field for Records
+
+We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
+
+```{python}
+#| code-fold: false
+for i in range(3):
+ print(f"{i:03} | {covid_json['data'][i]}")
+```
+
+Observations:
+* These look like equal-length records, so maybe `data` is a table!
+* But what do each of values in the record mean? Where can we find column headers?
+
+For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
+
+```{python}
+#| code-fold: false
+type(covid_json['meta']['view']['columns'])
+```
+
+###### Summary of exploring the JSON file
+
+1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
+1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
+1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
+
+###### Loading COVID Data into `pandas`
+Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
+
+1. Translate the JSON records into a `DataFrame`:
+
+ * fields: `covid_json['meta']['view']['columns']`
+ * records: `covid_json['data']`
+
+
+1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
+
+1. Examine the `tail` of the table.
+
+```{python}
+#| code-fold: false
+# Load the data from JSON and assign column titles
+covid = pd.DataFrame(
+ covid_json['data'],
+ columns=[c['name'] for c in covid_json['meta']['view']['columns']])
+
+covid.tail()
+```
+
+### Variable Types
+
+After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+
+**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
+
+* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
+* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
+
+**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
+
+* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
+* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
+
+![Classification of variable types](images/variable.png)
+
+Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+
+### Primary and Foreign Keys
+
+Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
+
+The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
+
+```{python}
+#| echo: false
+pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
+ "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
+ "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
+```
+
+The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
+
+```{python}
+#| echo: false
+pd.DataFrame({"OH Request":[1, 2, 3, 4], \
+ "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
+ "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
+```
+
+## Granularity, Scope, and Temporality
+
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
+
+### Granularity
+The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+
+### Scope
+The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+
+### Temporality
+The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+
+Time and date fields of a dataset could represent a few things:
+
+1. when the "event" happened
+2. when the data was collected, or when it was entered into the system
+3. when the data was copied into the database
+
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+
+#### Temporality with `pandas`' `dt` accessors
+Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
+
+```{python}
+#| code-fold: true
+calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+calls.head()
+```
+
+Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
+
+Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
+
+If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+calls.head()
+```
+
+Now, we can use the `dt` accessor on this column.
+
+We can get the month:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.month.head()
+```
+
+Which day of the week the date is on:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.dayofweek.head()
+```
+
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+
+```{python}
+#| code-fold: false
+calls.sort_values("EVENTDT").head()
+```
+
+Doesn't look like it! We are good!
+
+
+We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
+
+## Faithfulness
+
+At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
+
+Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+
+* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
+* Violations of obvious dependencies, like an age that does not match a birthday
+* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
+* Signs of data falsification, such as fake email addresses or repeated use of the same names
+* Duplicated records or fields containing the same information
+* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
+
+We often solve some of these more common issues in the following ways:
+
+* Spelling errors: apply corrections or drop records that aren't in a dictionary
+* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
+* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
+* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
+
+### Missing Values
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
+
+A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+
+* Average imputation: replace missing values with the average value for that field
+* Hot deck imputation: replace missing values with some random value
+* Regression imputation: develop a model to predict missing values
+* Multiple imputation: replace missing values with multiple random values
+
+Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+
+# EDA Demo 1: Tuberculosis in the United States
+
+Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+
+We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
+
+
+## CSVs and Field Names
+Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
+
+We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
+1. Using a text editor like emacs, vim, VSCode, etc.
+2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
+3. The `Python` file object
+4. `pandas`, using `pd.read_csv()`
+
+To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
+
+Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(row)
+ i += 1
+ if i > 3:
+ break
+```
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+
+You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
+
+If you're curious, we can use the `repr()` function to return the raw string with all special characters:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(repr(row)) # print raw strings
+ i += 1
+ if i > 3:
+ break
+```
+
+Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
+tb_df.head()
+```
+
+You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
+
+Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
+
+A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+tb_df.head(5)
+```
+
+Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
+
+We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
+
+```{python}
+#| code-fold: false
+rename_dict = {'2019': 'TB cases 2019',
+ '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+tb_df = tb_df.rename(columns=rename_dict)
+tb_df.head(5)
+```
+
+## Record Granularity
+
+You might already be wondering: what's up with that first record?
+
+Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
+
+Okay, EDA step two. How was the rollup record aggregated?
+
+Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
+
+```{python}
+#| code-fold: true
+tb_df.sum(axis=0)
+```
+
+Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+
+```{python}
+#| code-fold: true
+tb_df.dtypes
+```
+
+Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
+
+
+Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
+
+```{python}
+#| code-fold: false
+# improve readability: chaining method calls with outer parentheses/line breaks
+tb_df = (
+ pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
+ .rename(columns=rename_dict)
+)
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+tb_df.sum()
+```
+
+The Total TB cases look right. Phew!
+
+Let's just look at the records with **state-level granularity**:
+
+```{python}
+#| code-fold: true
+state_tb_df = tb_df[1:]
+state_tb_df.head(5)
+```
+
+## Gather Census Data
+
+U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
+
+Running the below cells cleans the data.
+There are a few new methods here:
+* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
+* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
+
+```{python}
+#| code-fold: true
+# 2010s census data
+census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+census_2010s_df = (
+ census_2010s_df
+ .reset_index()
+ .drop(columns=["index", "Census", "Estimates Base"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+census_2010s_df.head(5)
+```
+
+Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
+
+```python
+from importlib import reload
+reload(utils)
+```
+
+or use `iPython` magic which will intelligently import code when files change:
+
+```python
+%load_ext autoreload
+%autoreload 2
+```
+
+```{python}
+#| code-fold: true
+# census 2020s data
+census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+census_2020s_df = (
+ census_2020s_df
+ .reset_index()
+ .drop(columns=["index", "Unnamed: 1"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+
+census_2020s_df.head(5)
+```
+
+## Joining Data (Merging `DataFrame`s)
+
+Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
+
+```{python}
+#| code-fold: false
+# merge TB DataFrame with two US census DataFrames
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .merge(right=census_2020s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
+
+```{python}
+#| code-fold: false
+# try merging again, but cleaner this time
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["Geographic Area", "2019"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+ .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+## Reproducing Data: Compute Incidence
+
+Let's recompute incidence to make sure we know where the original CDC numbers came from.
+
+From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+
+$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
+
+$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
+
+Let's try this for 2019:
+
+```{python}
+#| code-fold: false
+tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+tb_census_df.head(5)
+```
+
+Awesome!!!
+
+Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+
+```{python}
+#| code-fold: false
+tb_census_df.describe()
+```
+
+## Bonus EDA: Reproducing the Reported Statistic
+
+
+**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+This is TB incidence computed across the entire U.S. population! How do we reproduce this?
+* We need to reproduce the "Total" TB incidences in our rolled record.
+* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
+* What happened...?
+
+Let's get exploring!
+
+Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+
+```{python}
+#| code-fold: true
+tb_df = tb_df.set_index("U.S. jurisdiction")
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2010s_df = census_2010s_df.set_index("Geographic Area")
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2020s_df = census_2020s_df.set_index("Geographic Area")
+census_2020s_df.head(5)
+```
+
+It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
+
+```{python}
+#| code-fold: false
+tb_df.head()
+```
+
+Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
+
+The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
+
+```{python}
+#| code-fold: false
+census_2010s_df.head(5)
+```
+
+The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
+
+One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
+
+```{python}
+#| code-fold: false
+# rename rolled record for 2010s
+census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+# same, but for 2020s rename rolled record
+census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2020s_df.head(5)
+```
+
+<br/>
+
+Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
+
+```{python}
+#| code-fold: false
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["2019"]],
+ left_index=True, right_index=True)
+ .merge(right=census_2020s_df[["2020", "2021"]],
+ left_index=True, right_index=True)
+)
+tb_census_df.head(5)
+```
+
+<br/>
+
+Finally, let's recompute our incidences:
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+We reproduced the total U.S. incidences correctly!
+
+We're almost there. Let's revisit the quote:
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+Recall that percent change from $A$ to $B$ is computed as
+$\text{percent change} = \frac{B - A}{A} \times 100$.
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
+incidence_2020
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
+incidence_2021
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
+difference
+```
+
+# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
+
+[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
+
+```{python}
+#| code-fold: false
+co2_file = "data/co2_mm_mlo.txt"
+```
+
+Let's do some **EDA**!!
+
+## Reading this file into Pandas?
+Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
+
+Lines 71-78 (inclusive) are shown below:
+
+ line number | file contents
+
+ 71 | # decimal average interpolated trend #days
+ 72 | # date (season corr)
+ 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+ 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+ 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+ 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+ 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+ 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+
+
+Notice how:
+
+- The values are separated by white space, possibly tabs.
+- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
+- The 71st and 72nd lines in the file contain column headings split over two lines.
+
+We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+)
+co2.head()
+```
+
+Congratulations! You've wrangled the data!
+
+<br/>
+
+...But our columns aren't named.
+**We need to do more EDA.**
+
+## Exploring Variable Feature Types
+
+The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
+
+Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = '\s+', #regex for continuous whitespace (next lecture)
+ names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+)
+co2.head()
+```
+
+## Visualizing CO<sub>2</sub>
+Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2);
+```
+
+The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
+
+Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
+
+```{python}
+#| code-fold: false
+co2.head()
+```
+
+```{python}
+#| code-fold: false
+co2.tail()
+```
+
+Some data have unusual values like -1 and -99.99.
+
+Let's check the description at the top of the file again.
+
+* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
+* -99.99 denotes a missing monthly average `Avg`
+
+How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+
+<br/>
+
+
+## Sanity Checks: Reasoning about the data
+First, we consider the shape of the data. How many rows should we have?
+
+* If chronological order, we should have one record per month.
+* Data from March 1958 to August 2019.
+* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
+
+```{python}
+#| code-fold: false
+co2.shape
+```
+
+Nice!! The number of rows (i.e. records) match our expectations.\
+
+<br/>
+
+
+Let's now check the quality of each feature.
+
+## Understanding Missing Value 1: `Days`
+`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
+
+Let's start with **months**, `Mo`.
+
+Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+
+```{python}
+#| code-fold: false
+co2["Mo"].value_counts().sort_index()
+```
+
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+
+<br/>
+
+Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
+
+```{python}
+#| code-fold: true
+sns.displot(co2['Days']);
+plt.title("Distribution of days feature"); # suppresses unneeded plotting output
+```
+
+In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
+
+<br/>
+
+Finally, let's check the last time feature, **year** `Yr`.
+
+Let's check to see if there is any connection between missing-ness and the year of the recording.
+
+```{python}
+#| code-fold: true
+sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output
+```
+
+**Observations**:
+
+* All of the missing data are in the early years of operation.
+* It appears there may have been problems with equipment in the mid to late 80s.
+
+**Potential Next Steps**:
+
+* Confirm these explanations through documentation about the historical readings.
+* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
+
+<br/>
+
+## Understanding Missing Value 2: `Avg`
+Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
+
+```{python}
+#| code-fold: true
+# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+```
+
+The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+
+We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
+
+```{python}
+#| code-fold: false
+co2[co2["Avg"] < 0]
+```
+
+There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
+
+## Drop, `NaN`, or Impute Missing `Avg` Data?
+
+How should we address the invalid `Avg` data?
+
+1. Drop records
+2. Set to NaN
+3. Impute using some strategy
+
+Remember we want to fix the following plot:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");
+```
+
+Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
+
+
+Let's consider a few options:
+1. Drop those records
+2. Replace -99.99 with NaN
+3. Substitute it with a likely value for the average CO2?
+
+What do you think are the pros and cons of each possible action?
+
+<br/>
+
+
+Let's examine each of these three options.
+
+```{python}
+#| code-fold: false
+# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
+```
+
+```{python}
+#| code-fold: false
+# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
+```
+
+We'll also use a third version of the data.
+
+First, we note that the dataset already comes with a **substitute value** for the -99.99.
+
+From the file description:
+
+> The `interpolated` column includes average values from the preceding column (`average`)
+and **interpolated values** where data are missing. Interpolated values are
+computed in two steps...
+
+The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
+
+So, the third version of our data will use the `Int` feature instead of `Avg`.
+
+```{python}
+#| code-fold: false
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
+```
+
+What's a **reasonable** estimate?
+
+To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+
+```{python}
+#| code-fold: true
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
+```
+
+In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
+
+However there is some appeal to **option C: Imputing**:
+
+* Shows seasonal trends for CO2
+* We are plotting all months in our data as a line plot
+
+<br/>
+
+
+Let's replot our original figure with option 3:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");
+```
+
+Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
+
+## Presenting the data: A Discussion on Data Granularity
+
+From the description:
+
+* monthly measurements are averages of average day measurements.
+* The NOAA GML website has datasets for daily/hourly measurements too.
+
+The data you present depends on your research question.
+
+**How do CO2 levels vary by season?**
+
+* You might want to keep average monthly data.
+
+**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
+
+* You might be happier with a **coarser granularity** of average year data!
+
+```{python}
+#| code-fold: true
+co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");
+```
+
+Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+
+# Summary
+We went over a lot of content this lecture; let's summarize the most important points:
+
+## Dealing with Missing Values
+There are a few options we can take to deal with missing data:
+
+* Drop missing records
+* Keep `NaN` missing values
+* Impute using an interpolated column
+
+## EDA and Data Wrangling
+There are several ways to approach EDA and Data Wrangling:
+
+* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
+* Examine each **field/attribute/dimension** individually.
+* Examine pairs of related dimensions (e.g. breaking down grades by major).
+* Along the way, we can:
+ * **Visualize** or summarize the data.
+ * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
+ * Identify and **address anomalies**.
+ * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
+ * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png
index a04218cf..f392d5f9 100644
Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png
new file mode 100644
index 00000000..be96b8c9
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png
deleted file mode 100644
index 31857f62..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-67-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png
index 67c3959d..ffd29ff8 100644
Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png
new file mode 100644
index 00000000..29088928
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png
deleted file mode 100644
index fb28f5d5..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-69-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png
index 39cac822..49ef3d6a 100644
Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png
index 6382e58a..15a5fe82 100644
Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png
index db2b0dee..40b1fc71 100644
Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png
index 897b8b39..99b6c2d1 100644
Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ
diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html
index ea770e7f..22d26788 100644
--- a/docs/feature_engineering/feature_engineering.html
+++ b/docs/feature_engineering/feature_engineering.html
@@ -556,7 +556,7 @@
my_model.fit(X, Y)
-LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
+LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
Code
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
Code
-
+
@@ -4632,9 +4620,9 @@
Code
-
+
@@ -4975,1218 +4963,1218 @@ <
Source Code
----
-title: Data Cleaning and EDA
-execute:
- echo: true
-format:
- html:
- code-fold: true
- code-tools: true
- toc: true
- toc-title: Data Cleaning and EDA
- page-layout: full
- theme:
- - cosmo
- - cerulean
- callout-icon: false
-jupyter: python3
----
-
-```{python}
-#| code-fold: true
-import numpy as np
-import pandas as pd
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-#%matplotlib inline
-plt.rcParams['figure.figsize'] = (12, 9)
-
-sns.set()
-sns.set_context('talk')
-np.set_printoptions(threshold=20, precision=2, suppress=True)
-pd.set_option('display.max_rows', 30)
-pd.set_option('display.max_columns', None)
-pd.set_option('display.precision', 2)
-# This option stops scientific notation for pandas
-pd.set_option('display.float_format', '{:.2f}'.format)
-
-# Silence some spurious seaborn warnings
-import warnings
-warnings.filterwarnings("ignore", category=FutureWarning)
-```
-
-::: {.callout-note collapse="false"}
-## Learning Outcomes
-* Recognize common file formats
-* Categorize data by its variable type
-* Build awareness of issues with data faithfulness and develop targeted solutions
-:::
-
-**This content is covered in lectures 4, 5, and 6.**
-
-In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
-
-**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
-
-* Unclear structure or formatting
-* Missing or corrupted values
-* Unit conversions
-* ...and so on
-
-**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
-
-In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
-
-## Structure
-
-### File Formats
-There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
-
-#### CSV
-CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
-In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.csv").head(5)
-```
-
-To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
-
-```{python}
-#| code-fold: false
-with open("data/elections.csv", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
-
-#### TSV
-
-Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
-
-Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
-
-```{python}
-#| code-fold: false
-with open("data/elections.txt", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.txt", sep='\t').head(3)
-```
-
-An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-#### JSON
-**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
-
-```{python}
-#| code-fold: false
-with open("data/elections.json", "r") as table:
- i = 0
- for row in table:
- print(row)
- i += 1
- if i > 8:
- break
-```
-
-JSON files can be loaded into `pandas` using `pd.read_json`.
-
-```{python}
-#| code-fold: false
-pd.read_json('data/elections.json').head(3)
-```
-
-##### EDA with JSON: Berkeley COVID-19 Data
-The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
-
-```{python}
-#| code-fold: false
-from ds100_utils import fetch_and_cache
-
-covid_file = fetch_and_cache(
- "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
- "confirmed-cases.json",
- force=False)
-covid_file # a file path wrapper object
-```
-
-###### File Size
-Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
-
-Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
-
-```{python}
-#| code-fold: false
-import os
-
-print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
-
-with open(covid_file, "r") as f:
- print(covid_file, "is", sum(1 for l in f), "lines.")
-```
-
-###### Unix Commands
-As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
-In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
-
-Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
-
-These two give us the same information as the code above, albeit in a slightly different form:
-
-```{python}
-#| code-fold: false
-!ls -lh {covid_file}
-!wc -l {covid_file}
-```
-
-###### File Contents
-Let's explore the data format using `Python`.
-
-```{python}
-#| code-fold: false
-with open(covid_file, "r") as f:
- for i, row in enumerate(f):
- print(repr(row)) # print raw strings
- if i >= 4: break
-```
-
-We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
-
-```{python}
-#| code-fold: false
-!head -5 {covid_file}
-```
-
-In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
-
-```{python}
-#| code-fold: false
-import json
-
-with open(covid_file, "rb") as f:
- covid_json = json.load(f)
-```
-
-The `covid_json` variable is now a dictionary encoding the data in the file:
-
-```{python}
-#| code-fold: false
-type(covid_json)
-```
-
-We can examine what keys are in the top level json object by listing out the keys.
-
-```{python}
-#| code-fold: false
-covid_json.keys()
-```
-
-**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
-
-
-We can investigate the meta data further by examining the keys associated with the metadata.
-
-```{python}
-#| code-fold: false
-covid_json['meta'].keys()
-```
-
-The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
-
-```{python}
-#| code-fold: false
-covid_json['meta']['view'].keys()
-```
-
-Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
-
-```
-meta
-|-> data
- | ... (haven't explored yet)
-|-> view
- | -> id
- | -> name
- | -> attribution
- ...
- | -> description
- ...
- | -> columns
- ...
-```
-
-
-There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
-```{python}
-#| code-fold: false
-print(covid_json['meta']['view']['description'])
-```
-
-###### Examining the Data Field for Records
-
-We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
-
-```{python}
-#| code-fold: false
-for i in range(3):
- print(f"{i:03} | {covid_json['data'][i]}")
-```
-
-Observations:
-* These look like equal-length records, so maybe `data` is a table!
-* But what do each of values in the record mean? Where can we find column headers?
-
-For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
-
-```{python}
-#| code-fold: false
-type(covid_json['meta']['view']['columns'])
-```
-
-###### Summary of exploring the JSON file
-
-1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
-1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
-1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
-
-###### Loading COVID Data into `pandas`
-Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
-
-1. Translate the JSON records into a `DataFrame`:
-
- * fields: `covid_json['meta']['view']['columns']`
- * records: `covid_json['data']`
-
-
-1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
-
-1. Examine the `tail` of the table.
-
-```{python}
-#| code-fold: false
-# Load the data from JSON and assign column titles
-covid = pd.DataFrame(
- covid_json['data'],
- columns=[c['name'] for c in covid_json['meta']['view']['columns']])
-
-covid.tail()
-```
-
-### Variable Types
-
-After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
-
-**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
-
-* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
-* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
-
-**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
-
-* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
-* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
-
-![Classification of variable types](images/variable.png)
-
-Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
-
-### Primary and Foreign Keys
-
-Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
-
-The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
-
-```{python}
-#| echo: false
-pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
- "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
- "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
-```
-
-The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
-
-```{python}
-#| echo: false
-pd.DataFrame({"OH Request":[1, 2, 3, 4], \
- "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
- "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
-```
-
-## Granularity, Scope, and Temporality
-
-After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
-
-### Granularity
-The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
-
-### Scope
-The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
-
-### Temporality
-The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
-
-Time and date fields of a dataset could represent a few things:
-
-1. when the "event" happened
-2. when the data was collected, or when it was entered into the system
-3. when the data was copied into the database
-
-To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
-
-#### Temporality with `pandas`' `dt` accessors
-Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
-```{python}
-#| code-fold: true
-calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
-calls.head()
-```
-
-Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
-
-Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
-
-If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
-calls.head()
-```
-
-Now, we can use the `dt` accessor on this column.
-
-We can get the month:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.month.head()
-```
-
-Which day of the week the date is on:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.dayofweek.head()
-```
-
-Check the mimimum values to see if there are any suspicious-looking, 70s dates:
-
-```{python}
-#| code-fold: false
-calls.sort_values("EVENTDT").head()
-```
-
-Doesn't look like it! We are good!
-
-
-We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
-
-## Faithfulness
-
-At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
-
-Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
-
-* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
-* Violations of obvious dependencies, like an age that does not match a birthday
-* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
-* Signs of data falsification, such as fake email addresses or repeated use of the same names
-* Duplicated records or fields containing the same information
-* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
-
-We often solve some of these more common issues in the following ways:
-
-* Spelling errors: apply corrections or drop records that aren't in a dictionary
-* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
-* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
-* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
-
-### Missing Values
-Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
-
-A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
-
-* Average imputation: replace missing values with the average value for that field
-* Hot deck imputation: replace missing values with some random value
-* Regression imputation: develop a model to predict missing values
-* Multiple imputation: replace missing values with multiple random values
-
-Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
-
-# EDA Demo 1: Tuberculosis in the United States
-
-Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
-
-We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
-
-
-## CSVs and Field Names
-Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
-
-We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
-1. Using a text editor like emacs, vim, VSCode, etc.
-2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
-3. The `Python` file object
-4. `pandas`, using `pd.read_csv()`
-
-To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
-
-Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(row)
- i += 1
- if i > 3:
- break
-```
-
-Whoa, why are there blank lines interspaced between the lines of the CSV?
-
-You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
-
-If you're curious, we can use the `repr()` function to return the raw string with all special characters:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(repr(row)) # print raw strings
- i += 1
- if i > 3:
- break
-```
-
-Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
-tb_df.head()
-```
-
-You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
-
-Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
-
-A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
-tb_df.head(5)
-```
-
-Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
-
-We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
-
-```{python}
-#| code-fold: false
-rename_dict = {'2019': 'TB cases 2019',
- '2020': 'TB cases 2020',
- '2021': 'TB cases 2021',
- '2019.1': 'TB incidence 2019',
- '2020.1': 'TB incidence 2020',
- '2021.1': 'TB incidence 2021'}
-tb_df = tb_df.rename(columns=rename_dict)
-tb_df.head(5)
-```
-
-## Record Granularity
-
-You might already be wondering: what's up with that first record?
-
-Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
-
-Okay, EDA step two. How was the rollup record aggregated?
-
-Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
-
-```{python}
-#| code-fold: true
-tb_df.sum(axis=0)
-```
-
-Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
-```{python}
-#| code-fold: true
-tb_df.dtypes
-```
-
-Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
-
-
-Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
-
-```{python}
-#| code-fold: false
-# improve readability: chaining method calls with outer parentheses/line breaks
-tb_df = (
- pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
- .rename(columns=rename_dict)
-)
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-tb_df.sum()
-```
-
-The Total TB cases look right. Phew!
-
-Let's just look at the records with **state-level granularity**:
-
-```{python}
-#| code-fold: true
-state_tb_df = tb_df[1:]
-state_tb_df.head(5)
-```
-
-## Gather Census Data
-
-U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
-
-Running the below cells cleans the data.
-There are a few new methods here:
-* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
-* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
-
-```{python}
-#| code-fold: true
-# 2010s census data
-census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
-census_2010s_df = (
- census_2010s_df
- .reset_index()
- .drop(columns=["index", "Census", "Estimates Base"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
-
-# with pd.option_context('display.min_rows', 30): # shows more rows
-# display(census_2010s_df)
-
-census_2010s_df.head(5)
-```
-
-Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
-
-```python
-from importlib import reload
-reload(utils)
-```
-
-or use `iPython` magic which will intelligently import code when files change:
-
-```python
-%load_ext autoreload
-%autoreload 2
-```
-
-```{python}
-#| code-fold: true
-# census 2020s data
-census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
-census_2020s_df = (
- census_2020s_df
- .reset_index()
- .drop(columns=["index", "Unnamed: 1"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
-
-census_2020s_df.head(5)
-```
-
-## Joining Data (Merging `DataFrame`s)
-
-Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
-
-```{python}
-#| code-fold: false
-# merge TB DataFrame with two US census DataFrames
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .merge(right=census_2020s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
-
-```{python}
-#| code-fold: false
-# try merging again, but cleaner this time
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["Geographic Area", "2019"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
- .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-## Reproducing Data: Compute Incidence
-
-Let's recompute incidence to make sure we know where the original CDC numbers came from.
-
-From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
-
-If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
-
-$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
-
-$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
-
-Let's try this for 2019:
-
-```{python}
-#| code-fold: false
-tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
-tb_census_df.head(5)
-```
-
-Awesome!!!
-
-Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
-
-```{python}
-#| code-fold: false
-tb_census_df.describe()
-```
-
-## Bonus EDA: Reproducing the Reported Statistic
-
-
-**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-This is TB incidence computed across the entire U.S. population! How do we reproduce this?
-* We need to reproduce the "Total" TB incidences in our rolled record.
-* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
-* What happened...?
-
-Let's get exploring!
-
-Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
-
-```{python}
-#| code-fold: true
-tb_df = tb_df.set_index("U.S. jurisdiction")
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2010s_df = census_2010s_df.set_index("Geographic Area")
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2020s_df = census_2020s_df.set_index("Geographic Area")
-census_2020s_df.head(5)
-```
-
-It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
-
-```{python}
-#| code-fold: false
-tb_df.head()
-```
-
-Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
-
-The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
-
-```{python}
-#| code-fold: false
-census_2010s_df.head(5)
-```
-
-The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
-
-One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
-
-```{python}
-#| code-fold: false
-# rename rolled record for 2010s
-census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-# same, but for 2020s rename rolled record
-census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2020s_df.head(5)
-```
-
-<br/>
-
-Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
-
-```{python}
-#| code-fold: false
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["2019"]],
- left_index=True, right_index=True)
- .merge(right=census_2020s_df[["2020", "2021"]],
- left_index=True, right_index=True)
-)
-tb_census_df.head(5)
-```
-
-<br/>
-
-Finally, let's recompute our incidences:
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-We reproduced the total U.S. incidences correctly!
-
-We're almost there. Let's revisit the quote:
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-Recall that percent change from $A$ to $B$ is computed as
-$\text{percent change} = \frac{B - A}{A} \times 100$.
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
-incidence_2020
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
-incidence_2021
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
-difference
-```
-
-# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
-
-[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
-
-```{python}
-#| code-fold: false
-co2_file = "data/co2_mm_mlo.txt"
-```
-
-Let's do some **EDA**!!
-
-## Reading this file into Pandas?
-Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
-
-Lines 71-78 (inclusive) are shown below:
-
- line number | file contents
-
- 71 | # decimal average interpolated trend #days
- 72 | # date (season corr)
- 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
- 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
- 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
- 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
- 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
- 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
-
-
-Notice how:
-
-- The values are separated by white space, possibly tabs.
-- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
-- The 71st and 72nd lines in the file contain column headings split over two lines.
-
-We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
-)
-co2.head()
-```
-
-Congratulations! You've wrangled the data!
-
-<br/>
-
-...But our columns aren't named.
-**We need to do more EDA.**
-
-## Exploring Variable Feature Types
-
-The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
-
-Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = '\s+', #regex for continuous whitespace (next lecture)
- names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
-)
-co2.head()
-```
-
-## Visualizing CO<sub>2</sub>
-Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2);
-```
-
-The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
-
-Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
-
-```{python}
-#| code-fold: false
-co2.head()
-```
-
-```{python}
-#| code-fold: false
-co2.tail()
-```
-
-Some data have unusual values like -1 and -99.99.
-
-Let's check the description at the top of the file again.
-
-* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
-* -99.99 denotes a missing monthly average `Avg`
-
-How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
-
-<br/>
-
-
-## Sanity Checks: Reasoning about the data
-First, we consider the shape of the data. How many rows should we have?
-
-* If chronological order, we should have one record per month.
-* Data from March 1958 to August 2019.
-* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
-
-```{python}
-#| code-fold: false
-co2.shape
-```
-
-Nice!! The number of rows (i.e. records) match our expectations.\
-
-<br/>
-
-
-Let's now check the quality of each feature.
-
-## Understanding Missing Value 1: `Days`
-`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
-
-Let's start with **months**, `Mo`.
-
-Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
-
-```{python}
-#| code-fold: false
-co2["Mo"].value_counts().sort_index()
-```
-
-As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
-
-<br/>
-
-Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
-
-```{python}
-#| code-fold: true
-sns.displot(co2['Days']);
-plt.title("Distribution of days feature"); # suppresses unneeded plotting output
-```
-
-In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
-
-<br/>
-
-Finally, let's check the last time feature, **year** `Yr`.
-
-Let's check to see if there is any connection between missing-ness and the year of the recording.
-
-```{python}
-#| code-fold: true
-sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
-```
-
-**Observations**:
-
-* All of the missing data are in the early years of operation.
-* It appears there may have been problems with equipment in the mid to late 80s.
-
-**Potential Next Steps**:
-
-* Confirm these explanations through documentation about the historical readings.
-* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
-
-<br/>
-
-## Understanding Missing Value 2: `Avg`
-Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
-
-```{python}
-#| code-fold: true
-# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
-```
-
-The non-missing values are in the 300-400 range (a regular range of CO2 levels).
-
-We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
-
-```{python}
-#| code-fold: false
-co2[co2["Avg"] < 0]
-```
-
-There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
-
-## Drop, `NaN`, or Impute Missing `Avg` Data?
-
-How should we address the invalid `Avg` data?
-
-1. Drop records
-2. Set to NaN
-3. Impute using some strategy
-
-Remember we want to fix the following plot:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
-```
-
-Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
-
-
-Let's consider a few options:
-1. Drop those records
-2. Replace -99.99 with NaN
-3. Substitute it with a likely value for the average CO2?
-
-What do you think are the pros and cons of each possible action?
-
-<br/>
-
-
-Let's examine each of these three options.
-
-```{python}
-#| code-fold: false
-# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
-```
-
-```{python}
-#| code-fold: false
-# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
-```
-
-We'll also use a third version of the data.
-
-First, we note that the dataset already comes with a **substitute value** for the -99.99.
-
-From the file description:
-
-> The `interpolated` column includes average values from the preceding column (`average`)
-and **interpolated values** where data are missing. Interpolated values are
-computed in two steps...
-
-The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
-
-So, the third version of our data will use the `Int` feature instead of `Avg`.
-
-```{python}
-#| code-fold: false
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
-```
-
-What's a **reasonable** estimate?
-
-To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
-
-```{python}
-#| code-fold: true
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
-```
-
-In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
-
-However there is some appeal to **option C: Imputing**:
-
-* Shows seasonal trends for CO2
-* We are plotting all months in our data as a line plot
-
-<br/>
-
-
-Let's replot our original figure with option 3:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
-```
-
-Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
-
-## Presenting the data: A Discussion on Data Granularity
-
-From the description:
-
-* monthly measurements are averages of average day measurements.
-* The NOAA GML website has datasets for daily/hourly measurements too.
-
-The data you present depends on your research question.
-
-**How do CO2 levels vary by season?**
-
-* You might want to keep average monthly data.
-
-**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
-
-* You might be happier with a **coarser granularity** of average year data!
-
-```{python}
-#| code-fold: true
-co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
-```
-
-Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
-
-# Summary
-We went over a lot of content this lecture; let's summarize the most important points:
-
-## Dealing with Missing Values
-There are a few options we can take to deal with missing data:
-
-* Drop missing records
-* Keep `NaN` missing values
-* Impute using an interpolated column
-
-## EDA and Data Wrangling
-There are several ways to approach EDA and Data Wrangling:
-
-* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
-* Examine each **field/attribute/dimension** individually.
-* Examine pairs of related dimensions (e.g. breaking down grades by major).
-* Along the way, we can:
- * **Visualize** or summarize the data.
- * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
- * Identify and **address anomalies**.
- * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
- * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
+---
+title: Data Cleaning and EDA
+execute:
+ echo: true
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ toc: true
+ toc-title: Data Cleaning and EDA
+ page-layout: full
+ theme:
+ - cosmo
+ - cerulean
+ callout-icon: false
+jupyter: python3
+---
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+plt.rcParams['figure.figsize'] = (12, 9)
+
+sns.set()
+sns.set_context('talk')
+np.set_printoptions(threshold=20, precision=2, suppress=True)
+pd.set_option('display.max_rows', 30)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+# This option stops scientific notation for pandas
+pd.set_option('display.float_format', '{:.2f}'.format)
+
+# Silence some spurious seaborn warnings
+import warnings
+warnings.filterwarnings("ignore", category=FutureWarning)
+```
+
+::: {.callout-note collapse="false"}
+## Learning Outcomes
+* Recognize common file formats
+* Categorize data by its variable type
+* Build awareness of issues with data faithfulness and develop targeted solutions
+:::
+
+**This content is covered in lectures 4, 5, and 6.**
+
+In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
+
+**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+
+* Unclear structure or formatting
+* Missing or corrupted values
+* Unit conversions
+* ...and so on
+
+**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
+
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+
+## Structure
+
+### File Formats
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+
+#### CSV
+CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
+In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.csv").head(5)
+```
+
+To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
+
+```{python}
+#| code-fold: false
+with open("data/elections.csv", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
+
+#### TSV
+
+Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
+
+Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
+
+```{python}
+#| code-fold: false
+with open("data/elections.txt", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.txt", sep='\t').head(3)
+```
+
+An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+#### JSON
+**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
+
+```{python}
+#| code-fold: false
+with open("data/elections.json", "r") as table:
+ i = 0
+ for row in table:
+ print(row)
+ i += 1
+ if i > 8:
+ break
+```
+
+JSON files can be loaded into `pandas` using `pd.read_json`.
+
+```{python}
+#| code-fold: false
+pd.read_json('data/elections.json').head(3)
+```
+
+##### EDA with JSON: Berkeley COVID-19 Data
+The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
+
+```{python}
+#| code-fold: false
+from ds100_utils import fetch_and_cache
+
+covid_file = fetch_and_cache(
+ "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ force=False)
+covid_file # a file path wrapper object
+```
+
+###### File Size
+Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
+
+Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
+
+```{python}
+#| code-fold: false
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+ print(covid_file, "is", sum(1 for l in f), "lines.")
+```
+
+###### Unix Commands
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
+In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
+
+Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
+
+These two give us the same information as the code above, albeit in a slightly different form:
+
+```{python}
+#| code-fold: false
+!ls -lh {covid_file}
+!wc -l {covid_file}
+```
+
+###### File Contents
+Let's explore the data format using `Python`.
+
+```{python}
+#| code-fold: false
+with open(covid_file, "r") as f:
+ for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
+```
+
+We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
+
+```{python}
+#| code-fold: false
+!head -5 {covid_file}
+```
+
+In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
+
+```{python}
+#| code-fold: false
+import json
+
+with open(covid_file, "rb") as f:
+ covid_json = json.load(f)
+```
+
+The `covid_json` variable is now a dictionary encoding the data in the file:
+
+```{python}
+#| code-fold: false
+type(covid_json)
+```
+
+We can examine what keys are in the top level json object by listing out the keys.
+
+```{python}
+#| code-fold: false
+covid_json.keys()
+```
+
+**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
+
+
+We can investigate the meta data further by examining the keys associated with the metadata.
+
+```{python}
+#| code-fold: false
+covid_json['meta'].keys()
+```
+
+The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
+
+```{python}
+#| code-fold: false
+covid_json['meta']['view'].keys()
+```
+
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+
+```
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+```
+
+
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+
+```{python}
+#| code-fold: false
+print(covid_json['meta']['view']['description'])
+```
+
+###### Examining the Data Field for Records
+
+We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
+
+```{python}
+#| code-fold: false
+for i in range(3):
+ print(f"{i:03} | {covid_json['data'][i]}")
+```
+
+Observations:
+* These look like equal-length records, so maybe `data` is a table!
+* But what do each of values in the record mean? Where can we find column headers?
+
+For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
+
+```{python}
+#| code-fold: false
+type(covid_json['meta']['view']['columns'])
+```
+
+###### Summary of exploring the JSON file
+
+1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
+1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
+1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
+
+###### Loading COVID Data into `pandas`
+Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
+
+1. Translate the JSON records into a `DataFrame`:
+
+ * fields: `covid_json['meta']['view']['columns']`
+ * records: `covid_json['data']`
+
+
+1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
+
+1. Examine the `tail` of the table.
+
+```{python}
+#| code-fold: false
+# Load the data from JSON and assign column titles
+covid = pd.DataFrame(
+ covid_json['data'],
+ columns=[c['name'] for c in covid_json['meta']['view']['columns']])
+
+covid.tail()
+```
+
+### Variable Types
+
+After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+
+**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
+
+* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
+* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
+
+**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
+
+* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
+* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
+
+![Classification of variable types](images/variable.png)
+
+Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+
+### Primary and Foreign Keys
+
+Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
+
+The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
+
+```{python}
+#| echo: false
+pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
+ "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
+ "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
+```
+
+The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
+
+```{python}
+#| echo: false
+pd.DataFrame({"OH Request":[1, 2, 3, 4], \
+ "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
+ "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
+```
+
+## Granularity, Scope, and Temporality
+
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
+
+### Granularity
+The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+
+### Scope
+The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+
+### Temporality
+The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+
+Time and date fields of a dataset could represent a few things:
+
+1. when the "event" happened
+2. when the data was collected, or when it was entered into the system
+3. when the data was copied into the database
+
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+
+#### Temporality with `pandas`' `dt` accessors
+Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
+
+```{python}
+#| code-fold: true
+calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+calls.head()
+```
+
+Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
+
+Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
+
+If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+calls.head()
+```
+
+Now, we can use the `dt` accessor on this column.
+
+We can get the month:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.month.head()
+```
+
+Which day of the week the date is on:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.dayofweek.head()
+```
+
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+
+```{python}
+#| code-fold: false
+calls.sort_values("EVENTDT").head()
+```
+
+Doesn't look like it! We are good!
+
+
+We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
+
+## Faithfulness
+
+At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
+
+Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+
+* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
+* Violations of obvious dependencies, like an age that does not match a birthday
+* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
+* Signs of data falsification, such as fake email addresses or repeated use of the same names
+* Duplicated records or fields containing the same information
+* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
+
+We often solve some of these more common issues in the following ways:
+
+* Spelling errors: apply corrections or drop records that aren't in a dictionary
+* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
+* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
+* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
+
+### Missing Values
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
+
+A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+
+* Average imputation: replace missing values with the average value for that field
+* Hot deck imputation: replace missing values with some random value
+* Regression imputation: develop a model to predict missing values
+* Multiple imputation: replace missing values with multiple random values
+
+Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+
+# EDA Demo 1: Tuberculosis in the United States
+
+Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+
+We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
+
+
+## CSVs and Field Names
+Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
+
+We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
+1. Using a text editor like emacs, vim, VSCode, etc.
+2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
+3. The `Python` file object
+4. `pandas`, using `pd.read_csv()`
+
+To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
+
+Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(row)
+ i += 1
+ if i > 3:
+ break
+```
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+
+You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
+
+If you're curious, we can use the `repr()` function to return the raw string with all special characters:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(repr(row)) # print raw strings
+ i += 1
+ if i > 3:
+ break
+```
+
+Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
+tb_df.head()
+```
+
+You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
+
+Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
+
+A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+tb_df.head(5)
+```
+
+Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
+
+We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
+
+```{python}
+#| code-fold: false
+rename_dict = {'2019': 'TB cases 2019',
+ '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+tb_df = tb_df.rename(columns=rename_dict)
+tb_df.head(5)
+```
+
+## Record Granularity
+
+You might already be wondering: what's up with that first record?
+
+Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
+
+Okay, EDA step two. How was the rollup record aggregated?
+
+Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
+
+```{python}
+#| code-fold: true
+tb_df.sum(axis=0)
+```
+
+Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+
+```{python}
+#| code-fold: true
+tb_df.dtypes
+```
+
+Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
+
+
+Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
+
+```{python}
+#| code-fold: false
+# improve readability: chaining method calls with outer parentheses/line breaks
+tb_df = (
+ pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
+ .rename(columns=rename_dict)
+)
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+tb_df.sum()
+```
+
+The Total TB cases look right. Phew!
+
+Let's just look at the records with **state-level granularity**:
+
+```{python}
+#| code-fold: true
+state_tb_df = tb_df[1:]
+state_tb_df.head(5)
+```
+
+## Gather Census Data
+
+U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
+
+Running the below cells cleans the data.
+There are a few new methods here:
+* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
+* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
+
+```{python}
+#| code-fold: true
+# 2010s census data
+census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+census_2010s_df = (
+ census_2010s_df
+ .reset_index()
+ .drop(columns=["index", "Census", "Estimates Base"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+census_2010s_df.head(5)
+```
+
+Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
+
+```python
+from importlib import reload
+reload(utils)
+```
+
+or use `iPython` magic which will intelligently import code when files change:
+
+```python
+%load_ext autoreload
+%autoreload 2
+```
+
+```{python}
+#| code-fold: true
+# census 2020s data
+census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+census_2020s_df = (
+ census_2020s_df
+ .reset_index()
+ .drop(columns=["index", "Unnamed: 1"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+
+census_2020s_df.head(5)
+```
+
+## Joining Data (Merging `DataFrame`s)
+
+Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
+
+```{python}
+#| code-fold: false
+# merge TB DataFrame with two US census DataFrames
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .merge(right=census_2020s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
+
+```{python}
+#| code-fold: false
+# try merging again, but cleaner this time
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["Geographic Area", "2019"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+ .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+## Reproducing Data: Compute Incidence
+
+Let's recompute incidence to make sure we know where the original CDC numbers came from.
+
+From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+
+$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
+
+$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
+
+Let's try this for 2019:
+
+```{python}
+#| code-fold: false
+tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+tb_census_df.head(5)
+```
+
+Awesome!!!
+
+Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+
+```{python}
+#| code-fold: false
+tb_census_df.describe()
+```
+
+## Bonus EDA: Reproducing the Reported Statistic
+
+
+**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+This is TB incidence computed across the entire U.S. population! How do we reproduce this?
+* We need to reproduce the "Total" TB incidences in our rolled record.
+* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
+* What happened...?
+
+Let's get exploring!
+
+Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+
+```{python}
+#| code-fold: true
+tb_df = tb_df.set_index("U.S. jurisdiction")
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2010s_df = census_2010s_df.set_index("Geographic Area")
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2020s_df = census_2020s_df.set_index("Geographic Area")
+census_2020s_df.head(5)
+```
+
+It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
+
+```{python}
+#| code-fold: false
+tb_df.head()
+```
+
+Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
+
+The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
+
+```{python}
+#| code-fold: false
+census_2010s_df.head(5)
+```
+
+The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
+
+One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
+
+```{python}
+#| code-fold: false
+# rename rolled record for 2010s
+census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+# same, but for 2020s rename rolled record
+census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2020s_df.head(5)
+```
+
+<br/>
+
+Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
+
+```{python}
+#| code-fold: false
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["2019"]],
+ left_index=True, right_index=True)
+ .merge(right=census_2020s_df[["2020", "2021"]],
+ left_index=True, right_index=True)
+)
+tb_census_df.head(5)
+```
+
+<br/>
+
+Finally, let's recompute our incidences:
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+We reproduced the total U.S. incidences correctly!
+
+We're almost there. Let's revisit the quote:
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+Recall that percent change from $A$ to $B$ is computed as
+$\text{percent change} = \frac{B - A}{A} \times 100$.
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
+incidence_2020
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
+incidence_2021
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
+difference
+```
+
+# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
+
+[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
+
+```{python}
+#| code-fold: false
+co2_file = "data/co2_mm_mlo.txt"
+```
+
+Let's do some **EDA**!!
+
+## Reading this file into Pandas?
+Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
+
+Lines 71-78 (inclusive) are shown below:
+
+ line number | file contents
+
+ 71 | # decimal average interpolated trend #days
+ 72 | # date (season corr)
+ 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+ 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+ 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+ 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+ 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+ 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+
+
+Notice how:
+
+- The values are separated by white space, possibly tabs.
+- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
+- The 71st and 72nd lines in the file contain column headings split over two lines.
+
+We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+)
+co2.head()
+```
+
+Congratulations! You've wrangled the data!
+
+<br/>
+
+...But our columns aren't named.
+**We need to do more EDA.**
+
+## Exploring Variable Feature Types
+
+The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
+
+Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = '\s+', #regex for continuous whitespace (next lecture)
+ names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+)
+co2.head()
+```
+
+## Visualizing CO<sub>2</sub>
+Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2);
+```
+
+The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
+
+Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
+
+```{python}
+#| code-fold: false
+co2.head()
+```
+
+```{python}
+#| code-fold: false
+co2.tail()
+```
+
+Some data have unusual values like -1 and -99.99.
+
+Let's check the description at the top of the file again.
+
+* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
+* -99.99 denotes a missing monthly average `Avg`
+
+How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+
+<br/>
+
+
+## Sanity Checks: Reasoning about the data
+First, we consider the shape of the data. How many rows should we have?
+
+* If chronological order, we should have one record per month.
+* Data from March 1958 to August 2019.
+* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
+
+```{python}
+#| code-fold: false
+co2.shape
+```
+
+Nice!! The number of rows (i.e. records) match our expectations.\
+
+<br/>
+
+
+Let's now check the quality of each feature.
+
+## Understanding Missing Value 1: `Days`
+`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
+
+Let's start with **months**, `Mo`.
+
+Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+
+```{python}
+#| code-fold: false
+co2["Mo"].value_counts().sort_index()
+```
+
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+
+<br/>
+
+Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
+
+```{python}
+#| code-fold: true
+sns.displot(co2['Days']);
+plt.title("Distribution of days feature"); # suppresses unneeded plotting output
+```
+
+In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
+
+<br/>
+
+Finally, let's check the last time feature, **year** `Yr`.
+
+Let's check to see if there is any connection between missing-ness and the year of the recording.
+
+```{python}
+#| code-fold: true
+sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output
+```
+
+**Observations**:
+
+* All of the missing data are in the early years of operation.
+* It appears there may have been problems with equipment in the mid to late 80s.
+
+**Potential Next Steps**:
+
+* Confirm these explanations through documentation about the historical readings.
+* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
+
+<br/>
+
+## Understanding Missing Value 2: `Avg`
+Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
+
+```{python}
+#| code-fold: true
+# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+```
+
+The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+
+We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
+
+```{python}
+#| code-fold: false
+co2[co2["Avg"] < 0]
+```
+
+There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
+
+## Drop, `NaN`, or Impute Missing `Avg` Data?
+
+How should we address the invalid `Avg` data?
+
+1. Drop records
+2. Set to NaN
+3. Impute using some strategy
+
+Remember we want to fix the following plot:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");
+```
+
+Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
+
+
+Let's consider a few options:
+1. Drop those records
+2. Replace -99.99 with NaN
+3. Substitute it with a likely value for the average CO2?
+
+What do you think are the pros and cons of each possible action?
+
+<br/>
+
+
+Let's examine each of these three options.
+
+```{python}
+#| code-fold: false
+# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
+```
+
+```{python}
+#| code-fold: false
+# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
+```
+
+We'll also use a third version of the data.
+
+First, we note that the dataset already comes with a **substitute value** for the -99.99.
+
+From the file description:
+
+> The `interpolated` column includes average values from the preceding column (`average`)
+and **interpolated values** where data are missing. Interpolated values are
+computed in two steps...
+
+The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
+
+So, the third version of our data will use the `Int` feature instead of `Avg`.
+
+```{python}
+#| code-fold: false
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
+```
+
+What's a **reasonable** estimate?
+
+To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+
+```{python}
+#| code-fold: true
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
+```
+
+In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
+
+However there is some appeal to **option C: Imputing**:
+
+* Shows seasonal trends for CO2
+* We are plotting all months in our data as a line plot
+
+<br/>
+
+
+Let's replot our original figure with option 3:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");
+```
+
+Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
+
+## Presenting the data: A Discussion on Data Granularity
+
+From the description:
+
+* monthly measurements are averages of average day measurements.
+* The NOAA GML website has datasets for daily/hourly measurements too.
+
+The data you present depends on your research question.
+
+**How do CO2 levels vary by season?**
+
+* You might want to keep average monthly data.
+
+**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
+
+* You might be happier with a **coarser granularity** of average year data!
+
+```{python}
+#| code-fold: true
+co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");
+```
+
+Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+
+# Summary
+We went over a lot of content this lecture; let's summarize the most important points:
+
+## Dealing with Missing Values
+There are a few options we can take to deal with missing data:
+
+* Drop missing records
+* Keep `NaN` missing values
+* Impute using an interpolated column
+
+## EDA and Data Wrangling
+There are several ways to approach EDA and Data Wrangling:
+
+* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
+* Examine each **field/attribute/dimension** individually.
+* Examine pairs of related dimensions (e.g. breaking down grades by major).
+* Along the way, we can:
+ * **Visualize** or summarize the data.
+ * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
+ * Identify and **address anomalies**.
+ * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
+ * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png
index a04218cf..f392d5f9 100644
Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png
new file mode 100644
index 00000000..be96b8c9
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png
deleted file mode 100644
index 31857f62..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-67-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png
index 67c3959d..ffd29ff8 100644
Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png
new file mode 100644
index 00000000..29088928
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png
deleted file mode 100644
index fb28f5d5..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-69-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png
index 39cac822..49ef3d6a 100644
Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png
index 6382e58a..15a5fe82 100644
Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png
index db2b0dee..40b1fc71 100644
Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png
index 897b8b39..99b6c2d1 100644
Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ
diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html
index ea770e7f..22d26788 100644
--- a/docs/feature_engineering/feature_engineering.html
+++ b/docs/feature_engineering/feature_engineering.html
@@ -556,7 +556,7 @@
my_model.fit(X, Y)
-LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
+LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegression()
Code
- +
Code
-
+
@@ -4975,1218 +4963,1218 @@ <
Source Code
----
-title: Data Cleaning and EDA
-execute:
- echo: true
-format:
- html:
- code-fold: true
- code-tools: true
- toc: true
- toc-title: Data Cleaning and EDA
- page-layout: full
- theme:
- - cosmo
- - cerulean
- callout-icon: false
-jupyter: python3
----
-
-```{python}
-#| code-fold: true
-import numpy as np
-import pandas as pd
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-#%matplotlib inline
-plt.rcParams['figure.figsize'] = (12, 9)
-
-sns.set()
-sns.set_context('talk')
-np.set_printoptions(threshold=20, precision=2, suppress=True)
-pd.set_option('display.max_rows', 30)
-pd.set_option('display.max_columns', None)
-pd.set_option('display.precision', 2)
-# This option stops scientific notation for pandas
-pd.set_option('display.float_format', '{:.2f}'.format)
-
-# Silence some spurious seaborn warnings
-import warnings
-warnings.filterwarnings("ignore", category=FutureWarning)
-```
-
-::: {.callout-note collapse="false"}
-## Learning Outcomes
-* Recognize common file formats
-* Categorize data by its variable type
-* Build awareness of issues with data faithfulness and develop targeted solutions
-:::
-
-**This content is covered in lectures 4, 5, and 6.**
-
-In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
-
-**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
-
-* Unclear structure or formatting
-* Missing or corrupted values
-* Unit conversions
-* ...and so on
-
-**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
-
-In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
-
-## Structure
-
-### File Formats
-There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
-
-#### CSV
-CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
-In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.csv").head(5)
-```
-
-To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
-
-```{python}
-#| code-fold: false
-with open("data/elections.csv", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
-
-#### TSV
-
-Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
-
-Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
-
-```{python}
-#| code-fold: false
-with open("data/elections.txt", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.txt", sep='\t').head(3)
-```
-
-An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-#### JSON
-**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
-
-```{python}
-#| code-fold: false
-with open("data/elections.json", "r") as table:
- i = 0
- for row in table:
- print(row)
- i += 1
- if i > 8:
- break
-```
-
-JSON files can be loaded into `pandas` using `pd.read_json`.
-
-```{python}
-#| code-fold: false
-pd.read_json('data/elections.json').head(3)
-```
-
-##### EDA with JSON: Berkeley COVID-19 Data
-The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
-
-```{python}
-#| code-fold: false
-from ds100_utils import fetch_and_cache
-
-covid_file = fetch_and_cache(
- "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
- "confirmed-cases.json",
- force=False)
-covid_file # a file path wrapper object
-```
-
-###### File Size
-Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
-
-Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
-
-```{python}
-#| code-fold: false
-import os
-
-print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
-
-with open(covid_file, "r") as f:
- print(covid_file, "is", sum(1 for l in f), "lines.")
-```
-
-###### Unix Commands
-As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
-In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
-
-Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
-
-These two give us the same information as the code above, albeit in a slightly different form:
-
-```{python}
-#| code-fold: false
-!ls -lh {covid_file}
-!wc -l {covid_file}
-```
-
-###### File Contents
-Let's explore the data format using `Python`.
-
-```{python}
-#| code-fold: false
-with open(covid_file, "r") as f:
- for i, row in enumerate(f):
- print(repr(row)) # print raw strings
- if i >= 4: break
-```
-
-We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
-
-```{python}
-#| code-fold: false
-!head -5 {covid_file}
-```
-
-In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
-
-```{python}
-#| code-fold: false
-import json
-
-with open(covid_file, "rb") as f:
- covid_json = json.load(f)
-```
-
-The `covid_json` variable is now a dictionary encoding the data in the file:
-
-```{python}
-#| code-fold: false
-type(covid_json)
-```
-
-We can examine what keys are in the top level json object by listing out the keys.
-
-```{python}
-#| code-fold: false
-covid_json.keys()
-```
-
-**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
-
-
-We can investigate the meta data further by examining the keys associated with the metadata.
-
-```{python}
-#| code-fold: false
-covid_json['meta'].keys()
-```
-
-The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
-
-```{python}
-#| code-fold: false
-covid_json['meta']['view'].keys()
-```
-
-Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
-
-```
-meta
-|-> data
- | ... (haven't explored yet)
-|-> view
- | -> id
- | -> name
- | -> attribution
- ...
- | -> description
- ...
- | -> columns
- ...
-```
-
-
-There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
-```{python}
-#| code-fold: false
-print(covid_json['meta']['view']['description'])
-```
-
-###### Examining the Data Field for Records
-
-We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
-
-```{python}
-#| code-fold: false
-for i in range(3):
- print(f"{i:03} | {covid_json['data'][i]}")
-```
-
-Observations:
-* These look like equal-length records, so maybe `data` is a table!
-* But what do each of values in the record mean? Where can we find column headers?
-
-For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
-
-```{python}
-#| code-fold: false
-type(covid_json['meta']['view']['columns'])
-```
-
-###### Summary of exploring the JSON file
-
-1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
-1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
-1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
-
-###### Loading COVID Data into `pandas`
-Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
-
-1. Translate the JSON records into a `DataFrame`:
-
- * fields: `covid_json['meta']['view']['columns']`
- * records: `covid_json['data']`
-
-
-1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
-
-1. Examine the `tail` of the table.
-
-```{python}
-#| code-fold: false
-# Load the data from JSON and assign column titles
-covid = pd.DataFrame(
- covid_json['data'],
- columns=[c['name'] for c in covid_json['meta']['view']['columns']])
-
-covid.tail()
-```
-
-### Variable Types
-
-After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
-
-**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
-
-* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
-* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
-
-**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
-
-* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
-* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
-
-![Classification of variable types](images/variable.png)
-
-Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
-
-### Primary and Foreign Keys
-
-Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
-
-The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
-
-```{python}
-#| echo: false
-pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
- "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
- "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
-```
-
-The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
-
-```{python}
-#| echo: false
-pd.DataFrame({"OH Request":[1, 2, 3, 4], \
- "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
- "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
-```
-
-## Granularity, Scope, and Temporality
-
-After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
-
-### Granularity
-The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
-
-### Scope
-The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
-
-### Temporality
-The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
-
-Time and date fields of a dataset could represent a few things:
-
-1. when the "event" happened
-2. when the data was collected, or when it was entered into the system
-3. when the data was copied into the database
-
-To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
-
-#### Temporality with `pandas`' `dt` accessors
-Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
-```{python}
-#| code-fold: true
-calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
-calls.head()
-```
-
-Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
-
-Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
-
-If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
-calls.head()
-```
-
-Now, we can use the `dt` accessor on this column.
-
-We can get the month:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.month.head()
-```
-
-Which day of the week the date is on:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.dayofweek.head()
-```
-
-Check the mimimum values to see if there are any suspicious-looking, 70s dates:
-
-```{python}
-#| code-fold: false
-calls.sort_values("EVENTDT").head()
-```
-
-Doesn't look like it! We are good!
-
-
-We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
-
-## Faithfulness
-
-At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
-
-Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
-
-* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
-* Violations of obvious dependencies, like an age that does not match a birthday
-* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
-* Signs of data falsification, such as fake email addresses or repeated use of the same names
-* Duplicated records or fields containing the same information
-* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
-
-We often solve some of these more common issues in the following ways:
-
-* Spelling errors: apply corrections or drop records that aren't in a dictionary
-* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
-* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
-* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
-
-### Missing Values
-Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
-
-A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
-
-* Average imputation: replace missing values with the average value for that field
-* Hot deck imputation: replace missing values with some random value
-* Regression imputation: develop a model to predict missing values
-* Multiple imputation: replace missing values with multiple random values
-
-Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
-
-# EDA Demo 1: Tuberculosis in the United States
-
-Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
-
-We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
-
-
-## CSVs and Field Names
-Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
-
-We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
-1. Using a text editor like emacs, vim, VSCode, etc.
-2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
-3. The `Python` file object
-4. `pandas`, using `pd.read_csv()`
-
-To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
-
-Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(row)
- i += 1
- if i > 3:
- break
-```
-
-Whoa, why are there blank lines interspaced between the lines of the CSV?
-
-You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
-
-If you're curious, we can use the `repr()` function to return the raw string with all special characters:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(repr(row)) # print raw strings
- i += 1
- if i > 3:
- break
-```
-
-Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
-tb_df.head()
-```
-
-You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
-
-Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
-
-A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
-tb_df.head(5)
-```
-
-Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
-
-We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
-
-```{python}
-#| code-fold: false
-rename_dict = {'2019': 'TB cases 2019',
- '2020': 'TB cases 2020',
- '2021': 'TB cases 2021',
- '2019.1': 'TB incidence 2019',
- '2020.1': 'TB incidence 2020',
- '2021.1': 'TB incidence 2021'}
-tb_df = tb_df.rename(columns=rename_dict)
-tb_df.head(5)
-```
-
-## Record Granularity
-
-You might already be wondering: what's up with that first record?
-
-Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
-
-Okay, EDA step two. How was the rollup record aggregated?
-
-Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
-
-```{python}
-#| code-fold: true
-tb_df.sum(axis=0)
-```
-
-Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
-```{python}
-#| code-fold: true
-tb_df.dtypes
-```
-
-Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
-
-
-Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
-
-```{python}
-#| code-fold: false
-# improve readability: chaining method calls with outer parentheses/line breaks
-tb_df = (
- pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
- .rename(columns=rename_dict)
-)
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-tb_df.sum()
-```
-
-The Total TB cases look right. Phew!
-
-Let's just look at the records with **state-level granularity**:
-
-```{python}
-#| code-fold: true
-state_tb_df = tb_df[1:]
-state_tb_df.head(5)
-```
-
-## Gather Census Data
-
-U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
-
-Running the below cells cleans the data.
-There are a few new methods here:
-* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
-* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
-
-```{python}
-#| code-fold: true
-# 2010s census data
-census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
-census_2010s_df = (
- census_2010s_df
- .reset_index()
- .drop(columns=["index", "Census", "Estimates Base"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
-
-# with pd.option_context('display.min_rows', 30): # shows more rows
-# display(census_2010s_df)
-
-census_2010s_df.head(5)
-```
-
-Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
-
-```python
-from importlib import reload
-reload(utils)
-```
-
-or use `iPython` magic which will intelligently import code when files change:
-
-```python
-%load_ext autoreload
-%autoreload 2
-```
-
-```{python}
-#| code-fold: true
-# census 2020s data
-census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
-census_2020s_df = (
- census_2020s_df
- .reset_index()
- .drop(columns=["index", "Unnamed: 1"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
-
-census_2020s_df.head(5)
-```
-
-## Joining Data (Merging `DataFrame`s)
-
-Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
-
-```{python}
-#| code-fold: false
-# merge TB DataFrame with two US census DataFrames
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .merge(right=census_2020s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
-
-```{python}
-#| code-fold: false
-# try merging again, but cleaner this time
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["Geographic Area", "2019"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
- .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-## Reproducing Data: Compute Incidence
-
-Let's recompute incidence to make sure we know where the original CDC numbers came from.
-
-From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
-
-If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
-
-$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
-
-$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
-
-Let's try this for 2019:
-
-```{python}
-#| code-fold: false
-tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
-tb_census_df.head(5)
-```
-
-Awesome!!!
-
-Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
-
-```{python}
-#| code-fold: false
-tb_census_df.describe()
-```
-
-## Bonus EDA: Reproducing the Reported Statistic
-
-
-**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-This is TB incidence computed across the entire U.S. population! How do we reproduce this?
-* We need to reproduce the "Total" TB incidences in our rolled record.
-* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
-* What happened...?
-
-Let's get exploring!
-
-Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
-
-```{python}
-#| code-fold: true
-tb_df = tb_df.set_index("U.S. jurisdiction")
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2010s_df = census_2010s_df.set_index("Geographic Area")
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2020s_df = census_2020s_df.set_index("Geographic Area")
-census_2020s_df.head(5)
-```
-
-It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
-
-```{python}
-#| code-fold: false
-tb_df.head()
-```
-
-Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
-
-The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
-
-```{python}
-#| code-fold: false
-census_2010s_df.head(5)
-```
-
-The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
-
-One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
-
-```{python}
-#| code-fold: false
-# rename rolled record for 2010s
-census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-# same, but for 2020s rename rolled record
-census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2020s_df.head(5)
-```
-
-<br/>
-
-Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
-
-```{python}
-#| code-fold: false
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["2019"]],
- left_index=True, right_index=True)
- .merge(right=census_2020s_df[["2020", "2021"]],
- left_index=True, right_index=True)
-)
-tb_census_df.head(5)
-```
-
-<br/>
-
-Finally, let's recompute our incidences:
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-We reproduced the total U.S. incidences correctly!
-
-We're almost there. Let's revisit the quote:
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-Recall that percent change from $A$ to $B$ is computed as
-$\text{percent change} = \frac{B - A}{A} \times 100$.
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
-incidence_2020
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
-incidence_2021
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
-difference
-```
-
-# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
-
-[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
-
-```{python}
-#| code-fold: false
-co2_file = "data/co2_mm_mlo.txt"
-```
-
-Let's do some **EDA**!!
-
-## Reading this file into Pandas?
-Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
-
-Lines 71-78 (inclusive) are shown below:
-
- line number | file contents
-
- 71 | # decimal average interpolated trend #days
- 72 | # date (season corr)
- 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
- 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
- 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
- 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
- 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
- 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
-
-
-Notice how:
-
-- The values are separated by white space, possibly tabs.
-- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
-- The 71st and 72nd lines in the file contain column headings split over two lines.
-
-We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
-)
-co2.head()
-```
-
-Congratulations! You've wrangled the data!
-
-<br/>
-
-...But our columns aren't named.
-**We need to do more EDA.**
-
-## Exploring Variable Feature Types
-
-The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
-
-Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = '\s+', #regex for continuous whitespace (next lecture)
- names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
-)
-co2.head()
-```
-
-## Visualizing CO<sub>2</sub>
-Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2);
-```
-
-The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
-
-Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
-
-```{python}
-#| code-fold: false
-co2.head()
-```
-
-```{python}
-#| code-fold: false
-co2.tail()
-```
-
-Some data have unusual values like -1 and -99.99.
-
-Let's check the description at the top of the file again.
-
-* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
-* -99.99 denotes a missing monthly average `Avg`
-
-How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
-
-<br/>
-
-
-## Sanity Checks: Reasoning about the data
-First, we consider the shape of the data. How many rows should we have?
-
-* If chronological order, we should have one record per month.
-* Data from March 1958 to August 2019.
-* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
-
-```{python}
-#| code-fold: false
-co2.shape
-```
-
-Nice!! The number of rows (i.e. records) match our expectations.\
-
-<br/>
-
-
-Let's now check the quality of each feature.
-
-## Understanding Missing Value 1: `Days`
-`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
-
-Let's start with **months**, `Mo`.
-
-Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
-
-```{python}
-#| code-fold: false
-co2["Mo"].value_counts().sort_index()
-```
-
-As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
-
-<br/>
-
-Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
-
-```{python}
-#| code-fold: true
-sns.displot(co2['Days']);
-plt.title("Distribution of days feature"); # suppresses unneeded plotting output
-```
-
-In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
-
-<br/>
-
-Finally, let's check the last time feature, **year** `Yr`.
-
-Let's check to see if there is any connection between missing-ness and the year of the recording.
-
-```{python}
-#| code-fold: true
-sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
-```
-
-**Observations**:
-
-* All of the missing data are in the early years of operation.
-* It appears there may have been problems with equipment in the mid to late 80s.
-
-**Potential Next Steps**:
-
-* Confirm these explanations through documentation about the historical readings.
-* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
-
-<br/>
-
-## Understanding Missing Value 2: `Avg`
-Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
-
-```{python}
-#| code-fold: true
-# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
-```
-
-The non-missing values are in the 300-400 range (a regular range of CO2 levels).
-
-We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
-
-```{python}
-#| code-fold: false
-co2[co2["Avg"] < 0]
-```
-
-There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
-
-## Drop, `NaN`, or Impute Missing `Avg` Data?
-
-How should we address the invalid `Avg` data?
-
-1. Drop records
-2. Set to NaN
-3. Impute using some strategy
-
-Remember we want to fix the following plot:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
-```
-
-Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
-
-
-Let's consider a few options:
-1. Drop those records
-2. Replace -99.99 with NaN
-3. Substitute it with a likely value for the average CO2?
-
-What do you think are the pros and cons of each possible action?
-
-<br/>
-
-
-Let's examine each of these three options.
-
-```{python}
-#| code-fold: false
-# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
-```
-
-```{python}
-#| code-fold: false
-# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
-```
-
-We'll also use a third version of the data.
-
-First, we note that the dataset already comes with a **substitute value** for the -99.99.
-
-From the file description:
-
-> The `interpolated` column includes average values from the preceding column (`average`)
-and **interpolated values** where data are missing. Interpolated values are
-computed in two steps...
-
-The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
-
-So, the third version of our data will use the `Int` feature instead of `Avg`.
-
-```{python}
-#| code-fold: false
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
-```
-
-What's a **reasonable** estimate?
-
-To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
-
-```{python}
-#| code-fold: true
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
-```
-
-In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
-
-However there is some appeal to **option C: Imputing**:
-
-* Shows seasonal trends for CO2
-* We are plotting all months in our data as a line plot
-
-<br/>
-
-
-Let's replot our original figure with option 3:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
-```
-
-Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
-
-## Presenting the data: A Discussion on Data Granularity
-
-From the description:
-
-* monthly measurements are averages of average day measurements.
-* The NOAA GML website has datasets for daily/hourly measurements too.
-
-The data you present depends on your research question.
-
-**How do CO2 levels vary by season?**
-
-* You might want to keep average monthly data.
-
-**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
-
-* You might be happier with a **coarser granularity** of average year data!
-
-```{python}
-#| code-fold: true
-co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
-```
-
-Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
-
-# Summary
-We went over a lot of content this lecture; let's summarize the most important points:
-
-## Dealing with Missing Values
-There are a few options we can take to deal with missing data:
-
-* Drop missing records
-* Keep `NaN` missing values
-* Impute using an interpolated column
-
-## EDA and Data Wrangling
-There are several ways to approach EDA and Data Wrangling:
-
-* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
-* Examine each **field/attribute/dimension** individually.
-* Examine pairs of related dimensions (e.g. breaking down grades by major).
-* Along the way, we can:
- * **Visualize** or summarize the data.
- * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
- * Identify and **address anomalies**.
- * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
- * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
+---
+title: Data Cleaning and EDA
+execute:
+ echo: true
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ toc: true
+ toc-title: Data Cleaning and EDA
+ page-layout: full
+ theme:
+ - cosmo
+ - cerulean
+ callout-icon: false
+jupyter: python3
+---
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+plt.rcParams['figure.figsize'] = (12, 9)
+
+sns.set()
+sns.set_context('talk')
+np.set_printoptions(threshold=20, precision=2, suppress=True)
+pd.set_option('display.max_rows', 30)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+# This option stops scientific notation for pandas
+pd.set_option('display.float_format', '{:.2f}'.format)
+
+# Silence some spurious seaborn warnings
+import warnings
+warnings.filterwarnings("ignore", category=FutureWarning)
+```
+
+::: {.callout-note collapse="false"}
+## Learning Outcomes
+* Recognize common file formats
+* Categorize data by its variable type
+* Build awareness of issues with data faithfulness and develop targeted solutions
+:::
+
+**This content is covered in lectures 4, 5, and 6.**
+
+In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
+
+**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+
+* Unclear structure or formatting
+* Missing or corrupted values
+* Unit conversions
+* ...and so on
+
+**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
+
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+
+## Structure
+
+### File Formats
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+
+#### CSV
+CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
+In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.csv").head(5)
+```
+
+To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
+
+```{python}
+#| code-fold: false
+with open("data/elections.csv", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
+
+#### TSV
+
+Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
+
+Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
+
+```{python}
+#| code-fold: false
+with open("data/elections.txt", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.txt", sep='\t').head(3)
+```
+
+An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+#### JSON
+**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
+
+```{python}
+#| code-fold: false
+with open("data/elections.json", "r") as table:
+ i = 0
+ for row in table:
+ print(row)
+ i += 1
+ if i > 8:
+ break
+```
+
+JSON files can be loaded into `pandas` using `pd.read_json`.
+
+```{python}
+#| code-fold: false
+pd.read_json('data/elections.json').head(3)
+```
+
+##### EDA with JSON: Berkeley COVID-19 Data
+The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
+
+```{python}
+#| code-fold: false
+from ds100_utils import fetch_and_cache
+
+covid_file = fetch_and_cache(
+ "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ force=False)
+covid_file # a file path wrapper object
+```
+
+###### File Size
+Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
+
+Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
+
+```{python}
+#| code-fold: false
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+ print(covid_file, "is", sum(1 for l in f), "lines.")
+```
+
+###### Unix Commands
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
+In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
+
+Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
+
+These two give us the same information as the code above, albeit in a slightly different form:
+
+```{python}
+#| code-fold: false
+!ls -lh {covid_file}
+!wc -l {covid_file}
+```
+
+###### File Contents
+Let's explore the data format using `Python`.
+
+```{python}
+#| code-fold: false
+with open(covid_file, "r") as f:
+ for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
+```
+
+We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
+
+```{python}
+#| code-fold: false
+!head -5 {covid_file}
+```
+
+In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
+
+```{python}
+#| code-fold: false
+import json
+
+with open(covid_file, "rb") as f:
+ covid_json = json.load(f)
+```
+
+The `covid_json` variable is now a dictionary encoding the data in the file:
+
+```{python}
+#| code-fold: false
+type(covid_json)
+```
+
+We can examine what keys are in the top level json object by listing out the keys.
+
+```{python}
+#| code-fold: false
+covid_json.keys()
+```
+
+**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
+
+
+We can investigate the meta data further by examining the keys associated with the metadata.
+
+```{python}
+#| code-fold: false
+covid_json['meta'].keys()
+```
+
+The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
+
+```{python}
+#| code-fold: false
+covid_json['meta']['view'].keys()
+```
+
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+
+```
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+```
+
+
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+
+```{python}
+#| code-fold: false
+print(covid_json['meta']['view']['description'])
+```
+
+###### Examining the Data Field for Records
+
+We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
+
+```{python}
+#| code-fold: false
+for i in range(3):
+ print(f"{i:03} | {covid_json['data'][i]}")
+```
+
+Observations:
+* These look like equal-length records, so maybe `data` is a table!
+* But what do each of values in the record mean? Where can we find column headers?
+
+For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
+
+```{python}
+#| code-fold: false
+type(covid_json['meta']['view']['columns'])
+```
+
+###### Summary of exploring the JSON file
+
+1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
+1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
+1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
+
+###### Loading COVID Data into `pandas`
+Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
+
+1. Translate the JSON records into a `DataFrame`:
+
+ * fields: `covid_json['meta']['view']['columns']`
+ * records: `covid_json['data']`
+
+
+1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
+
+1. Examine the `tail` of the table.
+
+```{python}
+#| code-fold: false
+# Load the data from JSON and assign column titles
+covid = pd.DataFrame(
+ covid_json['data'],
+ columns=[c['name'] for c in covid_json['meta']['view']['columns']])
+
+covid.tail()
+```
+
+### Variable Types
+
+After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+
+**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
+
+* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
+* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
+
+**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
+
+* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
+* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
+
+![Classification of variable types](images/variable.png)
+
+Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+
+### Primary and Foreign Keys
+
+Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
+
+The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
+
+```{python}
+#| echo: false
+pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
+ "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
+ "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
+```
+
+The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
+
+```{python}
+#| echo: false
+pd.DataFrame({"OH Request":[1, 2, 3, 4], \
+ "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
+ "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
+```
+
+## Granularity, Scope, and Temporality
+
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
+
+### Granularity
+The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+
+### Scope
+The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+
+### Temporality
+The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+
+Time and date fields of a dataset could represent a few things:
+
+1. when the "event" happened
+2. when the data was collected, or when it was entered into the system
+3. when the data was copied into the database
+
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+
+#### Temporality with `pandas`' `dt` accessors
+Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
+
+```{python}
+#| code-fold: true
+calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+calls.head()
+```
+
+Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
+
+Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
+
+If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+calls.head()
+```
+
+Now, we can use the `dt` accessor on this column.
+
+We can get the month:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.month.head()
+```
+
+Which day of the week the date is on:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.dayofweek.head()
+```
+
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+
+```{python}
+#| code-fold: false
+calls.sort_values("EVENTDT").head()
+```
+
+Doesn't look like it! We are good!
+
+
+We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
+
+## Faithfulness
+
+At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
+
+Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+
+* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
+* Violations of obvious dependencies, like an age that does not match a birthday
+* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
+* Signs of data falsification, such as fake email addresses or repeated use of the same names
+* Duplicated records or fields containing the same information
+* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
+
+We often solve some of these more common issues in the following ways:
+
+* Spelling errors: apply corrections or drop records that aren't in a dictionary
+* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
+* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
+* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
+
+### Missing Values
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
+
+A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+
+* Average imputation: replace missing values with the average value for that field
+* Hot deck imputation: replace missing values with some random value
+* Regression imputation: develop a model to predict missing values
+* Multiple imputation: replace missing values with multiple random values
+
+Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+
+# EDA Demo 1: Tuberculosis in the United States
+
+Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+
+We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
+
+
+## CSVs and Field Names
+Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
+
+We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
+1. Using a text editor like emacs, vim, VSCode, etc.
+2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
+3. The `Python` file object
+4. `pandas`, using `pd.read_csv()`
+
+To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
+
+Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(row)
+ i += 1
+ if i > 3:
+ break
+```
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+
+You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
+
+If you're curious, we can use the `repr()` function to return the raw string with all special characters:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(repr(row)) # print raw strings
+ i += 1
+ if i > 3:
+ break
+```
+
+Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
+tb_df.head()
+```
+
+You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
+
+Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
+
+A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+tb_df.head(5)
+```
+
+Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
+
+We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
+
+```{python}
+#| code-fold: false
+rename_dict = {'2019': 'TB cases 2019',
+ '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+tb_df = tb_df.rename(columns=rename_dict)
+tb_df.head(5)
+```
+
+## Record Granularity
+
+You might already be wondering: what's up with that first record?
+
+Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
+
+Okay, EDA step two. How was the rollup record aggregated?
+
+Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
+
+```{python}
+#| code-fold: true
+tb_df.sum(axis=0)
+```
+
+Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+
+```{python}
+#| code-fold: true
+tb_df.dtypes
+```
+
+Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
+
+
+Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
+
+```{python}
+#| code-fold: false
+# improve readability: chaining method calls with outer parentheses/line breaks
+tb_df = (
+ pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
+ .rename(columns=rename_dict)
+)
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+tb_df.sum()
+```
+
+The Total TB cases look right. Phew!
+
+Let's just look at the records with **state-level granularity**:
+
+```{python}
+#| code-fold: true
+state_tb_df = tb_df[1:]
+state_tb_df.head(5)
+```
+
+## Gather Census Data
+
+U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
+
+Running the below cells cleans the data.
+There are a few new methods here:
+* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
+* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
+
+```{python}
+#| code-fold: true
+# 2010s census data
+census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+census_2010s_df = (
+ census_2010s_df
+ .reset_index()
+ .drop(columns=["index", "Census", "Estimates Base"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+census_2010s_df.head(5)
+```
+
+Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
+
+```python
+from importlib import reload
+reload(utils)
+```
+
+or use `iPython` magic which will intelligently import code when files change:
+
+```python
+%load_ext autoreload
+%autoreload 2
+```
+
+```{python}
+#| code-fold: true
+# census 2020s data
+census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+census_2020s_df = (
+ census_2020s_df
+ .reset_index()
+ .drop(columns=["index", "Unnamed: 1"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+
+census_2020s_df.head(5)
+```
+
+## Joining Data (Merging `DataFrame`s)
+
+Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
+
+```{python}
+#| code-fold: false
+# merge TB DataFrame with two US census DataFrames
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .merge(right=census_2020s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
+
+```{python}
+#| code-fold: false
+# try merging again, but cleaner this time
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["Geographic Area", "2019"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+ .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+## Reproducing Data: Compute Incidence
+
+Let's recompute incidence to make sure we know where the original CDC numbers came from.
+
+From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+
+$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
+
+$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
+
+Let's try this for 2019:
+
+```{python}
+#| code-fold: false
+tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+tb_census_df.head(5)
+```
+
+Awesome!!!
+
+Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+
+```{python}
+#| code-fold: false
+tb_census_df.describe()
+```
+
+## Bonus EDA: Reproducing the Reported Statistic
+
+
+**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+This is TB incidence computed across the entire U.S. population! How do we reproduce this?
+* We need to reproduce the "Total" TB incidences in our rolled record.
+* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
+* What happened...?
+
+Let's get exploring!
+
+Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+
+```{python}
+#| code-fold: true
+tb_df = tb_df.set_index("U.S. jurisdiction")
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2010s_df = census_2010s_df.set_index("Geographic Area")
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2020s_df = census_2020s_df.set_index("Geographic Area")
+census_2020s_df.head(5)
+```
+
+It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
+
+```{python}
+#| code-fold: false
+tb_df.head()
+```
+
+Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
+
+The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
+
+```{python}
+#| code-fold: false
+census_2010s_df.head(5)
+```
+
+The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
+
+One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
+
+```{python}
+#| code-fold: false
+# rename rolled record for 2010s
+census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+# same, but for 2020s rename rolled record
+census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2020s_df.head(5)
+```
+
+<br/>
+
+Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
+
+```{python}
+#| code-fold: false
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["2019"]],
+ left_index=True, right_index=True)
+ .merge(right=census_2020s_df[["2020", "2021"]],
+ left_index=True, right_index=True)
+)
+tb_census_df.head(5)
+```
+
+<br/>
+
+Finally, let's recompute our incidences:
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+We reproduced the total U.S. incidences correctly!
+
+We're almost there. Let's revisit the quote:
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+Recall that percent change from $A$ to $B$ is computed as
+$\text{percent change} = \frac{B - A}{A} \times 100$.
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
+incidence_2020
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
+incidence_2021
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
+difference
+```
+
+# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
+
+[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
+
+```{python}
+#| code-fold: false
+co2_file = "data/co2_mm_mlo.txt"
+```
+
+Let's do some **EDA**!!
+
+## Reading this file into Pandas?
+Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
+
+Lines 71-78 (inclusive) are shown below:
+
+ line number | file contents
+
+ 71 | # decimal average interpolated trend #days
+ 72 | # date (season corr)
+ 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+ 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+ 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+ 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+ 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+ 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+
+
+Notice how:
+
+- The values are separated by white space, possibly tabs.
+- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
+- The 71st and 72nd lines in the file contain column headings split over two lines.
+
+We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+)
+co2.head()
+```
+
+Congratulations! You've wrangled the data!
+
+<br/>
+
+...But our columns aren't named.
+**We need to do more EDA.**
+
+## Exploring Variable Feature Types
+
+The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
+
+Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = '\s+', #regex for continuous whitespace (next lecture)
+ names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+)
+co2.head()
+```
+
+## Visualizing CO<sub>2</sub>
+Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2);
+```
+
+The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
+
+Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
+
+```{python}
+#| code-fold: false
+co2.head()
+```
+
+```{python}
+#| code-fold: false
+co2.tail()
+```
+
+Some data have unusual values like -1 and -99.99.
+
+Let's check the description at the top of the file again.
+
+* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
+* -99.99 denotes a missing monthly average `Avg`
+
+How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+
+<br/>
+
+
+## Sanity Checks: Reasoning about the data
+First, we consider the shape of the data. How many rows should we have?
+
+* If chronological order, we should have one record per month.
+* Data from March 1958 to August 2019.
+* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
+
+```{python}
+#| code-fold: false
+co2.shape
+```
+
+Nice!! The number of rows (i.e. records) match our expectations.\
+
+<br/>
+
+
+Let's now check the quality of each feature.
+
+## Understanding Missing Value 1: `Days`
+`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
+
+Let's start with **months**, `Mo`.
+
+Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+
+```{python}
+#| code-fold: false
+co2["Mo"].value_counts().sort_index()
+```
+
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+
+<br/>
+
+Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
+
+```{python}
+#| code-fold: true
+sns.displot(co2['Days']);
+plt.title("Distribution of days feature"); # suppresses unneeded plotting output
+```
+
+In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
+
+<br/>
+
+Finally, let's check the last time feature, **year** `Yr`.
+
+Let's check to see if there is any connection between missing-ness and the year of the recording.
+
+```{python}
+#| code-fold: true
+sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output
+```
+
+**Observations**:
+
+* All of the missing data are in the early years of operation.
+* It appears there may have been problems with equipment in the mid to late 80s.
+
+**Potential Next Steps**:
+
+* Confirm these explanations through documentation about the historical readings.
+* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
+
+<br/>
+
+## Understanding Missing Value 2: `Avg`
+Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
+
+```{python}
+#| code-fold: true
+# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+```
+
+The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+
+We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
+
+```{python}
+#| code-fold: false
+co2[co2["Avg"] < 0]
+```
+
+There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
+
+## Drop, `NaN`, or Impute Missing `Avg` Data?
+
+How should we address the invalid `Avg` data?
+
+1. Drop records
+2. Set to NaN
+3. Impute using some strategy
+
+Remember we want to fix the following plot:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");
+```
+
+Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
+
+
+Let's consider a few options:
+1. Drop those records
+2. Replace -99.99 with NaN
+3. Substitute it with a likely value for the average CO2?
+
+What do you think are the pros and cons of each possible action?
+
+<br/>
+
+
+Let's examine each of these three options.
+
+```{python}
+#| code-fold: false
+# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
+```
+
+```{python}
+#| code-fold: false
+# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
+```
+
+We'll also use a third version of the data.
+
+First, we note that the dataset already comes with a **substitute value** for the -99.99.
+
+From the file description:
+
+> The `interpolated` column includes average values from the preceding column (`average`)
+and **interpolated values** where data are missing. Interpolated values are
+computed in two steps...
+
+The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
+
+So, the third version of our data will use the `Int` feature instead of `Avg`.
+
+```{python}
+#| code-fold: false
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
+```
+
+What's a **reasonable** estimate?
+
+To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+
+```{python}
+#| code-fold: true
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
+```
+
+In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
+
+However there is some appeal to **option C: Imputing**:
+
+* Shows seasonal trends for CO2
+* We are plotting all months in our data as a line plot
+
+<br/>
+
+
+Let's replot our original figure with option 3:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");
+```
+
+Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
+
+## Presenting the data: A Discussion on Data Granularity
+
+From the description:
+
+* monthly measurements are averages of average day measurements.
+* The NOAA GML website has datasets for daily/hourly measurements too.
+
+The data you present depends on your research question.
+
+**How do CO2 levels vary by season?**
+
+* You might want to keep average monthly data.
+
+**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
+
+* You might be happier with a **coarser granularity** of average year data!
+
+```{python}
+#| code-fold: true
+co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");
+```
+
+Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+
+# Summary
+We went over a lot of content this lecture; let's summarize the most important points:
+
+## Dealing with Missing Values
+There are a few options we can take to deal with missing data:
+
+* Drop missing records
+* Keep `NaN` missing values
+* Impute using an interpolated column
+
+## EDA and Data Wrangling
+There are several ways to approach EDA and Data Wrangling:
+
+* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
+* Examine each **field/attribute/dimension** individually.
+* Examine pairs of related dimensions (e.g. breaking down grades by major).
+* Along the way, we can:
+ * **Visualize** or summarize the data.
+ * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
+ * Identify and **address anomalies**.
+ * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
+ * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png
index a04218cf..f392d5f9 100644
Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png
new file mode 100644
index 00000000..be96b8c9
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png
deleted file mode 100644
index 31857f62..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-67-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png
index 67c3959d..ffd29ff8 100644
Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png
new file mode 100644
index 00000000..29088928
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png
deleted file mode 100644
index fb28f5d5..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-69-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png
index 39cac822..49ef3d6a 100644
Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png
index 6382e58a..15a5fe82 100644
Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png
index db2b0dee..40b1fc71 100644
Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png
index 897b8b39..99b6c2d1 100644
Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ
diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html
index ea770e7f..22d26788 100644
--- a/docs/feature_engineering/feature_engineering.html
+++ b/docs/feature_engineering/feature_engineering.html
@@ -556,7 +556,7 @@
my_model.fit(X, Y)
Code
- +<
Source Code
-
---
-title: Data Cleaning and EDA
-execute:
- echo: true
-format:
- html:
- code-fold: true
- code-tools: true
- toc: true
- toc-title: Data Cleaning and EDA
- page-layout: full
- theme:
- - cosmo
- - cerulean
- callout-icon: false
-jupyter: python3
----
-
-```{python}
-#| code-fold: true
-import numpy as np
-import pandas as pd
-
-import matplotlib.pyplot as plt
-import seaborn as sns
-#%matplotlib inline
-plt.rcParams['figure.figsize'] = (12, 9)
-
-sns.set()
-sns.set_context('talk')
-np.set_printoptions(threshold=20, precision=2, suppress=True)
-pd.set_option('display.max_rows', 30)
-pd.set_option('display.max_columns', None)
-pd.set_option('display.precision', 2)
-# This option stops scientific notation for pandas
-pd.set_option('display.float_format', '{:.2f}'.format)
-
-# Silence some spurious seaborn warnings
-import warnings
-warnings.filterwarnings("ignore", category=FutureWarning)
-```
-
-::: {.callout-note collapse="false"}
-## Learning Outcomes
-* Recognize common file formats
-* Categorize data by its variable type
-* Build awareness of issues with data faithfulness and develop targeted solutions
-:::
-
-**This content is covered in lectures 4, 5, and 6.**
-
-In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
-
-**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
-
-* Unclear structure or formatting
-* Missing or corrupted values
-* Unit conversions
-* ...and so on
-
-**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
-
-In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
-
-## Structure
-
-### File Formats
-There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
-
-#### CSV
-CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
-In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.csv").head(5)
-```
-
-To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
-
-```{python}
-#| code-fold: false
-with open("data/elections.csv", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
-
-#### TSV
-
-Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
-
-Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
-
-```{python}
-#| code-fold: false
-with open("data/elections.txt", "r") as table:
- i = 0
- for row in table:
- print(repr(row))
- i += 1
- if i > 3:
- break
-```
-
-TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-```{python}
-#| code-fold: false
-pd.read_csv("data/elections.txt", sep='\t').head(3)
-```
-
-An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
-
-#### JSON
-**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
-
-```{python}
-#| code-fold: false
-with open("data/elections.json", "r") as table:
- i = 0
- for row in table:
- print(row)
- i += 1
- if i > 8:
- break
-```
-
-JSON files can be loaded into `pandas` using `pd.read_json`.
-
-```{python}
-#| code-fold: false
-pd.read_json('data/elections.json').head(3)
-```
-
-##### EDA with JSON: Berkeley COVID-19 Data
-The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
-
-```{python}
-#| code-fold: false
-from ds100_utils import fetch_and_cache
-
-covid_file = fetch_and_cache(
- "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
- "confirmed-cases.json",
- force=False)
-covid_file # a file path wrapper object
-```
-
-###### File Size
-Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
-
-Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
-
-```{python}
-#| code-fold: false
-import os
-
-print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
-
-with open(covid_file, "r") as f:
- print(covid_file, "is", sum(1 for l in f), "lines.")
-```
-
-###### Unix Commands
-As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
-In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
-
-Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
-
-These two give us the same information as the code above, albeit in a slightly different form:
-
-```{python}
-#| code-fold: false
-!ls -lh {covid_file}
-!wc -l {covid_file}
-```
-
-###### File Contents
-Let's explore the data format using `Python`.
-
-```{python}
-#| code-fold: false
-with open(covid_file, "r") as f:
- for i, row in enumerate(f):
- print(repr(row)) # print raw strings
- if i >= 4: break
-```
-
-We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
-
-```{python}
-#| code-fold: false
-!head -5 {covid_file}
-```
-
-In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
-
-```{python}
-#| code-fold: false
-import json
-
-with open(covid_file, "rb") as f:
- covid_json = json.load(f)
-```
-
-The `covid_json` variable is now a dictionary encoding the data in the file:
-
-```{python}
-#| code-fold: false
-type(covid_json)
-```
-
-We can examine what keys are in the top level json object by listing out the keys.
-
-```{python}
-#| code-fold: false
-covid_json.keys()
-```
-
-**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
-
-
-We can investigate the meta data further by examining the keys associated with the metadata.
-
-```{python}
-#| code-fold: false
-covid_json['meta'].keys()
-```
-
-The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
-
-```{python}
-#| code-fold: false
-covid_json['meta']['view'].keys()
-```
-
-Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
-
-```
-meta
-|-> data
- | ... (haven't explored yet)
-|-> view
- | -> id
- | -> name
- | -> attribution
- ...
- | -> description
- ...
- | -> columns
- ...
-```
-
-
-There is a key called description in the view sub dictionary. This likely contains a description of the data:
-
-```{python}
-#| code-fold: false
-print(covid_json['meta']['view']['description'])
-```
-
-###### Examining the Data Field for Records
-
-We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
-
-```{python}
-#| code-fold: false
-for i in range(3):
- print(f"{i:03} | {covid_json['data'][i]}")
-```
-
-Observations:
-* These look like equal-length records, so maybe `data` is a table!
-* But what do each of values in the record mean? Where can we find column headers?
-
-For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
-
-```{python}
-#| code-fold: false
-type(covid_json['meta']['view']['columns'])
-```
-
-###### Summary of exploring the JSON file
-
-1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
-1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
-1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
-
-###### Loading COVID Data into `pandas`
-Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
-
-1. Translate the JSON records into a `DataFrame`:
-
- * fields: `covid_json['meta']['view']['columns']`
- * records: `covid_json['data']`
-
-
-1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
-
-1. Examine the `tail` of the table.
-
-```{python}
-#| code-fold: false
-# Load the data from JSON and assign column titles
-covid = pd.DataFrame(
- covid_json['data'],
- columns=[c['name'] for c in covid_json['meta']['view']['columns']])
-
-covid.tail()
-```
-
-### Variable Types
-
-After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
-
-**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
-
-* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
-* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
-
-**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
-
-* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
-* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
-
-![Classification of variable types](images/variable.png)
-
-Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
-
-### Primary and Foreign Keys
-
-Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
-
-The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
-
-```{python}
-#| echo: false
-pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
- "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
- "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
-```
-
-The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
-
-```{python}
-#| echo: false
-pd.DataFrame({"OH Request":[1, 2, 3, 4], \
- "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
- "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
-```
-
-## Granularity, Scope, and Temporality
-
-After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
-
-### Granularity
-The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
-
-### Scope
-The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
-
-### Temporality
-The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
-
-Time and date fields of a dataset could represent a few things:
-
-1. when the "event" happened
-2. when the data was collected, or when it was entered into the system
-3. when the data was copied into the database
-
-To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
-
-#### Temporality with `pandas`' `dt` accessors
-Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
-
-```{python}
-#| code-fold: true
-calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
-calls.head()
-```
-
-Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
-
-Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
-
-If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
-calls.head()
-```
-
-Now, we can use the `dt` accessor on this column.
-
-We can get the month:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.month.head()
-```
-
-Which day of the week the date is on:
-
-```{python}
-#| code-fold: false
-calls["EVENTDT"].dt.dayofweek.head()
-```
-
-Check the mimimum values to see if there are any suspicious-looking, 70s dates:
-
-```{python}
-#| code-fold: false
-calls.sort_values("EVENTDT").head()
-```
-
-Doesn't look like it! We are good!
-
-
-We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
-
-## Faithfulness
-
-At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
-
-Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
-
-* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
-* Violations of obvious dependencies, like an age that does not match a birthday
-* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
-* Signs of data falsification, such as fake email addresses or repeated use of the same names
-* Duplicated records or fields containing the same information
-* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
-
-We often solve some of these more common issues in the following ways:
-
-* Spelling errors: apply corrections or drop records that aren't in a dictionary
-* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
-* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
-* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
-
-### Missing Values
-Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
-
-A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
-
-* Average imputation: replace missing values with the average value for that field
-* Hot deck imputation: replace missing values with some random value
-* Regression imputation: develop a model to predict missing values
-* Multiple imputation: replace missing values with multiple random values
-
-Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
-
-# EDA Demo 1: Tuberculosis in the United States
-
-Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
-
-We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
-
-
-## CSVs and Field Names
-Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
-
-We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
-1. Using a text editor like emacs, vim, VSCode, etc.
-2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
-3. The `Python` file object
-4. `pandas`, using `pd.read_csv()`
-
-To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
-
-Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(row)
- i += 1
- if i > 3:
- break
-```
-
-Whoa, why are there blank lines interspaced between the lines of the CSV?
-
-You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
-
-If you're curious, we can use the `repr()` function to return the raw string with all special characters:
-
-```{python}
-#| code-fold: true
-with open("data/cdc_tuberculosis.csv", "r") as f:
- i = 0
- for row in f:
- print(repr(row)) # print raw strings
- i += 1
- if i > 3:
- break
-```
-
-Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
-tb_df.head()
-```
-
-You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
-
-Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
-
-A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
-
-```{python}
-#| code-fold: false
-tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
-tb_df.head(5)
-```
-
-Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
-
-We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
-
-```{python}
-#| code-fold: false
-rename_dict = {'2019': 'TB cases 2019',
- '2020': 'TB cases 2020',
- '2021': 'TB cases 2021',
- '2019.1': 'TB incidence 2019',
- '2020.1': 'TB incidence 2020',
- '2021.1': 'TB incidence 2021'}
-tb_df = tb_df.rename(columns=rename_dict)
-tb_df.head(5)
-```
-
-## Record Granularity
-
-You might already be wondering: what's up with that first record?
-
-Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
-
-Okay, EDA step two. How was the rollup record aggregated?
-
-Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
-
-```{python}
-#| code-fold: true
-tb_df.sum(axis=0)
-```
-
-Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
-
-```{python}
-#| code-fold: true
-tb_df.dtypes
-```
-
-Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
-
-
-Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
-
-```{python}
-#| code-fold: false
-# improve readability: chaining method calls with outer parentheses/line breaks
-tb_df = (
- pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
- .rename(columns=rename_dict)
-)
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-tb_df.sum()
-```
-
-The Total TB cases look right. Phew!
-
-Let's just look at the records with **state-level granularity**:
-
-```{python}
-#| code-fold: true
-state_tb_df = tb_df[1:]
-state_tb_df.head(5)
-```
-
-## Gather Census Data
-
-U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
-
-Running the below cells cleans the data.
-There are a few new methods here:
-* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
-* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
-
-```{python}
-#| code-fold: true
-# 2010s census data
-census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
-census_2010s_df = (
- census_2010s_df
- .reset_index()
- .drop(columns=["index", "Census", "Estimates Base"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
-
-# with pd.option_context('display.min_rows', 30): # shows more rows
-# display(census_2010s_df)
-
-census_2010s_df.head(5)
-```
-
-Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
-
-```python
-from importlib import reload
-reload(utils)
-```
-
-or use `iPython` magic which will intelligently import code when files change:
-
-```python
-%load_ext autoreload
-%autoreload 2
-```
-
-```{python}
-#| code-fold: true
-# census 2020s data
-census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
-census_2020s_df = (
- census_2020s_df
- .reset_index()
- .drop(columns=["index", "Unnamed: 1"])
- .rename(columns={"Unnamed: 0": "Geographic Area"})
- .convert_dtypes() # "smart" converting of columns, use at your own risk
- .dropna() # we'll introduce this next time
-)
-census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
-
-census_2020s_df.head(5)
-```
-
-## Joining Data (Merging `DataFrame`s)
-
-Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
-
-```{python}
-#| code-fold: false
-# merge TB DataFrame with two US census DataFrames
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .merge(right=census_2020s_df,
- left_on="U.S. jurisdiction", right_on="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
-
-```{python}
-#| code-fold: false
-# try merging again, but cleaner this time
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["Geographic Area", "2019"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
- .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
- left_on="U.S. jurisdiction", right_on="Geographic Area")
- .drop(columns="Geographic Area")
-)
-tb_census_df.head(5)
-```
-
-## Reproducing Data: Compute Incidence
-
-Let's recompute incidence to make sure we know where the original CDC numbers came from.
-
-From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
-
-If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
-
-$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
-
-$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
-
-Let's try this for 2019:
-
-```{python}
-#| code-fold: false
-tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
-tb_census_df.head(5)
-```
-
-Awesome!!!
-
-Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
-
-```{python}
-#| code-fold: false
-tb_census_df.describe()
-```
-
-## Bonus EDA: Reproducing the Reported Statistic
-
-
-**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-This is TB incidence computed across the entire U.S. population! How do we reproduce this?
-* We need to reproduce the "Total" TB incidences in our rolled record.
-* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
-* What happened...?
-
-Let's get exploring!
-
-Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
-
-```{python}
-#| code-fold: true
-tb_df = tb_df.set_index("U.S. jurisdiction")
-tb_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2010s_df = census_2010s_df.set_index("Geographic Area")
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-census_2020s_df = census_2020s_df.set_index("Geographic Area")
-census_2020s_df.head(5)
-```
-
-It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
-
-```{python}
-#| code-fold: false
-tb_df.head()
-```
-
-Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
-
-The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
-
-```{python}
-#| code-fold: false
-census_2010s_df.head(5)
-```
-
-The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
-
-One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
-
-```{python}
-#| code-fold: false
-# rename rolled record for 2010s
-census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2010s_df.head(5)
-```
-
-```{python}
-#| code-fold: false
-# same, but for 2020s rename rolled record
-census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
-census_2020s_df.head(5)
-```
-
-<br/>
-
-Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
-
-```{python}
-#| code-fold: false
-tb_census_df = (
- tb_df
- .merge(right=census_2010s_df[["2019"]],
- left_index=True, right_index=True)
- .merge(right=census_2020s_df[["2020", "2021"]],
- left_index=True, right_index=True)
-)
-tb_census_df.head(5)
-```
-
-<br/>
-
-Finally, let's recompute our incidences:
-
-```{python}
-#| code-fold: false
-# recompute incidence for all years
-for year in [2019, 2020, 2021]:
- tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
-tb_census_df.head(5)
-```
-
-We reproduced the total U.S. incidences correctly!
-
-We're almost there. Let's revisit the quote:
-
-> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
-
-Recall that percent change from $A$ to $B$ is computed as
-$\text{percent change} = \frac{B - A}{A} \times 100$.
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
-incidence_2020
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
-incidence_2021
-```
-
-```{python}
-#| code-fold: false
-#| tags: []
-difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
-difference
-```
-
-# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
-
-[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
-
-```{python}
-#| code-fold: false
-co2_file = "data/co2_mm_mlo.txt"
-```
-
-Let's do some **EDA**!!
-
-## Reading this file into Pandas?
-Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
-
-Lines 71-78 (inclusive) are shown below:
-
- line number | file contents
-
- 71 | # decimal average interpolated trend #days
- 72 | # date (season corr)
- 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
- 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
- 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
- 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
- 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
- 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
-
-
-Notice how:
-
-- The values are separated by white space, possibly tabs.
-- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
-- The 71st and 72nd lines in the file contain column headings split over two lines.
-
-We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
-)
-co2.head()
-```
-
-Congratulations! You've wrangled the data!
-
-<br/>
-
-...But our columns aren't named.
-**We need to do more EDA.**
-
-## Exploring Variable Feature Types
-
-The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
-
-Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
-
-```{python}
-#| code-fold: false
-co2 = pd.read_csv(
- co2_file, header = None, skiprows = 72,
- sep = '\s+', #regex for continuous whitespace (next lecture)
- names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
-)
-co2.head()
-```
-
-## Visualizing CO<sub>2</sub>
-Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2);
-```
-
-The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
-
-Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
-
-```{python}
-#| code-fold: false
-co2.head()
-```
-
-```{python}
-#| code-fold: false
-co2.tail()
-```
-
-Some data have unusual values like -1 and -99.99.
-
-Let's check the description at the top of the file again.
-
-* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
-* -99.99 denotes a missing monthly average `Avg`
-
-How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
-
-<br/>
-
-
-## Sanity Checks: Reasoning about the data
-First, we consider the shape of the data. How many rows should we have?
-
-* If chronological order, we should have one record per month.
-* Data from March 1958 to August 2019.
-* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
-
-```{python}
-#| code-fold: false
-co2.shape
-```
-
-Nice!! The number of rows (i.e. records) match our expectations.\
-
-<br/>
-
-
-Let's now check the quality of each feature.
-
-## Understanding Missing Value 1: `Days`
-`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
-
-Let's start with **months**, `Mo`.
-
-Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
-
-```{python}
-#| code-fold: false
-co2["Mo"].value_counts().sort_index()
-```
-
-As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
-
-<br/>
-
-Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
-
-```{python}
-#| code-fold: true
-sns.displot(co2['Days']);
-plt.title("Distribution of days feature"); # suppresses unneeded plotting output
-```
-
-In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
-
-<br/>
-
-Finally, let's check the last time feature, **year** `Yr`.
-
-Let's check to see if there is any connection between missing-ness and the year of the recording.
-
-```{python}
-#| code-fold: true
-sns.scatterplot(x="Yr", y="Days", data=co2);
-plt.title("Day field by Year"); # the ; suppresses output
-```
-
-**Observations**:
-
-* All of the missing data are in the early years of operation.
-* It appears there may have been problems with equipment in the mid to late 80s.
-
-**Potential Next Steps**:
-
-* Confirm these explanations through documentation about the historical readings.
-* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
-
-<br/>
-
-## Understanding Missing Value 2: `Avg`
-Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
-
-```{python}
-#| code-fold: true
-# Histograms of average CO2 measurements
-sns.displot(co2['Avg']);
-```
-
-The non-missing values are in the 300-400 range (a regular range of CO2 levels).
-
-We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
-
-```{python}
-#| code-fold: false
-co2[co2["Avg"] < 0]
-```
-
-There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
-
-## Drop, `NaN`, or Impute Missing `Avg` Data?
-
-How should we address the invalid `Avg` data?
-
-1. Drop records
-2. Set to NaN
-3. Impute using some strategy
-
-Remember we want to fix the following plot:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2)
-plt.title("CO2 Average By Month");
-```
-
-Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
-
-
-Let's consider a few options:
-1. Drop those records
-2. Replace -99.99 with NaN
-3. Substitute it with a likely value for the average CO2?
-
-What do you think are the pros and cons of each possible action?
-
-<br/>
-
-
-Let's examine each of these three options.
-
-```{python}
-#| code-fold: false
-# 1. Drop missing values
-co2_drop = co2[co2['Avg'] > 0]
-co2_drop.head()
-```
-
-```{python}
-#| code-fold: false
-# 2. Replace NaN with -99.99
-co2_NA = co2.replace(-99.99, np.NaN)
-co2_NA.head()
-```
-
-We'll also use a third version of the data.
-
-First, we note that the dataset already comes with a **substitute value** for the -99.99.
-
-From the file description:
-
-> The `interpolated` column includes average values from the preceding column (`average`)
-and **interpolated values** where data are missing. Interpolated values are
-computed in two steps...
-
-The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
-
-So, the third version of our data will use the `Int` feature instead of `Avg`.
-
-```{python}
-#| code-fold: false
-# 3. Use interpolated column which estimates missing Avg values
-co2_impute = co2.copy()
-co2_impute['Avg'] = co2['Int']
-co2_impute.head()
-```
-
-What's a **reasonable** estimate?
-
-To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
-
-```{python}
-#| code-fold: true
-# results of plotting data in 1958
-
-def line_and_points(data, ax, title):
- # assumes single year, hence Mo
- ax.plot('Mo', 'Avg', data=data)
- ax.scatter('Mo', 'Avg', data=data)
- ax.set_xlim(2, 13)
- ax.set_title(title)
- ax.set_xticks(np.arange(3, 13))
-
-def data_year(data, year):
- return data[data["Yr"] == 1958]
-
-# uses matplotlib subplots
-# you may see more next week; focus on output for now
-fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
-
-year = 1958
-line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
-line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
-line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
-
-fig.suptitle(f"Monthly Averages for {year}")
-plt.tight_layout()
-```
-
-In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
-
-However there is some appeal to **option C: Imputing**:
-
-* Shows seasonal trends for CO2
-* We are plotting all months in our data as a line plot
-
-<br/>
-
-
-Let's replot our original figure with option 3:
-
-```{python}
-#| code-fold: true
-sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
-plt.title("CO2 Average By Month, Imputed");
-```
-
-Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
-
-## Presenting the data: A Discussion on Data Granularity
-
-From the description:
-
-* monthly measurements are averages of average day measurements.
-* The NOAA GML website has datasets for daily/hourly measurements too.
-
-The data you present depends on your research question.
-
-**How do CO2 levels vary by season?**
-
-* You might want to keep average monthly data.
-
-**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
-
-* You might be happier with a **coarser granularity** of average year data!
-
-```{python}
-#| code-fold: true
-co2_year = co2_impute.groupby('Yr').mean()
-sns.lineplot(x='Yr', y='Avg', data=co2_year)
-plt.title("CO2 Average By Year");
-```
-
-Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
-
-# Summary
-We went over a lot of content this lecture; let's summarize the most important points:
-
-## Dealing with Missing Values
-There are a few options we can take to deal with missing data:
-
-* Drop missing records
-* Keep `NaN` missing values
-* Impute using an interpolated column
-
-## EDA and Data Wrangling
-There are several ways to approach EDA and Data Wrangling:
-
-* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
-* Examine each **field/attribute/dimension** individually.
-* Examine pairs of related dimensions (e.g. breaking down grades by major).
-* Along the way, we can:
- * **Visualize** or summarize the data.
- * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
- * Identify and **address anomalies**.
- * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
- * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
+---
+title: Data Cleaning and EDA
+execute:
+ echo: true
+format:
+ html:
+ code-fold: true
+ code-tools: true
+ toc: true
+ toc-title: Data Cleaning and EDA
+ page-layout: full
+ theme:
+ - cosmo
+ - cerulean
+ callout-icon: false
+jupyter: python3
+---
+
+```{python}
+#| code-fold: true
+import numpy as np
+import pandas as pd
+
+import matplotlib.pyplot as plt
+import seaborn as sns
+#%matplotlib inline
+plt.rcParams['figure.figsize'] = (12, 9)
+
+sns.set()
+sns.set_context('talk')
+np.set_printoptions(threshold=20, precision=2, suppress=True)
+pd.set_option('display.max_rows', 30)
+pd.set_option('display.max_columns', None)
+pd.set_option('display.precision', 2)
+# This option stops scientific notation for pandas
+pd.set_option('display.float_format', '{:.2f}'.format)
+
+# Silence some spurious seaborn warnings
+import warnings
+warnings.filterwarnings("ignore", category=FutureWarning)
+```
+
+::: {.callout-note collapse="false"}
+## Learning Outcomes
+* Recognize common file formats
+* Categorize data by its variable type
+* Build awareness of issues with data faithfulness and develop targeted solutions
+:::
+
+**This content is covered in lectures 4, 5, and 6.**
+
+In the past few lectures, we've learned that `pandas` is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?
+
+**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:
+
+* Unclear structure or formatting
+* Missing or corrupted values
+* Unit conversions
+* ...and so on
+
+**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.
+
+In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.
+
+## Structure
+
+### File Formats
+There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.
+
+#### CSV
+CSVs, which stand for **Comma-Separated Values**, are a common tabular data format.
+In the past two `pandas` lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our `elections` and `babynames` datasets were stored and loaded as CSVs:
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.csv").head(5)
+```
+
+To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a `DataFrame`. We'll use the `repr()` function to return the raw string with its special characters:
+
+```{python}
+#| code-fold: false
+with open("data/elections.csv", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma `,` (hence, comma-separated!).
+
+#### TSV
+
+Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline `\n`, while fields are delimited by `\t` tab character.
+
+Let's check out the first few rows of the raw TSV file. Again, we'll use the `repr()` function so that `print` shows the special characters.
+
+```{python}
+#| code-fold: false
+with open("data/elections.txt", "r") as table:
+ i = 0
+ for row in table:
+ print(repr(row))
+ i += 1
+ if i > 3:
+ break
+```
+
+TSVs can be loaded into `pandas` using `pd.read_csv`. We'll need to specify the **delimiter** with parameter` sep='\t'` [(documentation)](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+```{python}
+#| code-fold: false
+pd.read_csv("data/elections.txt", sep='\t').head(3)
+```
+
+An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does `pandas` differentiate between a comma delimiter vs. a comma within the field itself, for example `8,900`? To remedy this, check out the [`quotechar` parameter](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).
+
+#### JSON
+**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.
+
+```{python}
+#| code-fold: false
+with open("data/elections.json", "r") as table:
+ i = 0
+ for row in table:
+ print(row)
+ i += 1
+ if i > 8:
+ break
+```
+
+JSON files can be loaded into `pandas` using `pd.read_json`.
+
+```{python}
+#| code-fold: false
+pd.read_json('data/elections.json').head(3)
+```
+
+##### EDA with JSON: Berkeley COVID-19 Data
+The City of Berkeley Open Data [website](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766) has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the [`ds100_utils.py`](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html) file that we can reuse these helper functions in many different notebooks.
+
+```{python}
+#| code-fold: false
+from ds100_utils import fetch_and_cache
+
+covid_file = fetch_and_cache(
+ "https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD",
+ "confirmed-cases.json",
+ force=False)
+covid_file # a file path wrapper object
+```
+
+###### File Size
+Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use `Python` tools to probe the file.
+
+Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records
+
+```{python}
+#| code-fold: false
+import os
+
+print(covid_file, "is", os.path.getsize(covid_file) / 1e6, "MB")
+
+with open(covid_file, "r") as f:
+ print(covid_file, "is", sum(1 for l in f), "lines.")
+```
+
+###### Unix Commands
+As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called ["Data Science at the Command Line"](https://datascienceatthecommandline.com/) that explores this idea in depth!
+In Jupyter/IPython, you can prefix lines with `!` to execute arbitrary Unix commands, and within those lines, you can refer to `Python` variables and expressions with the syntax `{expr}`.
+
+Here, we use the `ls` command to list files, using the `-lh` flags, which request "long format with information in human-readable form." We also use the `wc` command for "word count," but with the `-l` flag, which asks for line counts instead of words.
+
+These two give us the same information as the code above, albeit in a slightly different form:
+
+```{python}
+#| code-fold: false
+!ls -lh {covid_file}
+!wc -l {covid_file}
+```
+
+###### File Contents
+Let's explore the data format using `Python`.
+
+```{python}
+#| code-fold: false
+with open(covid_file, "r") as f:
+ for i, row in enumerate(f):
+ print(repr(row)) # print raw strings
+ if i >= 4: break
+```
+
+We can use the `head` Unix command (which is where `pandas`' `head` method comes from!) to see the first few lines of the file:
+
+```{python}
+#| code-fold: false
+!head -5 {covid_file}
+```
+
+In order to load the JSON file into `pandas`, Let's first do some EDA with `Python`'s `json` package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into `pandas`. `Python` has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the `json` package.
+
+```{python}
+#| code-fold: false
+import json
+
+with open(covid_file, "rb") as f:
+ covid_json = json.load(f)
+```
+
+The `covid_json` variable is now a dictionary encoding the data in the file:
+
+```{python}
+#| code-fold: false
+type(covid_json)
+```
+
+We can examine what keys are in the top level json object by listing out the keys.
+
+```{python}
+#| code-fold: false
+covid_json.keys()
+```
+
+**Observation**: The JSON dictionary contains a `meta` key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.
+
+
+We can investigate the meta data further by examining the keys associated with the metadata.
+
+```{python}
+#| code-fold: false
+covid_json['meta'].keys()
+```
+
+The `meta` key contains another dictionary called `view`. This likely refers to meta-data about a particular "view" of some underlying database. We will learn more about views when we study SQL later in the class.
+
+```{python}
+#| code-fold: false
+covid_json['meta']['view'].keys()
+```
+
+Notice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:
+
+```
+meta
+|-> data
+ | ... (haven't explored yet)
+|-> view
+ | -> id
+ | -> name
+ | -> attribution
+ ...
+ | -> description
+ ...
+ | -> columns
+ ...
+```
+
+
+There is a key called description in the view sub dictionary. This likely contains a description of the data:
+
+```{python}
+#| code-fold: false
+print(covid_json['meta']['view']['description'])
+```
+
+###### Examining the Data Field for Records
+
+We can look at a few entries in the `data` field. This is what we'll load into `pandas`.
+
+```{python}
+#| code-fold: false
+for i in range(3):
+ print(f"{i:03} | {covid_json['data'][i]}")
+```
+
+Observations:
+* These look like equal-length records, so maybe `data` is a table!
+* But what do each of values in the record mean? Where can we find column headers?
+
+For that, we'll need the `columns` key in the metadata dictionary. This returns a list:
+
+```{python}
+#| code-fold: false
+type(covid_json['meta']['view']['columns'])
+```
+
+###### Summary of exploring the JSON file
+
+1. The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.
+1. Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.
+1. Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.
+
+###### Loading COVID Data into `pandas`
+Finally, let's load the data (not the metadata) into a `pandas` `DataFrame`. In the following block of code we:
+
+1. Translate the JSON records into a `DataFrame`:
+
+ * fields: `covid_json['meta']['view']['columns']`
+ * records: `covid_json['data']`
+
+
+1. Remove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.
+
+1. Examine the `tail` of the table.
+
+```{python}
+#| code-fold: false
+# Load the data from JSON and assign column titles
+covid = pd.DataFrame(
+ covid_json['data'],
+ columns=[c['name'] for c in covid_json['meta']['view']['columns']])
+
+covid.tail()
+```
+
+### Variable Types
+
+After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.
+
+**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:
+
+* **Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<sub>2</sub> concentrations.
+* **Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.
+
+**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:
+
+* **Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.
+* **Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.
+
+![Classification of variable types](images/variable.png)
+
+Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.
+
+### Primary and Foreign Keys
+
+Last time, we introduced `.merge` as the `pandas` method for joining multiple `DataFrame`s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.
+
+The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key.
+
+```{python}
+#| echo: false
+pd.DataFrame({"Cal ID":[3034619471, 3035619472, 3025619473, 3046789372], \
+ "Name":["Oski", "Ollie", "Orrie", "Ollie"], \
+ "Major":["Data Science", "Computer Science", "Data Science", "Economics"]})
+```
+
+The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the `left_on` and `right_on` parameters of `.merge`. In the table of office hour tickets below, `"Cal ID"` is a foreign key referencing the previous table.
+
+```{python}
+#| echo: false
+pd.DataFrame({"OH Request":[1, 2, 3, 4], \
+ "Cal ID":[3034619471, 3035619472, 3025619473, 3035619472], \
+ "Question":["HW 2 Q1", "HW 2 Q3", "Lab 3 Q4", "HW 2 Q7"]})
+```
+
+## Granularity, Scope, and Temporality
+
+After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.
+
+### Granularity
+The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.
+
+### Scope
+The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.
+
+### Temporality
+The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.
+
+Time and date fields of a dataset could represent a few things:
+
+1. when the "event" happened
+2. when the data was collected, or when it was entered into the system
+3. when the data was copied into the database
+
+To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).
+
+#### Temporality with `pandas`' `dt` accessors
+Let's briefly look at how we can use `pandas`' `dt` accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.
+
+```{python}
+#| code-fold: true
+calls = pd.read_csv("data/Berkeley_PD_-_Calls_for_Service.csv")
+calls.head()
+```
+
+Looks like there are three columns with dates/times: `EVENTDT`, `EVENTTM`, and `InDbDate`.
+
+Most likely, `EVENTDT` stands for the date when the event took place, `EVENTTM` stands for the time of day the event took place (in 24-hr format), and `InDbDate` is the date this call is recorded onto the database.
+
+If we check the data type of these columns, we will see they are stored as strings. We can convert them to `datetime` objects using pandas `to_datetime` function.
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"] = pd.to_datetime(calls["EVENTDT"])
+calls.head()
+```
+
+Now, we can use the `dt` accessor on this column.
+
+We can get the month:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.month.head()
+```
+
+Which day of the week the date is on:
+
+```{python}
+#| code-fold: false
+calls["EVENTDT"].dt.dayofweek.head()
+```
+
+Check the mimimum values to see if there are any suspicious-looking, 70s dates:
+
+```{python}
+#| code-fold: false
+calls.sort_values("EVENTDT").head()
+```
+
+Doesn't look like it! We are good!
+
+
+We can also do many things with the `dt` accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on [`.dt` accessor](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors) and [time series/date functionality](https://pandas.pydata.org/docs/user_guide/timeseries.html#).
+
+## Faithfulness
+
+At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."
+
+Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:
+
+* Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future
+* Violations of obvious dependencies, like an age that does not match a birthday
+* Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted
+* Signs of data falsification, such as fake email addresses or repeated use of the same names
+* Duplicated records or fields containing the same information
+* Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255
+
+We often solve some of these more common issues in the following ways:
+
+* Spelling errors: apply corrections or drop records that aren't in a dictionary
+* Time zone inconsistencies: convert to a common time zone (e.g. UTC)
+* Duplicated records or fields: identify and eliminate duplicates (using primary keys)
+* Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data
+
+### Missing Values
+Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as `NaN` values.
+
+A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.
+
+* Average imputation: replace missing values with the average value for that field
+* Hot deck imputation: replace missing values with some random value
+* Regression imputation: develop a model to predict missing values
+* Multiple imputation: replace missing values with multiple random values
+
+Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.
+
+# EDA Demo 1: Tuberculosis in the United States
+
+Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!
+
+We will examine the data included in the [original CDC article](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down) published in 2021.
+
+
+## CSVs and Field Names
+Suppose Table 1 was saved as a CSV file located in `data/cdc_tuberculosis.csv`.
+
+We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:
+1. Using a text editor like emacs, vim, VSCode, etc.
+2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.
+3. The `Python` file object
+4. `pandas`, using `pd.read_csv()`
+
+To try out options 1 and 2, you can view or download the Tuberculosis from the [lecture demo notebook](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&branch=main) under the `data` folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.
+
+Next, let's try out option 3 using the `Python` file object. We'll look at the first four lines:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(row)
+ i += 1
+ if i > 3:
+ break
+```
+
+Whoa, why are there blank lines interspaced between the lines of the CSV?
+
+You may recall that all line breaks in text files are encoded as the special newline character `\n`. Python's `print()` prints each string (including the newline), and an additional newline on top of that.
+
+If you're curious, we can use the `repr()` function to return the raw string with all special characters:
+
+```{python}
+#| code-fold: true
+with open("data/cdc_tuberculosis.csv", "r") as f:
+ i = 0
+ for row in f:
+ print(repr(row)) # print raw strings
+ i += 1
+ if i > 3:
+ break
+```
+
+Finally, let's try option 4 and use the tried-and-true Data 100 approach: `pandas`.
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv")
+tb_df.head()
+```
+
+You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row?
+
+Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.
+
+A reasonable first step is to identify the row with the right header. The `pd.read_csv()` function ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)) has the convenient `header` parameter that we can set to use the elements in row 1 as the appropriate columns:
+
+```{python}
+#| code-fold: false
+tb_df = pd.read_csv("data/cdc_tuberculosis.csv", header=1) # row index
+tb_df.head(5)
+```
+
+Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. `pandas` has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.
+
+We can do this manually with `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)):
+
+```{python}
+#| code-fold: false
+rename_dict = {'2019': 'TB cases 2019',
+ '2020': 'TB cases 2020',
+ '2021': 'TB cases 2021',
+ '2019.1': 'TB incidence 2019',
+ '2020.1': 'TB incidence 2020',
+ '2021.1': 'TB incidence 2021'}
+tb_df = tb_df.rename(columns=rename_dict)
+tb_df.head(5)
+```
+
+## Record Granularity
+
+You might already be wondering: what's up with that first record?
+
+Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.
+
+Okay, EDA step two. How was the rollup record aggregated?
+
+Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).
+
+```{python}
+#| code-fold: true
+tb_df.sum(axis=0)
+```
+
+Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:
+
+```{python}
+#| code-fold: true
+tb_df.dtypes
+```
+
+Since there are commas in the values for TB cases, the numbers are read as the `object` datatype, or **storage type** (close to the `Python` string datatype), so `pandas` is concatenating strings instead of adding integers (recall that `Python` can "sum", or concatenate, strings together: `"data" + "100"` evaluates to `"data100"`).
+
+
+Fortunately `read_csv` also has a `thousands` parameter ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)):
+
+```{python}
+#| code-fold: false
+# improve readability: chaining method calls with outer parentheses/line breaks
+tb_df = (
+ pd.read_csv("data/cdc_tuberculosis.csv", header=1, thousands=',')
+ .rename(columns=rename_dict)
+)
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+tb_df.sum()
+```
+
+The Total TB cases look right. Phew!
+
+Let's just look at the records with **state-level granularity**:
+
+```{python}
+#| code-fold: true
+state_tb_df = tb_df[1:]
+state_tb_df.head(5)
+```
+
+## Gather Census Data
+
+U.S. Census population estimates [source](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html) (2019), [source](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html) (2020-2021).
+
+Running the below cells cleans the data.
+There are a few new methods here:
+* `df.convert_dtypes()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)) conveniently converts all float dtypes into ints and is out of scope for the class.
+* `df.drop_na()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)) will be explained in more detail next time.
+
+```{python}
+#| code-fold: true
+# 2010s census data
+census_2010s_df = pd.read_csv("data/nst-est2019-01.csv", header=3, thousands=",")
+census_2010s_df = (
+ census_2010s_df
+ .reset_index()
+ .drop(columns=["index", "Census", "Estimates Base"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')
+
+# with pd.option_context('display.min_rows', 30): # shows more rows
+# display(census_2010s_df)
+
+census_2010s_df.head(5)
+```
+
+Occasionally, you will want to modify code that you have imported. To reimport those modifications you can either use `python`'s `importlib` library:
+
+```python
+from importlib import reload
+reload(utils)
+```
+
+or use `iPython` magic which will intelligently import code when files change:
+
+```python
+%load_ext autoreload
+%autoreload 2
+```
+
+```{python}
+#| code-fold: true
+# census 2020s data
+census_2020s_df = pd.read_csv("data/NST-EST2022-POP.csv", header=3, thousands=",")
+census_2020s_df = (
+ census_2020s_df
+ .reset_index()
+ .drop(columns=["index", "Unnamed: 1"])
+ .rename(columns={"Unnamed: 0": "Geographic Area"})
+ .convert_dtypes() # "smart" converting of columns, use at your own risk
+ .dropna() # we'll introduce this next time
+)
+census_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')
+
+census_2020s_df.head(5)
+```
+
+## Joining Data (Merging `DataFrame`s)
+
+Time to `merge`! Here we use the `DataFrame` method `df1.merge(right=df2, ...)` on `DataFrame` `df1` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)). Contrast this with the function `pd.merge(left=df1, right=df2, ...)` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)). Feel free to use either.
+
+```{python}
+#| code-fold: false
+# merge TB DataFrame with two US census DataFrames
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .merge(right=census_2020s_df,
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census `DataFrame`s. Let's do the latter.
+
+```{python}
+#| code-fold: false
+# try merging again, but cleaner this time
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["Geographic Area", "2019"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+ .merge(right=census_2020s_df[["Geographic Area", "2020", "2021"]],
+ left_on="U.S. jurisdiction", right_on="Geographic Area")
+ .drop(columns="Geographic Area")
+)
+tb_census_df.head(5)
+```
+
+## Reproducing Data: Compute Incidence
+
+Let's recompute incidence to make sure we know where the original CDC numbers came from.
+
+From the [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down): TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”
+
+If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as
+
+$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$
+
+$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$
+
+Let's try this for 2019:
+
+```{python}
+#| code-fold: false
+tb_census_df["recompute incidence 2019"] = tb_census_df["TB cases 2019"]/tb_census_df["2019"]*100000
+tb_census_df.head(5)
+```
+
+Awesome!!!
+
+Let's use a for-loop and `Python` format strings to compute TB incidence for all years. `Python` f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course ([documentation](https://docs.python.org/3/tutorial/inputoutput.html)).
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.
+
+```{python}
+#| code-fold: false
+tb_census_df.describe()
+```
+
+## Bonus EDA: Reproducing the Reported Statistic
+
+
+**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+This is TB incidence computed across the entire U.S. population! How do we reproduce this?
+* We need to reproduce the "Total" TB incidences in our rolled record.
+* But our current `tb_census_df` only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.
+* What happened...?
+
+Let's get exploring!
+
+Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.
+
+```{python}
+#| code-fold: true
+tb_df = tb_df.set_index("U.S. jurisdiction")
+tb_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2010s_df = census_2010s_df.set_index("Geographic Area")
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+census_2020s_df = census_2020s_df.set_index("Geographic Area")
+census_2020s_df.head(5)
+```
+
+It turns out that our merge above only kept state records, even though our original `tb_df` had the "Total" rolled record:
+
+```{python}
+#| code-fold: false
+tb_df.head()
+```
+
+Recall that `merge` by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** `DataFrame`s.
+
+The rolled records in our census `DataFrame` have different `Geographic Area` fields, which was the key we merged on:
+
+```{python}
+#| code-fold: false
+census_2010s_df.head(5)
+```
+
+The Census `DataFrame` has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".
+
+One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use `df.rename()` ([documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)):
+
+```{python}
+#| code-fold: false
+# rename rolled record for 2010s
+census_2010s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2010s_df.head(5)
+```
+
+```{python}
+#| code-fold: false
+# same, but for 2020s rename rolled record
+census_2020s_df.rename(index={'United States':'Total'}, inplace=True)
+census_2020s_df.head(5)
+```
+
+<br/>
+
+Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (`df.merge()` [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)).
+
+```{python}
+#| code-fold: false
+tb_census_df = (
+ tb_df
+ .merge(right=census_2010s_df[["2019"]],
+ left_index=True, right_index=True)
+ .merge(right=census_2020s_df[["2020", "2021"]],
+ left_index=True, right_index=True)
+)
+tb_census_df.head(5)
+```
+
+<br/>
+
+Finally, let's recompute our incidences:
+
+```{python}
+#| code-fold: false
+# recompute incidence for all years
+for year in [2019, 2020, 2021]:
+ tb_census_df[f"recompute incidence {year}"] = tb_census_df[f"TB cases {year}"]/tb_census_df[f"{year}"]*100000
+tb_census_df.head(5)
+```
+
+We reproduced the total U.S. incidences correctly!
+
+We're almost there. Let's revisit the quote:
+
+> Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.
+
+Recall that percent change from $A$ to $B$ is computed as
+$\text{percent change} = \frac{B - A}{A} \times 100$.
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']
+incidence_2020
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+incidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']
+incidence_2021
+```
+
+```{python}
+#| code-fold: false
+#| tags: []
+difference = (incidence_2021 - incidence_2020)/incidence_2020 * 100
+difference
+```
+
+# EDA Demo 2: Mauna Loa CO<sub>2</sub> Data -- A Lesson in Data Faithfulness
+
+[Mauna Loa Observatory](https://gml.noaa.gov/ccgg/trends/data.html) has been monitoring CO<sub>2</sub> concentrations since 1958
+
+```{python}
+#| code-fold: false
+co2_file = "data/co2_mm_mlo.txt"
+```
+
+Let's do some **EDA**!!
+
+## Reading this file into Pandas?
+Let's instead check out this `.txt` file. Some questions to keep in mind: Do we trust this file extension? What structure is it?
+
+Lines 71-78 (inclusive) are shown below:
+
+ line number | file contents
+
+ 71 | # decimal average interpolated trend #days
+ 72 | # date (season corr)
+ 73 | 1958 3 1958.208 315.71 315.71 314.62 -1
+ 74 | 1958 4 1958.292 317.45 317.45 315.29 -1
+ 75 | 1958 5 1958.375 317.50 317.50 314.71 -1
+ 76 | 1958 6 1958.458 -99.99 317.10 314.85 -1
+ 77 | 1958 7 1958.542 315.86 315.86 314.98 -1
+ 78 | 1958 8 1958.625 314.93 314.93 315.94 -1
+
+
+Notice how:
+
+- The values are separated by white space, possibly tabs.
+- The data line up down the rows. For example, the month appears in 7th to 8th position of each line.
+- The 71st and 72nd lines in the file contain column headings split over two lines.
+
+We can use `read_csv` to read the data into a `pandas` `DataFrame`, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = r'\s+' #delimiter for continuous whitespace (stay tuned for regex next lecture))
+)
+co2.head()
+```
+
+Congratulations! You've wrangled the data!
+
+<br/>
+
+...But our columns aren't named.
+**We need to do more EDA.**
+
+## Exploring Variable Feature Types
+
+The NOAA [webpage](https://gml.noaa.gov/ccgg/trends/) might have some useful tidbits (in this case it doesn't).
+
+Using this information, we'll rerun `pd.read_csv`, but this time with some **custom column names.**
+
+```{python}
+#| code-fold: false
+co2 = pd.read_csv(
+ co2_file, header = None, skiprows = 72,
+ sep = '\s+', #regex for continuous whitespace (next lecture)
+ names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']
+)
+co2.head()
+```
+
+## Visualizing CO<sub>2</sub>
+Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2);
+```
+
+The code above uses the `seaborn` plotting library (abbreviated `sns`). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!
+
+Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?
+
+```{python}
+#| code-fold: false
+co2.head()
+```
+
+```{python}
+#| code-fold: false
+co2.tail()
+```
+
+Some data have unusual values like -1 and -99.99.
+
+Let's check the description at the top of the file again.
+
+* -1 signifies a missing value for the number of days `Days` the equipment was in operation that month.
+* -99.99 denotes a missing monthly average `Avg`
+
+How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.
+
+<br/>
+
+
+## Sanity Checks: Reasoning about the data
+First, we consider the shape of the data. How many rows should we have?
+
+* If chronological order, we should have one record per month.
+* Data from March 1958 to August 2019.
+* We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.
+
+```{python}
+#| code-fold: false
+co2.shape
+```
+
+Nice!! The number of rows (i.e. records) match our expectations.\
+
+<br/>
+
+
+Let's now check the quality of each feature.
+
+## Understanding Missing Value 1: `Days`
+`Days` is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.
+
+Let's start with **months**, `Mo`.
+
+Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).
+
+```{python}
+#| code-fold: false
+co2["Mo"].value_counts().sort_index()
+```
+
+As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.
+
+<br/>
+
+Next let's explore **days** `Days` itself, which is the number of days that the measurement equipment worked.
+
+```{python}
+#| code-fold: true
+sns.displot(co2['Days']);
+plt.title("Distribution of days feature"); # suppresses unneeded plotting output
+```
+
+In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!
+
+<br/>
+
+Finally, let's check the last time feature, **year** `Yr`.
+
+Let's check to see if there is any connection between missing-ness and the year of the recording.
+
+```{python}
+#| code-fold: true
+sns.scatterplot(x="Yr", y="Days", data=co2);
+plt.title("Day field by Year"); # the ; suppresses output
+```
+
+**Observations**:
+
+* All of the missing data are in the early years of operation.
+* It appears there may have been problems with equipment in the mid to late 80s.
+
+**Potential Next Steps**:
+
+* Confirm these explanations through documentation about the historical readings.
+* Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.
+
+<br/>
+
+## Understanding Missing Value 2: `Avg`
+Next, let's return to the -99.99 values in `Avg` to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<sub>2</sub> measurements
+
+```{python}
+#| code-fold: true
+# Histograms of average CO2 measurements
+sns.displot(co2['Avg']);
+```
+
+The non-missing values are in the 300-400 range (a regular range of CO2 levels).
+
+We also see that there are only a few missing `Avg` values (**<1% of values**). Let's examine all of them:
+
+```{python}
+#| code-fold: false
+co2[co2["Avg"] < 0]
+```
+
+There doesn't seem to be a pattern to these values, other than that most records also were missing `Days` data.
+
+## Drop, `NaN`, or Impute Missing `Avg` Data?
+
+How should we address the invalid `Avg` data?
+
+1. Drop records
+2. Set to NaN
+3. Impute using some strategy
+
+Remember we want to fix the following plot:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2)
+plt.title("CO2 Average By Month");
+```
+
+Since we are plotting `Avg` vs `DecDate`, we should just focus on dealing with missing values for `Avg`.
+
+
+Let's consider a few options:
+1. Drop those records
+2. Replace -99.99 with NaN
+3. Substitute it with a likely value for the average CO2?
+
+What do you think are the pros and cons of each possible action?
+
+<br/>
+
+
+Let's examine each of these three options.
+
+```{python}
+#| code-fold: false
+# 1. Drop missing values
+co2_drop = co2[co2['Avg'] > 0]
+co2_drop.head()
+```
+
+```{python}
+#| code-fold: false
+# 2. Replace NaN with -99.99
+co2_NA = co2.replace(-99.99, np.NaN)
+co2_NA.head()
+```
+
+We'll also use a third version of the data.
+
+First, we note that the dataset already comes with a **substitute value** for the -99.99.
+
+From the file description:
+
+> The `interpolated` column includes average values from the preceding column (`average`)
+and **interpolated values** where data are missing. Interpolated values are
+computed in two steps...
+
+The `Int` feature has values that exactly match those in `Avg`, except when `Avg` is -99.99, and then a **reasonable** estimate is used instead.
+
+So, the third version of our data will use the `Int` feature instead of `Avg`.
+
+```{python}
+#| code-fold: false
+# 3. Use interpolated column which estimates missing Avg values
+co2_impute = co2.copy()
+co2_impute['Avg'] = co2['Int']
+co2_impute.head()
+```
+
+What's a **reasonable** estimate?
+
+To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).
+
+```{python}
+#| code-fold: true
+# results of plotting data in 1958
+
+def line_and_points(data, ax, title):
+ # assumes single year, hence Mo
+ ax.plot('Mo', 'Avg', data=data)
+ ax.scatter('Mo', 'Avg', data=data)
+ ax.set_xlim(2, 13)
+ ax.set_title(title)
+ ax.set_xticks(np.arange(3, 13))
+
+def data_year(data, year):
+ return data[data["Yr"] == 1958]
+
+# uses matplotlib subplots
+# you may see more next week; focus on output for now
+fig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)
+
+year = 1958
+line_and_points(data_year(co2_drop, year), axes[0], title="1. Drop Missing")
+line_and_points(data_year(co2_NA, year), axes[1], title="2. Missing Set to NaN")
+line_and_points(data_year(co2_impute, year), axes[2], title="3. Missing Interpolated")
+
+fig.suptitle(f"Monthly Averages for {year}")
+plt.tight_layout()
+```
+
+In the big picture since there are only 7 `Avg` values missing (**<1%** of 738 months), any of these approaches would work.
+
+However there is some appeal to **option C: Imputing**:
+
+* Shows seasonal trends for CO2
+* We are plotting all months in our data as a line plot
+
+<br/>
+
+
+Let's replot our original figure with option 3:
+
+```{python}
+#| code-fold: true
+sns.lineplot(x='DecDate', y='Avg', data=co2_impute)
+plt.title("CO2 Average By Month, Imputed");
+```
+
+Looks pretty close to what we see on the NOAA [website](https://gml.noaa.gov/ccgg/trends/)!
+
+## Presenting the data: A Discussion on Data Granularity
+
+From the description:
+
+* monthly measurements are averages of average day measurements.
+* The NOAA GML website has datasets for daily/hourly measurements too.
+
+The data you present depends on your research question.
+
+**How do CO2 levels vary by season?**
+
+* You might want to keep average monthly data.
+
+**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**
+
+* You might be happier with a **coarser granularity** of average year data!
+
+```{python}
+#| code-fold: true
+co2_year = co2_impute.groupby('Yr').mean()
+sns.lineplot(x='Yr', y='Avg', data=co2_year)
+plt.title("CO2 Average By Year");
+```
+
+Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.
+
+# Summary
+We went over a lot of content this lecture; let's summarize the most important points:
+
+## Dealing with Missing Values
+There are a few options we can take to deal with missing data:
+
+* Drop missing records
+* Keep `NaN` missing values
+* Impute using an interpolated column
+
+## EDA and Data Wrangling
+There are several ways to approach EDA and Data Wrangling:
+
+* Examine the **data and metadata**: what is the date, size, organization, and structure of the data?
+* Examine each **field/attribute/dimension** individually.
+* Examine pairs of related dimensions (e.g. breaking down grades by major).
+* Along the way, we can:
+ * **Visualize** or summarize the data.
+ * **Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected.
+ * Identify and **address anomalies**.
+ * Apply data transformations and corrections (we'll cover this in the upcoming lecture).
+ * **Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Notice that we use double brackets to extract this column. Why double brackets instead of just single brackets? The .fit
method, by default, expects to receive 2-dimensional data – some kind of data that includes both rows and columns. Writing penguins["flipper_length_mm"]
would return a 1D Series
, causing sklearn
to error. We avoid this by writing penguins[["flipper_length_mm"]]
to produce a 2D DataFrame
.
print(f"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}")
-
The RMSE of the model is 0.9881331104079044
+The RMSE of the model is 0.9881331104079045
We can also see that we obtain the same predictions using sklearn
as we did when applying the ordinary least squares formula before!
print(f"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}")
-
MSE of model with (hp^2) feature: 18.984768907617223
+MSE of model with (hp^2) feature: 18.984768907617216
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png
index 92cb01c9..f8396667 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png differ
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png
index f4ae4ea0..ceecd30f 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png differ
diff --git a/docs/gradient_descent/gradient_descent.html b/docs/gradient_descent/gradient_descent.html
index 467ee5fb..ed238d2c 100644
--- a/docs/gradient_descent/gradient_descent.html
+++ b/docs/gradient_descent/gradient_descent.html
@@ -106,7 +106,7 @@
require.undef("plotly");
requirejs.config({
paths: {
- 'plotly': ['https://cdn.plot.ly/plotly-2.25.2.min']
+ 'plotly': ['https://cdn.plot.ly/plotly-2.12.1.min']
}
});
require(['plotly'], function(Plotly) {
@@ -439,9 +439,9 @@
-