diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-10-output-2.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-10-output-2.png
index d768d848..2c87cfcc 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-10-output-2.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-10-output-2.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-12-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-12-output-1.png
index e351a778..2d1655c2 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-12-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-12-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-13-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-13-output-1.png
index d6e0808a..d75aaf3b 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-13-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-13-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-17-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-17-output-1.png
new file mode 100644
index 00000000..ca7e64f1
Binary files /dev/null and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-17-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-18-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-18-output-1.png
index a5513d21..f258bbd4 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-18-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-18-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-19-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-19-output-1.png
index c9b146de..e36a6dde 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-19-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-19-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-20-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-20-output-1.png
index ae3ccf06..7530590c 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-20-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-20-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-4-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-4-output-1.png
new file mode 100644
index 00000000..224a8b08
Binary files /dev/null and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-4-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-5-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-5-output-1.png
index 8495665a..3bc172d7 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-5-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-5-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-6-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-6-output-1.png
new file mode 100644
index 00000000..5542b461
Binary files /dev/null and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-6-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-7-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-7-output-1.png
index 3e274fc7..a5b366fc 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-7-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-7-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-8-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-8-output-1.png
new file mode 100644
index 00000000..ea9a22ea
Binary files /dev/null and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-8-output-1.png differ
diff --git a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-9-output-1.png b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-9-output-1.png
index 5615c4f7..c60bfd6f 100644
Binary files a/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-9-output-1.png and b/docs/constant_model_loss_transformations/loss_transformations_files/figure-html/cell-9-output-1.png differ
diff --git a/docs/cv_regularization/cv_reg.html b/docs/cv_regularization/cv_reg.html
index df297ee5..0f40966e 100644
--- a/docs/cv_regularization/cv_reg.html
+++ b/docs/cv_regularization/cv_reg.html
@@ -726,7 +726,7 @@ <h2 data-number="16.6" class="anchored" data-anchor-id="l2-ridge-regularization"
 <span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>ridge_model.coef_</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="7">
-<pre><code>array([ 5.89130559e-02, -6.42445915e-03,  4.44468157e-05, -8.83981945e-08])</code></pre>
+<pre><code>array([ 5.89130560e-02, -6.42445916e-03,  4.44468157e-05, -8.83981945e-08])</code></pre>
 </div>
 </div>
 </section>
diff --git a/docs/eda/eda.html b/docs/eda/eda.html
index 6221f5d1..56c64c59 100644
--- a/docs/eda/eda.html
+++ b/docs/eda/eda.html
@@ -692,7 +692,7 @@ <h5 data-number="5.1.1.3.1" class="anchored" data-anchor-id="eda-with-json-berke
 <span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a>    force<span class="op">=</span><span class="va">False</span>)</span>
 <span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a>covid_file          <span class="co"># a file path wrapper object</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code>Using cached version that was downloaded (UTC): Fri Aug 18 22:19:42 2023</code></pre>
+<pre><code>Using cached version that was downloaded (UTC): Fri Aug 25 09:57:25 2023</code></pre>
 </div>
 <div class="cell-output cell-output-display" data-execution_count="8">
 <pre><code>PosixPath('data/confirmed-cases.json')</code></pre>
@@ -724,7 +724,7 @@ <h6 data-number="5.1.1.3.1.2" class="anchored" data-anchor-id="unix-commands"><s
 <div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>ls <span class="op">-</span>lh {covid_file}</span>
 <span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>wc <span class="op">-</span>l {covid_file}</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code>-rw-r--r--  1 Ishani  staff   114K Aug 18 22:19 data/confirmed-cases.json</code></pre>
+<pre><code>-rw-r--r--  1 lillianweng  staff   114K Aug 25 09:57 data/confirmed-cases.json</code></pre>
 </div>
 <div class="cell-output cell-output-stdout">
 <pre><code>    1109 data/confirmed-cases.json</code></pre>
@@ -4137,14 +4137,8 @@ <h2 data-number="7.5" class="anchored" data-anchor-id="understanding-missing-val
 <div class="sourceCode cell-code" id="cb97"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb97-1"><a href="#cb97-1" aria-hidden="true" tabindex="-1"></a>sns.displot(co2[<span class="st">'Days'</span>])<span class="op">;</span></span>
 <span id="cb97-2"><a href="#cb97-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of days feature"</span>)<span class="op">;</span> <span class="co"># suppresses unneeded plotting output</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
-<div class="cell-output cell-output-stderr">
-<pre><code>/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-</code></pre>
-</div>
 <div class="cell-output cell-output-display">
-<p><img src="eda_files/figure-html/cell-67-output-2.png" width="447" height="473"></p>
+<p><img src="eda_files/figure-html/cell-67-output-1.png" width="447" height="473"></p>
 </div>
 </div>
 <p>In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values–<strong>that’s about 27% of the data</strong>!</p>
@@ -4154,8 +4148,8 @@ <h2 data-number="7.5" class="anchored" data-anchor-id="understanding-missing-val
 <div class="cell" data-execution_count="67">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb99"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb99-1"><a href="#cb99-1" aria-hidden="true" tabindex="-1"></a>sns.scatterplot(x<span class="op">=</span><span class="st">"Yr"</span>, y<span class="op">=</span><span class="st">"Days"</span>, data<span class="op">=</span>co2)<span class="op">;</span></span>
-<span id="cb99-2"><a href="#cb99-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Day field by Year"</span>)<span class="op">;</span> <span class="co"># the ; suppresses output</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb98"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb98-1"><a href="#cb98-1" aria-hidden="true" tabindex="-1"></a>sns.scatterplot(x<span class="op">=</span><span class="st">"Yr"</span>, y<span class="op">=</span><span class="st">"Days"</span>, data<span class="op">=</span>co2)<span class="op">;</span></span>
+<span id="cb98-2"><a href="#cb98-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Day field by Year"</span>)<span class="op">;</span> <span class="co"># the ; suppresses output</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="eda_files/figure-html/cell-68-output-1.png" width="981" height="775"></p>
@@ -4179,23 +4173,17 @@ <h2 data-number="7.6" class="anchored" data-anchor-id="understanding-missing-val
 <div class="cell" data-execution_count="68">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb100"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb100-1"><a href="#cb100-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Histograms of average CO2 measurements</span></span>
-<span id="cb100-2"><a href="#cb100-2" aria-hidden="true" tabindex="-1"></a>sns.displot(co2[<span class="st">'Avg'</span>])<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb99"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb99-1"><a href="#cb99-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Histograms of average CO2 measurements</span></span>
+<span id="cb99-2"><a href="#cb99-2" aria-hidden="true" tabindex="-1"></a>sns.displot(co2[<span class="st">'Avg'</span>])<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
-<div class="cell-output cell-output-stderr">
-<pre><code>/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-</code></pre>
-</div>
 <div class="cell-output cell-output-display">
-<p><img src="eda_files/figure-html/cell-69-output-2.png" width="447" height="447"></p>
+<p><img src="eda_files/figure-html/cell-69-output-1.png" width="447" height="447"></p>
 </div>
 </div>
 <p>The non-missing values are in the 300-400 range (a regular range of CO2 levels).</p>
 <p>We also see that there are only a few missing <code>Avg</code> values (<strong>&lt;1% of values</strong>). Let’s examine all of them:</p>
 <div class="cell" data-execution_count="69">
-<div class="sourceCode cell-code" id="cb102"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb102-1"><a href="#cb102-1" aria-hidden="true" tabindex="-1"></a>co2[co2[<span class="st">"Avg"</span>] <span class="op">&lt;</span> <span class="dv">0</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb100"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb100-1"><a href="#cb100-1" aria-hidden="true" tabindex="-1"></a>co2[co2[<span class="st">"Avg"</span>] <span class="op">&lt;</span> <span class="dv">0</span>]</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="69">
 <div>
 
@@ -4304,8 +4292,8 @@ <h2 data-number="7.7" class="anchored" data-anchor-id="drop-nan-or-impute-missin
 <div class="cell" data-execution_count="70">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb103"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb103-1"><a href="#cb103-1" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2)</span>
-<span id="cb103-2"><a href="#cb103-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Month"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb101"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb101-1"><a href="#cb101-1" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2)</span>
+<span id="cb101-2"><a href="#cb101-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Month"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="eda_files/figure-html/cell-71-output-1.png" width="1006" height="775"></p>
@@ -4317,9 +4305,9 @@ <h2 data-number="7.7" class="anchored" data-anchor-id="drop-nan-or-impute-missin
 <p><br></p>
 <p>Let’s examine each of these three options.</p>
 <div class="cell" data-execution_count="71">
-<div class="sourceCode cell-code" id="cb104"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb104-1"><a href="#cb104-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Drop missing values</span></span>
-<span id="cb104-2"><a href="#cb104-2" aria-hidden="true" tabindex="-1"></a>co2_drop <span class="op">=</span> co2[co2[<span class="st">'Avg'</span>] <span class="op">&gt;</span> <span class="dv">0</span>]</span>
-<span id="cb104-3"><a href="#cb104-3" aria-hidden="true" tabindex="-1"></a>co2_drop.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb102"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb102-1"><a href="#cb102-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Drop missing values</span></span>
+<span id="cb102-2"><a href="#cb102-2" aria-hidden="true" tabindex="-1"></a>co2_drop <span class="op">=</span> co2[co2[<span class="st">'Avg'</span>] <span class="op">&gt;</span> <span class="dv">0</span>]</span>
+<span id="cb102-3"><a href="#cb102-3" aria-hidden="true" tabindex="-1"></a>co2_drop.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="71">
 <div>
 
@@ -4395,9 +4383,9 @@ <h2 data-number="7.7" class="anchored" data-anchor-id="drop-nan-or-impute-missin
 </div>
 </div>
 <div class="cell" data-execution_count="72">
-<div class="sourceCode cell-code" id="cb105"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb105-1"><a href="#cb105-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 2. Replace NaN with -99.99</span></span>
-<span id="cb105-2"><a href="#cb105-2" aria-hidden="true" tabindex="-1"></a>co2_NA <span class="op">=</span> co2.replace(<span class="op">-</span><span class="fl">99.99</span>, np.NaN)</span>
-<span id="cb105-3"><a href="#cb105-3" aria-hidden="true" tabindex="-1"></a>co2_NA.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb103"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb103-1"><a href="#cb103-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 2. Replace NaN with -99.99</span></span>
+<span id="cb103-2"><a href="#cb103-2" aria-hidden="true" tabindex="-1"></a>co2_NA <span class="op">=</span> co2.replace(<span class="op">-</span><span class="fl">99.99</span>, np.NaN)</span>
+<span id="cb103-3"><a href="#cb103-3" aria-hidden="true" tabindex="-1"></a>co2_NA.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="72">
 <div>
 
@@ -4481,10 +4469,10 @@ <h2 data-number="7.7" class="anchored" data-anchor-id="drop-nan-or-impute-missin
 <p>The <code>Int</code> feature has values that exactly match those in <code>Avg</code>, except when <code>Avg</code> is -99.99, and then a <strong>reasonable</strong> estimate is used instead.</p>
 <p>So, the third version of our data will use the <code>Int</code> feature instead of <code>Avg</code>.</p>
 <div class="cell" data-execution_count="73">
-<div class="sourceCode cell-code" id="cb106"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb106-1"><a href="#cb106-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 3. Use interpolated column which estimates missing Avg values</span></span>
-<span id="cb106-2"><a href="#cb106-2" aria-hidden="true" tabindex="-1"></a>co2_impute <span class="op">=</span> co2.copy()</span>
-<span id="cb106-3"><a href="#cb106-3" aria-hidden="true" tabindex="-1"></a>co2_impute[<span class="st">'Avg'</span>] <span class="op">=</span> co2[<span class="st">'Int'</span>]</span>
-<span id="cb106-4"><a href="#cb106-4" aria-hidden="true" tabindex="-1"></a>co2_impute.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb104"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb104-1"><a href="#cb104-1" aria-hidden="true" tabindex="-1"></a><span class="co"># 3. Use interpolated column which estimates missing Avg values</span></span>
+<span id="cb104-2"><a href="#cb104-2" aria-hidden="true" tabindex="-1"></a>co2_impute <span class="op">=</span> co2.copy()</span>
+<span id="cb104-3"><a href="#cb104-3" aria-hidden="true" tabindex="-1"></a>co2_impute[<span class="st">'Avg'</span>] <span class="op">=</span> co2[<span class="st">'Int'</span>]</span>
+<span id="cb104-4"><a href="#cb104-4" aria-hidden="true" tabindex="-1"></a>co2_impute.head()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="73">
 <div>
 
@@ -4564,30 +4552,30 @@ <h2 data-number="7.7" class="anchored" data-anchor-id="drop-nan-or-impute-missin
 <div class="cell" data-execution_count="74">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb107"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb107-1"><a href="#cb107-1" aria-hidden="true" tabindex="-1"></a><span class="co"># results of plotting data in 1958</span></span>
-<span id="cb107-2"><a href="#cb107-2" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb107-3"><a href="#cb107-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> line_and_points(data, ax, title):</span>
-<span id="cb107-4"><a href="#cb107-4" aria-hidden="true" tabindex="-1"></a>    <span class="co"># assumes single year, hence Mo</span></span>
-<span id="cb107-5"><a href="#cb107-5" aria-hidden="true" tabindex="-1"></a>    ax.plot(<span class="st">'Mo'</span>, <span class="st">'Avg'</span>, data<span class="op">=</span>data)</span>
-<span id="cb107-6"><a href="#cb107-6" aria-hidden="true" tabindex="-1"></a>    ax.scatter(<span class="st">'Mo'</span>, <span class="st">'Avg'</span>, data<span class="op">=</span>data)</span>
-<span id="cb107-7"><a href="#cb107-7" aria-hidden="true" tabindex="-1"></a>    ax.set_xlim(<span class="dv">2</span>, <span class="dv">13</span>)</span>
-<span id="cb107-8"><a href="#cb107-8" aria-hidden="true" tabindex="-1"></a>    ax.set_title(title)</span>
-<span id="cb107-9"><a href="#cb107-9" aria-hidden="true" tabindex="-1"></a>    ax.set_xticks(np.arange(<span class="dv">3</span>, <span class="dv">13</span>))</span>
-<span id="cb107-10"><a href="#cb107-10" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb107-11"><a href="#cb107-11" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> data_year(data, year):</span>
-<span id="cb107-12"><a href="#cb107-12" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> data[data[<span class="st">"Yr"</span>] <span class="op">==</span> <span class="dv">1958</span>]</span>
-<span id="cb107-13"><a href="#cb107-13" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb107-14"><a href="#cb107-14" aria-hidden="true" tabindex="-1"></a><span class="co"># uses matplotlib subplots</span></span>
-<span id="cb107-15"><a href="#cb107-15" aria-hidden="true" tabindex="-1"></a><span class="co"># you may see more next week; focus on output for now</span></span>
-<span id="cb107-16"><a href="#cb107-16" aria-hidden="true" tabindex="-1"></a>fig, axes <span class="op">=</span> plt.subplots(ncols <span class="op">=</span> <span class="dv">3</span>, figsize<span class="op">=</span>(<span class="dv">12</span>, <span class="dv">4</span>), sharey<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb107-17"><a href="#cb107-17" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb107-18"><a href="#cb107-18" aria-hidden="true" tabindex="-1"></a>year <span class="op">=</span> <span class="dv">1958</span></span>
-<span id="cb107-19"><a href="#cb107-19" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_drop, year), axes[<span class="dv">0</span>], title<span class="op">=</span><span class="st">"1. Drop Missing"</span>)</span>
-<span id="cb107-20"><a href="#cb107-20" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_NA, year), axes[<span class="dv">1</span>], title<span class="op">=</span><span class="st">"2. Missing Set to NaN"</span>)</span>
-<span id="cb107-21"><a href="#cb107-21" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_impute, year), axes[<span class="dv">2</span>], title<span class="op">=</span><span class="st">"3. Missing Interpolated"</span>)</span>
-<span id="cb107-22"><a href="#cb107-22" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb107-23"><a href="#cb107-23" aria-hidden="true" tabindex="-1"></a>fig.suptitle(<span class="ss">f"Monthly Averages for </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>)</span>
-<span id="cb107-24"><a href="#cb107-24" aria-hidden="true" tabindex="-1"></a>plt.tight_layout()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb105"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb105-1"><a href="#cb105-1" aria-hidden="true" tabindex="-1"></a><span class="co"># results of plotting data in 1958</span></span>
+<span id="cb105-2"><a href="#cb105-2" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb105-3"><a href="#cb105-3" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> line_and_points(data, ax, title):</span>
+<span id="cb105-4"><a href="#cb105-4" aria-hidden="true" tabindex="-1"></a>    <span class="co"># assumes single year, hence Mo</span></span>
+<span id="cb105-5"><a href="#cb105-5" aria-hidden="true" tabindex="-1"></a>    ax.plot(<span class="st">'Mo'</span>, <span class="st">'Avg'</span>, data<span class="op">=</span>data)</span>
+<span id="cb105-6"><a href="#cb105-6" aria-hidden="true" tabindex="-1"></a>    ax.scatter(<span class="st">'Mo'</span>, <span class="st">'Avg'</span>, data<span class="op">=</span>data)</span>
+<span id="cb105-7"><a href="#cb105-7" aria-hidden="true" tabindex="-1"></a>    ax.set_xlim(<span class="dv">2</span>, <span class="dv">13</span>)</span>
+<span id="cb105-8"><a href="#cb105-8" aria-hidden="true" tabindex="-1"></a>    ax.set_title(title)</span>
+<span id="cb105-9"><a href="#cb105-9" aria-hidden="true" tabindex="-1"></a>    ax.set_xticks(np.arange(<span class="dv">3</span>, <span class="dv">13</span>))</span>
+<span id="cb105-10"><a href="#cb105-10" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb105-11"><a href="#cb105-11" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> data_year(data, year):</span>
+<span id="cb105-12"><a href="#cb105-12" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> data[data[<span class="st">"Yr"</span>] <span class="op">==</span> <span class="dv">1958</span>]</span>
+<span id="cb105-13"><a href="#cb105-13" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb105-14"><a href="#cb105-14" aria-hidden="true" tabindex="-1"></a><span class="co"># uses matplotlib subplots</span></span>
+<span id="cb105-15"><a href="#cb105-15" aria-hidden="true" tabindex="-1"></a><span class="co"># you may see more next week; focus on output for now</span></span>
+<span id="cb105-16"><a href="#cb105-16" aria-hidden="true" tabindex="-1"></a>fig, axes <span class="op">=</span> plt.subplots(ncols <span class="op">=</span> <span class="dv">3</span>, figsize<span class="op">=</span>(<span class="dv">12</span>, <span class="dv">4</span>), sharey<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb105-17"><a href="#cb105-17" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb105-18"><a href="#cb105-18" aria-hidden="true" tabindex="-1"></a>year <span class="op">=</span> <span class="dv">1958</span></span>
+<span id="cb105-19"><a href="#cb105-19" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_drop, year), axes[<span class="dv">0</span>], title<span class="op">=</span><span class="st">"1. Drop Missing"</span>)</span>
+<span id="cb105-20"><a href="#cb105-20" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_NA, year), axes[<span class="dv">1</span>], title<span class="op">=</span><span class="st">"2. Missing Set to NaN"</span>)</span>
+<span id="cb105-21"><a href="#cb105-21" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_impute, year), axes[<span class="dv">2</span>], title<span class="op">=</span><span class="st">"3. Missing Interpolated"</span>)</span>
+<span id="cb105-22"><a href="#cb105-22" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb105-23"><a href="#cb105-23" aria-hidden="true" tabindex="-1"></a>fig.suptitle(<span class="ss">f"Monthly Averages for </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>)</span>
+<span id="cb105-24"><a href="#cb105-24" aria-hidden="true" tabindex="-1"></a>plt.tight_layout()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="eda_files/figure-html/cell-75-output-1.png" width="1119" height="370"></p>
@@ -4604,8 +4592,8 @@ <h2 data-number="7.7" class="anchored" data-anchor-id="drop-nan-or-impute-missin
 <div class="cell" data-execution_count="75">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb108"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb108-1"><a href="#cb108-1" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_impute)</span>
-<span id="cb108-2"><a href="#cb108-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Month, Imputed"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb106"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb106-1"><a href="#cb106-1" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_impute)</span>
+<span id="cb106-2"><a href="#cb106-2" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Month, Imputed"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="eda_files/figure-html/cell-76-output-1.png" width="993" height="775"></p>
@@ -4632,9 +4620,9 @@ <h2 data-number="7.8" class="anchored" data-anchor-id="presenting-the-data-a-dis
 <div class="cell" data-execution_count="76">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb109"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb109-1"><a href="#cb109-1" aria-hidden="true" tabindex="-1"></a>co2_year <span class="op">=</span> co2_impute.groupby(<span class="st">'Yr'</span>).mean()</span>
-<span id="cb109-2"><a href="#cb109-2" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'Yr'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_year)</span>
-<span id="cb109-3"><a href="#cb109-3" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Year"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb107"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb107-1"><a href="#cb107-1" aria-hidden="true" tabindex="-1"></a>co2_year <span class="op">=</span> co2_impute.groupby(<span class="st">'Yr'</span>).mean()</span>
+<span id="cb107-2"><a href="#cb107-2" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'Yr'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_year)</span>
+<span id="cb107-3"><a href="#cb107-3" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Year"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="eda_files/figure-html/cell-77-output-1.png" width="994" height="775"></p>
@@ -4975,1218 +4963,1218 @@ <h2 data-number="8.2" class="anchored" data-anchor-id="eda-and-data-wrangling"><
       </a>
   </div>
 </nav><div class="modal fade" id="quarto-embedded-source-code-modal" tabindex="-1" aria-labelledby="quarto-embedded-source-code-modal-label" aria-hidden="true"><div class="modal-dialog modal-dialog-scrollable"><div class="modal-content"><div class="modal-header"><h5 class="modal-title" id="quarto-embedded-source-code-modal-label">Source Code</h5><button class="btn-close" data-bs-dismiss="modal"></button></div><div class="modal-body"><div class="">
-<div class="sourceCode" id="cb110" data-shortcodes="false"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb110-1"><a href="#cb110-1" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
-<span id="cb110-2"><a href="#cb110-2" aria-hidden="true" tabindex="-1"></a><span class="an">title:</span><span class="co"> Data Cleaning and EDA</span></span>
-<span id="cb110-3"><a href="#cb110-3" aria-hidden="true" tabindex="-1"></a><span class="an">execute:</span></span>
-<span id="cb110-4"><a href="#cb110-4" aria-hidden="true" tabindex="-1"></a><span class="co">  echo: true</span></span>
-<span id="cb110-5"><a href="#cb110-5" aria-hidden="true" tabindex="-1"></a><span class="an">format:</span></span>
-<span id="cb110-6"><a href="#cb110-6" aria-hidden="true" tabindex="-1"></a><span class="co">  html:</span></span>
-<span id="cb110-7"><a href="#cb110-7" aria-hidden="true" tabindex="-1"></a><span class="co">    code-fold: true</span></span>
-<span id="cb110-8"><a href="#cb110-8" aria-hidden="true" tabindex="-1"></a><span class="co">    code-tools: true</span></span>
-<span id="cb110-9"><a href="#cb110-9" aria-hidden="true" tabindex="-1"></a><span class="co">    toc: true</span></span>
-<span id="cb110-10"><a href="#cb110-10" aria-hidden="true" tabindex="-1"></a><span class="co">    toc-title: Data Cleaning and EDA</span></span>
-<span id="cb110-11"><a href="#cb110-11" aria-hidden="true" tabindex="-1"></a><span class="co">    page-layout: full</span></span>
-<span id="cb110-12"><a href="#cb110-12" aria-hidden="true" tabindex="-1"></a><span class="co">    theme:</span></span>
-<span id="cb110-13"><a href="#cb110-13" aria-hidden="true" tabindex="-1"></a><span class="co">      - cosmo</span></span>
-<span id="cb110-14"><a href="#cb110-14" aria-hidden="true" tabindex="-1"></a><span class="co">      - cerulean</span></span>
-<span id="cb110-15"><a href="#cb110-15" aria-hidden="true" tabindex="-1"></a><span class="co">    callout-icon: false</span></span>
-<span id="cb110-16"><a href="#cb110-16" aria-hidden="true" tabindex="-1"></a><span class="an">jupyter:</span><span class="co"> python3</span></span>
-<span id="cb110-17"><a href="#cb110-17" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
-<span id="cb110-18"><a href="#cb110-18" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-21"><a href="#cb110-21" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-22"><a href="#cb110-22" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-23"><a href="#cb110-23" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
-<span id="cb110-24"><a href="#cb110-24" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
-<span id="cb110-25"><a href="#cb110-25" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-26"><a href="#cb110-26" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
-<span id="cb110-27"><a href="#cb110-27" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
-<span id="cb110-28"><a href="#cb110-28" aria-hidden="true" tabindex="-1"></a><span class="co">#%matplotlib inline</span></span>
-<span id="cb110-29"><a href="#cb110-29" aria-hidden="true" tabindex="-1"></a>plt.rcParams[<span class="st">'figure.figsize'</span>] <span class="op">=</span> (<span class="dv">12</span>, <span class="dv">9</span>)</span>
-<span id="cb110-30"><a href="#cb110-30" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-31"><a href="#cb110-31" aria-hidden="true" tabindex="-1"></a>sns.<span class="bu">set</span>()</span>
-<span id="cb110-32"><a href="#cb110-32" aria-hidden="true" tabindex="-1"></a>sns.set_context(<span class="st">'talk'</span>)</span>
-<span id="cb110-33"><a href="#cb110-33" aria-hidden="true" tabindex="-1"></a>np.set_printoptions(threshold<span class="op">=</span><span class="dv">20</span>, precision<span class="op">=</span><span class="dv">2</span>, suppress<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb110-34"><a href="#cb110-34" aria-hidden="true" tabindex="-1"></a>pd.set_option(<span class="st">'display.max_rows'</span>, <span class="dv">30</span>)</span>
-<span id="cb110-35"><a href="#cb110-35" aria-hidden="true" tabindex="-1"></a>pd.set_option(<span class="st">'display.max_columns'</span>, <span class="va">None</span>)</span>
-<span id="cb110-36"><a href="#cb110-36" aria-hidden="true" tabindex="-1"></a>pd.set_option(<span class="st">'display.precision'</span>, <span class="dv">2</span>)</span>
-<span id="cb110-37"><a href="#cb110-37" aria-hidden="true" tabindex="-1"></a><span class="co"># This option stops scientific notation for pandas</span></span>
-<span id="cb110-38"><a href="#cb110-38" aria-hidden="true" tabindex="-1"></a>pd.set_option(<span class="st">'display.float_format'</span>, <span class="st">'</span><span class="sc">{:.2f}</span><span class="st">'</span>.<span class="bu">format</span>)</span>
-<span id="cb110-39"><a href="#cb110-39" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-40"><a href="#cb110-40" aria-hidden="true" tabindex="-1"></a><span class="co"># Silence some spurious seaborn warnings</span></span>
-<span id="cb110-41"><a href="#cb110-41" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> warnings</span>
-<span id="cb110-42"><a href="#cb110-42" aria-hidden="true" tabindex="-1"></a>warnings.filterwarnings(<span class="st">"ignore"</span>, category<span class="op">=</span><span class="pp">FutureWarning</span>)</span>
-<span id="cb110-43"><a href="#cb110-43" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-44"><a href="#cb110-44" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-45"><a href="#cb110-45" aria-hidden="true" tabindex="-1"></a>::: {.callout-note collapse="false"}</span>
-<span id="cb110-46"><a href="#cb110-46" aria-hidden="true" tabindex="-1"></a><span class="fu">## Learning Outcomes</span></span>
-<span id="cb110-47"><a href="#cb110-47" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Recognize common file formats</span>
-<span id="cb110-48"><a href="#cb110-48" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Categorize data by its variable type</span>
-<span id="cb110-49"><a href="#cb110-49" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Build awareness of issues with data faithfulness and develop targeted solutions</span>
-<span id="cb110-50"><a href="#cb110-50" aria-hidden="true" tabindex="-1"></a>:::</span>
-<span id="cb110-51"><a href="#cb110-51" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-52"><a href="#cb110-52" aria-hidden="true" tabindex="-1"></a>**This content is covered in lectures 4, 5, and 6.**</span>
-<span id="cb110-53"><a href="#cb110-53" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-54"><a href="#cb110-54" aria-hidden="true" tabindex="-1"></a>In the past few lectures, we've learned that <span class="in">`pandas`</span> is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?</span>
-<span id="cb110-55"><a href="#cb110-55" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-56"><a href="#cb110-56" aria-hidden="true" tabindex="-1"></a>**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:</span>
-<span id="cb110-57"><a href="#cb110-57" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-58"><a href="#cb110-58" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Unclear structure or formatting</span>
-<span id="cb110-59"><a href="#cb110-59" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Missing or corrupted values</span>
-<span id="cb110-60"><a href="#cb110-60" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Unit conversions</span>
-<span id="cb110-61"><a href="#cb110-61" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>...and so on</span>
-<span id="cb110-62"><a href="#cb110-62" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-63"><a href="#cb110-63" aria-hidden="true" tabindex="-1"></a>**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.</span>
-<span id="cb110-64"><a href="#cb110-64" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-65"><a href="#cb110-65" aria-hidden="true" tabindex="-1"></a>In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.</span>
-<span id="cb110-66"><a href="#cb110-66" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-67"><a href="#cb110-67" aria-hidden="true" tabindex="-1"></a><span class="fu">## Structure</span></span>
-<span id="cb110-68"><a href="#cb110-68" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-69"><a href="#cb110-69" aria-hidden="true" tabindex="-1"></a><span class="fu">### File Formats</span></span>
-<span id="cb110-70"><a href="#cb110-70" aria-hidden="true" tabindex="-1"></a>There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types. </span>
-<span id="cb110-71"><a href="#cb110-71" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-72"><a href="#cb110-72" aria-hidden="true" tabindex="-1"></a><span class="fu">#### CSV</span></span>
-<span id="cb110-73"><a href="#cb110-73" aria-hidden="true" tabindex="-1"></a>CSVs, which stand for **Comma-Separated Values**, are a common tabular data format. </span>
-<span id="cb110-74"><a href="#cb110-74" aria-hidden="true" tabindex="-1"></a>In the past two <span class="in">`pandas`</span> lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our <span class="in">`elections`</span> and <span class="in">`babynames`</span> datasets were stored and loaded as CSVs:</span>
-<span id="cb110-75"><a href="#cb110-75" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-78"><a href="#cb110-78" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-79"><a href="#cb110-79" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-80"><a href="#cb110-80" aria-hidden="true" tabindex="-1"></a>pd.read_csv(<span class="st">"data/elections.csv"</span>).head(<span class="dv">5</span>)</span>
-<span id="cb110-81"><a href="#cb110-81" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-82"><a href="#cb110-82" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-83"><a href="#cb110-83" aria-hidden="true" tabindex="-1"></a>To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a <span class="in">`DataFrame`</span>. We'll use the <span class="in">`repr()`</span> function to return the raw string with its special characters: </span>
-<span id="cb110-84"><a href="#cb110-84" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-87"><a href="#cb110-87" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-88"><a href="#cb110-88" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-89"><a href="#cb110-89" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
-<span id="cb110-90"><a href="#cb110-90" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
-<span id="cb110-91"><a href="#cb110-91" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
-<span id="cb110-92"><a href="#cb110-92" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row))</span>
-<span id="cb110-93"><a href="#cb110-93" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
-<span id="cb110-94"><a href="#cb110-94" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">3</span>:</span>
-<span id="cb110-95"><a href="#cb110-95" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
-<span id="cb110-96"><a href="#cb110-96" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-97"><a href="#cb110-97" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-98"><a href="#cb110-98" aria-hidden="true" tabindex="-1"></a>Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma <span class="in">`,`</span> (hence, comma-separated!). </span>
-<span id="cb110-99"><a href="#cb110-99" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-100"><a href="#cb110-100" aria-hidden="true" tabindex="-1"></a><span class="fu">#### TSV</span></span>
-<span id="cb110-101"><a href="#cb110-101" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-102"><a href="#cb110-102" aria-hidden="true" tabindex="-1"></a>Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline <span class="in">`\n`</span>, while fields are delimited by <span class="in">`\t`</span> tab character. </span>
-<span id="cb110-103"><a href="#cb110-103" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-104"><a href="#cb110-104" aria-hidden="true" tabindex="-1"></a>Let's check out the first few rows of the raw TSV file. Again, we'll use the <span class="in">`repr()`</span> function so that <span class="in">`print`</span> shows the special characters.</span>
-<span id="cb110-105"><a href="#cb110-105" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-108"><a href="#cb110-108" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-109"><a href="#cb110-109" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-110"><a href="#cb110-110" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.txt"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
-<span id="cb110-111"><a href="#cb110-111" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
-<span id="cb110-112"><a href="#cb110-112" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
-<span id="cb110-113"><a href="#cb110-113" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row))</span>
-<span id="cb110-114"><a href="#cb110-114" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
-<span id="cb110-115"><a href="#cb110-115" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">3</span>:</span>
-<span id="cb110-116"><a href="#cb110-116" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
-<span id="cb110-117"><a href="#cb110-117" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-118"><a href="#cb110-118" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-119"><a href="#cb110-119" aria-hidden="true" tabindex="-1"></a>TSVs can be loaded into <span class="in">`pandas`</span> using <span class="in">`pd.read_csv`</span>. We'll need to specify the **delimiter** with parameter<span class="in">` sep='\t'`</span> <span class="co">[</span><span class="ot">(documentation)</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)</span>.</span>
-<span id="cb110-120"><a href="#cb110-120" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-123"><a href="#cb110-123" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-124"><a href="#cb110-124" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-125"><a href="#cb110-125" aria-hidden="true" tabindex="-1"></a>pd.read_csv(<span class="st">"data/elections.txt"</span>, sep<span class="op">=</span><span class="st">'</span><span class="ch">\t</span><span class="st">'</span>).head(<span class="dv">3</span>)</span>
-<span id="cb110-126"><a href="#cb110-126" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-127"><a href="#cb110-127" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-128"><a href="#cb110-128" aria-hidden="true" tabindex="-1"></a>An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does <span class="in">`pandas`</span> differentiate between a comma delimiter vs. a comma within the field itself, for example <span class="in">`8,900`</span>? To remedy this, check out the <span class="co">[</span><span class="ot">`quotechar` parameter</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)</span>. </span>
-<span id="cb110-129"><a href="#cb110-129" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-130"><a href="#cb110-130" aria-hidden="true" tabindex="-1"></a><span class="fu">#### JSON</span></span>
-<span id="cb110-131"><a href="#cb110-131" aria-hidden="true" tabindex="-1"></a>**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.</span>
-<span id="cb110-132"><a href="#cb110-132" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-135"><a href="#cb110-135" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-136"><a href="#cb110-136" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-137"><a href="#cb110-137" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.json"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
-<span id="cb110-138"><a href="#cb110-138" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
-<span id="cb110-139"><a href="#cb110-139" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
-<span id="cb110-140"><a href="#cb110-140" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(row)</span>
-<span id="cb110-141"><a href="#cb110-141" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
-<span id="cb110-142"><a href="#cb110-142" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">8</span>:</span>
-<span id="cb110-143"><a href="#cb110-143" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
-<span id="cb110-144"><a href="#cb110-144" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-145"><a href="#cb110-145" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-146"><a href="#cb110-146" aria-hidden="true" tabindex="-1"></a>JSON files can be loaded into <span class="in">`pandas`</span> using <span class="in">`pd.read_json`</span>. </span>
-<span id="cb110-147"><a href="#cb110-147" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-150"><a href="#cb110-150" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-151"><a href="#cb110-151" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-152"><a href="#cb110-152" aria-hidden="true" tabindex="-1"></a>pd.read_json(<span class="st">'data/elections.json'</span>).head(<span class="dv">3</span>)</span>
-<span id="cb110-153"><a href="#cb110-153" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-154"><a href="#cb110-154" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-155"><a href="#cb110-155" aria-hidden="true" tabindex="-1"></a><span class="fu">##### EDA with JSON: Berkeley COVID-19 Data</span></span>
-<span id="cb110-156"><a href="#cb110-156" aria-hidden="true" tabindex="-1"></a>The City of Berkeley Open Data <span class="co">[</span><span class="ot">website</span><span class="co">](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766)</span> has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the <span class="co">[</span><span class="ot">`ds100_utils.py`</span><span class="co">](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html)</span> file that we can reuse these helper functions in many different notebooks.</span>
-<span id="cb110-157"><a href="#cb110-157" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-160"><a href="#cb110-160" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-161"><a href="#cb110-161" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-162"><a href="#cb110-162" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> ds100_utils <span class="im">import</span> fetch_and_cache</span>
-<span id="cb110-163"><a href="#cb110-163" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-164"><a href="#cb110-164" aria-hidden="true" tabindex="-1"></a>covid_file <span class="op">=</span> fetch_and_cache(</span>
-<span id="cb110-165"><a href="#cb110-165" aria-hidden="true" tabindex="-1"></a>    <span class="st">"https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD"</span>,</span>
-<span id="cb110-166"><a href="#cb110-166" aria-hidden="true" tabindex="-1"></a>    <span class="st">"confirmed-cases.json"</span>,</span>
-<span id="cb110-167"><a href="#cb110-167" aria-hidden="true" tabindex="-1"></a>    force<span class="op">=</span><span class="va">False</span>)</span>
-<span id="cb110-168"><a href="#cb110-168" aria-hidden="true" tabindex="-1"></a>covid_file          <span class="co"># a file path wrapper object</span></span>
-<span id="cb110-169"><a href="#cb110-169" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-170"><a href="#cb110-170" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-171"><a href="#cb110-171" aria-hidden="true" tabindex="-1"></a><span class="fu">###### File Size</span></span>
-<span id="cb110-172"><a href="#cb110-172" aria-hidden="true" tabindex="-1"></a>Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use <span class="in">`Python`</span> tools to probe the file.</span>
-<span id="cb110-173"><a href="#cb110-173" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-174"><a href="#cb110-174" aria-hidden="true" tabindex="-1"></a>Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records</span>
-<span id="cb110-175"><a href="#cb110-175" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-178"><a href="#cb110-178" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-179"><a href="#cb110-179" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-180"><a href="#cb110-180" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> os</span>
-<span id="cb110-181"><a href="#cb110-181" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-182"><a href="#cb110-182" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(covid_file, <span class="st">"is"</span>, os.path.getsize(covid_file) <span class="op">/</span> <span class="fl">1e6</span>, <span class="st">"MB"</span>)</span>
-<span id="cb110-183"><a href="#cb110-183" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-184"><a href="#cb110-184" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(covid_file, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
-<span id="cb110-185"><a href="#cb110-185" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(covid_file, <span class="st">"is"</span>, <span class="bu">sum</span>(<span class="dv">1</span> <span class="cf">for</span> l <span class="kw">in</span> f), <span class="st">"lines."</span>)</span>
-<span id="cb110-186"><a href="#cb110-186" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-187"><a href="#cb110-187" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-188"><a href="#cb110-188" aria-hidden="true" tabindex="-1"></a><span class="fu">###### Unix Commands</span></span>
-<span id="cb110-189"><a href="#cb110-189" aria-hidden="true" tabindex="-1"></a>As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called <span class="co">[</span><span class="ot">"Data Science at the Command Line"</span><span class="co">](https://datascienceatthecommandline.com/)</span> that explores this idea in depth! </span>
-<span id="cb110-190"><a href="#cb110-190" aria-hidden="true" tabindex="-1"></a>In Jupyter/IPython, you can prefix lines with <span class="in">`!`</span> to execute arbitrary Unix commands, and within those lines, you can refer to <span class="in">`Python`</span> variables and expressions with the syntax <span class="in">`{expr}`</span>.</span>
-<span id="cb110-191"><a href="#cb110-191" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-192"><a href="#cb110-192" aria-hidden="true" tabindex="-1"></a>Here, we use the <span class="in">`ls`</span> command to list files, using the <span class="in">`-lh`</span> flags, which request "long format with information in human-readable form." We also use the <span class="in">`wc`</span> command for "word count," but with the <span class="in">`-l`</span> flag, which asks for line counts instead of words.</span>
-<span id="cb110-193"><a href="#cb110-193" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-194"><a href="#cb110-194" aria-hidden="true" tabindex="-1"></a>These two give us the same information as the code above, albeit in a slightly different form:</span>
-<span id="cb110-195"><a href="#cb110-195" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-198"><a href="#cb110-198" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-199"><a href="#cb110-199" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-200"><a href="#cb110-200" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>ls <span class="op">-</span>lh {covid_file}</span>
-<span id="cb110-201"><a href="#cb110-201" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>wc <span class="op">-</span>l {covid_file}</span>
-<span id="cb110-202"><a href="#cb110-202" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-203"><a href="#cb110-203" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-204"><a href="#cb110-204" aria-hidden="true" tabindex="-1"></a><span class="fu">###### File Contents</span></span>
-<span id="cb110-205"><a href="#cb110-205" aria-hidden="true" tabindex="-1"></a>Let's explore the data format using <span class="in">`Python`</span>. </span>
-<span id="cb110-206"><a href="#cb110-206" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-209"><a href="#cb110-209" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-210"><a href="#cb110-210" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-211"><a href="#cb110-211" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(covid_file, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
-<span id="cb110-212"><a href="#cb110-212" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> i, row <span class="kw">in</span> <span class="bu">enumerate</span>(f):</span>
-<span id="cb110-213"><a href="#cb110-213" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row)) <span class="co"># print raw strings</span></span>
-<span id="cb110-214"><a href="#cb110-214" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;=</span> <span class="dv">4</span>: <span class="cf">break</span></span>
-<span id="cb110-215"><a href="#cb110-215" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-216"><a href="#cb110-216" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-217"><a href="#cb110-217" aria-hidden="true" tabindex="-1"></a>We can use the <span class="in">`head`</span> Unix command (which is where <span class="in">`pandas`</span>' <span class="in">`head`</span> method comes from!) to see the first few lines of the file:</span>
-<span id="cb110-218"><a href="#cb110-218" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-221"><a href="#cb110-221" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-222"><a href="#cb110-222" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-223"><a href="#cb110-223" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>head <span class="op">-</span><span class="dv">5</span> {covid_file}</span>
-<span id="cb110-224"><a href="#cb110-224" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-225"><a href="#cb110-225" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-226"><a href="#cb110-226" aria-hidden="true" tabindex="-1"></a>In order to load the JSON file into <span class="in">`pandas`</span>, Let's first do some EDA with <span class="in">`Python`</span>'s <span class="in">`json`</span> package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into <span class="in">`pandas`</span>. <span class="in">`Python`</span> has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the <span class="in">`json`</span> package.</span>
-<span id="cb110-227"><a href="#cb110-227" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-230"><a href="#cb110-230" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-231"><a href="#cb110-231" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-232"><a href="#cb110-232" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> json</span>
-<span id="cb110-233"><a href="#cb110-233" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-234"><a href="#cb110-234" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(covid_file, <span class="st">"rb"</span>) <span class="im">as</span> f:</span>
-<span id="cb110-235"><a href="#cb110-235" aria-hidden="true" tabindex="-1"></a>    covid_json <span class="op">=</span> json.load(f)</span>
-<span id="cb110-236"><a href="#cb110-236" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-237"><a href="#cb110-237" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-238"><a href="#cb110-238" aria-hidden="true" tabindex="-1"></a>The <span class="in">`covid_json`</span> variable is now a dictionary encoding the data in the file:</span>
-<span id="cb110-239"><a href="#cb110-239" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-242"><a href="#cb110-242" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-243"><a href="#cb110-243" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-244"><a href="#cb110-244" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span>(covid_json)</span>
-<span id="cb110-245"><a href="#cb110-245" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-246"><a href="#cb110-246" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-247"><a href="#cb110-247" aria-hidden="true" tabindex="-1"></a>We can examine what keys are in the top level json object by listing out the keys. </span>
-<span id="cb110-248"><a href="#cb110-248" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-251"><a href="#cb110-251" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-252"><a href="#cb110-252" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-253"><a href="#cb110-253" aria-hidden="true" tabindex="-1"></a>covid_json.keys()</span>
-<span id="cb110-254"><a href="#cb110-254" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-255"><a href="#cb110-255" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-256"><a href="#cb110-256" aria-hidden="true" tabindex="-1"></a>**Observation**: The JSON dictionary contains a <span class="in">`meta`</span> key which likely refers to meta data (data about the data).  Meta data often maintained with the data and can be a good source of additional information.</span>
-<span id="cb110-257"><a href="#cb110-257" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-258"><a href="#cb110-258" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-259"><a href="#cb110-259" aria-hidden="true" tabindex="-1"></a>We can investigate the meta data further by examining the keys associated with the metadata.</span>
-<span id="cb110-260"><a href="#cb110-260" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-263"><a href="#cb110-263" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-264"><a href="#cb110-264" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-265"><a href="#cb110-265" aria-hidden="true" tabindex="-1"></a>covid_json[<span class="st">'meta'</span>].keys()</span>
-<span id="cb110-266"><a href="#cb110-266" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-267"><a href="#cb110-267" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-268"><a href="#cb110-268" aria-hidden="true" tabindex="-1"></a>The <span class="in">`meta`</span> key contains another dictionary called <span class="in">`view`</span>.  This likely refers to meta-data about a particular "view" of some underlying database.  We will learn more about views when we study SQL later in the class.    </span>
-<span id="cb110-269"><a href="#cb110-269" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-272"><a href="#cb110-272" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-273"><a href="#cb110-273" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-274"><a href="#cb110-274" aria-hidden="true" tabindex="-1"></a>covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>].keys()</span>
-<span id="cb110-275"><a href="#cb110-275" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-276"><a href="#cb110-276" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-277"><a href="#cb110-277" aria-hidden="true" tabindex="-1"></a>Notice that this a nested/recursive data structure.  As we dig deeper we reveal more and more keys and the corresponding data:</span>
-<span id="cb110-278"><a href="#cb110-278" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-279"><a href="#cb110-279" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-280"><a href="#cb110-280" aria-hidden="true" tabindex="-1"></a><span class="in">meta</span></span>
-<span id="cb110-281"><a href="#cb110-281" aria-hidden="true" tabindex="-1"></a><span class="in">|-&gt; data</span></span>
-<span id="cb110-282"><a href="#cb110-282" aria-hidden="true" tabindex="-1"></a><span class="in">    | ... (haven't explored yet)</span></span>
-<span id="cb110-283"><a href="#cb110-283" aria-hidden="true" tabindex="-1"></a><span class="in">|-&gt; view</span></span>
-<span id="cb110-284"><a href="#cb110-284" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; id</span></span>
-<span id="cb110-285"><a href="#cb110-285" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; name</span></span>
-<span id="cb110-286"><a href="#cb110-286" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; attribution </span></span>
-<span id="cb110-287"><a href="#cb110-287" aria-hidden="true" tabindex="-1"></a><span class="in">    ...</span></span>
-<span id="cb110-288"><a href="#cb110-288" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; description</span></span>
-<span id="cb110-289"><a href="#cb110-289" aria-hidden="true" tabindex="-1"></a><span class="in">    ...</span></span>
-<span id="cb110-290"><a href="#cb110-290" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; columns</span></span>
-<span id="cb110-291"><a href="#cb110-291" aria-hidden="true" tabindex="-1"></a><span class="in">    ...</span></span>
-<span id="cb110-292"><a href="#cb110-292" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-293"><a href="#cb110-293" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-294"><a href="#cb110-294" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-295"><a href="#cb110-295" aria-hidden="true" tabindex="-1"></a>There is a key called description in the view sub dictionary.  This likely contains a description of the data:</span>
-<span id="cb110-296"><a href="#cb110-296" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-299"><a href="#cb110-299" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-300"><a href="#cb110-300" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-301"><a href="#cb110-301" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>][<span class="st">'description'</span>])</span>
-<span id="cb110-302"><a href="#cb110-302" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-303"><a href="#cb110-303" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-304"><a href="#cb110-304" aria-hidden="true" tabindex="-1"></a><span class="fu">###### Examining the Data Field for Records</span></span>
-<span id="cb110-305"><a href="#cb110-305" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-306"><a href="#cb110-306" aria-hidden="true" tabindex="-1"></a>We can look at a few entries in the <span class="in">`data`</span> field. This is what we'll load into <span class="in">`pandas`</span>.</span>
-<span id="cb110-307"><a href="#cb110-307" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-310"><a href="#cb110-310" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-311"><a href="#cb110-311" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-312"><a href="#cb110-312" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">3</span>):</span>
-<span id="cb110-313"><a href="#cb110-313" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>i<span class="sc">:03}</span><span class="ss"> | </span><span class="sc">{</span>covid_json[<span class="st">'data'</span>][i]<span class="sc">}</span><span class="ss">"</span>)</span>
-<span id="cb110-314"><a href="#cb110-314" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-315"><a href="#cb110-315" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-316"><a href="#cb110-316" aria-hidden="true" tabindex="-1"></a>Observations:</span>
-<span id="cb110-317"><a href="#cb110-317" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>These look like equal-length records, so maybe <span class="in">`data`</span> is a table!</span>
-<span id="cb110-318"><a href="#cb110-318" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>But what do each of values in the record mean? Where can we find column headers?</span>
-<span id="cb110-319"><a href="#cb110-319" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-320"><a href="#cb110-320" aria-hidden="true" tabindex="-1"></a>For that, we'll need the <span class="in">`columns`</span> key in the metadata dictionary. This returns a list: </span>
-<span id="cb110-321"><a href="#cb110-321" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-324"><a href="#cb110-324" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-325"><a href="#cb110-325" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-326"><a href="#cb110-326" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span>(covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>][<span class="st">'columns'</span>])</span>
-<span id="cb110-327"><a href="#cb110-327" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-328"><a href="#cb110-328" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-329"><a href="#cb110-329" aria-hidden="true" tabindex="-1"></a><span class="fu">###### Summary of exploring the JSON file</span></span>
-<span id="cb110-330"><a href="#cb110-330" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-331"><a href="#cb110-331" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic. </span>
-<span id="cb110-332"><a href="#cb110-332" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.</span>
-<span id="cb110-333"><a href="#cb110-333" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes. </span>
-<span id="cb110-334"><a href="#cb110-334" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-335"><a href="#cb110-335" aria-hidden="true" tabindex="-1"></a><span class="fu">###### Loading COVID Data into `pandas`</span></span>
-<span id="cb110-336"><a href="#cb110-336" aria-hidden="true" tabindex="-1"></a>Finally, let's load the data (not the metadata) into a <span class="in">`pandas`</span> <span class="in">`DataFrame`</span>. In the following block of code we:</span>
-<span id="cb110-337"><a href="#cb110-337" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-338"><a href="#cb110-338" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Translate the JSON records into a <span class="in">`DataFrame`</span>:</span>
-<span id="cb110-339"><a href="#cb110-339" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-340"><a href="#cb110-340" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>fields: <span class="in">`covid_json['meta']['view']['columns']`</span></span>
-<span id="cb110-341"><a href="#cb110-341" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>records: <span class="in">`covid_json['data']`</span></span>
-<span id="cb110-342"><a href="#cb110-342" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-343"><a href="#cb110-343" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb110-344"><a href="#cb110-344" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Remove columns that have no metadata description.  This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.</span>
-<span id="cb110-345"><a href="#cb110-345" aria-hidden="true" tabindex="-1"></a>   </span>
-<span id="cb110-346"><a href="#cb110-346" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Examine the <span class="in">`tail`</span> of the table.</span>
-<span id="cb110-347"><a href="#cb110-347" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-350"><a href="#cb110-350" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-351"><a href="#cb110-351" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-352"><a href="#cb110-352" aria-hidden="true" tabindex="-1"></a><span class="co"># Load the data from JSON and assign column titles</span></span>
-<span id="cb110-353"><a href="#cb110-353" aria-hidden="true" tabindex="-1"></a>covid <span class="op">=</span> pd.DataFrame(</span>
-<span id="cb110-354"><a href="#cb110-354" aria-hidden="true" tabindex="-1"></a>    covid_json[<span class="st">'data'</span>],</span>
-<span id="cb110-355"><a href="#cb110-355" aria-hidden="true" tabindex="-1"></a>    columns<span class="op">=</span>[c[<span class="st">'name'</span>] <span class="cf">for</span> c <span class="kw">in</span> covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>][<span class="st">'columns'</span>]])</span>
-<span id="cb110-356"><a href="#cb110-356" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-357"><a href="#cb110-357" aria-hidden="true" tabindex="-1"></a>covid.tail()</span>
-<span id="cb110-358"><a href="#cb110-358" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-359"><a href="#cb110-359" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-360"><a href="#cb110-360" aria-hidden="true" tabindex="-1"></a><span class="fu">### Variable Types</span></span>
-<span id="cb110-361"><a href="#cb110-361" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-362"><a href="#cb110-362" aria-hidden="true" tabindex="-1"></a>After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types. </span>
-<span id="cb110-363"><a href="#cb110-363" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-364"><a href="#cb110-364" aria-hidden="true" tabindex="-1"></a>**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:</span>
-<span id="cb110-365"><a href="#cb110-365" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-366"><a href="#cb110-366" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<span class="kw">&lt;sub&gt;</span>2<span class="kw">&lt;/sub&gt;</span> concentrations.</span>
-<span id="cb110-367"><a href="#cb110-367" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.</span>
-<span id="cb110-368"><a href="#cb110-368" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-369"><a href="#cb110-369" aria-hidden="true" tabindex="-1"></a>**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:</span>
-<span id="cb110-370"><a href="#cb110-370" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-371"><a href="#cb110-371" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating. </span>
-<span id="cb110-372"><a href="#cb110-372" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.</span>
-<span id="cb110-373"><a href="#cb110-373" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-374"><a href="#cb110-374" aria-hidden="true" tabindex="-1"></a><span class="al">![Classification of variable types](images/variable.png)</span></span>
-<span id="cb110-375"><a href="#cb110-375" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-376"><a href="#cb110-376" aria-hidden="true" tabindex="-1"></a>Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings. </span>
-<span id="cb110-377"><a href="#cb110-377" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-378"><a href="#cb110-378" aria-hidden="true" tabindex="-1"></a><span class="fu">### Primary and Foreign Keys</span></span>
-<span id="cb110-379"><a href="#cb110-379" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-380"><a href="#cb110-380" aria-hidden="true" tabindex="-1"></a>Last time, we introduced <span class="in">`.merge`</span> as the <span class="in">`pandas`</span> method for joining multiple <span class="in">`DataFrame`</span>s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.</span>
-<span id="cb110-381"><a href="#cb110-381" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-382"><a href="#cb110-382" aria-hidden="true" tabindex="-1"></a>The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key. </span>
-<span id="cb110-383"><a href="#cb110-383" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-386"><a href="#cb110-386" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-387"><a href="#cb110-387" aria-hidden="true" tabindex="-1"></a><span class="co">#| echo: false</span></span>
-<span id="cb110-388"><a href="#cb110-388" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"Cal ID"</span>:[<span class="dv">3034619471</span>, <span class="dv">3035619472</span>, <span class="dv">3025619473</span>, <span class="dv">3046789372</span>], <span class="op">\</span></span>
-<span id="cb110-389"><a href="#cb110-389" aria-hidden="true" tabindex="-1"></a>             <span class="st">"Name"</span>:[<span class="st">"Oski"</span>, <span class="st">"Ollie"</span>, <span class="st">"Orrie"</span>, <span class="st">"Ollie"</span>], <span class="op">\</span></span>
-<span id="cb110-390"><a href="#cb110-390" aria-hidden="true" tabindex="-1"></a>             <span class="st">"Major"</span>:[<span class="st">"Data Science"</span>, <span class="st">"Computer Science"</span>, <span class="st">"Data Science"</span>, <span class="st">"Economics"</span>]})</span>
-<span id="cb110-391"><a href="#cb110-391" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-392"><a href="#cb110-392" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-393"><a href="#cb110-393" aria-hidden="true" tabindex="-1"></a>The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the <span class="in">`left_on`</span> and <span class="in">`right_on`</span> parameters of <span class="in">`.merge`</span>. In the table of office hour tickets below, <span class="in">`"Cal ID"`</span> is a foreign key referencing the previous table.</span>
-<span id="cb110-394"><a href="#cb110-394" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-397"><a href="#cb110-397" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-398"><a href="#cb110-398" aria-hidden="true" tabindex="-1"></a><span class="co">#| echo: false</span></span>
-<span id="cb110-399"><a href="#cb110-399" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"OH Request"</span>:[<span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>, <span class="dv">4</span>], <span class="op">\</span></span>
-<span id="cb110-400"><a href="#cb110-400" aria-hidden="true" tabindex="-1"></a>             <span class="st">"Cal ID"</span>:[<span class="dv">3034619471</span>, <span class="dv">3035619472</span>, <span class="dv">3025619473</span>, <span class="dv">3035619472</span>], <span class="op">\</span></span>
-<span id="cb110-401"><a href="#cb110-401" aria-hidden="true" tabindex="-1"></a>             <span class="st">"Question"</span>:[<span class="st">"HW 2 Q1"</span>, <span class="st">"HW 2 Q3"</span>, <span class="st">"Lab 3 Q4"</span>, <span class="st">"HW 2 Q7"</span>]})</span>
-<span id="cb110-402"><a href="#cb110-402" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-403"><a href="#cb110-403" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-404"><a href="#cb110-404" aria-hidden="true" tabindex="-1"></a><span class="fu">## Granularity, Scope, and Temporality</span></span>
-<span id="cb110-405"><a href="#cb110-405" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-406"><a href="#cb110-406" aria-hidden="true" tabindex="-1"></a>After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.</span>
-<span id="cb110-407"><a href="#cb110-407" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-408"><a href="#cb110-408" aria-hidden="true" tabindex="-1"></a><span class="fu">### Granularity</span></span>
-<span id="cb110-409"><a href="#cb110-409" aria-hidden="true" tabindex="-1"></a>The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.</span>
-<span id="cb110-410"><a href="#cb110-410" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-411"><a href="#cb110-411" aria-hidden="true" tabindex="-1"></a><span class="fu">### Scope</span></span>
-<span id="cb110-412"><a href="#cb110-412" aria-hidden="true" tabindex="-1"></a>The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California. </span>
-<span id="cb110-413"><a href="#cb110-413" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-414"><a href="#cb110-414" aria-hidden="true" tabindex="-1"></a><span class="fu">### Temporality</span></span>
-<span id="cb110-415"><a href="#cb110-415" aria-hidden="true" tabindex="-1"></a>The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated. </span>
-<span id="cb110-416"><a href="#cb110-416" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-417"><a href="#cb110-417" aria-hidden="true" tabindex="-1"></a>Time and date fields of a dataset could represent a few things:</span>
-<span id="cb110-418"><a href="#cb110-418" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-419"><a href="#cb110-419" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>when the "event" happened</span>
-<span id="cb110-420"><a href="#cb110-420" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>when the data was collected, or when it was entered into the system</span>
-<span id="cb110-421"><a href="#cb110-421" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>when the data was copied into the database </span>
-<span id="cb110-422"><a href="#cb110-422" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-423"><a href="#cb110-423" aria-hidden="true" tabindex="-1"></a>To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings). </span>
-<span id="cb110-424"><a href="#cb110-424" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-425"><a href="#cb110-425" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Temporality with `pandas`' `dt` accessors </span></span>
-<span id="cb110-426"><a href="#cb110-426" aria-hidden="true" tabindex="-1"></a>Let's briefly look at how we can use <span class="in">`pandas`</span>' <span class="in">`dt`</span> accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.</span>
-<span id="cb110-427"><a href="#cb110-427" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-430"><a href="#cb110-430" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-431"><a href="#cb110-431" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-432"><a href="#cb110-432" aria-hidden="true" tabindex="-1"></a>calls <span class="op">=</span> pd.read_csv(<span class="st">"data/Berkeley_PD_-_Calls_for_Service.csv"</span>)</span>
-<span id="cb110-433"><a href="#cb110-433" aria-hidden="true" tabindex="-1"></a>calls.head()</span>
-<span id="cb110-434"><a href="#cb110-434" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-435"><a href="#cb110-435" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-436"><a href="#cb110-436" aria-hidden="true" tabindex="-1"></a>Looks like there are three columns with dates/times: <span class="in">`EVENTDT`</span>, <span class="in">`EVENTTM`</span>, and <span class="in">`InDbDate`</span>. </span>
-<span id="cb110-437"><a href="#cb110-437" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-438"><a href="#cb110-438" aria-hidden="true" tabindex="-1"></a>Most likely, <span class="in">`EVENTDT`</span> stands for the date when the event took place, <span class="in">`EVENTTM`</span> stands for the time of day the event took place (in 24-hr format), and <span class="in">`InDbDate`</span> is the date this call is recorded onto the database.</span>
-<span id="cb110-439"><a href="#cb110-439" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-440"><a href="#cb110-440" aria-hidden="true" tabindex="-1"></a>If we check the data type of these columns, we will see they are stored as strings. We can convert them to <span class="in">`datetime`</span> objects using pandas <span class="in">`to_datetime`</span> function.</span>
-<span id="cb110-441"><a href="#cb110-441" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-444"><a href="#cb110-444" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-445"><a href="#cb110-445" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-446"><a href="#cb110-446" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>] <span class="op">=</span> pd.to_datetime(calls[<span class="st">"EVENTDT"</span>])</span>
-<span id="cb110-447"><a href="#cb110-447" aria-hidden="true" tabindex="-1"></a>calls.head()</span>
-<span id="cb110-448"><a href="#cb110-448" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-449"><a href="#cb110-449" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-450"><a href="#cb110-450" aria-hidden="true" tabindex="-1"></a>Now, we can use the <span class="in">`dt`</span> accessor on this column.</span>
-<span id="cb110-451"><a href="#cb110-451" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-452"><a href="#cb110-452" aria-hidden="true" tabindex="-1"></a>We can get the month: </span>
-<span id="cb110-453"><a href="#cb110-453" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-456"><a href="#cb110-456" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-457"><a href="#cb110-457" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-458"><a href="#cb110-458" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>].dt.month.head()</span>
-<span id="cb110-459"><a href="#cb110-459" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-460"><a href="#cb110-460" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-461"><a href="#cb110-461" aria-hidden="true" tabindex="-1"></a>Which day of the week the date is on:</span>
-<span id="cb110-462"><a href="#cb110-462" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-465"><a href="#cb110-465" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-466"><a href="#cb110-466" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-467"><a href="#cb110-467" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>].dt.dayofweek.head()</span>
-<span id="cb110-468"><a href="#cb110-468" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-469"><a href="#cb110-469" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-470"><a href="#cb110-470" aria-hidden="true" tabindex="-1"></a>Check the mimimum values to see if there are any suspicious-looking, 70s dates:</span>
-<span id="cb110-471"><a href="#cb110-471" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-474"><a href="#cb110-474" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-475"><a href="#cb110-475" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-476"><a href="#cb110-476" aria-hidden="true" tabindex="-1"></a>calls.sort_values(<span class="st">"EVENTDT"</span>).head()</span>
-<span id="cb110-477"><a href="#cb110-477" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-478"><a href="#cb110-478" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-479"><a href="#cb110-479" aria-hidden="true" tabindex="-1"></a>Doesn't look like it! We are good!</span>
-<span id="cb110-480"><a href="#cb110-480" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-481"><a href="#cb110-481" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-482"><a href="#cb110-482" aria-hidden="true" tabindex="-1"></a>We can also do many things with the <span class="in">`dt`</span> accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on <span class="co">[</span><span class="ot">`.dt` accessor</span><span class="co">](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors)</span> and <span class="co">[</span><span class="ot">time series/date functionality</span><span class="co">](https://pandas.pydata.org/docs/user_guide/timeseries.html#)</span>.</span>
-<span id="cb110-483"><a href="#cb110-483" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-484"><a href="#cb110-484" aria-hidden="true" tabindex="-1"></a><span class="fu">## Faithfulness</span></span>
-<span id="cb110-485"><a href="#cb110-485" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-486"><a href="#cb110-486" aria-hidden="true" tabindex="-1"></a>At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."</span>
-<span id="cb110-487"><a href="#cb110-487" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-488"><a href="#cb110-488" aria-hidden="true" tabindex="-1"></a>Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:</span>
-<span id="cb110-489"><a href="#cb110-489" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-490"><a href="#cb110-490" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future</span>
-<span id="cb110-491"><a href="#cb110-491" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Violations of obvious dependencies, like an age that does not match a birthday</span>
-<span id="cb110-492"><a href="#cb110-492" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted</span>
-<span id="cb110-493"><a href="#cb110-493" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Signs of data falsification, such as fake email addresses or repeated use of the same names</span>
-<span id="cb110-494"><a href="#cb110-494" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Duplicated records or fields containing the same information</span>
-<span id="cb110-495"><a href="#cb110-495" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255</span>
-<span id="cb110-496"><a href="#cb110-496" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-497"><a href="#cb110-497" aria-hidden="true" tabindex="-1"></a>We often solve some of these more common issues in the following ways: </span>
-<span id="cb110-498"><a href="#cb110-498" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-499"><a href="#cb110-499" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Spelling errors: apply corrections or drop records that aren't in a dictionary</span>
-<span id="cb110-500"><a href="#cb110-500" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Time zone inconsistencies: convert to a common time zone (e.g. UTC) </span>
-<span id="cb110-501"><a href="#cb110-501" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Duplicated records or fields: identify and eliminate duplicates (using primary keys)</span>
-<span id="cb110-502"><a href="#cb110-502" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data</span>
-<span id="cb110-503"><a href="#cb110-503" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-504"><a href="#cb110-504" aria-hidden="true" tabindex="-1"></a><span class="fu">### Missing Values</span></span>
-<span id="cb110-505"><a href="#cb110-505" aria-hidden="true" tabindex="-1"></a>Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as <span class="in">`NaN`</span> values. </span>
-<span id="cb110-506"><a href="#cb110-506" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-507"><a href="#cb110-507" aria-hidden="true" tabindex="-1"></a>A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.</span>
-<span id="cb110-508"><a href="#cb110-508" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-509"><a href="#cb110-509" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Average imputation: replace missing values with the average value for that field</span>
-<span id="cb110-510"><a href="#cb110-510" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Hot deck imputation: replace missing values with some random value</span>
-<span id="cb110-511"><a href="#cb110-511" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Regression imputation: develop a model to predict missing values</span>
-<span id="cb110-512"><a href="#cb110-512" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Multiple imputation: replace missing values with multiple random values</span>
-<span id="cb110-513"><a href="#cb110-513" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-514"><a href="#cb110-514" aria-hidden="true" tabindex="-1"></a>Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.</span>
-<span id="cb110-515"><a href="#cb110-515" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-516"><a href="#cb110-516" aria-hidden="true" tabindex="-1"></a><span class="fu"># EDA Demo 1: Tuberculosis in the United States</span></span>
-<span id="cb110-517"><a href="#cb110-517" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-518"><a href="#cb110-518" aria-hidden="true" tabindex="-1"></a>Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!</span>
-<span id="cb110-519"><a href="#cb110-519" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-520"><a href="#cb110-520" aria-hidden="true" tabindex="-1"></a>We will examine the data included in the <span class="co">[</span><span class="ot">original CDC article</span><span class="co">](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down)</span> published in 2021.</span>
-<span id="cb110-521"><a href="#cb110-521" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-522"><a href="#cb110-522" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-523"><a href="#cb110-523" aria-hidden="true" tabindex="-1"></a><span class="fu">## CSVs and Field Names</span></span>
-<span id="cb110-524"><a href="#cb110-524" aria-hidden="true" tabindex="-1"></a>Suppose Table 1 was saved as a CSV file located in <span class="in">`data/cdc_tuberculosis.csv`</span>.</span>
-<span id="cb110-525"><a href="#cb110-525" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-526"><a href="#cb110-526" aria-hidden="true" tabindex="-1"></a>We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:</span>
-<span id="cb110-527"><a href="#cb110-527" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Using a text editor like emacs, vim, VSCode, etc.</span>
-<span id="cb110-528"><a href="#cb110-528" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.</span>
-<span id="cb110-529"><a href="#cb110-529" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>The <span class="in">`Python`</span> file object</span>
-<span id="cb110-530"><a href="#cb110-530" aria-hidden="true" tabindex="-1"></a><span class="ss">4. </span><span class="in">`pandas`</span>, using <span class="in">`pd.read_csv()`</span></span>
-<span id="cb110-531"><a href="#cb110-531" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-532"><a href="#cb110-532" aria-hidden="true" tabindex="-1"></a>To try out options 1 and 2, you can view or download the Tuberculosis from the <span class="co">[</span><span class="ot">lecture demo notebook</span><span class="co">](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&amp;urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&amp;branch=main)</span> under the <span class="in">`data`</span> folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.</span>
-<span id="cb110-533"><a href="#cb110-533" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-534"><a href="#cb110-534" aria-hidden="true" tabindex="-1"></a>Next, let's try out option 3 using the <span class="in">`Python`</span> file object. We'll look at the first four lines:</span>
-<span id="cb110-535"><a href="#cb110-535" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-538"><a href="#cb110-538" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-539"><a href="#cb110-539" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-540"><a href="#cb110-540" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/cdc_tuberculosis.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
-<span id="cb110-541"><a href="#cb110-541" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
-<span id="cb110-542"><a href="#cb110-542" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> f:</span>
-<span id="cb110-543"><a href="#cb110-543" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(row)</span>
-<span id="cb110-544"><a href="#cb110-544" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
-<span id="cb110-545"><a href="#cb110-545" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">3</span>:</span>
-<span id="cb110-546"><a href="#cb110-546" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
-<span id="cb110-547"><a href="#cb110-547" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-548"><a href="#cb110-548" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-549"><a href="#cb110-549" aria-hidden="true" tabindex="-1"></a>Whoa, why are there blank lines interspaced between the lines of the CSV?</span>
-<span id="cb110-550"><a href="#cb110-550" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-551"><a href="#cb110-551" aria-hidden="true" tabindex="-1"></a>You may recall that all line breaks in text files are encoded as the special newline character <span class="in">`\n`</span>. Python's <span class="in">`print()`</span> prints each string (including the newline), and an additional newline on top of that.</span>
-<span id="cb110-552"><a href="#cb110-552" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-553"><a href="#cb110-553" aria-hidden="true" tabindex="-1"></a>If you're curious, we can use the <span class="in">`repr()`</span> function to return the raw string with all special characters:</span>
-<span id="cb110-554"><a href="#cb110-554" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-557"><a href="#cb110-557" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-558"><a href="#cb110-558" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-559"><a href="#cb110-559" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/cdc_tuberculosis.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
-<span id="cb110-560"><a href="#cb110-560" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
-<span id="cb110-561"><a href="#cb110-561" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> f:</span>
-<span id="cb110-562"><a href="#cb110-562" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row)) <span class="co"># print raw strings</span></span>
-<span id="cb110-563"><a href="#cb110-563" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
-<span id="cb110-564"><a href="#cb110-564" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">3</span>:</span>
-<span id="cb110-565"><a href="#cb110-565" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
-<span id="cb110-566"><a href="#cb110-566" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-567"><a href="#cb110-567" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-568"><a href="#cb110-568" aria-hidden="true" tabindex="-1"></a>Finally, let's try option 4 and use the tried-and-true Data 100 approach: <span class="in">`pandas`</span>.</span>
-<span id="cb110-569"><a href="#cb110-569" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-572"><a href="#cb110-572" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-573"><a href="#cb110-573" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-574"><a href="#cb110-574" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>)</span>
-<span id="cb110-575"><a href="#cb110-575" aria-hidden="true" tabindex="-1"></a>tb_df.head()</span>
-<span id="cb110-576"><a href="#cb110-576" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-577"><a href="#cb110-577" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-578"><a href="#cb110-578" aria-hidden="true" tabindex="-1"></a>You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row? </span>
-<span id="cb110-579"><a href="#cb110-579" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-580"><a href="#cb110-580" aria-hidden="true" tabindex="-1"></a>Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.</span>
-<span id="cb110-581"><a href="#cb110-581" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-582"><a href="#cb110-582" aria-hidden="true" tabindex="-1"></a>A reasonable first step is to identify the row with the right header. The <span class="in">`pd.read_csv()`</span> function (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)</span>) has the convenient <span class="in">`header`</span> parameter that we can set to use the elements in row 1 as the appropriate columns:</span>
-<span id="cb110-583"><a href="#cb110-583" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-586"><a href="#cb110-586" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-587"><a href="#cb110-587" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-588"><a href="#cb110-588" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>, header<span class="op">=</span><span class="dv">1</span>) <span class="co"># row index</span></span>
-<span id="cb110-589"><a href="#cb110-589" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-590"><a href="#cb110-590" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-591"><a href="#cb110-591" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-592"><a href="#cb110-592" aria-hidden="true" tabindex="-1"></a>Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. <span class="in">`pandas`</span> has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.</span>
-<span id="cb110-593"><a href="#cb110-593" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-594"><a href="#cb110-594" aria-hidden="true" tabindex="-1"></a>We can do this manually with <span class="in">`df.rename()`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)</span>):</span>
-<span id="cb110-595"><a href="#cb110-595" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-598"><a href="#cb110-598" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-599"><a href="#cb110-599" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-600"><a href="#cb110-600" aria-hidden="true" tabindex="-1"></a>rename_dict <span class="op">=</span> {<span class="st">'2019'</span>: <span class="st">'TB cases 2019'</span>,</span>
-<span id="cb110-601"><a href="#cb110-601" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2020'</span>: <span class="st">'TB cases 2020'</span>,</span>
-<span id="cb110-602"><a href="#cb110-602" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2021'</span>: <span class="st">'TB cases 2021'</span>,</span>
-<span id="cb110-603"><a href="#cb110-603" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2019.1'</span>: <span class="st">'TB incidence 2019'</span>,</span>
-<span id="cb110-604"><a href="#cb110-604" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2020.1'</span>: <span class="st">'TB incidence 2020'</span>,</span>
-<span id="cb110-605"><a href="#cb110-605" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2021.1'</span>: <span class="st">'TB incidence 2021'</span>}</span>
-<span id="cb110-606"><a href="#cb110-606" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> tb_df.rename(columns<span class="op">=</span>rename_dict)</span>
-<span id="cb110-607"><a href="#cb110-607" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-608"><a href="#cb110-608" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-609"><a href="#cb110-609" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-610"><a href="#cb110-610" aria-hidden="true" tabindex="-1"></a><span class="fu">## Record Granularity</span></span>
-<span id="cb110-611"><a href="#cb110-611" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-612"><a href="#cb110-612" aria-hidden="true" tabindex="-1"></a>You might already be wondering: what's up with that first record?</span>
-<span id="cb110-613"><a href="#cb110-613" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-614"><a href="#cb110-614" aria-hidden="true" tabindex="-1"></a>Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.</span>
-<span id="cb110-615"><a href="#cb110-615" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-616"><a href="#cb110-616" aria-hidden="true" tabindex="-1"></a>Okay, EDA step two. How was the rollup record aggregated?</span>
-<span id="cb110-617"><a href="#cb110-617" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-618"><a href="#cb110-618" aria-hidden="true" tabindex="-1"></a>Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).</span>
-<span id="cb110-619"><a href="#cb110-619" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-622"><a href="#cb110-622" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-623"><a href="#cb110-623" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-624"><a href="#cb110-624" aria-hidden="true" tabindex="-1"></a>tb_df.<span class="bu">sum</span>(axis<span class="op">=</span><span class="dv">0</span>)</span>
-<span id="cb110-625"><a href="#cb110-625" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-626"><a href="#cb110-626" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-627"><a href="#cb110-627" aria-hidden="true" tabindex="-1"></a>Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:</span>
-<span id="cb110-628"><a href="#cb110-628" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-631"><a href="#cb110-631" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-632"><a href="#cb110-632" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-633"><a href="#cb110-633" aria-hidden="true" tabindex="-1"></a>tb_df.dtypes</span>
-<span id="cb110-634"><a href="#cb110-634" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-635"><a href="#cb110-635" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-636"><a href="#cb110-636" aria-hidden="true" tabindex="-1"></a>Since there are commas in the values for TB cases, the numbers are read as the <span class="in">`object`</span> datatype, or **storage type** (close to the <span class="in">`Python`</span> string datatype), so <span class="in">`pandas`</span> is concatenating strings instead of adding integers (recall that <span class="in">`Python`</span> can "sum", or concatenate, strings together: <span class="in">`"data" + "100"`</span> evaluates to <span class="in">`"data100"`</span>). </span>
-<span id="cb110-637"><a href="#cb110-637" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-638"><a href="#cb110-638" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-639"><a href="#cb110-639" aria-hidden="true" tabindex="-1"></a>Fortunately <span class="in">`read_csv`</span> also has a <span class="in">`thousands`</span> parameter (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)</span>):</span>
-<span id="cb110-640"><a href="#cb110-640" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-643"><a href="#cb110-643" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-644"><a href="#cb110-644" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-645"><a href="#cb110-645" aria-hidden="true" tabindex="-1"></a><span class="co"># improve readability: chaining method calls with outer parentheses/line breaks</span></span>
-<span id="cb110-646"><a href="#cb110-646" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> (</span>
-<span id="cb110-647"><a href="#cb110-647" aria-hidden="true" tabindex="-1"></a>    pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>, header<span class="op">=</span><span class="dv">1</span>, thousands<span class="op">=</span><span class="st">','</span>)</span>
-<span id="cb110-648"><a href="#cb110-648" aria-hidden="true" tabindex="-1"></a>    .rename(columns<span class="op">=</span>rename_dict)</span>
-<span id="cb110-649"><a href="#cb110-649" aria-hidden="true" tabindex="-1"></a>)</span>
-<span id="cb110-650"><a href="#cb110-650" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-651"><a href="#cb110-651" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-652"><a href="#cb110-652" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-655"><a href="#cb110-655" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-656"><a href="#cb110-656" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-657"><a href="#cb110-657" aria-hidden="true" tabindex="-1"></a>tb_df.<span class="bu">sum</span>()</span>
-<span id="cb110-658"><a href="#cb110-658" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-659"><a href="#cb110-659" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-660"><a href="#cb110-660" aria-hidden="true" tabindex="-1"></a>The Total TB cases look right. Phew!</span>
-<span id="cb110-661"><a href="#cb110-661" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-662"><a href="#cb110-662" aria-hidden="true" tabindex="-1"></a>Let's just look at the records with **state-level granularity**:</span>
-<span id="cb110-663"><a href="#cb110-663" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-666"><a href="#cb110-666" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-667"><a href="#cb110-667" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-668"><a href="#cb110-668" aria-hidden="true" tabindex="-1"></a>state_tb_df <span class="op">=</span> tb_df[<span class="dv">1</span>:]</span>
-<span id="cb110-669"><a href="#cb110-669" aria-hidden="true" tabindex="-1"></a>state_tb_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-670"><a href="#cb110-670" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-671"><a href="#cb110-671" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-672"><a href="#cb110-672" aria-hidden="true" tabindex="-1"></a><span class="fu">## Gather Census Data</span></span>
-<span id="cb110-673"><a href="#cb110-673" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-674"><a href="#cb110-674" aria-hidden="true" tabindex="-1"></a>U.S. Census population estimates <span class="co">[</span><span class="ot">source</span><span class="co">](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html)</span> (2019), <span class="co">[</span><span class="ot">source</span><span class="co">](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html)</span> (2020-2021).</span>
-<span id="cb110-675"><a href="#cb110-675" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-676"><a href="#cb110-676" aria-hidden="true" tabindex="-1"></a>Running the below cells cleans the data.</span>
-<span id="cb110-677"><a href="#cb110-677" aria-hidden="true" tabindex="-1"></a>There are a few new methods here:</span>
-<span id="cb110-678"><a href="#cb110-678" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`df.convert_dtypes()`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)</span>) conveniently converts all float dtypes into ints and is out of scope for the class.</span>
-<span id="cb110-679"><a href="#cb110-679" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`df.drop_na()`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)</span>) will be explained in more detail next time.</span>
-<span id="cb110-680"><a href="#cb110-680" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-683"><a href="#cb110-683" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-684"><a href="#cb110-684" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-685"><a href="#cb110-685" aria-hidden="true" tabindex="-1"></a><span class="co"># 2010s census data</span></span>
-<span id="cb110-686"><a href="#cb110-686" aria-hidden="true" tabindex="-1"></a>census_2010s_df <span class="op">=</span> pd.read_csv(<span class="st">"data/nst-est2019-01.csv"</span>, header<span class="op">=</span><span class="dv">3</span>, thousands<span class="op">=</span><span class="st">","</span>)</span>
-<span id="cb110-687"><a href="#cb110-687" aria-hidden="true" tabindex="-1"></a>census_2010s_df <span class="op">=</span> (</span>
-<span id="cb110-688"><a href="#cb110-688" aria-hidden="true" tabindex="-1"></a>    census_2010s_df</span>
-<span id="cb110-689"><a href="#cb110-689" aria-hidden="true" tabindex="-1"></a>    .reset_index()</span>
-<span id="cb110-690"><a href="#cb110-690" aria-hidden="true" tabindex="-1"></a>    .drop(columns<span class="op">=</span>[<span class="st">"index"</span>, <span class="st">"Census"</span>, <span class="st">"Estimates Base"</span>])</span>
-<span id="cb110-691"><a href="#cb110-691" aria-hidden="true" tabindex="-1"></a>    .rename(columns<span class="op">=</span>{<span class="st">"Unnamed: 0"</span>: <span class="st">"Geographic Area"</span>})</span>
-<span id="cb110-692"><a href="#cb110-692" aria-hidden="true" tabindex="-1"></a>    .convert_dtypes()                 <span class="co"># "smart" converting of columns, use at your own risk</span></span>
-<span id="cb110-693"><a href="#cb110-693" aria-hidden="true" tabindex="-1"></a>    .dropna()                         <span class="co"># we'll introduce this next time</span></span>
-<span id="cb110-694"><a href="#cb110-694" aria-hidden="true" tabindex="-1"></a>)</span>
-<span id="cb110-695"><a href="#cb110-695" aria-hidden="true" tabindex="-1"></a>census_2010s_df[<span class="st">'Geographic Area'</span>] <span class="op">=</span> census_2010s_df[<span class="st">'Geographic Area'</span>].<span class="bu">str</span>.strip(<span class="st">'.'</span>)</span>
-<span id="cb110-696"><a href="#cb110-696" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-697"><a href="#cb110-697" aria-hidden="true" tabindex="-1"></a><span class="co"># with pd.option_context('display.min_rows', 30): # shows more rows</span></span>
-<span id="cb110-698"><a href="#cb110-698" aria-hidden="true" tabindex="-1"></a><span class="co">#     display(census_2010s_df)</span></span>
-<span id="cb110-699"><a href="#cb110-699" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb110-700"><a href="#cb110-700" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-701"><a href="#cb110-701" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-702"><a href="#cb110-702" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-703"><a href="#cb110-703" aria-hidden="true" tabindex="-1"></a>Occasionally, you will want to modify code that you have imported.  To reimport those modifications you can either use <span class="in">`python`</span>'s <span class="in">`importlib`</span> library:</span>
-<span id="cb110-704"><a href="#cb110-704" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-705"><a href="#cb110-705" aria-hidden="true" tabindex="-1"></a><span class="in">```python</span></span>
-<span id="cb110-706"><a href="#cb110-706" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> importlib <span class="im">import</span> <span class="bu">reload</span></span>
-<span id="cb110-707"><a href="#cb110-707" aria-hidden="true" tabindex="-1"></a><span class="bu">reload</span>(utils)</span>
-<span id="cb110-708"><a href="#cb110-708" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-709"><a href="#cb110-709" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-710"><a href="#cb110-710" aria-hidden="true" tabindex="-1"></a>or use <span class="in">`iPython`</span> magic which will intelligently import code when files change:</span>
-<span id="cb110-711"><a href="#cb110-711" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-712"><a href="#cb110-712" aria-hidden="true" tabindex="-1"></a><span class="in">```python</span></span>
-<span id="cb110-713"><a href="#cb110-713" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>load_ext autoreload</span>
-<span id="cb110-714"><a href="#cb110-714" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>autoreload <span class="dv">2</span></span>
-<span id="cb110-715"><a href="#cb110-715" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-716"><a href="#cb110-716" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-719"><a href="#cb110-719" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-720"><a href="#cb110-720" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-721"><a href="#cb110-721" aria-hidden="true" tabindex="-1"></a><span class="co"># census 2020s data</span></span>
-<span id="cb110-722"><a href="#cb110-722" aria-hidden="true" tabindex="-1"></a>census_2020s_df <span class="op">=</span> pd.read_csv(<span class="st">"data/NST-EST2022-POP.csv"</span>, header<span class="op">=</span><span class="dv">3</span>, thousands<span class="op">=</span><span class="st">","</span>)</span>
-<span id="cb110-723"><a href="#cb110-723" aria-hidden="true" tabindex="-1"></a>census_2020s_df <span class="op">=</span> (</span>
-<span id="cb110-724"><a href="#cb110-724" aria-hidden="true" tabindex="-1"></a>    census_2020s_df</span>
-<span id="cb110-725"><a href="#cb110-725" aria-hidden="true" tabindex="-1"></a>    .reset_index()</span>
-<span id="cb110-726"><a href="#cb110-726" aria-hidden="true" tabindex="-1"></a>    .drop(columns<span class="op">=</span>[<span class="st">"index"</span>, <span class="st">"Unnamed: 1"</span>])</span>
-<span id="cb110-727"><a href="#cb110-727" aria-hidden="true" tabindex="-1"></a>    .rename(columns<span class="op">=</span>{<span class="st">"Unnamed: 0"</span>: <span class="st">"Geographic Area"</span>})</span>
-<span id="cb110-728"><a href="#cb110-728" aria-hidden="true" tabindex="-1"></a>    .convert_dtypes()                 <span class="co"># "smart" converting of columns, use at your own risk</span></span>
-<span id="cb110-729"><a href="#cb110-729" aria-hidden="true" tabindex="-1"></a>    .dropna()                         <span class="co"># we'll introduce this next time</span></span>
-<span id="cb110-730"><a href="#cb110-730" aria-hidden="true" tabindex="-1"></a>)</span>
-<span id="cb110-731"><a href="#cb110-731" aria-hidden="true" tabindex="-1"></a>census_2020s_df[<span class="st">'Geographic Area'</span>] <span class="op">=</span> census_2020s_df[<span class="st">'Geographic Area'</span>].<span class="bu">str</span>.strip(<span class="st">'.'</span>)</span>
-<span id="cb110-732"><a href="#cb110-732" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-733"><a href="#cb110-733" aria-hidden="true" tabindex="-1"></a>census_2020s_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-734"><a href="#cb110-734" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-735"><a href="#cb110-735" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-736"><a href="#cb110-736" aria-hidden="true" tabindex="-1"></a><span class="fu">## Joining Data (Merging `DataFrame`s)</span></span>
-<span id="cb110-737"><a href="#cb110-737" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-738"><a href="#cb110-738" aria-hidden="true" tabindex="-1"></a>Time to <span class="in">`merge`</span>! Here we use the <span class="in">`DataFrame`</span> method <span class="in">`df1.merge(right=df2, ...)`</span> on <span class="in">`DataFrame`</span> <span class="in">`df1`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)</span>). Contrast this with the function <span class="in">`pd.merge(left=df1, right=df2, ...)`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)</span>). Feel free to use either.</span>
-<span id="cb110-739"><a href="#cb110-739" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-742"><a href="#cb110-742" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-743"><a href="#cb110-743" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-744"><a href="#cb110-744" aria-hidden="true" tabindex="-1"></a><span class="co"># merge TB DataFrame with two US census DataFrames</span></span>
-<span id="cb110-745"><a href="#cb110-745" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
-<span id="cb110-746"><a href="#cb110-746" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
-<span id="cb110-747"><a href="#cb110-747" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2010s_df,</span>
-<span id="cb110-748"><a href="#cb110-748" aria-hidden="true" tabindex="-1"></a>           left_on<span class="op">=</span><span class="st">"U.S. jurisdiction"</span>, right_on<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
-<span id="cb110-749"><a href="#cb110-749" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2020s_df,</span>
-<span id="cb110-750"><a href="#cb110-750" aria-hidden="true" tabindex="-1"></a>           left_on<span class="op">=</span><span class="st">"U.S. jurisdiction"</span>, right_on<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
-<span id="cb110-751"><a href="#cb110-751" aria-hidden="true" tabindex="-1"></a>)</span>
-<span id="cb110-752"><a href="#cb110-752" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-753"><a href="#cb110-753" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-754"><a href="#cb110-754" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-755"><a href="#cb110-755" aria-hidden="true" tabindex="-1"></a>Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census <span class="in">`DataFrame`</span>s. Let's do the latter.</span>
-<span id="cb110-756"><a href="#cb110-756" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-759"><a href="#cb110-759" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-760"><a href="#cb110-760" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-761"><a href="#cb110-761" aria-hidden="true" tabindex="-1"></a><span class="co"># try merging again, but cleaner this time</span></span>
-<span id="cb110-762"><a href="#cb110-762" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
-<span id="cb110-763"><a href="#cb110-763" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
-<span id="cb110-764"><a href="#cb110-764" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2010s_df[[<span class="st">"Geographic Area"</span>, <span class="st">"2019"</span>]],</span>
-<span id="cb110-765"><a href="#cb110-765" aria-hidden="true" tabindex="-1"></a>           left_on<span class="op">=</span><span class="st">"U.S. jurisdiction"</span>, right_on<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
-<span id="cb110-766"><a href="#cb110-766" aria-hidden="true" tabindex="-1"></a>    .drop(columns<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
-<span id="cb110-767"><a href="#cb110-767" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2020s_df[[<span class="st">"Geographic Area"</span>, <span class="st">"2020"</span>, <span class="st">"2021"</span>]],</span>
-<span id="cb110-768"><a href="#cb110-768" aria-hidden="true" tabindex="-1"></a>           left_on<span class="op">=</span><span class="st">"U.S. jurisdiction"</span>, right_on<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
-<span id="cb110-769"><a href="#cb110-769" aria-hidden="true" tabindex="-1"></a>    .drop(columns<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
-<span id="cb110-770"><a href="#cb110-770" aria-hidden="true" tabindex="-1"></a>)</span>
-<span id="cb110-771"><a href="#cb110-771" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-772"><a href="#cb110-772" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-773"><a href="#cb110-773" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-774"><a href="#cb110-774" aria-hidden="true" tabindex="-1"></a><span class="fu">## Reproducing Data: Compute Incidence</span></span>
-<span id="cb110-775"><a href="#cb110-775" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-776"><a href="#cb110-776" aria-hidden="true" tabindex="-1"></a>Let's recompute incidence to make sure we know where the original CDC numbers came from.</span>
-<span id="cb110-777"><a href="#cb110-777" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-778"><a href="#cb110-778" aria-hidden="true" tabindex="-1"></a>From the <span class="co">[</span><span class="ot">CDC report</span><span class="co">](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down)</span>: TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”</span>
-<span id="cb110-779"><a href="#cb110-779" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-780"><a href="#cb110-780" aria-hidden="true" tabindex="-1"></a>If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as</span>
-<span id="cb110-781"><a href="#cb110-781" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-782"><a href="#cb110-782" aria-hidden="true" tabindex="-1"></a>$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$</span>
-<span id="cb110-783"><a href="#cb110-783" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-784"><a href="#cb110-784" aria-hidden="true" tabindex="-1"></a>$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$</span>
-<span id="cb110-785"><a href="#cb110-785" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-786"><a href="#cb110-786" aria-hidden="true" tabindex="-1"></a>Let's try this for 2019:</span>
-<span id="cb110-787"><a href="#cb110-787" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-790"><a href="#cb110-790" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-791"><a href="#cb110-791" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-792"><a href="#cb110-792" aria-hidden="true" tabindex="-1"></a>tb_census_df[<span class="st">"recompute incidence 2019"</span>] <span class="op">=</span> tb_census_df[<span class="st">"TB cases 2019"</span>]<span class="op">/</span>tb_census_df[<span class="st">"2019"</span>]<span class="op">*</span><span class="dv">100000</span></span>
-<span id="cb110-793"><a href="#cb110-793" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-794"><a href="#cb110-794" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-795"><a href="#cb110-795" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-796"><a href="#cb110-796" aria-hidden="true" tabindex="-1"></a>Awesome!!!</span>
-<span id="cb110-797"><a href="#cb110-797" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-798"><a href="#cb110-798" aria-hidden="true" tabindex="-1"></a>Let's use a for-loop and <span class="in">`Python`</span> format strings to compute TB incidence for all years. <span class="in">`Python`</span> f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://docs.python.org/3/tutorial/inputoutput.html)</span>).</span>
-<span id="cb110-799"><a href="#cb110-799" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-802"><a href="#cb110-802" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-803"><a href="#cb110-803" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-804"><a href="#cb110-804" aria-hidden="true" tabindex="-1"></a><span class="co"># recompute incidence for all years</span></span>
-<span id="cb110-805"><a href="#cb110-805" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> year <span class="kw">in</span> [<span class="dv">2019</span>, <span class="dv">2020</span>, <span class="dv">2021</span>]:</span>
-<span id="cb110-806"><a href="#cb110-806" aria-hidden="true" tabindex="-1"></a>    tb_census_df[<span class="ss">f"recompute incidence </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>] <span class="op">=</span> tb_census_df[<span class="ss">f"TB cases </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">/</span>tb_census_df[<span class="ss">f"</span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">*</span><span class="dv">100000</span></span>
-<span id="cb110-807"><a href="#cb110-807" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-808"><a href="#cb110-808" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-809"><a href="#cb110-809" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-810"><a href="#cb110-810" aria-hidden="true" tabindex="-1"></a>These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy. </span>
-<span id="cb110-811"><a href="#cb110-811" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-814"><a href="#cb110-814" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-815"><a href="#cb110-815" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-816"><a href="#cb110-816" aria-hidden="true" tabindex="-1"></a>tb_census_df.describe()</span>
-<span id="cb110-817"><a href="#cb110-817" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-818"><a href="#cb110-818" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-819"><a href="#cb110-819" aria-hidden="true" tabindex="-1"></a><span class="fu">## Bonus EDA: Reproducing the Reported Statistic</span></span>
-<span id="cb110-820"><a href="#cb110-820" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-821"><a href="#cb110-821" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-822"><a href="#cb110-822" aria-hidden="true" tabindex="-1"></a>**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**</span>
-<span id="cb110-823"><a href="#cb110-823" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-824"><a href="#cb110-824" aria-hidden="true" tabindex="-1"></a><span class="at">&gt; Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.</span></span>
-<span id="cb110-825"><a href="#cb110-825" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-826"><a href="#cb110-826" aria-hidden="true" tabindex="-1"></a>This is TB incidence computed across the entire U.S. population! How do we reproduce this?</span>
-<span id="cb110-827"><a href="#cb110-827" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>We need to reproduce the "Total" TB incidences in our rolled record.</span>
-<span id="cb110-828"><a href="#cb110-828" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>But our current <span class="in">`tb_census_df`</span> only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.</span>
-<span id="cb110-829"><a href="#cb110-829" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>What happened...?</span>
-<span id="cb110-830"><a href="#cb110-830" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-831"><a href="#cb110-831" aria-hidden="true" tabindex="-1"></a>Let's get exploring!</span>
-<span id="cb110-832"><a href="#cb110-832" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-833"><a href="#cb110-833" aria-hidden="true" tabindex="-1"></a>Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.</span>
-<span id="cb110-834"><a href="#cb110-834" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-837"><a href="#cb110-837" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-838"><a href="#cb110-838" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-839"><a href="#cb110-839" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> tb_df.set_index(<span class="st">"U.S. jurisdiction"</span>)</span>
-<span id="cb110-840"><a href="#cb110-840" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-841"><a href="#cb110-841" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-842"><a href="#cb110-842" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-845"><a href="#cb110-845" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-846"><a href="#cb110-846" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-847"><a href="#cb110-847" aria-hidden="true" tabindex="-1"></a>census_2010s_df <span class="op">=</span> census_2010s_df.set_index(<span class="st">"Geographic Area"</span>)</span>
-<span id="cb110-848"><a href="#cb110-848" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-849"><a href="#cb110-849" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-850"><a href="#cb110-850" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-853"><a href="#cb110-853" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-854"><a href="#cb110-854" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-855"><a href="#cb110-855" aria-hidden="true" tabindex="-1"></a>census_2020s_df <span class="op">=</span> census_2020s_df.set_index(<span class="st">"Geographic Area"</span>)</span>
-<span id="cb110-856"><a href="#cb110-856" aria-hidden="true" tabindex="-1"></a>census_2020s_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-857"><a href="#cb110-857" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-858"><a href="#cb110-858" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-859"><a href="#cb110-859" aria-hidden="true" tabindex="-1"></a>It turns out that our merge above only kept state records, even though our original <span class="in">`tb_df`</span> had the "Total" rolled record:</span>
-<span id="cb110-860"><a href="#cb110-860" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-863"><a href="#cb110-863" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-864"><a href="#cb110-864" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-865"><a href="#cb110-865" aria-hidden="true" tabindex="-1"></a>tb_df.head()</span>
-<span id="cb110-866"><a href="#cb110-866" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-867"><a href="#cb110-867" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-868"><a href="#cb110-868" aria-hidden="true" tabindex="-1"></a>Recall that <span class="in">`merge`</span> by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** <span class="in">`DataFrame`</span>s.</span>
-<span id="cb110-869"><a href="#cb110-869" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-870"><a href="#cb110-870" aria-hidden="true" tabindex="-1"></a>The rolled records in our census <span class="in">`DataFrame`</span> have different <span class="in">`Geographic Area`</span> fields, which was the key we merged on:</span>
-<span id="cb110-871"><a href="#cb110-871" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-874"><a href="#cb110-874" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-875"><a href="#cb110-875" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-876"><a href="#cb110-876" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-877"><a href="#cb110-877" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-878"><a href="#cb110-878" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-879"><a href="#cb110-879" aria-hidden="true" tabindex="-1"></a>The Census <span class="in">`DataFrame`</span> has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".</span>
-<span id="cb110-880"><a href="#cb110-880" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-881"><a href="#cb110-881" aria-hidden="true" tabindex="-1"></a>One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use <span class="in">`df.rename()`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)</span>):</span>
-<span id="cb110-882"><a href="#cb110-882" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-885"><a href="#cb110-885" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-886"><a href="#cb110-886" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-887"><a href="#cb110-887" aria-hidden="true" tabindex="-1"></a><span class="co"># rename rolled record for 2010s</span></span>
-<span id="cb110-888"><a href="#cb110-888" aria-hidden="true" tabindex="-1"></a>census_2010s_df.rename(index<span class="op">=</span>{<span class="st">'United States'</span>:<span class="st">'Total'</span>}, inplace<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb110-889"><a href="#cb110-889" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-890"><a href="#cb110-890" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-891"><a href="#cb110-891" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-894"><a href="#cb110-894" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-895"><a href="#cb110-895" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-896"><a href="#cb110-896" aria-hidden="true" tabindex="-1"></a><span class="co"># same, but for 2020s rename rolled record</span></span>
-<span id="cb110-897"><a href="#cb110-897" aria-hidden="true" tabindex="-1"></a>census_2020s_df.rename(index<span class="op">=</span>{<span class="st">'United States'</span>:<span class="st">'Total'</span>}, inplace<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb110-898"><a href="#cb110-898" aria-hidden="true" tabindex="-1"></a>census_2020s_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-899"><a href="#cb110-899" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-900"><a href="#cb110-900" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-901"><a href="#cb110-901" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-902"><a href="#cb110-902" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-903"><a href="#cb110-903" aria-hidden="true" tabindex="-1"></a>Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (<span class="in">`df.merge()`</span> <span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)</span>).</span>
-<span id="cb110-904"><a href="#cb110-904" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-907"><a href="#cb110-907" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-908"><a href="#cb110-908" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-909"><a href="#cb110-909" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
-<span id="cb110-910"><a href="#cb110-910" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
-<span id="cb110-911"><a href="#cb110-911" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2010s_df[[<span class="st">"2019"</span>]],</span>
-<span id="cb110-912"><a href="#cb110-912" aria-hidden="true" tabindex="-1"></a>           left_index<span class="op">=</span><span class="va">True</span>, right_index<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb110-913"><a href="#cb110-913" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2020s_df[[<span class="st">"2020"</span>, <span class="st">"2021"</span>]],</span>
-<span id="cb110-914"><a href="#cb110-914" aria-hidden="true" tabindex="-1"></a>           left_index<span class="op">=</span><span class="va">True</span>, right_index<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb110-915"><a href="#cb110-915" aria-hidden="true" tabindex="-1"></a>)</span>
-<span id="cb110-916"><a href="#cb110-916" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-917"><a href="#cb110-917" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-918"><a href="#cb110-918" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-919"><a href="#cb110-919" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-920"><a href="#cb110-920" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-921"><a href="#cb110-921" aria-hidden="true" tabindex="-1"></a>Finally, let's recompute our incidences:</span>
-<span id="cb110-922"><a href="#cb110-922" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-925"><a href="#cb110-925" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-926"><a href="#cb110-926" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-927"><a href="#cb110-927" aria-hidden="true" tabindex="-1"></a><span class="co"># recompute incidence for all years</span></span>
-<span id="cb110-928"><a href="#cb110-928" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> year <span class="kw">in</span> [<span class="dv">2019</span>, <span class="dv">2020</span>, <span class="dv">2021</span>]:</span>
-<span id="cb110-929"><a href="#cb110-929" aria-hidden="true" tabindex="-1"></a>    tb_census_df[<span class="ss">f"recompute incidence </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>] <span class="op">=</span> tb_census_df[<span class="ss">f"TB cases </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">/</span>tb_census_df[<span class="ss">f"</span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">*</span><span class="dv">100000</span></span>
-<span id="cb110-930"><a href="#cb110-930" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
-<span id="cb110-931"><a href="#cb110-931" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-932"><a href="#cb110-932" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-933"><a href="#cb110-933" aria-hidden="true" tabindex="-1"></a>We reproduced the total U.S. incidences correctly!</span>
-<span id="cb110-934"><a href="#cb110-934" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-935"><a href="#cb110-935" aria-hidden="true" tabindex="-1"></a>We're almost there. Let's revisit the quote:</span>
-<span id="cb110-936"><a href="#cb110-936" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-937"><a href="#cb110-937" aria-hidden="true" tabindex="-1"></a><span class="at">&gt; Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.</span></span>
-<span id="cb110-938"><a href="#cb110-938" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-939"><a href="#cb110-939" aria-hidden="true" tabindex="-1"></a>Recall that percent change from $A$ to $B$ is computed as</span>
-<span id="cb110-940"><a href="#cb110-940" aria-hidden="true" tabindex="-1"></a>$\text{percent change} = \frac{B - A}{A} \times 100$.</span>
-<span id="cb110-941"><a href="#cb110-941" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-944"><a href="#cb110-944" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-945"><a href="#cb110-945" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-946"><a href="#cb110-946" aria-hidden="true" tabindex="-1"></a><span class="co">#| tags: []</span></span>
-<span id="cb110-947"><a href="#cb110-947" aria-hidden="true" tabindex="-1"></a>incidence_2020 <span class="op">=</span> tb_census_df.loc[<span class="st">'Total'</span>, <span class="st">'recompute incidence 2020'</span>]</span>
-<span id="cb110-948"><a href="#cb110-948" aria-hidden="true" tabindex="-1"></a>incidence_2020</span>
-<span id="cb110-949"><a href="#cb110-949" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-950"><a href="#cb110-950" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-953"><a href="#cb110-953" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-954"><a href="#cb110-954" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-955"><a href="#cb110-955" aria-hidden="true" tabindex="-1"></a><span class="co">#| tags: []</span></span>
-<span id="cb110-956"><a href="#cb110-956" aria-hidden="true" tabindex="-1"></a>incidence_2021 <span class="op">=</span> tb_census_df.loc[<span class="st">'Total'</span>, <span class="st">'recompute incidence 2021'</span>]</span>
-<span id="cb110-957"><a href="#cb110-957" aria-hidden="true" tabindex="-1"></a>incidence_2021</span>
-<span id="cb110-958"><a href="#cb110-958" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-959"><a href="#cb110-959" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-962"><a href="#cb110-962" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-963"><a href="#cb110-963" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-964"><a href="#cb110-964" aria-hidden="true" tabindex="-1"></a><span class="co">#| tags: []</span></span>
-<span id="cb110-965"><a href="#cb110-965" aria-hidden="true" tabindex="-1"></a>difference <span class="op">=</span> (incidence_2021 <span class="op">-</span> incidence_2020)<span class="op">/</span>incidence_2020 <span class="op">*</span> <span class="dv">100</span></span>
-<span id="cb110-966"><a href="#cb110-966" aria-hidden="true" tabindex="-1"></a>difference</span>
-<span id="cb110-967"><a href="#cb110-967" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-968"><a href="#cb110-968" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-969"><a href="#cb110-969" aria-hidden="true" tabindex="-1"></a><span class="fu"># EDA Demo 2: Mauna Loa CO&lt;sub&gt;2&lt;/sub&gt; Data -- A Lesson in Data Faithfulness</span></span>
-<span id="cb110-970"><a href="#cb110-970" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-971"><a href="#cb110-971" aria-hidden="true" tabindex="-1"></a><span class="co">[</span><span class="ot">Mauna Loa Observatory</span><span class="co">](https://gml.noaa.gov/ccgg/trends/data.html)</span> has been monitoring CO<span class="kw">&lt;sub&gt;</span>2<span class="kw">&lt;/sub&gt;</span> concentrations since 1958</span>
-<span id="cb110-972"><a href="#cb110-972" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-975"><a href="#cb110-975" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-976"><a href="#cb110-976" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-977"><a href="#cb110-977" aria-hidden="true" tabindex="-1"></a>co2_file <span class="op">=</span> <span class="st">"data/co2_mm_mlo.txt"</span></span>
-<span id="cb110-978"><a href="#cb110-978" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-979"><a href="#cb110-979" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-980"><a href="#cb110-980" aria-hidden="true" tabindex="-1"></a>Let's do some **EDA**!!</span>
-<span id="cb110-981"><a href="#cb110-981" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-982"><a href="#cb110-982" aria-hidden="true" tabindex="-1"></a><span class="fu">## Reading this file into Pandas?</span></span>
-<span id="cb110-983"><a href="#cb110-983" aria-hidden="true" tabindex="-1"></a>Let's instead check out this <span class="in">`.txt`</span> file. Some questions to keep in mind: Do we trust this file extension? What structure is it? </span>
-<span id="cb110-984"><a href="#cb110-984" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-985"><a href="#cb110-985" aria-hidden="true" tabindex="-1"></a>Lines 71-78 (inclusive) are shown below: </span>
-<span id="cb110-986"><a href="#cb110-986" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-987"><a href="#cb110-987" aria-hidden="true" tabindex="-1"></a><span class="in">    line number |                            file contents</span></span>
-<span id="cb110-988"><a href="#cb110-988" aria-hidden="true" tabindex="-1"></a><span class="in">    </span></span>
-<span id="cb110-989"><a href="#cb110-989" aria-hidden="true" tabindex="-1"></a><span class="in">    71          |   #            decimal     average   interpolated    trend    #days</span></span>
-<span id="cb110-990"><a href="#cb110-990" aria-hidden="true" tabindex="-1"></a><span class="in">    72          |   #             date                             (season corr)</span></span>
-<span id="cb110-991"><a href="#cb110-991" aria-hidden="true" tabindex="-1"></a><span class="in">    73          |   1958   3    1958.208      315.71      315.71      314.62     -1</span></span>
-<span id="cb110-992"><a href="#cb110-992" aria-hidden="true" tabindex="-1"></a><span class="in">    74          |   1958   4    1958.292      317.45      317.45      315.29     -1</span></span>
-<span id="cb110-993"><a href="#cb110-993" aria-hidden="true" tabindex="-1"></a><span class="in">    75          |   1958   5    1958.375      317.50      317.50      314.71     -1</span></span>
-<span id="cb110-994"><a href="#cb110-994" aria-hidden="true" tabindex="-1"></a><span class="in">    76          |   1958   6    1958.458      -99.99      317.10      314.85     -1</span></span>
-<span id="cb110-995"><a href="#cb110-995" aria-hidden="true" tabindex="-1"></a><span class="in">    77          |   1958   7    1958.542      315.86      315.86      314.98     -1</span></span>
-<span id="cb110-996"><a href="#cb110-996" aria-hidden="true" tabindex="-1"></a><span class="in">    78          |   1958   8    1958.625      314.93      314.93      315.94     -1</span></span>
-<span id="cb110-997"><a href="#cb110-997" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-998"><a href="#cb110-998" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-999"><a href="#cb110-999" aria-hidden="true" tabindex="-1"></a>Notice how: </span>
-<span id="cb110-1000"><a href="#cb110-1000" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1001"><a href="#cb110-1001" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The values are separated by white space, possibly tabs.</span>
-<span id="cb110-1002"><a href="#cb110-1002" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The data line up down the rows. For example, the month appears in 7th to 8th position of each line.</span>
-<span id="cb110-1003"><a href="#cb110-1003" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The 71st and 72nd lines in the file contain column headings split over two lines.</span>
-<span id="cb110-1004"><a href="#cb110-1004" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1005"><a href="#cb110-1005" aria-hidden="true" tabindex="-1"></a>We can use&nbsp;<span class="in">`read_csv`</span>&nbsp;to read the data into a <span class="in">`pandas`</span> <span class="in">`DataFrame`</span>, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.</span>
-<span id="cb110-1006"><a href="#cb110-1006" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1009"><a href="#cb110-1009" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1010"><a href="#cb110-1010" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1011"><a href="#cb110-1011" aria-hidden="true" tabindex="-1"></a>co2 <span class="op">=</span> pd.read_csv(</span>
-<span id="cb110-1012"><a href="#cb110-1012" aria-hidden="true" tabindex="-1"></a>    co2_file, header <span class="op">=</span> <span class="va">None</span>, skiprows <span class="op">=</span> <span class="dv">72</span>,</span>
-<span id="cb110-1013"><a href="#cb110-1013" aria-hidden="true" tabindex="-1"></a>    sep <span class="op">=</span> <span class="vs">r'\s+'</span>       <span class="co">#delimiter for continuous whitespace (stay tuned for regex next lecture))</span></span>
-<span id="cb110-1014"><a href="#cb110-1014" aria-hidden="true" tabindex="-1"></a>)</span>
-<span id="cb110-1015"><a href="#cb110-1015" aria-hidden="true" tabindex="-1"></a>co2.head()</span>
-<span id="cb110-1016"><a href="#cb110-1016" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1017"><a href="#cb110-1017" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1018"><a href="#cb110-1018" aria-hidden="true" tabindex="-1"></a>Congratulations! You've wrangled the data!</span>
-<span id="cb110-1019"><a href="#cb110-1019" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1020"><a href="#cb110-1020" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-1021"><a href="#cb110-1021" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1022"><a href="#cb110-1022" aria-hidden="true" tabindex="-1"></a>...But our columns aren't named.</span>
-<span id="cb110-1023"><a href="#cb110-1023" aria-hidden="true" tabindex="-1"></a>**We need to do more EDA.**</span>
-<span id="cb110-1024"><a href="#cb110-1024" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1025"><a href="#cb110-1025" aria-hidden="true" tabindex="-1"></a><span class="fu">## Exploring Variable Feature Types</span></span>
-<span id="cb110-1026"><a href="#cb110-1026" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1027"><a href="#cb110-1027" aria-hidden="true" tabindex="-1"></a>The NOAA <span class="co">[</span><span class="ot">webpage</span><span class="co">](https://gml.noaa.gov/ccgg/trends/)</span> might have some useful tidbits (in this case it doesn't).</span>
-<span id="cb110-1028"><a href="#cb110-1028" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1029"><a href="#cb110-1029" aria-hidden="true" tabindex="-1"></a>Using this information, we'll rerun <span class="in">`pd.read_csv`</span>, but this time with some **custom column names.**</span>
-<span id="cb110-1030"><a href="#cb110-1030" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1033"><a href="#cb110-1033" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1034"><a href="#cb110-1034" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1035"><a href="#cb110-1035" aria-hidden="true" tabindex="-1"></a>co2 <span class="op">=</span> pd.read_csv(</span>
-<span id="cb110-1036"><a href="#cb110-1036" aria-hidden="true" tabindex="-1"></a>    co2_file, header <span class="op">=</span> <span class="va">None</span>, skiprows <span class="op">=</span> <span class="dv">72</span>,</span>
-<span id="cb110-1037"><a href="#cb110-1037" aria-hidden="true" tabindex="-1"></a>    sep <span class="op">=</span> <span class="st">'\s+'</span>, <span class="co">#regex for continuous whitespace (next lecture)</span></span>
-<span id="cb110-1038"><a href="#cb110-1038" aria-hidden="true" tabindex="-1"></a>    names <span class="op">=</span> [<span class="st">'Yr'</span>, <span class="st">'Mo'</span>, <span class="st">'DecDate'</span>, <span class="st">'Avg'</span>, <span class="st">'Int'</span>, <span class="st">'Trend'</span>, <span class="st">'Days'</span>]</span>
-<span id="cb110-1039"><a href="#cb110-1039" aria-hidden="true" tabindex="-1"></a>)</span>
-<span id="cb110-1040"><a href="#cb110-1040" aria-hidden="true" tabindex="-1"></a>co2.head()</span>
-<span id="cb110-1041"><a href="#cb110-1041" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1042"><a href="#cb110-1042" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1043"><a href="#cb110-1043" aria-hidden="true" tabindex="-1"></a><span class="fu">## Visualizing CO&lt;sub&gt;2&lt;/sub&gt;</span></span>
-<span id="cb110-1044"><a href="#cb110-1044" aria-hidden="true" tabindex="-1"></a>Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.</span>
-<span id="cb110-1045"><a href="#cb110-1045" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1048"><a href="#cb110-1048" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1049"><a href="#cb110-1049" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-1050"><a href="#cb110-1050" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2)<span class="op">;</span></span>
-<span id="cb110-1051"><a href="#cb110-1051" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1052"><a href="#cb110-1052" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1053"><a href="#cb110-1053" aria-hidden="true" tabindex="-1"></a>The code above uses the <span class="in">`seaborn`</span> plotting library (abbreviated <span class="in">`sns`</span>). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!</span>
-<span id="cb110-1054"><a href="#cb110-1054" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1055"><a href="#cb110-1055" aria-hidden="true" tabindex="-1"></a>Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?</span>
-<span id="cb110-1056"><a href="#cb110-1056" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1059"><a href="#cb110-1059" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1060"><a href="#cb110-1060" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1061"><a href="#cb110-1061" aria-hidden="true" tabindex="-1"></a>co2.head()</span>
-<span id="cb110-1062"><a href="#cb110-1062" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1063"><a href="#cb110-1063" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1066"><a href="#cb110-1066" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1067"><a href="#cb110-1067" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1068"><a href="#cb110-1068" aria-hidden="true" tabindex="-1"></a>co2.tail()</span>
-<span id="cb110-1069"><a href="#cb110-1069" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1070"><a href="#cb110-1070" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1071"><a href="#cb110-1071" aria-hidden="true" tabindex="-1"></a>Some data have unusual values like -1 and -99.99.</span>
-<span id="cb110-1072"><a href="#cb110-1072" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1073"><a href="#cb110-1073" aria-hidden="true" tabindex="-1"></a>Let's check the description at the top of the file again.</span>
-<span id="cb110-1074"><a href="#cb110-1074" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1075"><a href="#cb110-1075" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>-1 signifies a missing value for the number of days <span class="in">`Days`</span> the equipment was in operation that month.</span>
-<span id="cb110-1076"><a href="#cb110-1076" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>-99.99 denotes a missing monthly average <span class="in">`Avg`</span></span>
-<span id="cb110-1077"><a href="#cb110-1077" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1078"><a href="#cb110-1078" aria-hidden="true" tabindex="-1"></a>How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.</span>
-<span id="cb110-1079"><a href="#cb110-1079" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1080"><a href="#cb110-1080" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-1081"><a href="#cb110-1081" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1082"><a href="#cb110-1082" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1083"><a href="#cb110-1083" aria-hidden="true" tabindex="-1"></a><span class="fu">## Sanity Checks: Reasoning about the data</span></span>
-<span id="cb110-1084"><a href="#cb110-1084" aria-hidden="true" tabindex="-1"></a>First, we consider the shape of the data. How many rows should we have?</span>
-<span id="cb110-1085"><a href="#cb110-1085" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1086"><a href="#cb110-1086" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>If chronological order, we should have one record per month.</span>
-<span id="cb110-1087"><a href="#cb110-1087" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Data from March 1958 to August 2019.</span>
-<span id="cb110-1088"><a href="#cb110-1088" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.</span>
-<span id="cb110-1089"><a href="#cb110-1089" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1092"><a href="#cb110-1092" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1093"><a href="#cb110-1093" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1094"><a href="#cb110-1094" aria-hidden="true" tabindex="-1"></a>co2.shape</span>
-<span id="cb110-1095"><a href="#cb110-1095" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1096"><a href="#cb110-1096" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1097"><a href="#cb110-1097" aria-hidden="true" tabindex="-1"></a>Nice!! The number of rows (i.e. records) match our expectations.\</span>
-<span id="cb110-1098"><a href="#cb110-1098" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1099"><a href="#cb110-1099" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-1100"><a href="#cb110-1100" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1101"><a href="#cb110-1101" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1102"><a href="#cb110-1102" aria-hidden="true" tabindex="-1"></a>Let's now check the quality of each feature.</span>
-<span id="cb110-1103"><a href="#cb110-1103" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1104"><a href="#cb110-1104" aria-hidden="true" tabindex="-1"></a><span class="fu">## Understanding Missing Value 1: `Days`</span></span>
-<span id="cb110-1105"><a href="#cb110-1105" aria-hidden="true" tabindex="-1"></a><span class="in">`Days`</span> is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.</span>
-<span id="cb110-1106"><a href="#cb110-1106" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1107"><a href="#cb110-1107" aria-hidden="true" tabindex="-1"></a>Let's start with **months**, <span class="in">`Mo`</span>.</span>
-<span id="cb110-1108"><a href="#cb110-1108" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1109"><a href="#cb110-1109" aria-hidden="true" tabindex="-1"></a>Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).</span>
-<span id="cb110-1110"><a href="#cb110-1110" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1113"><a href="#cb110-1113" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1114"><a href="#cb110-1114" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1115"><a href="#cb110-1115" aria-hidden="true" tabindex="-1"></a>co2[<span class="st">"Mo"</span>].value_counts().sort_index()</span>
-<span id="cb110-1116"><a href="#cb110-1116" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1117"><a href="#cb110-1117" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1118"><a href="#cb110-1118" aria-hidden="true" tabindex="-1"></a>As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.</span>
-<span id="cb110-1119"><a href="#cb110-1119" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1120"><a href="#cb110-1120" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-1121"><a href="#cb110-1121" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1122"><a href="#cb110-1122" aria-hidden="true" tabindex="-1"></a>Next let's explore **days** <span class="in">`Days`</span> itself, which is the number of days that the measurement equipment worked.</span>
-<span id="cb110-1123"><a href="#cb110-1123" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1126"><a href="#cb110-1126" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1127"><a href="#cb110-1127" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-1128"><a href="#cb110-1128" aria-hidden="true" tabindex="-1"></a>sns.displot(co2[<span class="st">'Days'</span>])<span class="op">;</span></span>
-<span id="cb110-1129"><a href="#cb110-1129" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of days feature"</span>)<span class="op">;</span> <span class="co"># suppresses unneeded plotting output</span></span>
-<span id="cb110-1130"><a href="#cb110-1130" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1131"><a href="#cb110-1131" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1132"><a href="#cb110-1132" aria-hidden="true" tabindex="-1"></a>In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!</span>
-<span id="cb110-1133"><a href="#cb110-1133" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1134"><a href="#cb110-1134" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-1135"><a href="#cb110-1135" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1136"><a href="#cb110-1136" aria-hidden="true" tabindex="-1"></a>Finally, let's check the last time feature, **year** <span class="in">`Yr`</span>.</span>
-<span id="cb110-1137"><a href="#cb110-1137" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1138"><a href="#cb110-1138" aria-hidden="true" tabindex="-1"></a>Let's check to see if there is any connection between missing-ness and the year of the recording.</span>
-<span id="cb110-1139"><a href="#cb110-1139" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1142"><a href="#cb110-1142" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1143"><a href="#cb110-1143" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-1144"><a href="#cb110-1144" aria-hidden="true" tabindex="-1"></a>sns.scatterplot(x<span class="op">=</span><span class="st">"Yr"</span>, y<span class="op">=</span><span class="st">"Days"</span>, data<span class="op">=</span>co2)<span class="op">;</span></span>
-<span id="cb110-1145"><a href="#cb110-1145" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Day field by Year"</span>)<span class="op">;</span> <span class="co"># the ; suppresses output</span></span>
-<span id="cb110-1146"><a href="#cb110-1146" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1147"><a href="#cb110-1147" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1148"><a href="#cb110-1148" aria-hidden="true" tabindex="-1"></a>**Observations**:</span>
-<span id="cb110-1149"><a href="#cb110-1149" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1150"><a href="#cb110-1150" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>All of the missing data are in the early years of operation.</span>
-<span id="cb110-1151"><a href="#cb110-1151" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>It appears there may have been problems with equipment in the mid to late 80s.</span>
-<span id="cb110-1152"><a href="#cb110-1152" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1153"><a href="#cb110-1153" aria-hidden="true" tabindex="-1"></a>**Potential Next Steps**:</span>
-<span id="cb110-1154"><a href="#cb110-1154" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1155"><a href="#cb110-1155" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Confirm these explanations through documentation about the historical readings.</span>
-<span id="cb110-1156"><a href="#cb110-1156" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.</span>
-<span id="cb110-1157"><a href="#cb110-1157" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1158"><a href="#cb110-1158" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-1159"><a href="#cb110-1159" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1160"><a href="#cb110-1160" aria-hidden="true" tabindex="-1"></a><span class="fu">## Understanding Missing Value 2: `Avg`</span></span>
-<span id="cb110-1161"><a href="#cb110-1161" aria-hidden="true" tabindex="-1"></a>Next, let's return to the -99.99 values in <span class="in">`Avg`</span> to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<span class="kw">&lt;sub&gt;</span>2<span class="kw">&lt;/sub&gt;</span> measurements</span>
-<span id="cb110-1162"><a href="#cb110-1162" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1165"><a href="#cb110-1165" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1166"><a href="#cb110-1166" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-1167"><a href="#cb110-1167" aria-hidden="true" tabindex="-1"></a><span class="co"># Histograms of average CO2 measurements</span></span>
-<span id="cb110-1168"><a href="#cb110-1168" aria-hidden="true" tabindex="-1"></a>sns.displot(co2[<span class="st">'Avg'</span>])<span class="op">;</span></span>
-<span id="cb110-1169"><a href="#cb110-1169" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1170"><a href="#cb110-1170" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1171"><a href="#cb110-1171" aria-hidden="true" tabindex="-1"></a>The non-missing values are in the 300-400 range (a regular range of CO2 levels).</span>
-<span id="cb110-1172"><a href="#cb110-1172" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1173"><a href="#cb110-1173" aria-hidden="true" tabindex="-1"></a>We also see that there are only a few missing <span class="in">`Avg`</span> values (**&lt;1% of values**). Let's examine all of them:</span>
-<span id="cb110-1174"><a href="#cb110-1174" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1177"><a href="#cb110-1177" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1178"><a href="#cb110-1178" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1179"><a href="#cb110-1179" aria-hidden="true" tabindex="-1"></a>co2[co2[<span class="st">"Avg"</span>] <span class="op">&lt;</span> <span class="dv">0</span>]</span>
-<span id="cb110-1180"><a href="#cb110-1180" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1181"><a href="#cb110-1181" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1182"><a href="#cb110-1182" aria-hidden="true" tabindex="-1"></a>There doesn't seem to be a pattern to these values, other than that most records also were missing <span class="in">`Days`</span> data.</span>
-<span id="cb110-1183"><a href="#cb110-1183" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1184"><a href="#cb110-1184" aria-hidden="true" tabindex="-1"></a><span class="fu">## Drop, `NaN`, or Impute Missing `Avg` Data?</span></span>
-<span id="cb110-1185"><a href="#cb110-1185" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1186"><a href="#cb110-1186" aria-hidden="true" tabindex="-1"></a>How should we address the invalid <span class="in">`Avg`</span> data?</span>
-<span id="cb110-1187"><a href="#cb110-1187" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1188"><a href="#cb110-1188" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Drop records</span>
-<span id="cb110-1189"><a href="#cb110-1189" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Set to NaN</span>
-<span id="cb110-1190"><a href="#cb110-1190" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>Impute using some strategy</span>
-<span id="cb110-1191"><a href="#cb110-1191" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1192"><a href="#cb110-1192" aria-hidden="true" tabindex="-1"></a>Remember we want to fix the following plot:</span>
-<span id="cb110-1193"><a href="#cb110-1193" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1196"><a href="#cb110-1196" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1197"><a href="#cb110-1197" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-1198"><a href="#cb110-1198" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2)</span>
-<span id="cb110-1199"><a href="#cb110-1199" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Month"</span>)<span class="op">;</span></span>
-<span id="cb110-1200"><a href="#cb110-1200" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1201"><a href="#cb110-1201" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1202"><a href="#cb110-1202" aria-hidden="true" tabindex="-1"></a>Since we are plotting <span class="in">`Avg`</span> vs <span class="in">`DecDate`</span>, we should just focus on dealing with missing values for <span class="in">`Avg`</span>.</span>
-<span id="cb110-1203"><a href="#cb110-1203" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1204"><a href="#cb110-1204" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1205"><a href="#cb110-1205" aria-hidden="true" tabindex="-1"></a>Let's consider a few options:</span>
-<span id="cb110-1206"><a href="#cb110-1206" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Drop those records</span>
-<span id="cb110-1207"><a href="#cb110-1207" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Replace -99.99 with NaN</span>
-<span id="cb110-1208"><a href="#cb110-1208" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>Substitute it with a likely value for the average CO2?</span>
-<span id="cb110-1209"><a href="#cb110-1209" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1210"><a href="#cb110-1210" aria-hidden="true" tabindex="-1"></a>What do you think are the pros and cons of each possible action?</span>
-<span id="cb110-1211"><a href="#cb110-1211" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1212"><a href="#cb110-1212" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-1213"><a href="#cb110-1213" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1214"><a href="#cb110-1214" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1215"><a href="#cb110-1215" aria-hidden="true" tabindex="-1"></a>Let's examine each of these three options.</span>
-<span id="cb110-1216"><a href="#cb110-1216" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1219"><a href="#cb110-1219" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1220"><a href="#cb110-1220" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1221"><a href="#cb110-1221" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Drop missing values</span></span>
-<span id="cb110-1222"><a href="#cb110-1222" aria-hidden="true" tabindex="-1"></a>co2_drop <span class="op">=</span> co2[co2[<span class="st">'Avg'</span>] <span class="op">&gt;</span> <span class="dv">0</span>]</span>
-<span id="cb110-1223"><a href="#cb110-1223" aria-hidden="true" tabindex="-1"></a>co2_drop.head()</span>
-<span id="cb110-1224"><a href="#cb110-1224" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1225"><a href="#cb110-1225" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1228"><a href="#cb110-1228" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1229"><a href="#cb110-1229" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1230"><a href="#cb110-1230" aria-hidden="true" tabindex="-1"></a><span class="co"># 2. Replace NaN with -99.99</span></span>
-<span id="cb110-1231"><a href="#cb110-1231" aria-hidden="true" tabindex="-1"></a>co2_NA <span class="op">=</span> co2.replace(<span class="op">-</span><span class="fl">99.99</span>, np.NaN)</span>
-<span id="cb110-1232"><a href="#cb110-1232" aria-hidden="true" tabindex="-1"></a>co2_NA.head()</span>
-<span id="cb110-1233"><a href="#cb110-1233" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1234"><a href="#cb110-1234" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1235"><a href="#cb110-1235" aria-hidden="true" tabindex="-1"></a>We'll also use a third version of the data.</span>
-<span id="cb110-1236"><a href="#cb110-1236" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1237"><a href="#cb110-1237" aria-hidden="true" tabindex="-1"></a>First, we note that the dataset already comes with a **substitute value** for the -99.99.</span>
-<span id="cb110-1238"><a href="#cb110-1238" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1239"><a href="#cb110-1239" aria-hidden="true" tabindex="-1"></a>From the file description:</span>
-<span id="cb110-1240"><a href="#cb110-1240" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1241"><a href="#cb110-1241" aria-hidden="true" tabindex="-1"></a><span class="at">&gt;  The </span><span class="in">`interpolated`</span><span class="at"> column includes average values from the preceding column (</span><span class="in">`average`</span><span class="at">)</span></span>
-<span id="cb110-1242"><a href="#cb110-1242" aria-hidden="true" tabindex="-1"></a><span class="at">and **interpolated values** where data are missing.  Interpolated values are</span></span>
-<span id="cb110-1243"><a href="#cb110-1243" aria-hidden="true" tabindex="-1"></a><span class="at">computed in two steps...</span></span>
-<span id="cb110-1244"><a href="#cb110-1244" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1245"><a href="#cb110-1245" aria-hidden="true" tabindex="-1"></a>The <span class="in">`Int`</span> feature has values that exactly match those in <span class="in">`Avg`</span>, except when <span class="in">`Avg`</span> is -99.99, and then a **reasonable** estimate is used instead.</span>
-<span id="cb110-1246"><a href="#cb110-1246" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1247"><a href="#cb110-1247" aria-hidden="true" tabindex="-1"></a>So, the third version of our data will use the <span class="in">`Int`</span> feature instead of <span class="in">`Avg`</span>.</span>
-<span id="cb110-1248"><a href="#cb110-1248" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1251"><a href="#cb110-1251" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1252"><a href="#cb110-1252" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb110-1253"><a href="#cb110-1253" aria-hidden="true" tabindex="-1"></a><span class="co"># 3. Use interpolated column which estimates missing Avg values</span></span>
-<span id="cb110-1254"><a href="#cb110-1254" aria-hidden="true" tabindex="-1"></a>co2_impute <span class="op">=</span> co2.copy()</span>
-<span id="cb110-1255"><a href="#cb110-1255" aria-hidden="true" tabindex="-1"></a>co2_impute[<span class="st">'Avg'</span>] <span class="op">=</span> co2[<span class="st">'Int'</span>]</span>
-<span id="cb110-1256"><a href="#cb110-1256" aria-hidden="true" tabindex="-1"></a>co2_impute.head()</span>
-<span id="cb110-1257"><a href="#cb110-1257" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1258"><a href="#cb110-1258" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1259"><a href="#cb110-1259" aria-hidden="true" tabindex="-1"></a>What's a **reasonable** estimate?</span>
-<span id="cb110-1260"><a href="#cb110-1260" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1261"><a href="#cb110-1261" aria-hidden="true" tabindex="-1"></a>To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).</span>
-<span id="cb110-1262"><a href="#cb110-1262" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1265"><a href="#cb110-1265" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1266"><a href="#cb110-1266" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-1267"><a href="#cb110-1267" aria-hidden="true" tabindex="-1"></a><span class="co"># results of plotting data in 1958</span></span>
-<span id="cb110-1268"><a href="#cb110-1268" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1269"><a href="#cb110-1269" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> line_and_points(data, ax, title):</span>
-<span id="cb110-1270"><a href="#cb110-1270" aria-hidden="true" tabindex="-1"></a>    <span class="co"># assumes single year, hence Mo</span></span>
-<span id="cb110-1271"><a href="#cb110-1271" aria-hidden="true" tabindex="-1"></a>    ax.plot(<span class="st">'Mo'</span>, <span class="st">'Avg'</span>, data<span class="op">=</span>data)</span>
-<span id="cb110-1272"><a href="#cb110-1272" aria-hidden="true" tabindex="-1"></a>    ax.scatter(<span class="st">'Mo'</span>, <span class="st">'Avg'</span>, data<span class="op">=</span>data)</span>
-<span id="cb110-1273"><a href="#cb110-1273" aria-hidden="true" tabindex="-1"></a>    ax.set_xlim(<span class="dv">2</span>, <span class="dv">13</span>)</span>
-<span id="cb110-1274"><a href="#cb110-1274" aria-hidden="true" tabindex="-1"></a>    ax.set_title(title)</span>
-<span id="cb110-1275"><a href="#cb110-1275" aria-hidden="true" tabindex="-1"></a>    ax.set_xticks(np.arange(<span class="dv">3</span>, <span class="dv">13</span>))</span>
-<span id="cb110-1276"><a href="#cb110-1276" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1277"><a href="#cb110-1277" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> data_year(data, year):</span>
-<span id="cb110-1278"><a href="#cb110-1278" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> data[data[<span class="st">"Yr"</span>] <span class="op">==</span> <span class="dv">1958</span>]</span>
-<span id="cb110-1279"><a href="#cb110-1279" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb110-1280"><a href="#cb110-1280" aria-hidden="true" tabindex="-1"></a><span class="co"># uses matplotlib subplots</span></span>
-<span id="cb110-1281"><a href="#cb110-1281" aria-hidden="true" tabindex="-1"></a><span class="co"># you may see more next week; focus on output for now</span></span>
-<span id="cb110-1282"><a href="#cb110-1282" aria-hidden="true" tabindex="-1"></a>fig, axes <span class="op">=</span> plt.subplots(ncols <span class="op">=</span> <span class="dv">3</span>, figsize<span class="op">=</span>(<span class="dv">12</span>, <span class="dv">4</span>), sharey<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb110-1283"><a href="#cb110-1283" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1284"><a href="#cb110-1284" aria-hidden="true" tabindex="-1"></a>year <span class="op">=</span> <span class="dv">1958</span></span>
-<span id="cb110-1285"><a href="#cb110-1285" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_drop, year), axes[<span class="dv">0</span>], title<span class="op">=</span><span class="st">"1. Drop Missing"</span>)</span>
-<span id="cb110-1286"><a href="#cb110-1286" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_NA, year), axes[<span class="dv">1</span>], title<span class="op">=</span><span class="st">"2. Missing Set to NaN"</span>)</span>
-<span id="cb110-1287"><a href="#cb110-1287" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_impute, year), axes[<span class="dv">2</span>], title<span class="op">=</span><span class="st">"3. Missing Interpolated"</span>)</span>
-<span id="cb110-1288"><a href="#cb110-1288" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1289"><a href="#cb110-1289" aria-hidden="true" tabindex="-1"></a>fig.suptitle(<span class="ss">f"Monthly Averages for </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>)</span>
-<span id="cb110-1290"><a href="#cb110-1290" aria-hidden="true" tabindex="-1"></a>plt.tight_layout()</span>
-<span id="cb110-1291"><a href="#cb110-1291" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1292"><a href="#cb110-1292" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1293"><a href="#cb110-1293" aria-hidden="true" tabindex="-1"></a>In the big picture since there are only 7 <span class="in">`Avg`</span> values missing (**&lt;1%** of 738 months), any of these approaches would work.</span>
-<span id="cb110-1294"><a href="#cb110-1294" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1295"><a href="#cb110-1295" aria-hidden="true" tabindex="-1"></a>However there is some appeal to **option C: Imputing**:</span>
-<span id="cb110-1296"><a href="#cb110-1296" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1297"><a href="#cb110-1297" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Shows seasonal trends for CO2</span>
-<span id="cb110-1298"><a href="#cb110-1298" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>We are plotting all months in our data as a line plot</span>
-<span id="cb110-1299"><a href="#cb110-1299" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1300"><a href="#cb110-1300" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
-<span id="cb110-1301"><a href="#cb110-1301" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1302"><a href="#cb110-1302" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1303"><a href="#cb110-1303" aria-hidden="true" tabindex="-1"></a>Let's replot our original figure with option 3:</span>
-<span id="cb110-1304"><a href="#cb110-1304" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1307"><a href="#cb110-1307" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1308"><a href="#cb110-1308" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-1309"><a href="#cb110-1309" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_impute)</span>
-<span id="cb110-1310"><a href="#cb110-1310" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Month, Imputed"</span>)<span class="op">;</span></span>
-<span id="cb110-1311"><a href="#cb110-1311" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1312"><a href="#cb110-1312" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1313"><a href="#cb110-1313" aria-hidden="true" tabindex="-1"></a>Looks pretty close to what we see on the NOAA <span class="co">[</span><span class="ot">website</span><span class="co">](https://gml.noaa.gov/ccgg/trends/)</span>!</span>
-<span id="cb110-1314"><a href="#cb110-1314" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1315"><a href="#cb110-1315" aria-hidden="true" tabindex="-1"></a><span class="fu">## Presenting the data: A Discussion on Data Granularity</span></span>
-<span id="cb110-1316"><a href="#cb110-1316" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1317"><a href="#cb110-1317" aria-hidden="true" tabindex="-1"></a>From the description:</span>
-<span id="cb110-1318"><a href="#cb110-1318" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1319"><a href="#cb110-1319" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>monthly measurements are averages of average day measurements.</span>
-<span id="cb110-1320"><a href="#cb110-1320" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The NOAA GML website has datasets for daily/hourly measurements too.</span>
-<span id="cb110-1321"><a href="#cb110-1321" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1322"><a href="#cb110-1322" aria-hidden="true" tabindex="-1"></a>The data you present depends on your research question.</span>
-<span id="cb110-1323"><a href="#cb110-1323" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1324"><a href="#cb110-1324" aria-hidden="true" tabindex="-1"></a>**How do CO2 levels vary by season?**</span>
-<span id="cb110-1325"><a href="#cb110-1325" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1326"><a href="#cb110-1326" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>You might want to keep average monthly data.</span>
-<span id="cb110-1327"><a href="#cb110-1327" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1328"><a href="#cb110-1328" aria-hidden="true" tabindex="-1"></a>**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**</span>
-<span id="cb110-1329"><a href="#cb110-1329" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1330"><a href="#cb110-1330" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>You might be happier with a **coarser granularity** of average year data!</span>
-<span id="cb110-1331"><a href="#cb110-1331" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1334"><a href="#cb110-1334" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb110-1335"><a href="#cb110-1335" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
-<span id="cb110-1336"><a href="#cb110-1336" aria-hidden="true" tabindex="-1"></a>co2_year <span class="op">=</span> co2_impute.groupby(<span class="st">'Yr'</span>).mean()</span>
-<span id="cb110-1337"><a href="#cb110-1337" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'Yr'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_year)</span>
-<span id="cb110-1338"><a href="#cb110-1338" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Year"</span>)<span class="op">;</span></span>
-<span id="cb110-1339"><a href="#cb110-1339" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb110-1340"><a href="#cb110-1340" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1341"><a href="#cb110-1341" aria-hidden="true" tabindex="-1"></a>Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.</span>
-<span id="cb110-1342"><a href="#cb110-1342" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1343"><a href="#cb110-1343" aria-hidden="true" tabindex="-1"></a><span class="fu"># Summary</span></span>
-<span id="cb110-1344"><a href="#cb110-1344" aria-hidden="true" tabindex="-1"></a>We went over a lot of content this lecture; let's summarize the most important points: </span>
-<span id="cb110-1345"><a href="#cb110-1345" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1346"><a href="#cb110-1346" aria-hidden="true" tabindex="-1"></a><span class="fu">## Dealing with Missing Values</span></span>
-<span id="cb110-1347"><a href="#cb110-1347" aria-hidden="true" tabindex="-1"></a>There are a few options we can take to deal with missing data:</span>
-<span id="cb110-1348"><a href="#cb110-1348" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1349"><a href="#cb110-1349" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Drop missing records</span>
-<span id="cb110-1350"><a href="#cb110-1350" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Keep <span class="in">`NaN`</span> missing values</span>
-<span id="cb110-1351"><a href="#cb110-1351" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Impute using an interpolated column</span>
-<span id="cb110-1352"><a href="#cb110-1352" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1353"><a href="#cb110-1353" aria-hidden="true" tabindex="-1"></a><span class="fu">## EDA and Data Wrangling</span></span>
-<span id="cb110-1354"><a href="#cb110-1354" aria-hidden="true" tabindex="-1"></a>There are several ways to approach EDA and Data Wrangling: </span>
-<span id="cb110-1355"><a href="#cb110-1355" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb110-1356"><a href="#cb110-1356" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Examine the **data and metadata**: what is the date, size, organization, and structure of the data? </span>
-<span id="cb110-1357"><a href="#cb110-1357" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Examine each **field/attribute/dimension** individually.</span>
-<span id="cb110-1358"><a href="#cb110-1358" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Examine pairs of related dimensions (e.g. breaking down grades by major).</span>
-<span id="cb110-1359"><a href="#cb110-1359" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Along the way, we can:</span>
-<span id="cb110-1360"><a href="#cb110-1360" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>**Visualize** or summarize the data.</span>
-<span id="cb110-1361"><a href="#cb110-1361" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>**Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected. </span>
-<span id="cb110-1362"><a href="#cb110-1362" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Identify and **address anomalies**.</span>
-<span id="cb110-1363"><a href="#cb110-1363" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Apply data transformations and corrections (we'll cover this in the upcoming lecture).</span>
-<span id="cb110-1364"><a href="#cb110-1364" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>**Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!</span>
+<div class="sourceCode" id="cb108" data-shortcodes="false"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb108-1"><a href="#cb108-1" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
+<span id="cb108-2"><a href="#cb108-2" aria-hidden="true" tabindex="-1"></a><span class="an">title:</span><span class="co"> Data Cleaning and EDA</span></span>
+<span id="cb108-3"><a href="#cb108-3" aria-hidden="true" tabindex="-1"></a><span class="an">execute:</span></span>
+<span id="cb108-4"><a href="#cb108-4" aria-hidden="true" tabindex="-1"></a><span class="co">  echo: true</span></span>
+<span id="cb108-5"><a href="#cb108-5" aria-hidden="true" tabindex="-1"></a><span class="an">format:</span></span>
+<span id="cb108-6"><a href="#cb108-6" aria-hidden="true" tabindex="-1"></a><span class="co">  html:</span></span>
+<span id="cb108-7"><a href="#cb108-7" aria-hidden="true" tabindex="-1"></a><span class="co">    code-fold: true</span></span>
+<span id="cb108-8"><a href="#cb108-8" aria-hidden="true" tabindex="-1"></a><span class="co">    code-tools: true</span></span>
+<span id="cb108-9"><a href="#cb108-9" aria-hidden="true" tabindex="-1"></a><span class="co">    toc: true</span></span>
+<span id="cb108-10"><a href="#cb108-10" aria-hidden="true" tabindex="-1"></a><span class="co">    toc-title: Data Cleaning and EDA</span></span>
+<span id="cb108-11"><a href="#cb108-11" aria-hidden="true" tabindex="-1"></a><span class="co">    page-layout: full</span></span>
+<span id="cb108-12"><a href="#cb108-12" aria-hidden="true" tabindex="-1"></a><span class="co">    theme:</span></span>
+<span id="cb108-13"><a href="#cb108-13" aria-hidden="true" tabindex="-1"></a><span class="co">      - cosmo</span></span>
+<span id="cb108-14"><a href="#cb108-14" aria-hidden="true" tabindex="-1"></a><span class="co">      - cerulean</span></span>
+<span id="cb108-15"><a href="#cb108-15" aria-hidden="true" tabindex="-1"></a><span class="co">    callout-icon: false</span></span>
+<span id="cb108-16"><a href="#cb108-16" aria-hidden="true" tabindex="-1"></a><span class="an">jupyter:</span><span class="co"> python3</span></span>
+<span id="cb108-17"><a href="#cb108-17" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
+<span id="cb108-18"><a href="#cb108-18" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-21"><a href="#cb108-21" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-22"><a href="#cb108-22" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-23"><a href="#cb108-23" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
+<span id="cb108-24"><a href="#cb108-24" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
+<span id="cb108-25"><a href="#cb108-25" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-26"><a href="#cb108-26" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
+<span id="cb108-27"><a href="#cb108-27" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
+<span id="cb108-28"><a href="#cb108-28" aria-hidden="true" tabindex="-1"></a><span class="co">#%matplotlib inline</span></span>
+<span id="cb108-29"><a href="#cb108-29" aria-hidden="true" tabindex="-1"></a>plt.rcParams[<span class="st">'figure.figsize'</span>] <span class="op">=</span> (<span class="dv">12</span>, <span class="dv">9</span>)</span>
+<span id="cb108-30"><a href="#cb108-30" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-31"><a href="#cb108-31" aria-hidden="true" tabindex="-1"></a>sns.<span class="bu">set</span>()</span>
+<span id="cb108-32"><a href="#cb108-32" aria-hidden="true" tabindex="-1"></a>sns.set_context(<span class="st">'talk'</span>)</span>
+<span id="cb108-33"><a href="#cb108-33" aria-hidden="true" tabindex="-1"></a>np.set_printoptions(threshold<span class="op">=</span><span class="dv">20</span>, precision<span class="op">=</span><span class="dv">2</span>, suppress<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb108-34"><a href="#cb108-34" aria-hidden="true" tabindex="-1"></a>pd.set_option(<span class="st">'display.max_rows'</span>, <span class="dv">30</span>)</span>
+<span id="cb108-35"><a href="#cb108-35" aria-hidden="true" tabindex="-1"></a>pd.set_option(<span class="st">'display.max_columns'</span>, <span class="va">None</span>)</span>
+<span id="cb108-36"><a href="#cb108-36" aria-hidden="true" tabindex="-1"></a>pd.set_option(<span class="st">'display.precision'</span>, <span class="dv">2</span>)</span>
+<span id="cb108-37"><a href="#cb108-37" aria-hidden="true" tabindex="-1"></a><span class="co"># This option stops scientific notation for pandas</span></span>
+<span id="cb108-38"><a href="#cb108-38" aria-hidden="true" tabindex="-1"></a>pd.set_option(<span class="st">'display.float_format'</span>, <span class="st">'</span><span class="sc">{:.2f}</span><span class="st">'</span>.<span class="bu">format</span>)</span>
+<span id="cb108-39"><a href="#cb108-39" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-40"><a href="#cb108-40" aria-hidden="true" tabindex="-1"></a><span class="co"># Silence some spurious seaborn warnings</span></span>
+<span id="cb108-41"><a href="#cb108-41" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> warnings</span>
+<span id="cb108-42"><a href="#cb108-42" aria-hidden="true" tabindex="-1"></a>warnings.filterwarnings(<span class="st">"ignore"</span>, category<span class="op">=</span><span class="pp">FutureWarning</span>)</span>
+<span id="cb108-43"><a href="#cb108-43" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-44"><a href="#cb108-44" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-45"><a href="#cb108-45" aria-hidden="true" tabindex="-1"></a>::: {.callout-note collapse="false"}</span>
+<span id="cb108-46"><a href="#cb108-46" aria-hidden="true" tabindex="-1"></a><span class="fu">## Learning Outcomes</span></span>
+<span id="cb108-47"><a href="#cb108-47" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Recognize common file formats</span>
+<span id="cb108-48"><a href="#cb108-48" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Categorize data by its variable type</span>
+<span id="cb108-49"><a href="#cb108-49" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Build awareness of issues with data faithfulness and develop targeted solutions</span>
+<span id="cb108-50"><a href="#cb108-50" aria-hidden="true" tabindex="-1"></a>:::</span>
+<span id="cb108-51"><a href="#cb108-51" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-52"><a href="#cb108-52" aria-hidden="true" tabindex="-1"></a>**This content is covered in lectures 4, 5, and 6.**</span>
+<span id="cb108-53"><a href="#cb108-53" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-54"><a href="#cb108-54" aria-hidden="true" tabindex="-1"></a>In the past few lectures, we've learned that <span class="in">`pandas`</span> is a toolkit to restructure, modify, and explore a dataset. What we haven't yet touched on is *how* to make these data transformation decisions. When we receive a new set of data from the "real world," how do we know what processing we should do to convert this data into a usable form?</span>
+<span id="cb108-55"><a href="#cb108-55" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-56"><a href="#cb108-56" aria-hidden="true" tabindex="-1"></a>**Data cleaning**, also called **data wrangling**, is the process of transforming raw data to facilitate subsequent analysis. It is often used to address issues like:</span>
+<span id="cb108-57"><a href="#cb108-57" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-58"><a href="#cb108-58" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Unclear structure or formatting</span>
+<span id="cb108-59"><a href="#cb108-59" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Missing or corrupted values</span>
+<span id="cb108-60"><a href="#cb108-60" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Unit conversions</span>
+<span id="cb108-61"><a href="#cb108-61" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>...and so on</span>
+<span id="cb108-62"><a href="#cb108-62" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-63"><a href="#cb108-63" aria-hidden="true" tabindex="-1"></a>**Exploratory Data Analysis (EDA)** is the process of understanding a new dataset. It is an open-ended, informal analysis that involves familiarizing ourselves with the variables present in the data, discovering potential hypotheses, and identifying possible issues with the data. This last point can often motivate further data cleaning to address any problems with the dataset's format; because of this, EDA and data cleaning are often thought of as an "infinite loop," with each process driving the other.</span>
+<span id="cb108-64"><a href="#cb108-64" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-65"><a href="#cb108-65" aria-hidden="true" tabindex="-1"></a>In this lecture, we will consider the key properties of data to consider when performing data cleaning and EDA. In doing so, we'll develop a "checklist" of sorts for you to consider when approaching a new dataset. Throughout this process, we'll build a deeper understanding of this early (but very important!) stage of the data science lifecycle.</span>
+<span id="cb108-66"><a href="#cb108-66" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-67"><a href="#cb108-67" aria-hidden="true" tabindex="-1"></a><span class="fu">## Structure</span></span>
+<span id="cb108-68"><a href="#cb108-68" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-69"><a href="#cb108-69" aria-hidden="true" tabindex="-1"></a><span class="fu">### File Formats</span></span>
+<span id="cb108-70"><a href="#cb108-70" aria-hidden="true" tabindex="-1"></a>There are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We'll only cover CSV, TSV, and JSON in lecture, but you'll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types. </span>
+<span id="cb108-71"><a href="#cb108-71" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-72"><a href="#cb108-72" aria-hidden="true" tabindex="-1"></a><span class="fu">#### CSV</span></span>
+<span id="cb108-73"><a href="#cb108-73" aria-hidden="true" tabindex="-1"></a>CSVs, which stand for **Comma-Separated Values**, are a common tabular data format. </span>
+<span id="cb108-74"><a href="#cb108-74" aria-hidden="true" tabindex="-1"></a>In the past two <span class="in">`pandas`</span> lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our <span class="in">`elections`</span> and <span class="in">`babynames`</span> datasets were stored and loaded as CSVs:</span>
+<span id="cb108-75"><a href="#cb108-75" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-78"><a href="#cb108-78" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-79"><a href="#cb108-79" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-80"><a href="#cb108-80" aria-hidden="true" tabindex="-1"></a>pd.read_csv(<span class="st">"data/elections.csv"</span>).head(<span class="dv">5</span>)</span>
+<span id="cb108-81"><a href="#cb108-81" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-82"><a href="#cb108-82" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-83"><a href="#cb108-83" aria-hidden="true" tabindex="-1"></a>To better understand the properties of a CSV, let's take a look at the first few rows of the raw data file to see what it looks like before being loaded into a <span class="in">`DataFrame`</span>. We'll use the <span class="in">`repr()`</span> function to return the raw string with its special characters: </span>
+<span id="cb108-84"><a href="#cb108-84" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-87"><a href="#cb108-87" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-88"><a href="#cb108-88" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-89"><a href="#cb108-89" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
+<span id="cb108-90"><a href="#cb108-90" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
+<span id="cb108-91"><a href="#cb108-91" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
+<span id="cb108-92"><a href="#cb108-92" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row))</span>
+<span id="cb108-93"><a href="#cb108-93" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
+<span id="cb108-94"><a href="#cb108-94" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">3</span>:</span>
+<span id="cb108-95"><a href="#cb108-95" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
+<span id="cb108-96"><a href="#cb108-96" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-97"><a href="#cb108-97" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-98"><a href="#cb108-98" aria-hidden="true" tabindex="-1"></a>Each row, or **record**, in the data is delimited by a newline `\n`. Each column, or **field**, in the data is delimited by a comma <span class="in">`,`</span> (hence, comma-separated!). </span>
+<span id="cb108-99"><a href="#cb108-99" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-100"><a href="#cb108-100" aria-hidden="true" tabindex="-1"></a><span class="fu">#### TSV</span></span>
+<span id="cb108-101"><a href="#cb108-101" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-102"><a href="#cb108-102" aria-hidden="true" tabindex="-1"></a>Another common file type is **TSV (Tab-Separated Values)**. In a TSV, records are still delimited by a newline <span class="in">`\n`</span>, while fields are delimited by <span class="in">`\t`</span> tab character. </span>
+<span id="cb108-103"><a href="#cb108-103" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-104"><a href="#cb108-104" aria-hidden="true" tabindex="-1"></a>Let's check out the first few rows of the raw TSV file. Again, we'll use the <span class="in">`repr()`</span> function so that <span class="in">`print`</span> shows the special characters.</span>
+<span id="cb108-105"><a href="#cb108-105" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-108"><a href="#cb108-108" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-109"><a href="#cb108-109" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-110"><a href="#cb108-110" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.txt"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
+<span id="cb108-111"><a href="#cb108-111" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
+<span id="cb108-112"><a href="#cb108-112" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
+<span id="cb108-113"><a href="#cb108-113" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row))</span>
+<span id="cb108-114"><a href="#cb108-114" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
+<span id="cb108-115"><a href="#cb108-115" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">3</span>:</span>
+<span id="cb108-116"><a href="#cb108-116" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
+<span id="cb108-117"><a href="#cb108-117" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-118"><a href="#cb108-118" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-119"><a href="#cb108-119" aria-hidden="true" tabindex="-1"></a>TSVs can be loaded into <span class="in">`pandas`</span> using <span class="in">`pd.read_csv`</span>. We'll need to specify the **delimiter** with parameter<span class="in">` sep='\t'`</span> <span class="co">[</span><span class="ot">(documentation)</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)</span>.</span>
+<span id="cb108-120"><a href="#cb108-120" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-123"><a href="#cb108-123" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-124"><a href="#cb108-124" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-125"><a href="#cb108-125" aria-hidden="true" tabindex="-1"></a>pd.read_csv(<span class="st">"data/elections.txt"</span>, sep<span class="op">=</span><span class="st">'</span><span class="ch">\t</span><span class="st">'</span>).head(<span class="dv">3</span>)</span>
+<span id="cb108-126"><a href="#cb108-126" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-127"><a href="#cb108-127" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-128"><a href="#cb108-128" aria-hidden="true" tabindex="-1"></a>An issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does <span class="in">`pandas`</span> differentiate between a comma delimiter vs. a comma within the field itself, for example <span class="in">`8,900`</span>? To remedy this, check out the <span class="co">[</span><span class="ot">`quotechar` parameter</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)</span>. </span>
+<span id="cb108-129"><a href="#cb108-129" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-130"><a href="#cb108-130" aria-hidden="true" tabindex="-1"></a><span class="fu">#### JSON</span></span>
+<span id="cb108-131"><a href="#cb108-131" aria-hidden="true" tabindex="-1"></a>**JSON (JavaScript Object Notation)** files behave similarly to Python dictionaries. A raw JSON is shown below.</span>
+<span id="cb108-132"><a href="#cb108-132" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-135"><a href="#cb108-135" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-136"><a href="#cb108-136" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-137"><a href="#cb108-137" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/elections.json"</span>, <span class="st">"r"</span>) <span class="im">as</span> table:</span>
+<span id="cb108-138"><a href="#cb108-138" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
+<span id="cb108-139"><a href="#cb108-139" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> table:</span>
+<span id="cb108-140"><a href="#cb108-140" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(row)</span>
+<span id="cb108-141"><a href="#cb108-141" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
+<span id="cb108-142"><a href="#cb108-142" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">8</span>:</span>
+<span id="cb108-143"><a href="#cb108-143" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
+<span id="cb108-144"><a href="#cb108-144" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-145"><a href="#cb108-145" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-146"><a href="#cb108-146" aria-hidden="true" tabindex="-1"></a>JSON files can be loaded into <span class="in">`pandas`</span> using <span class="in">`pd.read_json`</span>. </span>
+<span id="cb108-147"><a href="#cb108-147" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-150"><a href="#cb108-150" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-151"><a href="#cb108-151" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-152"><a href="#cb108-152" aria-hidden="true" tabindex="-1"></a>pd.read_json(<span class="st">'data/elections.json'</span>).head(<span class="dv">3</span>)</span>
+<span id="cb108-153"><a href="#cb108-153" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-154"><a href="#cb108-154" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-155"><a href="#cb108-155" aria-hidden="true" tabindex="-1"></a><span class="fu">##### EDA with JSON: Berkeley COVID-19 Data</span></span>
+<span id="cb108-156"><a href="#cb108-156" aria-hidden="true" tabindex="-1"></a>The City of Berkeley Open Data <span class="co">[</span><span class="ot">website</span><span class="co">](https://data.cityofberkeley.info/Health/COVID-19-Confirmed-Cases/xn6j-b766)</span> has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let's download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the <span class="co">[</span><span class="ot">`ds100_utils.py`</span><span class="co">](https://ds100.org/fa23/resources/assets/lectures/lec05/lec05-eda.html)</span> file that we can reuse these helper functions in many different notebooks.</span>
+<span id="cb108-157"><a href="#cb108-157" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-160"><a href="#cb108-160" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-161"><a href="#cb108-161" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-162"><a href="#cb108-162" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> ds100_utils <span class="im">import</span> fetch_and_cache</span>
+<span id="cb108-163"><a href="#cb108-163" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-164"><a href="#cb108-164" aria-hidden="true" tabindex="-1"></a>covid_file <span class="op">=</span> fetch_and_cache(</span>
+<span id="cb108-165"><a href="#cb108-165" aria-hidden="true" tabindex="-1"></a>    <span class="st">"https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD"</span>,</span>
+<span id="cb108-166"><a href="#cb108-166" aria-hidden="true" tabindex="-1"></a>    <span class="st">"confirmed-cases.json"</span>,</span>
+<span id="cb108-167"><a href="#cb108-167" aria-hidden="true" tabindex="-1"></a>    force<span class="op">=</span><span class="va">False</span>)</span>
+<span id="cb108-168"><a href="#cb108-168" aria-hidden="true" tabindex="-1"></a>covid_file          <span class="co"># a file path wrapper object</span></span>
+<span id="cb108-169"><a href="#cb108-169" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-170"><a href="#cb108-170" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-171"><a href="#cb108-171" aria-hidden="true" tabindex="-1"></a><span class="fu">###### File Size</span></span>
+<span id="cb108-172"><a href="#cb108-172" aria-hidden="true" tabindex="-1"></a>Let's start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use <span class="in">`Python`</span> tools to probe the file.</span>
+<span id="cb108-173"><a href="#cb108-173" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-174"><a href="#cb108-174" aria-hidden="true" tabindex="-1"></a>Since there seem to be text files, let's investigate the number of lines, which often corresponds to the number of records</span>
+<span id="cb108-175"><a href="#cb108-175" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-178"><a href="#cb108-178" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-179"><a href="#cb108-179" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-180"><a href="#cb108-180" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> os</span>
+<span id="cb108-181"><a href="#cb108-181" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-182"><a href="#cb108-182" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(covid_file, <span class="st">"is"</span>, os.path.getsize(covid_file) <span class="op">/</span> <span class="fl">1e6</span>, <span class="st">"MB"</span>)</span>
+<span id="cb108-183"><a href="#cb108-183" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-184"><a href="#cb108-184" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(covid_file, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
+<span id="cb108-185"><a href="#cb108-185" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(covid_file, <span class="st">"is"</span>, <span class="bu">sum</span>(<span class="dv">1</span> <span class="cf">for</span> l <span class="kw">in</span> f), <span class="st">"lines."</span>)</span>
+<span id="cb108-186"><a href="#cb108-186" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-187"><a href="#cb108-187" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-188"><a href="#cb108-188" aria-hidden="true" tabindex="-1"></a><span class="fu">###### Unix Commands</span></span>
+<span id="cb108-189"><a href="#cb108-189" aria-hidden="true" tabindex="-1"></a>As part of the EDA workflow, Unix commands can come in very handy. In fact, there's an entire book called <span class="co">[</span><span class="ot">"Data Science at the Command Line"</span><span class="co">](https://datascienceatthecommandline.com/)</span> that explores this idea in depth! </span>
+<span id="cb108-190"><a href="#cb108-190" aria-hidden="true" tabindex="-1"></a>In Jupyter/IPython, you can prefix lines with <span class="in">`!`</span> to execute arbitrary Unix commands, and within those lines, you can refer to <span class="in">`Python`</span> variables and expressions with the syntax <span class="in">`{expr}`</span>.</span>
+<span id="cb108-191"><a href="#cb108-191" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-192"><a href="#cb108-192" aria-hidden="true" tabindex="-1"></a>Here, we use the <span class="in">`ls`</span> command to list files, using the <span class="in">`-lh`</span> flags, which request "long format with information in human-readable form." We also use the <span class="in">`wc`</span> command for "word count," but with the <span class="in">`-l`</span> flag, which asks for line counts instead of words.</span>
+<span id="cb108-193"><a href="#cb108-193" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-194"><a href="#cb108-194" aria-hidden="true" tabindex="-1"></a>These two give us the same information as the code above, albeit in a slightly different form:</span>
+<span id="cb108-195"><a href="#cb108-195" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-198"><a href="#cb108-198" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-199"><a href="#cb108-199" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-200"><a href="#cb108-200" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>ls <span class="op">-</span>lh {covid_file}</span>
+<span id="cb108-201"><a href="#cb108-201" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>wc <span class="op">-</span>l {covid_file}</span>
+<span id="cb108-202"><a href="#cb108-202" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-203"><a href="#cb108-203" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-204"><a href="#cb108-204" aria-hidden="true" tabindex="-1"></a><span class="fu">###### File Contents</span></span>
+<span id="cb108-205"><a href="#cb108-205" aria-hidden="true" tabindex="-1"></a>Let's explore the data format using <span class="in">`Python`</span>. </span>
+<span id="cb108-206"><a href="#cb108-206" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-209"><a href="#cb108-209" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-210"><a href="#cb108-210" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-211"><a href="#cb108-211" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(covid_file, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
+<span id="cb108-212"><a href="#cb108-212" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> i, row <span class="kw">in</span> <span class="bu">enumerate</span>(f):</span>
+<span id="cb108-213"><a href="#cb108-213" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row)) <span class="co"># print raw strings</span></span>
+<span id="cb108-214"><a href="#cb108-214" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;=</span> <span class="dv">4</span>: <span class="cf">break</span></span>
+<span id="cb108-215"><a href="#cb108-215" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-216"><a href="#cb108-216" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-217"><a href="#cb108-217" aria-hidden="true" tabindex="-1"></a>We can use the <span class="in">`head`</span> Unix command (which is where <span class="in">`pandas`</span>' <span class="in">`head`</span> method comes from!) to see the first few lines of the file:</span>
+<span id="cb108-218"><a href="#cb108-218" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-221"><a href="#cb108-221" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-222"><a href="#cb108-222" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-223"><a href="#cb108-223" aria-hidden="true" tabindex="-1"></a><span class="op">!</span>head <span class="op">-</span><span class="dv">5</span> {covid_file}</span>
+<span id="cb108-224"><a href="#cb108-224" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-225"><a href="#cb108-225" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-226"><a href="#cb108-226" aria-hidden="true" tabindex="-1"></a>In order to load the JSON file into <span class="in">`pandas`</span>, Let's first do some EDA with <span class="in">`Python`</span>'s <span class="in">`json`</span> package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into <span class="in">`pandas`</span>. <span class="in">`Python`</span> has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the <span class="in">`json`</span> package.</span>
+<span id="cb108-227"><a href="#cb108-227" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-230"><a href="#cb108-230" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-231"><a href="#cb108-231" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-232"><a href="#cb108-232" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> json</span>
+<span id="cb108-233"><a href="#cb108-233" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-234"><a href="#cb108-234" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(covid_file, <span class="st">"rb"</span>) <span class="im">as</span> f:</span>
+<span id="cb108-235"><a href="#cb108-235" aria-hidden="true" tabindex="-1"></a>    covid_json <span class="op">=</span> json.load(f)</span>
+<span id="cb108-236"><a href="#cb108-236" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-237"><a href="#cb108-237" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-238"><a href="#cb108-238" aria-hidden="true" tabindex="-1"></a>The <span class="in">`covid_json`</span> variable is now a dictionary encoding the data in the file:</span>
+<span id="cb108-239"><a href="#cb108-239" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-242"><a href="#cb108-242" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-243"><a href="#cb108-243" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-244"><a href="#cb108-244" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span>(covid_json)</span>
+<span id="cb108-245"><a href="#cb108-245" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-246"><a href="#cb108-246" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-247"><a href="#cb108-247" aria-hidden="true" tabindex="-1"></a>We can examine what keys are in the top level json object by listing out the keys. </span>
+<span id="cb108-248"><a href="#cb108-248" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-251"><a href="#cb108-251" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-252"><a href="#cb108-252" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-253"><a href="#cb108-253" aria-hidden="true" tabindex="-1"></a>covid_json.keys()</span>
+<span id="cb108-254"><a href="#cb108-254" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-255"><a href="#cb108-255" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-256"><a href="#cb108-256" aria-hidden="true" tabindex="-1"></a>**Observation**: The JSON dictionary contains a <span class="in">`meta`</span> key which likely refers to meta data (data about the data).  Meta data often maintained with the data and can be a good source of additional information.</span>
+<span id="cb108-257"><a href="#cb108-257" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-258"><a href="#cb108-258" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-259"><a href="#cb108-259" aria-hidden="true" tabindex="-1"></a>We can investigate the meta data further by examining the keys associated with the metadata.</span>
+<span id="cb108-260"><a href="#cb108-260" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-263"><a href="#cb108-263" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-264"><a href="#cb108-264" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-265"><a href="#cb108-265" aria-hidden="true" tabindex="-1"></a>covid_json[<span class="st">'meta'</span>].keys()</span>
+<span id="cb108-266"><a href="#cb108-266" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-267"><a href="#cb108-267" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-268"><a href="#cb108-268" aria-hidden="true" tabindex="-1"></a>The <span class="in">`meta`</span> key contains another dictionary called <span class="in">`view`</span>.  This likely refers to meta-data about a particular "view" of some underlying database.  We will learn more about views when we study SQL later in the class.    </span>
+<span id="cb108-269"><a href="#cb108-269" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-272"><a href="#cb108-272" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-273"><a href="#cb108-273" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-274"><a href="#cb108-274" aria-hidden="true" tabindex="-1"></a>covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>].keys()</span>
+<span id="cb108-275"><a href="#cb108-275" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-276"><a href="#cb108-276" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-277"><a href="#cb108-277" aria-hidden="true" tabindex="-1"></a>Notice that this a nested/recursive data structure.  As we dig deeper we reveal more and more keys and the corresponding data:</span>
+<span id="cb108-278"><a href="#cb108-278" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-279"><a href="#cb108-279" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-280"><a href="#cb108-280" aria-hidden="true" tabindex="-1"></a><span class="in">meta</span></span>
+<span id="cb108-281"><a href="#cb108-281" aria-hidden="true" tabindex="-1"></a><span class="in">|-&gt; data</span></span>
+<span id="cb108-282"><a href="#cb108-282" aria-hidden="true" tabindex="-1"></a><span class="in">    | ... (haven't explored yet)</span></span>
+<span id="cb108-283"><a href="#cb108-283" aria-hidden="true" tabindex="-1"></a><span class="in">|-&gt; view</span></span>
+<span id="cb108-284"><a href="#cb108-284" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; id</span></span>
+<span id="cb108-285"><a href="#cb108-285" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; name</span></span>
+<span id="cb108-286"><a href="#cb108-286" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; attribution </span></span>
+<span id="cb108-287"><a href="#cb108-287" aria-hidden="true" tabindex="-1"></a><span class="in">    ...</span></span>
+<span id="cb108-288"><a href="#cb108-288" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; description</span></span>
+<span id="cb108-289"><a href="#cb108-289" aria-hidden="true" tabindex="-1"></a><span class="in">    ...</span></span>
+<span id="cb108-290"><a href="#cb108-290" aria-hidden="true" tabindex="-1"></a><span class="in">    | -&gt; columns</span></span>
+<span id="cb108-291"><a href="#cb108-291" aria-hidden="true" tabindex="-1"></a><span class="in">    ...</span></span>
+<span id="cb108-292"><a href="#cb108-292" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-293"><a href="#cb108-293" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-294"><a href="#cb108-294" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-295"><a href="#cb108-295" aria-hidden="true" tabindex="-1"></a>There is a key called description in the view sub dictionary.  This likely contains a description of the data:</span>
+<span id="cb108-296"><a href="#cb108-296" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-299"><a href="#cb108-299" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-300"><a href="#cb108-300" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-301"><a href="#cb108-301" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>][<span class="st">'description'</span>])</span>
+<span id="cb108-302"><a href="#cb108-302" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-303"><a href="#cb108-303" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-304"><a href="#cb108-304" aria-hidden="true" tabindex="-1"></a><span class="fu">###### Examining the Data Field for Records</span></span>
+<span id="cb108-305"><a href="#cb108-305" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-306"><a href="#cb108-306" aria-hidden="true" tabindex="-1"></a>We can look at a few entries in the <span class="in">`data`</span> field. This is what we'll load into <span class="in">`pandas`</span>.</span>
+<span id="cb108-307"><a href="#cb108-307" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-310"><a href="#cb108-310" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-311"><a href="#cb108-311" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-312"><a href="#cb108-312" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">3</span>):</span>
+<span id="cb108-313"><a href="#cb108-313" aria-hidden="true" tabindex="-1"></a>    <span class="bu">print</span>(<span class="ss">f"</span><span class="sc">{</span>i<span class="sc">:03}</span><span class="ss"> | </span><span class="sc">{</span>covid_json[<span class="st">'data'</span>][i]<span class="sc">}</span><span class="ss">"</span>)</span>
+<span id="cb108-314"><a href="#cb108-314" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-315"><a href="#cb108-315" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-316"><a href="#cb108-316" aria-hidden="true" tabindex="-1"></a>Observations:</span>
+<span id="cb108-317"><a href="#cb108-317" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>These look like equal-length records, so maybe <span class="in">`data`</span> is a table!</span>
+<span id="cb108-318"><a href="#cb108-318" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>But what do each of values in the record mean? Where can we find column headers?</span>
+<span id="cb108-319"><a href="#cb108-319" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-320"><a href="#cb108-320" aria-hidden="true" tabindex="-1"></a>For that, we'll need the <span class="in">`columns`</span> key in the metadata dictionary. This returns a list: </span>
+<span id="cb108-321"><a href="#cb108-321" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-324"><a href="#cb108-324" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-325"><a href="#cb108-325" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-326"><a href="#cb108-326" aria-hidden="true" tabindex="-1"></a><span class="bu">type</span>(covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>][<span class="st">'columns'</span>])</span>
+<span id="cb108-327"><a href="#cb108-327" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-328"><a href="#cb108-328" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-329"><a href="#cb108-329" aria-hidden="true" tabindex="-1"></a><span class="fu">###### Summary of exploring the JSON file</span></span>
+<span id="cb108-330"><a href="#cb108-330" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-331"><a href="#cb108-331" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>The above **metadata** tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic. </span>
+<span id="cb108-332"><a href="#cb108-332" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Because of its non-tabular structure, JSON makes it easier (than CSV) to create **self-documenting data**, meaning that information about the data is stored in the same file as the data.</span>
+<span id="cb108-333"><a href="#cb108-333" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Self-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes. </span>
+<span id="cb108-334"><a href="#cb108-334" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-335"><a href="#cb108-335" aria-hidden="true" tabindex="-1"></a><span class="fu">###### Loading COVID Data into `pandas`</span></span>
+<span id="cb108-336"><a href="#cb108-336" aria-hidden="true" tabindex="-1"></a>Finally, let's load the data (not the metadata) into a <span class="in">`pandas`</span> <span class="in">`DataFrame`</span>. In the following block of code we:</span>
+<span id="cb108-337"><a href="#cb108-337" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-338"><a href="#cb108-338" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Translate the JSON records into a <span class="in">`DataFrame`</span>:</span>
+<span id="cb108-339"><a href="#cb108-339" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-340"><a href="#cb108-340" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>fields: <span class="in">`covid_json['meta']['view']['columns']`</span></span>
+<span id="cb108-341"><a href="#cb108-341" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>records: <span class="in">`covid_json['data']`</span></span>
+<span id="cb108-342"><a href="#cb108-342" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-343"><a href="#cb108-343" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb108-344"><a href="#cb108-344" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Remove columns that have no metadata description.  This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.</span>
+<span id="cb108-345"><a href="#cb108-345" aria-hidden="true" tabindex="-1"></a>   </span>
+<span id="cb108-346"><a href="#cb108-346" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Examine the <span class="in">`tail`</span> of the table.</span>
+<span id="cb108-347"><a href="#cb108-347" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-350"><a href="#cb108-350" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-351"><a href="#cb108-351" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-352"><a href="#cb108-352" aria-hidden="true" tabindex="-1"></a><span class="co"># Load the data from JSON and assign column titles</span></span>
+<span id="cb108-353"><a href="#cb108-353" aria-hidden="true" tabindex="-1"></a>covid <span class="op">=</span> pd.DataFrame(</span>
+<span id="cb108-354"><a href="#cb108-354" aria-hidden="true" tabindex="-1"></a>    covid_json[<span class="st">'data'</span>],</span>
+<span id="cb108-355"><a href="#cb108-355" aria-hidden="true" tabindex="-1"></a>    columns<span class="op">=</span>[c[<span class="st">'name'</span>] <span class="cf">for</span> c <span class="kw">in</span> covid_json[<span class="st">'meta'</span>][<span class="st">'view'</span>][<span class="st">'columns'</span>]])</span>
+<span id="cb108-356"><a href="#cb108-356" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-357"><a href="#cb108-357" aria-hidden="true" tabindex="-1"></a>covid.tail()</span>
+<span id="cb108-358"><a href="#cb108-358" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-359"><a href="#cb108-359" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-360"><a href="#cb108-360" aria-hidden="true" tabindex="-1"></a><span class="fu">### Variable Types</span></span>
+<span id="cb108-361"><a href="#cb108-361" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-362"><a href="#cb108-362" aria-hidden="true" tabindex="-1"></a>After loading data into a file, it's a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types. </span>
+<span id="cb108-363"><a href="#cb108-363" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-364"><a href="#cb108-364" aria-hidden="true" tabindex="-1"></a>**Quantitative variables** describe some numeric quantity or amount. We can divide quantitative data further into:</span>
+<span id="cb108-365"><a href="#cb108-365" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-366"><a href="#cb108-366" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Continuous quantitative variables**: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO<span class="kw">&lt;sub&gt;</span>2<span class="kw">&lt;/sub&gt;</span> concentrations.</span>
+<span id="cb108-367"><a href="#cb108-367" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Discrete quantitative variables**: numeric data that can only take on a finite set of possible values. For example, someone's age or the number of siblings they have.</span>
+<span id="cb108-368"><a href="#cb108-368" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-369"><a href="#cb108-369" aria-hidden="true" tabindex="-1"></a>**Qualitative variables**, also known as **categorical variables**, describe data that isn't measuring some quantity or amount. The sub-categories of categorical data are:</span>
+<span id="cb108-370"><a href="#cb108-370" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-371"><a href="#cb108-371" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Ordinal qualitative variables**: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating. </span>
+<span id="cb108-372"><a href="#cb108-372" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Nominal qualitative variables**: categories with no specific order. For example, someone's political affiliation or Cal ID number.</span>
+<span id="cb108-373"><a href="#cb108-373" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-374"><a href="#cb108-374" aria-hidden="true" tabindex="-1"></a><span class="al">![Classification of variable types](images/variable.png)</span></span>
+<span id="cb108-375"><a href="#cb108-375" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-376"><a href="#cb108-376" aria-hidden="true" tabindex="-1"></a>Note that many variables don't sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings. </span>
+<span id="cb108-377"><a href="#cb108-377" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-378"><a href="#cb108-378" aria-hidden="true" tabindex="-1"></a><span class="fu">### Primary and Foreign Keys</span></span>
+<span id="cb108-379"><a href="#cb108-379" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-380"><a href="#cb108-380" aria-hidden="true" tabindex="-1"></a>Last time, we introduced <span class="in">`.merge`</span> as the <span class="in">`pandas`</span> method for joining multiple <span class="in">`DataFrame`</span>s together. In our discussion of joins, we touched on the idea of using a "key" to determine what rows should be merged from each table. Let's take a moment to examine this idea more closely.</span>
+<span id="cb108-381"><a href="#cb108-381" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-382"><a href="#cb108-382" aria-hidden="true" tabindex="-1"></a>The **primary key** is the column or set of columns in a table that *uniquely* determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student's Cal ID as the primary key. </span>
+<span id="cb108-383"><a href="#cb108-383" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-386"><a href="#cb108-386" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-387"><a href="#cb108-387" aria-hidden="true" tabindex="-1"></a><span class="co">#| echo: false</span></span>
+<span id="cb108-388"><a href="#cb108-388" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"Cal ID"</span>:[<span class="dv">3034619471</span>, <span class="dv">3035619472</span>, <span class="dv">3025619473</span>, <span class="dv">3046789372</span>], <span class="op">\</span></span>
+<span id="cb108-389"><a href="#cb108-389" aria-hidden="true" tabindex="-1"></a>             <span class="st">"Name"</span>:[<span class="st">"Oski"</span>, <span class="st">"Ollie"</span>, <span class="st">"Orrie"</span>, <span class="st">"Ollie"</span>], <span class="op">\</span></span>
+<span id="cb108-390"><a href="#cb108-390" aria-hidden="true" tabindex="-1"></a>             <span class="st">"Major"</span>:[<span class="st">"Data Science"</span>, <span class="st">"Computer Science"</span>, <span class="st">"Data Science"</span>, <span class="st">"Economics"</span>]})</span>
+<span id="cb108-391"><a href="#cb108-391" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-392"><a href="#cb108-392" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-393"><a href="#cb108-393" aria-hidden="true" tabindex="-1"></a>The **foreign key** is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset's foreign keys can be useful when assigning the <span class="in">`left_on`</span> and <span class="in">`right_on`</span> parameters of <span class="in">`.merge`</span>. In the table of office hour tickets below, <span class="in">`"Cal ID"`</span> is a foreign key referencing the previous table.</span>
+<span id="cb108-394"><a href="#cb108-394" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-397"><a href="#cb108-397" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-398"><a href="#cb108-398" aria-hidden="true" tabindex="-1"></a><span class="co">#| echo: false</span></span>
+<span id="cb108-399"><a href="#cb108-399" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"OH Request"</span>:[<span class="dv">1</span>, <span class="dv">2</span>, <span class="dv">3</span>, <span class="dv">4</span>], <span class="op">\</span></span>
+<span id="cb108-400"><a href="#cb108-400" aria-hidden="true" tabindex="-1"></a>             <span class="st">"Cal ID"</span>:[<span class="dv">3034619471</span>, <span class="dv">3035619472</span>, <span class="dv">3025619473</span>, <span class="dv">3035619472</span>], <span class="op">\</span></span>
+<span id="cb108-401"><a href="#cb108-401" aria-hidden="true" tabindex="-1"></a>             <span class="st">"Question"</span>:[<span class="st">"HW 2 Q1"</span>, <span class="st">"HW 2 Q3"</span>, <span class="st">"Lab 3 Q4"</span>, <span class="st">"HW 2 Q7"</span>]})</span>
+<span id="cb108-402"><a href="#cb108-402" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-403"><a href="#cb108-403" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-404"><a href="#cb108-404" aria-hidden="true" tabindex="-1"></a><span class="fu">## Granularity, Scope, and Temporality</span></span>
+<span id="cb108-405"><a href="#cb108-405" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-406"><a href="#cb108-406" aria-hidden="true" tabindex="-1"></a>After understanding the structure of the dataset, the next task is to determine what exactly the data represents. We'll do so by considering the data's granularity, scope, and temporality.</span>
+<span id="cb108-407"><a href="#cb108-407" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-408"><a href="#cb108-408" aria-hidden="true" tabindex="-1"></a><span class="fu">### Granularity</span></span>
+<span id="cb108-409"><a href="#cb108-409" aria-hidden="true" tabindex="-1"></a>The **granularity** of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data's granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.</span>
+<span id="cb108-410"><a href="#cb108-410" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-411"><a href="#cb108-411" aria-hidden="true" tabindex="-1"></a><span class="fu">### Scope</span></span>
+<span id="cb108-412"><a href="#cb108-412" aria-hidden="true" tabindex="-1"></a>The **scope** of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California. </span>
+<span id="cb108-413"><a href="#cb108-413" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-414"><a href="#cb108-414" aria-hidden="true" tabindex="-1"></a><span class="fu">### Temporality</span></span>
+<span id="cb108-415"><a href="#cb108-415" aria-hidden="true" tabindex="-1"></a>The **temporality** of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated. </span>
+<span id="cb108-416"><a href="#cb108-416" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-417"><a href="#cb108-417" aria-hidden="true" tabindex="-1"></a>Time and date fields of a dataset could represent a few things:</span>
+<span id="cb108-418"><a href="#cb108-418" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-419"><a href="#cb108-419" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>when the "event" happened</span>
+<span id="cb108-420"><a href="#cb108-420" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>when the data was collected, or when it was entered into the system</span>
+<span id="cb108-421"><a href="#cb108-421" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>when the data was copied into the database </span>
+<span id="cb108-422"><a href="#cb108-422" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-423"><a href="#cb108-423" aria-hidden="true" tabindex="-1"></a>To fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley's time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings). </span>
+<span id="cb108-424"><a href="#cb108-424" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-425"><a href="#cb108-425" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Temporality with `pandas`' `dt` accessors </span></span>
+<span id="cb108-426"><a href="#cb108-426" aria-hidden="true" tabindex="-1"></a>Let's briefly look at how we can use <span class="in">`pandas`</span>' <span class="in">`dt`</span> accessors to work with dates/times in a dataset using the dataset you'll see in Lab 3: the Berkeley PD Calls for Service dataset.</span>
+<span id="cb108-427"><a href="#cb108-427" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-430"><a href="#cb108-430" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-431"><a href="#cb108-431" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-432"><a href="#cb108-432" aria-hidden="true" tabindex="-1"></a>calls <span class="op">=</span> pd.read_csv(<span class="st">"data/Berkeley_PD_-_Calls_for_Service.csv"</span>)</span>
+<span id="cb108-433"><a href="#cb108-433" aria-hidden="true" tabindex="-1"></a>calls.head()</span>
+<span id="cb108-434"><a href="#cb108-434" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-435"><a href="#cb108-435" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-436"><a href="#cb108-436" aria-hidden="true" tabindex="-1"></a>Looks like there are three columns with dates/times: <span class="in">`EVENTDT`</span>, <span class="in">`EVENTTM`</span>, and <span class="in">`InDbDate`</span>. </span>
+<span id="cb108-437"><a href="#cb108-437" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-438"><a href="#cb108-438" aria-hidden="true" tabindex="-1"></a>Most likely, <span class="in">`EVENTDT`</span> stands for the date when the event took place, <span class="in">`EVENTTM`</span> stands for the time of day the event took place (in 24-hr format), and <span class="in">`InDbDate`</span> is the date this call is recorded onto the database.</span>
+<span id="cb108-439"><a href="#cb108-439" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-440"><a href="#cb108-440" aria-hidden="true" tabindex="-1"></a>If we check the data type of these columns, we will see they are stored as strings. We can convert them to <span class="in">`datetime`</span> objects using pandas <span class="in">`to_datetime`</span> function.</span>
+<span id="cb108-441"><a href="#cb108-441" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-444"><a href="#cb108-444" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-445"><a href="#cb108-445" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-446"><a href="#cb108-446" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>] <span class="op">=</span> pd.to_datetime(calls[<span class="st">"EVENTDT"</span>])</span>
+<span id="cb108-447"><a href="#cb108-447" aria-hidden="true" tabindex="-1"></a>calls.head()</span>
+<span id="cb108-448"><a href="#cb108-448" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-449"><a href="#cb108-449" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-450"><a href="#cb108-450" aria-hidden="true" tabindex="-1"></a>Now, we can use the <span class="in">`dt`</span> accessor on this column.</span>
+<span id="cb108-451"><a href="#cb108-451" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-452"><a href="#cb108-452" aria-hidden="true" tabindex="-1"></a>We can get the month: </span>
+<span id="cb108-453"><a href="#cb108-453" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-456"><a href="#cb108-456" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-457"><a href="#cb108-457" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-458"><a href="#cb108-458" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>].dt.month.head()</span>
+<span id="cb108-459"><a href="#cb108-459" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-460"><a href="#cb108-460" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-461"><a href="#cb108-461" aria-hidden="true" tabindex="-1"></a>Which day of the week the date is on:</span>
+<span id="cb108-462"><a href="#cb108-462" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-465"><a href="#cb108-465" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-466"><a href="#cb108-466" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-467"><a href="#cb108-467" aria-hidden="true" tabindex="-1"></a>calls[<span class="st">"EVENTDT"</span>].dt.dayofweek.head()</span>
+<span id="cb108-468"><a href="#cb108-468" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-469"><a href="#cb108-469" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-470"><a href="#cb108-470" aria-hidden="true" tabindex="-1"></a>Check the mimimum values to see if there are any suspicious-looking, 70s dates:</span>
+<span id="cb108-471"><a href="#cb108-471" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-474"><a href="#cb108-474" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-475"><a href="#cb108-475" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-476"><a href="#cb108-476" aria-hidden="true" tabindex="-1"></a>calls.sort_values(<span class="st">"EVENTDT"</span>).head()</span>
+<span id="cb108-477"><a href="#cb108-477" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-478"><a href="#cb108-478" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-479"><a href="#cb108-479" aria-hidden="true" tabindex="-1"></a>Doesn't look like it! We are good!</span>
+<span id="cb108-480"><a href="#cb108-480" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-481"><a href="#cb108-481" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-482"><a href="#cb108-482" aria-hidden="true" tabindex="-1"></a>We can also do many things with the <span class="in">`dt`</span> accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on <span class="co">[</span><span class="ot">`.dt` accessor</span><span class="co">](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dt-accessors)</span> and <span class="co">[</span><span class="ot">time series/date functionality</span><span class="co">](https://pandas.pydata.org/docs/user_guide/timeseries.html#)</span>.</span>
+<span id="cb108-483"><a href="#cb108-483" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-484"><a href="#cb108-484" aria-hidden="true" tabindex="-1"></a><span class="fu">## Faithfulness</span></span>
+<span id="cb108-485"><a href="#cb108-485" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-486"><a href="#cb108-486" aria-hidden="true" tabindex="-1"></a>At this stage in our data cleaning and EDA workflow, we've achieved quite a lot: we've identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the "real world."</span>
+<span id="cb108-487"><a href="#cb108-487" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-488"><a href="#cb108-488" aria-hidden="true" tabindex="-1"></a>Data used in research or industry is often "messy" – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:</span>
+<span id="cb108-489"><a href="#cb108-489" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-490"><a href="#cb108-490" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Unrealistic or "incorrect" values, such as negative counts, locations that don't exist, or dates set in the future</span>
+<span id="cb108-491"><a href="#cb108-491" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Violations of obvious dependencies, like an age that does not match a birthday</span>
+<span id="cb108-492"><a href="#cb108-492" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Clear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted</span>
+<span id="cb108-493"><a href="#cb108-493" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Signs of data falsification, such as fake email addresses or repeated use of the same names</span>
+<span id="cb108-494"><a href="#cb108-494" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Duplicated records or fields containing the same information</span>
+<span id="cb108-495"><a href="#cb108-495" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Truncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255</span>
+<span id="cb108-496"><a href="#cb108-496" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-497"><a href="#cb108-497" aria-hidden="true" tabindex="-1"></a>We often solve some of these more common issues in the following ways: </span>
+<span id="cb108-498"><a href="#cb108-498" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-499"><a href="#cb108-499" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Spelling errors: apply corrections or drop records that aren't in a dictionary</span>
+<span id="cb108-500"><a href="#cb108-500" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Time zone inconsistencies: convert to a common time zone (e.g. UTC) </span>
+<span id="cb108-501"><a href="#cb108-501" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Duplicated records or fields: identify and eliminate duplicates (using primary keys)</span>
+<span id="cb108-502"><a href="#cb108-502" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Unspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data</span>
+<span id="cb108-503"><a href="#cb108-503" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-504"><a href="#cb108-504" aria-hidden="true" tabindex="-1"></a><span class="fu">### Missing Values</span></span>
+<span id="cb108-505"><a href="#cb108-505" aria-hidden="true" tabindex="-1"></a>Another common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as <span class="in">`NaN`</span> values. </span>
+<span id="cb108-506"><a href="#cb108-506" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-507"><a href="#cb108-507" aria-hidden="true" tabindex="-1"></a>A third method to address missing data is to perform **imputation**: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.</span>
+<span id="cb108-508"><a href="#cb108-508" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-509"><a href="#cb108-509" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Average imputation: replace missing values with the average value for that field</span>
+<span id="cb108-510"><a href="#cb108-510" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Hot deck imputation: replace missing values with some random value</span>
+<span id="cb108-511"><a href="#cb108-511" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Regression imputation: develop a model to predict missing values</span>
+<span id="cb108-512"><a href="#cb108-512" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Multiple imputation: replace missing values with multiple random values</span>
+<span id="cb108-513"><a href="#cb108-513" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-514"><a href="#cb108-514" aria-hidden="true" tabindex="-1"></a>Regardless of the strategy used to deal with missing data, we should think carefully about *why* particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful.</span>
+<span id="cb108-515"><a href="#cb108-515" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-516"><a href="#cb108-516" aria-hidden="true" tabindex="-1"></a><span class="fu"># EDA Demo 1: Tuberculosis in the United States</span></span>
+<span id="cb108-517"><a href="#cb108-517" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-518"><a href="#cb108-518" aria-hidden="true" tabindex="-1"></a>Now, let's walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!</span>
+<span id="cb108-519"><a href="#cb108-519" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-520"><a href="#cb108-520" aria-hidden="true" tabindex="-1"></a>We will examine the data included in the <span class="co">[</span><span class="ot">original CDC article</span><span class="co">](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down)</span> published in 2021.</span>
+<span id="cb108-521"><a href="#cb108-521" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-522"><a href="#cb108-522" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-523"><a href="#cb108-523" aria-hidden="true" tabindex="-1"></a><span class="fu">## CSVs and Field Names</span></span>
+<span id="cb108-524"><a href="#cb108-524" aria-hidden="true" tabindex="-1"></a>Suppose Table 1 was saved as a CSV file located in <span class="in">`data/cdc_tuberculosis.csv`</span>.</span>
+<span id="cb108-525"><a href="#cb108-525" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-526"><a href="#cb108-526" aria-hidden="true" tabindex="-1"></a>We can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways:</span>
+<span id="cb108-527"><a href="#cb108-527" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Using a text editor like emacs, vim, VSCode, etc.</span>
+<span id="cb108-528"><a href="#cb108-528" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc.</span>
+<span id="cb108-529"><a href="#cb108-529" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>The <span class="in">`Python`</span> file object</span>
+<span id="cb108-530"><a href="#cb108-530" aria-hidden="true" tabindex="-1"></a><span class="ss">4. </span><span class="in">`pandas`</span>, using <span class="in">`pd.read_csv()`</span></span>
+<span id="cb108-531"><a href="#cb108-531" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-532"><a href="#cb108-532" aria-hidden="true" tabindex="-1"></a>To try out options 1 and 2, you can view or download the Tuberculosis from the <span class="co">[</span><span class="ot">lecture demo notebook</span><span class="co">](https://data100.datahub.berkeley.edu/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2FDS-100%2Ffa23-student&amp;urlpath=lab%2Ftree%2Ffa23-student%2Flecture%2Flec05%2Flec04-eda.ipynb&amp;branch=main)</span> under the <span class="in">`data`</span> folder in the left hand menu. Notice how the CSV file is a type of **rectangular data (i.e., tabular data) stored as comma-separated values**.</span>
+<span id="cb108-533"><a href="#cb108-533" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-534"><a href="#cb108-534" aria-hidden="true" tabindex="-1"></a>Next, let's try out option 3 using the <span class="in">`Python`</span> file object. We'll look at the first four lines:</span>
+<span id="cb108-535"><a href="#cb108-535" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-538"><a href="#cb108-538" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-539"><a href="#cb108-539" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-540"><a href="#cb108-540" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/cdc_tuberculosis.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
+<span id="cb108-541"><a href="#cb108-541" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
+<span id="cb108-542"><a href="#cb108-542" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> f:</span>
+<span id="cb108-543"><a href="#cb108-543" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(row)</span>
+<span id="cb108-544"><a href="#cb108-544" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
+<span id="cb108-545"><a href="#cb108-545" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">3</span>:</span>
+<span id="cb108-546"><a href="#cb108-546" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
+<span id="cb108-547"><a href="#cb108-547" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-548"><a href="#cb108-548" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-549"><a href="#cb108-549" aria-hidden="true" tabindex="-1"></a>Whoa, why are there blank lines interspaced between the lines of the CSV?</span>
+<span id="cb108-550"><a href="#cb108-550" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-551"><a href="#cb108-551" aria-hidden="true" tabindex="-1"></a>You may recall that all line breaks in text files are encoded as the special newline character <span class="in">`\n`</span>. Python's <span class="in">`print()`</span> prints each string (including the newline), and an additional newline on top of that.</span>
+<span id="cb108-552"><a href="#cb108-552" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-553"><a href="#cb108-553" aria-hidden="true" tabindex="-1"></a>If you're curious, we can use the <span class="in">`repr()`</span> function to return the raw string with all special characters:</span>
+<span id="cb108-554"><a href="#cb108-554" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-557"><a href="#cb108-557" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-558"><a href="#cb108-558" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-559"><a href="#cb108-559" aria-hidden="true" tabindex="-1"></a><span class="cf">with</span> <span class="bu">open</span>(<span class="st">"data/cdc_tuberculosis.csv"</span>, <span class="st">"r"</span>) <span class="im">as</span> f:</span>
+<span id="cb108-560"><a href="#cb108-560" aria-hidden="true" tabindex="-1"></a>    i <span class="op">=</span> <span class="dv">0</span></span>
+<span id="cb108-561"><a href="#cb108-561" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> row <span class="kw">in</span> f:</span>
+<span id="cb108-562"><a href="#cb108-562" aria-hidden="true" tabindex="-1"></a>        <span class="bu">print</span>(<span class="bu">repr</span>(row)) <span class="co"># print raw strings</span></span>
+<span id="cb108-563"><a href="#cb108-563" aria-hidden="true" tabindex="-1"></a>        i <span class="op">+=</span> <span class="dv">1</span></span>
+<span id="cb108-564"><a href="#cb108-564" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> i <span class="op">&gt;</span> <span class="dv">3</span>:</span>
+<span id="cb108-565"><a href="#cb108-565" aria-hidden="true" tabindex="-1"></a>            <span class="cf">break</span></span>
+<span id="cb108-566"><a href="#cb108-566" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-567"><a href="#cb108-567" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-568"><a href="#cb108-568" aria-hidden="true" tabindex="-1"></a>Finally, let's try option 4 and use the tried-and-true Data 100 approach: <span class="in">`pandas`</span>.</span>
+<span id="cb108-569"><a href="#cb108-569" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-572"><a href="#cb108-572" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-573"><a href="#cb108-573" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-574"><a href="#cb108-574" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>)</span>
+<span id="cb108-575"><a href="#cb108-575" aria-hidden="true" tabindex="-1"></a>tb_df.head()</span>
+<span id="cb108-576"><a href="#cb108-576" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-577"><a href="#cb108-577" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-578"><a href="#cb108-578" aria-hidden="true" tabindex="-1"></a>You may notice some strange things about this table: what's up with the "Unnamed" column names and the first row? </span>
+<span id="cb108-579"><a href="#cb108-579" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-580"><a href="#cb108-580" aria-hidden="true" tabindex="-1"></a>Congratulations — you're ready to wrangle your data! Because of how things are stored, we'll need to clean the data a bit to name our columns better.</span>
+<span id="cb108-581"><a href="#cb108-581" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-582"><a href="#cb108-582" aria-hidden="true" tabindex="-1"></a>A reasonable first step is to identify the row with the right header. The <span class="in">`pd.read_csv()`</span> function (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)</span>) has the convenient <span class="in">`header`</span> parameter that we can set to use the elements in row 1 as the appropriate columns:</span>
+<span id="cb108-583"><a href="#cb108-583" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-586"><a href="#cb108-586" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-587"><a href="#cb108-587" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-588"><a href="#cb108-588" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>, header<span class="op">=</span><span class="dv">1</span>) <span class="co"># row index</span></span>
+<span id="cb108-589"><a href="#cb108-589" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-590"><a href="#cb108-590" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-591"><a href="#cb108-591" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-592"><a href="#cb108-592" aria-hidden="true" tabindex="-1"></a>Wait...but now we can't differentiate betwen the "Number of TB cases" and "TB incidence" year columns. <span class="in">`pandas`</span> has tried to make our lives easier by automatically adding ".1" to the latter columns, but this doesn't help us, as humans, understand the data.</span>
+<span id="cb108-593"><a href="#cb108-593" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-594"><a href="#cb108-594" aria-hidden="true" tabindex="-1"></a>We can do this manually with <span class="in">`df.rename()`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html?highlight=rename#pandas.DataFrame.rename)</span>):</span>
+<span id="cb108-595"><a href="#cb108-595" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-598"><a href="#cb108-598" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-599"><a href="#cb108-599" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-600"><a href="#cb108-600" aria-hidden="true" tabindex="-1"></a>rename_dict <span class="op">=</span> {<span class="st">'2019'</span>: <span class="st">'TB cases 2019'</span>,</span>
+<span id="cb108-601"><a href="#cb108-601" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2020'</span>: <span class="st">'TB cases 2020'</span>,</span>
+<span id="cb108-602"><a href="#cb108-602" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2021'</span>: <span class="st">'TB cases 2021'</span>,</span>
+<span id="cb108-603"><a href="#cb108-603" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2019.1'</span>: <span class="st">'TB incidence 2019'</span>,</span>
+<span id="cb108-604"><a href="#cb108-604" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2020.1'</span>: <span class="st">'TB incidence 2020'</span>,</span>
+<span id="cb108-605"><a href="#cb108-605" aria-hidden="true" tabindex="-1"></a>               <span class="st">'2021.1'</span>: <span class="st">'TB incidence 2021'</span>}</span>
+<span id="cb108-606"><a href="#cb108-606" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> tb_df.rename(columns<span class="op">=</span>rename_dict)</span>
+<span id="cb108-607"><a href="#cb108-607" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-608"><a href="#cb108-608" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-609"><a href="#cb108-609" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-610"><a href="#cb108-610" aria-hidden="true" tabindex="-1"></a><span class="fu">## Record Granularity</span></span>
+<span id="cb108-611"><a href="#cb108-611" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-612"><a href="#cb108-612" aria-hidden="true" tabindex="-1"></a>You might already be wondering: what's up with that first record?</span>
+<span id="cb108-613"><a href="#cb108-613" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-614"><a href="#cb108-614" aria-hidden="true" tabindex="-1"></a>Row 0 is what we call a **rollup record**, or summary record. It's often useful when displaying tables to humans. The **granularity** of record 0 (Totals) vs the rest of the records (States) is different.</span>
+<span id="cb108-615"><a href="#cb108-615" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-616"><a href="#cb108-616" aria-hidden="true" tabindex="-1"></a>Okay, EDA step two. How was the rollup record aggregated?</span>
+<span id="cb108-617"><a href="#cb108-617" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-618"><a href="#cb108-618" aria-hidden="true" tabindex="-1"></a>Let's check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get **2x** the total cases in each of our TB cases by year (why do you think this is?).</span>
+<span id="cb108-619"><a href="#cb108-619" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-622"><a href="#cb108-622" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-623"><a href="#cb108-623" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-624"><a href="#cb108-624" aria-hidden="true" tabindex="-1"></a>tb_df.<span class="bu">sum</span>(axis<span class="op">=</span><span class="dv">0</span>)</span>
+<span id="cb108-625"><a href="#cb108-625" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-626"><a href="#cb108-626" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-627"><a href="#cb108-627" aria-hidden="true" tabindex="-1"></a>Whoa, what's going on with the TB cases in 2019, 2020, and 2021? Check out the column types:</span>
+<span id="cb108-628"><a href="#cb108-628" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-631"><a href="#cb108-631" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-632"><a href="#cb108-632" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-633"><a href="#cb108-633" aria-hidden="true" tabindex="-1"></a>tb_df.dtypes</span>
+<span id="cb108-634"><a href="#cb108-634" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-635"><a href="#cb108-635" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-636"><a href="#cb108-636" aria-hidden="true" tabindex="-1"></a>Since there are commas in the values for TB cases, the numbers are read as the <span class="in">`object`</span> datatype, or **storage type** (close to the <span class="in">`Python`</span> string datatype), so <span class="in">`pandas`</span> is concatenating strings instead of adding integers (recall that <span class="in">`Python`</span> can "sum", or concatenate, strings together: <span class="in">`"data" + "100"`</span> evaluates to <span class="in">`"data100"`</span>). </span>
+<span id="cb108-637"><a href="#cb108-637" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-638"><a href="#cb108-638" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-639"><a href="#cb108-639" aria-hidden="true" tabindex="-1"></a>Fortunately <span class="in">`read_csv`</span> also has a <span class="in">`thousands`</span> parameter (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)</span>):</span>
+<span id="cb108-640"><a href="#cb108-640" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-643"><a href="#cb108-643" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-644"><a href="#cb108-644" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-645"><a href="#cb108-645" aria-hidden="true" tabindex="-1"></a><span class="co"># improve readability: chaining method calls with outer parentheses/line breaks</span></span>
+<span id="cb108-646"><a href="#cb108-646" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> (</span>
+<span id="cb108-647"><a href="#cb108-647" aria-hidden="true" tabindex="-1"></a>    pd.read_csv(<span class="st">"data/cdc_tuberculosis.csv"</span>, header<span class="op">=</span><span class="dv">1</span>, thousands<span class="op">=</span><span class="st">','</span>)</span>
+<span id="cb108-648"><a href="#cb108-648" aria-hidden="true" tabindex="-1"></a>    .rename(columns<span class="op">=</span>rename_dict)</span>
+<span id="cb108-649"><a href="#cb108-649" aria-hidden="true" tabindex="-1"></a>)</span>
+<span id="cb108-650"><a href="#cb108-650" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-651"><a href="#cb108-651" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-652"><a href="#cb108-652" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-655"><a href="#cb108-655" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-656"><a href="#cb108-656" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-657"><a href="#cb108-657" aria-hidden="true" tabindex="-1"></a>tb_df.<span class="bu">sum</span>()</span>
+<span id="cb108-658"><a href="#cb108-658" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-659"><a href="#cb108-659" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-660"><a href="#cb108-660" aria-hidden="true" tabindex="-1"></a>The Total TB cases look right. Phew!</span>
+<span id="cb108-661"><a href="#cb108-661" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-662"><a href="#cb108-662" aria-hidden="true" tabindex="-1"></a>Let's just look at the records with **state-level granularity**:</span>
+<span id="cb108-663"><a href="#cb108-663" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-666"><a href="#cb108-666" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-667"><a href="#cb108-667" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-668"><a href="#cb108-668" aria-hidden="true" tabindex="-1"></a>state_tb_df <span class="op">=</span> tb_df[<span class="dv">1</span>:]</span>
+<span id="cb108-669"><a href="#cb108-669" aria-hidden="true" tabindex="-1"></a>state_tb_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-670"><a href="#cb108-670" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-671"><a href="#cb108-671" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-672"><a href="#cb108-672" aria-hidden="true" tabindex="-1"></a><span class="fu">## Gather Census Data</span></span>
+<span id="cb108-673"><a href="#cb108-673" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-674"><a href="#cb108-674" aria-hidden="true" tabindex="-1"></a>U.S. Census population estimates <span class="co">[</span><span class="ot">source</span><span class="co">](https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html)</span> (2019), <span class="co">[</span><span class="ot">source</span><span class="co">](https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html)</span> (2020-2021).</span>
+<span id="cb108-675"><a href="#cb108-675" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-676"><a href="#cb108-676" aria-hidden="true" tabindex="-1"></a>Running the below cells cleans the data.</span>
+<span id="cb108-677"><a href="#cb108-677" aria-hidden="true" tabindex="-1"></a>There are a few new methods here:</span>
+<span id="cb108-678"><a href="#cb108-678" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`df.convert_dtypes()`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html)</span>) conveniently converts all float dtypes into ints and is out of scope for the class.</span>
+<span id="cb108-679"><a href="#cb108-679" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`df.drop_na()`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)</span>) will be explained in more detail next time.</span>
+<span id="cb108-680"><a href="#cb108-680" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-683"><a href="#cb108-683" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-684"><a href="#cb108-684" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-685"><a href="#cb108-685" aria-hidden="true" tabindex="-1"></a><span class="co"># 2010s census data</span></span>
+<span id="cb108-686"><a href="#cb108-686" aria-hidden="true" tabindex="-1"></a>census_2010s_df <span class="op">=</span> pd.read_csv(<span class="st">"data/nst-est2019-01.csv"</span>, header<span class="op">=</span><span class="dv">3</span>, thousands<span class="op">=</span><span class="st">","</span>)</span>
+<span id="cb108-687"><a href="#cb108-687" aria-hidden="true" tabindex="-1"></a>census_2010s_df <span class="op">=</span> (</span>
+<span id="cb108-688"><a href="#cb108-688" aria-hidden="true" tabindex="-1"></a>    census_2010s_df</span>
+<span id="cb108-689"><a href="#cb108-689" aria-hidden="true" tabindex="-1"></a>    .reset_index()</span>
+<span id="cb108-690"><a href="#cb108-690" aria-hidden="true" tabindex="-1"></a>    .drop(columns<span class="op">=</span>[<span class="st">"index"</span>, <span class="st">"Census"</span>, <span class="st">"Estimates Base"</span>])</span>
+<span id="cb108-691"><a href="#cb108-691" aria-hidden="true" tabindex="-1"></a>    .rename(columns<span class="op">=</span>{<span class="st">"Unnamed: 0"</span>: <span class="st">"Geographic Area"</span>})</span>
+<span id="cb108-692"><a href="#cb108-692" aria-hidden="true" tabindex="-1"></a>    .convert_dtypes()                 <span class="co"># "smart" converting of columns, use at your own risk</span></span>
+<span id="cb108-693"><a href="#cb108-693" aria-hidden="true" tabindex="-1"></a>    .dropna()                         <span class="co"># we'll introduce this next time</span></span>
+<span id="cb108-694"><a href="#cb108-694" aria-hidden="true" tabindex="-1"></a>)</span>
+<span id="cb108-695"><a href="#cb108-695" aria-hidden="true" tabindex="-1"></a>census_2010s_df[<span class="st">'Geographic Area'</span>] <span class="op">=</span> census_2010s_df[<span class="st">'Geographic Area'</span>].<span class="bu">str</span>.strip(<span class="st">'.'</span>)</span>
+<span id="cb108-696"><a href="#cb108-696" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-697"><a href="#cb108-697" aria-hidden="true" tabindex="-1"></a><span class="co"># with pd.option_context('display.min_rows', 30): # shows more rows</span></span>
+<span id="cb108-698"><a href="#cb108-698" aria-hidden="true" tabindex="-1"></a><span class="co">#     display(census_2010s_df)</span></span>
+<span id="cb108-699"><a href="#cb108-699" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb108-700"><a href="#cb108-700" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-701"><a href="#cb108-701" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-702"><a href="#cb108-702" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-703"><a href="#cb108-703" aria-hidden="true" tabindex="-1"></a>Occasionally, you will want to modify code that you have imported.  To reimport those modifications you can either use <span class="in">`python`</span>'s <span class="in">`importlib`</span> library:</span>
+<span id="cb108-704"><a href="#cb108-704" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-705"><a href="#cb108-705" aria-hidden="true" tabindex="-1"></a><span class="in">```python</span></span>
+<span id="cb108-706"><a href="#cb108-706" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> importlib <span class="im">import</span> <span class="bu">reload</span></span>
+<span id="cb108-707"><a href="#cb108-707" aria-hidden="true" tabindex="-1"></a><span class="bu">reload</span>(utils)</span>
+<span id="cb108-708"><a href="#cb108-708" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-709"><a href="#cb108-709" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-710"><a href="#cb108-710" aria-hidden="true" tabindex="-1"></a>or use <span class="in">`iPython`</span> magic which will intelligently import code when files change:</span>
+<span id="cb108-711"><a href="#cb108-711" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-712"><a href="#cb108-712" aria-hidden="true" tabindex="-1"></a><span class="in">```python</span></span>
+<span id="cb108-713"><a href="#cb108-713" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>load_ext autoreload</span>
+<span id="cb108-714"><a href="#cb108-714" aria-hidden="true" tabindex="-1"></a><span class="op">%</span>autoreload <span class="dv">2</span></span>
+<span id="cb108-715"><a href="#cb108-715" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-716"><a href="#cb108-716" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-719"><a href="#cb108-719" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-720"><a href="#cb108-720" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-721"><a href="#cb108-721" aria-hidden="true" tabindex="-1"></a><span class="co"># census 2020s data</span></span>
+<span id="cb108-722"><a href="#cb108-722" aria-hidden="true" tabindex="-1"></a>census_2020s_df <span class="op">=</span> pd.read_csv(<span class="st">"data/NST-EST2022-POP.csv"</span>, header<span class="op">=</span><span class="dv">3</span>, thousands<span class="op">=</span><span class="st">","</span>)</span>
+<span id="cb108-723"><a href="#cb108-723" aria-hidden="true" tabindex="-1"></a>census_2020s_df <span class="op">=</span> (</span>
+<span id="cb108-724"><a href="#cb108-724" aria-hidden="true" tabindex="-1"></a>    census_2020s_df</span>
+<span id="cb108-725"><a href="#cb108-725" aria-hidden="true" tabindex="-1"></a>    .reset_index()</span>
+<span id="cb108-726"><a href="#cb108-726" aria-hidden="true" tabindex="-1"></a>    .drop(columns<span class="op">=</span>[<span class="st">"index"</span>, <span class="st">"Unnamed: 1"</span>])</span>
+<span id="cb108-727"><a href="#cb108-727" aria-hidden="true" tabindex="-1"></a>    .rename(columns<span class="op">=</span>{<span class="st">"Unnamed: 0"</span>: <span class="st">"Geographic Area"</span>})</span>
+<span id="cb108-728"><a href="#cb108-728" aria-hidden="true" tabindex="-1"></a>    .convert_dtypes()                 <span class="co"># "smart" converting of columns, use at your own risk</span></span>
+<span id="cb108-729"><a href="#cb108-729" aria-hidden="true" tabindex="-1"></a>    .dropna()                         <span class="co"># we'll introduce this next time</span></span>
+<span id="cb108-730"><a href="#cb108-730" aria-hidden="true" tabindex="-1"></a>)</span>
+<span id="cb108-731"><a href="#cb108-731" aria-hidden="true" tabindex="-1"></a>census_2020s_df[<span class="st">'Geographic Area'</span>] <span class="op">=</span> census_2020s_df[<span class="st">'Geographic Area'</span>].<span class="bu">str</span>.strip(<span class="st">'.'</span>)</span>
+<span id="cb108-732"><a href="#cb108-732" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-733"><a href="#cb108-733" aria-hidden="true" tabindex="-1"></a>census_2020s_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-734"><a href="#cb108-734" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-735"><a href="#cb108-735" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-736"><a href="#cb108-736" aria-hidden="true" tabindex="-1"></a><span class="fu">## Joining Data (Merging `DataFrame`s)</span></span>
+<span id="cb108-737"><a href="#cb108-737" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-738"><a href="#cb108-738" aria-hidden="true" tabindex="-1"></a>Time to <span class="in">`merge`</span>! Here we use the <span class="in">`DataFrame`</span> method <span class="in">`df1.merge(right=df2, ...)`</span> on <span class="in">`DataFrame`</span> <span class="in">`df1`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)</span>). Contrast this with the function <span class="in">`pd.merge(left=df1, right=df2, ...)`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.merge.html?highlight=pandas%20merge#pandas.merge)</span>). Feel free to use either.</span>
+<span id="cb108-739"><a href="#cb108-739" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-742"><a href="#cb108-742" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-743"><a href="#cb108-743" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-744"><a href="#cb108-744" aria-hidden="true" tabindex="-1"></a><span class="co"># merge TB DataFrame with two US census DataFrames</span></span>
+<span id="cb108-745"><a href="#cb108-745" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
+<span id="cb108-746"><a href="#cb108-746" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
+<span id="cb108-747"><a href="#cb108-747" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2010s_df,</span>
+<span id="cb108-748"><a href="#cb108-748" aria-hidden="true" tabindex="-1"></a>           left_on<span class="op">=</span><span class="st">"U.S. jurisdiction"</span>, right_on<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
+<span id="cb108-749"><a href="#cb108-749" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2020s_df,</span>
+<span id="cb108-750"><a href="#cb108-750" aria-hidden="true" tabindex="-1"></a>           left_on<span class="op">=</span><span class="st">"U.S. jurisdiction"</span>, right_on<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
+<span id="cb108-751"><a href="#cb108-751" aria-hidden="true" tabindex="-1"></a>)</span>
+<span id="cb108-752"><a href="#cb108-752" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-753"><a href="#cb108-753" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-754"><a href="#cb108-754" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-755"><a href="#cb108-755" aria-hidden="true" tabindex="-1"></a>Having all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census <span class="in">`DataFrame`</span>s. Let's do the latter.</span>
+<span id="cb108-756"><a href="#cb108-756" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-759"><a href="#cb108-759" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-760"><a href="#cb108-760" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-761"><a href="#cb108-761" aria-hidden="true" tabindex="-1"></a><span class="co"># try merging again, but cleaner this time</span></span>
+<span id="cb108-762"><a href="#cb108-762" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
+<span id="cb108-763"><a href="#cb108-763" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
+<span id="cb108-764"><a href="#cb108-764" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2010s_df[[<span class="st">"Geographic Area"</span>, <span class="st">"2019"</span>]],</span>
+<span id="cb108-765"><a href="#cb108-765" aria-hidden="true" tabindex="-1"></a>           left_on<span class="op">=</span><span class="st">"U.S. jurisdiction"</span>, right_on<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
+<span id="cb108-766"><a href="#cb108-766" aria-hidden="true" tabindex="-1"></a>    .drop(columns<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
+<span id="cb108-767"><a href="#cb108-767" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2020s_df[[<span class="st">"Geographic Area"</span>, <span class="st">"2020"</span>, <span class="st">"2021"</span>]],</span>
+<span id="cb108-768"><a href="#cb108-768" aria-hidden="true" tabindex="-1"></a>           left_on<span class="op">=</span><span class="st">"U.S. jurisdiction"</span>, right_on<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
+<span id="cb108-769"><a href="#cb108-769" aria-hidden="true" tabindex="-1"></a>    .drop(columns<span class="op">=</span><span class="st">"Geographic Area"</span>)</span>
+<span id="cb108-770"><a href="#cb108-770" aria-hidden="true" tabindex="-1"></a>)</span>
+<span id="cb108-771"><a href="#cb108-771" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-772"><a href="#cb108-772" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-773"><a href="#cb108-773" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-774"><a href="#cb108-774" aria-hidden="true" tabindex="-1"></a><span class="fu">## Reproducing Data: Compute Incidence</span></span>
+<span id="cb108-775"><a href="#cb108-775" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-776"><a href="#cb108-776" aria-hidden="true" tabindex="-1"></a>Let's recompute incidence to make sure we know where the original CDC numbers came from.</span>
+<span id="cb108-777"><a href="#cb108-777" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-778"><a href="#cb108-778" aria-hidden="true" tabindex="-1"></a>From the <span class="co">[</span><span class="ot">CDC report</span><span class="co">](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w#T1_down)</span>: TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”</span>
+<span id="cb108-779"><a href="#cb108-779" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-780"><a href="#cb108-780" aria-hidden="true" tabindex="-1"></a>If we define a group as 100,000 people, then we can compute the TB incidence for a given state population as</span>
+<span id="cb108-781"><a href="#cb108-781" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-782"><a href="#cb108-782" aria-hidden="true" tabindex="-1"></a>$$\text{TB incidence} = \frac{\text{TB cases in population}}{\text{groups in population}} = \frac{\text{TB cases in population}}{\text{population}/100000} $$</span>
+<span id="cb108-783"><a href="#cb108-783" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-784"><a href="#cb108-784" aria-hidden="true" tabindex="-1"></a>$$= \frac{\text{TB cases in population}}{\text{population}} \times 100000$$</span>
+<span id="cb108-785"><a href="#cb108-785" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-786"><a href="#cb108-786" aria-hidden="true" tabindex="-1"></a>Let's try this for 2019:</span>
+<span id="cb108-787"><a href="#cb108-787" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-790"><a href="#cb108-790" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-791"><a href="#cb108-791" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-792"><a href="#cb108-792" aria-hidden="true" tabindex="-1"></a>tb_census_df[<span class="st">"recompute incidence 2019"</span>] <span class="op">=</span> tb_census_df[<span class="st">"TB cases 2019"</span>]<span class="op">/</span>tb_census_df[<span class="st">"2019"</span>]<span class="op">*</span><span class="dv">100000</span></span>
+<span id="cb108-793"><a href="#cb108-793" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-794"><a href="#cb108-794" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-795"><a href="#cb108-795" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-796"><a href="#cb108-796" aria-hidden="true" tabindex="-1"></a>Awesome!!!</span>
+<span id="cb108-797"><a href="#cb108-797" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-798"><a href="#cb108-798" aria-hidden="true" tabindex="-1"></a>Let's use a for-loop and <span class="in">`Python`</span> format strings to compute TB incidence for all years. <span class="in">`Python`</span> f-strings are just used for the purposes of this demo, but they're handy to know when you explore data beyond this course (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://docs.python.org/3/tutorial/inputoutput.html)</span>).</span>
+<span id="cb108-799"><a href="#cb108-799" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-802"><a href="#cb108-802" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-803"><a href="#cb108-803" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-804"><a href="#cb108-804" aria-hidden="true" tabindex="-1"></a><span class="co"># recompute incidence for all years</span></span>
+<span id="cb108-805"><a href="#cb108-805" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> year <span class="kw">in</span> [<span class="dv">2019</span>, <span class="dv">2020</span>, <span class="dv">2021</span>]:</span>
+<span id="cb108-806"><a href="#cb108-806" aria-hidden="true" tabindex="-1"></a>    tb_census_df[<span class="ss">f"recompute incidence </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>] <span class="op">=</span> tb_census_df[<span class="ss">f"TB cases </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">/</span>tb_census_df[<span class="ss">f"</span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">*</span><span class="dv">100000</span></span>
+<span id="cb108-807"><a href="#cb108-807" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-808"><a href="#cb108-808" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-809"><a href="#cb108-809" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-810"><a href="#cb108-810" aria-hidden="true" tabindex="-1"></a>These numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy. </span>
+<span id="cb108-811"><a href="#cb108-811" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-814"><a href="#cb108-814" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-815"><a href="#cb108-815" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-816"><a href="#cb108-816" aria-hidden="true" tabindex="-1"></a>tb_census_df.describe()</span>
+<span id="cb108-817"><a href="#cb108-817" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-818"><a href="#cb108-818" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-819"><a href="#cb108-819" aria-hidden="true" tabindex="-1"></a><span class="fu">## Bonus EDA: Reproducing the Reported Statistic</span></span>
+<span id="cb108-820"><a href="#cb108-820" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-821"><a href="#cb108-821" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-822"><a href="#cb108-822" aria-hidden="true" tabindex="-1"></a>**How do we reproduce that reported statistic in the original [CDC report](https://www.cdc.gov/mmwr/volumes/71/wr/mm7112a1.htm?s_cid=mm7112a1_w)?**</span>
+<span id="cb108-823"><a href="#cb108-823" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-824"><a href="#cb108-824" aria-hidden="true" tabindex="-1"></a><span class="at">&gt; Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.</span></span>
+<span id="cb108-825"><a href="#cb108-825" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-826"><a href="#cb108-826" aria-hidden="true" tabindex="-1"></a>This is TB incidence computed across the entire U.S. population! How do we reproduce this?</span>
+<span id="cb108-827"><a href="#cb108-827" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>We need to reproduce the "Total" TB incidences in our rolled record.</span>
+<span id="cb108-828"><a href="#cb108-828" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>But our current <span class="in">`tb_census_df`</span> only has 51 entries (50 states plus Washington, D.C.). There is no rolled record.</span>
+<span id="cb108-829"><a href="#cb108-829" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>What happened...?</span>
+<span id="cb108-830"><a href="#cb108-830" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-831"><a href="#cb108-831" aria-hidden="true" tabindex="-1"></a>Let's get exploring!</span>
+<span id="cb108-832"><a href="#cb108-832" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-833"><a href="#cb108-833" aria-hidden="true" tabindex="-1"></a>Before we keep exploring, we'll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.</span>
+<span id="cb108-834"><a href="#cb108-834" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-837"><a href="#cb108-837" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-838"><a href="#cb108-838" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-839"><a href="#cb108-839" aria-hidden="true" tabindex="-1"></a>tb_df <span class="op">=</span> tb_df.set_index(<span class="st">"U.S. jurisdiction"</span>)</span>
+<span id="cb108-840"><a href="#cb108-840" aria-hidden="true" tabindex="-1"></a>tb_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-841"><a href="#cb108-841" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-842"><a href="#cb108-842" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-845"><a href="#cb108-845" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-846"><a href="#cb108-846" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-847"><a href="#cb108-847" aria-hidden="true" tabindex="-1"></a>census_2010s_df <span class="op">=</span> census_2010s_df.set_index(<span class="st">"Geographic Area"</span>)</span>
+<span id="cb108-848"><a href="#cb108-848" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-849"><a href="#cb108-849" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-850"><a href="#cb108-850" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-853"><a href="#cb108-853" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-854"><a href="#cb108-854" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-855"><a href="#cb108-855" aria-hidden="true" tabindex="-1"></a>census_2020s_df <span class="op">=</span> census_2020s_df.set_index(<span class="st">"Geographic Area"</span>)</span>
+<span id="cb108-856"><a href="#cb108-856" aria-hidden="true" tabindex="-1"></a>census_2020s_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-857"><a href="#cb108-857" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-858"><a href="#cb108-858" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-859"><a href="#cb108-859" aria-hidden="true" tabindex="-1"></a>It turns out that our merge above only kept state records, even though our original <span class="in">`tb_df`</span> had the "Total" rolled record:</span>
+<span id="cb108-860"><a href="#cb108-860" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-863"><a href="#cb108-863" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-864"><a href="#cb108-864" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-865"><a href="#cb108-865" aria-hidden="true" tabindex="-1"></a>tb_df.head()</span>
+<span id="cb108-866"><a href="#cb108-866" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-867"><a href="#cb108-867" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-868"><a href="#cb108-868" aria-hidden="true" tabindex="-1"></a>Recall that <span class="in">`merge`</span> by default does an **inner** merge by default, meaning that it only preserves keys that are present in **both** <span class="in">`DataFrame`</span>s.</span>
+<span id="cb108-869"><a href="#cb108-869" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-870"><a href="#cb108-870" aria-hidden="true" tabindex="-1"></a>The rolled records in our census <span class="in">`DataFrame`</span> have different <span class="in">`Geographic Area`</span> fields, which was the key we merged on:</span>
+<span id="cb108-871"><a href="#cb108-871" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-874"><a href="#cb108-874" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-875"><a href="#cb108-875" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-876"><a href="#cb108-876" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-877"><a href="#cb108-877" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-878"><a href="#cb108-878" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-879"><a href="#cb108-879" aria-hidden="true" tabindex="-1"></a>The Census <span class="in">`DataFrame`</span> has several rolled records. The aggregate record we are looking for actually has the Geographic Area named "United States".</span>
+<span id="cb108-880"><a href="#cb108-880" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-881"><a href="#cb108-881" aria-hidden="true" tabindex="-1"></a>One straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we'll use <span class="in">`df.rename()`</span> (<span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rename.html)</span>):</span>
+<span id="cb108-882"><a href="#cb108-882" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-885"><a href="#cb108-885" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-886"><a href="#cb108-886" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-887"><a href="#cb108-887" aria-hidden="true" tabindex="-1"></a><span class="co"># rename rolled record for 2010s</span></span>
+<span id="cb108-888"><a href="#cb108-888" aria-hidden="true" tabindex="-1"></a>census_2010s_df.rename(index<span class="op">=</span>{<span class="st">'United States'</span>:<span class="st">'Total'</span>}, inplace<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb108-889"><a href="#cb108-889" aria-hidden="true" tabindex="-1"></a>census_2010s_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-890"><a href="#cb108-890" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-891"><a href="#cb108-891" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-894"><a href="#cb108-894" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-895"><a href="#cb108-895" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-896"><a href="#cb108-896" aria-hidden="true" tabindex="-1"></a><span class="co"># same, but for 2020s rename rolled record</span></span>
+<span id="cb108-897"><a href="#cb108-897" aria-hidden="true" tabindex="-1"></a>census_2020s_df.rename(index<span class="op">=</span>{<span class="st">'United States'</span>:<span class="st">'Total'</span>}, inplace<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb108-898"><a href="#cb108-898" aria-hidden="true" tabindex="-1"></a>census_2020s_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-899"><a href="#cb108-899" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-900"><a href="#cb108-900" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-901"><a href="#cb108-901" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-902"><a href="#cb108-902" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-903"><a href="#cb108-903" aria-hidden="true" tabindex="-1"></a>Next let's rerun our merge. Note the different chaining, because we are now merging on indexes (<span class="in">`df.merge()`</span> <span class="co">[</span><span class="ot">documentation</span><span class="co">](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html)</span>).</span>
+<span id="cb108-904"><a href="#cb108-904" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-907"><a href="#cb108-907" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-908"><a href="#cb108-908" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-909"><a href="#cb108-909" aria-hidden="true" tabindex="-1"></a>tb_census_df <span class="op">=</span> (</span>
+<span id="cb108-910"><a href="#cb108-910" aria-hidden="true" tabindex="-1"></a>    tb_df</span>
+<span id="cb108-911"><a href="#cb108-911" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2010s_df[[<span class="st">"2019"</span>]],</span>
+<span id="cb108-912"><a href="#cb108-912" aria-hidden="true" tabindex="-1"></a>           left_index<span class="op">=</span><span class="va">True</span>, right_index<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb108-913"><a href="#cb108-913" aria-hidden="true" tabindex="-1"></a>    .merge(right<span class="op">=</span>census_2020s_df[[<span class="st">"2020"</span>, <span class="st">"2021"</span>]],</span>
+<span id="cb108-914"><a href="#cb108-914" aria-hidden="true" tabindex="-1"></a>           left_index<span class="op">=</span><span class="va">True</span>, right_index<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb108-915"><a href="#cb108-915" aria-hidden="true" tabindex="-1"></a>)</span>
+<span id="cb108-916"><a href="#cb108-916" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-917"><a href="#cb108-917" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-918"><a href="#cb108-918" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-919"><a href="#cb108-919" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-920"><a href="#cb108-920" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-921"><a href="#cb108-921" aria-hidden="true" tabindex="-1"></a>Finally, let's recompute our incidences:</span>
+<span id="cb108-922"><a href="#cb108-922" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-925"><a href="#cb108-925" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-926"><a href="#cb108-926" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-927"><a href="#cb108-927" aria-hidden="true" tabindex="-1"></a><span class="co"># recompute incidence for all years</span></span>
+<span id="cb108-928"><a href="#cb108-928" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> year <span class="kw">in</span> [<span class="dv">2019</span>, <span class="dv">2020</span>, <span class="dv">2021</span>]:</span>
+<span id="cb108-929"><a href="#cb108-929" aria-hidden="true" tabindex="-1"></a>    tb_census_df[<span class="ss">f"recompute incidence </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>] <span class="op">=</span> tb_census_df[<span class="ss">f"TB cases </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">/</span>tb_census_df[<span class="ss">f"</span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>]<span class="op">*</span><span class="dv">100000</span></span>
+<span id="cb108-930"><a href="#cb108-930" aria-hidden="true" tabindex="-1"></a>tb_census_df.head(<span class="dv">5</span>)</span>
+<span id="cb108-931"><a href="#cb108-931" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-932"><a href="#cb108-932" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-933"><a href="#cb108-933" aria-hidden="true" tabindex="-1"></a>We reproduced the total U.S. incidences correctly!</span>
+<span id="cb108-934"><a href="#cb108-934" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-935"><a href="#cb108-935" aria-hidden="true" tabindex="-1"></a>We're almost there. Let's revisit the quote:</span>
+<span id="cb108-936"><a href="#cb108-936" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-937"><a href="#cb108-937" aria-hidden="true" tabindex="-1"></a><span class="at">&gt; Reported TB incidence (cases per 100,000 persons) increased **9.4%**, from **2.2** during 2020 to **2.4** during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.</span></span>
+<span id="cb108-938"><a href="#cb108-938" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-939"><a href="#cb108-939" aria-hidden="true" tabindex="-1"></a>Recall that percent change from $A$ to $B$ is computed as</span>
+<span id="cb108-940"><a href="#cb108-940" aria-hidden="true" tabindex="-1"></a>$\text{percent change} = \frac{B - A}{A} \times 100$.</span>
+<span id="cb108-941"><a href="#cb108-941" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-944"><a href="#cb108-944" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-945"><a href="#cb108-945" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-946"><a href="#cb108-946" aria-hidden="true" tabindex="-1"></a><span class="co">#| tags: []</span></span>
+<span id="cb108-947"><a href="#cb108-947" aria-hidden="true" tabindex="-1"></a>incidence_2020 <span class="op">=</span> tb_census_df.loc[<span class="st">'Total'</span>, <span class="st">'recompute incidence 2020'</span>]</span>
+<span id="cb108-948"><a href="#cb108-948" aria-hidden="true" tabindex="-1"></a>incidence_2020</span>
+<span id="cb108-949"><a href="#cb108-949" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-950"><a href="#cb108-950" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-953"><a href="#cb108-953" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-954"><a href="#cb108-954" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-955"><a href="#cb108-955" aria-hidden="true" tabindex="-1"></a><span class="co">#| tags: []</span></span>
+<span id="cb108-956"><a href="#cb108-956" aria-hidden="true" tabindex="-1"></a>incidence_2021 <span class="op">=</span> tb_census_df.loc[<span class="st">'Total'</span>, <span class="st">'recompute incidence 2021'</span>]</span>
+<span id="cb108-957"><a href="#cb108-957" aria-hidden="true" tabindex="-1"></a>incidence_2021</span>
+<span id="cb108-958"><a href="#cb108-958" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-959"><a href="#cb108-959" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-962"><a href="#cb108-962" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-963"><a href="#cb108-963" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-964"><a href="#cb108-964" aria-hidden="true" tabindex="-1"></a><span class="co">#| tags: []</span></span>
+<span id="cb108-965"><a href="#cb108-965" aria-hidden="true" tabindex="-1"></a>difference <span class="op">=</span> (incidence_2021 <span class="op">-</span> incidence_2020)<span class="op">/</span>incidence_2020 <span class="op">*</span> <span class="dv">100</span></span>
+<span id="cb108-966"><a href="#cb108-966" aria-hidden="true" tabindex="-1"></a>difference</span>
+<span id="cb108-967"><a href="#cb108-967" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-968"><a href="#cb108-968" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-969"><a href="#cb108-969" aria-hidden="true" tabindex="-1"></a><span class="fu"># EDA Demo 2: Mauna Loa CO&lt;sub&gt;2&lt;/sub&gt; Data -- A Lesson in Data Faithfulness</span></span>
+<span id="cb108-970"><a href="#cb108-970" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-971"><a href="#cb108-971" aria-hidden="true" tabindex="-1"></a><span class="co">[</span><span class="ot">Mauna Loa Observatory</span><span class="co">](https://gml.noaa.gov/ccgg/trends/data.html)</span> has been monitoring CO<span class="kw">&lt;sub&gt;</span>2<span class="kw">&lt;/sub&gt;</span> concentrations since 1958</span>
+<span id="cb108-972"><a href="#cb108-972" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-975"><a href="#cb108-975" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-976"><a href="#cb108-976" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-977"><a href="#cb108-977" aria-hidden="true" tabindex="-1"></a>co2_file <span class="op">=</span> <span class="st">"data/co2_mm_mlo.txt"</span></span>
+<span id="cb108-978"><a href="#cb108-978" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-979"><a href="#cb108-979" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-980"><a href="#cb108-980" aria-hidden="true" tabindex="-1"></a>Let's do some **EDA**!!</span>
+<span id="cb108-981"><a href="#cb108-981" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-982"><a href="#cb108-982" aria-hidden="true" tabindex="-1"></a><span class="fu">## Reading this file into Pandas?</span></span>
+<span id="cb108-983"><a href="#cb108-983" aria-hidden="true" tabindex="-1"></a>Let's instead check out this <span class="in">`.txt`</span> file. Some questions to keep in mind: Do we trust this file extension? What structure is it? </span>
+<span id="cb108-984"><a href="#cb108-984" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-985"><a href="#cb108-985" aria-hidden="true" tabindex="-1"></a>Lines 71-78 (inclusive) are shown below: </span>
+<span id="cb108-986"><a href="#cb108-986" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-987"><a href="#cb108-987" aria-hidden="true" tabindex="-1"></a><span class="in">    line number |                            file contents</span></span>
+<span id="cb108-988"><a href="#cb108-988" aria-hidden="true" tabindex="-1"></a><span class="in">    </span></span>
+<span id="cb108-989"><a href="#cb108-989" aria-hidden="true" tabindex="-1"></a><span class="in">    71          |   #            decimal     average   interpolated    trend    #days</span></span>
+<span id="cb108-990"><a href="#cb108-990" aria-hidden="true" tabindex="-1"></a><span class="in">    72          |   #             date                             (season corr)</span></span>
+<span id="cb108-991"><a href="#cb108-991" aria-hidden="true" tabindex="-1"></a><span class="in">    73          |   1958   3    1958.208      315.71      315.71      314.62     -1</span></span>
+<span id="cb108-992"><a href="#cb108-992" aria-hidden="true" tabindex="-1"></a><span class="in">    74          |   1958   4    1958.292      317.45      317.45      315.29     -1</span></span>
+<span id="cb108-993"><a href="#cb108-993" aria-hidden="true" tabindex="-1"></a><span class="in">    75          |   1958   5    1958.375      317.50      317.50      314.71     -1</span></span>
+<span id="cb108-994"><a href="#cb108-994" aria-hidden="true" tabindex="-1"></a><span class="in">    76          |   1958   6    1958.458      -99.99      317.10      314.85     -1</span></span>
+<span id="cb108-995"><a href="#cb108-995" aria-hidden="true" tabindex="-1"></a><span class="in">    77          |   1958   7    1958.542      315.86      315.86      314.98     -1</span></span>
+<span id="cb108-996"><a href="#cb108-996" aria-hidden="true" tabindex="-1"></a><span class="in">    78          |   1958   8    1958.625      314.93      314.93      315.94     -1</span></span>
+<span id="cb108-997"><a href="#cb108-997" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-998"><a href="#cb108-998" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-999"><a href="#cb108-999" aria-hidden="true" tabindex="-1"></a>Notice how: </span>
+<span id="cb108-1000"><a href="#cb108-1000" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1001"><a href="#cb108-1001" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The values are separated by white space, possibly tabs.</span>
+<span id="cb108-1002"><a href="#cb108-1002" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The data line up down the rows. For example, the month appears in 7th to 8th position of each line.</span>
+<span id="cb108-1003"><a href="#cb108-1003" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The 71st and 72nd lines in the file contain column headings split over two lines.</span>
+<span id="cb108-1004"><a href="#cb108-1004" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1005"><a href="#cb108-1005" aria-hidden="true" tabindex="-1"></a>We can use&nbsp;<span class="in">`read_csv`</span>&nbsp;to read the data into a <span class="in">`pandas`</span> <span class="in">`DataFrame`</span>, and we provide several arguments to specify that the separators are white space, there is no header (**we will set our own column names**), and to skip the first 72 rows of the file.</span>
+<span id="cb108-1006"><a href="#cb108-1006" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1009"><a href="#cb108-1009" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1010"><a href="#cb108-1010" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1011"><a href="#cb108-1011" aria-hidden="true" tabindex="-1"></a>co2 <span class="op">=</span> pd.read_csv(</span>
+<span id="cb108-1012"><a href="#cb108-1012" aria-hidden="true" tabindex="-1"></a>    co2_file, header <span class="op">=</span> <span class="va">None</span>, skiprows <span class="op">=</span> <span class="dv">72</span>,</span>
+<span id="cb108-1013"><a href="#cb108-1013" aria-hidden="true" tabindex="-1"></a>    sep <span class="op">=</span> <span class="vs">r'\s+'</span>       <span class="co">#delimiter for continuous whitespace (stay tuned for regex next lecture))</span></span>
+<span id="cb108-1014"><a href="#cb108-1014" aria-hidden="true" tabindex="-1"></a>)</span>
+<span id="cb108-1015"><a href="#cb108-1015" aria-hidden="true" tabindex="-1"></a>co2.head()</span>
+<span id="cb108-1016"><a href="#cb108-1016" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1017"><a href="#cb108-1017" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1018"><a href="#cb108-1018" aria-hidden="true" tabindex="-1"></a>Congratulations! You've wrangled the data!</span>
+<span id="cb108-1019"><a href="#cb108-1019" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1020"><a href="#cb108-1020" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-1021"><a href="#cb108-1021" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1022"><a href="#cb108-1022" aria-hidden="true" tabindex="-1"></a>...But our columns aren't named.</span>
+<span id="cb108-1023"><a href="#cb108-1023" aria-hidden="true" tabindex="-1"></a>**We need to do more EDA.**</span>
+<span id="cb108-1024"><a href="#cb108-1024" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1025"><a href="#cb108-1025" aria-hidden="true" tabindex="-1"></a><span class="fu">## Exploring Variable Feature Types</span></span>
+<span id="cb108-1026"><a href="#cb108-1026" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1027"><a href="#cb108-1027" aria-hidden="true" tabindex="-1"></a>The NOAA <span class="co">[</span><span class="ot">webpage</span><span class="co">](https://gml.noaa.gov/ccgg/trends/)</span> might have some useful tidbits (in this case it doesn't).</span>
+<span id="cb108-1028"><a href="#cb108-1028" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1029"><a href="#cb108-1029" aria-hidden="true" tabindex="-1"></a>Using this information, we'll rerun <span class="in">`pd.read_csv`</span>, but this time with some **custom column names.**</span>
+<span id="cb108-1030"><a href="#cb108-1030" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1033"><a href="#cb108-1033" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1034"><a href="#cb108-1034" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1035"><a href="#cb108-1035" aria-hidden="true" tabindex="-1"></a>co2 <span class="op">=</span> pd.read_csv(</span>
+<span id="cb108-1036"><a href="#cb108-1036" aria-hidden="true" tabindex="-1"></a>    co2_file, header <span class="op">=</span> <span class="va">None</span>, skiprows <span class="op">=</span> <span class="dv">72</span>,</span>
+<span id="cb108-1037"><a href="#cb108-1037" aria-hidden="true" tabindex="-1"></a>    sep <span class="op">=</span> <span class="st">'\s+'</span>, <span class="co">#regex for continuous whitespace (next lecture)</span></span>
+<span id="cb108-1038"><a href="#cb108-1038" aria-hidden="true" tabindex="-1"></a>    names <span class="op">=</span> [<span class="st">'Yr'</span>, <span class="st">'Mo'</span>, <span class="st">'DecDate'</span>, <span class="st">'Avg'</span>, <span class="st">'Int'</span>, <span class="st">'Trend'</span>, <span class="st">'Days'</span>]</span>
+<span id="cb108-1039"><a href="#cb108-1039" aria-hidden="true" tabindex="-1"></a>)</span>
+<span id="cb108-1040"><a href="#cb108-1040" aria-hidden="true" tabindex="-1"></a>co2.head()</span>
+<span id="cb108-1041"><a href="#cb108-1041" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1042"><a href="#cb108-1042" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1043"><a href="#cb108-1043" aria-hidden="true" tabindex="-1"></a><span class="fu">## Visualizing CO&lt;sub&gt;2&lt;/sub&gt;</span></span>
+<span id="cb108-1044"><a href="#cb108-1044" aria-hidden="true" tabindex="-1"></a>Scientific studies tend to have very clean data, right...? Let's jump right in and make a time series plot of CO2 monthly averages.</span>
+<span id="cb108-1045"><a href="#cb108-1045" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1048"><a href="#cb108-1048" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1049"><a href="#cb108-1049" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-1050"><a href="#cb108-1050" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2)<span class="op">;</span></span>
+<span id="cb108-1051"><a href="#cb108-1051" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1052"><a href="#cb108-1052" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1053"><a href="#cb108-1053" aria-hidden="true" tabindex="-1"></a>The code above uses the <span class="in">`seaborn`</span> plotting library (abbreviated <span class="in">`sns`</span>). We will cover this in the Visualization lecture, but now you don't need to worry about how it works!</span>
+<span id="cb108-1054"><a href="#cb108-1054" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1055"><a href="#cb108-1055" aria-hidden="true" tabindex="-1"></a>Yikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some **missing values**. What happened here?</span>
+<span id="cb108-1056"><a href="#cb108-1056" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1059"><a href="#cb108-1059" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1060"><a href="#cb108-1060" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1061"><a href="#cb108-1061" aria-hidden="true" tabindex="-1"></a>co2.head()</span>
+<span id="cb108-1062"><a href="#cb108-1062" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1063"><a href="#cb108-1063" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1066"><a href="#cb108-1066" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1067"><a href="#cb108-1067" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1068"><a href="#cb108-1068" aria-hidden="true" tabindex="-1"></a>co2.tail()</span>
+<span id="cb108-1069"><a href="#cb108-1069" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1070"><a href="#cb108-1070" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1071"><a href="#cb108-1071" aria-hidden="true" tabindex="-1"></a>Some data have unusual values like -1 and -99.99.</span>
+<span id="cb108-1072"><a href="#cb108-1072" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1073"><a href="#cb108-1073" aria-hidden="true" tabindex="-1"></a>Let's check the description at the top of the file again.</span>
+<span id="cb108-1074"><a href="#cb108-1074" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1075"><a href="#cb108-1075" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>-1 signifies a missing value for the number of days <span class="in">`Days`</span> the equipment was in operation that month.</span>
+<span id="cb108-1076"><a href="#cb108-1076" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>-99.99 denotes a missing monthly average <span class="in">`Avg`</span></span>
+<span id="cb108-1077"><a href="#cb108-1077" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1078"><a href="#cb108-1078" aria-hidden="true" tabindex="-1"></a>How can we fix this? First, let's explore other aspects of our data. Understanding our data will help us decide what to do with the missing values.</span>
+<span id="cb108-1079"><a href="#cb108-1079" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1080"><a href="#cb108-1080" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-1081"><a href="#cb108-1081" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1082"><a href="#cb108-1082" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1083"><a href="#cb108-1083" aria-hidden="true" tabindex="-1"></a><span class="fu">## Sanity Checks: Reasoning about the data</span></span>
+<span id="cb108-1084"><a href="#cb108-1084" aria-hidden="true" tabindex="-1"></a>First, we consider the shape of the data. How many rows should we have?</span>
+<span id="cb108-1085"><a href="#cb108-1085" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1086"><a href="#cb108-1086" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>If chronological order, we should have one record per month.</span>
+<span id="cb108-1087"><a href="#cb108-1087" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Data from March 1958 to August 2019.</span>
+<span id="cb108-1088"><a href="#cb108-1088" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>We should have $ 12 \times (2019-1957) - 2 - 4 = 738 $ records.</span>
+<span id="cb108-1089"><a href="#cb108-1089" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1092"><a href="#cb108-1092" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1093"><a href="#cb108-1093" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1094"><a href="#cb108-1094" aria-hidden="true" tabindex="-1"></a>co2.shape</span>
+<span id="cb108-1095"><a href="#cb108-1095" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1096"><a href="#cb108-1096" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1097"><a href="#cb108-1097" aria-hidden="true" tabindex="-1"></a>Nice!! The number of rows (i.e. records) match our expectations.\</span>
+<span id="cb108-1098"><a href="#cb108-1098" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1099"><a href="#cb108-1099" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-1100"><a href="#cb108-1100" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1101"><a href="#cb108-1101" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1102"><a href="#cb108-1102" aria-hidden="true" tabindex="-1"></a>Let's now check the quality of each feature.</span>
+<span id="cb108-1103"><a href="#cb108-1103" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1104"><a href="#cb108-1104" aria-hidden="true" tabindex="-1"></a><span class="fu">## Understanding Missing Value 1: `Days`</span></span>
+<span id="cb108-1105"><a href="#cb108-1105" aria-hidden="true" tabindex="-1"></a><span class="in">`Days`</span> is a time field, so let's analyze other time fields to see if there is an explanation for missing values of days of operation.</span>
+<span id="cb108-1106"><a href="#cb108-1106" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1107"><a href="#cb108-1107" aria-hidden="true" tabindex="-1"></a>Let's start with **months**, <span class="in">`Mo`</span>.</span>
+<span id="cb108-1108"><a href="#cb108-1108" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1109"><a href="#cb108-1109" aria-hidden="true" tabindex="-1"></a>Are we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).</span>
+<span id="cb108-1110"><a href="#cb108-1110" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1113"><a href="#cb108-1113" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1114"><a href="#cb108-1114" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1115"><a href="#cb108-1115" aria-hidden="true" tabindex="-1"></a>co2[<span class="st">"Mo"</span>].value_counts().sort_index()</span>
+<span id="cb108-1116"><a href="#cb108-1116" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1117"><a href="#cb108-1117" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1118"><a href="#cb108-1118" aria-hidden="true" tabindex="-1"></a>As expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.</span>
+<span id="cb108-1119"><a href="#cb108-1119" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1120"><a href="#cb108-1120" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-1121"><a href="#cb108-1121" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1122"><a href="#cb108-1122" aria-hidden="true" tabindex="-1"></a>Next let's explore **days** <span class="in">`Days`</span> itself, which is the number of days that the measurement equipment worked.</span>
+<span id="cb108-1123"><a href="#cb108-1123" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1126"><a href="#cb108-1126" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1127"><a href="#cb108-1127" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-1128"><a href="#cb108-1128" aria-hidden="true" tabindex="-1"></a>sns.displot(co2[<span class="st">'Days'</span>])<span class="op">;</span></span>
+<span id="cb108-1129"><a href="#cb108-1129" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of days feature"</span>)<span class="op">;</span> <span class="co"># suppresses unneeded plotting output</span></span>
+<span id="cb108-1130"><a href="#cb108-1130" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1131"><a href="#cb108-1131" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1132"><a href="#cb108-1132" aria-hidden="true" tabindex="-1"></a>In terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values--**that's about 27% of the data**!</span>
+<span id="cb108-1133"><a href="#cb108-1133" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1134"><a href="#cb108-1134" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-1135"><a href="#cb108-1135" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1136"><a href="#cb108-1136" aria-hidden="true" tabindex="-1"></a>Finally, let's check the last time feature, **year** <span class="in">`Yr`</span>.</span>
+<span id="cb108-1137"><a href="#cb108-1137" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1138"><a href="#cb108-1138" aria-hidden="true" tabindex="-1"></a>Let's check to see if there is any connection between missing-ness and the year of the recording.</span>
+<span id="cb108-1139"><a href="#cb108-1139" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1142"><a href="#cb108-1142" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1143"><a href="#cb108-1143" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-1144"><a href="#cb108-1144" aria-hidden="true" tabindex="-1"></a>sns.scatterplot(x<span class="op">=</span><span class="st">"Yr"</span>, y<span class="op">=</span><span class="st">"Days"</span>, data<span class="op">=</span>co2)<span class="op">;</span></span>
+<span id="cb108-1145"><a href="#cb108-1145" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Day field by Year"</span>)<span class="op">;</span> <span class="co"># the ; suppresses output</span></span>
+<span id="cb108-1146"><a href="#cb108-1146" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1147"><a href="#cb108-1147" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1148"><a href="#cb108-1148" aria-hidden="true" tabindex="-1"></a>**Observations**:</span>
+<span id="cb108-1149"><a href="#cb108-1149" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1150"><a href="#cb108-1150" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>All of the missing data are in the early years of operation.</span>
+<span id="cb108-1151"><a href="#cb108-1151" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>It appears there may have been problems with equipment in the mid to late 80s.</span>
+<span id="cb108-1152"><a href="#cb108-1152" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1153"><a href="#cb108-1153" aria-hidden="true" tabindex="-1"></a>**Potential Next Steps**:</span>
+<span id="cb108-1154"><a href="#cb108-1154" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1155"><a href="#cb108-1155" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Confirm these explanations through documentation about the historical readings.</span>
+<span id="cb108-1156"><a href="#cb108-1156" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Maybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems.</span>
+<span id="cb108-1157"><a href="#cb108-1157" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1158"><a href="#cb108-1158" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-1159"><a href="#cb108-1159" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1160"><a href="#cb108-1160" aria-hidden="true" tabindex="-1"></a><span class="fu">## Understanding Missing Value 2: `Avg`</span></span>
+<span id="cb108-1161"><a href="#cb108-1161" aria-hidden="true" tabindex="-1"></a>Next, let's return to the -99.99 values in <span class="in">`Avg`</span> to analyze the overall quality of the CO2 measurements. We'll plot a histogram of the average CO<span class="kw">&lt;sub&gt;</span>2<span class="kw">&lt;/sub&gt;</span> measurements</span>
+<span id="cb108-1162"><a href="#cb108-1162" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1165"><a href="#cb108-1165" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1166"><a href="#cb108-1166" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-1167"><a href="#cb108-1167" aria-hidden="true" tabindex="-1"></a><span class="co"># Histograms of average CO2 measurements</span></span>
+<span id="cb108-1168"><a href="#cb108-1168" aria-hidden="true" tabindex="-1"></a>sns.displot(co2[<span class="st">'Avg'</span>])<span class="op">;</span></span>
+<span id="cb108-1169"><a href="#cb108-1169" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1170"><a href="#cb108-1170" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1171"><a href="#cb108-1171" aria-hidden="true" tabindex="-1"></a>The non-missing values are in the 300-400 range (a regular range of CO2 levels).</span>
+<span id="cb108-1172"><a href="#cb108-1172" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1173"><a href="#cb108-1173" aria-hidden="true" tabindex="-1"></a>We also see that there are only a few missing <span class="in">`Avg`</span> values (**&lt;1% of values**). Let's examine all of them:</span>
+<span id="cb108-1174"><a href="#cb108-1174" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1177"><a href="#cb108-1177" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1178"><a href="#cb108-1178" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1179"><a href="#cb108-1179" aria-hidden="true" tabindex="-1"></a>co2[co2[<span class="st">"Avg"</span>] <span class="op">&lt;</span> <span class="dv">0</span>]</span>
+<span id="cb108-1180"><a href="#cb108-1180" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1181"><a href="#cb108-1181" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1182"><a href="#cb108-1182" aria-hidden="true" tabindex="-1"></a>There doesn't seem to be a pattern to these values, other than that most records also were missing <span class="in">`Days`</span> data.</span>
+<span id="cb108-1183"><a href="#cb108-1183" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1184"><a href="#cb108-1184" aria-hidden="true" tabindex="-1"></a><span class="fu">## Drop, `NaN`, or Impute Missing `Avg` Data?</span></span>
+<span id="cb108-1185"><a href="#cb108-1185" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1186"><a href="#cb108-1186" aria-hidden="true" tabindex="-1"></a>How should we address the invalid <span class="in">`Avg`</span> data?</span>
+<span id="cb108-1187"><a href="#cb108-1187" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1188"><a href="#cb108-1188" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Drop records</span>
+<span id="cb108-1189"><a href="#cb108-1189" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Set to NaN</span>
+<span id="cb108-1190"><a href="#cb108-1190" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>Impute using some strategy</span>
+<span id="cb108-1191"><a href="#cb108-1191" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1192"><a href="#cb108-1192" aria-hidden="true" tabindex="-1"></a>Remember we want to fix the following plot:</span>
+<span id="cb108-1193"><a href="#cb108-1193" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1196"><a href="#cb108-1196" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1197"><a href="#cb108-1197" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-1198"><a href="#cb108-1198" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2)</span>
+<span id="cb108-1199"><a href="#cb108-1199" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Month"</span>)<span class="op">;</span></span>
+<span id="cb108-1200"><a href="#cb108-1200" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1201"><a href="#cb108-1201" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1202"><a href="#cb108-1202" aria-hidden="true" tabindex="-1"></a>Since we are plotting <span class="in">`Avg`</span> vs <span class="in">`DecDate`</span>, we should just focus on dealing with missing values for <span class="in">`Avg`</span>.</span>
+<span id="cb108-1203"><a href="#cb108-1203" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1204"><a href="#cb108-1204" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1205"><a href="#cb108-1205" aria-hidden="true" tabindex="-1"></a>Let's consider a few options:</span>
+<span id="cb108-1206"><a href="#cb108-1206" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>Drop those records</span>
+<span id="cb108-1207"><a href="#cb108-1207" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>Replace -99.99 with NaN</span>
+<span id="cb108-1208"><a href="#cb108-1208" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>Substitute it with a likely value for the average CO2?</span>
+<span id="cb108-1209"><a href="#cb108-1209" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1210"><a href="#cb108-1210" aria-hidden="true" tabindex="-1"></a>What do you think are the pros and cons of each possible action?</span>
+<span id="cb108-1211"><a href="#cb108-1211" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1212"><a href="#cb108-1212" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-1213"><a href="#cb108-1213" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1214"><a href="#cb108-1214" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1215"><a href="#cb108-1215" aria-hidden="true" tabindex="-1"></a>Let's examine each of these three options.</span>
+<span id="cb108-1216"><a href="#cb108-1216" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1219"><a href="#cb108-1219" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1220"><a href="#cb108-1220" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1221"><a href="#cb108-1221" aria-hidden="true" tabindex="-1"></a><span class="co"># 1. Drop missing values</span></span>
+<span id="cb108-1222"><a href="#cb108-1222" aria-hidden="true" tabindex="-1"></a>co2_drop <span class="op">=</span> co2[co2[<span class="st">'Avg'</span>] <span class="op">&gt;</span> <span class="dv">0</span>]</span>
+<span id="cb108-1223"><a href="#cb108-1223" aria-hidden="true" tabindex="-1"></a>co2_drop.head()</span>
+<span id="cb108-1224"><a href="#cb108-1224" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1225"><a href="#cb108-1225" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1228"><a href="#cb108-1228" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1229"><a href="#cb108-1229" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1230"><a href="#cb108-1230" aria-hidden="true" tabindex="-1"></a><span class="co"># 2. Replace NaN with -99.99</span></span>
+<span id="cb108-1231"><a href="#cb108-1231" aria-hidden="true" tabindex="-1"></a>co2_NA <span class="op">=</span> co2.replace(<span class="op">-</span><span class="fl">99.99</span>, np.NaN)</span>
+<span id="cb108-1232"><a href="#cb108-1232" aria-hidden="true" tabindex="-1"></a>co2_NA.head()</span>
+<span id="cb108-1233"><a href="#cb108-1233" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1234"><a href="#cb108-1234" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1235"><a href="#cb108-1235" aria-hidden="true" tabindex="-1"></a>We'll also use a third version of the data.</span>
+<span id="cb108-1236"><a href="#cb108-1236" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1237"><a href="#cb108-1237" aria-hidden="true" tabindex="-1"></a>First, we note that the dataset already comes with a **substitute value** for the -99.99.</span>
+<span id="cb108-1238"><a href="#cb108-1238" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1239"><a href="#cb108-1239" aria-hidden="true" tabindex="-1"></a>From the file description:</span>
+<span id="cb108-1240"><a href="#cb108-1240" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1241"><a href="#cb108-1241" aria-hidden="true" tabindex="-1"></a><span class="at">&gt;  The </span><span class="in">`interpolated`</span><span class="at"> column includes average values from the preceding column (</span><span class="in">`average`</span><span class="at">)</span></span>
+<span id="cb108-1242"><a href="#cb108-1242" aria-hidden="true" tabindex="-1"></a><span class="at">and **interpolated values** where data are missing.  Interpolated values are</span></span>
+<span id="cb108-1243"><a href="#cb108-1243" aria-hidden="true" tabindex="-1"></a><span class="at">computed in two steps...</span></span>
+<span id="cb108-1244"><a href="#cb108-1244" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1245"><a href="#cb108-1245" aria-hidden="true" tabindex="-1"></a>The <span class="in">`Int`</span> feature has values that exactly match those in <span class="in">`Avg`</span>, except when <span class="in">`Avg`</span> is -99.99, and then a **reasonable** estimate is used instead.</span>
+<span id="cb108-1246"><a href="#cb108-1246" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1247"><a href="#cb108-1247" aria-hidden="true" tabindex="-1"></a>So, the third version of our data will use the <span class="in">`Int`</span> feature instead of <span class="in">`Avg`</span>.</span>
+<span id="cb108-1248"><a href="#cb108-1248" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1251"><a href="#cb108-1251" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1252"><a href="#cb108-1252" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb108-1253"><a href="#cb108-1253" aria-hidden="true" tabindex="-1"></a><span class="co"># 3. Use interpolated column which estimates missing Avg values</span></span>
+<span id="cb108-1254"><a href="#cb108-1254" aria-hidden="true" tabindex="-1"></a>co2_impute <span class="op">=</span> co2.copy()</span>
+<span id="cb108-1255"><a href="#cb108-1255" aria-hidden="true" tabindex="-1"></a>co2_impute[<span class="st">'Avg'</span>] <span class="op">=</span> co2[<span class="st">'Int'</span>]</span>
+<span id="cb108-1256"><a href="#cb108-1256" aria-hidden="true" tabindex="-1"></a>co2_impute.head()</span>
+<span id="cb108-1257"><a href="#cb108-1257" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1258"><a href="#cb108-1258" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1259"><a href="#cb108-1259" aria-hidden="true" tabindex="-1"></a>What's a **reasonable** estimate?</span>
+<span id="cb108-1260"><a href="#cb108-1260" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1261"><a href="#cb108-1261" aria-hidden="true" tabindex="-1"></a>To answer this question, let's zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).</span>
+<span id="cb108-1262"><a href="#cb108-1262" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1265"><a href="#cb108-1265" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1266"><a href="#cb108-1266" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-1267"><a href="#cb108-1267" aria-hidden="true" tabindex="-1"></a><span class="co"># results of plotting data in 1958</span></span>
+<span id="cb108-1268"><a href="#cb108-1268" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1269"><a href="#cb108-1269" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> line_and_points(data, ax, title):</span>
+<span id="cb108-1270"><a href="#cb108-1270" aria-hidden="true" tabindex="-1"></a>    <span class="co"># assumes single year, hence Mo</span></span>
+<span id="cb108-1271"><a href="#cb108-1271" aria-hidden="true" tabindex="-1"></a>    ax.plot(<span class="st">'Mo'</span>, <span class="st">'Avg'</span>, data<span class="op">=</span>data)</span>
+<span id="cb108-1272"><a href="#cb108-1272" aria-hidden="true" tabindex="-1"></a>    ax.scatter(<span class="st">'Mo'</span>, <span class="st">'Avg'</span>, data<span class="op">=</span>data)</span>
+<span id="cb108-1273"><a href="#cb108-1273" aria-hidden="true" tabindex="-1"></a>    ax.set_xlim(<span class="dv">2</span>, <span class="dv">13</span>)</span>
+<span id="cb108-1274"><a href="#cb108-1274" aria-hidden="true" tabindex="-1"></a>    ax.set_title(title)</span>
+<span id="cb108-1275"><a href="#cb108-1275" aria-hidden="true" tabindex="-1"></a>    ax.set_xticks(np.arange(<span class="dv">3</span>, <span class="dv">13</span>))</span>
+<span id="cb108-1276"><a href="#cb108-1276" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1277"><a href="#cb108-1277" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> data_year(data, year):</span>
+<span id="cb108-1278"><a href="#cb108-1278" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> data[data[<span class="st">"Yr"</span>] <span class="op">==</span> <span class="dv">1958</span>]</span>
+<span id="cb108-1279"><a href="#cb108-1279" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb108-1280"><a href="#cb108-1280" aria-hidden="true" tabindex="-1"></a><span class="co"># uses matplotlib subplots</span></span>
+<span id="cb108-1281"><a href="#cb108-1281" aria-hidden="true" tabindex="-1"></a><span class="co"># you may see more next week; focus on output for now</span></span>
+<span id="cb108-1282"><a href="#cb108-1282" aria-hidden="true" tabindex="-1"></a>fig, axes <span class="op">=</span> plt.subplots(ncols <span class="op">=</span> <span class="dv">3</span>, figsize<span class="op">=</span>(<span class="dv">12</span>, <span class="dv">4</span>), sharey<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb108-1283"><a href="#cb108-1283" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1284"><a href="#cb108-1284" aria-hidden="true" tabindex="-1"></a>year <span class="op">=</span> <span class="dv">1958</span></span>
+<span id="cb108-1285"><a href="#cb108-1285" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_drop, year), axes[<span class="dv">0</span>], title<span class="op">=</span><span class="st">"1. Drop Missing"</span>)</span>
+<span id="cb108-1286"><a href="#cb108-1286" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_NA, year), axes[<span class="dv">1</span>], title<span class="op">=</span><span class="st">"2. Missing Set to NaN"</span>)</span>
+<span id="cb108-1287"><a href="#cb108-1287" aria-hidden="true" tabindex="-1"></a>line_and_points(data_year(co2_impute, year), axes[<span class="dv">2</span>], title<span class="op">=</span><span class="st">"3. Missing Interpolated"</span>)</span>
+<span id="cb108-1288"><a href="#cb108-1288" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1289"><a href="#cb108-1289" aria-hidden="true" tabindex="-1"></a>fig.suptitle(<span class="ss">f"Monthly Averages for </span><span class="sc">{</span>year<span class="sc">}</span><span class="ss">"</span>)</span>
+<span id="cb108-1290"><a href="#cb108-1290" aria-hidden="true" tabindex="-1"></a>plt.tight_layout()</span>
+<span id="cb108-1291"><a href="#cb108-1291" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1292"><a href="#cb108-1292" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1293"><a href="#cb108-1293" aria-hidden="true" tabindex="-1"></a>In the big picture since there are only 7 <span class="in">`Avg`</span> values missing (**&lt;1%** of 738 months), any of these approaches would work.</span>
+<span id="cb108-1294"><a href="#cb108-1294" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1295"><a href="#cb108-1295" aria-hidden="true" tabindex="-1"></a>However there is some appeal to **option C: Imputing**:</span>
+<span id="cb108-1296"><a href="#cb108-1296" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1297"><a href="#cb108-1297" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Shows seasonal trends for CO2</span>
+<span id="cb108-1298"><a href="#cb108-1298" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>We are plotting all months in our data as a line plot</span>
+<span id="cb108-1299"><a href="#cb108-1299" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1300"><a href="#cb108-1300" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;br/&gt;</span></span>
+<span id="cb108-1301"><a href="#cb108-1301" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1302"><a href="#cb108-1302" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1303"><a href="#cb108-1303" aria-hidden="true" tabindex="-1"></a>Let's replot our original figure with option 3:</span>
+<span id="cb108-1304"><a href="#cb108-1304" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1307"><a href="#cb108-1307" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1308"><a href="#cb108-1308" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-1309"><a href="#cb108-1309" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'DecDate'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_impute)</span>
+<span id="cb108-1310"><a href="#cb108-1310" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Month, Imputed"</span>)<span class="op">;</span></span>
+<span id="cb108-1311"><a href="#cb108-1311" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1312"><a href="#cb108-1312" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1313"><a href="#cb108-1313" aria-hidden="true" tabindex="-1"></a>Looks pretty close to what we see on the NOAA <span class="co">[</span><span class="ot">website</span><span class="co">](https://gml.noaa.gov/ccgg/trends/)</span>!</span>
+<span id="cb108-1314"><a href="#cb108-1314" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1315"><a href="#cb108-1315" aria-hidden="true" tabindex="-1"></a><span class="fu">## Presenting the data: A Discussion on Data Granularity</span></span>
+<span id="cb108-1316"><a href="#cb108-1316" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1317"><a href="#cb108-1317" aria-hidden="true" tabindex="-1"></a>From the description:</span>
+<span id="cb108-1318"><a href="#cb108-1318" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1319"><a href="#cb108-1319" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>monthly measurements are averages of average day measurements.</span>
+<span id="cb108-1320"><a href="#cb108-1320" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The NOAA GML website has datasets for daily/hourly measurements too.</span>
+<span id="cb108-1321"><a href="#cb108-1321" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1322"><a href="#cb108-1322" aria-hidden="true" tabindex="-1"></a>The data you present depends on your research question.</span>
+<span id="cb108-1323"><a href="#cb108-1323" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1324"><a href="#cb108-1324" aria-hidden="true" tabindex="-1"></a>**How do CO2 levels vary by season?**</span>
+<span id="cb108-1325"><a href="#cb108-1325" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1326"><a href="#cb108-1326" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>You might want to keep average monthly data.</span>
+<span id="cb108-1327"><a href="#cb108-1327" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1328"><a href="#cb108-1328" aria-hidden="true" tabindex="-1"></a>**Are CO2 levels rising over the past 50+ years, consistent with global warming predictions?**</span>
+<span id="cb108-1329"><a href="#cb108-1329" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1330"><a href="#cb108-1330" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>You might be happier with a **coarser granularity** of average year data!</span>
+<span id="cb108-1331"><a href="#cb108-1331" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1334"><a href="#cb108-1334" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb108-1335"><a href="#cb108-1335" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: true</span></span>
+<span id="cb108-1336"><a href="#cb108-1336" aria-hidden="true" tabindex="-1"></a>co2_year <span class="op">=</span> co2_impute.groupby(<span class="st">'Yr'</span>).mean()</span>
+<span id="cb108-1337"><a href="#cb108-1337" aria-hidden="true" tabindex="-1"></a>sns.lineplot(x<span class="op">=</span><span class="st">'Yr'</span>, y<span class="op">=</span><span class="st">'Avg'</span>, data<span class="op">=</span>co2_year)</span>
+<span id="cb108-1338"><a href="#cb108-1338" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"CO2 Average By Year"</span>)<span class="op">;</span></span>
+<span id="cb108-1339"><a href="#cb108-1339" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb108-1340"><a href="#cb108-1340" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1341"><a href="#cb108-1341" aria-hidden="true" tabindex="-1"></a>Indeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958.</span>
+<span id="cb108-1342"><a href="#cb108-1342" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1343"><a href="#cb108-1343" aria-hidden="true" tabindex="-1"></a><span class="fu"># Summary</span></span>
+<span id="cb108-1344"><a href="#cb108-1344" aria-hidden="true" tabindex="-1"></a>We went over a lot of content this lecture; let's summarize the most important points: </span>
+<span id="cb108-1345"><a href="#cb108-1345" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1346"><a href="#cb108-1346" aria-hidden="true" tabindex="-1"></a><span class="fu">## Dealing with Missing Values</span></span>
+<span id="cb108-1347"><a href="#cb108-1347" aria-hidden="true" tabindex="-1"></a>There are a few options we can take to deal with missing data:</span>
+<span id="cb108-1348"><a href="#cb108-1348" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1349"><a href="#cb108-1349" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Drop missing records</span>
+<span id="cb108-1350"><a href="#cb108-1350" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Keep <span class="in">`NaN`</span> missing values</span>
+<span id="cb108-1351"><a href="#cb108-1351" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Impute using an interpolated column</span>
+<span id="cb108-1352"><a href="#cb108-1352" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1353"><a href="#cb108-1353" aria-hidden="true" tabindex="-1"></a><span class="fu">## EDA and Data Wrangling</span></span>
+<span id="cb108-1354"><a href="#cb108-1354" aria-hidden="true" tabindex="-1"></a>There are several ways to approach EDA and Data Wrangling: </span>
+<span id="cb108-1355"><a href="#cb108-1355" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb108-1356"><a href="#cb108-1356" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Examine the **data and metadata**: what is the date, size, organization, and structure of the data? </span>
+<span id="cb108-1357"><a href="#cb108-1357" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Examine each **field/attribute/dimension** individually.</span>
+<span id="cb108-1358"><a href="#cb108-1358" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Examine pairs of related dimensions (e.g. breaking down grades by major).</span>
+<span id="cb108-1359"><a href="#cb108-1359" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Along the way, we can:</span>
+<span id="cb108-1360"><a href="#cb108-1360" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>**Visualize** or summarize the data.</span>
+<span id="cb108-1361"><a href="#cb108-1361" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>**Validate assumptions** about data and its collection process. Pay particular attention to when the data was collected. </span>
+<span id="cb108-1362"><a href="#cb108-1362" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Identify and **address anomalies**.</span>
+<span id="cb108-1363"><a href="#cb108-1363" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Apply data transformations and corrections (we'll cover this in the upcoming lecture).</span>
+<span id="cb108-1364"><a href="#cb108-1364" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>**Record everything you do!** Developing in Jupyter Notebook promotes *reproducibility* of your own work!</span>
 </code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div></div></div></div></div>
 </div> <!-- /content -->
diff --git a/docs/eda/eda_files/figure-html/cell-62-output-1.png b/docs/eda/eda_files/figure-html/cell-62-output-1.png
index a04218cf..f392d5f9 100644
Binary files a/docs/eda/eda_files/figure-html/cell-62-output-1.png and b/docs/eda/eda_files/figure-html/cell-62-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-1.png b/docs/eda/eda_files/figure-html/cell-67-output-1.png
new file mode 100644
index 00000000..be96b8c9
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-67-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-67-output-2.png b/docs/eda/eda_files/figure-html/cell-67-output-2.png
deleted file mode 100644
index 31857f62..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-67-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-68-output-1.png b/docs/eda/eda_files/figure-html/cell-68-output-1.png
index 67c3959d..ffd29ff8 100644
Binary files a/docs/eda/eda_files/figure-html/cell-68-output-1.png and b/docs/eda/eda_files/figure-html/cell-68-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-1.png b/docs/eda/eda_files/figure-html/cell-69-output-1.png
new file mode 100644
index 00000000..29088928
Binary files /dev/null and b/docs/eda/eda_files/figure-html/cell-69-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-69-output-2.png b/docs/eda/eda_files/figure-html/cell-69-output-2.png
deleted file mode 100644
index fb28f5d5..00000000
Binary files a/docs/eda/eda_files/figure-html/cell-69-output-2.png and /dev/null differ
diff --git a/docs/eda/eda_files/figure-html/cell-71-output-1.png b/docs/eda/eda_files/figure-html/cell-71-output-1.png
index 39cac822..49ef3d6a 100644
Binary files a/docs/eda/eda_files/figure-html/cell-71-output-1.png and b/docs/eda/eda_files/figure-html/cell-71-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-75-output-1.png b/docs/eda/eda_files/figure-html/cell-75-output-1.png
index 6382e58a..15a5fe82 100644
Binary files a/docs/eda/eda_files/figure-html/cell-75-output-1.png and b/docs/eda/eda_files/figure-html/cell-75-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-76-output-1.png b/docs/eda/eda_files/figure-html/cell-76-output-1.png
index db2b0dee..40b1fc71 100644
Binary files a/docs/eda/eda_files/figure-html/cell-76-output-1.png and b/docs/eda/eda_files/figure-html/cell-76-output-1.png differ
diff --git a/docs/eda/eda_files/figure-html/cell-77-output-1.png b/docs/eda/eda_files/figure-html/cell-77-output-1.png
index 897b8b39..99b6c2d1 100644
Binary files a/docs/eda/eda_files/figure-html/cell-77-output-1.png and b/docs/eda/eda_files/figure-html/cell-77-output-1.png differ
diff --git a/docs/feature_engineering/feature_engineering.html b/docs/feature_engineering/feature_engineering.html
index ea770e7f..22d26788 100644
--- a/docs/feature_engineering/feature_engineering.html
+++ b/docs/feature_engineering/feature_engineering.html
@@ -556,7 +556,7 @@ <h2 data-number="14.2" class="anchored" data-anchor-id="sklearn"><span class="he
 <span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>my_model.fit(X, Y)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="6">
-<style>#sk-container-id-1 {color: black;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-1" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>LinearRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br>On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden=""><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" checked=""><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">LinearRegression</label><div class="sk-toggleable__content"><pre>LinearRegression()</pre></div></div></div></div></div>
+<style>#sk-container-id-1 {color: black;background-color: white;}#sk-container-id-1 pre{padding: 0;}#sk-container-id-1 div.sk-toggleable {background-color: white;}#sk-container-id-1 label.sk-toggleable__label {cursor: pointer;display: block;width: 100%;margin-bottom: 0;padding: 0.3em;box-sizing: border-box;text-align: center;}#sk-container-id-1 label.sk-toggleable__label-arrow:before {content: "▸";float: left;margin-right: 0.25em;color: #696969;}#sk-container-id-1 label.sk-toggleable__label-arrow:hover:before {color: black;}#sk-container-id-1 div.sk-estimator:hover label.sk-toggleable__label-arrow:before {color: black;}#sk-container-id-1 div.sk-toggleable__content {max-height: 0;max-width: 0;overflow: hidden;text-align: left;background-color: #f0f8ff;}#sk-container-id-1 div.sk-toggleable__content pre {margin: 0.2em;color: black;border-radius: 0.25em;background-color: #f0f8ff;}#sk-container-id-1 input.sk-toggleable__control:checked~div.sk-toggleable__content {max-height: 200px;max-width: 100%;overflow: auto;}#sk-container-id-1 input.sk-toggleable__control:checked~label.sk-toggleable__label-arrow:before {content: "▾";}#sk-container-id-1 div.sk-estimator input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-label input.sk-toggleable__control:checked~label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 input.sk-hidden--visually {border: 0;clip: rect(1px 1px 1px 1px);clip: rect(1px, 1px, 1px, 1px);height: 1px;margin: -1px;overflow: hidden;padding: 0;position: absolute;width: 1px;}#sk-container-id-1 div.sk-estimator {font-family: monospace;background-color: #f0f8ff;border: 1px dotted black;border-radius: 0.25em;box-sizing: border-box;margin-bottom: 0.5em;}#sk-container-id-1 div.sk-estimator:hover {background-color: #d4ebff;}#sk-container-id-1 div.sk-parallel-item::after {content: "";width: 100%;border-bottom: 1px solid gray;flex-grow: 1;}#sk-container-id-1 div.sk-label:hover label.sk-toggleable__label {background-color: #d4ebff;}#sk-container-id-1 div.sk-serial::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: 0;}#sk-container-id-1 div.sk-serial {display: flex;flex-direction: column;align-items: center;background-color: white;padding-right: 0.2em;padding-left: 0.2em;position: relative;}#sk-container-id-1 div.sk-item {position: relative;z-index: 1;}#sk-container-id-1 div.sk-parallel {display: flex;align-items: stretch;justify-content: center;background-color: white;position: relative;}#sk-container-id-1 div.sk-item::before, #sk-container-id-1 div.sk-parallel-item::before {content: "";position: absolute;border-left: 1px solid gray;box-sizing: border-box;top: 0;bottom: 0;left: 50%;z-index: -1;}#sk-container-id-1 div.sk-parallel-item {display: flex;flex-direction: column;z-index: 1;position: relative;background-color: white;}#sk-container-id-1 div.sk-parallel-item:first-child::after {align-self: flex-end;width: 50%;}#sk-container-id-1 div.sk-parallel-item:last-child::after {align-self: flex-start;width: 50%;}#sk-container-id-1 div.sk-parallel-item:only-child::after {width: 0;}#sk-container-id-1 div.sk-dashed-wrapped {border: 1px dashed gray;margin: 0 0.4em 0.5em 0.4em;box-sizing: border-box;padding-bottom: 0.4em;background-color: white;}#sk-container-id-1 div.sk-label label {font-family: monospace;font-weight: bold;display: inline-block;line-height: 1.2em;}#sk-container-id-1 div.sk-label-container {text-align: center;}#sk-container-id-1 div.sk-container {/* jupyter's `normalize.less` sets `[hidden] { display: none; }` but bootstrap.min.css set `[hidden] { display: none !important; }` so we also need the `!important` here to be able to override the default hidden behavior on the sphinx rendered scikit-learn.org. See: https://github.com/scikit-learn/scikit-learn/issues/21755 */display: inline-block !important;position: relative;}#sk-container-id-1 div.sk-text-repr-fallback {display: none;}</style><div id="sk-container-id-1" class="sk-top-container"><div class="sk-text-repr-fallback"><pre>LinearRegression()</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br>On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class="sk-container" hidden=""><div class="sk-item"><div class="sk-estimator sk-toggleable"><input class="sk-toggleable__control sk-hidden--visually" id="sk-estimator-id-1" type="checkbox" checked=""><label for="sk-estimator-id-1" class="sk-toggleable__label sk-toggleable__label-arrow">LinearRegression</label><div class="sk-toggleable__content"><pre>LinearRegression()</pre></div></div></div></div></div>
 </div>
 </div>
 <p>Notice that we use <strong>double brackets</strong> to extract this column. Why double brackets instead of just single brackets? The <code>.fit</code> method, by default, expects to receive <strong>2-dimensional</strong> data – some kind of data that includes both rows and columns. Writing <code>penguins["flipper_length_mm"]</code> would return a 1D <code>Series</code>, causing <code>sklearn</code> to error. We avoid this by writing <code>penguins[["flipper_length_mm"]]</code> to produce a 2D <code>DataFrame</code>.</p>
@@ -607,7 +607,7 @@ <h2 data-number="14.2" class="anchored" data-anchor-id="sklearn"><span class="he
 <span id="cb18-12"><a href="#cb18-12" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb18-13"><a href="#cb18-13" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"The RMSE of the model is </span><span class="sc">{</span>np<span class="sc">.</span>sqrt(np.mean((Y<span class="op">-</span>Y_hat_two_features)<span class="op">**</span><span class="dv">2</span>))<span class="sc">}</span><span class="ss">"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code>The RMSE of the model is 0.9881331104079044</code></pre>
+<pre><code>The RMSE of the model is 0.9881331104079045</code></pre>
 </div>
 </div>
 <p>We can also see that we obtain the same predictions using <code>sklearn</code> as we did when applying the ordinary least squares formula before!</p>
@@ -977,7 +977,7 @@ <h2 data-number="14.6" class="anchored" data-anchor-id="polynomial-features"><sp
 <span id="cb26-12"><a href="#cb26-12" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb26-13"><a href="#cb26-13" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"MSE of model with (hp^2) feature: </span><span class="sc">{</span>np<span class="sc">.</span>mean((Y<span class="op">-</span>hp2_model_predictions)<span class="op">**</span><span class="dv">2</span>)<span class="sc">}</span><span class="ss">"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stdout">
-<pre><code>MSE of model with (hp^2) feature: 18.984768907617223</code></pre>
+<pre><code>MSE of model with (hp^2) feature: 18.984768907617216</code></pre>
 </div>
 <div class="cell-output cell-output-display">
 <p><img src="feature_engineering_files/figure-html/cell-17-output-2.png" width="585" height="429"></p>
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png
index 92cb01c9..f8396667 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-16-output-2.png differ
diff --git a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png
index f4ae4ea0..ceecd30f 100644
Binary files a/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png and b/docs/feature_engineering/feature_engineering_files/figure-html/cell-17-output-2.png differ
diff --git a/docs/gradient_descent/gradient_descent.html b/docs/gradient_descent/gradient_descent.html
index 467ee5fb..ed238d2c 100644
--- a/docs/gradient_descent/gradient_descent.html
+++ b/docs/gradient_descent/gradient_descent.html
@@ -106,7 +106,7 @@
 require.undef("plotly");
 requirejs.config({
     paths: {
-        'plotly': ['https://cdn.plot.ly/plotly-2.25.2.min']
+        'plotly': ['https://cdn.plot.ly/plotly-2.12.1.min']
     }
 });
 require(['plotly'], function(Plotly) {
@@ -439,9 +439,9 @@ <h3 data-number="13.1.1" class="anchored" data-anchor-id="the-naive-approach-gue
 </details>
 <div class="cell-output cell-output-display">
 
-<div>                            <div id="b1169494-b9f9-4bc7-8aaa-b0e7dbd864e1" class="plotly-graph-div" style="height:600px; width:800px;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("b1169494-b9f9-4bc7-8aaa-b0e7dbd864e1")) {                    Plotly.newPlot(                        "b1169494-b9f9-4bc7-8aaa-b0e7dbd864e1",                        [{"hovertemplate":"x=%{x}\u003cbr\u003ey=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1.0,1.0301507537688441,1.0603015075376885,1.0904522613065326,1.120603015075377,1.150753768844221,1.1809045226130652,1.2110552763819096,1.2412060301507537,1.271356783919598,1.3015075376884422,1.3316582914572863,1.3618090452261307,1.3919597989949748,1.4221105527638191,1.4522613065326633,1.4824120603015074,1.5125628140703518,1.542713567839196,1.5728643216080402,1.6030150753768844,1.6331658291457285,1.6633165829145728,1.6934673366834172,1.7236180904522613,1.7537688442211055,1.7839195979899496,1.814070351758794,1.8442211055276383,1.8743718592964824,1.9045226130653266,1.9346733668341707,1.964824120603015,1.9949748743718594,2.0251256281407035,2.0552763819095476,2.085427135678392,2.115577889447236,2.1457286432160805,2.1758793969849246,2.2060301507537687,2.2361809045226133,2.266331658291457,2.2964824120603016,2.3266331658291457,2.35678391959799,2.3869346733668344,2.417085427135678,2.4472361809045227,2.477386934673367,2.507537688442211,2.5376884422110555,2.567839195979899,2.5979899497487438,2.628140703517588,2.658291457286432,2.6884422110552766,2.7185929648241203,2.748743718592965,2.778894472361809,2.809045226130653,2.8391959798994977,2.8693467336683414,2.899497487437186,2.92964824120603,2.959798994974874,2.9899497487437188,3.020100502512563,3.050251256281407,3.080402010050251,3.1105527638190953,3.1407035175879394,3.170854271356784,3.201005025125628,3.2311557788944723,3.2613065326633164,3.2914572864321605,3.321608040201005,3.351758793969849,3.3819095477386933,3.4120603015075375,3.4422110552763816,3.472361809045226,3.5025125628140703,3.5326633165829144,3.5628140703517586,3.5929648241206027,3.6231155778894473,3.6532663316582914,3.6834170854271355,3.7135678391959797,3.743718592964824,3.7738693467336684,3.8040201005025125,3.8341708542713566,3.8643216080402008,3.8944723618090453,3.9246231155778895,3.9547738693467336,3.9849246231155777,4.015075376884422,4.045226130653266,4.075376884422111,4.105527638190955,4.135678391959798,4.165829145728643,4.1959798994974875,4.226130653266331,4.256281407035176,4.28643216080402,4.316582914572864,4.346733668341709,4.376884422110553,4.407035175879397,4.4371859296482405,4.467336683417085,4.49748743718593,4.527638190954773,4.557788944723618,4.5879396984924625,4.618090452261306,4.648241206030151,4.678391959798995,4.708542713567839,4.738693467336683,4.768844221105527,4.798994974874372,4.829145728643216,4.85929648241206,4.889447236180905,4.919597989949748,4.949748743718593,4.9798994974874375,5.010050251256281,5.040201005025126,5.0703517587939695,5.100502512562814,5.130653266331658,5.160804020100502,5.190954773869347,5.221105527638191,5.251256281407035,5.281407035175879,5.311557788944723,5.341708542713568,5.371859296482412,5.402010050251256,5.4321608040201,5.4623115577889445,5.492462311557789,5.522613065326633,5.552763819095477,5.582914572864321,5.613065326633166,5.64321608040201,5.673366834170854,5.703517587939698,5.733668341708542,5.763819095477387,5.793969849246231,5.824120603015075,5.8542713567839195,5.884422110552763,5.914572864321608,5.944723618090452,5.974874371859296,6.005025125628141,6.035175879396984,6.065326633165829,6.0954773869346734,6.125628140703517,6.155778894472362,6.185929648241205,6.21608040201005,6.2462311557788945,6.276381909547738,6.306532663316583,6.3366834170854265,6.366834170854271,6.396984924623116,6.427135678391959,6.457286432160804,6.487437185929648,6.517587939698492,6.547738693467337,6.57788944723618,6.608040201005025,6.638190954773869,6.668341708542713,6.698492462311558,6.7286432160804015,6.758793969849246,6.788944723618091,6.819095477386934,6.849246231155779,6.879396984924623,6.909547738693467,6.939698492462312,6.969849246231155,7.0],"xaxis":"x","y":[3.0,2.8197775132646994,2.6468296407545298,2.480978457571409,2.3220480221881674,2.169864376448527,2.0242555455671196,1.8850515381294826,1.7520843460920474,1.6251879447821538,1.5041982928980473,1.3889533325088705,1.2792929890546703,1.175059171346399,1.076095771565909,0.9822486652659563,0.8933657113701969,0.809296752173205,0.7298936133404282,0.6550101039082478,0.5845020162839318,0.5182271262456482,0.45604519294247436,0.39781795889439875,0.34340914999228855,0.2926844754979413,0.24551162804404497,0.20176028363418083,0.1613021016428462,0.1240107248154402,0.08976177926825812,0.05843287448851129,0.029903603334304307,0.0040555420346265695,-0.019227749810596606,-0.04006072923056081,-0.0585558698835257,-0.07482366205689459,-0.08897261266714337,-0.10110924525984047,-0.11133810000965809,-0.119761733720361,-0.12648071982483203,-0.1315936483850237,-0.13519712609199247,-0.13738577626590426,-0.1382522388560119,-0.13788717044065493,-0.13637924422729383,-0.13381515005247593,-0.13027959438184666,-0.1258553003101497,-0.12062300756120407,-0.11466147248795551,-0.10804746807244214,-0.10085578392577758,-0.09315922628821909,-0.08502861802905386,-0.0765327986467014,-0.06773862426869641,-0.05871096765167181,-0.04951271818131318,-0.04020478187242702,-0.030846081368940757,-0.021493555943828825,-0.012202161499220664,-0.0030248705662870635,0.005987327694657552,0.014785427494228998,0.023322406413939234,0.03155322540619636,0.039434828794344415,0.046926144272549666,0.053988082905993905,0.06058353913065275,0.06667739075349459,0.0722364989522987,0.0772297082758655,0.08162784664377228,0.0854037253466231,0.0885321390457932,0.09098986577366759,0.09275566693348196,0.0938102872993909,0.09413645501645647,0.09371888160064827,0.09254426193878089,0.09060127428864462,0.08788058027890884,0.08437482490911634,0.08007863654980837,0.07498862694224044,0.06910339119879154,0.062423507802634504,0.05495153860777009,0.04669202883931121,0.037651507093005424,0.027838485335746555,0.01726345890522225,0.005938906509970821,-0.006120709770470967,-0.018898944485715673,-0.03237736881442288,-0.04653557056435602,-0.061351154172564294,-0.07679974070485969,-0.0928549678563968,-0.10948848995135449,-0.12666997794297002,-0.14436711941368685,-0.16254561857489308,-0.1811691962672853,-0.2001995899604026,-0.21959655375303555,-0.23931785837306735,-0.25931929117742814,-0.2795546561521974,-0.2999757739125698,-0.32053248170270765,-0.34117263339597914,-0.3618420994948906,-0.38248476713084756,-0.40304254006464363,-0.4234553386859261,-0.4436611000135485,-0.46359577769538873,-0.483193342008542,-0.5023857798591053,-0.5211030947822792,-0.5392733069424139,-0.5568224531329065,-0.5736745867762579,-0.589751777924107,-0.6049741132570944,-0.6192596960850778,-0.632524646347008,-0.6446831006107573,-0.6556472120734724,-0.6653271505614157,-0.6736311025297482,-0.6804652710629284,-0.6857338758744731,-0.689339153306878,-0.6911813563318902,-0.6911587545502471,-0.6891676341918014,-0.6851022981155552,-0.6788550658095346,-0.6703162733909721,-0.6593742736060904,-0.6459154358302271,-0.6298241460678924,-0.6109828069525293,-0.5892718377469237,-0.5645696743426925,-0.5367527692607041,-0.5056955916510105,-0.47127062729252883,-0.4333483785933822,-0.39179736459091147,-0.3464841209514134,-0.2972731999702091,-0.24402717057191695,-0.18660661831017933,-0.12487014536754941,-0.05867437055599112,0.01212607068350735,0.08767852628116088,0.1681323275376599,0.2536387891250115,0.34435120908576666,0.4404248688334974,0.5420170331528652,0.6492869501990981,0.7623958514986725,0.8815069519487224,1.0067854498173574,1.13839852674364,1.2765153477374043,1.421307061179641,1.572946798821863,1.7316096757870127,1.897472790568304,2.070715225030267,2.251518044408317,2.4400642973086635,2.636539015708513,2.841129214955754,3.0540238937694992,3.27541403423952,3.5054926018266315,3.7444545453624185,3.9924967970495344,4.249818272461312,4.516619870542331,4.793104473607627,5.079476947343528,5.37594414080711,5.682714886426311,6.0],"yaxis":"y","type":"scatter"},{"mode":"markers","x":[1.0,2.5,4.0,5.5,7.0],"y":[3.0,-0.13125,0.0,-0.65625,6.0],"type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"x"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"y"}},"legend":{"tracegroupgap":0},"margin":{"t":60},"showlegend":false,"autosize":false,"width":800,"height":600},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="f8d260b0-5199-45b6-874c-779bfcf851f3" class="plotly-graph-div" style="height:600px; width:800px;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("f8d260b0-5199-45b6-874c-779bfcf851f3")) {                    Plotly.newPlot(                        "f8d260b0-5199-45b6-874c-779bfcf851f3",                        [{"hovertemplate":"x=%{x}<br>y=%{y}<extra></extra>","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1.0,1.0301507537688441,1.0603015075376885,1.0904522613065326,1.120603015075377,1.150753768844221,1.1809045226130652,1.2110552763819096,1.2412060301507537,1.271356783919598,1.3015075376884422,1.3316582914572863,1.3618090452261307,1.3919597989949748,1.4221105527638191,1.4522613065326633,1.4824120603015074,1.5125628140703518,1.542713567839196,1.5728643216080402,1.6030150753768844,1.6331658291457285,1.6633165829145728,1.6934673366834172,1.7236180904522613,1.7537688442211055,1.7839195979899496,1.814070351758794,1.8442211055276383,1.8743718592964824,1.9045226130653266,1.9346733668341707,1.964824120603015,1.9949748743718594,2.0251256281407035,2.0552763819095476,2.085427135678392,2.115577889447236,2.1457286432160805,2.1758793969849246,2.2060301507537687,2.2361809045226133,2.266331658291457,2.2964824120603016,2.3266331658291457,2.35678391959799,2.3869346733668344,2.417085427135678,2.4472361809045227,2.477386934673367,2.507537688442211,2.5376884422110555,2.567839195979899,2.5979899497487438,2.628140703517588,2.658291457286432,2.6884422110552766,2.7185929648241203,2.748743718592965,2.778894472361809,2.809045226130653,2.8391959798994977,2.8693467336683414,2.899497487437186,2.92964824120603,2.959798994974874,2.9899497487437188,3.020100502512563,3.050251256281407,3.080402010050251,3.1105527638190953,3.1407035175879394,3.170854271356784,3.201005025125628,3.2311557788944723,3.2613065326633164,3.2914572864321605,3.321608040201005,3.351758793969849,3.3819095477386933,3.4120603015075375,3.4422110552763816,3.472361809045226,3.5025125628140703,3.5326633165829144,3.5628140703517586,3.5929648241206027,3.6231155778894473,3.6532663316582914,3.6834170854271355,3.7135678391959797,3.743718592964824,3.7738693467336684,3.8040201005025125,3.8341708542713566,3.8643216080402008,3.8944723618090453,3.9246231155778895,3.9547738693467336,3.9849246231155777,4.015075376884422,4.045226130653266,4.075376884422111,4.105527638190955,4.135678391959798,4.165829145728643,4.1959798994974875,4.226130653266331,4.256281407035176,4.28643216080402,4.316582914572864,4.346733668341709,4.376884422110553,4.407035175879397,4.4371859296482405,4.467336683417085,4.49748743718593,4.527638190954773,4.557788944723618,4.5879396984924625,4.618090452261306,4.648241206030151,4.678391959798995,4.708542713567839,4.738693467336683,4.768844221105527,4.798994974874372,4.829145728643216,4.85929648241206,4.889447236180905,4.919597989949748,4.949748743718593,4.9798994974874375,5.010050251256281,5.040201005025126,5.0703517587939695,5.100502512562814,5.130653266331658,5.160804020100502,5.190954773869347,5.221105527638191,5.251256281407035,5.281407035175879,5.311557788944723,5.341708542713568,5.371859296482412,5.402010050251256,5.4321608040201,5.4623115577889445,5.492462311557789,5.522613065326633,5.552763819095477,5.582914572864321,5.613065326633166,5.64321608040201,5.673366834170854,5.703517587939698,5.733668341708542,5.763819095477387,5.793969849246231,5.824120603015075,5.8542713567839195,5.884422110552763,5.914572864321608,5.944723618090452,5.974874371859296,6.005025125628141,6.035175879396984,6.065326633165829,6.0954773869346734,6.125628140703517,6.155778894472362,6.185929648241205,6.21608040201005,6.2462311557788945,6.276381909547738,6.306532663316583,6.3366834170854265,6.366834170854271,6.396984924623116,6.427135678391959,6.457286432160804,6.487437185929648,6.517587939698492,6.547738693467337,6.57788944723618,6.608040201005025,6.638190954773869,6.668341708542713,6.698492462311558,6.7286432160804015,6.758793969849246,6.788944723618091,6.819095477386934,6.849246231155779,6.879396984924623,6.909547738693467,6.939698492462312,6.969849246231155,7.0],"xaxis":"x","y":[3.0,2.8197775132646994,2.6468296407545298,2.480978457571409,2.3220480221881674,2.169864376448527,2.0242555455671196,1.8850515381294826,1.7520843460920474,1.6251879447821538,1.5041982928980473,1.3889533325088705,1.2792929890546703,1.175059171346399,1.076095771565909,0.9822486652659563,0.8933657113701969,0.809296752173205,0.7298936133404282,0.6550101039082478,0.5845020162839318,0.5182271262456482,0.45604519294247436,0.39781795889439875,0.34340914999228855,0.2926844754979413,0.24551162804404497,0.20176028363418083,0.1613021016428462,0.1240107248154402,0.08976177926825812,0.05843287448851129,0.029903603334304307,0.0040555420346265695,-0.019227749810596606,-0.04006072923056081,-0.0585558698835257,-0.07482366205689459,-0.08897261266714337,-0.10110924525984047,-0.11133810000965809,-0.119761733720361,-0.12648071982483203,-0.1315936483850237,-0.13519712609199247,-0.13738577626590426,-0.1382522388560119,-0.13788717044065493,-0.13637924422729383,-0.13381515005247593,-0.13027959438184666,-0.1258553003101497,-0.12062300756120407,-0.11466147248795551,-0.10804746807244214,-0.10085578392577758,-0.09315922628821909,-0.08502861802905386,-0.0765327986467014,-0.06773862426869641,-0.05871096765167181,-0.04951271818131318,-0.04020478187242702,-0.030846081368940757,-0.021493555943828825,-0.012202161499220664,-0.0030248705662870635,0.005987327694657552,0.014785427494228998,0.023322406413939234,0.03155322540619636,0.039434828794344415,0.046926144272549666,0.053988082905993905,0.06058353913065275,0.06667739075349459,0.0722364989522987,0.0772297082758655,0.08162784664377228,0.0854037253466231,0.0885321390457932,0.09098986577366759,0.09275566693348196,0.0938102872993909,0.09413645501645647,0.09371888160064827,0.09254426193878089,0.09060127428864462,0.08788058027890884,0.08437482490911634,0.08007863654980837,0.07498862694224044,0.06910339119879154,0.062423507802634504,0.05495153860777009,0.04669202883931121,0.037651507093005424,0.027838485335746555,0.01726345890522225,0.005938906509970821,-0.006120709770470967,-0.018898944485715673,-0.03237736881442288,-0.04653557056435602,-0.061351154172564294,-0.07679974070485969,-0.0928549678563968,-0.10948848995135449,-0.12666997794297002,-0.14436711941368685,-0.16254561857489308,-0.1811691962672853,-0.2001995899604026,-0.21959655375303555,-0.23931785837306735,-0.25931929117742814,-0.2795546561521974,-0.2999757739125698,-0.32053248170270765,-0.34117263339597914,-0.3618420994948906,-0.38248476713084756,-0.40304254006464363,-0.4234553386859261,-0.4436611000135485,-0.46359577769538873,-0.483193342008542,-0.5023857798591053,-0.5211030947822792,-0.5392733069424139,-0.5568224531329065,-0.5736745867762579,-0.589751777924107,-0.6049741132570944,-0.6192596960850778,-0.632524646347008,-0.6446831006107573,-0.6556472120734724,-0.6653271505614157,-0.6736311025297482,-0.6804652710629284,-0.6857338758744731,-0.689339153306878,-0.6911813563318902,-0.6911587545502471,-0.6891676341918014,-0.6851022981155552,-0.6788550658095346,-0.6703162733909721,-0.6593742736060904,-0.6459154358302271,-0.6298241460678924,-0.6109828069525293,-0.5892718377469237,-0.5645696743426925,-0.5367527692607041,-0.5056955916510105,-0.47127062729252883,-0.4333483785933822,-0.39179736459091147,-0.3464841209514134,-0.2972731999702091,-0.24402717057191695,-0.18660661831017933,-0.12487014536754941,-0.05867437055599112,0.01212607068350735,0.08767852628116088,0.1681323275376599,0.2536387891250115,0.34435120908576666,0.4404248688334974,0.5420170331528652,0.6492869501990981,0.7623958514986725,0.8815069519487224,1.0067854498173574,1.13839852674364,1.2765153477374043,1.421307061179641,1.572946798821863,1.7316096757870127,1.897472790568304,2.070715225030267,2.251518044408317,2.4400642973086635,2.636539015708513,2.841129214955754,3.0540238937694992,3.27541403423952,3.5054926018266315,3.7444545453624185,3.9924967970495344,4.249818272461312,4.516619870542331,4.793104473607627,5.079476947343528,5.37594414080711,5.682714886426311,6.0],"yaxis":"y","type":"scatter"},{"mode":"markers","x":[1.0,2.5,4.0,5.5,7.0],"y":[3.0,-0.13125,0.0,-0.65625,6.0],"type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"x"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"y"}},"legend":{"tracegroupgap":0},"margin":{"t":60},"showlegend":false,"autosize":false,"width":800,"height":600},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('b1169494-b9f9-4bc7-8aaa-b0e7dbd864e1');
+var gd = document.getElementById('f8d260b0-5199-45b6-874c-779bfcf851f3');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
@@ -737,9 +737,9 @@ <h3 data-number="13.3.1" class="anchored" data-anchor-id="gradient-descent-on-th
 </details>
 <div class="cell-output cell-output-display">
 
-<div>                            <div id="40977550-5844-418d-a071-2905b9f7e02e" class="plotly-graph-div" style="height:600px; width:800px;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("40977550-5844-418d-a071-2905b9f7e02e")) {                    Plotly.newPlot(                        "40977550-5844-418d-a071-2905b9f7e02e",                        [{"mode":"lines","name":"f","x":[1.0,1.0301507537688441,1.0603015075376885,1.0904522613065326,1.120603015075377,1.150753768844221,1.1809045226130652,1.2110552763819096,1.2412060301507537,1.271356783919598,1.3015075376884422,1.3316582914572863,1.3618090452261307,1.3919597989949748,1.4221105527638191,1.4522613065326633,1.4824120603015074,1.5125628140703518,1.542713567839196,1.5728643216080402,1.6030150753768844,1.6331658291457285,1.6633165829145728,1.6934673366834172,1.7236180904522613,1.7537688442211055,1.7839195979899496,1.814070351758794,1.8442211055276383,1.8743718592964824,1.9045226130653266,1.9346733668341707,1.964824120603015,1.9949748743718594,2.0251256281407035,2.0552763819095476,2.085427135678392,2.115577889447236,2.1457286432160805,2.1758793969849246,2.2060301507537687,2.2361809045226133,2.266331658291457,2.2964824120603016,2.3266331658291457,2.35678391959799,2.3869346733668344,2.417085427135678,2.4472361809045227,2.477386934673367,2.507537688442211,2.5376884422110555,2.567839195979899,2.5979899497487438,2.628140703517588,2.658291457286432,2.6884422110552766,2.7185929648241203,2.748743718592965,2.778894472361809,2.809045226130653,2.8391959798994977,2.8693467336683414,2.899497487437186,2.92964824120603,2.959798994974874,2.9899497487437188,3.020100502512563,3.050251256281407,3.080402010050251,3.1105527638190953,3.1407035175879394,3.170854271356784,3.201005025125628,3.2311557788944723,3.2613065326633164,3.2914572864321605,3.321608040201005,3.351758793969849,3.3819095477386933,3.4120603015075375,3.4422110552763816,3.472361809045226,3.5025125628140703,3.5326633165829144,3.5628140703517586,3.5929648241206027,3.6231155778894473,3.6532663316582914,3.6834170854271355,3.7135678391959797,3.743718592964824,3.7738693467336684,3.8040201005025125,3.8341708542713566,3.8643216080402008,3.8944723618090453,3.9246231155778895,3.9547738693467336,3.9849246231155777,4.015075376884422,4.045226130653266,4.075376884422111,4.105527638190955,4.135678391959798,4.165829145728643,4.1959798994974875,4.226130653266331,4.256281407035176,4.28643216080402,4.316582914572864,4.346733668341709,4.376884422110553,4.407035175879397,4.4371859296482405,4.467336683417085,4.49748743718593,4.527638190954773,4.557788944723618,4.5879396984924625,4.618090452261306,4.648241206030151,4.678391959798995,4.708542713567839,4.738693467336683,4.768844221105527,4.798994974874372,4.829145728643216,4.85929648241206,4.889447236180905,4.919597989949748,4.949748743718593,4.9798994974874375,5.010050251256281,5.040201005025126,5.0703517587939695,5.100502512562814,5.130653266331658,5.160804020100502,5.190954773869347,5.221105527638191,5.251256281407035,5.281407035175879,5.311557788944723,5.341708542713568,5.371859296482412,5.402010050251256,5.4321608040201,5.4623115577889445,5.492462311557789,5.522613065326633,5.552763819095477,5.582914572864321,5.613065326633166,5.64321608040201,5.673366834170854,5.703517587939698,5.733668341708542,5.763819095477387,5.793969849246231,5.824120603015075,5.8542713567839195,5.884422110552763,5.914572864321608,5.944723618090452,5.974874371859296,6.005025125628141,6.035175879396984,6.065326633165829,6.0954773869346734,6.125628140703517,6.155778894472362,6.185929648241205,6.21608040201005,6.2462311557788945,6.276381909547738,6.306532663316583,6.3366834170854265,6.366834170854271,6.396984924623116,6.427135678391959,6.457286432160804,6.487437185929648,6.517587939698492,6.547738693467337,6.57788944723618,6.608040201005025,6.638190954773869,6.668341708542713,6.698492462311558,6.7286432160804015,6.758793969849246,6.788944723618091,6.819095477386934,6.849246231155779,6.879396984924623,6.909547738693467,6.939698492462312,6.969849246231155,7.0],"y":[3.0,2.8197775132646994,2.6468296407545298,2.480978457571409,2.3220480221881674,2.169864376448527,2.0242555455671196,1.8850515381294826,1.7520843460920474,1.6251879447821538,1.5041982928980473,1.3889533325088705,1.2792929890546703,1.175059171346399,1.076095771565909,0.9822486652659563,0.8933657113701969,0.809296752173205,0.7298936133404282,0.6550101039082478,0.5845020162839318,0.5182271262456482,0.45604519294247436,0.39781795889439875,0.34340914999228855,0.2926844754979413,0.24551162804404497,0.20176028363418083,0.1613021016428462,0.1240107248154402,0.08976177926825812,0.05843287448851129,0.029903603334304307,0.0040555420346265695,-0.019227749810596606,-0.04006072923056081,-0.0585558698835257,-0.07482366205689459,-0.08897261266714337,-0.10110924525984047,-0.11133810000965809,-0.119761733720361,-0.12648071982483203,-0.1315936483850237,-0.13519712609199247,-0.13738577626590426,-0.1382522388560119,-0.13788717044065493,-0.13637924422729383,-0.13381515005247593,-0.13027959438184666,-0.1258553003101497,-0.12062300756120407,-0.11466147248795551,-0.10804746807244214,-0.10085578392577758,-0.09315922628821909,-0.08502861802905386,-0.0765327986467014,-0.06773862426869641,-0.05871096765167181,-0.04951271818131318,-0.04020478187242702,-0.030846081368940757,-0.021493555943828825,-0.012202161499220664,-0.0030248705662870635,0.005987327694657552,0.014785427494228998,0.023322406413939234,0.03155322540619636,0.039434828794344415,0.046926144272549666,0.053988082905993905,0.06058353913065275,0.06667739075349459,0.0722364989522987,0.0772297082758655,0.08162784664377228,0.0854037253466231,0.0885321390457932,0.09098986577366759,0.09275566693348196,0.0938102872993909,0.09413645501645647,0.09371888160064827,0.09254426193878089,0.09060127428864462,0.08788058027890884,0.08437482490911634,0.08007863654980837,0.07498862694224044,0.06910339119879154,0.062423507802634504,0.05495153860777009,0.04669202883931121,0.037651507093005424,0.027838485335746555,0.01726345890522225,0.005938906509970821,-0.006120709770470967,-0.018898944485715673,-0.03237736881442288,-0.04653557056435602,-0.061351154172564294,-0.07679974070485969,-0.0928549678563968,-0.10948848995135449,-0.12666997794297002,-0.14436711941368685,-0.16254561857489308,-0.1811691962672853,-0.2001995899604026,-0.21959655375303555,-0.23931785837306735,-0.25931929117742814,-0.2795546561521974,-0.2999757739125698,-0.32053248170270765,-0.34117263339597914,-0.3618420994948906,-0.38248476713084756,-0.40304254006464363,-0.4234553386859261,-0.4436611000135485,-0.46359577769538873,-0.483193342008542,-0.5023857798591053,-0.5211030947822792,-0.5392733069424139,-0.5568224531329065,-0.5736745867762579,-0.589751777924107,-0.6049741132570944,-0.6192596960850778,-0.632524646347008,-0.6446831006107573,-0.6556472120734724,-0.6653271505614157,-0.6736311025297482,-0.6804652710629284,-0.6857338758744731,-0.689339153306878,-0.6911813563318902,-0.6911587545502471,-0.6891676341918014,-0.6851022981155552,-0.6788550658095346,-0.6703162733909721,-0.6593742736060904,-0.6459154358302271,-0.6298241460678924,-0.6109828069525293,-0.5892718377469237,-0.5645696743426925,-0.5367527692607041,-0.5056955916510105,-0.47127062729252883,-0.4333483785933822,-0.39179736459091147,-0.3464841209514134,-0.2972731999702091,-0.24402717057191695,-0.18660661831017933,-0.12487014536754941,-0.05867437055599112,0.01212607068350735,0.08767852628116088,0.1681323275376599,0.2536387891250115,0.34435120908576666,0.4404248688334974,0.5420170331528652,0.6492869501990981,0.7623958514986725,0.8815069519487224,1.0067854498173574,1.13839852674364,1.2765153477374043,1.421307061179641,1.572946798821863,1.7316096757870127,1.897472790568304,2.070715225030267,2.251518044408317,2.4400642973086635,2.636539015708513,2.841129214955754,3.0540238937694992,3.27541403423952,3.5054926018266315,3.7444545453624185,3.9924967970495344,4.249818272461312,4.516619870542331,4.793104473607627,5.079476947343528,5.37594414080711,5.682714886426311,6.0],"type":"scatter"},{"line":{"dash":"dash"},"mode":"lines","name":"df","x":[1.0,1.0301507537688441,1.0603015075376885,1.0904522613065326,1.120603015075377,1.150753768844221,1.1809045226130652,1.2110552763819096,1.2412060301507537,1.271356783919598,1.3015075376884422,1.3316582914572863,1.3618090452261307,1.3919597989949748,1.4221105527638191,1.4522613065326633,1.4824120603015074,1.5125628140703518,1.542713567839196,1.5728643216080402,1.6030150753768844,1.6331658291457285,1.6633165829145728,1.6934673366834172,1.7236180904522613,1.7537688442211055,1.7839195979899496,1.814070351758794,1.8442211055276383,1.8743718592964824,1.9045226130653266,1.9346733668341707,1.964824120603015,1.9949748743718594,2.0251256281407035,2.0552763819095476,2.085427135678392,2.115577889447236,2.1457286432160805,2.1758793969849246,2.2060301507537687,2.2361809045226133,2.266331658291457,2.2964824120603016,2.3266331658291457,2.35678391959799,2.3869346733668344,2.417085427135678,2.4472361809045227,2.477386934673367,2.507537688442211,2.5376884422110555,2.567839195979899,2.5979899497487438,2.628140703517588,2.658291457286432,2.6884422110552766,2.7185929648241203,2.748743718592965,2.778894472361809,2.809045226130653,2.8391959798994977,2.8693467336683414,2.899497487437186,2.92964824120603,2.959798994974874,2.9899497487437188,3.020100502512563,3.050251256281407,3.080402010050251,3.1105527638190953,3.1407035175879394,3.170854271356784,3.201005025125628,3.2311557788944723,3.2613065326633164,3.2914572864321605,3.321608040201005,3.351758793969849,3.3819095477386933,3.4120603015075375,3.4422110552763816,3.472361809045226,3.5025125628140703,3.5326633165829144,3.5628140703517586,3.5929648241206027,3.6231155778894473,3.6532663316582914,3.6834170854271355,3.7135678391959797,3.743718592964824,3.7738693467336684,3.8040201005025125,3.8341708542713566,3.8643216080402008,3.8944723618090453,3.9246231155778895,3.9547738693467336,3.9849246231155777,4.015075376884422,4.045226130653266,4.075376884422111,4.105527638190955,4.135678391959798,4.165829145728643,4.1959798994974875,4.226130653266331,4.256281407035176,4.28643216080402,4.316582914572864,4.346733668341709,4.376884422110553,4.407035175879397,4.4371859296482405,4.467336683417085,4.49748743718593,4.527638190954773,4.557788944723618,4.5879396984924625,4.618090452261306,4.648241206030151,4.678391959798995,4.708542713567839,4.738693467336683,4.768844221105527,4.798994974874372,4.829145728643216,4.85929648241206,4.889447236180905,4.919597989949748,4.949748743718593,4.9798994974874375,5.010050251256281,5.040201005025126,5.0703517587939695,5.100502512562814,5.130653266331658,5.160804020100502,5.190954773869347,5.221105527638191,5.251256281407035,5.281407035175879,5.311557788944723,5.341708542713568,5.371859296482412,5.402010050251256,5.4321608040201,5.4623115577889445,5.492462311557789,5.522613065326633,5.552763819095477,5.582914572864321,5.613065326633166,5.64321608040201,5.673366834170854,5.703517587939698,5.733668341708542,5.763819095477387,5.793969849246231,5.824120603015075,5.8542713567839195,5.884422110552763,5.914572864321608,5.944723618090452,5.974874371859296,6.005025125628141,6.035175879396984,6.065326633165829,6.0954773869346734,6.125628140703517,6.155778894472362,6.185929648241205,6.21608040201005,6.2462311557788945,6.276381909547738,6.306532663316583,6.3366834170854265,6.366834170854271,6.396984924623116,6.427135678391959,6.457286432160804,6.487437185929648,6.517587939698492,6.547738693467337,6.57788944723618,6.608040201005025,6.638190954773869,6.668341708542713,6.698492462311558,6.7286432160804015,6.758793969849246,6.788944723618091,6.819095477386934,6.849246231155779,6.879396984924623,6.909547738693467,6.939698492462312,6.969849246231155,7.0],"y":[-6.1,-5.855752779706215,-5.617439626099488,-5.384994757378214,-5.158352391740783,-4.937446747385573,-4.72221204251098,-4.512582495315394,-4.3084923239972,-4.109875746754784,-3.9166669817865367,-3.728800247290849,-3.546209761466102,-3.3688297425106897,-3.1965944086229996,-3.0294379780014196,-2.867294668844335,-2.710098699350141,-2.5577842877172143,-2.410285652143955,-2.2675370108287467,-2.129472581969978,-1.9960265837660303,-1.8671332344153029,-1.7427267521161809,-1.6227413550670406,-1.5071112614662923,-1.3957706895123068,-1.288653857403483,-1.1856949833381947,-1.0868282855148437,-0.9919879821318176,-0.9011082913874986,-0.8141234314802744,-0.7309676206085385,-0.6515750769706727,-0.5758800187650707,-0.5038166641901227,-0.4353192314442083,-0.37032193872572067,-0.30875900423305325,-0.2505646461645881,-0.1956730827187158,-0.14401853209381557,-0.09553521248829214,-0.050157342100516186,-0.007819139128892516,0.03154517822820253,0.0680013917723727,0.10161528330524447,0.1324526346284074,0.1605792275434908,0.18606084385210409,0.20896326535584536,0.22935227385634108,0.247293651155195,0.26285317905403077,0.2760966393544493,0.2870898138580628,0.295898484366478,0.30258843268132407,0.30722544060420204,0.30987528993671276,0.3106037624804969,0.3094766400371327,0.3065597044082608,0.3019187373954651,0.2956195208003862,0.28772783642461375,0.278309466069777,0.2674301915374713,0.2551557946293144,0.24155205714693012,0.2266847608919079,0.2106196876658714,0.19342261927042728,0.17515933750721616,0.15589562417781053,0.13569726108383975,0.11463003002689902,0.09275971280864041,0.07015209123063641,0.04687294709450498,0.022988062201864067,-0.0014367816456569925,-0.026335802646468665,-0.05164321899897004,-0.07729324890150338,-0.10322011055251892,-0.12935802215034756,-0.15564120189339975,-0.18200386798008594,-0.20838023860876548,-0.23470453197783173,-0.26091096628567245,-0.28693375973069807,-0.3127071305112679,-0.33816529682578106,-0.36324247687263095,-0.387872888850211,-0.4119907509568804,-0.4355302813910555,-0.4584256983510954,-0.48061122003543344,-0.5020210646424061,-0.5225894503704523,-0.5422505954179314,-0.560938717983231,-0.5785880362647674,-0.5951327684608827,-0.6105071327699989,-0.6246453473904694,-0.6374816305207218,-0.648950200359127,-0.6589852751040894,-0.6675210729539799,-0.6744918121071806,-0.6798317107620733,-0.6834749871170857,-0.6853558593705884,-0.6854085457209521,-0.6835672643665817,-0.6797662335058476,-0.6739396713371661,-0.666021796058908,-0.6559468258694551,-0.643648978967201,-0.6290624735505503,-0.6121215278178852,-0.5927603599675649,-0.5709131881979829,-0.546514230707578,-0.5194977056946982,-0.48979783135773686,-0.4573488258950988,-0.42208490750512057,-0.38394029438626376,-0.3428492047368536,-0.29874585675530624,-0.2515644686400151,-0.20123925858937355,-0.1477044448017068,-0.09089424547549925,-0.030742878809076047,0.032815436999146644,0.09984648375078678,0.17041604324746232,0.24458989729080258,0.3224338276823687,0.4040136162238241,0.48939504471677536,0.5786438949628178,0.6718259487635351,0.7690069879205907,0.8702527942355687,0.9756291495100868,1.0852018355457402,1.1990366341441927,1.3171993271069824,1.439755696235784,1.5667715233321815,1.6983125901977814,1.834444678634202,1.9752335704430606,2.1207450474259644,2.2710448913845083,2.4261988841203674,2.58627280743508,2.7513324431302637,2.9214435730075934,3.0966719788686077,3.2770834425149475,3.462743745748264,3.6537186703701194,3.8500739981821313,4.051875510985894,4.259188990583084,4.472080218775238,4.690614977364044,4.914859048151049,5.144878212937897,5.380738253526147,5.622504951717474,5.870244089313519,6.124021448115786,6.383902809925985,6.649953956545687,6.922240669776488,7.200828731420051,7.485783923277927,7.777172027151755,8.075058824843131,8.37951009815372,8.690591628885068,9.00836919883884,9.332908589816588,9.664275583619997,10.002535962050592,10.347755506910039,10.7],"type":"scatter"},{"marker":{"size":12},"mode":"markers","name":"df = zero","x":[2.3927,3.5309,5.3263],"y":[0.0,0.0,0.0],"type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"font":{"size":20},"yaxis":{"range":[-1,3]},"autosize":false,"width":800,"height":600},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="3fe97968-f08c-4cb5-96ad-6c408ee5dd1d" class="plotly-graph-div" style="height:600px; width:800px;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("3fe97968-f08c-4cb5-96ad-6c408ee5dd1d")) {                    Plotly.newPlot(                        "3fe97968-f08c-4cb5-96ad-6c408ee5dd1d",                        [{"mode":"lines","name":"f","x":[1.0,1.0301507537688441,1.0603015075376885,1.0904522613065326,1.120603015075377,1.150753768844221,1.1809045226130652,1.2110552763819096,1.2412060301507537,1.271356783919598,1.3015075376884422,1.3316582914572863,1.3618090452261307,1.3919597989949748,1.4221105527638191,1.4522613065326633,1.4824120603015074,1.5125628140703518,1.542713567839196,1.5728643216080402,1.6030150753768844,1.6331658291457285,1.6633165829145728,1.6934673366834172,1.7236180904522613,1.7537688442211055,1.7839195979899496,1.814070351758794,1.8442211055276383,1.8743718592964824,1.9045226130653266,1.9346733668341707,1.964824120603015,1.9949748743718594,2.0251256281407035,2.0552763819095476,2.085427135678392,2.115577889447236,2.1457286432160805,2.1758793969849246,2.2060301507537687,2.2361809045226133,2.266331658291457,2.2964824120603016,2.3266331658291457,2.35678391959799,2.3869346733668344,2.417085427135678,2.4472361809045227,2.477386934673367,2.507537688442211,2.5376884422110555,2.567839195979899,2.5979899497487438,2.628140703517588,2.658291457286432,2.6884422110552766,2.7185929648241203,2.748743718592965,2.778894472361809,2.809045226130653,2.8391959798994977,2.8693467336683414,2.899497487437186,2.92964824120603,2.959798994974874,2.9899497487437188,3.020100502512563,3.050251256281407,3.080402010050251,3.1105527638190953,3.1407035175879394,3.170854271356784,3.201005025125628,3.2311557788944723,3.2613065326633164,3.2914572864321605,3.321608040201005,3.351758793969849,3.3819095477386933,3.4120603015075375,3.4422110552763816,3.472361809045226,3.5025125628140703,3.5326633165829144,3.5628140703517586,3.5929648241206027,3.6231155778894473,3.6532663316582914,3.6834170854271355,3.7135678391959797,3.743718592964824,3.7738693467336684,3.8040201005025125,3.8341708542713566,3.8643216080402008,3.8944723618090453,3.9246231155778895,3.9547738693467336,3.9849246231155777,4.015075376884422,4.045226130653266,4.075376884422111,4.105527638190955,4.135678391959798,4.165829145728643,4.1959798994974875,4.226130653266331,4.256281407035176,4.28643216080402,4.316582914572864,4.346733668341709,4.376884422110553,4.407035175879397,4.4371859296482405,4.467336683417085,4.49748743718593,4.527638190954773,4.557788944723618,4.5879396984924625,4.618090452261306,4.648241206030151,4.678391959798995,4.708542713567839,4.738693467336683,4.768844221105527,4.798994974874372,4.829145728643216,4.85929648241206,4.889447236180905,4.919597989949748,4.949748743718593,4.9798994974874375,5.010050251256281,5.040201005025126,5.0703517587939695,5.100502512562814,5.130653266331658,5.160804020100502,5.190954773869347,5.221105527638191,5.251256281407035,5.281407035175879,5.311557788944723,5.341708542713568,5.371859296482412,5.402010050251256,5.4321608040201,5.4623115577889445,5.492462311557789,5.522613065326633,5.552763819095477,5.582914572864321,5.613065326633166,5.64321608040201,5.673366834170854,5.703517587939698,5.733668341708542,5.763819095477387,5.793969849246231,5.824120603015075,5.8542713567839195,5.884422110552763,5.914572864321608,5.944723618090452,5.974874371859296,6.005025125628141,6.035175879396984,6.065326633165829,6.0954773869346734,6.125628140703517,6.155778894472362,6.185929648241205,6.21608040201005,6.2462311557788945,6.276381909547738,6.306532663316583,6.3366834170854265,6.366834170854271,6.396984924623116,6.427135678391959,6.457286432160804,6.487437185929648,6.517587939698492,6.547738693467337,6.57788944723618,6.608040201005025,6.638190954773869,6.668341708542713,6.698492462311558,6.7286432160804015,6.758793969849246,6.788944723618091,6.819095477386934,6.849246231155779,6.879396984924623,6.909547738693467,6.939698492462312,6.969849246231155,7.0],"y":[3.0,2.8197775132646994,2.6468296407545298,2.480978457571409,2.3220480221881674,2.169864376448527,2.0242555455671196,1.8850515381294826,1.7520843460920474,1.6251879447821538,1.5041982928980473,1.3889533325088705,1.2792929890546703,1.175059171346399,1.076095771565909,0.9822486652659563,0.8933657113701969,0.809296752173205,0.7298936133404282,0.6550101039082478,0.5845020162839318,0.5182271262456482,0.45604519294247436,0.39781795889439875,0.34340914999228855,0.2926844754979413,0.24551162804404497,0.20176028363418083,0.1613021016428462,0.1240107248154402,0.08976177926825812,0.05843287448851129,0.029903603334304307,0.0040555420346265695,-0.019227749810596606,-0.04006072923056081,-0.0585558698835257,-0.07482366205689459,-0.08897261266714337,-0.10110924525984047,-0.11133810000965809,-0.119761733720361,-0.12648071982483203,-0.1315936483850237,-0.13519712609199247,-0.13738577626590426,-0.1382522388560119,-0.13788717044065493,-0.13637924422729383,-0.13381515005247593,-0.13027959438184666,-0.1258553003101497,-0.12062300756120407,-0.11466147248795551,-0.10804746807244214,-0.10085578392577758,-0.09315922628821909,-0.08502861802905386,-0.0765327986467014,-0.06773862426869641,-0.05871096765167181,-0.04951271818131318,-0.04020478187242702,-0.030846081368940757,-0.021493555943828825,-0.012202161499220664,-0.0030248705662870635,0.005987327694657552,0.014785427494228998,0.023322406413939234,0.03155322540619636,0.039434828794344415,0.046926144272549666,0.053988082905993905,0.06058353913065275,0.06667739075349459,0.0722364989522987,0.0772297082758655,0.08162784664377228,0.0854037253466231,0.0885321390457932,0.09098986577366759,0.09275566693348196,0.0938102872993909,0.09413645501645647,0.09371888160064827,0.09254426193878089,0.09060127428864462,0.08788058027890884,0.08437482490911634,0.08007863654980837,0.07498862694224044,0.06910339119879154,0.062423507802634504,0.05495153860777009,0.04669202883931121,0.037651507093005424,0.027838485335746555,0.01726345890522225,0.005938906509970821,-0.006120709770470967,-0.018898944485715673,-0.03237736881442288,-0.04653557056435602,-0.061351154172564294,-0.07679974070485969,-0.0928549678563968,-0.10948848995135449,-0.12666997794297002,-0.14436711941368685,-0.16254561857489308,-0.1811691962672853,-0.2001995899604026,-0.21959655375303555,-0.23931785837306735,-0.25931929117742814,-0.2795546561521974,-0.2999757739125698,-0.32053248170270765,-0.34117263339597914,-0.3618420994948906,-0.38248476713084756,-0.40304254006464363,-0.4234553386859261,-0.4436611000135485,-0.46359577769538873,-0.483193342008542,-0.5023857798591053,-0.5211030947822792,-0.5392733069424139,-0.5568224531329065,-0.5736745867762579,-0.589751777924107,-0.6049741132570944,-0.6192596960850778,-0.632524646347008,-0.6446831006107573,-0.6556472120734724,-0.6653271505614157,-0.6736311025297482,-0.6804652710629284,-0.6857338758744731,-0.689339153306878,-0.6911813563318902,-0.6911587545502471,-0.6891676341918014,-0.6851022981155552,-0.6788550658095346,-0.6703162733909721,-0.6593742736060904,-0.6459154358302271,-0.6298241460678924,-0.6109828069525293,-0.5892718377469237,-0.5645696743426925,-0.5367527692607041,-0.5056955916510105,-0.47127062729252883,-0.4333483785933822,-0.39179736459091147,-0.3464841209514134,-0.2972731999702091,-0.24402717057191695,-0.18660661831017933,-0.12487014536754941,-0.05867437055599112,0.01212607068350735,0.08767852628116088,0.1681323275376599,0.2536387891250115,0.34435120908576666,0.4404248688334974,0.5420170331528652,0.6492869501990981,0.7623958514986725,0.8815069519487224,1.0067854498173574,1.13839852674364,1.2765153477374043,1.421307061179641,1.572946798821863,1.7316096757870127,1.897472790568304,2.070715225030267,2.251518044408317,2.4400642973086635,2.636539015708513,2.841129214955754,3.0540238937694992,3.27541403423952,3.5054926018266315,3.7444545453624185,3.9924967970495344,4.249818272461312,4.516619870542331,4.793104473607627,5.079476947343528,5.37594414080711,5.682714886426311,6.0],"type":"scatter"},{"line":{"dash":"dash"},"mode":"lines","name":"df","x":[1.0,1.0301507537688441,1.0603015075376885,1.0904522613065326,1.120603015075377,1.150753768844221,1.1809045226130652,1.2110552763819096,1.2412060301507537,1.271356783919598,1.3015075376884422,1.3316582914572863,1.3618090452261307,1.3919597989949748,1.4221105527638191,1.4522613065326633,1.4824120603015074,1.5125628140703518,1.542713567839196,1.5728643216080402,1.6030150753768844,1.6331658291457285,1.6633165829145728,1.6934673366834172,1.7236180904522613,1.7537688442211055,1.7839195979899496,1.814070351758794,1.8442211055276383,1.8743718592964824,1.9045226130653266,1.9346733668341707,1.964824120603015,1.9949748743718594,2.0251256281407035,2.0552763819095476,2.085427135678392,2.115577889447236,2.1457286432160805,2.1758793969849246,2.2060301507537687,2.2361809045226133,2.266331658291457,2.2964824120603016,2.3266331658291457,2.35678391959799,2.3869346733668344,2.417085427135678,2.4472361809045227,2.477386934673367,2.507537688442211,2.5376884422110555,2.567839195979899,2.5979899497487438,2.628140703517588,2.658291457286432,2.6884422110552766,2.7185929648241203,2.748743718592965,2.778894472361809,2.809045226130653,2.8391959798994977,2.8693467336683414,2.899497487437186,2.92964824120603,2.959798994974874,2.9899497487437188,3.020100502512563,3.050251256281407,3.080402010050251,3.1105527638190953,3.1407035175879394,3.170854271356784,3.201005025125628,3.2311557788944723,3.2613065326633164,3.2914572864321605,3.321608040201005,3.351758793969849,3.3819095477386933,3.4120603015075375,3.4422110552763816,3.472361809045226,3.5025125628140703,3.5326633165829144,3.5628140703517586,3.5929648241206027,3.6231155778894473,3.6532663316582914,3.6834170854271355,3.7135678391959797,3.743718592964824,3.7738693467336684,3.8040201005025125,3.8341708542713566,3.8643216080402008,3.8944723618090453,3.9246231155778895,3.9547738693467336,3.9849246231155777,4.015075376884422,4.045226130653266,4.075376884422111,4.105527638190955,4.135678391959798,4.165829145728643,4.1959798994974875,4.226130653266331,4.256281407035176,4.28643216080402,4.316582914572864,4.346733668341709,4.376884422110553,4.407035175879397,4.4371859296482405,4.467336683417085,4.49748743718593,4.527638190954773,4.557788944723618,4.5879396984924625,4.618090452261306,4.648241206030151,4.678391959798995,4.708542713567839,4.738693467336683,4.768844221105527,4.798994974874372,4.829145728643216,4.85929648241206,4.889447236180905,4.919597989949748,4.949748743718593,4.9798994974874375,5.010050251256281,5.040201005025126,5.0703517587939695,5.100502512562814,5.130653266331658,5.160804020100502,5.190954773869347,5.221105527638191,5.251256281407035,5.281407035175879,5.311557788944723,5.341708542713568,5.371859296482412,5.402010050251256,5.4321608040201,5.4623115577889445,5.492462311557789,5.522613065326633,5.552763819095477,5.582914572864321,5.613065326633166,5.64321608040201,5.673366834170854,5.703517587939698,5.733668341708542,5.763819095477387,5.793969849246231,5.824120603015075,5.8542713567839195,5.884422110552763,5.914572864321608,5.944723618090452,5.974874371859296,6.005025125628141,6.035175879396984,6.065326633165829,6.0954773869346734,6.125628140703517,6.155778894472362,6.185929648241205,6.21608040201005,6.2462311557788945,6.276381909547738,6.306532663316583,6.3366834170854265,6.366834170854271,6.396984924623116,6.427135678391959,6.457286432160804,6.487437185929648,6.517587939698492,6.547738693467337,6.57788944723618,6.608040201005025,6.638190954773869,6.668341708542713,6.698492462311558,6.7286432160804015,6.758793969849246,6.788944723618091,6.819095477386934,6.849246231155779,6.879396984924623,6.909547738693467,6.939698492462312,6.969849246231155,7.0],"y":[-6.1,-5.855752779706215,-5.617439626099488,-5.384994757378214,-5.158352391740783,-4.937446747385573,-4.72221204251098,-4.512582495315394,-4.3084923239972,-4.109875746754784,-3.9166669817865367,-3.728800247290849,-3.546209761466102,-3.3688297425106897,-3.1965944086229996,-3.0294379780014196,-2.867294668844335,-2.710098699350141,-2.5577842877172143,-2.410285652143955,-2.2675370108287467,-2.129472581969978,-1.9960265837660303,-1.8671332344153029,-1.7427267521161809,-1.6227413550670406,-1.5071112614662923,-1.3957706895123068,-1.288653857403483,-1.1856949833381947,-1.0868282855148437,-0.9919879821318176,-0.9011082913874986,-0.8141234314802744,-0.7309676206085385,-0.6515750769706727,-0.5758800187650707,-0.5038166641901227,-0.4353192314442083,-0.37032193872572067,-0.30875900423305325,-0.2505646461645881,-0.1956730827187158,-0.14401853209381557,-0.09553521248829214,-0.050157342100516186,-0.007819139128892516,0.03154517822820253,0.0680013917723727,0.10161528330524447,0.1324526346284074,0.1605792275434908,0.18606084385210409,0.20896326535584536,0.22935227385634108,0.247293651155195,0.26285317905403077,0.2760966393544493,0.2870898138580628,0.295898484366478,0.30258843268132407,0.30722544060420204,0.30987528993671276,0.3106037624804969,0.3094766400371327,0.3065597044082608,0.3019187373954651,0.2956195208003862,0.28772783642461375,0.278309466069777,0.2674301915374713,0.2551557946293144,0.24155205714693012,0.2266847608919079,0.2106196876658714,0.19342261927042728,0.17515933750721616,0.15589562417781053,0.13569726108383975,0.11463003002689902,0.09275971280864041,0.07015209123063641,0.04687294709450498,0.022988062201864067,-0.0014367816456569925,-0.026335802646468665,-0.05164321899897004,-0.07729324890150338,-0.10322011055251892,-0.12935802215034756,-0.15564120189339975,-0.18200386798008594,-0.20838023860876548,-0.23470453197783173,-0.26091096628567245,-0.28693375973069807,-0.3127071305112679,-0.33816529682578106,-0.36324247687263095,-0.387872888850211,-0.4119907509568804,-0.4355302813910555,-0.4584256983510954,-0.48061122003543344,-0.5020210646424061,-0.5225894503704523,-0.5422505954179314,-0.560938717983231,-0.5785880362647674,-0.5951327684608827,-0.6105071327699989,-0.6246453473904694,-0.6374816305207218,-0.648950200359127,-0.6589852751040894,-0.6675210729539799,-0.6744918121071806,-0.6798317107620733,-0.6834749871170857,-0.6853558593705884,-0.6854085457209521,-0.6835672643665817,-0.6797662335058476,-0.6739396713371661,-0.666021796058908,-0.6559468258694551,-0.643648978967201,-0.6290624735505503,-0.6121215278178852,-0.5927603599675649,-0.5709131881979829,-0.546514230707578,-0.5194977056946982,-0.48979783135773686,-0.4573488258950988,-0.42208490750512057,-0.38394029438626376,-0.3428492047368536,-0.29874585675530624,-0.2515644686400151,-0.20123925858937355,-0.1477044448017068,-0.09089424547549925,-0.030742878809076047,0.032815436999146644,0.09984648375078678,0.17041604324746232,0.24458989729080258,0.3224338276823687,0.4040136162238241,0.48939504471677536,0.5786438949628178,0.6718259487635351,0.7690069879205907,0.8702527942355687,0.9756291495100868,1.0852018355457402,1.1990366341441927,1.3171993271069824,1.439755696235784,1.5667715233321815,1.6983125901977814,1.834444678634202,1.9752335704430606,2.1207450474259644,2.2710448913845083,2.4261988841203674,2.58627280743508,2.7513324431302637,2.9214435730075934,3.0966719788686077,3.2770834425149475,3.462743745748264,3.6537186703701194,3.8500739981821313,4.051875510985894,4.259188990583084,4.472080218775238,4.690614977364044,4.914859048151049,5.144878212937897,5.380738253526147,5.622504951717474,5.870244089313519,6.124021448115786,6.383902809925985,6.649953956545687,6.922240669776488,7.200828731420051,7.485783923277927,7.777172027151755,8.075058824843131,8.37951009815372,8.690591628885068,9.00836919883884,9.332908589816588,9.664275583619997,10.002535962050592,10.347755506910039,10.7],"type":"scatter"},{"marker":{"size":12},"mode":"markers","name":"df = zero","x":[2.3927,3.5309,5.3263],"y":[0.0,0.0,0.0],"type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"font":{"size":20},"yaxis":{"range":[-1,3]},"autosize":false,"width":800,"height":600},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('40977550-5844-418d-a071-2905b9f7e02e');
+var gd = document.getElementById('3fe97968-f08c-4cb5-96ad-6c408ee5dd1d');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
@@ -869,9 +869,9 @@ <h2 data-number="13.4" class="anchored" data-anchor-id="gradient-descent-on-mult
 </details>
 <div class="cell-output cell-output-display">
 
-<div>                            <div id="2c59d3f6-b8fb-4913-b179-5e65b6c30177" class="plotly-graph-div" style="height:600px; width:800px;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("2c59d3f6-b8fb-4913-b179-5e65b6c30177")) {                    Plotly.newPlot(                        "2c59d3f6-b8fb-4913-b179-5e65b6c30177",                        [{"x":[[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0]],"y":[[-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1],[-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001],[-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17],[0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999],[0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998],[0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997],[0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998],[0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997],[0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993],[0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35]],"z":[[29.123031979508195,26.96047116185995,24.896675776310463,22.931645822859743,21.065381301507795,19.297882212254603,17.629148555100183,16.059180330044526,14.587977537087637,13.21554017622951],[18.833628650614756,17.11075544680986,15.486647675103727,13.961305335496359,12.534728427987758,11.20691695257792,9.97787090926685,8.847590298054545,7.816075118941006,6.883325371926231],[10.896283606557377,9.613098016595831,8.428677858733053,7.343023132969036,6.356133839303786,5.468009977737301,4.678651548269582,3.9880585509006288,3.3962309856304405,2.903168852459017],[5.310996847336067,4.467498871217872,3.722766327198442,3.0767992152777786,2.5295975354558795,2.0811612877327472,1.7314904721083795,1.4805850885827772,1.3284451371559403,1.2750706178278688],[2.0777683729508207,1.6739580106759773,1.3689130804998995,1.1626335824225869,1.05511951644404,1.0463708825642581,1.1363876807832418,1.3251699111009911,1.6127175735175057,1.9990306680327863],[1.196598183401639,1.232475434970147,1.3671181186374206,1.6005262344034599,1.9326997822682639,2.3636387622318336,2.8933431742941687,3.521813018455269,4.249048294715136,5.075049003073768],[2.667486278688523,3.143051144100383,3.7173814416110083,4.390477171220399,5.162338332928554,6.032964926735477,7.002356952641165,8.070514410645615,9.237437300748834,10.503125622950817],[6.490432658811471,7.405685138066683,8.419703049420658,9.532486392873404,10.74403516842491,12.054349376075185,13.463429015824222,14.971274087672022,16.57788459161859,18.283260527663927],[12.665437323770481,14.020377416869044,15.474082942066373,17.02655389936246,18.677790288757322,20.427792110250948,22.27655936384334,24.22409204953449,26.27039016732441,28.4154537172131],[21.19250027356557,22.98712798050748,24.880521119548167,26.87267969068761,28.96360369392582,31.153293129262796,33.44174799669854,35.828968296233036,38.314954027866314,40.89970519159836]],"type":"surface"},{"marker":{"color":"red","size":10},"name":"Optimal Point","x":[1.1111111111111112],"y":[0.09999999999999998],"z":[1.0463708825642581],"type":"scatter3d"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"scene":{"xaxis":{"title":{"text":"theta0"}},"yaxis":{"title":{"text":"theta1"}},"zaxis":{"title":{"text":"MSE"}}},"autosize":false,"width":800,"height":600},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="63f07d57-4ddc-4236-b223-d40166cbb7bd" class="plotly-graph-div" style="height:600px; width:800px;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("63f07d57-4ddc-4236-b223-d40166cbb7bd")) {                    Plotly.newPlot(                        "63f07d57-4ddc-4236-b223-d40166cbb7bd",                        [{"x":[[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0]],"y":[[-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1,-0.1],[-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001,-0.05000000000000001],[-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17,-1.3877787807814457e-17],[0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999,0.04999999999999999],[0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998,0.09999999999999998],[0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997,0.14999999999999997],[0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998,0.19999999999999998],[0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997,0.24999999999999997],[0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993,0.29999999999999993],[0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35,0.35]],"z":[[29.123031979508195,26.96047116185995,24.896675776310463,22.931645822859743,21.065381301507795,19.297882212254603,17.629148555100183,16.059180330044526,14.587977537087637,13.21554017622951],[18.833628650614756,17.11075544680986,15.486647675103727,13.961305335496359,12.534728427987758,11.20691695257792,9.97787090926685,8.847590298054545,7.816075118941006,6.883325371926231],[10.896283606557377,9.613098016595831,8.428677858733053,7.343023132969036,6.356133839303786,5.468009977737301,4.678651548269582,3.9880585509006288,3.3962309856304405,2.903168852459017],[5.310996847336067,4.467498871217872,3.722766327198442,3.0767992152777786,2.5295975354558795,2.0811612877327472,1.7314904721083795,1.4805850885827772,1.3284451371559403,1.275070617827869],[2.0777683729508207,1.6739580106759773,1.3689130804998992,1.1626335824225869,1.05511951644404,1.0463708825642581,1.1363876807832418,1.325169911100991,1.6127175735175057,1.9990306680327863],[1.196598183401639,1.232475434970147,1.3671181186374208,1.6005262344034599,1.9326997822682639,2.3636387622318336,2.8933431742941687,3.521813018455269,4.249048294715135,5.075049003073768],[2.667486278688523,3.1430511441003834,3.717381441611009,4.390477171220399,5.162338332928555,6.032964926735477,7.002356952641165,8.070514410645615,9.237437300748834,10.503125622950817],[6.490432658811471,7.405685138066682,8.419703049420658,9.532486392873402,10.74403516842491,12.054349376075182,13.463429015824218,14.971274087672022,16.57788459161859,18.283260527663924],[12.665437323770481,14.020377416869042,15.474082942066373,17.02655389936246,18.677790288757322,20.427792110250945,22.27655936384334,24.224092049534487,26.27039016732441,28.4154537172131],[21.19250027356557,22.98712798050748,24.880521119548167,26.87267969068761,28.96360369392582,31.153293129262796,33.44174799669854,35.828968296233036,38.314954027866314,40.89970519159835]],"type":"surface"},{"marker":{"color":"red","size":10},"name":"Optimal Point","x":[1.1111111111111112],"y":[0.09999999999999998],"z":[1.0463708825642581],"type":"scatter3d"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"scene":{"xaxis":{"title":{"text":"theta0"}},"yaxis":{"title":{"text":"theta1"}},"zaxis":{"title":{"text":"MSE"}}},"autosize":false,"width":800,"height":600},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('2c59d3f6-b8fb-4913-b179-5e65b6c30177');
+var gd = document.getElementById('63f07d57-4ddc-4236-b223-d40166cbb7bd');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
@@ -907,9 +907,9 @@ <h2 data-number="13.4" class="anchored" data-anchor-id="gradient-descent-on-mult
 <span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a>fig.show()</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 
-<div>                            <div id="ddd7c922-8cee-4f07-9373-0dace8e0ca92" class="plotly-graph-div" style="height:600px; width:800px;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("ddd7c922-8cee-4f07-9373-0dace8e0ca92")) {                    Plotly.newPlot(                        "ddd7c922-8cee-4f07-9373-0dace8e0ca92",                        [{"x":[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],"y":[-0.1,-0.05000000000000001,-1.3877787807814457e-17,0.04999999999999999,0.09999999999999998,0.14999999999999997,0.19999999999999998,0.24999999999999997,0.29999999999999993,0.35],"z":[[29.123031979508195,26.96047116185995,24.896675776310463,22.931645822859743,21.065381301507795,19.297882212254603,17.629148555100183,16.059180330044526,14.587977537087637,13.21554017622951],[18.833628650614756,17.11075544680986,15.486647675103727,13.961305335496359,12.534728427987758,11.20691695257792,9.97787090926685,8.847590298054545,7.816075118941006,6.883325371926231],[10.896283606557377,9.613098016595831,8.428677858733053,7.343023132969036,6.356133839303786,5.468009977737301,4.678651548269582,3.9880585509006288,3.3962309856304405,2.903168852459017],[5.310996847336067,4.467498871217872,3.722766327198442,3.0767992152777786,2.5295975354558795,2.0811612877327472,1.7314904721083795,1.4805850885827772,1.3284451371559403,1.2750706178278688],[2.0777683729508207,1.6739580106759773,1.3689130804998995,1.1626335824225869,1.05511951644404,1.0463708825642581,1.1363876807832418,1.3251699111009911,1.6127175735175057,1.9990306680327863],[1.196598183401639,1.232475434970147,1.3671181186374206,1.6005262344034599,1.9326997822682639,2.3636387622318336,2.8933431742941687,3.521813018455269,4.249048294715136,5.075049003073768],[2.667486278688523,3.143051144100383,3.7173814416110083,4.390477171220399,5.162338332928554,6.032964926735477,7.002356952641165,8.070514410645615,9.237437300748834,10.503125622950817],[6.490432658811471,7.405685138066683,8.419703049420658,9.532486392873404,10.74403516842491,12.054349376075185,13.463429015824222,14.971274087672022,16.57788459161859,18.283260527663927],[12.665437323770481,14.020377416869044,15.474082942066373,17.02655389936246,18.677790288757322,20.427792110250948,22.27655936384334,24.22409204953449,26.27039016732441,28.4154537172131],[21.19250027356557,22.98712798050748,24.880521119548167,26.87267969068761,28.96360369392582,31.153293129262796,33.44174799669854,35.828968296233036,38.314954027866314,40.89970519159836]],"type":"contour"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"title":{"text":"theta0"}},"yaxis":{"title":{"text":"theta1"}},"autosize":false,"width":800,"height":600},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="1407b7cb-6ee5-4e3a-b81f-92faee915334" class="plotly-graph-div" style="height:600px; width:800px;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("1407b7cb-6ee5-4e3a-b81f-92faee915334")) {                    Plotly.newPlot(                        "1407b7cb-6ee5-4e3a-b81f-92faee915334",                        [{"x":[0.0,0.2222222222222222,0.4444444444444444,0.6666666666666666,0.8888888888888888,1.1111111111111112,1.3333333333333333,1.5555555555555554,1.7777777777777777,2.0],"y":[-0.1,-0.05000000000000001,-1.3877787807814457e-17,0.04999999999999999,0.09999999999999998,0.14999999999999997,0.19999999999999998,0.24999999999999997,0.29999999999999993,0.35],"z":[[29.123031979508195,26.96047116185995,24.896675776310463,22.931645822859743,21.065381301507795,19.297882212254603,17.629148555100183,16.059180330044526,14.587977537087637,13.21554017622951],[18.833628650614756,17.11075544680986,15.486647675103727,13.961305335496359,12.534728427987758,11.20691695257792,9.97787090926685,8.847590298054545,7.816075118941006,6.883325371926231],[10.896283606557377,9.613098016595831,8.428677858733053,7.343023132969036,6.356133839303786,5.468009977737301,4.678651548269582,3.9880585509006288,3.3962309856304405,2.903168852459017],[5.310996847336067,4.467498871217872,3.722766327198442,3.0767992152777786,2.5295975354558795,2.0811612877327472,1.7314904721083795,1.4805850885827772,1.3284451371559403,1.275070617827869],[2.0777683729508207,1.6739580106759773,1.3689130804998992,1.1626335824225869,1.05511951644404,1.0463708825642581,1.1363876807832418,1.325169911100991,1.6127175735175057,1.9990306680327863],[1.196598183401639,1.232475434970147,1.3671181186374208,1.6005262344034599,1.9326997822682639,2.3636387622318336,2.8933431742941687,3.521813018455269,4.249048294715135,5.075049003073768],[2.667486278688523,3.1430511441003834,3.717381441611009,4.390477171220399,5.162338332928555,6.032964926735477,7.002356952641165,8.070514410645615,9.237437300748834,10.503125622950817],[6.490432658811471,7.405685138066682,8.419703049420658,9.532486392873402,10.74403516842491,12.054349376075182,13.463429015824218,14.971274087672022,16.57788459161859,18.283260527663924],[12.665437323770481,14.020377416869042,15.474082942066373,17.02655389936246,18.677790288757322,20.427792110250945,22.27655936384334,24.224092049534487,26.27039016732441,28.4154537172131],[21.19250027356557,22.98712798050748,24.880521119548167,26.87267969068761,28.96360369392582,31.153293129262796,33.44174799669854,35.828968296233036,38.314954027866314,40.89970519159835]],"type":"contour"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"title":{"text":"theta0"}},"yaxis":{"title":{"text":"theta1"}},"autosize":false,"width":800,"height":600},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('ddd7c922-8cee-4f07-9373-0dace8e0ca92');
+var gd = document.getElementById('1407b7cb-6ee5-4e3a-b81f-92faee915334');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
diff --git a/docs/gradient_descent/gradient_descent_files/figure-html/cell-10-output-2.png b/docs/gradient_descent/gradient_descent_files/figure-html/cell-10-output-2.png
index c6fd7936..06dfea3a 100644
Binary files a/docs/gradient_descent/gradient_descent_files/figure-html/cell-10-output-2.png and b/docs/gradient_descent/gradient_descent_files/figure-html/cell-10-output-2.png differ
diff --git a/docs/inference_causality/inference_causality.html b/docs/inference_causality/inference_causality.html
index bcfdaac6..240ddea9 100644
--- a/docs/inference_causality/inference_causality.html
+++ b/docs/inference_causality/inference_causality.html
@@ -784,7 +784,7 @@ <h2 data-number="19.3" class="anchored" data-anchor-id="hypothesis-testing-throu
 <span id="cb5-28"><a href="#cb5-28" aria-hidden="true" tabindex="-1"></a>conf_interval <span class="op">=</span> (lower, upper)</span>
 <span id="cb5-29"><a href="#cb5-29" aria-hidden="true" tabindex="-1"></a>conf_interval</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="4">
-<pre><code>(-0.25864811956848754, 1.1034243854204049)</code></pre>
+<pre><code>(-0.258648119568487, 1.103424385420405)</code></pre>
 </div>
 </div>
 <p>We find that our bootstrapped approximate 95% confidence interval for <span class="math inline">\(\theta_1\)</span> is <span class="math inline">\([-0.259, 1.103]\)</span>. Immediately, we can see that 0 <em>is</em> indeed contained in this interval – this means that we <em>cannot</em> conclude that <span class="math inline">\(\theta_1\)</span> is non-zero! More formally, we fail to reject the null hypothesis (that <span class="math inline">\(\theta_1\)</span> is 0) under a 5% p-value cutoff.</p>
@@ -875,14 +875,8 @@ <h2 data-number="19.4" class="anchored" data-anchor-id="colinearity"><span class
 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
 <span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>sns.pairplot(eggs[[<span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>, <span class="st">"egg_weight"</span>, <span class="st">'bird_weight'</span>]])<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
-<div class="cell-output cell-output-stderr">
-<pre><code>/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-</code></pre>
-</div>
 <div class="cell-output cell-output-display">
-<p><img src="inference_causality_files/figure-html/cell-7-output-2.png" width="946" height="947"></p>
+<p><img src="inference_causality_files/figure-html/cell-7-output-1.png" width="946" height="945"></p>
 </div>
 </div>
 <p>This issue is known as <strong>colinearity</strong>, sometimes also called <strong>multicolinearity</strong>. Collinearity occurs when one feature can be predicted fairly accurately by a linear combination of the other features, which happens when one feature is highly correlated with the others.</p>
@@ -900,19 +894,19 @@ <h3 data-number="19.4.1" class="anchored" data-anchor-id="a-simpler-model"><span
 <div class="cell" data-execution_count="7">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LinearRegression</span>
-<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>X_int <span class="op">=</span> eggs[[<span class="st">"egg_weight"</span>]]</span>
-<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>Y_int <span class="op">=</span> eggs[<span class="st">"bird_weight"</span>]</span>
-<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>model_int <span class="op">=</span> LinearRegression()</span>
-<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a>model_int.fit(X_int, Y_int)</span>
-<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a><span class="co"># This gives an array containing the fitted model parameter estimates</span></span>
-<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a>thetas_int <span class="op">=</span> model_int.coef_</span>
-<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Put the parameter estimates in a nice table for viewing</span></span>
-<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"theta_hat"</span>:[model_int.intercept_, thetas_int[<span class="dv">0</span>]]}, index<span class="op">=</span>[<span class="st">"theta_0"</span>, <span class="st">"theta_1"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LinearRegression</span>
+<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>X_int <span class="op">=</span> eggs[[<span class="st">"egg_weight"</span>]]</span>
+<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>Y_int <span class="op">=</span> eggs[<span class="st">"bird_weight"</span>]</span>
+<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>model_int <span class="op">=</span> LinearRegression()</span>
+<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>model_int.fit(X_int, Y_int)</span>
+<span id="cb9-8"><a href="#cb9-8" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb9-9"><a href="#cb9-9" aria-hidden="true" tabindex="-1"></a><span class="co"># This gives an array containing the fitted model parameter estimates</span></span>
+<span id="cb9-10"><a href="#cb9-10" aria-hidden="true" tabindex="-1"></a>thetas_int <span class="op">=</span> model_int.coef_</span>
+<span id="cb9-11"><a href="#cb9-11" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb9-12"><a href="#cb9-12" aria-hidden="true" tabindex="-1"></a><span class="co"># Put the parameter estimates in a nice table for viewing</span></span>
+<span id="cb9-13"><a href="#cb9-13" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"theta_hat"</span>:[model_int.intercept_, thetas_int[<span class="dv">0</span>]]}, index<span class="op">=</span>[<span class="st">"theta_0"</span>, <span class="st">"theta_1"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display" data-execution_count="7">
 <div>
@@ -941,35 +935,35 @@ <h3 data-number="19.4.1" class="anchored" data-anchor-id="a-simpler-model"><span
 </div>
 </div>
 <div class="cell" data-execution_count="8">
-<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
-<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Set a random seed so you generate the same random sample as staff</span></span>
-<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a><span class="co"># In the "real world", we wouldn't do this</span></span>
-<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">1337</span>)</span>
-<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Set the sample size of each bootstrap sample</span></span>
-<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="bu">len</span>(eggs)</span>
-<span id="cb11-9"><a href="#cb11-9" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb11-10"><a href="#cb11-10" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a list to store all the bootstrapped estimates</span></span>
-<span id="cb11-11"><a href="#cb11-11" aria-hidden="true" tabindex="-1"></a>estimates_int <span class="op">=</span> []</span>
-<span id="cb11-12"><a href="#cb11-12" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb11-13"><a href="#cb11-13" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. </span></span>
-<span id="cb11-14"><a href="#cb11-14" aria-hidden="true" tabindex="-1"></a><span class="co"># Repeat 10000 times.</span></span>
-<span id="cb11-15"><a href="#cb11-15" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>):</span>
-<span id="cb11-16"><a href="#cb11-16" aria-hidden="true" tabindex="-1"></a>    bootstrap_resample_int <span class="op">=</span> eggs.sample(n, replace<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb11-17"><a href="#cb11-17" aria-hidden="true" tabindex="-1"></a>    X_bootstrap_int <span class="op">=</span> bootstrap_resample_int[[<span class="st">"egg_weight"</span>]]</span>
-<span id="cb11-18"><a href="#cb11-18" aria-hidden="true" tabindex="-1"></a>    Y_bootstrap_int <span class="op">=</span> bootstrap_resample_int[<span class="st">"bird_weight"</span>]</span>
-<span id="cb11-19"><a href="#cb11-19" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb11-20"><a href="#cb11-20" aria-hidden="true" tabindex="-1"></a>    bootstrap_model_int <span class="op">=</span> LinearRegression()</span>
-<span id="cb11-21"><a href="#cb11-21" aria-hidden="true" tabindex="-1"></a>    bootstrap_model_int.fit(X_bootstrap_int, Y_bootstrap_int)</span>
-<span id="cb11-22"><a href="#cb11-22" aria-hidden="true" tabindex="-1"></a>    bootstrap_thetas_int <span class="op">=</span> bootstrap_model_int.coef_</span>
-<span id="cb11-23"><a href="#cb11-23" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb11-24"><a href="#cb11-24" aria-hidden="true" tabindex="-1"></a>    estimates_int.append(bootstrap_thetas_int[<span class="dv">0</span>])</span>
-<span id="cb11-25"><a href="#cb11-25" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb11-26"><a href="#cb11-26" aria-hidden="true" tabindex="-1"></a>plt.figure(dpi<span class="op">=</span><span class="dv">120</span>)</span>
-<span id="cb11-27"><a href="#cb11-27" aria-hidden="true" tabindex="-1"></a>sns.histplot(estimates_int, stat<span class="op">=</span><span class="st">"density"</span>)</span>
-<span id="cb11-28"><a href="#cb11-28" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="vs">r"$\hat{\theta}_1$"</span>)</span>
-<span id="cb11-29"><a href="#cb11-29" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="vs">r"Bootstrapped estimates $\hat{\theta}_1$ Under the Interpretable Model"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
+<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a><span class="co"># Set a random seed so you generate the same random sample as staff</span></span>
+<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a><span class="co"># In the "real world", we wouldn't do this</span></span>
+<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">1337</span>)</span>
+<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a><span class="co"># Set the sample size of each bootstrap sample</span></span>
+<span id="cb10-8"><a href="#cb10-8" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="bu">len</span>(eggs)</span>
+<span id="cb10-9"><a href="#cb10-9" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-10"><a href="#cb10-10" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a list to store all the bootstrapped estimates</span></span>
+<span id="cb10-11"><a href="#cb10-11" aria-hidden="true" tabindex="-1"></a>estimates_int <span class="op">=</span> []</span>
+<span id="cb10-12"><a href="#cb10-12" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-13"><a href="#cb10-13" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. </span></span>
+<span id="cb10-14"><a href="#cb10-14" aria-hidden="true" tabindex="-1"></a><span class="co"># Repeat 10000 times.</span></span>
+<span id="cb10-15"><a href="#cb10-15" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>):</span>
+<span id="cb10-16"><a href="#cb10-16" aria-hidden="true" tabindex="-1"></a>    bootstrap_resample_int <span class="op">=</span> eggs.sample(n, replace<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb10-17"><a href="#cb10-17" aria-hidden="true" tabindex="-1"></a>    X_bootstrap_int <span class="op">=</span> bootstrap_resample_int[[<span class="st">"egg_weight"</span>]]</span>
+<span id="cb10-18"><a href="#cb10-18" aria-hidden="true" tabindex="-1"></a>    Y_bootstrap_int <span class="op">=</span> bootstrap_resample_int[<span class="st">"bird_weight"</span>]</span>
+<span id="cb10-19"><a href="#cb10-19" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb10-20"><a href="#cb10-20" aria-hidden="true" tabindex="-1"></a>    bootstrap_model_int <span class="op">=</span> LinearRegression()</span>
+<span id="cb10-21"><a href="#cb10-21" aria-hidden="true" tabindex="-1"></a>    bootstrap_model_int.fit(X_bootstrap_int, Y_bootstrap_int)</span>
+<span id="cb10-22"><a href="#cb10-22" aria-hidden="true" tabindex="-1"></a>    bootstrap_thetas_int <span class="op">=</span> bootstrap_model_int.coef_</span>
+<span id="cb10-23"><a href="#cb10-23" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb10-24"><a href="#cb10-24" aria-hidden="true" tabindex="-1"></a>    estimates_int.append(bootstrap_thetas_int[<span class="dv">0</span>])</span>
+<span id="cb10-25"><a href="#cb10-25" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb10-26"><a href="#cb10-26" aria-hidden="true" tabindex="-1"></a>plt.figure(dpi<span class="op">=</span><span class="dv">120</span>)</span>
+<span id="cb10-27"><a href="#cb10-27" aria-hidden="true" tabindex="-1"></a>sns.histplot(estimates_int, stat<span class="op">=</span><span class="st">"density"</span>)</span>
+<span id="cb10-28"><a href="#cb10-28" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="vs">r"$\hat{\theta}_1$"</span>)</span>
+<span id="cb10-29"><a href="#cb10-29" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="vs">r"Bootstrapped estimates $\hat{\theta}_1$ Under the Interpretable Model"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <p><img src="inference_causality_files/figure-html/cell-9-output-1.png" class="img-fluid"></p>
 </div>
@@ -978,12 +972,12 @@ <h3 data-number="19.4.1" class="anchored" data-anchor-id="a-simpler-model"><span
 <div class="cell" data-execution_count="9">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.metrics <span class="im">import</span> mean_squared_error</span>
-<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>rmse <span class="op">=</span> mean_squared_error(Y, model.predict(X))</span>
-<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a>rmse_int <span class="op">=</span> mean_squared_error(Y_int, model_int.predict(X_int))</span>
-<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f'RMSE of Original Model: </span><span class="sc">{</span>rmse<span class="sc">}</span><span class="ss">'</span>)</span>
-<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f'RMSE of Interpretable Model: </span><span class="sc">{</span>rmse_int<span class="sc">}</span><span class="ss">'</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.metrics <span class="im">import</span> mean_squared_error</span>
+<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>rmse <span class="op">=</span> mean_squared_error(Y, model.predict(X))</span>
+<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a>rmse_int <span class="op">=</span> mean_squared_error(Y_int, model_int.predict(X_int))</span>
+<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f'RMSE of Original Model: </span><span class="sc">{</span>rmse<span class="sc">}</span><span class="ss">'</span>)</span>
+<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f'RMSE of Interpretable Model: </span><span class="sc">{</span>rmse_int<span class="sc">}</span><span class="ss">'</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-stdout">
 <pre><code>RMSE of Original Model: 0.04547085380275766
@@ -994,14 +988,14 @@ <h3 data-number="19.4.1" class="anchored" data-anchor-id="a-simpler-model"><span
 <div class="cell" data-execution_count="10">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>lower_int <span class="op">=</span> np.percentile(estimates_int, <span class="fl">2.5</span>)</span>
-<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>upper_int <span class="op">=</span> np.percentile(estimates_int, <span class="fl">97.5</span>)</span>
-<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a>conf_interval_int <span class="op">=</span> (lower_int, upper_int)</span>
-<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a>conf_interval_int</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>lower_int <span class="op">=</span> np.percentile(estimates_int, <span class="fl">2.5</span>)</span>
+<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>upper_int <span class="op">=</span> np.percentile(estimates_int, <span class="fl">97.5</span>)</span>
+<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a>conf_interval_int <span class="op">=</span> (lower_int, upper_int)</span>
+<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a>conf_interval_int</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display" data-execution_count="10">
-<pre><code>(0.6029335250209633, 0.8208401738546206)</code></pre>
+<pre><code>(0.6029335250209632, 0.8208401738546206)</code></pre>
 </div>
 </div>
 <p>In retrospect, it’s no surprise that the weight of an egg best predicts the weight of a newly-hatched chick.</p>
@@ -1600,876 +1594,876 @@ <h3 data-number="19.6.5" class="anchored" data-anchor-id="step-4-bias-variance-d
       </a>
   </div>
 </nav><div class="modal fade" id="quarto-embedded-source-code-modal" tabindex="-1" aria-labelledby="quarto-embedded-source-code-modal-label" aria-hidden="true"><div class="modal-dialog modal-dialog-scrollable"><div class="modal-content"><div class="modal-header"><h5 class="modal-title" id="quarto-embedded-source-code-modal-label">Source Code</h5><button class="btn-close" data-bs-dismiss="modal"></button></div><div class="modal-body"><div class="">
-<div class="sourceCode" id="cb16" data-shortcodes="false"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
-<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a><span class="an">title:</span><span class="co"> 'Bias, Variance, and Inference'</span></span>
-<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a><span class="an">execute:</span></span>
-<span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a><span class="co">  echo: true</span></span>
-<span id="cb16-5"><a href="#cb16-5" aria-hidden="true" tabindex="-1"></a><span class="an">format:</span></span>
-<span id="cb16-6"><a href="#cb16-6" aria-hidden="true" tabindex="-1"></a><span class="co">  html:</span></span>
-<span id="cb16-7"><a href="#cb16-7" aria-hidden="true" tabindex="-1"></a><span class="co">    code-fold: true</span></span>
-<span id="cb16-8"><a href="#cb16-8" aria-hidden="true" tabindex="-1"></a><span class="co">    code-tools: true</span></span>
-<span id="cb16-9"><a href="#cb16-9" aria-hidden="true" tabindex="-1"></a><span class="co">    toc: true</span></span>
-<span id="cb16-10"><a href="#cb16-10" aria-hidden="true" tabindex="-1"></a><span class="co">    toc-title: 'Bias, Variance, and Inference'</span></span>
-<span id="cb16-11"><a href="#cb16-11" aria-hidden="true" tabindex="-1"></a><span class="co">    page-layout: full</span></span>
-<span id="cb16-12"><a href="#cb16-12" aria-hidden="true" tabindex="-1"></a><span class="co">    theme:</span></span>
-<span id="cb16-13"><a href="#cb16-13" aria-hidden="true" tabindex="-1"></a><span class="co">      - cosmo</span></span>
-<span id="cb16-14"><a href="#cb16-14" aria-hidden="true" tabindex="-1"></a><span class="co">      - cerulean</span></span>
-<span id="cb16-15"><a href="#cb16-15" aria-hidden="true" tabindex="-1"></a><span class="co">    callout-icon: false</span></span>
-<span id="cb16-16"><a href="#cb16-16" aria-hidden="true" tabindex="-1"></a><span class="an">jupyter:</span><span class="co"> python3</span></span>
-<span id="cb16-17"><a href="#cb16-17" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
-<span id="cb16-18"><a href="#cb16-18" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-19"><a href="#cb16-19" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- </span></span>
-<span id="cb16-20"><a href="#cb16-20" aria-hidden="true" tabindex="-1"></a><span class="co">The **bias** of an estimator is how far off it is from the parameter, on average.</span></span>
-<span id="cb16-21"><a href="#cb16-21" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-22"><a href="#cb16-22" aria-hidden="true" tabindex="-1"></a><span class="co">$$\begin{align}\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta} - \theta] = \mathbb{E}[\hat{\theta}] - \theta\end{align}$$</span></span>
-<span id="cb16-23"><a href="#cb16-23" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-24"><a href="#cb16-24" aria-hidden="true" tabindex="-1"></a><span class="co">For example, the bias of the sample mean as an estimator of the population mean is:</span></span>
-<span id="cb16-25"><a href="#cb16-25" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-26"><a href="#cb16-26" aria-hidden="true" tabindex="-1"></a><span class="co">$$\begin{align}\mathbb{E}[\bar{X}_n - \mu]</span></span>
-<span id="cb16-27"><a href="#cb16-27" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;= \mathbb{E}[\frac{1}{n}\sum_{i=1}^n (X_i)] - \mu \\</span></span>
-<span id="cb16-28"><a href="#cb16-28" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;= \frac{1}{n}\sum_{i=1}^n \mathbb{E}[X_i] - \mu \\</span></span>
-<span id="cb16-29"><a href="#cb16-29" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;= \frac{1}{n} (n\mu) - \mu \\</span></span>
-<span id="cb16-30"><a href="#cb16-30" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;= 0\end{align}$$</span></span>
-<span id="cb16-31"><a href="#cb16-31" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-32"><a href="#cb16-32" aria-hidden="true" tabindex="-1"></a><span class="co">Because its bias is equal to 0, the sample mean is said to be an **unbiased** estimator of the population mean.</span></span>
-<span id="cb16-33"><a href="#cb16-33" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-34"><a href="#cb16-34" aria-hidden="true" tabindex="-1"></a><span class="co">The **variance** of an estimator is a measure of how much the estimator tends to vary from its mean value.</span></span>
-<span id="cb16-35"><a href="#cb16-35" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-36"><a href="#cb16-36" aria-hidden="true" tabindex="-1"></a><span class="co">$$\begin{align}\text{Var}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2 \right]\end{align}$$</span></span>
-<span id="cb16-37"><a href="#cb16-37" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-38"><a href="#cb16-38" aria-hidden="true" tabindex="-1"></a><span class="co">The **mean squared error** measures the "goodness" of an estimator by incorporating both the bias and variance. Formally, it is defined as:</span></span>
-<span id="cb16-39"><a href="#cb16-39" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-40"><a href="#cb16-40" aria-hidden="true" tabindex="-1"></a><span class="co">$$\begin{align}\text{MSE}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \theta)^2</span></span>
-<span id="cb16-41"><a href="#cb16-41" aria-hidden="true" tabindex="-1"></a><span class="co">\right]\end{align}$$ --&gt;</span></span>
-<span id="cb16-42"><a href="#cb16-42" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-43"><a href="#cb16-43" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-44"><a href="#cb16-44" aria-hidden="true" tabindex="-1"></a>::: {.callout-note collapse="false"}</span>
-<span id="cb16-45"><a href="#cb16-45" aria-hidden="true" tabindex="-1"></a><span class="fu">## Learning Outcomes</span></span>
-<span id="cb16-46"><a href="#cb16-46" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Compute the bias, variance, and MSE of an estimator for a parameter</span>
-<span id="cb16-47"><a href="#cb16-47" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Introduction to model risk of fitted models</span>
-<span id="cb16-48"><a href="#cb16-48" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Decompose the model risk into bias and variance terms</span>
-<span id="cb16-49"><a href="#cb16-49" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Construct confidence intervals for hypothesis testing</span>
-<span id="cb16-50"><a href="#cb16-50" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Understand the assumptions we make and its impact on our regression inference</span>
-<span id="cb16-51"><a href="#cb16-51" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Compare regression and causation</span>
-<span id="cb16-52"><a href="#cb16-52" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Experiment setup, confounding variables, average treatment effect, and covariate adjustment</span>
-<span id="cb16-53"><a href="#cb16-53" aria-hidden="true" tabindex="-1"></a>:::</span>
-<span id="cb16-54"><a href="#cb16-54" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-55"><a href="#cb16-55" aria-hidden="true" tabindex="-1"></a>Last time, we introduced the idea of random variables and its effect on the observed relationship we use to fit models.</span>
-<span id="cb16-56"><a href="#cb16-56" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-57"><a href="#cb16-57" aria-hidden="true" tabindex="-1"></a>In this lecture, we will explore the decomposition of model risk from a fitted model, regression inference via hypothesis testing and considering the assumptions we make, and the environment of understanding causality in theory and in practice.</span>
-<span id="cb16-58"><a href="#cb16-58" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-59"><a href="#cb16-59" aria-hidden="true" tabindex="-1"></a><span class="fu">## Bias-Variance Tradeoff</span></span>
-<span id="cb16-60"><a href="#cb16-60" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-61"><a href="#cb16-61" aria-hidden="true" tabindex="-1"></a>Recall the model and the data we generated from that model in the last section:</span>
-<span id="cb16-62"><a href="#cb16-62" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-63"><a href="#cb16-63" aria-hidden="true" tabindex="-1"></a>$$\text{True relationship: } g(x)$$</span>
-<span id="cb16-64"><a href="#cb16-64" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-65"><a href="#cb16-65" aria-hidden="true" tabindex="-1"></a>$$\text{Observed relationship: }Y = g(x) + \epsilon$$</span>
-<span id="cb16-66"><a href="#cb16-66" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-67"><a href="#cb16-67" aria-hidden="true" tabindex="-1"></a>$$\text{Prediction: }\hat{Y}(x)$$</span>
-<span id="cb16-68"><a href="#cb16-68" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-69"><a href="#cb16-69" aria-hidden="true" tabindex="-1"></a>With this reformulated modeling goal, we can now revisit the Bias-Variance Tradeoff from two lectures ago (shown below): </span>
-<span id="cb16-70"><a href="#cb16-70" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-71"><a href="#cb16-71" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-72"><a href="#cb16-72" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/bvt_old.png"</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-73"><a href="#cb16-73" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-74"><a href="#cb16-74" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-75"><a href="#cb16-75" aria-hidden="true" tabindex="-1"></a>In today's lecture, we'll explore a more mathematical version of the graph you see above by introducing the terms model risk, observation variance, model bias, and model variance. Eventually, we'll work our way up to an updated version of the Bias-Variance Tradeoff graph that you see below </span>
-<span id="cb16-76"><a href="#cb16-76" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-77"><a href="#cb16-77" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-78"><a href="#cb16-78" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/bvt.png"</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-79"><a href="#cb16-79" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-80"><a href="#cb16-80" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-81"><a href="#cb16-81" aria-hidden="true" tabindex="-1"></a><span class="fu">### Performance of an Estimator</span></span>
-<span id="cb16-82"><a href="#cb16-82" aria-hidden="true" tabindex="-1"></a>Suppose we want to estimate a target $Y$ using an estimator $\hat{Y}(x)$. For every estimator that we train, we can determine how good a model is by asking the following questions: </span>
-<span id="cb16-83"><a href="#cb16-83" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-84"><a href="#cb16-84" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Do we get the right answer on average? **(Bias)** </span>
-<span id="cb16-85"><a href="#cb16-85" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>How variable is the answer? **(Variance)**</span>
-<span id="cb16-86"><a href="#cb16-86" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>How close do we get to $Y$? **(Risk / MSE)**</span>
-<span id="cb16-87"><a href="#cb16-87" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-88"><a href="#cb16-88" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-89"><a href="#cb16-89" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/bias_v_variance.png"</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-90"><a href="#cb16-90" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-91"><a href="#cb16-91" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-92"><a href="#cb16-92" aria-hidden="true" tabindex="-1"></a>Ideally, we want our estimator to have low bias and low variance, but how can we mathematically quantify that? To do so, let's introduce a few terms.</span>
-<span id="cb16-93"><a href="#cb16-93" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-94"><a href="#cb16-94" aria-hidden="true" tabindex="-1"></a><span class="fu">### Model Risk</span></span>
-<span id="cb16-95"><a href="#cb16-95" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-96"><a href="#cb16-96" aria-hidden="true" tabindex="-1"></a>**Model risk** is defined as the mean square prediction error of the random variable $\hat{Y}$. It is an expectation across *all* samples we could have possibly gotten when fitting the model, which we can denote as random variables $X_1, X_2, \ldots, X_n, Y$. Model risk considers the model's performance on any sample that is theoretically possible, rather than the specific data that we have collected. </span>
-<span id="cb16-97"><a href="#cb16-97" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-98"><a href="#cb16-98" aria-hidden="true" tabindex="-1"></a>$$\text{model risk }=E\left<span class="co">[</span><span class="ot">(Y-\hat{Y(x)})^2\right</span><span class="co">]</span>$$ </span>
-<span id="cb16-99"><a href="#cb16-99" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-100"><a href="#cb16-100" aria-hidden="true" tabindex="-1"></a>What is the origin of the error encoded by model risk? Note that there are two types of errors:</span>
-<span id="cb16-101"><a href="#cb16-101" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-102"><a href="#cb16-102" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Chance errors: happen due to randomness alone</span>
-<span id="cb16-103"><a href="#cb16-103" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Source 1 **(Observation Variance)**: randomness in new observations $Y$ due to random noise $\epsilon$</span>
-<span id="cb16-104"><a href="#cb16-104" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Source 2 **(Model Variance)**: randomness in the sample we used to train the models, as samples $X_1, X_2, \ldots, X_n, Y$ are random</span>
-<span id="cb16-105"><a href="#cb16-105" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**(Model Bias)**: non-random error due to our model being different from the true underlying function $g$</span>
-<span id="cb16-106"><a href="#cb16-106" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-107"><a href="#cb16-107" aria-hidden="true" tabindex="-1"></a>Recall the data-generating process we established earlier. There is a true underlying relationship $g$, observed data (with random noise) $Y$, and model $\hat{Y}$. </span>
-<span id="cb16-108"><a href="#cb16-108" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-109"><a href="#cb16-109" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-110"><a href="#cb16-110" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/errors.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'errors'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-111"><a href="#cb16-111" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-112"><a href="#cb16-112" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-113"><a href="#cb16-113" aria-hidden="true" tabindex="-1"></a>To better understand model risk, we'll zoom in on a single data point in the plot above.</span>
-<span id="cb16-114"><a href="#cb16-114" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-115"><a href="#cb16-115" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-116"><a href="#cb16-116" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/breakdown.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'breakdown'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-117"><a href="#cb16-117" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-118"><a href="#cb16-118" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-119"><a href="#cb16-119" aria-hidden="true" tabindex="-1"></a>Remember that $\hat{Y}(x)$ is a random variable – it is the prediction made for $x$ after being fit on the specific sample used for training. If we had used a different sample for training, a different prediction might have been made for this value of $x$. To capture this, the diagram above considers both the prediction $\hat{Y}(x)$ made for a particular random training sample, and the *expected* prediction across all possible training samples, $E<span class="co">[</span><span class="ot">\hat{Y}(x)</span><span class="co">]</span>$. </span>
-<span id="cb16-120"><a href="#cb16-120" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-121"><a href="#cb16-121" aria-hidden="true" tabindex="-1"></a>We can use this simplified diagram to break down the prediction error into smaller components. First, start by considering the error on a single prediction, $Y(x)-\hat{Y}(x)$.</span>
-<span id="cb16-122"><a href="#cb16-122" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-123"><a href="#cb16-123" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-124"><a href="#cb16-124" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/error.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'error'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-125"><a href="#cb16-125" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-126"><a href="#cb16-126" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-127"><a href="#cb16-127" aria-hidden="true" tabindex="-1"></a>We can identify three components of this error.</span>
-<span id="cb16-128"><a href="#cb16-128" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-129"><a href="#cb16-129" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-130"><a href="#cb16-130" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/decomposition.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'decomposition'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-131"><a href="#cb16-131" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-132"><a href="#cb16-132" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-133"><a href="#cb16-133" aria-hidden="true" tabindex="-1"></a>That is, the error can be written as:</span>
-<span id="cb16-134"><a href="#cb16-134" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-135"><a href="#cb16-135" aria-hidden="true" tabindex="-1"></a>$$Y(x)-\hat{Y}(x) = \epsilon + \left(g(x)-E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>\right) + \left(E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - \hat{Y}(x)\right)$$</span>
-<span id="cb16-136"><a href="#cb16-136" aria-hidden="true" tabindex="-1"></a>$$\newline   $$</span>
-<span id="cb16-137"><a href="#cb16-137" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-138"><a href="#cb16-138" aria-hidden="true" tabindex="-1"></a>The model risk is the expected square of the expression above, $E\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span>$. If we square both sides and then take the expectation, we will get the following decomposition of model risk:</span>
-<span id="cb16-139"><a href="#cb16-139" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-140"><a href="#cb16-140" aria-hidden="true" tabindex="-1"></a>$$E\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span> = E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span> + \left(g(x)-E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>\right)^2 + E\left<span class="co">[</span><span class="ot">\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right</span><span class="co">]</span>$$</span>
-<span id="cb16-141"><a href="#cb16-141" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-142"><a href="#cb16-142" aria-hidden="true" tabindex="-1"></a>It looks like we are missing some cross-product terms when squaring the right-hand side, but it turns out that all of those cross-product terms are zero. The detailed derivation is out of scope for this class, but a proof is included at the end of this note for your reference.</span>
-<span id="cb16-143"><a href="#cb16-143" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-144"><a href="#cb16-144" aria-hidden="true" tabindex="-1"></a>This expression may look complicated at first glance, but we've actually already defined each term earlier in this lecture! Let's look at them term by term.</span>
-<span id="cb16-145"><a href="#cb16-145" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-146"><a href="#cb16-146" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-147"><a href="#cb16-147" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Observation Variance</span></span>
-<span id="cb16-148"><a href="#cb16-148" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-149"><a href="#cb16-149" aria-hidden="true" tabindex="-1"></a>The first term in the above decomposition is $E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span>$. Remember $\epsilon$ is the random noise when observing $Y$, with expectation $\mathbb{E}(\epsilon)=0$ and variance $\text{Var}(\epsilon) = \sigma^2$. We can show that $E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span>$ is the variance of $\epsilon$:</span>
-<span id="cb16-150"><a href="#cb16-150" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-151"><a href="#cb16-151" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
-<span id="cb16-152"><a href="#cb16-152" aria-hidden="true" tabindex="-1"></a>\text{Var}(\epsilon) &amp;= E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span> + \left(E<span class="co">[</span><span class="ot">\epsilon</span><span class="co">]</span>\right)^2<span class="sc">\\</span></span>
-<span id="cb16-153"><a href="#cb16-153" aria-hidden="true" tabindex="-1"></a>&amp;= E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span> + 0^2<span class="sc">\\</span></span>
-<span id="cb16-154"><a href="#cb16-154" aria-hidden="true" tabindex="-1"></a>&amp;= \sigma^2.</span>
-<span id="cb16-155"><a href="#cb16-155" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
-<span id="cb16-156"><a href="#cb16-156" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-157"><a href="#cb16-157" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-158"><a href="#cb16-158" aria-hidden="true" tabindex="-1"></a>This term describes how variable the random error $\epsilon$ (and $Y$) is for each observation. This is called the **observation variance**. It exists due to the randomness in our observations $Y$. It is a form of *chance error* we talked about in the Sampling lecture.</span>
-<span id="cb16-159"><a href="#cb16-159" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-160"><a href="#cb16-160" aria-hidden="true" tabindex="-1"></a>$$\text{observation variance} = \text{Var}(\epsilon) = \sigma^2.$$</span>
-<span id="cb16-161"><a href="#cb16-161" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-162"><a href="#cb16-162" aria-hidden="true" tabindex="-1"></a>The observation variance results from measurement errors when observing data or missing information that acts like noise. To reduce this observation variance, we could try to get more precise measurements, but it is often beyond the control of data scientists. Because of this, the observation variance $\sigma^2$ is sometimes called "irreducible error."</span>
-<span id="cb16-163"><a href="#cb16-163" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-164"><a href="#cb16-164" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Model Variance</span></span>
-<span id="cb16-165"><a href="#cb16-165" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-166"><a href="#cb16-166" aria-hidden="true" tabindex="-1"></a>We will then look at the last term: $E\left<span class="co">[</span><span class="ot">\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right</span><span class="co">]</span>$. If you recall the definition of variance from the last lecture, this is precisely $\text{Var}(\hat{Y}(x))$. We call this the **model variance**.</span>
-<span id="cb16-167"><a href="#cb16-167" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-168"><a href="#cb16-168" aria-hidden="true" tabindex="-1"></a>It describes how much the prediction $\hat{Y}(x)$ tends to vary when we fit the model on different samples. Remember the sample we collect can come out very differently, thus the prediction $\hat{Y}(x)$ will also be different. The model variance describes this variability due to the randomness in our sampling process. Like observation variance, it is also a form of *chance error*—even though the sources of randomness are different.</span>
-<span id="cb16-169"><a href="#cb16-169" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-170"><a href="#cb16-170" aria-hidden="true" tabindex="-1"></a>$$\text{model variance} = \text{Var}(\hat{Y}(x)) = E\left<span class="co">[</span><span class="ot">\left(\hat{Y}(x) - E\left[\hat{Y}(x)\right]\right)^2\right</span><span class="co">]</span>$$</span>
-<span id="cb16-171"><a href="#cb16-171" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-172"><a href="#cb16-172" aria-hidden="true" tabindex="-1"></a>The main reason for the large model variance is because of **overfitting**: we paid too much attention to the details in our sample that small differences in our random sample lead to large differences in the fitted model. To remediate this, we try to reduce model complexity (e.g. take out some features and limit the magnitude of estimated model coefficients) and not fit our model on the noises.</span>
-<span id="cb16-173"><a href="#cb16-173" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-174"><a href="#cb16-174" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Model Bias</span></span>
-<span id="cb16-175"><a href="#cb16-175" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-176"><a href="#cb16-176" aria-hidden="true" tabindex="-1"></a>Finally, the second term is $\left(g(x)-E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>\right)^2$. What is this? The term $E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - g(x)$ is called the **model bias**.</span>
-<span id="cb16-177"><a href="#cb16-177" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-178"><a href="#cb16-178" aria-hidden="true" tabindex="-1"></a>Remember that $g(x)$ is the fixed underlying truth and $\hat{Y}(x)$ is our fitted model, which is random. Model bias therefore measures how far off $g(x)$ and $\hat{Y}(x)$ are on average over all possible samples.</span>
-<span id="cb16-179"><a href="#cb16-179" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-180"><a href="#cb16-180" aria-hidden="true" tabindex="-1"></a>$$\text{model bias} = E\left<span class="co">[</span><span class="ot">\hat{Y}(x) - g(x)\right</span><span class="co">]</span> = E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - g(x)$$</span>
-<span id="cb16-181"><a href="#cb16-181" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-182"><a href="#cb16-182" aria-hidden="true" tabindex="-1"></a>The model bias is not random; it's an average measure for a specific individual $x$. If bias is positive, our model tends to overestimate $g(x)$; if it's negative, our model tends to underestimate $g(x)$. And if it's 0, we can say that our model is **unbiased**.</span>
-<span id="cb16-183"><a href="#cb16-183" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-184"><a href="#cb16-184" aria-hidden="true" tabindex="-1"></a>::: {.callout-tip}</span>
-<span id="cb16-185"><a href="#cb16-185" aria-hidden="true" tabindex="-1"></a><span class="fu">##### Unbiased Estimators </span></span>
-<span id="cb16-186"><a href="#cb16-186" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-187"><a href="#cb16-187" aria-hidden="true" tabindex="-1"></a>An **unbiased model** has a $\text{model bias } = 0$. In other words, our model predicts $g(x)$ on average. </span>
-<span id="cb16-188"><a href="#cb16-188" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-189"><a href="#cb16-189" aria-hidden="true" tabindex="-1"></a>Similarly, we can define bias for estimators like the mean. The sample mean is an **unbiased estimator** of the population mean, as by CLT, $\mathbb{E}<span class="co">[</span><span class="ot">\bar{X}_n</span><span class="co">]</span> = \mu$. Therefore, the $\text{estimator bias } = \mathbb{E}<span class="co">[</span><span class="ot">\bar{X}_n</span><span class="co">]</span> - \mu = 0$.</span>
-<span id="cb16-190"><a href="#cb16-190" aria-hidden="true" tabindex="-1"></a>:::</span>
-<span id="cb16-191"><a href="#cb16-191" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-192"><a href="#cb16-192" aria-hidden="true" tabindex="-1"></a>There are two main reasons for large model biases:</span>
-<span id="cb16-193"><a href="#cb16-193" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-194"><a href="#cb16-194" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Underfitting: our model is too simple for the data</span>
-<span id="cb16-195"><a href="#cb16-195" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Lack of domain knowledge: we don't understand what features are useful for the response variable</span>
-<span id="cb16-196"><a href="#cb16-196" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-197"><a href="#cb16-197" aria-hidden="true" tabindex="-1"></a>To fix this, we increase model complexity (but we don't want to overfit!) or consult domain experts to see which models make sense. You can start to see a tradeoff here: if we increase model complexity, we decrease the model bias, but we also risk increasing the model variance.</span>
-<span id="cb16-198"><a href="#cb16-198" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-199"><a href="#cb16-199" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-200"><a href="#cb16-200" aria-hidden="true" tabindex="-1"></a><span class="fu">### The Decomposition</span></span>
-<span id="cb16-201"><a href="#cb16-201" aria-hidden="true" tabindex="-1"></a>To summarize: </span>
-<span id="cb16-202"><a href="#cb16-202" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-203"><a href="#cb16-203" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The **model risk**, $\mathbb{E}\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span>$, is the mean squared prediction error of the model.</span>
-<span id="cb16-204"><a href="#cb16-204" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The **observation variance**, $\sigma^2$, is the variance of the random noise in the observations. It describes how variable the random error $\epsilon$ is for each observation.</span>
-<span id="cb16-205"><a href="#cb16-205" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The **model bias**, $\mathbb{E}\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>-g(x)$, is how "off" the $\hat{Y}(x)$ is as an estimator of the true underlying relationship $g(x)$. </span>
-<span id="cb16-206"><a href="#cb16-206" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The **model variance**, $\text{Var}(\hat{Y}(x))$, describes how much the prediction $\hat{Y}(x)$ tends to vary when we fit the model on different samples. </span>
-<span id="cb16-207"><a href="#cb16-207" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-208"><a href="#cb16-208" aria-hidden="true" tabindex="-1"></a>The above definitions enable us to simplify the decomposition of model risk before as:</span>
-<span id="cb16-209"><a href="#cb16-209" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-210"><a href="#cb16-210" aria-hidden="true" tabindex="-1"></a>$$ E<span class="co">[</span><span class="ot">(Y(x) - \hat{Y}(x))^2</span><span class="co">]</span> = \sigma^2 + (E<span class="co">[</span><span class="ot">\hat{Y}(x)</span><span class="co">]</span> - g(x))^2 + \text{Var}(\hat{Y}(x)) $$</span>
-<span id="cb16-211"><a href="#cb16-211" aria-hidden="true" tabindex="-1"></a>$$\text{model risk } = \text{observation variance} + (\text{model bias})^2 \text{+ model variance}$$</span>
-<span id="cb16-212"><a href="#cb16-212" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-213"><a href="#cb16-213" aria-hidden="true" tabindex="-1"></a>This is known as the **bias-variance tradeoff**. What does it mean? Remember that the model risk is a measure of the model's performance. Our goal in building models is to keep model risk low; this means that we will want to ensure that each component of model risk is kept at a small value. </span>
-<span id="cb16-214"><a href="#cb16-214" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-215"><a href="#cb16-215" aria-hidden="true" tabindex="-1"></a>Observation variance is an inherent, random part of the data collection process. We aren't able to reduce the observation variance, so we'll focus our attention on the model bias and model variance. </span>
-<span id="cb16-216"><a href="#cb16-216" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-217"><a href="#cb16-217" aria-hidden="true" tabindex="-1"></a>In the Feature Engineering lecture, we considered the issue of overfitting. We saw that the model's error or bias tends to decrease as model complexity increases — if we design a highly complex model, it will tend to make predictions that are closer to the true relationship $g$. At the same time, model variance tends to *increase* as model complexity increases; a complex model may overfit to the training data, meaning that small differences in the random samples used for training lead to large differences in the fitted model. We have a problem. To decrease model bias, we could increase the model's complexity, which would lead to overfitting and an increase in model variance. Alternatively, we could decrease model variance by decreasing the model's complexity at the cost of increased model bias due to underfitting. </span>
-<span id="cb16-218"><a href="#cb16-218" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-219"><a href="#cb16-219" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-220"><a href="#cb16-220" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/bvt.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'bvt'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-221"><a href="#cb16-221" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-222"><a href="#cb16-222" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-223"><a href="#cb16-223" aria-hidden="true" tabindex="-1"></a>We need to strike a balance. Our goal in model creation is to use a complexity level that is high enough to keep bias low, but not so high that model variance is large.</span>
-<span id="cb16-224"><a href="#cb16-224" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-225"><a href="#cb16-225" aria-hidden="true" tabindex="-1"></a><span class="fu">## Interpreting Regression Coefficients</span></span>
-<span id="cb16-226"><a href="#cb16-226" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-227"><a href="#cb16-227" aria-hidden="true" tabindex="-1"></a>Recall the framework we established earlier in this lecture. If we assume that the underlying relationship between our observations and input features is linear, we can express this relationship in terms of the unknown, true model parameters $\theta$.</span>
-<span id="cb16-228"><a href="#cb16-228" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-229"><a href="#cb16-229" aria-hidden="true" tabindex="-1"></a>$$f_{\theta}(x) = g(x) + \epsilon = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p + \epsilon$$</span>
-<span id="cb16-230"><a href="#cb16-230" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-231"><a href="#cb16-231" aria-hidden="true" tabindex="-1"></a>Our model attempts to estimate each true parameter $\theta_i$ using the estimates $\hat{\theta}_i$ calculated from the design matrix $\Bbb{X}$ and response vector $\Bbb{Y}$.</span>
-<span id="cb16-232"><a href="#cb16-232" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-233"><a href="#cb16-233" aria-hidden="true" tabindex="-1"></a>$$f_{\hat{\theta}}(x) = \hat{\theta}_0 + \hat{\theta}_1 x_1 + \ldots + \hat{\theta}_p x_p$$</span>
-<span id="cb16-234"><a href="#cb16-234" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-235"><a href="#cb16-235" aria-hidden="true" tabindex="-1"></a>Let's pause for a moment. At this point, we're very used to working with the idea of a model parameter. But what exactly does each coefficient $\theta_i$ actually *mean*? We can think of each $\theta_i$ as a *slope* of the linear model – if all other variables are held constant, a unit change in $x_i$ will result in a $\theta_i$ change in $f_{\theta}(x)$. Broadly speaking, a large value of $\theta_i$ means that the feature $x_i$ has a large effect on the response; conversely, a small value of $\theta_i$ means that $x_i$ has little effect on the response. In the extreme case, if the true parameter $\theta_i$ is 0, then the feature $x_i$ has **no effect** on $Y(x)$. </span>
-<span id="cb16-236"><a href="#cb16-236" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-237"><a href="#cb16-237" aria-hidden="true" tabindex="-1"></a>If the true parameter $\theta_i$ for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between $x_i$ and $Y(x)$! How then, can we test if a parameter is 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate $\hat{\theta}_i$. However, we need to also consider the fact that if our random sample had come out differently, we may have found a different result for $\hat{\theta}_i$. To infer if the true parameter $\theta_i$ is 0, we want to draw our conclusion from the distribution of $\hat{\theta}_i$ estimates we could have drawn across all other random samples. This is where <span class="co">[</span><span class="ot">hypothesis testing</span><span class="co">](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html)</span> comes in handy! </span>
-<span id="cb16-238"><a href="#cb16-238" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-239"><a href="#cb16-239" aria-hidden="true" tabindex="-1"></a>To test if the true parameter $\theta_i$ is 0, we construct a **hypothesis test** where our null hypothesis states that the true parameter $\theta_i$ is 0 and the alternative hypothesis states that the true parameter $\theta_i$ is *not* 0. If our p-value is smaller than our cutoff value (usually p=0.05), we reject the null hypothesis. </span>
-<span id="cb16-240"><a href="#cb16-240" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-241"><a href="#cb16-241" aria-hidden="true" tabindex="-1"></a><span class="fu">## Hypothesis Testing through Bootstrap: PurpleAir Demo</span></span>
-<span id="cb16-242"><a href="#cb16-242" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-243"><a href="#cb16-243" aria-hidden="true" tabindex="-1"></a>An equivalent way to execute the hypothesis test described above is through **bootstrapping** (this equivalence can be proven through the <span class="co">[</span><span class="ot">duality argument</span><span class="co">](https://stats.stackexchange.com/questions/179902/confidence-interval-p-value-duality-vs-frequentist-interpretation-of-cis)</span>, which is out of scope for this class). We use bootstrapping to compute approximate 95% confidence intervals for each $\theta_i$. If the interval doesn't contain 0, we reject the null hypothesis at the 5% level. Otherwise, the data is consistent with the null, as the true parameter *could* be 0.</span>
-<span id="cb16-244"><a href="#cb16-244" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-245"><a href="#cb16-245" aria-hidden="true" tabindex="-1"></a>To show an example of this hypothesis testing process, we'll work with the <span class="co">[</span><span class="ot">snowy plover</span><span class="co">](https://www.audubon.org/field-guide/bird/snowy-plover)</span> dataset throughout this section. The data are about the eggs and newly-hatched chicks of the Snowy Plover. The data were collected at the Point Reyes National Seashore by a former <span class="co">[</span><span class="ot">student at Berkeley</span><span class="co">](https://openlibrary.org/books/OL2038693M/BLSS_the_Berkeley_interactive_statistical_system)</span>. Here's a <span class="co">[</span><span class="ot">parent bird and some eggs</span><span class="co">](http://cescos.fau.edu/jay/eps/articles/snowyplover.html)</span>.</span>
-<span id="cb16-246"><a href="#cb16-246" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-247"><a href="#cb16-247" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
-<span id="cb16-248"><a href="#cb16-248" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/plover_eggs.jpg"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'bvt'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'550'</span><span class="kw">&gt;</span></span>
-<span id="cb16-249"><a href="#cb16-249" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
-<span id="cb16-250"><a href="#cb16-250" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-251"><a href="#cb16-251" aria-hidden="true" tabindex="-1"></a>Note that <span class="in">`Egg Length`</span> and <span class="in">`Egg Breadth`</span> (widest diameter) are measured in millimeters, and <span class="in">`Egg Weight`</span> and <span class="in">`Bird Weight`</span> are measured in grams; for comparison, a standard paper clip weighs about one gram.</span>
-<span id="cb16-252"><a href="#cb16-252" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-253"><a href="#cb16-253" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- To show an example of this hypothesis testing process, we'll work with air quality measurement data. There are 2 common sources of air quality information: Air Quality System (AQS) and [PurpleAir sensors](https://www2.purpleair.com/). AQS is seen as the gold standard because it is high quality, well-calibrated, and publicly available. However, it is very expensive, and the sensors are far apart; reports are also delayed due to extensive calibration.  --&gt;</span></span>
-<span id="cb16-254"><a href="#cb16-254" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- On the other hand, PurpleAir (PA) sensors are much cheaper, easier to install, and measurements are taken every 2 minutes for denser coverage. However, they are much less accurate than AQS.  --&gt;</span></span>
-<span id="cb16-255"><a href="#cb16-255" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- For this demo, our goal is to use AQS sensor measurements to improve PurpleAir measurements by training a model that adjusts PA measurements based on AQS measurements</span></span>
-<span id="cb16-256"><a href="#cb16-256" aria-hidden="true" tabindex="-1"></a><span class="co">$$PA \approx \theta_0 + \theta_1 AQS$$</span></span>
-<span id="cb16-257"><a href="#cb16-257" aria-hidden="true" tabindex="-1"></a><span class="co">Using this approximation, we'll invert the model to predict the true air quality from PA measurements</span></span>
-<span id="cb16-258"><a href="#cb16-258" aria-hidden="true" tabindex="-1"></a><span class="co">::: {.callout-tip collapse="true"}</span></span>
-<span id="cb16-259"><a href="#cb16-259" aria-hidden="true" tabindex="-1"></a><span class="al">###</span><span class="co"> Inverse Model Derivation </span></span>
-<span id="cb16-260"><a href="#cb16-260" aria-hidden="true" tabindex="-1"></a><span class="co">Intuitively, AQS measurements are very accurate, so we can treat AQS as the true air quality $AQS = \text{True Air Quality}$</span></span>
-<span id="cb16-261"><a href="#cb16-261" aria-hidden="true" tabindex="-1"></a><span class="co">$$</span></span>
-<span id="cb16-262"><a href="#cb16-262" aria-hidden="true" tabindex="-1"></a><span class="co">\begin{align}</span></span>
-<span id="cb16-263"><a href="#cb16-263" aria-hidden="true" tabindex="-1"></a><span class="co">PA &amp;\approx \theta_0 + \theta_1 AQS \\</span></span>
-<span id="cb16-264"><a href="#cb16-264" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;\approx \theta_0 + \theta_1 \text{True Air Quality} \\</span></span>
-<span id="cb16-265"><a href="#cb16-265" aria-hidden="true" tabindex="-1"></a><span class="co">PA - \theta_0 &amp;\approx + \theta_1 \text{True Air Quality} \\</span></span>
-<span id="cb16-266"><a href="#cb16-266" aria-hidden="true" tabindex="-1"></a><span class="co">\frac{PA - \theta_0}{\theta_1} &amp;\approx \text{True Air Quality} \\</span></span>
-<span id="cb16-267"><a href="#cb16-267" aria-hidden="true" tabindex="-1"></a><span class="co">\text{True Air Quality } &amp;\approx -\frac{\theta_0}{\theta_1} + \frac{1}{\theta_1} PA </span></span>
-<span id="cb16-268"><a href="#cb16-268" aria-hidden="true" tabindex="-1"></a><span class="co">\end{align}</span></span>
-<span id="cb16-269"><a href="#cb16-269" aria-hidden="true" tabindex="-1"></a><span class="co">$$</span></span>
-<span id="cb16-270"><a href="#cb16-270" aria-hidden="true" tabindex="-1"></a><span class="co">:::</span></span>
-<span id="cb16-271"><a href="#cb16-271" aria-hidden="true" tabindex="-1"></a><span class="co">$$ \text{True Air Quality } \approx -\frac{\theta_0}{\theta_1} + \frac{1}{\theta_1} PA$$ --&gt;</span></span>
-<span id="cb16-272"><a href="#cb16-272" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-275"><a href="#cb16-275" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-276"><a href="#cb16-276" aria-hidden="true" tabindex="-1"></a><span class="co"># import numpy as np</span></span>
-<span id="cb16-277"><a href="#cb16-277" aria-hidden="true" tabindex="-1"></a><span class="co"># import pandas as pd</span></span>
-<span id="cb16-278"><a href="#cb16-278" aria-hidden="true" tabindex="-1"></a><span class="co"># import matplotlib</span></span>
-<span id="cb16-279"><a href="#cb16-279" aria-hidden="true" tabindex="-1"></a><span class="co"># import matplotlib.pyplot as plt</span></span>
-<span id="cb16-280"><a href="#cb16-280" aria-hidden="true" tabindex="-1"></a><span class="co"># import seaborn as sns</span></span>
-<span id="cb16-281"><a href="#cb16-281" aria-hidden="true" tabindex="-1"></a><span class="co"># import sklearn.linear_model as lm</span></span>
-<span id="cb16-282"><a href="#cb16-282" aria-hidden="true" tabindex="-1"></a><span class="co"># from sklearn.linear_model import LinearRegression</span></span>
-<span id="cb16-283"><a href="#cb16-283" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-284"><a href="#cb16-284" aria-hidden="true" tabindex="-1"></a><span class="co"># # big font helper</span></span>
-<span id="cb16-285"><a href="#cb16-285" aria-hidden="true" tabindex="-1"></a><span class="co"># def adjust_fontsize(size=None):</span></span>
-<span id="cb16-286"><a href="#cb16-286" aria-hidden="true" tabindex="-1"></a><span class="co">#     SMALL_SIZE = 8</span></span>
-<span id="cb16-287"><a href="#cb16-287" aria-hidden="true" tabindex="-1"></a><span class="co">#     MEDIUM_SIZE = 10</span></span>
-<span id="cb16-288"><a href="#cb16-288" aria-hidden="true" tabindex="-1"></a><span class="co">#     BIGGER_SIZE = 12</span></span>
-<span id="cb16-289"><a href="#cb16-289" aria-hidden="true" tabindex="-1"></a><span class="co">#     if size != None:</span></span>
-<span id="cb16-290"><a href="#cb16-290" aria-hidden="true" tabindex="-1"></a><span class="co">#         SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size</span></span>
-<span id="cb16-291"><a href="#cb16-291" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-292"><a href="#cb16-292" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('font', size=SMALL_SIZE)          # controls default text sizes</span></span>
-<span id="cb16-293"><a href="#cb16-293" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title</span></span>
-<span id="cb16-294"><a href="#cb16-294" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels</span></span>
-<span id="cb16-295"><a href="#cb16-295" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels</span></span>
-<span id="cb16-296"><a href="#cb16-296" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels</span></span>
-<span id="cb16-297"><a href="#cb16-297" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize</span></span>
-<span id="cb16-298"><a href="#cb16-298" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title</span></span>
-<span id="cb16-299"><a href="#cb16-299" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-300"><a href="#cb16-300" aria-hidden="true" tabindex="-1"></a><span class="co"># plt.style.use('fivethirtyeight')</span></span>
-<span id="cb16-301"><a href="#cb16-301" aria-hidden="true" tabindex="-1"></a><span class="co"># sns.set_context("talk")</span></span>
-<span id="cb16-302"><a href="#cb16-302" aria-hidden="true" tabindex="-1"></a><span class="co"># sns.set_theme()</span></span>
-<span id="cb16-303"><a href="#cb16-303" aria-hidden="true" tabindex="-1"></a><span class="co"># #plt.style.use('default') # revert style to default mpl</span></span>
-<span id="cb16-304"><a href="#cb16-304" aria-hidden="true" tabindex="-1"></a><span class="co"># adjust_fontsize(size=20)</span></span>
-<span id="cb16-305"><a href="#cb16-305" aria-hidden="true" tabindex="-1"></a><span class="co"># %matplotlib inline</span></span>
-<span id="cb16-306"><a href="#cb16-306" aria-hidden="true" tabindex="-1"></a><span class="co"># csv_file = 'data/Full24hrdataset.csv'</span></span>
-<span id="cb16-307"><a href="#cb16-307" aria-hidden="true" tabindex="-1"></a><span class="co"># usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'TempC', 'RH', 'Dewpoint']</span></span>
-<span id="cb16-308"><a href="#cb16-308" aria-hidden="true" tabindex="-1"></a><span class="co"># full_df = (pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date'])</span></span>
-<span id="cb16-309"><a href="#cb16-309" aria-hidden="true" tabindex="-1"></a><span class="co">#         .dropna())</span></span>
-<span id="cb16-310"><a href="#cb16-310" aria-hidden="true" tabindex="-1"></a><span class="co"># full_df.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'temp', 'rh', 'dew']</span></span>
-<span id="cb16-311"><a href="#cb16-311" aria-hidden="true" tabindex="-1"></a><span class="co"># full_df = full_df.loc[(full_df['pm25aqs'] &lt; 50)]</span></span>
-<span id="cb16-312"><a href="#cb16-312" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-313"><a href="#cb16-313" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-314"><a href="#cb16-314" aria-hidden="true" tabindex="-1"></a><span class="co"># bad_dates = ['2019-08-21', '2019-08-22', '2019-09-24']</span></span>
-<span id="cb16-315"><a href="#cb16-315" aria-hidden="true" tabindex="-1"></a><span class="co"># GA = full_df.loc[(full_df['id'] == 'GA1') &amp; (~full_df['date'].isin(bad_dates)) , :]</span></span>
-<span id="cb16-316"><a href="#cb16-316" aria-hidden="true" tabindex="-1"></a><span class="co"># AQS, PA = GA[['pm25aqs']], GA['pm25pa']</span></span>
-<span id="cb16-317"><a href="#cb16-317" aria-hidden="true" tabindex="-1"></a><span class="co"># AQS.head()</span></span>
-<span id="cb16-318"><a href="#cb16-318" aria-hidden="true" tabindex="-1"></a><span class="co"># pd.DataFrame(PA).head()</span></span>
-<span id="cb16-319"><a href="#cb16-319" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-320"><a href="#cb16-320" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-323"><a href="#cb16-323" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-324"><a href="#cb16-324" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
-<span id="cb16-325"><a href="#cb16-325" aria-hidden="true" tabindex="-1"></a>eggs <span class="op">=</span> pd.read_csv(<span class="st">"data/snowy_plover.csv"</span>)</span>
-<span id="cb16-326"><a href="#cb16-326" aria-hidden="true" tabindex="-1"></a>eggs.head(<span class="dv">5</span>)</span>
-<span id="cb16-327"><a href="#cb16-327" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-328"><a href="#cb16-328" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-329"><a href="#cb16-329" aria-hidden="true" tabindex="-1"></a>Our goal will be to predict the weight of a newborn plover chick, which we assume follows the true relationship $Y = f_{\theta}(x)$ below.</span>
-<span id="cb16-330"><a href="#cb16-330" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-331"><a href="#cb16-331" aria-hidden="true" tabindex="-1"></a>$$\text{bird<span class="sc">\_</span>weight} = \theta_0 + \theta_1 \text{egg<span class="sc">\_</span>weight} + \theta_2 \text{egg<span class="sc">\_</span>length} + \theta_3 \text{egg<span class="sc">\_</span>breadth} + \epsilon$$</span>
-<span id="cb16-332"><a href="#cb16-332" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-333"><a href="#cb16-333" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>For each $i$, the parameter $\theta_i$ is a fixed number but it is unobservable. We can only estimate it.</span>
-<span id="cb16-334"><a href="#cb16-334" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The random error $\epsilon$ is also unobservable, but it is assumed to have expectation 0 and be independent and identically distributed across eggs.</span>
-<span id="cb16-335"><a href="#cb16-335" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-336"><a href="#cb16-336" aria-hidden="true" tabindex="-1"></a>Say we wish to determine if the <span class="in">`egg_weight`</span> impacts the <span class="in">`bird_weight`</span> of a chick – we want to infer if $\theta_1$ is equal to 0.</span>
-<span id="cb16-337"><a href="#cb16-337" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-338"><a href="#cb16-338" aria-hidden="true" tabindex="-1"></a>First, we define our hypotheses:</span>
-<span id="cb16-339"><a href="#cb16-339" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-340"><a href="#cb16-340" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Null hypothesis**: the true parameter $\theta_1$ is 0; any variation is due to random chance.</span>
-<span id="cb16-341"><a href="#cb16-341" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Alternative hypothesis**: the true parameter $\theta_1$ is not 0.</span>
-<span id="cb16-342"><a href="#cb16-342" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-343"><a href="#cb16-343" aria-hidden="true" tabindex="-1"></a>Next, we use our data to fit a model $\hat{Y} = f_{\hat{\theta}}(x)$ that approximates the relationship above. This gives us the **observed value** of $\hat{\theta}_1$ found from our data.</span>
-<span id="cb16-344"><a href="#cb16-344" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-347"><a href="#cb16-347" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-348"><a href="#cb16-348" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb16-349"><a href="#cb16-349" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LinearRegression</span>
-<span id="cb16-350"><a href="#cb16-350" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
-<span id="cb16-351"><a href="#cb16-351" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-352"><a href="#cb16-352" aria-hidden="true" tabindex="-1"></a>X <span class="op">=</span> eggs[[<span class="st">"egg_weight"</span>, <span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>]]</span>
-<span id="cb16-353"><a href="#cb16-353" aria-hidden="true" tabindex="-1"></a>Y <span class="op">=</span> eggs[<span class="st">"bird_weight"</span>]</span>
-<span id="cb16-354"><a href="#cb16-354" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-355"><a href="#cb16-355" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> LinearRegression()</span>
-<span id="cb16-356"><a href="#cb16-356" aria-hidden="true" tabindex="-1"></a>model.fit(X, Y)</span>
-<span id="cb16-357"><a href="#cb16-357" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-358"><a href="#cb16-358" aria-hidden="true" tabindex="-1"></a><span class="co"># This gives an array containing the fitted model parameter estimates</span></span>
-<span id="cb16-359"><a href="#cb16-359" aria-hidden="true" tabindex="-1"></a>thetas <span class="op">=</span> model.coef_</span>
-<span id="cb16-360"><a href="#cb16-360" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-361"><a href="#cb16-361" aria-hidden="true" tabindex="-1"></a><span class="co"># Put the parameter estimates in a nice table for viewing</span></span>
-<span id="cb16-362"><a href="#cb16-362" aria-hidden="true" tabindex="-1"></a>display(pd.DataFrame([model.intercept_] <span class="op">+</span> <span class="bu">list</span>(model.coef_),</span>
-<span id="cb16-363"><a href="#cb16-363" aria-hidden="true" tabindex="-1"></a>             columns<span class="op">=</span>[<span class="st">'theta_hat'</span>],</span>
-<span id="cb16-364"><a href="#cb16-364" aria-hidden="true" tabindex="-1"></a>             index<span class="op">=</span>[<span class="st">'intercept'</span>, <span class="st">'egg_weight'</span>, <span class="st">'egg_length'</span>, <span class="st">'egg_breadth'</span>]))</span>
-<span id="cb16-365"><a href="#cb16-365" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-366"><a href="#cb16-366" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"RMSE"</span>, np.mean((Y <span class="op">-</span> model.predict(X)) <span class="op">**</span> <span class="dv">2</span>))</span>
-<span id="cb16-367"><a href="#cb16-367" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-368"><a href="#cb16-368" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-369"><a href="#cb16-369" aria-hidden="true" tabindex="-1"></a>We now have the value of $\hat{\theta}_1$ when considering the single sample of data that we have. To get a sense of how this estimate might vary if we were to draw different random samples, we will use **[bootstrapping](https://inferentialthinking.com/chapters/13/2/Bootstrap.html?)**. To construct a bootstrap sample, we will draw a resample from the collected data that:</span>
-<span id="cb16-370"><a href="#cb16-370" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-371"><a href="#cb16-371" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Has the same sample size as the collected data</span>
-<span id="cb16-372"><a href="#cb16-372" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Is drawn with replacement (this ensures that we don't draw the exact same sample every time!)</span>
-<span id="cb16-373"><a href="#cb16-373" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-374"><a href="#cb16-374" aria-hidden="true" tabindex="-1"></a>We draw a bootstrap sample, use this sample to fit a model, and record the result for $\hat{\theta}_1$ on this bootstrapped sample. We then repeat this process many times to generate a **bootstrapped empirical distribution** of $\hat{\theta}_1$. This gives us an estimate of what the true distribution of $\hat{\theta}_1$ across all possible samples might look like.</span>
-<span id="cb16-375"><a href="#cb16-375" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-378"><a href="#cb16-378" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-379"><a href="#cb16-379" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb16-380"><a href="#cb16-380" aria-hidden="true" tabindex="-1"></a><span class="co"># Set a random seed so you generate the same random sample as staff</span></span>
-<span id="cb16-381"><a href="#cb16-381" aria-hidden="true" tabindex="-1"></a><span class="co"># In the "real world", we wouldn't do this</span></span>
-<span id="cb16-382"><a href="#cb16-382" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
-<span id="cb16-383"><a href="#cb16-383" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">1337</span>)</span>
-<span id="cb16-384"><a href="#cb16-384" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-385"><a href="#cb16-385" aria-hidden="true" tabindex="-1"></a><span class="co"># Set the sample size of each bootstrap sample</span></span>
-<span id="cb16-386"><a href="#cb16-386" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="bu">len</span>(eggs)</span>
-<span id="cb16-387"><a href="#cb16-387" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-388"><a href="#cb16-388" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a list to store all the bootstrapped estimates</span></span>
-<span id="cb16-389"><a href="#cb16-389" aria-hidden="true" tabindex="-1"></a>estimates <span class="op">=</span> []</span>
-<span id="cb16-390"><a href="#cb16-390" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-391"><a href="#cb16-391" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. </span></span>
-<span id="cb16-392"><a href="#cb16-392" aria-hidden="true" tabindex="-1"></a><span class="co"># Repeat 10000 times.</span></span>
-<span id="cb16-393"><a href="#cb16-393" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>):</span>
-<span id="cb16-394"><a href="#cb16-394" aria-hidden="true" tabindex="-1"></a>    bootstrap_resample <span class="op">=</span> eggs.sample(n, replace<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb16-395"><a href="#cb16-395" aria-hidden="true" tabindex="-1"></a>    X_bootstrap <span class="op">=</span> bootstrap_resample[[<span class="st">"egg_weight"</span>, <span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>]]</span>
-<span id="cb16-396"><a href="#cb16-396" aria-hidden="true" tabindex="-1"></a>    Y_bootstrap <span class="op">=</span> bootstrap_resample[<span class="st">"bird_weight"</span>]</span>
-<span id="cb16-397"><a href="#cb16-397" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-398"><a href="#cb16-398" aria-hidden="true" tabindex="-1"></a>    bootstrap_model <span class="op">=</span> LinearRegression()</span>
-<span id="cb16-399"><a href="#cb16-399" aria-hidden="true" tabindex="-1"></a>    bootstrap_model.fit(X_bootstrap, Y_bootstrap)</span>
-<span id="cb16-400"><a href="#cb16-400" aria-hidden="true" tabindex="-1"></a>    bootstrap_thetas <span class="op">=</span> bootstrap_model.coef_</span>
-<span id="cb16-401"><a href="#cb16-401" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-402"><a href="#cb16-402" aria-hidden="true" tabindex="-1"></a>    estimates.append(bootstrap_thetas[<span class="dv">0</span>])</span>
-<span id="cb16-403"><a href="#cb16-403" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-404"><a href="#cb16-404" aria-hidden="true" tabindex="-1"></a><span class="co"># calculate the 95% confidence interval </span></span>
-<span id="cb16-405"><a href="#cb16-405" aria-hidden="true" tabindex="-1"></a>lower <span class="op">=</span> np.percentile(estimates, <span class="fl">2.5</span>, axis<span class="op">=</span><span class="dv">0</span>)</span>
-<span id="cb16-406"><a href="#cb16-406" aria-hidden="true" tabindex="-1"></a>upper <span class="op">=</span> np.percentile(estimates, <span class="fl">97.5</span>, axis<span class="op">=</span><span class="dv">0</span>)</span>
-<span id="cb16-407"><a href="#cb16-407" aria-hidden="true" tabindex="-1"></a>conf_interval <span class="op">=</span> (lower, upper)</span>
-<span id="cb16-408"><a href="#cb16-408" aria-hidden="true" tabindex="-1"></a>conf_interval</span>
-<span id="cb16-409"><a href="#cb16-409" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-410"><a href="#cb16-410" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-411"><a href="#cb16-411" aria-hidden="true" tabindex="-1"></a>We find that our bootstrapped approximate 95% confidence interval for $\theta_1$ is $<span class="co">[</span><span class="ot">-0.259, 1.103</span><span class="co">]</span>$. Immediately, we can see that 0 *is* indeed contained in this interval – this means that we *cannot* conclude that $\theta_1$ is non-zero! More formally, we fail to reject the null hypothesis (that $\theta_1$ is 0) under a 5% p-value cutoff. </span>
-<span id="cb16-412"><a href="#cb16-412" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-413"><a href="#cb16-413" aria-hidden="true" tabindex="-1"></a><span class="fu">## Colinearity</span></span>
-<span id="cb16-414"><a href="#cb16-414" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-415"><a href="#cb16-415" aria-hidden="true" tabindex="-1"></a>We can repeat this process to construct 95% confidence intervals for the other parameters of the model.</span>
-<span id="cb16-416"><a href="#cb16-416" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-419"><a href="#cb16-419" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-420"><a href="#cb16-420" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">1337</span>)</span>
-<span id="cb16-421"><a href="#cb16-421" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-422"><a href="#cb16-422" aria-hidden="true" tabindex="-1"></a>theta_0_estimates <span class="op">=</span> []</span>
-<span id="cb16-423"><a href="#cb16-423" aria-hidden="true" tabindex="-1"></a>theta_1_estimates <span class="op">=</span> []</span>
-<span id="cb16-424"><a href="#cb16-424" aria-hidden="true" tabindex="-1"></a>theta_2_estimates <span class="op">=</span> []</span>
-<span id="cb16-425"><a href="#cb16-425" aria-hidden="true" tabindex="-1"></a>theta_3_estimates <span class="op">=</span> []</span>
-<span id="cb16-426"><a href="#cb16-426" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-427"><a href="#cb16-427" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-428"><a href="#cb16-428" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>):</span>
-<span id="cb16-429"><a href="#cb16-429" aria-hidden="true" tabindex="-1"></a>    bootstrap_resample <span class="op">=</span> eggs.sample(n, replace<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb16-430"><a href="#cb16-430" aria-hidden="true" tabindex="-1"></a>    X_bootstrap <span class="op">=</span> bootstrap_resample[[<span class="st">"egg_weight"</span>, <span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>]]</span>
-<span id="cb16-431"><a href="#cb16-431" aria-hidden="true" tabindex="-1"></a>    Y_bootstrap <span class="op">=</span> bootstrap_resample[<span class="st">"bird_weight"</span>]</span>
-<span id="cb16-432"><a href="#cb16-432" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-433"><a href="#cb16-433" aria-hidden="true" tabindex="-1"></a>    bootstrap_model <span class="op">=</span> LinearRegression()</span>
-<span id="cb16-434"><a href="#cb16-434" aria-hidden="true" tabindex="-1"></a>    bootstrap_model.fit(X_bootstrap, Y_bootstrap)</span>
-<span id="cb16-435"><a href="#cb16-435" aria-hidden="true" tabindex="-1"></a>    bootstrap_theta_0 <span class="op">=</span> bootstrap_model.intercept_</span>
-<span id="cb16-436"><a href="#cb16-436" aria-hidden="true" tabindex="-1"></a>    bootstrap_theta_1, bootstrap_theta_2, bootstrap_theta_3 <span class="op">=</span> bootstrap_model.coef_</span>
-<span id="cb16-437"><a href="#cb16-437" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-438"><a href="#cb16-438" aria-hidden="true" tabindex="-1"></a>    theta_0_estimates.append(bootstrap_theta_0)</span>
-<span id="cb16-439"><a href="#cb16-439" aria-hidden="true" tabindex="-1"></a>    theta_1_estimates.append(bootstrap_theta_1)</span>
-<span id="cb16-440"><a href="#cb16-440" aria-hidden="true" tabindex="-1"></a>    theta_2_estimates.append(bootstrap_theta_2)</span>
-<span id="cb16-441"><a href="#cb16-441" aria-hidden="true" tabindex="-1"></a>    theta_3_estimates.append(bootstrap_theta_3)</span>
-<span id="cb16-442"><a href="#cb16-442" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-443"><a href="#cb16-443" aria-hidden="true" tabindex="-1"></a>theta_0_lower, theta_0_upper <span class="op">=</span> np.percentile(theta_0_estimates, <span class="fl">2.5</span>), np.percentile(theta_0_estimates, <span class="fl">97.5</span>)</span>
-<span id="cb16-444"><a href="#cb16-444" aria-hidden="true" tabindex="-1"></a>theta_1_lower, theta_1_upper <span class="op">=</span> np.percentile(theta_1_estimates, <span class="fl">2.5</span>), np.percentile(theta_1_estimates, <span class="fl">97.5</span>)</span>
-<span id="cb16-445"><a href="#cb16-445" aria-hidden="true" tabindex="-1"></a>theta_2_lower, theta_2_upper <span class="op">=</span> np.percentile(theta_2_estimates, <span class="fl">2.5</span>), np.percentile(theta_2_estimates, <span class="fl">97.5</span>)</span>
-<span id="cb16-446"><a href="#cb16-446" aria-hidden="true" tabindex="-1"></a>theta_3_lower, theta_3_upper <span class="op">=</span> np.percentile(theta_3_estimates, <span class="fl">2.5</span>), np.percentile(theta_3_estimates, <span class="fl">97.5</span>)</span>
-<span id="cb16-447"><a href="#cb16-447" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-448"><a href="#cb16-448" aria-hidden="true" tabindex="-1"></a><span class="co"># Make a nice table to view results</span></span>
-<span id="cb16-449"><a href="#cb16-449" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"lower"</span>:[theta_0_lower, theta_1_lower, theta_2_lower, theta_3_lower], <span class="st">"upper"</span>:[theta_0_upper, <span class="op">\</span></span>
-<span id="cb16-450"><a href="#cb16-450" aria-hidden="true" tabindex="-1"></a>                theta_1_upper, theta_2_upper, theta_3_upper]}, index<span class="op">=</span>[<span class="st">"theta_0"</span>, <span class="st">"theta_1"</span>, <span class="st">"theta_2"</span>, <span class="st">"theta_3"</span>])</span>
-<span id="cb16-451"><a href="#cb16-451" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-452"><a href="#cb16-452" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-453"><a href="#cb16-453" aria-hidden="true" tabindex="-1"></a>Something's off here. Notice that 0 is included in the 95% confidence interval for *every* parameter of the model. Using the interpretation we outlined above, this would suggest that we can't say for certain that *any* of the input variables impact the response variable! This makes it seem like our model can't make any predictions – and yet, each model we fit in our bootstrap experiment above could very much make predictions of $Y$. </span>
-<span id="cb16-454"><a href="#cb16-454" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-455"><a href="#cb16-455" aria-hidden="true" tabindex="-1"></a>How can we explain this result? Think back to how we first interpreted the parameters of a linear model. We treated each $\theta_i$ as a slope, where a unit increase in $x_i$ leads to a $\theta_i$ increase in $Y$, **if all other variables are held constant**. It turns out that this last assumption is very important. If variables in our model are somehow related to one another, then it might not be possible to have a change in one of them while holding the others constant. This means that our interpretation framework is no longer valid! In the models we fit above, we incorporated <span class="in">`egg_length`</span>, <span class="in">`egg_breadth`</span>, and <span class="in">`egg_weight`</span> as input variables. These variables are very likely related to one another – an egg with large <span class="in">`egg_length`</span> and <span class="in">`egg_breadth`</span> will likely be heavy in <span class="in">`egg_weight`</span>. This means that the model parameters cannot be meaningfully interpreted as slopes. </span>
-<span id="cb16-456"><a href="#cb16-456" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-457"><a href="#cb16-457" aria-hidden="true" tabindex="-1"></a>To support this conclusion, we can visualize the relationships between our feature variables. Notice the strong positive association between the features.</span>
-<span id="cb16-458"><a href="#cb16-458" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-461"><a href="#cb16-461" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-462"><a href="#cb16-462" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
-<span id="cb16-463"><a href="#cb16-463" aria-hidden="true" tabindex="-1"></a>sns.pairplot(eggs[[<span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>, <span class="st">"egg_weight"</span>, <span class="st">'bird_weight'</span>]])<span class="op">;</span></span>
-<span id="cb16-464"><a href="#cb16-464" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-465"><a href="#cb16-465" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-466"><a href="#cb16-466" aria-hidden="true" tabindex="-1"></a>This issue is known as **colinearity**, sometimes also called **multicolinearity**. Collinearity occurs when one feature can be predicted fairly accurately by a linear combination of the other features, which happens when one feature is highly correlated with the others. </span>
-<span id="cb16-467"><a href="#cb16-467" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-468"><a href="#cb16-468" aria-hidden="true" tabindex="-1"></a>Why is colinearity a problem? Its consequences span several aspects of the modeling process:</span>
-<span id="cb16-469"><a href="#cb16-469" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-470"><a href="#cb16-470" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Inference**: Slopes can't be interpreted for an inference task.</span>
-<span id="cb16-471"><a href="#cb16-471" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Model Variance**: If features strongly influence one another, even small changes in the sampled data can lead to large changes in the estimated slopes.</span>
-<span id="cb16-472"><a href="#cb16-472" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Unique Solution**: If one feature is a linear combination of the other features, the design matrix will not be full rank, and $\mathbb{X}^{\top}\mathbb{X}$ is not invertible. This means that least squares does not have a unique solution.</span>
-<span id="cb16-473"><a href="#cb16-473" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-474"><a href="#cb16-474" aria-hidden="true" tabindex="-1"></a>The take-home point is that we need to be careful with what features we select for modeling. If two features likely encode similar information, it is often a good idea to choose only one of them as an input variable.</span>
-<span id="cb16-475"><a href="#cb16-475" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-476"><a href="#cb16-476" aria-hidden="true" tabindex="-1"></a><span class="fu">### A Simpler Model</span></span>
-<span id="cb16-477"><a href="#cb16-477" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-478"><a href="#cb16-478" aria-hidden="true" tabindex="-1"></a>Let us now consider a more interpretable model: we instead assume a true relationship using only egg weight:</span>
-<span id="cb16-479"><a href="#cb16-479" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-480"><a href="#cb16-480" aria-hidden="true" tabindex="-1"></a>$$f_\theta(x) = \theta_0 + \theta_1 \text{egg<span class="sc">\_</span>weight} + \epsilon$$</span>
-<span id="cb16-481"><a href="#cb16-481" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-484"><a href="#cb16-484" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-485"><a href="#cb16-485" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LinearRegression</span>
-<span id="cb16-486"><a href="#cb16-486" aria-hidden="true" tabindex="-1"></a>X_int <span class="op">=</span> eggs[[<span class="st">"egg_weight"</span>]]</span>
-<span id="cb16-487"><a href="#cb16-487" aria-hidden="true" tabindex="-1"></a>Y_int <span class="op">=</span> eggs[<span class="st">"bird_weight"</span>]</span>
-<span id="cb16-488"><a href="#cb16-488" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-489"><a href="#cb16-489" aria-hidden="true" tabindex="-1"></a>model_int <span class="op">=</span> LinearRegression()</span>
-<span id="cb16-490"><a href="#cb16-490" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-491"><a href="#cb16-491" aria-hidden="true" tabindex="-1"></a>model_int.fit(X_int, Y_int)</span>
-<span id="cb16-492"><a href="#cb16-492" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-493"><a href="#cb16-493" aria-hidden="true" tabindex="-1"></a><span class="co"># This gives an array containing the fitted model parameter estimates</span></span>
-<span id="cb16-494"><a href="#cb16-494" aria-hidden="true" tabindex="-1"></a>thetas_int <span class="op">=</span> model_int.coef_</span>
-<span id="cb16-495"><a href="#cb16-495" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-496"><a href="#cb16-496" aria-hidden="true" tabindex="-1"></a><span class="co"># Put the parameter estimates in a nice table for viewing</span></span>
-<span id="cb16-497"><a href="#cb16-497" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"theta_hat"</span>:[model_int.intercept_, thetas_int[<span class="dv">0</span>]]}, index<span class="op">=</span>[<span class="st">"theta_0"</span>, <span class="st">"theta_1"</span>])</span>
-<span id="cb16-498"><a href="#cb16-498" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-499"><a href="#cb16-499" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-502"><a href="#cb16-502" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-503"><a href="#cb16-503" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
-<span id="cb16-504"><a href="#cb16-504" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
-<span id="cb16-505"><a href="#cb16-505" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-506"><a href="#cb16-506" aria-hidden="true" tabindex="-1"></a><span class="co"># Set a random seed so you generate the same random sample as staff</span></span>
-<span id="cb16-507"><a href="#cb16-507" aria-hidden="true" tabindex="-1"></a><span class="co"># In the "real world", we wouldn't do this</span></span>
-<span id="cb16-508"><a href="#cb16-508" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">1337</span>)</span>
-<span id="cb16-509"><a href="#cb16-509" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-510"><a href="#cb16-510" aria-hidden="true" tabindex="-1"></a><span class="co"># Set the sample size of each bootstrap sample</span></span>
-<span id="cb16-511"><a href="#cb16-511" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="bu">len</span>(eggs)</span>
-<span id="cb16-512"><a href="#cb16-512" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-513"><a href="#cb16-513" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a list to store all the bootstrapped estimates</span></span>
-<span id="cb16-514"><a href="#cb16-514" aria-hidden="true" tabindex="-1"></a>estimates_int <span class="op">=</span> []</span>
-<span id="cb16-515"><a href="#cb16-515" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-516"><a href="#cb16-516" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. </span></span>
-<span id="cb16-517"><a href="#cb16-517" aria-hidden="true" tabindex="-1"></a><span class="co"># Repeat 10000 times.</span></span>
-<span id="cb16-518"><a href="#cb16-518" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>):</span>
-<span id="cb16-519"><a href="#cb16-519" aria-hidden="true" tabindex="-1"></a>    bootstrap_resample_int <span class="op">=</span> eggs.sample(n, replace<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb16-520"><a href="#cb16-520" aria-hidden="true" tabindex="-1"></a>    X_bootstrap_int <span class="op">=</span> bootstrap_resample_int[[<span class="st">"egg_weight"</span>]]</span>
-<span id="cb16-521"><a href="#cb16-521" aria-hidden="true" tabindex="-1"></a>    Y_bootstrap_int <span class="op">=</span> bootstrap_resample_int[<span class="st">"bird_weight"</span>]</span>
-<span id="cb16-522"><a href="#cb16-522" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-523"><a href="#cb16-523" aria-hidden="true" tabindex="-1"></a>    bootstrap_model_int <span class="op">=</span> LinearRegression()</span>
-<span id="cb16-524"><a href="#cb16-524" aria-hidden="true" tabindex="-1"></a>    bootstrap_model_int.fit(X_bootstrap_int, Y_bootstrap_int)</span>
-<span id="cb16-525"><a href="#cb16-525" aria-hidden="true" tabindex="-1"></a>    bootstrap_thetas_int <span class="op">=</span> bootstrap_model_int.coef_</span>
-<span id="cb16-526"><a href="#cb16-526" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-527"><a href="#cb16-527" aria-hidden="true" tabindex="-1"></a>    estimates_int.append(bootstrap_thetas_int[<span class="dv">0</span>])</span>
-<span id="cb16-528"><a href="#cb16-528" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-529"><a href="#cb16-529" aria-hidden="true" tabindex="-1"></a>plt.figure(dpi<span class="op">=</span><span class="dv">120</span>)</span>
-<span id="cb16-530"><a href="#cb16-530" aria-hidden="true" tabindex="-1"></a>sns.histplot(estimates_int, stat<span class="op">=</span><span class="st">"density"</span>)</span>
-<span id="cb16-531"><a href="#cb16-531" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="vs">r"$\hat{\theta}_1$"</span>)</span>
-<span id="cb16-532"><a href="#cb16-532" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="vs">r"Bootstrapped estimates $\hat{\theta}_1$ Under the Interpretable Model"</span>)<span class="op">;</span></span>
-<span id="cb16-533"><a href="#cb16-533" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-534"><a href="#cb16-534" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-535"><a href="#cb16-535" aria-hidden="true" tabindex="-1"></a>Notice how the interpretable model performs almost as well as our other model:</span>
-<span id="cb16-536"><a href="#cb16-536" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-539"><a href="#cb16-539" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-540"><a href="#cb16-540" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.metrics <span class="im">import</span> mean_squared_error</span>
-<span id="cb16-541"><a href="#cb16-541" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-542"><a href="#cb16-542" aria-hidden="true" tabindex="-1"></a>rmse <span class="op">=</span> mean_squared_error(Y, model.predict(X))</span>
-<span id="cb16-543"><a href="#cb16-543" aria-hidden="true" tabindex="-1"></a>rmse_int <span class="op">=</span> mean_squared_error(Y_int, model_int.predict(X_int))</span>
-<span id="cb16-544"><a href="#cb16-544" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f'RMSE of Original Model: </span><span class="sc">{</span>rmse<span class="sc">}</span><span class="ss">'</span>)</span>
-<span id="cb16-545"><a href="#cb16-545" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f'RMSE of Interpretable Model: </span><span class="sc">{</span>rmse_int<span class="sc">}</span><span class="ss">'</span>)</span>
-<span id="cb16-546"><a href="#cb16-546" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-547"><a href="#cb16-547" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-548"><a href="#cb16-548" aria-hidden="true" tabindex="-1"></a>Yet, the confidence interval for the true parameter $\theta_{1}$ does not contain zero.</span>
-<span id="cb16-549"><a href="#cb16-549" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-552"><a href="#cb16-552" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb16-553"><a href="#cb16-553" aria-hidden="true" tabindex="-1"></a>lower_int <span class="op">=</span> np.percentile(estimates_int, <span class="fl">2.5</span>)</span>
-<span id="cb16-554"><a href="#cb16-554" aria-hidden="true" tabindex="-1"></a>upper_int <span class="op">=</span> np.percentile(estimates_int, <span class="fl">97.5</span>)</span>
-<span id="cb16-555"><a href="#cb16-555" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-556"><a href="#cb16-556" aria-hidden="true" tabindex="-1"></a>conf_interval_int <span class="op">=</span> (lower_int, upper_int)</span>
-<span id="cb16-557"><a href="#cb16-557" aria-hidden="true" tabindex="-1"></a>conf_interval_int</span>
-<span id="cb16-558"><a href="#cb16-558" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb16-559"><a href="#cb16-559" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-560"><a href="#cb16-560" aria-hidden="true" tabindex="-1"></a>In retrospect, it’s no surprise that the weight of an egg best predicts the weight of a newly-hatched chick.</span>
-<span id="cb16-561"><a href="#cb16-561" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-562"><a href="#cb16-562" aria-hidden="true" tabindex="-1"></a>A model with highly correlated variables prevents us from interpreting how the variables are related to the prediction.</span>
-<span id="cb16-563"><a href="#cb16-563" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-564"><a href="#cb16-564" aria-hidden="true" tabindex="-1"></a><span class="fu">### Reminder: Assumptions Matter</span></span>
-<span id="cb16-565"><a href="#cb16-565" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-566"><a href="#cb16-566" aria-hidden="true" tabindex="-1"></a>Keep the following in mind:</span>
-<span id="cb16-567"><a href="#cb16-567" aria-hidden="true" tabindex="-1"></a>All inference assumes that the regression model holds.</span>
-<span id="cb16-568"><a href="#cb16-568" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-569"><a href="#cb16-569" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>If the model doesn’t hold, the inference might not be valid.</span>
-<span id="cb16-570"><a href="#cb16-570" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>If the <span class="co">[</span><span class="ot">assumptions of the bootstrap</span><span class="co">](https://inferentialthinking.com/chapters/13/3/Confidence_Intervals.html?highlight=p%20value%20confidence%20interval#care-in-using-the-bootstrap-percentile-method)</span> don’t hold…</span>
-<span id="cb16-571"><a href="#cb16-571" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Sample size n is large</span>
-<span id="cb16-572"><a href="#cb16-572" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Sample is representative of population distribution (drawn i.i.d., unbiased)</span>
-<span id="cb16-573"><a href="#cb16-573" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb16-574"><a href="#cb16-574" aria-hidden="true" tabindex="-1"></a>    …then the results of the bootstrap might not be valid.</span>
-<span id="cb16-575"><a href="#cb16-575" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-576"><a href="#cb16-576" aria-hidden="true" tabindex="-1"></a><span class="fu">## (Bonus) Correlation and Causation</span></span>
-<span id="cb16-577"><a href="#cb16-577" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-578"><a href="#cb16-578" aria-hidden="true" tabindex="-1"></a>Let us consider some questions in an arbitrary regression problem. </span>
-<span id="cb16-579"><a href="#cb16-579" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-580"><a href="#cb16-580" aria-hidden="true" tabindex="-1"></a>What does $\theta_{j}$ mean in our regression?</span>
-<span id="cb16-581"><a href="#cb16-581" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-582"><a href="#cb16-582" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Holding other variables fixed, how much should our prediction change with $X_{j}$?</span>
-<span id="cb16-583"><a href="#cb16-583" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-584"><a href="#cb16-584" aria-hidden="true" tabindex="-1"></a>For simple linear regression, this boils down to the correlation coefficient</span>
-<span id="cb16-585"><a href="#cb16-585" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-586"><a href="#cb16-586" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does having more $x$ predict more $y$ (and by how much)?</span>
-<span id="cb16-587"><a href="#cb16-587" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-588"><a href="#cb16-588" aria-hidden="true" tabindex="-1"></a>**Examples**:</span>
-<span id="cb16-589"><a href="#cb16-589" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-590"><a href="#cb16-590" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are homes with granite countertops worth more money?</span>
-<span id="cb16-591"><a href="#cb16-591" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Is college GPA higher for students who win a certain scholarship?</span>
-<span id="cb16-592"><a href="#cb16-592" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are breastfed babies less likely to develop asthma?</span>
-<span id="cb16-593"><a href="#cb16-593" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Do cancer patients given some aggressive treatment have a higher 5-year survival rate?</span>
-<span id="cb16-594"><a href="#cb16-594" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are people who smoke more likely to get cancer? </span>
-<span id="cb16-595"><a href="#cb16-595" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-596"><a href="#cb16-596" aria-hidden="true" tabindex="-1"></a>These sound like causal questions, but they are not!</span>
-<span id="cb16-597"><a href="#cb16-597" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-598"><a href="#cb16-598" aria-hidden="true" tabindex="-1"></a><span class="fu">### Prediction vs Causation</span></span>
-<span id="cb16-599"><a href="#cb16-599" aria-hidden="true" tabindex="-1"></a>The difference between correlation/prediction vs. causation is best illustrated through examples. </span>
-<span id="cb16-600"><a href="#cb16-600" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-601"><a href="#cb16-601" aria-hidden="true" tabindex="-1"></a>Some questions about **correlation / prediction** include:</span>
-<span id="cb16-602"><a href="#cb16-602" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-603"><a href="#cb16-603" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are homes with granite countertops worth more money?</span>
-<span id="cb16-604"><a href="#cb16-604" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Is college GPA higher for students who win a certain scholarship?</span>
-<span id="cb16-605"><a href="#cb16-605" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are breastfed babies less likely to develop asthma?</span>
-<span id="cb16-606"><a href="#cb16-606" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Do cancer patients given some aggressive treatment have a higher 5-year survival rate?</span>
-<span id="cb16-607"><a href="#cb16-607" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are people who smoke more likely to get cancer? </span>
-<span id="cb16-608"><a href="#cb16-608" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-609"><a href="#cb16-609" aria-hidden="true" tabindex="-1"></a>Some questions about **causality** include:</span>
-<span id="cb16-610"><a href="#cb16-610" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-611"><a href="#cb16-611" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>How much do granite countertops **raise** the value of a house?</span>
-<span id="cb16-612"><a href="#cb16-612" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does getting the scholarship **improve** students’ GPAs?</span>
-<span id="cb16-613"><a href="#cb16-613" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does breastfeeding **protect** babies against asthma?</span>
-<span id="cb16-614"><a href="#cb16-614" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does the treatment **improve** cancer survival?</span>
-<span id="cb16-615"><a href="#cb16-615" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does smoking **cause** cancer?</span>
-<span id="cb16-616"><a href="#cb16-616" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-617"><a href="#cb16-617" aria-hidden="true" tabindex="-1"></a>Causal questions are about the **effects** of **interventions** (not just passive observation). Note, however, that regression coefficients are sometimes called “effects”, which can be deceptive!</span>
-<span id="cb16-618"><a href="#cb16-618" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-619"><a href="#cb16-619" aria-hidden="true" tabindex="-1"></a>When using data alone, **predictive questions** (i.e. are breastfed babies healthier?) can be answered, but **causal questions:** (i.e. does breastfeeding improve babies’ health?) cannot. The reason for this is that there are many possible causes for our predictive question. For example, possible explanations for why breastfed babies are healthier on average include:</span>
-<span id="cb16-620"><a href="#cb16-620" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-621"><a href="#cb16-621" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>**Causal effect:** breastfeeding makes babies healthier</span>
-<span id="cb16-622"><a href="#cb16-622" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>**Reverse causality:** healthier babies more likely to successfully breastfeed</span>
-<span id="cb16-623"><a href="#cb16-623" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>**Common cause:** healthier / richer parents have healthier babies and are more likely to breastfeed</span>
-<span id="cb16-624"><a href="#cb16-624" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-625"><a href="#cb16-625" aria-hidden="true" tabindex="-1"></a>We cannot tell which explanations are true (or to what extent) just by observing ($x$,$y$) pairs.</span>
-<span id="cb16-626"><a href="#cb16-626" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-627"><a href="#cb16-627" aria-hidden="true" tabindex="-1"></a>Additionally, causal questions implicitly involve **counterfactuals**, events that didn't happen. For example, we could ask, **would** the **same** breastfed babies have been less healthy **if** they hadn’t been breastfed? Explanation 1 from above implies they would be, but explanations 2 and 3 do not. </span>
-<span id="cb16-628"><a href="#cb16-628" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-629"><a href="#cb16-629" aria-hidden="true" tabindex="-1"></a><span class="fu">### Confounders</span></span>
-<span id="cb16-630"><a href="#cb16-630" aria-hidden="true" tabindex="-1"></a>Let T represent a treatment (for example, alcohol use), and Y represent an outcome (for example, lung cancer).</span>
-<span id="cb16-631"><a href="#cb16-631" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-632"><a href="#cb16-632" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/confounder.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'confounder'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-633"><a href="#cb16-633" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-634"><a href="#cb16-634" aria-hidden="true" tabindex="-1"></a>A **confounder** is a variable that affects both T and Y, distorting the correlation between them. Using the example above. Confounders can be a measured covariate or an unmeasured variable we don’t know about, and they generally cause problems, as the relationship between T and Y is really affected by data we cannot see. </span>
-<span id="cb16-635"><a href="#cb16-635" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-636"><a href="#cb16-636" aria-hidden="true" tabindex="-1"></a>**Common assumption:** all confounders are observed (**ignorability**)</span>
-<span id="cb16-637"><a href="#cb16-637" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-638"><a href="#cb16-638" aria-hidden="true" tabindex="-1"></a><span class="fu">### Terminology</span></span>
-<span id="cb16-639"><a href="#cb16-639" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-640"><a href="#cb16-640" aria-hidden="true" tabindex="-1"></a>Let us define some terms that will help us understand causal effects.</span>
-<span id="cb16-641"><a href="#cb16-641" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-642"><a href="#cb16-642" aria-hidden="true" tabindex="-1"></a>In prediction, we had two kinds of variables: </span>
-<span id="cb16-643"><a href="#cb16-643" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-644"><a href="#cb16-644" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Response** ($Y$): what we are trying to predict</span>
-<span id="cb16-645"><a href="#cb16-645" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Predictors** ($X$): inputs to our prediction</span>
-<span id="cb16-646"><a href="#cb16-646" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-647"><a href="#cb16-647" aria-hidden="true" tabindex="-1"></a>Other variables in causal inference include: </span>
-<span id="cb16-648"><a href="#cb16-648" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-649"><a href="#cb16-649" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Response** ($Y$): the outcome of interest</span>
-<span id="cb16-650"><a href="#cb16-650" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Treatment** ($T$): the variable we might intervene on</span>
-<span id="cb16-651"><a href="#cb16-651" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Covariate** ($X$): other variables we measured that may affect $T$ and/or $Y$</span>
-<span id="cb16-652"><a href="#cb16-652" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-653"><a href="#cb16-653" aria-hidden="true" tabindex="-1"></a>For this lecture, $T$ is a **binary (0/1)** variable:</span>
-<span id="cb16-654"><a href="#cb16-654" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-655"><a href="#cb16-655" aria-hidden="true" tabindex="-1"></a><span class="fu">### Neyman-Rubin Causal Model</span></span>
-<span id="cb16-656"><a href="#cb16-656" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-657"><a href="#cb16-657" aria-hidden="true" tabindex="-1"></a>Causal questions are about **counterfactuals**:</span>
-<span id="cb16-658"><a href="#cb16-658" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-659"><a href="#cb16-659" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>What would have happened if T were different?</span>
-<span id="cb16-660"><a href="#cb16-660" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>What will happen if we set T differently in the future?</span>
-<span id="cb16-661"><a href="#cb16-661" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-662"><a href="#cb16-662" aria-hidden="true" tabindex="-1"></a>We assume every individual has two **potential outcomes**:</span>
-<span id="cb16-663"><a href="#cb16-663" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-664"><a href="#cb16-664" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$Y_{i}(1)$: value of $y_{i}$ if $T_{i} = 1$ (**treated outcome**)</span>
-<span id="cb16-665"><a href="#cb16-665" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$Y_{i}(0)$: value of $y_{i}$ if $T_{i} = 0$ (**control outcome**)</span>
-<span id="cb16-666"><a href="#cb16-666" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-667"><a href="#cb16-667" aria-hidden="true" tabindex="-1"></a>For each individual in the data set, we observe:</span>
-<span id="cb16-668"><a href="#cb16-668" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-669"><a href="#cb16-669" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Covariates $x_{i}$</span>
-<span id="cb16-670"><a href="#cb16-670" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Treatment $T_{i}$</span>
-<span id="cb16-671"><a href="#cb16-671" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Response $y_{i} = Y_{i}(T_{i})$</span>
-<span id="cb16-672"><a href="#cb16-672" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-673"><a href="#cb16-673" aria-hidden="true" tabindex="-1"></a>We will assume ($x_{i}$, $T_{i}$, $y_{i} = Y_{i}(T_{i})$) tuples iid for $i = 1,..., n$</span>
-<span id="cb16-674"><a href="#cb16-674" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-675"><a href="#cb16-675" aria-hidden="true" tabindex="-1"></a><span class="fu">### Average Treatment Effect</span></span>
-<span id="cb16-676"><a href="#cb16-676" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-677"><a href="#cb16-677" aria-hidden="true" tabindex="-1"></a>For each individual, the **treatment effect** is $Y_{i}(1)-Y_{i}(0)$</span>
-<span id="cb16-678"><a href="#cb16-678" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-679"><a href="#cb16-679" aria-hidden="true" tabindex="-1"></a>The most common thing to estimate is the **Average Treatment Effect (ATE)**</span>
-<span id="cb16-680"><a href="#cb16-680" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-681"><a href="#cb16-681" aria-hidden="true" tabindex="-1"></a>$$ATE = \mathbb{E}<span class="co">[</span><span class="ot">Y(1)-Y(0)</span><span class="co">]</span> = \mathbb{E}<span class="co">[</span><span class="ot">Y(1)</span><span class="co">]</span> - \mathbb{E}<span class="co">[</span><span class="ot">Y(0)</span><span class="co">]</span>$$</span>
-<span id="cb16-682"><a href="#cb16-682" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-683"><a href="#cb16-683" aria-hidden="true" tabindex="-1"></a>Can we just take the sample mean?</span>
-<span id="cb16-684"><a href="#cb16-684" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-685"><a href="#cb16-685" aria-hidden="true" tabindex="-1"></a>$$\hat{ATE} = \frac{1}{n}\sum_{i=1}^{n}Y_{i}(1) - Y_{i}(0)$$</span>
-<span id="cb16-686"><a href="#cb16-686" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-687"><a href="#cb16-687" aria-hidden="true" tabindex="-1"></a>We cannot. Why? We only observe one of $Y_{i}(1)$, $Y_{i}(0)$.</span>
-<span id="cb16-688"><a href="#cb16-688" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-689"><a href="#cb16-689" aria-hidden="true" tabindex="-1"></a>**Fundamental problem of causal inference:** We only ever observe one potential outcome</span>
-<span id="cb16-690"><a href="#cb16-690" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-691"><a href="#cb16-691" aria-hidden="true" tabindex="-1"></a>To draw causal conclusions, we need some causal assumption relating the observed to the unobserved units</span>
-<span id="cb16-692"><a href="#cb16-692" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-693"><a href="#cb16-693" aria-hidden="true" tabindex="-1"></a>Instead of $\frac{1}{n}\sum_{i=1}^{n}Y_{i}(1) - Y_{i}(0)$, what if we took the difference between the sample mean for each group?</span>
-<span id="cb16-694"><a href="#cb16-694" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-695"><a href="#cb16-695" aria-hidden="true" tabindex="-1"></a>$$\hat{ATE} = \frac{1}{n_{1}}\sum_{i: T_{i} = 1}{Y_{i}(1)} - \frac{1}{n_{0}}\sum_{i: T_{i} = 0}{Y_{i}(0)} = \frac{1}{n_{1}}\sum_{i: T_{i} = 1}{y_{i}} - \frac{1}{n_{0}}\sum_{i: T_{i} = 0}{y_{i}}$$</span>
-<span id="cb16-696"><a href="#cb16-696" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-697"><a href="#cb16-697" aria-hidden="true" tabindex="-1"></a>Is this estimator of $ATE$ unbiased? Thus, this proposed $\hat{ATE}$ is not suitable for our purposes.</span>
-<span id="cb16-698"><a href="#cb16-698" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-699"><a href="#cb16-699" aria-hidden="true" tabindex="-1"></a>If treatment assignment comes from random coin flips, then the treated units are an iid random sample of size $n_{1}$ from the population of $Y_{i}(1)$.</span>
-<span id="cb16-700"><a href="#cb16-700" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-701"><a href="#cb16-701" aria-hidden="true" tabindex="-1"></a>This means that, </span>
-<span id="cb16-702"><a href="#cb16-702" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-703"><a href="#cb16-703" aria-hidden="true" tabindex="-1"></a>$$\mathbb{E}<span class="co">[</span><span class="ot">\frac{1}{n_{1}}\sum_{i: T_{i} = 1}{y_{i}}</span><span class="co">]</span> = \mathbb{E}<span class="co">[</span><span class="ot">Y_{i}(1)</span><span class="co">]</span>$$</span>
-<span id="cb16-704"><a href="#cb16-704" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-705"><a href="#cb16-705" aria-hidden="true" tabindex="-1"></a>Similarly, </span>
-<span id="cb16-706"><a href="#cb16-706" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-707"><a href="#cb16-707" aria-hidden="true" tabindex="-1"></a>$$\mathbb{E}<span class="co">[</span><span class="ot">\frac{1}{n_{0}}\sum_{i: T_{i} = 0}{y_{i}}</span><span class="co">]</span> = \mathbb{E}<span class="co">[</span><span class="ot">Y_{i}(0)</span><span class="co">]</span>$$</span>
-<span id="cb16-708"><a href="#cb16-708" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-709"><a href="#cb16-709" aria-hidden="true" tabindex="-1"></a>which allows us to conclude that $\hat{ATE}$ is an unbiased estimator of $ATE$:</span>
-<span id="cb16-710"><a href="#cb16-710" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-711"><a href="#cb16-711" aria-hidden="true" tabindex="-1"></a>$$\mathbb{E}<span class="co">[</span><span class="ot">\hat{ATE}</span><span class="co">]</span> = ATE$$</span>
-<span id="cb16-712"><a href="#cb16-712" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-713"><a href="#cb16-713" aria-hidden="true" tabindex="-1"></a><span class="fu">### Randomized Experiments</span></span>
-<span id="cb16-714"><a href="#cb16-714" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-715"><a href="#cb16-715" aria-hidden="true" tabindex="-1"></a>However, often, randomly assigning treatments is impractical or unethical. For example, assigning a treatment of cigarettes would likely be impractical and unethical.</span>
-<span id="cb16-716"><a href="#cb16-716" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-717"><a href="#cb16-717" aria-hidden="true" tabindex="-1"></a>An alternative to bypass this issue is to utilize **observational studies**.</span>
-<span id="cb16-718"><a href="#cb16-718" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-719"><a href="#cb16-719" aria-hidden="true" tabindex="-1"></a>Experiments:</span>
-<span id="cb16-720"><a href="#cb16-720" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-721"><a href="#cb16-721" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/experiment.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'experiment'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-722"><a href="#cb16-722" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-723"><a href="#cb16-723" aria-hidden="true" tabindex="-1"></a>Observational Study:</span>
-<span id="cb16-724"><a href="#cb16-724" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-725"><a href="#cb16-725" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/observational.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'observational'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-726"><a href="#cb16-726" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-727"><a href="#cb16-727" aria-hidden="true" tabindex="-1"></a><span class="fu">### Covariate Adjustment</span></span>
-<span id="cb16-728"><a href="#cb16-728" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-729"><a href="#cb16-729" aria-hidden="true" tabindex="-1"></a>What to do about confounders?</span>
-<span id="cb16-730"><a href="#cb16-730" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-731"><a href="#cb16-731" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Ignorability assumption:** all important confounders are in the data set! </span>
-<span id="cb16-732"><a href="#cb16-732" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-733"><a href="#cb16-733" aria-hidden="true" tabindex="-1"></a>**One idea:** come up with a model that includes them, such as:</span>
-<span id="cb16-734"><a href="#cb16-734" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-735"><a href="#cb16-735" aria-hidden="true" tabindex="-1"></a>$$Y_{i}(t) = \theta_{0} + \theta_{1}x_{1} + ... + \theta_{p}x_{p} + \tau{t} + \epsilon$$</span>
-<span id="cb16-736"><a href="#cb16-736" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-737"><a href="#cb16-737" aria-hidden="true" tabindex="-1"></a>**Question:** what is the $ATE$ in this model? $\tau$</span>
-<span id="cb16-738"><a href="#cb16-738" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-739"><a href="#cb16-739" aria-hidden="true" tabindex="-1"></a>This approach can work but is **fragile**. Breaks if:</span>
-<span id="cb16-740"><a href="#cb16-740" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-741"><a href="#cb16-741" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Important covariates are missing or true dependence on $x$ is nonlinear</span>
-<span id="cb16-742"><a href="#cb16-742" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Sometimes pejoratively called **“causal inference”**</span>
-<span id="cb16-743"><a href="#cb16-743" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-744"><a href="#cb16-744" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/ignorability.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'ignorability'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
-<span id="cb16-745"><a href="#cb16-745" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-746"><a href="#cb16-746" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Covariate adjustment without parametric assumptions</span></span>
-<span id="cb16-747"><a href="#cb16-747" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-748"><a href="#cb16-748" aria-hidden="true" tabindex="-1"></a>What to do about confounders?</span>
-<span id="cb16-749"><a href="#cb16-749" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-750"><a href="#cb16-750" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Ignorability assumption:** all possible confounders are in the data set! </span>
-<span id="cb16-751"><a href="#cb16-751" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-752"><a href="#cb16-752" aria-hidden="true" tabindex="-1"></a>**One idea:** come up with a model that includes them, such as:</span>
-<span id="cb16-753"><a href="#cb16-753" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-754"><a href="#cb16-754" aria-hidden="true" tabindex="-1"></a>$$Y_{i}(t) = f_{\theta}(x, t) + \epsilon$$</span>
-<span id="cb16-755"><a href="#cb16-755" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-756"><a href="#cb16-756" aria-hidden="true" tabindex="-1"></a>Then:</span>
-<span id="cb16-757"><a href="#cb16-757" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-758"><a href="#cb16-758" aria-hidden="true" tabindex="-1"></a>$$ATE = \frac{1}{n}\sum_{i=1}^{n}{f_{\theta}(x_i, 1) - f_{\theta}(x_i, 0)}$$</span>
-<span id="cb16-759"><a href="#cb16-759" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-760"><a href="#cb16-760" aria-hidden="true" tabindex="-1"></a>With enough data, we may be able to learn $f_{\theta}$ very accurately</span>
-<span id="cb16-761"><a href="#cb16-761" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-762"><a href="#cb16-762" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Very difficult if x is high-dimensional / its functional form is highly nonlinear</span>
-<span id="cb16-763"><a href="#cb16-763" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Need additional assumption: **overlap**</span>
-<span id="cb16-764"><a href="#cb16-764" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-765"><a href="#cb16-765" aria-hidden="true" tabindex="-1"></a><span class="fu">### Other Methods</span></span>
-<span id="cb16-766"><a href="#cb16-766" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-767"><a href="#cb16-767" aria-hidden="true" tabindex="-1"></a>Causal inference is hard, and covariate adjustment is often not the best approach</span>
-<span id="cb16-768"><a href="#cb16-768" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-769"><a href="#cb16-769" aria-hidden="true" tabindex="-1"></a>Many other methods are some combination of:</span>
-<span id="cb16-770"><a href="#cb16-770" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-771"><a href="#cb16-771" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Modeling treatment T as a function of covariates x</span>
-<span id="cb16-772"><a href="#cb16-772" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Modeling the outcome y as a function of x, T</span>
-<span id="cb16-773"><a href="#cb16-773" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-774"><a href="#cb16-774" aria-hidden="true" tabindex="-1"></a>What if we don’t believe in ignorability? Other methods look for a</span>
-<span id="cb16-775"><a href="#cb16-775" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-776"><a href="#cb16-776" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Favorite example: **regression discontinuity**</span>
-<span id="cb16-777"><a href="#cb16-777" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-778"><a href="#cb16-778" aria-hidden="true" tabindex="-1"></a><span class="fu">## (Bonus) Proof of Bias-Variance Decomposition</span></span>
-<span id="cb16-779"><a href="#cb16-779" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-780"><a href="#cb16-780" aria-hidden="true" tabindex="-1"></a>This section walks through the detailed derivation of the Bias-Variance Decomposition in the Bias-Variance Tradeoff section earlier in this note.</span>
-<span id="cb16-781"><a href="#cb16-781" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-782"><a href="#cb16-782" aria-hidden="true" tabindex="-1"></a>:::{.callout collapse="true"}</span>
-<span id="cb16-783"><a href="#cb16-783" aria-hidden="true" tabindex="-1"></a><span class="fu">### Click to show</span></span>
-<span id="cb16-784"><a href="#cb16-784" aria-hidden="true" tabindex="-1"></a>We want to prove that the model risk can be decomposed as</span>
-<span id="cb16-785"><a href="#cb16-785" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-786"><a href="#cb16-786" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-787"><a href="#cb16-787" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
-<span id="cb16-788"><a href="#cb16-788" aria-hidden="true" tabindex="-1"></a>E\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span> &amp;= E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span> + \left(g(x)-E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>\right)^2 + E\left<span class="co">[</span><span class="ot">\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right</span><span class="co">]</span>.</span>
-<span id="cb16-789"><a href="#cb16-789" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
-<span id="cb16-790"><a href="#cb16-790" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-791"><a href="#cb16-791" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-792"><a href="#cb16-792" aria-hidden="true" tabindex="-1"></a>To prove this, we will first need the following lemma:</span>
-<span id="cb16-793"><a href="#cb16-793" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-794"><a href="#cb16-794" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;center&gt;</span>If $V$ and $W$ are independent random variables then $E<span class="co">[</span><span class="ot">VW</span><span class="co">]</span> = E<span class="co">[</span><span class="ot">V</span><span class="co">]</span>E<span class="co">[</span><span class="ot">W</span><span class="co">]</span>$.<span class="kw">&lt;/center&gt;</span></span>
-<span id="cb16-795"><a href="#cb16-795" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-796"><a href="#cb16-796" aria-hidden="true" tabindex="-1"></a>We will prove this in the discrete finite case. Trust that it's true in greater generality.</span>
-<span id="cb16-797"><a href="#cb16-797" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-798"><a href="#cb16-798" aria-hidden="true" tabindex="-1"></a>The job is to calculate the weighted average of the values of $VW$, where the weights are the probabilities of those values. Here goes.</span>
-<span id="cb16-799"><a href="#cb16-799" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-800"><a href="#cb16-800" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
-<span id="cb16-801"><a href="#cb16-801" aria-hidden="true" tabindex="-1"></a>E<span class="co">[</span><span class="ot">VW</span><span class="co">]</span> ~ &amp;= ~ \sum_v\sum_w vwP(V=v \text{ and } W=w) <span class="sc">\\</span></span>
-<span id="cb16-802"><a href="#cb16-802" aria-hidden="true" tabindex="-1"></a>&amp;= ~ \sum_v\sum_w vwP(V=v)P(W=w) ~~~~ \text{by independence} <span class="sc">\\</span></span>
-<span id="cb16-803"><a href="#cb16-803" aria-hidden="true" tabindex="-1"></a>&amp;= ~ \sum_v vP(V=v)\sum_w wP(W=w) <span class="sc">\\</span></span>
-<span id="cb16-804"><a href="#cb16-804" aria-hidden="true" tabindex="-1"></a>&amp;= ~ E<span class="co">[</span><span class="ot">V</span><span class="co">]</span>E<span class="co">[</span><span class="ot">W</span><span class="co">]</span></span>
-<span id="cb16-805"><a href="#cb16-805" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
-<span id="cb16-806"><a href="#cb16-806" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-807"><a href="#cb16-807" aria-hidden="true" tabindex="-1"></a>Now we go into the actual proof:</span>
-<span id="cb16-808"><a href="#cb16-808" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-809"><a href="#cb16-809" aria-hidden="true" tabindex="-1"></a><span class="fu">### Goal</span></span>
-<span id="cb16-810"><a href="#cb16-810" aria-hidden="true" tabindex="-1"></a>Decompose the model risk into recognizable components.</span>
-<span id="cb16-811"><a href="#cb16-811" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-812"><a href="#cb16-812" aria-hidden="true" tabindex="-1"></a><span class="fu">### Step 1</span></span>
-<span id="cb16-813"><a href="#cb16-813" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-814"><a href="#cb16-814" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
-<span id="cb16-815"><a href="#cb16-815" aria-hidden="true" tabindex="-1"></a>\text{model risk} ~ &amp;= ~ E\left<span class="co">[</span><span class="ot">\left(Y - \hat{Y}(x)\right)^2 \right</span><span class="co">]</span> <span class="sc">\\</span></span>
-<span id="cb16-816"><a href="#cb16-816" aria-hidden="true" tabindex="-1"></a>&amp;= ~ E\left<span class="co">[</span><span class="ot">\left(g(x) + \epsilon - \hat{Y}(x)\right)^2 \right</span><span class="co">]</span> <span class="sc">\\</span></span>
-<span id="cb16-817"><a href="#cb16-817" aria-hidden="true" tabindex="-1"></a>&amp;= ~ E\left<span class="co">[</span><span class="ot">\left(\epsilon + \left(g(x)- \hat{Y}(x)\right)\right)^2 \right</span><span class="co">]</span> <span class="sc">\\</span></span>
-<span id="cb16-818"><a href="#cb16-818" aria-hidden="true" tabindex="-1"></a>&amp;= ~ E\left<span class="co">[</span><span class="ot">\epsilon^2\right</span><span class="co">]</span> + 2E\left<span class="co">[</span><span class="ot">\epsilon \left(g(x)- \hat{Y}(x)\right)\right</span><span class="co">]</span> + E\left<span class="co">[</span><span class="ot">\left(g(x) - \hat{Y}(x)\right)^2\right</span><span class="co">]</span><span class="sc">\\</span></span>
-<span id="cb16-819"><a href="#cb16-819" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
-<span id="cb16-820"><a href="#cb16-820" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-821"><a href="#cb16-821" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-822"><a href="#cb16-822" aria-hidden="true" tabindex="-1"></a>On the right hand side: </span>
-<span id="cb16-823"><a href="#cb16-823" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-824"><a href="#cb16-824" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The first term is the observation variance $\sigma^2$.</span>
-<span id="cb16-825"><a href="#cb16-825" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The cross product term is 0 because $\epsilon$ is independent of $g(x) - \hat{Y}(x)$ and $E(\epsilon) = 0$</span>
-<span id="cb16-826"><a href="#cb16-826" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The last term is the mean squared difference between our predicted value and the value of the true function at $x$</span>
-<span id="cb16-827"><a href="#cb16-827" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-828"><a href="#cb16-828" aria-hidden="true" tabindex="-1"></a><span class="fu">### Step 2</span></span>
-<span id="cb16-829"><a href="#cb16-829" aria-hidden="true" tabindex="-1"></a>At this stage we have</span>
-<span id="cb16-830"><a href="#cb16-830" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-831"><a href="#cb16-831" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-832"><a href="#cb16-832" aria-hidden="true" tabindex="-1"></a>\text{model risk} ~ = ~ E\left<span class="co">[</span><span class="ot">\epsilon^2\right</span><span class="co">]</span> + E\left<span class="co">[</span><span class="ot">\left(g(x) - \hat{Y}(x)\right)^2\right</span><span class="co">]</span></span>
-<span id="cb16-833"><a href="#cb16-833" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-834"><a href="#cb16-834" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-835"><a href="#cb16-835" aria-hidden="true" tabindex="-1"></a>We don't yet have a good understanding of $g(x) - \hat{Y}(x)$. But we do understand the deviation $D_{\hat{Y}(x)} = \hat{Y}(x) - E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>$. We know that</span>
-<span id="cb16-836"><a href="#cb16-836" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-837"><a href="#cb16-837" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}\right</span><span class="co">]</span> ~ = ~ 0$</span>
-<span id="cb16-838"><a href="#cb16-838" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}^2\right</span><span class="co">]</span> ~ = ~ \text{model variance}$</span>
-<span id="cb16-839"><a href="#cb16-839" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-840"><a href="#cb16-840" aria-hidden="true" tabindex="-1"></a>So let's add and subtract $E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>$ and see if that helps.</span>
-<span id="cb16-841"><a href="#cb16-841" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-842"><a href="#cb16-842" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-843"><a href="#cb16-843" aria-hidden="true" tabindex="-1"></a>g(x) - \hat{Y}(x) ~ = ~ \left(g(x) - E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> \right) + \left(E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - \hat{Y}(x)\right) </span>
-<span id="cb16-844"><a href="#cb16-844" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-845"><a href="#cb16-845" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-846"><a href="#cb16-846" aria-hidden="true" tabindex="-1"></a>The first term on the right hand side is the model bias at $x$. The second term is $-D_{\hat{Y}(x)}$. So</span>
-<span id="cb16-847"><a href="#cb16-847" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-848"><a href="#cb16-848" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-849"><a href="#cb16-849" aria-hidden="true" tabindex="-1"></a>g(x) - \hat{Y}(x) ~ = ~ \text{model bias} - D_{\hat{Y}(x)}</span>
-<span id="cb16-850"><a href="#cb16-850" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-851"><a href="#cb16-851" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-852"><a href="#cb16-852" aria-hidden="true" tabindex="-1"></a><span class="fu">### Step 3</span></span>
-<span id="cb16-853"><a href="#cb16-853" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-854"><a href="#cb16-854" aria-hidden="true" tabindex="-1"></a>Remember that the model bias at $x$ is a constant, not a random variable. Think of it as your favorite number, say 10. Then </span>
-<span id="cb16-855"><a href="#cb16-855" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-856"><a href="#cb16-856" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
-<span id="cb16-857"><a href="#cb16-857" aria-hidden="true" tabindex="-1"></a>E\left<span class="co">[</span><span class="ot"> \left(g(x) - \hat{Y}(x)\right)^2 \right</span><span class="co">]</span> ~ &amp;= ~ \text{model bias}^2 - 2(\text{model bias})E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}\right</span><span class="co">]</span> + E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}^2\right</span><span class="co">]</span> <span class="sc">\\</span></span>
-<span id="cb16-858"><a href="#cb16-858" aria-hidden="true" tabindex="-1"></a>&amp;= ~ \text{model bias}^2 - 0 + \text{model variance} <span class="sc">\\</span></span>
-<span id="cb16-859"><a href="#cb16-859" aria-hidden="true" tabindex="-1"></a>&amp;= ~ \text{model bias}^2 + \text{model variance}</span>
-<span id="cb16-860"><a href="#cb16-860" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
-<span id="cb16-861"><a href="#cb16-861" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-862"><a href="#cb16-862" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-863"><a href="#cb16-863" aria-hidden="true" tabindex="-1"></a>Again, the cross-product term is $0$ because $E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}\right</span><span class="co">]</span> ~ = ~ 0$.</span>
-<span id="cb16-864"><a href="#cb16-864" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-865"><a href="#cb16-865" aria-hidden="true" tabindex="-1"></a><span class="fu">### Step 4: Bias-Variance Decomposition</span></span>
-<span id="cb16-866"><a href="#cb16-866" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-867"><a href="#cb16-867" aria-hidden="true" tabindex="-1"></a>In Step 2 we had</span>
-<span id="cb16-868"><a href="#cb16-868" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-869"><a href="#cb16-869" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-870"><a href="#cb16-870" aria-hidden="true" tabindex="-1"></a>\text{model risk} ~ = ~ \text{observation variance} + E\left<span class="co">[</span><span class="ot">\left(g(x) - \hat{Y}(x)\right)^2\right</span><span class="co">]</span></span>
-<span id="cb16-871"><a href="#cb16-871" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-872"><a href="#cb16-872" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-873"><a href="#cb16-873" aria-hidden="true" tabindex="-1"></a>Step 3 showed</span>
-<span id="cb16-874"><a href="#cb16-874" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-875"><a href="#cb16-875" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-876"><a href="#cb16-876" aria-hidden="true" tabindex="-1"></a>E\left<span class="co">[</span><span class="ot"> \left(g(x) - \hat{Y}(x)\right)^2 \right</span><span class="co">]</span> ~ = ~ \text{model bias}^2 + \text{model variance}</span>
-<span id="cb16-877"><a href="#cb16-877" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-878"><a href="#cb16-878" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-879"><a href="#cb16-879" aria-hidden="true" tabindex="-1"></a>Thus we have shown the bias-variance decomposition:</span>
-<span id="cb16-880"><a href="#cb16-880" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-881"><a href="#cb16-881" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-882"><a href="#cb16-882" aria-hidden="true" tabindex="-1"></a>\text{model risk} = \text{observation variance} + \text{model bias}^2 + \text{model variance}.</span>
-<span id="cb16-883"><a href="#cb16-883" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-884"><a href="#cb16-884" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-885"><a href="#cb16-885" aria-hidden="true" tabindex="-1"></a>That is,</span>
-<span id="cb16-886"><a href="#cb16-886" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb16-887"><a href="#cb16-887" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-888"><a href="#cb16-888" aria-hidden="true" tabindex="-1"></a>E\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span> = \sigma^2 + \left(E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - g(x)\right)^2 + E\left<span class="co">[</span><span class="ot">\left(\hat{Y}(x)-E\left[\hat{Y}(x)\right]\right)^2\right</span><span class="co">]</span></span>
-<span id="cb16-889"><a href="#cb16-889" aria-hidden="true" tabindex="-1"></a>$$</span>
-<span id="cb16-890"><a href="#cb16-890" aria-hidden="true" tabindex="-1"></a>:::</span>
+<div class="sourceCode" id="cb15" data-shortcodes="false"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
+<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="an">title:</span><span class="co"> 'Bias, Variance, and Inference'</span></span>
+<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a><span class="an">execute:</span></span>
+<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a><span class="co">  echo: true</span></span>
+<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="an">format:</span></span>
+<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="co">  html:</span></span>
+<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a><span class="co">    code-fold: true</span></span>
+<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a><span class="co">    code-tools: true</span></span>
+<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a><span class="co">    toc: true</span></span>
+<span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a><span class="co">    toc-title: 'Bias, Variance, and Inference'</span></span>
+<span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a><span class="co">    page-layout: full</span></span>
+<span id="cb15-12"><a href="#cb15-12" aria-hidden="true" tabindex="-1"></a><span class="co">    theme:</span></span>
+<span id="cb15-13"><a href="#cb15-13" aria-hidden="true" tabindex="-1"></a><span class="co">      - cosmo</span></span>
+<span id="cb15-14"><a href="#cb15-14" aria-hidden="true" tabindex="-1"></a><span class="co">      - cerulean</span></span>
+<span id="cb15-15"><a href="#cb15-15" aria-hidden="true" tabindex="-1"></a><span class="co">    callout-icon: false</span></span>
+<span id="cb15-16"><a href="#cb15-16" aria-hidden="true" tabindex="-1"></a><span class="an">jupyter:</span><span class="co"> python3</span></span>
+<span id="cb15-17"><a href="#cb15-17" aria-hidden="true" tabindex="-1"></a><span class="co">---</span></span>
+<span id="cb15-18"><a href="#cb15-18" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-19"><a href="#cb15-19" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- </span></span>
+<span id="cb15-20"><a href="#cb15-20" aria-hidden="true" tabindex="-1"></a><span class="co">The **bias** of an estimator is how far off it is from the parameter, on average.</span></span>
+<span id="cb15-21"><a href="#cb15-21" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-22"><a href="#cb15-22" aria-hidden="true" tabindex="-1"></a><span class="co">$$\begin{align}\text{Bias}(\hat{\theta}) = \mathbb{E}[\hat{\theta} - \theta] = \mathbb{E}[\hat{\theta}] - \theta\end{align}$$</span></span>
+<span id="cb15-23"><a href="#cb15-23" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-24"><a href="#cb15-24" aria-hidden="true" tabindex="-1"></a><span class="co">For example, the bias of the sample mean as an estimator of the population mean is:</span></span>
+<span id="cb15-25"><a href="#cb15-25" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-26"><a href="#cb15-26" aria-hidden="true" tabindex="-1"></a><span class="co">$$\begin{align}\mathbb{E}[\bar{X}_n - \mu]</span></span>
+<span id="cb15-27"><a href="#cb15-27" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;= \mathbb{E}[\frac{1}{n}\sum_{i=1}^n (X_i)] - \mu \\</span></span>
+<span id="cb15-28"><a href="#cb15-28" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;= \frac{1}{n}\sum_{i=1}^n \mathbb{E}[X_i] - \mu \\</span></span>
+<span id="cb15-29"><a href="#cb15-29" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;= \frac{1}{n} (n\mu) - \mu \\</span></span>
+<span id="cb15-30"><a href="#cb15-30" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;= 0\end{align}$$</span></span>
+<span id="cb15-31"><a href="#cb15-31" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-32"><a href="#cb15-32" aria-hidden="true" tabindex="-1"></a><span class="co">Because its bias is equal to 0, the sample mean is said to be an **unbiased** estimator of the population mean.</span></span>
+<span id="cb15-33"><a href="#cb15-33" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-34"><a href="#cb15-34" aria-hidden="true" tabindex="-1"></a><span class="co">The **variance** of an estimator is a measure of how much the estimator tends to vary from its mean value.</span></span>
+<span id="cb15-35"><a href="#cb15-35" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-36"><a href="#cb15-36" aria-hidden="true" tabindex="-1"></a><span class="co">$$\begin{align}\text{Var}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \mathbb{E}[\hat{\theta}])^2 \right]\end{align}$$</span></span>
+<span id="cb15-37"><a href="#cb15-37" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-38"><a href="#cb15-38" aria-hidden="true" tabindex="-1"></a><span class="co">The **mean squared error** measures the "goodness" of an estimator by incorporating both the bias and variance. Formally, it is defined as:</span></span>
+<span id="cb15-39"><a href="#cb15-39" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-40"><a href="#cb15-40" aria-hidden="true" tabindex="-1"></a><span class="co">$$\begin{align}\text{MSE}(\hat{\theta}) = \mathbb{E}\left[(\hat{\theta} - \theta)^2</span></span>
+<span id="cb15-41"><a href="#cb15-41" aria-hidden="true" tabindex="-1"></a><span class="co">\right]\end{align}$$ --&gt;</span></span>
+<span id="cb15-42"><a href="#cb15-42" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-43"><a href="#cb15-43" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-44"><a href="#cb15-44" aria-hidden="true" tabindex="-1"></a>::: {.callout-note collapse="false"}</span>
+<span id="cb15-45"><a href="#cb15-45" aria-hidden="true" tabindex="-1"></a><span class="fu">## Learning Outcomes</span></span>
+<span id="cb15-46"><a href="#cb15-46" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Compute the bias, variance, and MSE of an estimator for a parameter</span>
+<span id="cb15-47"><a href="#cb15-47" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Introduction to model risk of fitted models</span>
+<span id="cb15-48"><a href="#cb15-48" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Decompose the model risk into bias and variance terms</span>
+<span id="cb15-49"><a href="#cb15-49" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Construct confidence intervals for hypothesis testing</span>
+<span id="cb15-50"><a href="#cb15-50" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Understand the assumptions we make and its impact on our regression inference</span>
+<span id="cb15-51"><a href="#cb15-51" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Compare regression and causation</span>
+<span id="cb15-52"><a href="#cb15-52" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Experiment setup, confounding variables, average treatment effect, and covariate adjustment</span>
+<span id="cb15-53"><a href="#cb15-53" aria-hidden="true" tabindex="-1"></a>:::</span>
+<span id="cb15-54"><a href="#cb15-54" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-55"><a href="#cb15-55" aria-hidden="true" tabindex="-1"></a>Last time, we introduced the idea of random variables and its effect on the observed relationship we use to fit models.</span>
+<span id="cb15-56"><a href="#cb15-56" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-57"><a href="#cb15-57" aria-hidden="true" tabindex="-1"></a>In this lecture, we will explore the decomposition of model risk from a fitted model, regression inference via hypothesis testing and considering the assumptions we make, and the environment of understanding causality in theory and in practice.</span>
+<span id="cb15-58"><a href="#cb15-58" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-59"><a href="#cb15-59" aria-hidden="true" tabindex="-1"></a><span class="fu">## Bias-Variance Tradeoff</span></span>
+<span id="cb15-60"><a href="#cb15-60" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-61"><a href="#cb15-61" aria-hidden="true" tabindex="-1"></a>Recall the model and the data we generated from that model in the last section:</span>
+<span id="cb15-62"><a href="#cb15-62" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-63"><a href="#cb15-63" aria-hidden="true" tabindex="-1"></a>$$\text{True relationship: } g(x)$$</span>
+<span id="cb15-64"><a href="#cb15-64" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-65"><a href="#cb15-65" aria-hidden="true" tabindex="-1"></a>$$\text{Observed relationship: }Y = g(x) + \epsilon$$</span>
+<span id="cb15-66"><a href="#cb15-66" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-67"><a href="#cb15-67" aria-hidden="true" tabindex="-1"></a>$$\text{Prediction: }\hat{Y}(x)$$</span>
+<span id="cb15-68"><a href="#cb15-68" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-69"><a href="#cb15-69" aria-hidden="true" tabindex="-1"></a>With this reformulated modeling goal, we can now revisit the Bias-Variance Tradeoff from two lectures ago (shown below): </span>
+<span id="cb15-70"><a href="#cb15-70" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-71"><a href="#cb15-71" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-72"><a href="#cb15-72" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/bvt_old.png"</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-73"><a href="#cb15-73" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-74"><a href="#cb15-74" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-75"><a href="#cb15-75" aria-hidden="true" tabindex="-1"></a>In today's lecture, we'll explore a more mathematical version of the graph you see above by introducing the terms model risk, observation variance, model bias, and model variance. Eventually, we'll work our way up to an updated version of the Bias-Variance Tradeoff graph that you see below </span>
+<span id="cb15-76"><a href="#cb15-76" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-77"><a href="#cb15-77" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-78"><a href="#cb15-78" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/bvt.png"</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-79"><a href="#cb15-79" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-80"><a href="#cb15-80" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-81"><a href="#cb15-81" aria-hidden="true" tabindex="-1"></a><span class="fu">### Performance of an Estimator</span></span>
+<span id="cb15-82"><a href="#cb15-82" aria-hidden="true" tabindex="-1"></a>Suppose we want to estimate a target $Y$ using an estimator $\hat{Y}(x)$. For every estimator that we train, we can determine how good a model is by asking the following questions: </span>
+<span id="cb15-83"><a href="#cb15-83" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-84"><a href="#cb15-84" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Do we get the right answer on average? **(Bias)** </span>
+<span id="cb15-85"><a href="#cb15-85" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>How variable is the answer? **(Variance)**</span>
+<span id="cb15-86"><a href="#cb15-86" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>How close do we get to $Y$? **(Risk / MSE)**</span>
+<span id="cb15-87"><a href="#cb15-87" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-88"><a href="#cb15-88" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-89"><a href="#cb15-89" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/bias_v_variance.png"</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-90"><a href="#cb15-90" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-91"><a href="#cb15-91" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-92"><a href="#cb15-92" aria-hidden="true" tabindex="-1"></a>Ideally, we want our estimator to have low bias and low variance, but how can we mathematically quantify that? To do so, let's introduce a few terms.</span>
+<span id="cb15-93"><a href="#cb15-93" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-94"><a href="#cb15-94" aria-hidden="true" tabindex="-1"></a><span class="fu">### Model Risk</span></span>
+<span id="cb15-95"><a href="#cb15-95" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-96"><a href="#cb15-96" aria-hidden="true" tabindex="-1"></a>**Model risk** is defined as the mean square prediction error of the random variable $\hat{Y}$. It is an expectation across *all* samples we could have possibly gotten when fitting the model, which we can denote as random variables $X_1, X_2, \ldots, X_n, Y$. Model risk considers the model's performance on any sample that is theoretically possible, rather than the specific data that we have collected. </span>
+<span id="cb15-97"><a href="#cb15-97" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-98"><a href="#cb15-98" aria-hidden="true" tabindex="-1"></a>$$\text{model risk }=E\left<span class="co">[</span><span class="ot">(Y-\hat{Y(x)})^2\right</span><span class="co">]</span>$$ </span>
+<span id="cb15-99"><a href="#cb15-99" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-100"><a href="#cb15-100" aria-hidden="true" tabindex="-1"></a>What is the origin of the error encoded by model risk? Note that there are two types of errors:</span>
+<span id="cb15-101"><a href="#cb15-101" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-102"><a href="#cb15-102" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Chance errors: happen due to randomness alone</span>
+<span id="cb15-103"><a href="#cb15-103" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Source 1 **(Observation Variance)**: randomness in new observations $Y$ due to random noise $\epsilon$</span>
+<span id="cb15-104"><a href="#cb15-104" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Source 2 **(Model Variance)**: randomness in the sample we used to train the models, as samples $X_1, X_2, \ldots, X_n, Y$ are random</span>
+<span id="cb15-105"><a href="#cb15-105" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**(Model Bias)**: non-random error due to our model being different from the true underlying function $g$</span>
+<span id="cb15-106"><a href="#cb15-106" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-107"><a href="#cb15-107" aria-hidden="true" tabindex="-1"></a>Recall the data-generating process we established earlier. There is a true underlying relationship $g$, observed data (with random noise) $Y$, and model $\hat{Y}$. </span>
+<span id="cb15-108"><a href="#cb15-108" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-109"><a href="#cb15-109" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-110"><a href="#cb15-110" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/errors.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'errors'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-111"><a href="#cb15-111" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-112"><a href="#cb15-112" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-113"><a href="#cb15-113" aria-hidden="true" tabindex="-1"></a>To better understand model risk, we'll zoom in on a single data point in the plot above.</span>
+<span id="cb15-114"><a href="#cb15-114" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-115"><a href="#cb15-115" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-116"><a href="#cb15-116" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/breakdown.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'breakdown'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-117"><a href="#cb15-117" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-118"><a href="#cb15-118" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-119"><a href="#cb15-119" aria-hidden="true" tabindex="-1"></a>Remember that $\hat{Y}(x)$ is a random variable – it is the prediction made for $x$ after being fit on the specific sample used for training. If we had used a different sample for training, a different prediction might have been made for this value of $x$. To capture this, the diagram above considers both the prediction $\hat{Y}(x)$ made for a particular random training sample, and the *expected* prediction across all possible training samples, $E<span class="co">[</span><span class="ot">\hat{Y}(x)</span><span class="co">]</span>$. </span>
+<span id="cb15-120"><a href="#cb15-120" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-121"><a href="#cb15-121" aria-hidden="true" tabindex="-1"></a>We can use this simplified diagram to break down the prediction error into smaller components. First, start by considering the error on a single prediction, $Y(x)-\hat{Y}(x)$.</span>
+<span id="cb15-122"><a href="#cb15-122" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-123"><a href="#cb15-123" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-124"><a href="#cb15-124" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/error.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'error'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-125"><a href="#cb15-125" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-126"><a href="#cb15-126" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-127"><a href="#cb15-127" aria-hidden="true" tabindex="-1"></a>We can identify three components of this error.</span>
+<span id="cb15-128"><a href="#cb15-128" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-129"><a href="#cb15-129" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-130"><a href="#cb15-130" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/decomposition.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'decomposition'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-131"><a href="#cb15-131" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-132"><a href="#cb15-132" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-133"><a href="#cb15-133" aria-hidden="true" tabindex="-1"></a>That is, the error can be written as:</span>
+<span id="cb15-134"><a href="#cb15-134" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-135"><a href="#cb15-135" aria-hidden="true" tabindex="-1"></a>$$Y(x)-\hat{Y}(x) = \epsilon + \left(g(x)-E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>\right) + \left(E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - \hat{Y}(x)\right)$$</span>
+<span id="cb15-136"><a href="#cb15-136" aria-hidden="true" tabindex="-1"></a>$$\newline   $$</span>
+<span id="cb15-137"><a href="#cb15-137" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-138"><a href="#cb15-138" aria-hidden="true" tabindex="-1"></a>The model risk is the expected square of the expression above, $E\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span>$. If we square both sides and then take the expectation, we will get the following decomposition of model risk:</span>
+<span id="cb15-139"><a href="#cb15-139" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-140"><a href="#cb15-140" aria-hidden="true" tabindex="-1"></a>$$E\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span> = E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span> + \left(g(x)-E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>\right)^2 + E\left<span class="co">[</span><span class="ot">\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right</span><span class="co">]</span>$$</span>
+<span id="cb15-141"><a href="#cb15-141" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-142"><a href="#cb15-142" aria-hidden="true" tabindex="-1"></a>It looks like we are missing some cross-product terms when squaring the right-hand side, but it turns out that all of those cross-product terms are zero. The detailed derivation is out of scope for this class, but a proof is included at the end of this note for your reference.</span>
+<span id="cb15-143"><a href="#cb15-143" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-144"><a href="#cb15-144" aria-hidden="true" tabindex="-1"></a>This expression may look complicated at first glance, but we've actually already defined each term earlier in this lecture! Let's look at them term by term.</span>
+<span id="cb15-145"><a href="#cb15-145" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-146"><a href="#cb15-146" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-147"><a href="#cb15-147" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Observation Variance</span></span>
+<span id="cb15-148"><a href="#cb15-148" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-149"><a href="#cb15-149" aria-hidden="true" tabindex="-1"></a>The first term in the above decomposition is $E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span>$. Remember $\epsilon$ is the random noise when observing $Y$, with expectation $\mathbb{E}(\epsilon)=0$ and variance $\text{Var}(\epsilon) = \sigma^2$. We can show that $E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span>$ is the variance of $\epsilon$:</span>
+<span id="cb15-150"><a href="#cb15-150" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-151"><a href="#cb15-151" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
+<span id="cb15-152"><a href="#cb15-152" aria-hidden="true" tabindex="-1"></a>\text{Var}(\epsilon) &amp;= E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span> + \left(E<span class="co">[</span><span class="ot">\epsilon</span><span class="co">]</span>\right)^2<span class="sc">\\</span></span>
+<span id="cb15-153"><a href="#cb15-153" aria-hidden="true" tabindex="-1"></a>&amp;= E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span> + 0^2<span class="sc">\\</span></span>
+<span id="cb15-154"><a href="#cb15-154" aria-hidden="true" tabindex="-1"></a>&amp;= \sigma^2.</span>
+<span id="cb15-155"><a href="#cb15-155" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
+<span id="cb15-156"><a href="#cb15-156" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-157"><a href="#cb15-157" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-158"><a href="#cb15-158" aria-hidden="true" tabindex="-1"></a>This term describes how variable the random error $\epsilon$ (and $Y$) is for each observation. This is called the **observation variance**. It exists due to the randomness in our observations $Y$. It is a form of *chance error* we talked about in the Sampling lecture.</span>
+<span id="cb15-159"><a href="#cb15-159" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-160"><a href="#cb15-160" aria-hidden="true" tabindex="-1"></a>$$\text{observation variance} = \text{Var}(\epsilon) = \sigma^2.$$</span>
+<span id="cb15-161"><a href="#cb15-161" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-162"><a href="#cb15-162" aria-hidden="true" tabindex="-1"></a>The observation variance results from measurement errors when observing data or missing information that acts like noise. To reduce this observation variance, we could try to get more precise measurements, but it is often beyond the control of data scientists. Because of this, the observation variance $\sigma^2$ is sometimes called "irreducible error."</span>
+<span id="cb15-163"><a href="#cb15-163" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-164"><a href="#cb15-164" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Model Variance</span></span>
+<span id="cb15-165"><a href="#cb15-165" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-166"><a href="#cb15-166" aria-hidden="true" tabindex="-1"></a>We will then look at the last term: $E\left<span class="co">[</span><span class="ot">\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right</span><span class="co">]</span>$. If you recall the definition of variance from the last lecture, this is precisely $\text{Var}(\hat{Y}(x))$. We call this the **model variance**.</span>
+<span id="cb15-167"><a href="#cb15-167" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-168"><a href="#cb15-168" aria-hidden="true" tabindex="-1"></a>It describes how much the prediction $\hat{Y}(x)$ tends to vary when we fit the model on different samples. Remember the sample we collect can come out very differently, thus the prediction $\hat{Y}(x)$ will also be different. The model variance describes this variability due to the randomness in our sampling process. Like observation variance, it is also a form of *chance error*—even though the sources of randomness are different.</span>
+<span id="cb15-169"><a href="#cb15-169" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-170"><a href="#cb15-170" aria-hidden="true" tabindex="-1"></a>$$\text{model variance} = \text{Var}(\hat{Y}(x)) = E\left<span class="co">[</span><span class="ot">\left(\hat{Y}(x) - E\left[\hat{Y}(x)\right]\right)^2\right</span><span class="co">]</span>$$</span>
+<span id="cb15-171"><a href="#cb15-171" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-172"><a href="#cb15-172" aria-hidden="true" tabindex="-1"></a>The main reason for the large model variance is because of **overfitting**: we paid too much attention to the details in our sample that small differences in our random sample lead to large differences in the fitted model. To remediate this, we try to reduce model complexity (e.g. take out some features and limit the magnitude of estimated model coefficients) and not fit our model on the noises.</span>
+<span id="cb15-173"><a href="#cb15-173" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-174"><a href="#cb15-174" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Model Bias</span></span>
+<span id="cb15-175"><a href="#cb15-175" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-176"><a href="#cb15-176" aria-hidden="true" tabindex="-1"></a>Finally, the second term is $\left(g(x)-E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>\right)^2$. What is this? The term $E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - g(x)$ is called the **model bias**.</span>
+<span id="cb15-177"><a href="#cb15-177" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-178"><a href="#cb15-178" aria-hidden="true" tabindex="-1"></a>Remember that $g(x)$ is the fixed underlying truth and $\hat{Y}(x)$ is our fitted model, which is random. Model bias therefore measures how far off $g(x)$ and $\hat{Y}(x)$ are on average over all possible samples.</span>
+<span id="cb15-179"><a href="#cb15-179" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-180"><a href="#cb15-180" aria-hidden="true" tabindex="-1"></a>$$\text{model bias} = E\left<span class="co">[</span><span class="ot">\hat{Y}(x) - g(x)\right</span><span class="co">]</span> = E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - g(x)$$</span>
+<span id="cb15-181"><a href="#cb15-181" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-182"><a href="#cb15-182" aria-hidden="true" tabindex="-1"></a>The model bias is not random; it's an average measure for a specific individual $x$. If bias is positive, our model tends to overestimate $g(x)$; if it's negative, our model tends to underestimate $g(x)$. And if it's 0, we can say that our model is **unbiased**.</span>
+<span id="cb15-183"><a href="#cb15-183" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-184"><a href="#cb15-184" aria-hidden="true" tabindex="-1"></a>::: {.callout-tip}</span>
+<span id="cb15-185"><a href="#cb15-185" aria-hidden="true" tabindex="-1"></a><span class="fu">##### Unbiased Estimators </span></span>
+<span id="cb15-186"><a href="#cb15-186" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-187"><a href="#cb15-187" aria-hidden="true" tabindex="-1"></a>An **unbiased model** has a $\text{model bias } = 0$. In other words, our model predicts $g(x)$ on average. </span>
+<span id="cb15-188"><a href="#cb15-188" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-189"><a href="#cb15-189" aria-hidden="true" tabindex="-1"></a>Similarly, we can define bias for estimators like the mean. The sample mean is an **unbiased estimator** of the population mean, as by CLT, $\mathbb{E}<span class="co">[</span><span class="ot">\bar{X}_n</span><span class="co">]</span> = \mu$. Therefore, the $\text{estimator bias } = \mathbb{E}<span class="co">[</span><span class="ot">\bar{X}_n</span><span class="co">]</span> - \mu = 0$.</span>
+<span id="cb15-190"><a href="#cb15-190" aria-hidden="true" tabindex="-1"></a>:::</span>
+<span id="cb15-191"><a href="#cb15-191" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-192"><a href="#cb15-192" aria-hidden="true" tabindex="-1"></a>There are two main reasons for large model biases:</span>
+<span id="cb15-193"><a href="#cb15-193" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-194"><a href="#cb15-194" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Underfitting: our model is too simple for the data</span>
+<span id="cb15-195"><a href="#cb15-195" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Lack of domain knowledge: we don't understand what features are useful for the response variable</span>
+<span id="cb15-196"><a href="#cb15-196" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-197"><a href="#cb15-197" aria-hidden="true" tabindex="-1"></a>To fix this, we increase model complexity (but we don't want to overfit!) or consult domain experts to see which models make sense. You can start to see a tradeoff here: if we increase model complexity, we decrease the model bias, but we also risk increasing the model variance.</span>
+<span id="cb15-198"><a href="#cb15-198" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-199"><a href="#cb15-199" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-200"><a href="#cb15-200" aria-hidden="true" tabindex="-1"></a><span class="fu">### The Decomposition</span></span>
+<span id="cb15-201"><a href="#cb15-201" aria-hidden="true" tabindex="-1"></a>To summarize: </span>
+<span id="cb15-202"><a href="#cb15-202" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-203"><a href="#cb15-203" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The **model risk**, $\mathbb{E}\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span>$, is the mean squared prediction error of the model.</span>
+<span id="cb15-204"><a href="#cb15-204" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The **observation variance**, $\sigma^2$, is the variance of the random noise in the observations. It describes how variable the random error $\epsilon$ is for each observation.</span>
+<span id="cb15-205"><a href="#cb15-205" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The **model bias**, $\mathbb{E}\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>-g(x)$, is how "off" the $\hat{Y}(x)$ is as an estimator of the true underlying relationship $g(x)$. </span>
+<span id="cb15-206"><a href="#cb15-206" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>The **model variance**, $\text{Var}(\hat{Y}(x))$, describes how much the prediction $\hat{Y}(x)$ tends to vary when we fit the model on different samples. </span>
+<span id="cb15-207"><a href="#cb15-207" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-208"><a href="#cb15-208" aria-hidden="true" tabindex="-1"></a>The above definitions enable us to simplify the decomposition of model risk before as:</span>
+<span id="cb15-209"><a href="#cb15-209" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-210"><a href="#cb15-210" aria-hidden="true" tabindex="-1"></a>$$ E<span class="co">[</span><span class="ot">(Y(x) - \hat{Y}(x))^2</span><span class="co">]</span> = \sigma^2 + (E<span class="co">[</span><span class="ot">\hat{Y}(x)</span><span class="co">]</span> - g(x))^2 + \text{Var}(\hat{Y}(x)) $$</span>
+<span id="cb15-211"><a href="#cb15-211" aria-hidden="true" tabindex="-1"></a>$$\text{model risk } = \text{observation variance} + (\text{model bias})^2 \text{+ model variance}$$</span>
+<span id="cb15-212"><a href="#cb15-212" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-213"><a href="#cb15-213" aria-hidden="true" tabindex="-1"></a>This is known as the **bias-variance tradeoff**. What does it mean? Remember that the model risk is a measure of the model's performance. Our goal in building models is to keep model risk low; this means that we will want to ensure that each component of model risk is kept at a small value. </span>
+<span id="cb15-214"><a href="#cb15-214" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-215"><a href="#cb15-215" aria-hidden="true" tabindex="-1"></a>Observation variance is an inherent, random part of the data collection process. We aren't able to reduce the observation variance, so we'll focus our attention on the model bias and model variance. </span>
+<span id="cb15-216"><a href="#cb15-216" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-217"><a href="#cb15-217" aria-hidden="true" tabindex="-1"></a>In the Feature Engineering lecture, we considered the issue of overfitting. We saw that the model's error or bias tends to decrease as model complexity increases — if we design a highly complex model, it will tend to make predictions that are closer to the true relationship $g$. At the same time, model variance tends to *increase* as model complexity increases; a complex model may overfit to the training data, meaning that small differences in the random samples used for training lead to large differences in the fitted model. We have a problem. To decrease model bias, we could increase the model's complexity, which would lead to overfitting and an increase in model variance. Alternatively, we could decrease model variance by decreasing the model's complexity at the cost of increased model bias due to underfitting. </span>
+<span id="cb15-218"><a href="#cb15-218" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-219"><a href="#cb15-219" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-220"><a href="#cb15-220" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/bvt.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'bvt'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-221"><a href="#cb15-221" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-222"><a href="#cb15-222" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-223"><a href="#cb15-223" aria-hidden="true" tabindex="-1"></a>We need to strike a balance. Our goal in model creation is to use a complexity level that is high enough to keep bias low, but not so high that model variance is large.</span>
+<span id="cb15-224"><a href="#cb15-224" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-225"><a href="#cb15-225" aria-hidden="true" tabindex="-1"></a><span class="fu">## Interpreting Regression Coefficients</span></span>
+<span id="cb15-226"><a href="#cb15-226" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-227"><a href="#cb15-227" aria-hidden="true" tabindex="-1"></a>Recall the framework we established earlier in this lecture. If we assume that the underlying relationship between our observations and input features is linear, we can express this relationship in terms of the unknown, true model parameters $\theta$.</span>
+<span id="cb15-228"><a href="#cb15-228" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-229"><a href="#cb15-229" aria-hidden="true" tabindex="-1"></a>$$f_{\theta}(x) = g(x) + \epsilon = \theta_0 + \theta_1 x_1 + \ldots + \theta_p x_p + \epsilon$$</span>
+<span id="cb15-230"><a href="#cb15-230" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-231"><a href="#cb15-231" aria-hidden="true" tabindex="-1"></a>Our model attempts to estimate each true parameter $\theta_i$ using the estimates $\hat{\theta}_i$ calculated from the design matrix $\Bbb{X}$ and response vector $\Bbb{Y}$.</span>
+<span id="cb15-232"><a href="#cb15-232" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-233"><a href="#cb15-233" aria-hidden="true" tabindex="-1"></a>$$f_{\hat{\theta}}(x) = \hat{\theta}_0 + \hat{\theta}_1 x_1 + \ldots + \hat{\theta}_p x_p$$</span>
+<span id="cb15-234"><a href="#cb15-234" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-235"><a href="#cb15-235" aria-hidden="true" tabindex="-1"></a>Let's pause for a moment. At this point, we're very used to working with the idea of a model parameter. But what exactly does each coefficient $\theta_i$ actually *mean*? We can think of each $\theta_i$ as a *slope* of the linear model – if all other variables are held constant, a unit change in $x_i$ will result in a $\theta_i$ change in $f_{\theta}(x)$. Broadly speaking, a large value of $\theta_i$ means that the feature $x_i$ has a large effect on the response; conversely, a small value of $\theta_i$ means that $x_i$ has little effect on the response. In the extreme case, if the true parameter $\theta_i$ is 0, then the feature $x_i$ has **no effect** on $Y(x)$. </span>
+<span id="cb15-236"><a href="#cb15-236" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-237"><a href="#cb15-237" aria-hidden="true" tabindex="-1"></a>If the true parameter $\theta_i$ for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between $x_i$ and $Y(x)$! How then, can we test if a parameter is 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate $\hat{\theta}_i$. However, we need to also consider the fact that if our random sample had come out differently, we may have found a different result for $\hat{\theta}_i$. To infer if the true parameter $\theta_i$ is 0, we want to draw our conclusion from the distribution of $\hat{\theta}_i$ estimates we could have drawn across all other random samples. This is where <span class="co">[</span><span class="ot">hypothesis testing</span><span class="co">](https://inferentialthinking.com/chapters/11/Testing_Hypotheses.html)</span> comes in handy! </span>
+<span id="cb15-238"><a href="#cb15-238" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-239"><a href="#cb15-239" aria-hidden="true" tabindex="-1"></a>To test if the true parameter $\theta_i$ is 0, we construct a **hypothesis test** where our null hypothesis states that the true parameter $\theta_i$ is 0 and the alternative hypothesis states that the true parameter $\theta_i$ is *not* 0. If our p-value is smaller than our cutoff value (usually p=0.05), we reject the null hypothesis. </span>
+<span id="cb15-240"><a href="#cb15-240" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-241"><a href="#cb15-241" aria-hidden="true" tabindex="-1"></a><span class="fu">## Hypothesis Testing through Bootstrap: PurpleAir Demo</span></span>
+<span id="cb15-242"><a href="#cb15-242" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-243"><a href="#cb15-243" aria-hidden="true" tabindex="-1"></a>An equivalent way to execute the hypothesis test described above is through **bootstrapping** (this equivalence can be proven through the <span class="co">[</span><span class="ot">duality argument</span><span class="co">](https://stats.stackexchange.com/questions/179902/confidence-interval-p-value-duality-vs-frequentist-interpretation-of-cis)</span>, which is out of scope for this class). We use bootstrapping to compute approximate 95% confidence intervals for each $\theta_i$. If the interval doesn't contain 0, we reject the null hypothesis at the 5% level. Otherwise, the data is consistent with the null, as the true parameter *could* be 0.</span>
+<span id="cb15-244"><a href="#cb15-244" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-245"><a href="#cb15-245" aria-hidden="true" tabindex="-1"></a>To show an example of this hypothesis testing process, we'll work with the <span class="co">[</span><span class="ot">snowy plover</span><span class="co">](https://www.audubon.org/field-guide/bird/snowy-plover)</span> dataset throughout this section. The data are about the eggs and newly-hatched chicks of the Snowy Plover. The data were collected at the Point Reyes National Seashore by a former <span class="co">[</span><span class="ot">student at Berkeley</span><span class="co">](https://openlibrary.org/books/OL2038693M/BLSS_the_Berkeley_interactive_statistical_system)</span>. Here's a <span class="co">[</span><span class="ot">parent bird and some eggs</span><span class="co">](http://cescos.fau.edu/jay/eps/articles/snowyplover.html)</span>.</span>
+<span id="cb15-246"><a href="#cb15-246" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-247"><a href="#cb15-247" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;p</span> <span class="er">align</span><span class="ot">=</span><span class="st">"center"</span><span class="kw">&gt;</span></span>
+<span id="cb15-248"><a href="#cb15-248" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/plover_eggs.jpg"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'bvt'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'550'</span><span class="kw">&gt;</span></span>
+<span id="cb15-249"><a href="#cb15-249" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;/p&gt;</span></span>
+<span id="cb15-250"><a href="#cb15-250" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-251"><a href="#cb15-251" aria-hidden="true" tabindex="-1"></a>Note that <span class="in">`Egg Length`</span> and <span class="in">`Egg Breadth`</span> (widest diameter) are measured in millimeters, and <span class="in">`Egg Weight`</span> and <span class="in">`Bird Weight`</span> are measured in grams; for comparison, a standard paper clip weighs about one gram.</span>
+<span id="cb15-252"><a href="#cb15-252" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-253"><a href="#cb15-253" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- To show an example of this hypothesis testing process, we'll work with air quality measurement data. There are 2 common sources of air quality information: Air Quality System (AQS) and [PurpleAir sensors](https://www2.purpleair.com/). AQS is seen as the gold standard because it is high quality, well-calibrated, and publicly available. However, it is very expensive, and the sensors are far apart; reports are also delayed due to extensive calibration.  --&gt;</span></span>
+<span id="cb15-254"><a href="#cb15-254" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- On the other hand, PurpleAir (PA) sensors are much cheaper, easier to install, and measurements are taken every 2 minutes for denser coverage. However, they are much less accurate than AQS.  --&gt;</span></span>
+<span id="cb15-255"><a href="#cb15-255" aria-hidden="true" tabindex="-1"></a><span class="co">&lt;!-- For this demo, our goal is to use AQS sensor measurements to improve PurpleAir measurements by training a model that adjusts PA measurements based on AQS measurements</span></span>
+<span id="cb15-256"><a href="#cb15-256" aria-hidden="true" tabindex="-1"></a><span class="co">$$PA \approx \theta_0 + \theta_1 AQS$$</span></span>
+<span id="cb15-257"><a href="#cb15-257" aria-hidden="true" tabindex="-1"></a><span class="co">Using this approximation, we'll invert the model to predict the true air quality from PA measurements</span></span>
+<span id="cb15-258"><a href="#cb15-258" aria-hidden="true" tabindex="-1"></a><span class="co">::: {.callout-tip collapse="true"}</span></span>
+<span id="cb15-259"><a href="#cb15-259" aria-hidden="true" tabindex="-1"></a><span class="al">###</span><span class="co"> Inverse Model Derivation </span></span>
+<span id="cb15-260"><a href="#cb15-260" aria-hidden="true" tabindex="-1"></a><span class="co">Intuitively, AQS measurements are very accurate, so we can treat AQS as the true air quality $AQS = \text{True Air Quality}$</span></span>
+<span id="cb15-261"><a href="#cb15-261" aria-hidden="true" tabindex="-1"></a><span class="co">$$</span></span>
+<span id="cb15-262"><a href="#cb15-262" aria-hidden="true" tabindex="-1"></a><span class="co">\begin{align}</span></span>
+<span id="cb15-263"><a href="#cb15-263" aria-hidden="true" tabindex="-1"></a><span class="co">PA &amp;\approx \theta_0 + \theta_1 AQS \\</span></span>
+<span id="cb15-264"><a href="#cb15-264" aria-hidden="true" tabindex="-1"></a><span class="co">&amp;\approx \theta_0 + \theta_1 \text{True Air Quality} \\</span></span>
+<span id="cb15-265"><a href="#cb15-265" aria-hidden="true" tabindex="-1"></a><span class="co">PA - \theta_0 &amp;\approx + \theta_1 \text{True Air Quality} \\</span></span>
+<span id="cb15-266"><a href="#cb15-266" aria-hidden="true" tabindex="-1"></a><span class="co">\frac{PA - \theta_0}{\theta_1} &amp;\approx \text{True Air Quality} \\</span></span>
+<span id="cb15-267"><a href="#cb15-267" aria-hidden="true" tabindex="-1"></a><span class="co">\text{True Air Quality } &amp;\approx -\frac{\theta_0}{\theta_1} + \frac{1}{\theta_1} PA </span></span>
+<span id="cb15-268"><a href="#cb15-268" aria-hidden="true" tabindex="-1"></a><span class="co">\end{align}</span></span>
+<span id="cb15-269"><a href="#cb15-269" aria-hidden="true" tabindex="-1"></a><span class="co">$$</span></span>
+<span id="cb15-270"><a href="#cb15-270" aria-hidden="true" tabindex="-1"></a><span class="co">:::</span></span>
+<span id="cb15-271"><a href="#cb15-271" aria-hidden="true" tabindex="-1"></a><span class="co">$$ \text{True Air Quality } \approx -\frac{\theta_0}{\theta_1} + \frac{1}{\theta_1} PA$$ --&gt;</span></span>
+<span id="cb15-272"><a href="#cb15-272" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-275"><a href="#cb15-275" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-276"><a href="#cb15-276" aria-hidden="true" tabindex="-1"></a><span class="co"># import numpy as np</span></span>
+<span id="cb15-277"><a href="#cb15-277" aria-hidden="true" tabindex="-1"></a><span class="co"># import pandas as pd</span></span>
+<span id="cb15-278"><a href="#cb15-278" aria-hidden="true" tabindex="-1"></a><span class="co"># import matplotlib</span></span>
+<span id="cb15-279"><a href="#cb15-279" aria-hidden="true" tabindex="-1"></a><span class="co"># import matplotlib.pyplot as plt</span></span>
+<span id="cb15-280"><a href="#cb15-280" aria-hidden="true" tabindex="-1"></a><span class="co"># import seaborn as sns</span></span>
+<span id="cb15-281"><a href="#cb15-281" aria-hidden="true" tabindex="-1"></a><span class="co"># import sklearn.linear_model as lm</span></span>
+<span id="cb15-282"><a href="#cb15-282" aria-hidden="true" tabindex="-1"></a><span class="co"># from sklearn.linear_model import LinearRegression</span></span>
+<span id="cb15-283"><a href="#cb15-283" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-284"><a href="#cb15-284" aria-hidden="true" tabindex="-1"></a><span class="co"># # big font helper</span></span>
+<span id="cb15-285"><a href="#cb15-285" aria-hidden="true" tabindex="-1"></a><span class="co"># def adjust_fontsize(size=None):</span></span>
+<span id="cb15-286"><a href="#cb15-286" aria-hidden="true" tabindex="-1"></a><span class="co">#     SMALL_SIZE = 8</span></span>
+<span id="cb15-287"><a href="#cb15-287" aria-hidden="true" tabindex="-1"></a><span class="co">#     MEDIUM_SIZE = 10</span></span>
+<span id="cb15-288"><a href="#cb15-288" aria-hidden="true" tabindex="-1"></a><span class="co">#     BIGGER_SIZE = 12</span></span>
+<span id="cb15-289"><a href="#cb15-289" aria-hidden="true" tabindex="-1"></a><span class="co">#     if size != None:</span></span>
+<span id="cb15-290"><a href="#cb15-290" aria-hidden="true" tabindex="-1"></a><span class="co">#         SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size</span></span>
+<span id="cb15-291"><a href="#cb15-291" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-292"><a href="#cb15-292" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('font', size=SMALL_SIZE)          # controls default text sizes</span></span>
+<span id="cb15-293"><a href="#cb15-293" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title</span></span>
+<span id="cb15-294"><a href="#cb15-294" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels</span></span>
+<span id="cb15-295"><a href="#cb15-295" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels</span></span>
+<span id="cb15-296"><a href="#cb15-296" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels</span></span>
+<span id="cb15-297"><a href="#cb15-297" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize</span></span>
+<span id="cb15-298"><a href="#cb15-298" aria-hidden="true" tabindex="-1"></a><span class="co">#     plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title</span></span>
+<span id="cb15-299"><a href="#cb15-299" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-300"><a href="#cb15-300" aria-hidden="true" tabindex="-1"></a><span class="co"># plt.style.use('fivethirtyeight')</span></span>
+<span id="cb15-301"><a href="#cb15-301" aria-hidden="true" tabindex="-1"></a><span class="co"># sns.set_context("talk")</span></span>
+<span id="cb15-302"><a href="#cb15-302" aria-hidden="true" tabindex="-1"></a><span class="co"># sns.set_theme()</span></span>
+<span id="cb15-303"><a href="#cb15-303" aria-hidden="true" tabindex="-1"></a><span class="co"># #plt.style.use('default') # revert style to default mpl</span></span>
+<span id="cb15-304"><a href="#cb15-304" aria-hidden="true" tabindex="-1"></a><span class="co"># adjust_fontsize(size=20)</span></span>
+<span id="cb15-305"><a href="#cb15-305" aria-hidden="true" tabindex="-1"></a><span class="co"># %matplotlib inline</span></span>
+<span id="cb15-306"><a href="#cb15-306" aria-hidden="true" tabindex="-1"></a><span class="co"># csv_file = 'data/Full24hrdataset.csv'</span></span>
+<span id="cb15-307"><a href="#cb15-307" aria-hidden="true" tabindex="-1"></a><span class="co"># usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'TempC', 'RH', 'Dewpoint']</span></span>
+<span id="cb15-308"><a href="#cb15-308" aria-hidden="true" tabindex="-1"></a><span class="co"># full_df = (pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date'])</span></span>
+<span id="cb15-309"><a href="#cb15-309" aria-hidden="true" tabindex="-1"></a><span class="co">#         .dropna())</span></span>
+<span id="cb15-310"><a href="#cb15-310" aria-hidden="true" tabindex="-1"></a><span class="co"># full_df.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'temp', 'rh', 'dew']</span></span>
+<span id="cb15-311"><a href="#cb15-311" aria-hidden="true" tabindex="-1"></a><span class="co"># full_df = full_df.loc[(full_df['pm25aqs'] &lt; 50)]</span></span>
+<span id="cb15-312"><a href="#cb15-312" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-313"><a href="#cb15-313" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-314"><a href="#cb15-314" aria-hidden="true" tabindex="-1"></a><span class="co"># bad_dates = ['2019-08-21', '2019-08-22', '2019-09-24']</span></span>
+<span id="cb15-315"><a href="#cb15-315" aria-hidden="true" tabindex="-1"></a><span class="co"># GA = full_df.loc[(full_df['id'] == 'GA1') &amp; (~full_df['date'].isin(bad_dates)) , :]</span></span>
+<span id="cb15-316"><a href="#cb15-316" aria-hidden="true" tabindex="-1"></a><span class="co"># AQS, PA = GA[['pm25aqs']], GA['pm25pa']</span></span>
+<span id="cb15-317"><a href="#cb15-317" aria-hidden="true" tabindex="-1"></a><span class="co"># AQS.head()</span></span>
+<span id="cb15-318"><a href="#cb15-318" aria-hidden="true" tabindex="-1"></a><span class="co"># pd.DataFrame(PA).head()</span></span>
+<span id="cb15-319"><a href="#cb15-319" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-320"><a href="#cb15-320" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-323"><a href="#cb15-323" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-324"><a href="#cb15-324" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> pandas <span class="im">as</span> pd</span>
+<span id="cb15-325"><a href="#cb15-325" aria-hidden="true" tabindex="-1"></a>eggs <span class="op">=</span> pd.read_csv(<span class="st">"data/snowy_plover.csv"</span>)</span>
+<span id="cb15-326"><a href="#cb15-326" aria-hidden="true" tabindex="-1"></a>eggs.head(<span class="dv">5</span>)</span>
+<span id="cb15-327"><a href="#cb15-327" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-328"><a href="#cb15-328" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-329"><a href="#cb15-329" aria-hidden="true" tabindex="-1"></a>Our goal will be to predict the weight of a newborn plover chick, which we assume follows the true relationship $Y = f_{\theta}(x)$ below.</span>
+<span id="cb15-330"><a href="#cb15-330" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-331"><a href="#cb15-331" aria-hidden="true" tabindex="-1"></a>$$\text{bird<span class="sc">\_</span>weight} = \theta_0 + \theta_1 \text{egg<span class="sc">\_</span>weight} + \theta_2 \text{egg<span class="sc">\_</span>length} + \theta_3 \text{egg<span class="sc">\_</span>breadth} + \epsilon$$</span>
+<span id="cb15-332"><a href="#cb15-332" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-333"><a href="#cb15-333" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>For each $i$, the parameter $\theta_i$ is a fixed number but it is unobservable. We can only estimate it.</span>
+<span id="cb15-334"><a href="#cb15-334" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The random error $\epsilon$ is also unobservable, but it is assumed to have expectation 0 and be independent and identically distributed across eggs.</span>
+<span id="cb15-335"><a href="#cb15-335" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-336"><a href="#cb15-336" aria-hidden="true" tabindex="-1"></a>Say we wish to determine if the <span class="in">`egg_weight`</span> impacts the <span class="in">`bird_weight`</span> of a chick – we want to infer if $\theta_1$ is equal to 0.</span>
+<span id="cb15-337"><a href="#cb15-337" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-338"><a href="#cb15-338" aria-hidden="true" tabindex="-1"></a>First, we define our hypotheses:</span>
+<span id="cb15-339"><a href="#cb15-339" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-340"><a href="#cb15-340" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Null hypothesis**: the true parameter $\theta_1$ is 0; any variation is due to random chance.</span>
+<span id="cb15-341"><a href="#cb15-341" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Alternative hypothesis**: the true parameter $\theta_1$ is not 0.</span>
+<span id="cb15-342"><a href="#cb15-342" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-343"><a href="#cb15-343" aria-hidden="true" tabindex="-1"></a>Next, we use our data to fit a model $\hat{Y} = f_{\hat{\theta}}(x)$ that approximates the relationship above. This gives us the **observed value** of $\hat{\theta}_1$ found from our data.</span>
+<span id="cb15-344"><a href="#cb15-344" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-347"><a href="#cb15-347" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-348"><a href="#cb15-348" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb15-349"><a href="#cb15-349" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LinearRegression</span>
+<span id="cb15-350"><a href="#cb15-350" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
+<span id="cb15-351"><a href="#cb15-351" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-352"><a href="#cb15-352" aria-hidden="true" tabindex="-1"></a>X <span class="op">=</span> eggs[[<span class="st">"egg_weight"</span>, <span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>]]</span>
+<span id="cb15-353"><a href="#cb15-353" aria-hidden="true" tabindex="-1"></a>Y <span class="op">=</span> eggs[<span class="st">"bird_weight"</span>]</span>
+<span id="cb15-354"><a href="#cb15-354" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-355"><a href="#cb15-355" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> LinearRegression()</span>
+<span id="cb15-356"><a href="#cb15-356" aria-hidden="true" tabindex="-1"></a>model.fit(X, Y)</span>
+<span id="cb15-357"><a href="#cb15-357" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-358"><a href="#cb15-358" aria-hidden="true" tabindex="-1"></a><span class="co"># This gives an array containing the fitted model parameter estimates</span></span>
+<span id="cb15-359"><a href="#cb15-359" aria-hidden="true" tabindex="-1"></a>thetas <span class="op">=</span> model.coef_</span>
+<span id="cb15-360"><a href="#cb15-360" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-361"><a href="#cb15-361" aria-hidden="true" tabindex="-1"></a><span class="co"># Put the parameter estimates in a nice table for viewing</span></span>
+<span id="cb15-362"><a href="#cb15-362" aria-hidden="true" tabindex="-1"></a>display(pd.DataFrame([model.intercept_] <span class="op">+</span> <span class="bu">list</span>(model.coef_),</span>
+<span id="cb15-363"><a href="#cb15-363" aria-hidden="true" tabindex="-1"></a>             columns<span class="op">=</span>[<span class="st">'theta_hat'</span>],</span>
+<span id="cb15-364"><a href="#cb15-364" aria-hidden="true" tabindex="-1"></a>             index<span class="op">=</span>[<span class="st">'intercept'</span>, <span class="st">'egg_weight'</span>, <span class="st">'egg_length'</span>, <span class="st">'egg_breadth'</span>]))</span>
+<span id="cb15-365"><a href="#cb15-365" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-366"><a href="#cb15-366" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="st">"RMSE"</span>, np.mean((Y <span class="op">-</span> model.predict(X)) <span class="op">**</span> <span class="dv">2</span>))</span>
+<span id="cb15-367"><a href="#cb15-367" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-368"><a href="#cb15-368" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-369"><a href="#cb15-369" aria-hidden="true" tabindex="-1"></a>We now have the value of $\hat{\theta}_1$ when considering the single sample of data that we have. To get a sense of how this estimate might vary if we were to draw different random samples, we will use **[bootstrapping](https://inferentialthinking.com/chapters/13/2/Bootstrap.html?)**. To construct a bootstrap sample, we will draw a resample from the collected data that:</span>
+<span id="cb15-370"><a href="#cb15-370" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-371"><a href="#cb15-371" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Has the same sample size as the collected data</span>
+<span id="cb15-372"><a href="#cb15-372" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Is drawn with replacement (this ensures that we don't draw the exact same sample every time!)</span>
+<span id="cb15-373"><a href="#cb15-373" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-374"><a href="#cb15-374" aria-hidden="true" tabindex="-1"></a>We draw a bootstrap sample, use this sample to fit a model, and record the result for $\hat{\theta}_1$ on this bootstrapped sample. We then repeat this process many times to generate a **bootstrapped empirical distribution** of $\hat{\theta}_1$. This gives us an estimate of what the true distribution of $\hat{\theta}_1$ across all possible samples might look like.</span>
+<span id="cb15-375"><a href="#cb15-375" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-378"><a href="#cb15-378" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-379"><a href="#cb15-379" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb15-380"><a href="#cb15-380" aria-hidden="true" tabindex="-1"></a><span class="co"># Set a random seed so you generate the same random sample as staff</span></span>
+<span id="cb15-381"><a href="#cb15-381" aria-hidden="true" tabindex="-1"></a><span class="co"># In the "real world", we wouldn't do this</span></span>
+<span id="cb15-382"><a href="#cb15-382" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> numpy <span class="im">as</span> np</span>
+<span id="cb15-383"><a href="#cb15-383" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">1337</span>)</span>
+<span id="cb15-384"><a href="#cb15-384" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-385"><a href="#cb15-385" aria-hidden="true" tabindex="-1"></a><span class="co"># Set the sample size of each bootstrap sample</span></span>
+<span id="cb15-386"><a href="#cb15-386" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="bu">len</span>(eggs)</span>
+<span id="cb15-387"><a href="#cb15-387" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-388"><a href="#cb15-388" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a list to store all the bootstrapped estimates</span></span>
+<span id="cb15-389"><a href="#cb15-389" aria-hidden="true" tabindex="-1"></a>estimates <span class="op">=</span> []</span>
+<span id="cb15-390"><a href="#cb15-390" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-391"><a href="#cb15-391" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. </span></span>
+<span id="cb15-392"><a href="#cb15-392" aria-hidden="true" tabindex="-1"></a><span class="co"># Repeat 10000 times.</span></span>
+<span id="cb15-393"><a href="#cb15-393" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>):</span>
+<span id="cb15-394"><a href="#cb15-394" aria-hidden="true" tabindex="-1"></a>    bootstrap_resample <span class="op">=</span> eggs.sample(n, replace<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb15-395"><a href="#cb15-395" aria-hidden="true" tabindex="-1"></a>    X_bootstrap <span class="op">=</span> bootstrap_resample[[<span class="st">"egg_weight"</span>, <span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>]]</span>
+<span id="cb15-396"><a href="#cb15-396" aria-hidden="true" tabindex="-1"></a>    Y_bootstrap <span class="op">=</span> bootstrap_resample[<span class="st">"bird_weight"</span>]</span>
+<span id="cb15-397"><a href="#cb15-397" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-398"><a href="#cb15-398" aria-hidden="true" tabindex="-1"></a>    bootstrap_model <span class="op">=</span> LinearRegression()</span>
+<span id="cb15-399"><a href="#cb15-399" aria-hidden="true" tabindex="-1"></a>    bootstrap_model.fit(X_bootstrap, Y_bootstrap)</span>
+<span id="cb15-400"><a href="#cb15-400" aria-hidden="true" tabindex="-1"></a>    bootstrap_thetas <span class="op">=</span> bootstrap_model.coef_</span>
+<span id="cb15-401"><a href="#cb15-401" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-402"><a href="#cb15-402" aria-hidden="true" tabindex="-1"></a>    estimates.append(bootstrap_thetas[<span class="dv">0</span>])</span>
+<span id="cb15-403"><a href="#cb15-403" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-404"><a href="#cb15-404" aria-hidden="true" tabindex="-1"></a><span class="co"># calculate the 95% confidence interval </span></span>
+<span id="cb15-405"><a href="#cb15-405" aria-hidden="true" tabindex="-1"></a>lower <span class="op">=</span> np.percentile(estimates, <span class="fl">2.5</span>, axis<span class="op">=</span><span class="dv">0</span>)</span>
+<span id="cb15-406"><a href="#cb15-406" aria-hidden="true" tabindex="-1"></a>upper <span class="op">=</span> np.percentile(estimates, <span class="fl">97.5</span>, axis<span class="op">=</span><span class="dv">0</span>)</span>
+<span id="cb15-407"><a href="#cb15-407" aria-hidden="true" tabindex="-1"></a>conf_interval <span class="op">=</span> (lower, upper)</span>
+<span id="cb15-408"><a href="#cb15-408" aria-hidden="true" tabindex="-1"></a>conf_interval</span>
+<span id="cb15-409"><a href="#cb15-409" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-410"><a href="#cb15-410" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-411"><a href="#cb15-411" aria-hidden="true" tabindex="-1"></a>We find that our bootstrapped approximate 95% confidence interval for $\theta_1$ is $<span class="co">[</span><span class="ot">-0.259, 1.103</span><span class="co">]</span>$. Immediately, we can see that 0 *is* indeed contained in this interval – this means that we *cannot* conclude that $\theta_1$ is non-zero! More formally, we fail to reject the null hypothesis (that $\theta_1$ is 0) under a 5% p-value cutoff. </span>
+<span id="cb15-412"><a href="#cb15-412" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-413"><a href="#cb15-413" aria-hidden="true" tabindex="-1"></a><span class="fu">## Colinearity</span></span>
+<span id="cb15-414"><a href="#cb15-414" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-415"><a href="#cb15-415" aria-hidden="true" tabindex="-1"></a>We can repeat this process to construct 95% confidence intervals for the other parameters of the model.</span>
+<span id="cb15-416"><a href="#cb15-416" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-419"><a href="#cb15-419" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-420"><a href="#cb15-420" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">1337</span>)</span>
+<span id="cb15-421"><a href="#cb15-421" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-422"><a href="#cb15-422" aria-hidden="true" tabindex="-1"></a>theta_0_estimates <span class="op">=</span> []</span>
+<span id="cb15-423"><a href="#cb15-423" aria-hidden="true" tabindex="-1"></a>theta_1_estimates <span class="op">=</span> []</span>
+<span id="cb15-424"><a href="#cb15-424" aria-hidden="true" tabindex="-1"></a>theta_2_estimates <span class="op">=</span> []</span>
+<span id="cb15-425"><a href="#cb15-425" aria-hidden="true" tabindex="-1"></a>theta_3_estimates <span class="op">=</span> []</span>
+<span id="cb15-426"><a href="#cb15-426" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-427"><a href="#cb15-427" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-428"><a href="#cb15-428" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>):</span>
+<span id="cb15-429"><a href="#cb15-429" aria-hidden="true" tabindex="-1"></a>    bootstrap_resample <span class="op">=</span> eggs.sample(n, replace<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb15-430"><a href="#cb15-430" aria-hidden="true" tabindex="-1"></a>    X_bootstrap <span class="op">=</span> bootstrap_resample[[<span class="st">"egg_weight"</span>, <span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>]]</span>
+<span id="cb15-431"><a href="#cb15-431" aria-hidden="true" tabindex="-1"></a>    Y_bootstrap <span class="op">=</span> bootstrap_resample[<span class="st">"bird_weight"</span>]</span>
+<span id="cb15-432"><a href="#cb15-432" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-433"><a href="#cb15-433" aria-hidden="true" tabindex="-1"></a>    bootstrap_model <span class="op">=</span> LinearRegression()</span>
+<span id="cb15-434"><a href="#cb15-434" aria-hidden="true" tabindex="-1"></a>    bootstrap_model.fit(X_bootstrap, Y_bootstrap)</span>
+<span id="cb15-435"><a href="#cb15-435" aria-hidden="true" tabindex="-1"></a>    bootstrap_theta_0 <span class="op">=</span> bootstrap_model.intercept_</span>
+<span id="cb15-436"><a href="#cb15-436" aria-hidden="true" tabindex="-1"></a>    bootstrap_theta_1, bootstrap_theta_2, bootstrap_theta_3 <span class="op">=</span> bootstrap_model.coef_</span>
+<span id="cb15-437"><a href="#cb15-437" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-438"><a href="#cb15-438" aria-hidden="true" tabindex="-1"></a>    theta_0_estimates.append(bootstrap_theta_0)</span>
+<span id="cb15-439"><a href="#cb15-439" aria-hidden="true" tabindex="-1"></a>    theta_1_estimates.append(bootstrap_theta_1)</span>
+<span id="cb15-440"><a href="#cb15-440" aria-hidden="true" tabindex="-1"></a>    theta_2_estimates.append(bootstrap_theta_2)</span>
+<span id="cb15-441"><a href="#cb15-441" aria-hidden="true" tabindex="-1"></a>    theta_3_estimates.append(bootstrap_theta_3)</span>
+<span id="cb15-442"><a href="#cb15-442" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-443"><a href="#cb15-443" aria-hidden="true" tabindex="-1"></a>theta_0_lower, theta_0_upper <span class="op">=</span> np.percentile(theta_0_estimates, <span class="fl">2.5</span>), np.percentile(theta_0_estimates, <span class="fl">97.5</span>)</span>
+<span id="cb15-444"><a href="#cb15-444" aria-hidden="true" tabindex="-1"></a>theta_1_lower, theta_1_upper <span class="op">=</span> np.percentile(theta_1_estimates, <span class="fl">2.5</span>), np.percentile(theta_1_estimates, <span class="fl">97.5</span>)</span>
+<span id="cb15-445"><a href="#cb15-445" aria-hidden="true" tabindex="-1"></a>theta_2_lower, theta_2_upper <span class="op">=</span> np.percentile(theta_2_estimates, <span class="fl">2.5</span>), np.percentile(theta_2_estimates, <span class="fl">97.5</span>)</span>
+<span id="cb15-446"><a href="#cb15-446" aria-hidden="true" tabindex="-1"></a>theta_3_lower, theta_3_upper <span class="op">=</span> np.percentile(theta_3_estimates, <span class="fl">2.5</span>), np.percentile(theta_3_estimates, <span class="fl">97.5</span>)</span>
+<span id="cb15-447"><a href="#cb15-447" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-448"><a href="#cb15-448" aria-hidden="true" tabindex="-1"></a><span class="co"># Make a nice table to view results</span></span>
+<span id="cb15-449"><a href="#cb15-449" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"lower"</span>:[theta_0_lower, theta_1_lower, theta_2_lower, theta_3_lower], <span class="st">"upper"</span>:[theta_0_upper, <span class="op">\</span></span>
+<span id="cb15-450"><a href="#cb15-450" aria-hidden="true" tabindex="-1"></a>                theta_1_upper, theta_2_upper, theta_3_upper]}, index<span class="op">=</span>[<span class="st">"theta_0"</span>, <span class="st">"theta_1"</span>, <span class="st">"theta_2"</span>, <span class="st">"theta_3"</span>])</span>
+<span id="cb15-451"><a href="#cb15-451" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-452"><a href="#cb15-452" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-453"><a href="#cb15-453" aria-hidden="true" tabindex="-1"></a>Something's off here. Notice that 0 is included in the 95% confidence interval for *every* parameter of the model. Using the interpretation we outlined above, this would suggest that we can't say for certain that *any* of the input variables impact the response variable! This makes it seem like our model can't make any predictions – and yet, each model we fit in our bootstrap experiment above could very much make predictions of $Y$. </span>
+<span id="cb15-454"><a href="#cb15-454" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-455"><a href="#cb15-455" aria-hidden="true" tabindex="-1"></a>How can we explain this result? Think back to how we first interpreted the parameters of a linear model. We treated each $\theta_i$ as a slope, where a unit increase in $x_i$ leads to a $\theta_i$ increase in $Y$, **if all other variables are held constant**. It turns out that this last assumption is very important. If variables in our model are somehow related to one another, then it might not be possible to have a change in one of them while holding the others constant. This means that our interpretation framework is no longer valid! In the models we fit above, we incorporated <span class="in">`egg_length`</span>, <span class="in">`egg_breadth`</span>, and <span class="in">`egg_weight`</span> as input variables. These variables are very likely related to one another – an egg with large <span class="in">`egg_length`</span> and <span class="in">`egg_breadth`</span> will likely be heavy in <span class="in">`egg_weight`</span>. This means that the model parameters cannot be meaningfully interpreted as slopes. </span>
+<span id="cb15-456"><a href="#cb15-456" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-457"><a href="#cb15-457" aria-hidden="true" tabindex="-1"></a>To support this conclusion, we can visualize the relationships between our feature variables. Notice the strong positive association between the features.</span>
+<span id="cb15-458"><a href="#cb15-458" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-461"><a href="#cb15-461" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-462"><a href="#cb15-462" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> seaborn <span class="im">as</span> sns</span>
+<span id="cb15-463"><a href="#cb15-463" aria-hidden="true" tabindex="-1"></a>sns.pairplot(eggs[[<span class="st">"egg_length"</span>, <span class="st">"egg_breadth"</span>, <span class="st">"egg_weight"</span>, <span class="st">'bird_weight'</span>]])<span class="op">;</span></span>
+<span id="cb15-464"><a href="#cb15-464" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-465"><a href="#cb15-465" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-466"><a href="#cb15-466" aria-hidden="true" tabindex="-1"></a>This issue is known as **colinearity**, sometimes also called **multicolinearity**. Collinearity occurs when one feature can be predicted fairly accurately by a linear combination of the other features, which happens when one feature is highly correlated with the others. </span>
+<span id="cb15-467"><a href="#cb15-467" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-468"><a href="#cb15-468" aria-hidden="true" tabindex="-1"></a>Why is colinearity a problem? Its consequences span several aspects of the modeling process:</span>
+<span id="cb15-469"><a href="#cb15-469" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-470"><a href="#cb15-470" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Inference**: Slopes can't be interpreted for an inference task.</span>
+<span id="cb15-471"><a href="#cb15-471" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Model Variance**: If features strongly influence one another, even small changes in the sampled data can lead to large changes in the estimated slopes.</span>
+<span id="cb15-472"><a href="#cb15-472" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>**Unique Solution**: If one feature is a linear combination of the other features, the design matrix will not be full rank, and $\mathbb{X}^{\top}\mathbb{X}$ is not invertible. This means that least squares does not have a unique solution.</span>
+<span id="cb15-473"><a href="#cb15-473" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-474"><a href="#cb15-474" aria-hidden="true" tabindex="-1"></a>The take-home point is that we need to be careful with what features we select for modeling. If two features likely encode similar information, it is often a good idea to choose only one of them as an input variable.</span>
+<span id="cb15-475"><a href="#cb15-475" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-476"><a href="#cb15-476" aria-hidden="true" tabindex="-1"></a><span class="fu">### A Simpler Model</span></span>
+<span id="cb15-477"><a href="#cb15-477" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-478"><a href="#cb15-478" aria-hidden="true" tabindex="-1"></a>Let us now consider a more interpretable model: we instead assume a true relationship using only egg weight:</span>
+<span id="cb15-479"><a href="#cb15-479" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-480"><a href="#cb15-480" aria-hidden="true" tabindex="-1"></a>$$f_\theta(x) = \theta_0 + \theta_1 \text{egg<span class="sc">\_</span>weight} + \epsilon$$</span>
+<span id="cb15-481"><a href="#cb15-481" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-484"><a href="#cb15-484" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-485"><a href="#cb15-485" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LinearRegression</span>
+<span id="cb15-486"><a href="#cb15-486" aria-hidden="true" tabindex="-1"></a>X_int <span class="op">=</span> eggs[[<span class="st">"egg_weight"</span>]]</span>
+<span id="cb15-487"><a href="#cb15-487" aria-hidden="true" tabindex="-1"></a>Y_int <span class="op">=</span> eggs[<span class="st">"bird_weight"</span>]</span>
+<span id="cb15-488"><a href="#cb15-488" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-489"><a href="#cb15-489" aria-hidden="true" tabindex="-1"></a>model_int <span class="op">=</span> LinearRegression()</span>
+<span id="cb15-490"><a href="#cb15-490" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-491"><a href="#cb15-491" aria-hidden="true" tabindex="-1"></a>model_int.fit(X_int, Y_int)</span>
+<span id="cb15-492"><a href="#cb15-492" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-493"><a href="#cb15-493" aria-hidden="true" tabindex="-1"></a><span class="co"># This gives an array containing the fitted model parameter estimates</span></span>
+<span id="cb15-494"><a href="#cb15-494" aria-hidden="true" tabindex="-1"></a>thetas_int <span class="op">=</span> model_int.coef_</span>
+<span id="cb15-495"><a href="#cb15-495" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-496"><a href="#cb15-496" aria-hidden="true" tabindex="-1"></a><span class="co"># Put the parameter estimates in a nice table for viewing</span></span>
+<span id="cb15-497"><a href="#cb15-497" aria-hidden="true" tabindex="-1"></a>pd.DataFrame({<span class="st">"theta_hat"</span>:[model_int.intercept_, thetas_int[<span class="dv">0</span>]]}, index<span class="op">=</span>[<span class="st">"theta_0"</span>, <span class="st">"theta_1"</span>])</span>
+<span id="cb15-498"><a href="#cb15-498" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-499"><a href="#cb15-499" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-502"><a href="#cb15-502" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-503"><a href="#cb15-503" aria-hidden="true" tabindex="-1"></a><span class="co">#| code-fold: false</span></span>
+<span id="cb15-504"><a href="#cb15-504" aria-hidden="true" tabindex="-1"></a><span class="im">import</span> matplotlib.pyplot <span class="im">as</span> plt</span>
+<span id="cb15-505"><a href="#cb15-505" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-506"><a href="#cb15-506" aria-hidden="true" tabindex="-1"></a><span class="co"># Set a random seed so you generate the same random sample as staff</span></span>
+<span id="cb15-507"><a href="#cb15-507" aria-hidden="true" tabindex="-1"></a><span class="co"># In the "real world", we wouldn't do this</span></span>
+<span id="cb15-508"><a href="#cb15-508" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">1337</span>)</span>
+<span id="cb15-509"><a href="#cb15-509" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-510"><a href="#cb15-510" aria-hidden="true" tabindex="-1"></a><span class="co"># Set the sample size of each bootstrap sample</span></span>
+<span id="cb15-511"><a href="#cb15-511" aria-hidden="true" tabindex="-1"></a>n <span class="op">=</span> <span class="bu">len</span>(eggs)</span>
+<span id="cb15-512"><a href="#cb15-512" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-513"><a href="#cb15-513" aria-hidden="true" tabindex="-1"></a><span class="co"># Create a list to store all the bootstrapped estimates</span></span>
+<span id="cb15-514"><a href="#cb15-514" aria-hidden="true" tabindex="-1"></a>estimates_int <span class="op">=</span> []</span>
+<span id="cb15-515"><a href="#cb15-515" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-516"><a href="#cb15-516" aria-hidden="true" tabindex="-1"></a><span class="co"># Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. </span></span>
+<span id="cb15-517"><a href="#cb15-517" aria-hidden="true" tabindex="-1"></a><span class="co"># Repeat 10000 times.</span></span>
+<span id="cb15-518"><a href="#cb15-518" aria-hidden="true" tabindex="-1"></a><span class="cf">for</span> i <span class="kw">in</span> <span class="bu">range</span>(<span class="dv">10000</span>):</span>
+<span id="cb15-519"><a href="#cb15-519" aria-hidden="true" tabindex="-1"></a>    bootstrap_resample_int <span class="op">=</span> eggs.sample(n, replace<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb15-520"><a href="#cb15-520" aria-hidden="true" tabindex="-1"></a>    X_bootstrap_int <span class="op">=</span> bootstrap_resample_int[[<span class="st">"egg_weight"</span>]]</span>
+<span id="cb15-521"><a href="#cb15-521" aria-hidden="true" tabindex="-1"></a>    Y_bootstrap_int <span class="op">=</span> bootstrap_resample_int[<span class="st">"bird_weight"</span>]</span>
+<span id="cb15-522"><a href="#cb15-522" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-523"><a href="#cb15-523" aria-hidden="true" tabindex="-1"></a>    bootstrap_model_int <span class="op">=</span> LinearRegression()</span>
+<span id="cb15-524"><a href="#cb15-524" aria-hidden="true" tabindex="-1"></a>    bootstrap_model_int.fit(X_bootstrap_int, Y_bootstrap_int)</span>
+<span id="cb15-525"><a href="#cb15-525" aria-hidden="true" tabindex="-1"></a>    bootstrap_thetas_int <span class="op">=</span> bootstrap_model_int.coef_</span>
+<span id="cb15-526"><a href="#cb15-526" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-527"><a href="#cb15-527" aria-hidden="true" tabindex="-1"></a>    estimates_int.append(bootstrap_thetas_int[<span class="dv">0</span>])</span>
+<span id="cb15-528"><a href="#cb15-528" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-529"><a href="#cb15-529" aria-hidden="true" tabindex="-1"></a>plt.figure(dpi<span class="op">=</span><span class="dv">120</span>)</span>
+<span id="cb15-530"><a href="#cb15-530" aria-hidden="true" tabindex="-1"></a>sns.histplot(estimates_int, stat<span class="op">=</span><span class="st">"density"</span>)</span>
+<span id="cb15-531"><a href="#cb15-531" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="vs">r"$\hat{\theta}_1$"</span>)</span>
+<span id="cb15-532"><a href="#cb15-532" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="vs">r"Bootstrapped estimates $\hat{\theta}_1$ Under the Interpretable Model"</span>)<span class="op">;</span></span>
+<span id="cb15-533"><a href="#cb15-533" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-534"><a href="#cb15-534" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-535"><a href="#cb15-535" aria-hidden="true" tabindex="-1"></a>Notice how the interpretable model performs almost as well as our other model:</span>
+<span id="cb15-536"><a href="#cb15-536" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-539"><a href="#cb15-539" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-540"><a href="#cb15-540" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.metrics <span class="im">import</span> mean_squared_error</span>
+<span id="cb15-541"><a href="#cb15-541" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-542"><a href="#cb15-542" aria-hidden="true" tabindex="-1"></a>rmse <span class="op">=</span> mean_squared_error(Y, model.predict(X))</span>
+<span id="cb15-543"><a href="#cb15-543" aria-hidden="true" tabindex="-1"></a>rmse_int <span class="op">=</span> mean_squared_error(Y_int, model_int.predict(X_int))</span>
+<span id="cb15-544"><a href="#cb15-544" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f'RMSE of Original Model: </span><span class="sc">{</span>rmse<span class="sc">}</span><span class="ss">'</span>)</span>
+<span id="cb15-545"><a href="#cb15-545" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f'RMSE of Interpretable Model: </span><span class="sc">{</span>rmse_int<span class="sc">}</span><span class="ss">'</span>)</span>
+<span id="cb15-546"><a href="#cb15-546" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-547"><a href="#cb15-547" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-548"><a href="#cb15-548" aria-hidden="true" tabindex="-1"></a>Yet, the confidence interval for the true parameter $\theta_{1}$ does not contain zero.</span>
+<span id="cb15-549"><a href="#cb15-549" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-552"><a href="#cb15-552" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb15-553"><a href="#cb15-553" aria-hidden="true" tabindex="-1"></a>lower_int <span class="op">=</span> np.percentile(estimates_int, <span class="fl">2.5</span>)</span>
+<span id="cb15-554"><a href="#cb15-554" aria-hidden="true" tabindex="-1"></a>upper_int <span class="op">=</span> np.percentile(estimates_int, <span class="fl">97.5</span>)</span>
+<span id="cb15-555"><a href="#cb15-555" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-556"><a href="#cb15-556" aria-hidden="true" tabindex="-1"></a>conf_interval_int <span class="op">=</span> (lower_int, upper_int)</span>
+<span id="cb15-557"><a href="#cb15-557" aria-hidden="true" tabindex="-1"></a>conf_interval_int</span>
+<span id="cb15-558"><a href="#cb15-558" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb15-559"><a href="#cb15-559" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-560"><a href="#cb15-560" aria-hidden="true" tabindex="-1"></a>In retrospect, it’s no surprise that the weight of an egg best predicts the weight of a newly-hatched chick.</span>
+<span id="cb15-561"><a href="#cb15-561" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-562"><a href="#cb15-562" aria-hidden="true" tabindex="-1"></a>A model with highly correlated variables prevents us from interpreting how the variables are related to the prediction.</span>
+<span id="cb15-563"><a href="#cb15-563" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-564"><a href="#cb15-564" aria-hidden="true" tabindex="-1"></a><span class="fu">### Reminder: Assumptions Matter</span></span>
+<span id="cb15-565"><a href="#cb15-565" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-566"><a href="#cb15-566" aria-hidden="true" tabindex="-1"></a>Keep the following in mind:</span>
+<span id="cb15-567"><a href="#cb15-567" aria-hidden="true" tabindex="-1"></a>All inference assumes that the regression model holds.</span>
+<span id="cb15-568"><a href="#cb15-568" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-569"><a href="#cb15-569" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>If the model doesn’t hold, the inference might not be valid.</span>
+<span id="cb15-570"><a href="#cb15-570" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>If the <span class="co">[</span><span class="ot">assumptions of the bootstrap</span><span class="co">](https://inferentialthinking.com/chapters/13/3/Confidence_Intervals.html?highlight=p%20value%20confidence%20interval#care-in-using-the-bootstrap-percentile-method)</span> don’t hold…</span>
+<span id="cb15-571"><a href="#cb15-571" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Sample size n is large</span>
+<span id="cb15-572"><a href="#cb15-572" aria-hidden="true" tabindex="-1"></a><span class="ss">    * </span>Sample is representative of population distribution (drawn i.i.d., unbiased)</span>
+<span id="cb15-573"><a href="#cb15-573" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb15-574"><a href="#cb15-574" aria-hidden="true" tabindex="-1"></a>    …then the results of the bootstrap might not be valid.</span>
+<span id="cb15-575"><a href="#cb15-575" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-576"><a href="#cb15-576" aria-hidden="true" tabindex="-1"></a><span class="fu">## (Bonus) Correlation and Causation</span></span>
+<span id="cb15-577"><a href="#cb15-577" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-578"><a href="#cb15-578" aria-hidden="true" tabindex="-1"></a>Let us consider some questions in an arbitrary regression problem. </span>
+<span id="cb15-579"><a href="#cb15-579" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-580"><a href="#cb15-580" aria-hidden="true" tabindex="-1"></a>What does $\theta_{j}$ mean in our regression?</span>
+<span id="cb15-581"><a href="#cb15-581" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-582"><a href="#cb15-582" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Holding other variables fixed, how much should our prediction change with $X_{j}$?</span>
+<span id="cb15-583"><a href="#cb15-583" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-584"><a href="#cb15-584" aria-hidden="true" tabindex="-1"></a>For simple linear regression, this boils down to the correlation coefficient</span>
+<span id="cb15-585"><a href="#cb15-585" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-586"><a href="#cb15-586" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does having more $x$ predict more $y$ (and by how much)?</span>
+<span id="cb15-587"><a href="#cb15-587" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-588"><a href="#cb15-588" aria-hidden="true" tabindex="-1"></a>**Examples**:</span>
+<span id="cb15-589"><a href="#cb15-589" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-590"><a href="#cb15-590" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are homes with granite countertops worth more money?</span>
+<span id="cb15-591"><a href="#cb15-591" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Is college GPA higher for students who win a certain scholarship?</span>
+<span id="cb15-592"><a href="#cb15-592" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are breastfed babies less likely to develop asthma?</span>
+<span id="cb15-593"><a href="#cb15-593" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Do cancer patients given some aggressive treatment have a higher 5-year survival rate?</span>
+<span id="cb15-594"><a href="#cb15-594" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are people who smoke more likely to get cancer? </span>
+<span id="cb15-595"><a href="#cb15-595" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-596"><a href="#cb15-596" aria-hidden="true" tabindex="-1"></a>These sound like causal questions, but they are not!</span>
+<span id="cb15-597"><a href="#cb15-597" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-598"><a href="#cb15-598" aria-hidden="true" tabindex="-1"></a><span class="fu">### Prediction vs Causation</span></span>
+<span id="cb15-599"><a href="#cb15-599" aria-hidden="true" tabindex="-1"></a>The difference between correlation/prediction vs. causation is best illustrated through examples. </span>
+<span id="cb15-600"><a href="#cb15-600" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-601"><a href="#cb15-601" aria-hidden="true" tabindex="-1"></a>Some questions about **correlation / prediction** include:</span>
+<span id="cb15-602"><a href="#cb15-602" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-603"><a href="#cb15-603" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are homes with granite countertops worth more money?</span>
+<span id="cb15-604"><a href="#cb15-604" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Is college GPA higher for students who win a certain scholarship?</span>
+<span id="cb15-605"><a href="#cb15-605" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are breastfed babies less likely to develop asthma?</span>
+<span id="cb15-606"><a href="#cb15-606" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Do cancer patients given some aggressive treatment have a higher 5-year survival rate?</span>
+<span id="cb15-607"><a href="#cb15-607" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Are people who smoke more likely to get cancer? </span>
+<span id="cb15-608"><a href="#cb15-608" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-609"><a href="#cb15-609" aria-hidden="true" tabindex="-1"></a>Some questions about **causality** include:</span>
+<span id="cb15-610"><a href="#cb15-610" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-611"><a href="#cb15-611" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>How much do granite countertops **raise** the value of a house?</span>
+<span id="cb15-612"><a href="#cb15-612" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does getting the scholarship **improve** students’ GPAs?</span>
+<span id="cb15-613"><a href="#cb15-613" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does breastfeeding **protect** babies against asthma?</span>
+<span id="cb15-614"><a href="#cb15-614" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does the treatment **improve** cancer survival?</span>
+<span id="cb15-615"><a href="#cb15-615" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span>Does smoking **cause** cancer?</span>
+<span id="cb15-616"><a href="#cb15-616" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-617"><a href="#cb15-617" aria-hidden="true" tabindex="-1"></a>Causal questions are about the **effects** of **interventions** (not just passive observation). Note, however, that regression coefficients are sometimes called “effects”, which can be deceptive!</span>
+<span id="cb15-618"><a href="#cb15-618" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-619"><a href="#cb15-619" aria-hidden="true" tabindex="-1"></a>When using data alone, **predictive questions** (i.e. are breastfed babies healthier?) can be answered, but **causal questions:** (i.e. does breastfeeding improve babies’ health?) cannot. The reason for this is that there are many possible causes for our predictive question. For example, possible explanations for why breastfed babies are healthier on average include:</span>
+<span id="cb15-620"><a href="#cb15-620" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-621"><a href="#cb15-621" aria-hidden="true" tabindex="-1"></a><span class="ss">1. </span>**Causal effect:** breastfeeding makes babies healthier</span>
+<span id="cb15-622"><a href="#cb15-622" aria-hidden="true" tabindex="-1"></a><span class="ss">2. </span>**Reverse causality:** healthier babies more likely to successfully breastfeed</span>
+<span id="cb15-623"><a href="#cb15-623" aria-hidden="true" tabindex="-1"></a><span class="ss">3. </span>**Common cause:** healthier / richer parents have healthier babies and are more likely to breastfeed</span>
+<span id="cb15-624"><a href="#cb15-624" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-625"><a href="#cb15-625" aria-hidden="true" tabindex="-1"></a>We cannot tell which explanations are true (or to what extent) just by observing ($x$,$y$) pairs.</span>
+<span id="cb15-626"><a href="#cb15-626" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-627"><a href="#cb15-627" aria-hidden="true" tabindex="-1"></a>Additionally, causal questions implicitly involve **counterfactuals**, events that didn't happen. For example, we could ask, **would** the **same** breastfed babies have been less healthy **if** they hadn’t been breastfed? Explanation 1 from above implies they would be, but explanations 2 and 3 do not. </span>
+<span id="cb15-628"><a href="#cb15-628" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-629"><a href="#cb15-629" aria-hidden="true" tabindex="-1"></a><span class="fu">### Confounders</span></span>
+<span id="cb15-630"><a href="#cb15-630" aria-hidden="true" tabindex="-1"></a>Let T represent a treatment (for example, alcohol use), and Y represent an outcome (for example, lung cancer).</span>
+<span id="cb15-631"><a href="#cb15-631" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-632"><a href="#cb15-632" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/confounder.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'confounder'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-633"><a href="#cb15-633" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-634"><a href="#cb15-634" aria-hidden="true" tabindex="-1"></a>A **confounder** is a variable that affects both T and Y, distorting the correlation between them. Using the example above. Confounders can be a measured covariate or an unmeasured variable we don’t know about, and they generally cause problems, as the relationship between T and Y is really affected by data we cannot see. </span>
+<span id="cb15-635"><a href="#cb15-635" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-636"><a href="#cb15-636" aria-hidden="true" tabindex="-1"></a>**Common assumption:** all confounders are observed (**ignorability**)</span>
+<span id="cb15-637"><a href="#cb15-637" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-638"><a href="#cb15-638" aria-hidden="true" tabindex="-1"></a><span class="fu">### Terminology</span></span>
+<span id="cb15-639"><a href="#cb15-639" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-640"><a href="#cb15-640" aria-hidden="true" tabindex="-1"></a>Let us define some terms that will help us understand causal effects.</span>
+<span id="cb15-641"><a href="#cb15-641" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-642"><a href="#cb15-642" aria-hidden="true" tabindex="-1"></a>In prediction, we had two kinds of variables: </span>
+<span id="cb15-643"><a href="#cb15-643" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-644"><a href="#cb15-644" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Response** ($Y$): what we are trying to predict</span>
+<span id="cb15-645"><a href="#cb15-645" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Predictors** ($X$): inputs to our prediction</span>
+<span id="cb15-646"><a href="#cb15-646" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-647"><a href="#cb15-647" aria-hidden="true" tabindex="-1"></a>Other variables in causal inference include: </span>
+<span id="cb15-648"><a href="#cb15-648" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-649"><a href="#cb15-649" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Response** ($Y$): the outcome of interest</span>
+<span id="cb15-650"><a href="#cb15-650" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Treatment** ($T$): the variable we might intervene on</span>
+<span id="cb15-651"><a href="#cb15-651" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Covariate** ($X$): other variables we measured that may affect $T$ and/or $Y$</span>
+<span id="cb15-652"><a href="#cb15-652" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-653"><a href="#cb15-653" aria-hidden="true" tabindex="-1"></a>For this lecture, $T$ is a **binary (0/1)** variable:</span>
+<span id="cb15-654"><a href="#cb15-654" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-655"><a href="#cb15-655" aria-hidden="true" tabindex="-1"></a><span class="fu">### Neyman-Rubin Causal Model</span></span>
+<span id="cb15-656"><a href="#cb15-656" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-657"><a href="#cb15-657" aria-hidden="true" tabindex="-1"></a>Causal questions are about **counterfactuals**:</span>
+<span id="cb15-658"><a href="#cb15-658" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-659"><a href="#cb15-659" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>What would have happened if T were different?</span>
+<span id="cb15-660"><a href="#cb15-660" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>What will happen if we set T differently in the future?</span>
+<span id="cb15-661"><a href="#cb15-661" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-662"><a href="#cb15-662" aria-hidden="true" tabindex="-1"></a>We assume every individual has two **potential outcomes**:</span>
+<span id="cb15-663"><a href="#cb15-663" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-664"><a href="#cb15-664" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$Y_{i}(1)$: value of $y_{i}$ if $T_{i} = 1$ (**treated outcome**)</span>
+<span id="cb15-665"><a href="#cb15-665" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$Y_{i}(0)$: value of $y_{i}$ if $T_{i} = 0$ (**control outcome**)</span>
+<span id="cb15-666"><a href="#cb15-666" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-667"><a href="#cb15-667" aria-hidden="true" tabindex="-1"></a>For each individual in the data set, we observe:</span>
+<span id="cb15-668"><a href="#cb15-668" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-669"><a href="#cb15-669" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Covariates $x_{i}$</span>
+<span id="cb15-670"><a href="#cb15-670" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Treatment $T_{i}$</span>
+<span id="cb15-671"><a href="#cb15-671" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Response $y_{i} = Y_{i}(T_{i})$</span>
+<span id="cb15-672"><a href="#cb15-672" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-673"><a href="#cb15-673" aria-hidden="true" tabindex="-1"></a>We will assume ($x_{i}$, $T_{i}$, $y_{i} = Y_{i}(T_{i})$) tuples iid for $i = 1,..., n$</span>
+<span id="cb15-674"><a href="#cb15-674" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-675"><a href="#cb15-675" aria-hidden="true" tabindex="-1"></a><span class="fu">### Average Treatment Effect</span></span>
+<span id="cb15-676"><a href="#cb15-676" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-677"><a href="#cb15-677" aria-hidden="true" tabindex="-1"></a>For each individual, the **treatment effect** is $Y_{i}(1)-Y_{i}(0)$</span>
+<span id="cb15-678"><a href="#cb15-678" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-679"><a href="#cb15-679" aria-hidden="true" tabindex="-1"></a>The most common thing to estimate is the **Average Treatment Effect (ATE)**</span>
+<span id="cb15-680"><a href="#cb15-680" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-681"><a href="#cb15-681" aria-hidden="true" tabindex="-1"></a>$$ATE = \mathbb{E}<span class="co">[</span><span class="ot">Y(1)-Y(0)</span><span class="co">]</span> = \mathbb{E}<span class="co">[</span><span class="ot">Y(1)</span><span class="co">]</span> - \mathbb{E}<span class="co">[</span><span class="ot">Y(0)</span><span class="co">]</span>$$</span>
+<span id="cb15-682"><a href="#cb15-682" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-683"><a href="#cb15-683" aria-hidden="true" tabindex="-1"></a>Can we just take the sample mean?</span>
+<span id="cb15-684"><a href="#cb15-684" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-685"><a href="#cb15-685" aria-hidden="true" tabindex="-1"></a>$$\hat{ATE} = \frac{1}{n}\sum_{i=1}^{n}Y_{i}(1) - Y_{i}(0)$$</span>
+<span id="cb15-686"><a href="#cb15-686" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-687"><a href="#cb15-687" aria-hidden="true" tabindex="-1"></a>We cannot. Why? We only observe one of $Y_{i}(1)$, $Y_{i}(0)$.</span>
+<span id="cb15-688"><a href="#cb15-688" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-689"><a href="#cb15-689" aria-hidden="true" tabindex="-1"></a>**Fundamental problem of causal inference:** We only ever observe one potential outcome</span>
+<span id="cb15-690"><a href="#cb15-690" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-691"><a href="#cb15-691" aria-hidden="true" tabindex="-1"></a>To draw causal conclusions, we need some causal assumption relating the observed to the unobserved units</span>
+<span id="cb15-692"><a href="#cb15-692" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-693"><a href="#cb15-693" aria-hidden="true" tabindex="-1"></a>Instead of $\frac{1}{n}\sum_{i=1}^{n}Y_{i}(1) - Y_{i}(0)$, what if we took the difference between the sample mean for each group?</span>
+<span id="cb15-694"><a href="#cb15-694" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-695"><a href="#cb15-695" aria-hidden="true" tabindex="-1"></a>$$\hat{ATE} = \frac{1}{n_{1}}\sum_{i: T_{i} = 1}{Y_{i}(1)} - \frac{1}{n_{0}}\sum_{i: T_{i} = 0}{Y_{i}(0)} = \frac{1}{n_{1}}\sum_{i: T_{i} = 1}{y_{i}} - \frac{1}{n_{0}}\sum_{i: T_{i} = 0}{y_{i}}$$</span>
+<span id="cb15-696"><a href="#cb15-696" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-697"><a href="#cb15-697" aria-hidden="true" tabindex="-1"></a>Is this estimator of $ATE$ unbiased? Thus, this proposed $\hat{ATE}$ is not suitable for our purposes.</span>
+<span id="cb15-698"><a href="#cb15-698" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-699"><a href="#cb15-699" aria-hidden="true" tabindex="-1"></a>If treatment assignment comes from random coin flips, then the treated units are an iid random sample of size $n_{1}$ from the population of $Y_{i}(1)$.</span>
+<span id="cb15-700"><a href="#cb15-700" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-701"><a href="#cb15-701" aria-hidden="true" tabindex="-1"></a>This means that, </span>
+<span id="cb15-702"><a href="#cb15-702" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-703"><a href="#cb15-703" aria-hidden="true" tabindex="-1"></a>$$\mathbb{E}<span class="co">[</span><span class="ot">\frac{1}{n_{1}}\sum_{i: T_{i} = 1}{y_{i}}</span><span class="co">]</span> = \mathbb{E}<span class="co">[</span><span class="ot">Y_{i}(1)</span><span class="co">]</span>$$</span>
+<span id="cb15-704"><a href="#cb15-704" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-705"><a href="#cb15-705" aria-hidden="true" tabindex="-1"></a>Similarly, </span>
+<span id="cb15-706"><a href="#cb15-706" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-707"><a href="#cb15-707" aria-hidden="true" tabindex="-1"></a>$$\mathbb{E}<span class="co">[</span><span class="ot">\frac{1}{n_{0}}\sum_{i: T_{i} = 0}{y_{i}}</span><span class="co">]</span> = \mathbb{E}<span class="co">[</span><span class="ot">Y_{i}(0)</span><span class="co">]</span>$$</span>
+<span id="cb15-708"><a href="#cb15-708" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-709"><a href="#cb15-709" aria-hidden="true" tabindex="-1"></a>which allows us to conclude that $\hat{ATE}$ is an unbiased estimator of $ATE$:</span>
+<span id="cb15-710"><a href="#cb15-710" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-711"><a href="#cb15-711" aria-hidden="true" tabindex="-1"></a>$$\mathbb{E}<span class="co">[</span><span class="ot">\hat{ATE}</span><span class="co">]</span> = ATE$$</span>
+<span id="cb15-712"><a href="#cb15-712" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-713"><a href="#cb15-713" aria-hidden="true" tabindex="-1"></a><span class="fu">### Randomized Experiments</span></span>
+<span id="cb15-714"><a href="#cb15-714" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-715"><a href="#cb15-715" aria-hidden="true" tabindex="-1"></a>However, often, randomly assigning treatments is impractical or unethical. For example, assigning a treatment of cigarettes would likely be impractical and unethical.</span>
+<span id="cb15-716"><a href="#cb15-716" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-717"><a href="#cb15-717" aria-hidden="true" tabindex="-1"></a>An alternative to bypass this issue is to utilize **observational studies**.</span>
+<span id="cb15-718"><a href="#cb15-718" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-719"><a href="#cb15-719" aria-hidden="true" tabindex="-1"></a>Experiments:</span>
+<span id="cb15-720"><a href="#cb15-720" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-721"><a href="#cb15-721" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/experiment.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'experiment'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-722"><a href="#cb15-722" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-723"><a href="#cb15-723" aria-hidden="true" tabindex="-1"></a>Observational Study:</span>
+<span id="cb15-724"><a href="#cb15-724" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-725"><a href="#cb15-725" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/observational.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'observational'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-726"><a href="#cb15-726" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-727"><a href="#cb15-727" aria-hidden="true" tabindex="-1"></a><span class="fu">### Covariate Adjustment</span></span>
+<span id="cb15-728"><a href="#cb15-728" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-729"><a href="#cb15-729" aria-hidden="true" tabindex="-1"></a>What to do about confounders?</span>
+<span id="cb15-730"><a href="#cb15-730" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-731"><a href="#cb15-731" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Ignorability assumption:** all important confounders are in the data set! </span>
+<span id="cb15-732"><a href="#cb15-732" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-733"><a href="#cb15-733" aria-hidden="true" tabindex="-1"></a>**One idea:** come up with a model that includes them, such as:</span>
+<span id="cb15-734"><a href="#cb15-734" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-735"><a href="#cb15-735" aria-hidden="true" tabindex="-1"></a>$$Y_{i}(t) = \theta_{0} + \theta_{1}x_{1} + ... + \theta_{p}x_{p} + \tau{t} + \epsilon$$</span>
+<span id="cb15-736"><a href="#cb15-736" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-737"><a href="#cb15-737" aria-hidden="true" tabindex="-1"></a>**Question:** what is the $ATE$ in this model? $\tau$</span>
+<span id="cb15-738"><a href="#cb15-738" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-739"><a href="#cb15-739" aria-hidden="true" tabindex="-1"></a>This approach can work but is **fragile**. Breaks if:</span>
+<span id="cb15-740"><a href="#cb15-740" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-741"><a href="#cb15-741" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Important covariates are missing or true dependence on $x$ is nonlinear</span>
+<span id="cb15-742"><a href="#cb15-742" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Sometimes pejoratively called **“causal inference”**</span>
+<span id="cb15-743"><a href="#cb15-743" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-744"><a href="#cb15-744" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;img</span> <span class="er">src</span><span class="ot">=</span><span class="st">"images/ignorability.png"</span> <span class="er">alt</span><span class="ot">=</span><span class="st">'ignorability'</span> <span class="er">width</span><span class="ot">=</span><span class="st">'600'</span><span class="kw">&gt;</span></span>
+<span id="cb15-745"><a href="#cb15-745" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-746"><a href="#cb15-746" aria-hidden="true" tabindex="-1"></a><span class="fu">#### Covariate adjustment without parametric assumptions</span></span>
+<span id="cb15-747"><a href="#cb15-747" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-748"><a href="#cb15-748" aria-hidden="true" tabindex="-1"></a>What to do about confounders?</span>
+<span id="cb15-749"><a href="#cb15-749" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-750"><a href="#cb15-750" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>**Ignorability assumption:** all possible confounders are in the data set! </span>
+<span id="cb15-751"><a href="#cb15-751" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-752"><a href="#cb15-752" aria-hidden="true" tabindex="-1"></a>**One idea:** come up with a model that includes them, such as:</span>
+<span id="cb15-753"><a href="#cb15-753" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-754"><a href="#cb15-754" aria-hidden="true" tabindex="-1"></a>$$Y_{i}(t) = f_{\theta}(x, t) + \epsilon$$</span>
+<span id="cb15-755"><a href="#cb15-755" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-756"><a href="#cb15-756" aria-hidden="true" tabindex="-1"></a>Then:</span>
+<span id="cb15-757"><a href="#cb15-757" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-758"><a href="#cb15-758" aria-hidden="true" tabindex="-1"></a>$$ATE = \frac{1}{n}\sum_{i=1}^{n}{f_{\theta}(x_i, 1) - f_{\theta}(x_i, 0)}$$</span>
+<span id="cb15-759"><a href="#cb15-759" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-760"><a href="#cb15-760" aria-hidden="true" tabindex="-1"></a>With enough data, we may be able to learn $f_{\theta}$ very accurately</span>
+<span id="cb15-761"><a href="#cb15-761" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-762"><a href="#cb15-762" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Very difficult if x is high-dimensional / its functional form is highly nonlinear</span>
+<span id="cb15-763"><a href="#cb15-763" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Need additional assumption: **overlap**</span>
+<span id="cb15-764"><a href="#cb15-764" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-765"><a href="#cb15-765" aria-hidden="true" tabindex="-1"></a><span class="fu">### Other Methods</span></span>
+<span id="cb15-766"><a href="#cb15-766" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-767"><a href="#cb15-767" aria-hidden="true" tabindex="-1"></a>Causal inference is hard, and covariate adjustment is often not the best approach</span>
+<span id="cb15-768"><a href="#cb15-768" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-769"><a href="#cb15-769" aria-hidden="true" tabindex="-1"></a>Many other methods are some combination of:</span>
+<span id="cb15-770"><a href="#cb15-770" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-771"><a href="#cb15-771" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Modeling treatment T as a function of covariates x</span>
+<span id="cb15-772"><a href="#cb15-772" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Modeling the outcome y as a function of x, T</span>
+<span id="cb15-773"><a href="#cb15-773" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-774"><a href="#cb15-774" aria-hidden="true" tabindex="-1"></a>What if we don’t believe in ignorability? Other methods look for a</span>
+<span id="cb15-775"><a href="#cb15-775" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-776"><a href="#cb15-776" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>Favorite example: **regression discontinuity**</span>
+<span id="cb15-777"><a href="#cb15-777" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-778"><a href="#cb15-778" aria-hidden="true" tabindex="-1"></a><span class="fu">## (Bonus) Proof of Bias-Variance Decomposition</span></span>
+<span id="cb15-779"><a href="#cb15-779" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-780"><a href="#cb15-780" aria-hidden="true" tabindex="-1"></a>This section walks through the detailed derivation of the Bias-Variance Decomposition in the Bias-Variance Tradeoff section earlier in this note.</span>
+<span id="cb15-781"><a href="#cb15-781" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-782"><a href="#cb15-782" aria-hidden="true" tabindex="-1"></a>:::{.callout collapse="true"}</span>
+<span id="cb15-783"><a href="#cb15-783" aria-hidden="true" tabindex="-1"></a><span class="fu">### Click to show</span></span>
+<span id="cb15-784"><a href="#cb15-784" aria-hidden="true" tabindex="-1"></a>We want to prove that the model risk can be decomposed as</span>
+<span id="cb15-785"><a href="#cb15-785" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-786"><a href="#cb15-786" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-787"><a href="#cb15-787" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
+<span id="cb15-788"><a href="#cb15-788" aria-hidden="true" tabindex="-1"></a>E\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span> &amp;= E<span class="co">[</span><span class="ot">\epsilon^2</span><span class="co">]</span> + \left(g(x)-E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>\right)^2 + E\left<span class="co">[</span><span class="ot">\left(E\left[\hat{Y}(x)\right] - \hat{Y}(x)\right)^2\right</span><span class="co">]</span>.</span>
+<span id="cb15-789"><a href="#cb15-789" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
+<span id="cb15-790"><a href="#cb15-790" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-791"><a href="#cb15-791" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-792"><a href="#cb15-792" aria-hidden="true" tabindex="-1"></a>To prove this, we will first need the following lemma:</span>
+<span id="cb15-793"><a href="#cb15-793" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-794"><a href="#cb15-794" aria-hidden="true" tabindex="-1"></a><span class="kw">&lt;center&gt;</span>If $V$ and $W$ are independent random variables then $E<span class="co">[</span><span class="ot">VW</span><span class="co">]</span> = E<span class="co">[</span><span class="ot">V</span><span class="co">]</span>E<span class="co">[</span><span class="ot">W</span><span class="co">]</span>$.<span class="kw">&lt;/center&gt;</span></span>
+<span id="cb15-795"><a href="#cb15-795" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-796"><a href="#cb15-796" aria-hidden="true" tabindex="-1"></a>We will prove this in the discrete finite case. Trust that it's true in greater generality.</span>
+<span id="cb15-797"><a href="#cb15-797" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-798"><a href="#cb15-798" aria-hidden="true" tabindex="-1"></a>The job is to calculate the weighted average of the values of $VW$, where the weights are the probabilities of those values. Here goes.</span>
+<span id="cb15-799"><a href="#cb15-799" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-800"><a href="#cb15-800" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
+<span id="cb15-801"><a href="#cb15-801" aria-hidden="true" tabindex="-1"></a>E<span class="co">[</span><span class="ot">VW</span><span class="co">]</span> ~ &amp;= ~ \sum_v\sum_w vwP(V=v \text{ and } W=w) <span class="sc">\\</span></span>
+<span id="cb15-802"><a href="#cb15-802" aria-hidden="true" tabindex="-1"></a>&amp;= ~ \sum_v\sum_w vwP(V=v)P(W=w) ~~~~ \text{by independence} <span class="sc">\\</span></span>
+<span id="cb15-803"><a href="#cb15-803" aria-hidden="true" tabindex="-1"></a>&amp;= ~ \sum_v vP(V=v)\sum_w wP(W=w) <span class="sc">\\</span></span>
+<span id="cb15-804"><a href="#cb15-804" aria-hidden="true" tabindex="-1"></a>&amp;= ~ E<span class="co">[</span><span class="ot">V</span><span class="co">]</span>E<span class="co">[</span><span class="ot">W</span><span class="co">]</span></span>
+<span id="cb15-805"><a href="#cb15-805" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
+<span id="cb15-806"><a href="#cb15-806" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-807"><a href="#cb15-807" aria-hidden="true" tabindex="-1"></a>Now we go into the actual proof:</span>
+<span id="cb15-808"><a href="#cb15-808" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-809"><a href="#cb15-809" aria-hidden="true" tabindex="-1"></a><span class="fu">### Goal</span></span>
+<span id="cb15-810"><a href="#cb15-810" aria-hidden="true" tabindex="-1"></a>Decompose the model risk into recognizable components.</span>
+<span id="cb15-811"><a href="#cb15-811" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-812"><a href="#cb15-812" aria-hidden="true" tabindex="-1"></a><span class="fu">### Step 1</span></span>
+<span id="cb15-813"><a href="#cb15-813" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-814"><a href="#cb15-814" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
+<span id="cb15-815"><a href="#cb15-815" aria-hidden="true" tabindex="-1"></a>\text{model risk} ~ &amp;= ~ E\left<span class="co">[</span><span class="ot">\left(Y - \hat{Y}(x)\right)^2 \right</span><span class="co">]</span> <span class="sc">\\</span></span>
+<span id="cb15-816"><a href="#cb15-816" aria-hidden="true" tabindex="-1"></a>&amp;= ~ E\left<span class="co">[</span><span class="ot">\left(g(x) + \epsilon - \hat{Y}(x)\right)^2 \right</span><span class="co">]</span> <span class="sc">\\</span></span>
+<span id="cb15-817"><a href="#cb15-817" aria-hidden="true" tabindex="-1"></a>&amp;= ~ E\left<span class="co">[</span><span class="ot">\left(\epsilon + \left(g(x)- \hat{Y}(x)\right)\right)^2 \right</span><span class="co">]</span> <span class="sc">\\</span></span>
+<span id="cb15-818"><a href="#cb15-818" aria-hidden="true" tabindex="-1"></a>&amp;= ~ E\left<span class="co">[</span><span class="ot">\epsilon^2\right</span><span class="co">]</span> + 2E\left<span class="co">[</span><span class="ot">\epsilon \left(g(x)- \hat{Y}(x)\right)\right</span><span class="co">]</span> + E\left<span class="co">[</span><span class="ot">\left(g(x) - \hat{Y}(x)\right)^2\right</span><span class="co">]</span><span class="sc">\\</span></span>
+<span id="cb15-819"><a href="#cb15-819" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
+<span id="cb15-820"><a href="#cb15-820" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-821"><a href="#cb15-821" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-822"><a href="#cb15-822" aria-hidden="true" tabindex="-1"></a>On the right hand side: </span>
+<span id="cb15-823"><a href="#cb15-823" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-824"><a href="#cb15-824" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The first term is the observation variance $\sigma^2$.</span>
+<span id="cb15-825"><a href="#cb15-825" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The cross product term is 0 because $\epsilon$ is independent of $g(x) - \hat{Y}(x)$ and $E(\epsilon) = 0$</span>
+<span id="cb15-826"><a href="#cb15-826" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>The last term is the mean squared difference between our predicted value and the value of the true function at $x$</span>
+<span id="cb15-827"><a href="#cb15-827" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-828"><a href="#cb15-828" aria-hidden="true" tabindex="-1"></a><span class="fu">### Step 2</span></span>
+<span id="cb15-829"><a href="#cb15-829" aria-hidden="true" tabindex="-1"></a>At this stage we have</span>
+<span id="cb15-830"><a href="#cb15-830" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-831"><a href="#cb15-831" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-832"><a href="#cb15-832" aria-hidden="true" tabindex="-1"></a>\text{model risk} ~ = ~ E\left<span class="co">[</span><span class="ot">\epsilon^2\right</span><span class="co">]</span> + E\left<span class="co">[</span><span class="ot">\left(g(x) - \hat{Y}(x)\right)^2\right</span><span class="co">]</span></span>
+<span id="cb15-833"><a href="#cb15-833" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-834"><a href="#cb15-834" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-835"><a href="#cb15-835" aria-hidden="true" tabindex="-1"></a>We don't yet have a good understanding of $g(x) - \hat{Y}(x)$. But we do understand the deviation $D_{\hat{Y}(x)} = \hat{Y}(x) - E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>$. We know that</span>
+<span id="cb15-836"><a href="#cb15-836" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-837"><a href="#cb15-837" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}\right</span><span class="co">]</span> ~ = ~ 0$</span>
+<span id="cb15-838"><a href="#cb15-838" aria-hidden="true" tabindex="-1"></a><span class="ss">- </span>$E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}^2\right</span><span class="co">]</span> ~ = ~ \text{model variance}$</span>
+<span id="cb15-839"><a href="#cb15-839" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-840"><a href="#cb15-840" aria-hidden="true" tabindex="-1"></a>So let's add and subtract $E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span>$ and see if that helps.</span>
+<span id="cb15-841"><a href="#cb15-841" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-842"><a href="#cb15-842" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-843"><a href="#cb15-843" aria-hidden="true" tabindex="-1"></a>g(x) - \hat{Y}(x) ~ = ~ \left(g(x) - E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> \right) + \left(E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - \hat{Y}(x)\right) </span>
+<span id="cb15-844"><a href="#cb15-844" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-845"><a href="#cb15-845" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-846"><a href="#cb15-846" aria-hidden="true" tabindex="-1"></a>The first term on the right hand side is the model bias at $x$. The second term is $-D_{\hat{Y}(x)}$. So</span>
+<span id="cb15-847"><a href="#cb15-847" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-848"><a href="#cb15-848" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-849"><a href="#cb15-849" aria-hidden="true" tabindex="-1"></a>g(x) - \hat{Y}(x) ~ = ~ \text{model bias} - D_{\hat{Y}(x)}</span>
+<span id="cb15-850"><a href="#cb15-850" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-851"><a href="#cb15-851" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-852"><a href="#cb15-852" aria-hidden="true" tabindex="-1"></a><span class="fu">### Step 3</span></span>
+<span id="cb15-853"><a href="#cb15-853" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-854"><a href="#cb15-854" aria-hidden="true" tabindex="-1"></a>Remember that the model bias at $x$ is a constant, not a random variable. Think of it as your favorite number, say 10. Then </span>
+<span id="cb15-855"><a href="#cb15-855" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-856"><a href="#cb15-856" aria-hidden="true" tabindex="-1"></a>\begin{align*}</span>
+<span id="cb15-857"><a href="#cb15-857" aria-hidden="true" tabindex="-1"></a>E\left<span class="co">[</span><span class="ot"> \left(g(x) - \hat{Y}(x)\right)^2 \right</span><span class="co">]</span> ~ &amp;= ~ \text{model bias}^2 - 2(\text{model bias})E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}\right</span><span class="co">]</span> + E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}^2\right</span><span class="co">]</span> <span class="sc">\\</span></span>
+<span id="cb15-858"><a href="#cb15-858" aria-hidden="true" tabindex="-1"></a>&amp;= ~ \text{model bias}^2 - 0 + \text{model variance} <span class="sc">\\</span></span>
+<span id="cb15-859"><a href="#cb15-859" aria-hidden="true" tabindex="-1"></a>&amp;= ~ \text{model bias}^2 + \text{model variance}</span>
+<span id="cb15-860"><a href="#cb15-860" aria-hidden="true" tabindex="-1"></a>\end{align*}</span>
+<span id="cb15-861"><a href="#cb15-861" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-862"><a href="#cb15-862" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-863"><a href="#cb15-863" aria-hidden="true" tabindex="-1"></a>Again, the cross-product term is $0$ because $E\left<span class="co">[</span><span class="ot">D_{\hat{Y}(x)}\right</span><span class="co">]</span> ~ = ~ 0$.</span>
+<span id="cb15-864"><a href="#cb15-864" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-865"><a href="#cb15-865" aria-hidden="true" tabindex="-1"></a><span class="fu">### Step 4: Bias-Variance Decomposition</span></span>
+<span id="cb15-866"><a href="#cb15-866" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-867"><a href="#cb15-867" aria-hidden="true" tabindex="-1"></a>In Step 2 we had</span>
+<span id="cb15-868"><a href="#cb15-868" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-869"><a href="#cb15-869" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-870"><a href="#cb15-870" aria-hidden="true" tabindex="-1"></a>\text{model risk} ~ = ~ \text{observation variance} + E\left<span class="co">[</span><span class="ot">\left(g(x) - \hat{Y}(x)\right)^2\right</span><span class="co">]</span></span>
+<span id="cb15-871"><a href="#cb15-871" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-872"><a href="#cb15-872" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-873"><a href="#cb15-873" aria-hidden="true" tabindex="-1"></a>Step 3 showed</span>
+<span id="cb15-874"><a href="#cb15-874" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-875"><a href="#cb15-875" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-876"><a href="#cb15-876" aria-hidden="true" tabindex="-1"></a>E\left<span class="co">[</span><span class="ot"> \left(g(x) - \hat{Y}(x)\right)^2 \right</span><span class="co">]</span> ~ = ~ \text{model bias}^2 + \text{model variance}</span>
+<span id="cb15-877"><a href="#cb15-877" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-878"><a href="#cb15-878" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-879"><a href="#cb15-879" aria-hidden="true" tabindex="-1"></a>Thus we have shown the bias-variance decomposition:</span>
+<span id="cb15-880"><a href="#cb15-880" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-881"><a href="#cb15-881" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-882"><a href="#cb15-882" aria-hidden="true" tabindex="-1"></a>\text{model risk} = \text{observation variance} + \text{model bias}^2 + \text{model variance}.</span>
+<span id="cb15-883"><a href="#cb15-883" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-884"><a href="#cb15-884" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-885"><a href="#cb15-885" aria-hidden="true" tabindex="-1"></a>That is,</span>
+<span id="cb15-886"><a href="#cb15-886" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-887"><a href="#cb15-887" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-888"><a href="#cb15-888" aria-hidden="true" tabindex="-1"></a>E\left<span class="co">[</span><span class="ot">(Y(x)-\hat{Y}(x))^2\right</span><span class="co">]</span> = \sigma^2 + \left(E\left<span class="co">[</span><span class="ot">\hat{Y}(x)\right</span><span class="co">]</span> - g(x)\right)^2 + E\left<span class="co">[</span><span class="ot">\left(\hat{Y}(x)-E\left[\hat{Y}(x)\right]\right)^2\right</span><span class="co">]</span></span>
+<span id="cb15-889"><a href="#cb15-889" aria-hidden="true" tabindex="-1"></a>$$</span>
+<span id="cb15-890"><a href="#cb15-890" aria-hidden="true" tabindex="-1"></a>:::</span>
 </code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div></div></div></div></div>
 </div> <!-- /content -->
diff --git a/docs/inference_causality/inference_causality_files/figure-html/cell-7-output-1.png b/docs/inference_causality/inference_causality_files/figure-html/cell-7-output-1.png
new file mode 100644
index 00000000..a4debbaa
Binary files /dev/null and b/docs/inference_causality/inference_causality_files/figure-html/cell-7-output-1.png differ
diff --git a/docs/inference_causality/inference_causality_files/figure-html/cell-9-output-1.png b/docs/inference_causality/inference_causality_files/figure-html/cell-9-output-1.png
index aab63ce8..3300c13c 100644
Binary files a/docs/inference_causality/inference_causality_files/figure-html/cell-9-output-1.png and b/docs/inference_causality/inference_causality_files/figure-html/cell-9-output-1.png differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png
index 23bc5823..41268164 100644
Binary files a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png and b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-2-output-1.png differ
diff --git a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-3-output-1.png b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-3-output-1.png
index 548ac872..2ad9801d 100644
Binary files a/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-3-output-1.png and b/docs/intro_to_modeling/intro_to_modeling_files/figure-html/cell-3-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-10-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-10-output-1.png
index e16bbabf..0a715389 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-10-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-10-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-11-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-11-output-1.png
index ee355410..2cb99ae3 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-11-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-11-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-13-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-13-output-1.png
index 33f7ba1e..13eb7126 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-13-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-13-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-3-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-3-output-1.png
index 96336ba1..3d079eca 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-3-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-3-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-4-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-4-output-1.png
index 75bfd605..fa4b6a28 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-4-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-4-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-5-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-5-output-1.png
index b8da11ef..20cf9cf6 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-5-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-5-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-6-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-6-output-1.png
index 77058f49..f148b17b 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-6-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-6-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-7-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-7-output-1.png
index 8c68b1a3..bad5d58f 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-7-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-7-output-1.png differ
diff --git a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-8-output-1.png b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-8-output-1.png
index f98a09fe..61d15b98 100644
Binary files a/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-8-output-1.png and b/docs/logistic_regression_1/logistic_reg_1_files/figure-html/cell-8-output-1.png differ
diff --git a/docs/pandas_2/pandas_2.html b/docs/pandas_2/pandas_2.html
index 67f4e7a0..50846454 100644
--- a/docs/pandas_2/pandas_2.html
+++ b/docs/pandas_2/pandas_2.html
@@ -1665,11 +1665,11 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">253971</td>
+<td data-quarto-table-cell-role="th">188815</td>
 <td>CA</td>
-<td>M</td>
-<td>1939</td>
-<td>Clement</td>
+<td>F</td>
+<td>2009</td>
+<td>Sofiya</td>
 <td>6</td>
 </tr>
 </tbody>
@@ -1697,34 +1697,34 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">161929</td>
-<td>2003</td>
-<td>Azalea</td>
-<td>18</td>
+<td data-quarto-table-cell-role="th">205298</td>
+<td>2013</td>
+<td>Jamilah</td>
+<td>5</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">253610</td>
-<td>1939</td>
-<td>Alvin</td>
-<td>72</td>
+<td data-quarto-table-cell-role="th">256320</td>
+<td>1943</td>
+<td>Preston</td>
+<td>18</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">347114</td>
-<td>2001</td>
-<td>Tiernan</td>
-<td>6</td>
+<td data-quarto-table-cell-role="th">83820</td>
+<td>1980</td>
+<td>Rosanna</td>
+<td>39</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">290525</td>
-<td>1975</td>
-<td>Jed</td>
-<td>20</td>
+<td data-quarto-table-cell-role="th">34103</td>
+<td>1954</td>
+<td>Larita</td>
+<td>5</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">297087</td>
-<td>1979</td>
-<td>Damond</td>
-<td>7</td>
+<td data-quarto-table-cell-role="th">392253</td>
+<td>2017</td>
+<td>Zayan</td>
+<td>8</td>
 </tr>
 </tbody>
 </table>
@@ -1750,28 +1750,28 @@ <h3 data-number="3.3.4" class="anchored" data-anchor-id="sample"><span class="he
 </thead>
 <tbody>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">343838</td>
+<td data-quarto-table-cell-role="th">343638</td>
 <td>2000</td>
-<td>Kris</td>
-<td>10</td>
+<td>Caiden</td>
+<td>12</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">343223</td>
+<td data-quarto-table-cell-role="th">344823</td>
 <td>2000</td>
-<td>Jeff</td>
-<td>24</td>
+<td>King</td>
+<td>5</td>
 </tr>
 <tr class="odd">
-<td data-quarto-table-cell-role="th">342795</td>
+<td data-quarto-table-cell-role="th">152398</td>
 <td>2000</td>
-<td>Nelson</td>
-<td>99</td>
+<td>Cyann</td>
+<td>5</td>
 </tr>
 <tr class="even">
-<td data-quarto-table-cell-role="th">152704</td>
+<td data-quarto-table-cell-role="th">152042</td>
 <td>2000</td>
-<td>Reya</td>
-<td>5</td>
+<td>Guinevere</td>
+<td>6</td>
 </tr>
 </tbody>
 </table>
@@ -2338,7 +2338,7 @@ <h2 data-number="3.5" class="anchored" data-anchor-id="aggregating-data-with-.gr
 <div class="cell" data-execution_count="40">
 <div class="sourceCode cell-code" id="cb53"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb53-1"><a href="#cb53-1" aria-hidden="true" tabindex="-1"></a>babynames.groupby(<span class="st">"Year"</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="40">
-<pre><code>&lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x130159580&gt;</code></pre>
+<pre><code>&lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc711fb5cc0&gt;</code></pre>
 </div>
 </div>
 <p>What does this strange output mean? Calling <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html"><code>.groupby</code></a> has generated a <code>GroupBy</code> object. You can imagine this as a set of “mini” sub-DataFrames, where each subframe contains all of the rows from <code>babynames</code> that correspond to a particular year.</p>
diff --git a/docs/pandas_3/pandas_3.html b/docs/pandas_3/pandas_3.html
index 3035007b..6389327e 100644
--- a/docs/pandas_3/pandas_3.html
+++ b/docs/pandas_3/pandas_3.html
@@ -106,7 +106,7 @@
 require.undef("plotly");
 requirejs.config({
     paths: {
-        'plotly': ['https://cdn.plot.ly/plotly-2.25.2.min']
+        'plotly': ['https://cdn.plot.ly/plotly-2.12.1.min']
     }
 });
 require(['plotly'], function(Plotly) {
@@ -821,9 +821,9 @@ <h3 data-number="4.1.4" class="anchored" data-anchor-id="some-data-science-payof
 </details>
 <div class="cell-output cell-output-display">
 
-<div>                            <div id="abacd3c4-90f3-4f55-b977-ce34d2f6b50a" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("abacd3c4-90f3-4f55-b977-ce34d2f6b50a")) {                    Plotly.newPlot(                        "abacd3c4-90f3-4f55-b977-ce34d2f6b50a",                        [{"hovertemplate":"Year=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2012,2013,2016],"xaxis":"x","y":[7,7,8,15,19,20,56,92,199,601,1510,2351,3295,3784,3969,3755,3318,2660,2290,2014,1647,1592,1430,1287,1154,958,818,748,647,547,463,318,242,236,159,151,151,164,130,141,97,114,97,95,93,64,78,69,71,51,62,41,34,28,28,12,14,16,10,13,14,10,7,12,13,12,13,6,7,5,8,5],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"tracegroupgap":0},"margin":{"t":60}},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="962e097d-d4d3-494c-98ab-be594c05dc3f" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("962e097d-d4d3-494c-98ab-be594c05dc3f")) {                    Plotly.newPlot(                        "962e097d-d4d3-494c-98ab-be594c05dc3f",                        [{"hovertemplate":"Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2012,2013,2016],"xaxis":"x","y":[7,7,8,15,19,20,56,92,199,601,1510,2351,3295,3784,3969,3755,3318,2660,2290,2014,1647,1592,1430,1287,1154,958,818,748,647,547,463,318,242,236,159,151,151,164,130,141,97,114,97,95,93,64,78,69,71,51,62,41,34,28,28,12,14,16,10,13,14,10,7,12,13,12,13,6,7,5,8,5],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"tracegroupgap":0},"margin":{"t":60}},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('abacd3c4-90f3-4f55-b977-ce34d2f6b50a');
+var gd = document.getElementById('962e097d-d4d3-494c-98ab-be594c05dc3f');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
@@ -859,9 +859,9 @@ <h3 data-number="4.1.4" class="anchored" data-anchor-id="some-data-science-payof
 <span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 
-<div>                            <div id="652fa0b5-184f-4c49-9540-4582d0cc263a" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("652fa0b5-184f-4c49-9540-4582d0cc263a")) {                    Plotly.newPlot(                        "652fa0b5-184f-4c49-9540-4582d0cc263a",                        [{"hovertemplate":"Name=Carol\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Carol","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Carol","orientation":"v","showlegend":true,"x":[1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[8,13,17,16,26,38,59,47,55,48,64,67,74,94,138,153,151,148,193,279,270,297,367,453,559,669,873,1015,1050,1109,1079,1339,1672,1937,2089,2138,2152,2201,1954,1779,1737,1734,1727,1597,1684,1651,1704,1703,1545,1480,1359,1283,1191,993,1034,815,622,577,543,468,366,267,223,187,173,146,145,145,121,132,123,128,106,114,111,101,120,107,108,134,150,136,129,89,92,75,87,64,61,46,64,33,43,47,52,76,62,38,44,26,17,47,31,36,24,13,25,18,29,20,17,8,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Susan\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Susan","line":{"color":"#EF553B","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Susan","orientation":"v","showlegend":true,"x":[1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[6,8,8,10,16,17,15,20,22,21,19,15,22,26,32,29,43,25,37,63,47,63,74,101,118,138,183,271,433,630,795,1058,1380,1596,1991,2689,2831,3338,3180,3260,3346,3424,3753,3934,3900,3771,3631,3504,3123,3145,3135,2952,2839,2535,2008,1825,1644,1367,1232,1070,861,651,530,552,496,456,437,424,409,420,361,391,352,338,273,280,272,286,267,272,260,196,202,172,152,152,114,116,103,100,104,85,76,70,71,74,53,56,41,39,43,28,44,26,45,22,26,22,19,17,8,13],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Tina\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Tina","line":{"color":"#00cc96","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Tina","orientation":"v","showlegend":true,"x":[1915,1916,1917,1918,1920,1921,1922,1924,1925,1927,1928,1929,1930,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[5,6,5,5,5,7,5,9,5,8,8,5,10,10,7,8,12,9,28,45,43,53,64,80,80,88,92,128,168,163,177,366,569,569,700,753,889,1045,1228,1212,1129,1202,1282,1342,1402,1302,1248,1091,941,634,642,546,450,370,414,363,335,371,310,268,271,310,238,252,252,208,180,196,163,171,147,121,111,91,80,83,90,80,67,64,63,69,36,37,47,39,39,27,39,28,46,38,33,36,26,21,15,13,6],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Cheryl\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Cheryl","line":{"color":"#ab63fa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Cheryl","orientation":"v","showlegend":true,"x":[1930,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2011,2012,2013,2014,2015,2016,2017,2018,2019,2021,2022],"xaxis":"x","y":[6,8,12,10,16,76,49,42,48,87,377,759,801,1063,1093,1021,916,903,993,955,1058,1465,1639,1715,1833,1832,1639,1624,1565,1420,1295,1207,1051,950,899,751,635,550,428,371,293,271,236,199,178,303,299,272,204,229,164,135,129,130,98,106,88,90,65,55,39,47,38,30,30,19,22,24,14,11,16,17,16,13,21,14,11,15,12,10,12,15,8,10,9,6,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Debbie\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Debbie","line":{"color":"#FFA15A","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Debbie","orientation":"v","showlegend":true,"x":[1936,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2014,2015,2016,2017,2021],"xaxis":"x","y":[5,9,9,10,16,11,32,74,91,115,120,191,233,300,427,697,902,1313,1656,1776,1675,1547,1458,1215,1004,648,504,415,338,279,243,192,145,108,108,92,72,64,87,91,81,65,79,67,74,64,56,71,78,93,85,78,50,61,70,53,46,39,22,28,19,11,16,14,13,8,21,10,11,10,12,8,9,6,5,5,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Michele\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Michele","line":{"color":"#19d3f3","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Michele","orientation":"v","showlegend":true,"x":[1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2022],"xaxis":"x","y":[7,8,5,8,18,34,113,132,166,171,172,253,213,335,295,306,401,421,500,498,464,454,470,506,576,763,766,775,768,796,1037,1033,1111,1016,973,700,702,571,494,484,437,390,381,305,281,223,230,227,200,162,206,146,143,164,137,142,125,104,82,65,52,47,45,38,28,37,27,22,28,16,21,15,15,11,14,7,5,10,6,11,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Shannon\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Shannon","line":{"color":"#FF6692","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Shannon","orientation":"v","showlegend":true,"x":[1938,1939,1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[6,9,6,10,14,19,25,16,34,23,34,43,51,59,73,83,111,106,126,129,161,145,206,216,305,409,441,516,587,932,1419,1650,1436,1198,1090,1127,982,1218,1136,1052,991,923,968,969,971,945,872,803,699,642,597,527,493,594,615,531,438,428,366,303,217,199,200,165,133,133,110,90,88,63,42,43,37,41,32,19,31,22,17,14,21,8,8,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Terri\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Terri","line":{"color":"#B6E880","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Terri","orientation":"v","showlegend":true,"x":[1938,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2004,2005,2006,2016,2021,2022],"xaxis":"x","y":[6,8,12,26,32,38,65,99,130,132,168,154,236,306,379,542,604,685,839,875,1052,964,937,902,826,737,486,448,398,323,312,263,191,153,120,106,81,59,84,57,44,49,47,53,44,36,37,35,32,34,20,26,29,15,19,22,11,15,12,13,11,14,9,7,6,7,5,5,5,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Debra\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Debra","line":{"color":"#FF97FF","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Debra","orientation":"v","showlegend":true,"x":[1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2012,2013,2016],"xaxis":"x","y":[7,7,8,15,19,20,56,92,199,601,1510,2351,3295,3784,3969,3755,3318,2660,2290,2014,1647,1592,1430,1287,1154,958,818,748,647,547,463,318,242,236,159,151,151,164,130,141,97,114,97,95,93,64,78,69,71,51,62,41,34,28,28,12,14,16,10,13,14,10,7,12,13,12,13,6,7,5,8,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Tammy\u003cbr\u003eYear=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"Tammy","line":{"color":"#FECB52","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Tammy","orientation":"v","showlegend":true,"x":[1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2019,2022],"xaxis":"x","y":[7,5,10,9,12,13,10,9,9,13,28,14,26,37,368,746,990,1038,1136,1223,1539,1273,1219,1168,1143,1099,977,1013,859,704,544,421,392,328,275,229,227,181,168,157,96,120,102,85,120,88,85,94,77,82,74,61,49,45,45,54,50,47,49,45,44,36,30,24,29,14,16,12,11,9,5,13,9,15,11,7,5],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"title":{"text":"Name"},"tracegroupgap":0},"margin":{"t":60}},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="55908907-80a2-4642-ad26-2a6351759d38" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("55908907-80a2-4642-ad26-2a6351759d38")) {                    Plotly.newPlot(                        "55908907-80a2-4642-ad26-2a6351759d38",                        [{"hovertemplate":"Name=Carol<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Carol","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Carol","orientation":"v","showlegend":true,"x":[1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[8,13,17,16,26,38,59,47,55,48,64,67,74,94,138,153,151,148,193,279,270,297,367,453,559,669,873,1015,1050,1109,1079,1339,1672,1937,2089,2138,2152,2201,1954,1779,1737,1734,1727,1597,1684,1651,1704,1703,1545,1480,1359,1283,1191,993,1034,815,622,577,543,468,366,267,223,187,173,146,145,145,121,132,123,128,106,114,111,101,120,107,108,134,150,136,129,89,92,75,87,64,61,46,64,33,43,47,52,76,62,38,44,26,17,47,31,36,24,13,25,18,29,20,17,8,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Susan<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Susan","line":{"color":"#EF553B","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Susan","orientation":"v","showlegend":true,"x":[1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[6,8,8,10,16,17,15,20,22,21,19,15,22,26,32,29,43,25,37,63,47,63,74,101,118,138,183,271,433,630,795,1058,1380,1596,1991,2689,2831,3338,3180,3260,3346,3424,3753,3934,3900,3771,3631,3504,3123,3145,3135,2952,2839,2535,2008,1825,1644,1367,1232,1070,861,651,530,552,496,456,437,424,409,420,361,391,352,338,273,280,272,286,267,272,260,196,202,172,152,152,114,116,103,100,104,85,76,70,71,74,53,56,41,39,43,28,44,26,45,22,26,22,19,17,8,13],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Tina<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Tina","line":{"color":"#00cc96","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Tina","orientation":"v","showlegend":true,"x":[1915,1916,1917,1918,1920,1921,1922,1924,1925,1927,1928,1929,1930,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[5,6,5,5,5,7,5,9,5,8,8,5,10,10,7,8,12,9,28,45,43,53,64,80,80,88,92,128,168,163,177,366,569,569,700,753,889,1045,1228,1212,1129,1202,1282,1342,1402,1302,1248,1091,941,634,642,546,450,370,414,363,335,371,310,268,271,310,238,252,252,208,180,196,163,171,147,121,111,91,80,83,90,80,67,64,63,69,36,37,47,39,39,27,39,28,46,38,33,36,26,21,15,13,6],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Cheryl<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Cheryl","line":{"color":"#ab63fa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Cheryl","orientation":"v","showlegend":true,"x":[1930,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2011,2012,2013,2014,2015,2016,2017,2018,2019,2021,2022],"xaxis":"x","y":[6,8,12,10,16,76,49,42,48,87,377,759,801,1063,1093,1021,916,903,993,955,1058,1465,1639,1715,1833,1832,1639,1624,1565,1420,1295,1207,1051,950,899,751,635,550,428,371,293,271,236,199,178,303,299,272,204,229,164,135,129,130,98,106,88,90,65,55,39,47,38,30,30,19,22,24,14,11,16,17,16,13,21,14,11,15,12,10,12,15,8,10,9,6,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Debbie<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Debbie","line":{"color":"#FFA15A","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Debbie","orientation":"v","showlegend":true,"x":[1936,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2014,2015,2016,2017,2021],"xaxis":"x","y":[5,9,9,10,16,11,32,74,91,115,120,191,233,300,427,697,902,1313,1656,1776,1675,1547,1458,1215,1004,648,504,415,338,279,243,192,145,108,108,92,72,64,87,91,81,65,79,67,74,64,56,71,78,93,85,78,50,61,70,53,46,39,22,28,19,11,16,14,13,8,21,10,11,10,12,8,9,6,5,5,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Michele<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Michele","line":{"color":"#19d3f3","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Michele","orientation":"v","showlegend":true,"x":[1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2022],"xaxis":"x","y":[7,8,5,8,18,34,113,132,166,171,172,253,213,335,295,306,401,421,500,498,464,454,470,506,576,763,766,775,768,796,1037,1033,1111,1016,973,700,702,571,494,484,437,390,381,305,281,223,230,227,200,162,206,146,143,164,137,142,125,104,82,65,52,47,45,38,28,37,27,22,28,16,21,15,15,11,14,7,5,10,6,11,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Shannon<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Shannon","line":{"color":"#FF6692","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Shannon","orientation":"v","showlegend":true,"x":[1938,1939,1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[6,9,6,10,14,19,25,16,34,23,34,43,51,59,73,83,111,106,126,129,161,145,206,216,305,409,441,516,587,932,1419,1650,1436,1198,1090,1127,982,1218,1136,1052,991,923,968,969,971,945,872,803,699,642,597,527,493,594,615,531,438,428,366,303,217,199,200,165,133,133,110,90,88,63,42,43,37,41,32,19,31,22,17,14,21,8,8,7],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Terri<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Terri","line":{"color":"#B6E880","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Terri","orientation":"v","showlegend":true,"x":[1938,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2004,2005,2006,2016,2021,2022],"xaxis":"x","y":[6,8,12,26,32,38,65,99,130,132,168,154,236,306,379,542,604,685,839,875,1052,964,937,902,826,737,486,448,398,323,312,263,191,153,120,106,81,59,84,57,44,49,47,53,44,36,37,35,32,34,20,26,29,15,19,22,11,15,12,13,11,14,9,7,6,7,5,5,5,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Debra<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Debra","line":{"color":"#FF97FF","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Debra","orientation":"v","showlegend":true,"x":[1940,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2012,2013,2016],"xaxis":"x","y":[7,7,8,15,19,20,56,92,199,601,1510,2351,3295,3784,3969,3755,3318,2660,2290,2014,1647,1592,1430,1287,1154,958,818,748,647,547,463,318,242,236,159,151,151,164,130,141,97,114,97,95,93,64,78,69,71,51,62,41,34,28,28,12,14,16,10,13,14,10,7,12,13,12,13,6,7,5,8,5],"yaxis":"y","type":"scatter"},{"hovertemplate":"Name=Tammy<br>Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"Tammy","line":{"color":"#FECB52","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"Tammy","orientation":"v","showlegend":true,"x":[1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2019,2022],"xaxis":"x","y":[7,5,10,9,12,13,10,9,9,13,28,14,26,37,368,746,990,1038,1136,1223,1539,1273,1219,1168,1143,1099,977,1013,859,704,544,421,392,328,275,229,227,181,168,157,96,120,102,85,120,88,85,94,77,82,74,61,49,45,45,54,50,47,49,45,44,36,30,24,29,14,16,12,11,9,5,13,9,15,11,7,5],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"title":{"text":"Name"},"tracegroupgap":0},"margin":{"t":60}},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('652fa0b5-184f-4c49-9540-4582d0cc263a');
+var gd = document.getElementById('55908907-80a2-4642-ad26-2a6351759d38');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
@@ -1002,9 +1002,9 @@ <h3 data-number="4.1.5" class="anchored" data-anchor-id="plotting-birth-counts">
 </details>
 <div class="cell-output cell-output-display">
 
-<div>                            <div id="65737270-d025-4cfe-9bf5-c945903c6924" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("65737270-d025-4cfe-9bf5-c945903c6924")) {                    Plotly.newPlot(                        "65737270-d025-4cfe-9bf5-c945903c6924",                        [{"hovertemplate":"Year=%{x}\u003cbr\u003eCount=%{y}\u003cextra\u003e\u003c\u002fextra\u003e","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[9163,9983,17946,22094,26926,35835,37501,39916,44692,45119,54142,58983,61004,67917,74451,73493,72910,74201,74264,72108,75294,71467,69522,66895,69789,71603,74932,83738,91626,93461,102627,114296,142033,159813,164349,171764,204945,232313,229033,233625,235582,250468,271681,287484,297099,304567,324186,340083,337562,345901,358544,363926,360475,361897,355386,336567,319421,318819,321040,333671,342411,310020,287239,275036,286947,290518,302547,315011,322241,343070,365973,382156,390581,394608,404961,425583,435964,453824,480602,512615,552647,549317,541054,524983,509302,494635,483288,468412,464300,460844,471649,466934,467742,477651,480892,484503,494971,497627,483360,460305,444619,437818,439402,431945,440683,431317,427015,411058,395436,386996,362882,362582,360023],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"tracegroupgap":0},"margin":{"t":60}},                        {"responsive": true}                    ).then(function(){
+<div>                            <div id="5a7694cd-4057-4d8c-8263-58c26eb872d3" class="plotly-graph-div" style="height:525px; width:100%;"></div>            <script type="text/javascript">                require(["plotly"], function(Plotly) {                    window.PLOTLYENV=window.PLOTLYENV || {};                                    if (document.getElementById("5a7694cd-4057-4d8c-8263-58c26eb872d3")) {                    Plotly.newPlot(                        "5a7694cd-4057-4d8c-8263-58c26eb872d3",                        [{"hovertemplate":"Year=%{x}<br>Count=%{y}<extra></extra>","legendgroup":"","line":{"color":"#636efa","dash":"solid"},"marker":{"symbol":"circle"},"mode":"lines","name":"","orientation":"v","showlegend":false,"x":[1910,1911,1912,1913,1914,1915,1916,1917,1918,1919,1920,1921,1922,1923,1924,1925,1926,1927,1928,1929,1930,1931,1932,1933,1934,1935,1936,1937,1938,1939,1940,1941,1942,1943,1944,1945,1946,1947,1948,1949,1950,1951,1952,1953,1954,1955,1956,1957,1958,1959,1960,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,2021,2022],"xaxis":"x","y":[9163,9983,17946,22094,26926,35835,37501,39916,44692,45119,54142,58983,61004,67917,74451,73493,72910,74201,74264,72108,75294,71467,69522,66895,69789,71603,74932,83738,91626,93461,102627,114296,142033,159813,164349,171764,204945,232313,229033,233625,235582,250468,271681,287484,297099,304567,324186,340083,337562,345901,358544,363926,360475,361897,355386,336567,319421,318819,321040,333671,342411,310020,287239,275036,286947,290518,302547,315011,322241,343070,365973,382156,390581,394608,404961,425583,435964,453824,480602,512615,552647,549317,541054,524983,509302,494635,483288,468412,464300,460844,471649,466934,467742,477651,480892,484503,494971,497627,483360,460305,444619,437818,439402,431945,440683,431317,427015,411058,395436,386996,362882,362582,360023],"yaxis":"y","type":"scatter"}],                        {"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"xaxis":{"anchor":"y","domain":[0.0,1.0],"title":{"text":"Year"}},"yaxis":{"anchor":"x","domain":[0.0,1.0],"title":{"text":"Count"}},"legend":{"tracegroupgap":0},"margin":{"t":60}},                        {"responsive": true}                    ).then(function(){
                             
-var gd = document.getElementById('65737270-d025-4cfe-9bf5-c945903c6924');
+var gd = document.getElementById('5a7694cd-4057-4d8c-8263-58c26eb872d3');
 var x = new MutationObserver(function (mutations, observer) {{
         var display = window.getComputedStyle(gd).display;
         if (!display || display === 'none') {{
diff --git a/docs/regex/regex.html b/docs/regex/regex.html
index 67cac076..e037998e 100644
--- a/docs/regex/regex.html
+++ b/docs/regex/regex.html
@@ -700,11 +700,11 @@ <h4 data-number="6.2.1.2" class="anchored" data-anchor-id="canonicalization-with
 <span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>county_and_state[<span class="st">'clean_county_pandas'</span>] <span class="op">=</span> canonicalize_county_series(county_and_state[<span class="st">'County'</span>])</span>
 <span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>display(county_and_pop), display(county_and_state)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-stderr">
-<pre><code>/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_96319/2523629438.py:3: FutureWarning:
+<pre><code>/var/folders/sy/b85yc0p951zdr__z5hvdmbjm0000gn/T/ipykernel_22163/2523629438.py:7: FutureWarning:
 
 The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
 
-/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_96319/2523629438.py:3: FutureWarning:
+/var/folders/sy/b85yc0p951zdr__z5hvdmbjm0000gn/T/ipykernel_22163/2523629438.py:7: FutureWarning:
 
 The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
 </code></pre>
diff --git a/docs/sampling/sampling.html b/docs/sampling/sampling.html
index 6bc3ac79..b2152c82 100644
--- a/docs/sampling/sampling.html
+++ b/docs/sampling/sampling.html
@@ -710,7 +710,7 @@ <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"
 <span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>random_sample <span class="op">=</span> movie.sample(n, replace <span class="op">=</span> <span class="va">False</span>) <span class="co">## By default, replace = False</span></span>
 <span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>np.mean(random_sample[<span class="st">"barbie"</span>])</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="9">
-<pre><code>0.5294104553194805</code></pre>
+<pre><code>0.5315000723435988</code></pre>
 </div>
 </div>
 <p>This is very close to the actual vote of 0.5302792307692308!</p>
@@ -728,7 +728,7 @@ <h4 data-number="9.3.3.3" class="anchored" data-anchor-id="simple-random-sample"
 <span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a>Markdown(<span class="ss">f"**Actual** = </span><span class="sc">{</span>actual_barbie<span class="sc">:.4f}</span><span class="ss">, **Sample** = </span><span class="sc">{</span>sample_barbie<span class="sc">:.4f}</span><span class="ss">, "</span></span>
 <span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a>         <span class="ss">f"**Err** = </span><span class="sc">{</span><span class="dv">100</span><span class="op">*</span>err<span class="sc">:.2f}</span><span class="ss">%."</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="10">
-<p><strong>Actual</strong> = 0.5303, <strong>Sample</strong> = 0.5138, <strong>Err</strong> = 3.12%.</p>
+<p><strong>Actual</strong> = 0.5303, <strong>Sample</strong> = 0.5200, <strong>Err</strong> = 1.94%.</p>
 </div>
 </div>
 <p>We’ll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.</p>
@@ -753,7 +753,7 @@ <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-er
 <span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a>ax.axvline(actual_barbie, color<span class="op">=</span><span class="st">"orange"</span>, lw<span class="op">=</span><span class="dv">4</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
-<p><img src="sampling_files/figure-html/cell-13-output-1.png" width="609" height="421"></p>
+<p><img src="sampling_files/figure-html/cell-13-output-1.png" width="605" height="421"></p>
 </div>
 </div>
 <p>What fraction of these simulated samples would have predicted Barbie?</p>
@@ -761,7 +761,7 @@ <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-er
 <div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>poll_result <span class="op">=</span> pd.Series(poll_result)</span>
 <span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>np.<span class="bu">sum</span>(poll_result <span class="op">&gt;</span> <span class="fl">0.5</span>)<span class="op">/</span><span class="dv">1000</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display" data-execution_count="13">
-<pre><code>0.955</code></pre>
+<pre><code>0.959</code></pre>
 </div>
 </div>
 <p>You can see the curve looks roughly Gaussian/normal. Using KDE:</p>
@@ -771,7 +771,7 @@ <h4 data-number="9.3.3.4" class="anchored" data-anchor-id="quantifying-chance-er
 <div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a>sns.histplot(poll_result, stat<span class="op">=</span><span class="st">'density'</span>, kde<span class="op">=</span><span class="va">True</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
-<p><img src="sampling_files/figure-html/cell-15-output-1.png" width="609" height="421"></p>
+<p><img src="sampling_files/figure-html/cell-15-output-1.png" width="605" height="421"></p>
 </div>
 </div>
 </section>
diff --git a/docs/sampling/sampling_files/figure-html/cell-11-output-2.png b/docs/sampling/sampling_files/figure-html/cell-11-output-2.png
new file mode 100644
index 00000000..21db1859
Binary files /dev/null and b/docs/sampling/sampling_files/figure-html/cell-11-output-2.png differ
diff --git a/docs/sampling/sampling_files/figure-html/cell-13-output-1.png b/docs/sampling/sampling_files/figure-html/cell-13-output-1.png
index a2783b9c..cab3be6d 100644
Binary files a/docs/sampling/sampling_files/figure-html/cell-13-output-1.png and b/docs/sampling/sampling_files/figure-html/cell-13-output-1.png differ
diff --git a/docs/sampling/sampling_files/figure-html/cell-15-output-1.png b/docs/sampling/sampling_files/figure-html/cell-15-output-1.png
index e43bdf80..5287ec7a 100644
Binary files a/docs/sampling/sampling_files/figure-html/cell-15-output-1.png and b/docs/sampling/sampling_files/figure-html/cell-15-output-1.png differ
diff --git a/docs/sampling/sampling_files/figure-html/cell-22-output-2.png b/docs/sampling/sampling_files/figure-html/cell-22-output-2.png
new file mode 100644
index 00000000..dbaba815
Binary files /dev/null and b/docs/sampling/sampling_files/figure-html/cell-22-output-2.png differ
diff --git a/docs/sampling/sampling_files/figure-html/cell-25-output-2.png b/docs/sampling/sampling_files/figure-html/cell-25-output-2.png
new file mode 100644
index 00000000..e45708fd
Binary files /dev/null and b/docs/sampling/sampling_files/figure-html/cell-25-output-2.png differ
diff --git a/docs/sampling/sampling_files/figure-html/cell-8-output-2.png b/docs/sampling/sampling_files/figure-html/cell-8-output-2.png
new file mode 100644
index 00000000..dbaba815
Binary files /dev/null and b/docs/sampling/sampling_files/figure-html/cell-8-output-2.png differ
diff --git a/docs/sampling/sampling_files/figure-html/cell-9-output-1.png b/docs/sampling/sampling_files/figure-html/cell-9-output-1.png
index b2fd76ef..2b1ab549 100644
Binary files a/docs/sampling/sampling_files/figure-html/cell-9-output-1.png and b/docs/sampling/sampling_files/figure-html/cell-9-output-1.png differ
diff --git a/docs/search.json b/docs/search.json
index e6e9638d..1a2de9c5 100644
--- a/docs/search.json
+++ b/docs/search.json
@@ -13,832 +13,6 @@
     "section": "About the Course Notes",
     "text": "About the Course Notes\nThis text offers supplementary resources to accompany lectures presented in the Fall 2023 Edition of the UC Berkeley course Data 100: Principles and Techniques of Data Science.\nNew notes will be added each week to accompany live lectures. See the full calendar of lectures on the course website.\nIf you spot any typos or would like to suggest any changes, please email us.   Email: data100.instructors@berkeley.edu"
   },
-  {
-    "objectID": "intro_lec/introduction.html#data-science-lifecycle",
-    "href": "intro_lec/introduction.html#data-science-lifecycle",
-    "title": "1  Introduction",
-    "section": "1.1 Data Science Lifecycle",
-    "text": "1.1 Data Science Lifecycle\nThe data science lifecycle is a high-level overview of the data science workflow. It’s a cycle of stages that a data scientist should explore as they conduct a thorough analysis of a data-driven problem.\nThere are many variations of the key ideas present in the data science lifecycle. In Data 100, we visualize the stages of the lifecycle using a flow diagram. Notice how there are two entry points.\n\n\n1.1.1 Ask a Question\nWhether by curiosity or necessity, data scientists constantly ask questions. For example, in the business world, data scientists may be interested in predicting the profit generated by a certain investment. In the field of medicine, they may ask whether some patients are more likely than others to benefit from a treatment.\nPosing questions is one of the primary ways the data science lifecycle begins. It helps to fully define the question. Here are some things you should ask yourself before framing a question.\n\nWhat do we want to know?\n\nA question that is too ambiguous may lead to confusion.\n\nWhat problems are we trying to solve?\n\nThe goal of asking a question should be clear in order to justify your efforts to stakeholders.\n\nWhat are the hypotheses we want to test?\n\nThis gives a clear perspective from which to analyze final results.\n\nWhat are the metrics for our success?\n\nThis establishes a clear point to know when to conclude the project.\n\n\n\n\n\n1.1.2 Obtain Data\nThe second entry point to the lifecycle is by obtaining data. A careful analysis of any problem requires the use of data. Data may be readily available to us, or we may have to embark on a process to collect it. When doing so, it is crucial to ask the following:\n\nWhat data do we have, and what data do we need?\n\nDefine the units of the data (people, cities, points in time, etc.) and what features to measure.\n\nHow will we sample more data?\n\nScrape the web, collect manually, run experiments, etc.\n\nIs our data representative of the population we want to study?\n\nIf our data is not representative of our population of interest, then we cann come to incorrect conclusions.\n\n\nKey procedures: data acquisition, data cleaning\n\n\n\n1.1.3 Understand the Data\nRaw data itself is not inherently useful. It’s impossible to discern all the patterns and relationships between variables without carefully investigating them. Therefore, translating pure data into actionable insights is a key job of a data scientist. For example, we may choose to ask:\n\nHow is our data organized and what does it contain?\n\nKnowing what the data says about the world helps us better understand the world.\n\nDo we have relevant data?\n\nIf the data we have collected is not useful to the question at hand, then we must collect more data.\n\nWhat are the biases, anomalies, or other issues with the data?\n\nThese can lead to many false conclusions if ignored, so data scientists must always be aware of these issues.\n\nHow do we transform the data to enable effective analysis?\n\nData is not always easy to interpret at first glance, so a data scientist should strive to reveal the hidden insights.\n\n\nKey procedures: exploratory data analysis, data visualization.\n\n\n\n1.1.4 Understand the World\nAfter observing the patterns in our data, we can begin answering our question. This may require that we predict a quantity (machine learning), or measure the effect of some treatment (inference).\nFrom here, we may choose to report our results, or possibly conduct more analysis. We may not be satisfied with our findings, or our initial exploration may have brought up new questions that require new data.\n\nWhat does the data say about the world?\n\nGiven our models, the data will lead us to certain conclusions about the real world.\n\n\nDoes it answer our questions or accurately solve the problem?\n\nIf our model and data can not accomplish our goals, then we must reform our question, model, or both.\n\n\nHow robust are our conclusions and can we trust the predictions?\n\nInaccurate models can lead to false conclusions.\n\n\nKey procedures: model creation, prediction, inference."
-  },
-  {
-    "objectID": "intro_lec/introduction.html#conclusion",
-    "href": "intro_lec/introduction.html#conclusion",
-    "title": "1  Introduction",
-    "section": "1.2 Conclusion",
-    "text": "1.2 Conclusion\nThe data science lifecycle is meant to be a set of general guidelines rather than a hard set of requirements. In our journey exploring the lifecycle, we’ll cover both the underlying theory and technologies used in data science. By the end of the course, we hope that you start to see yourself as a data scientist.\nWith that, we’ll begin by introducing one of the most important tools in exploratory data analysis: pandas."
-  },
-  {
-    "objectID": "pandas_1/pandas_1.html#tabular-data",
-    "href": "pandas_1/pandas_1.html#tabular-data",
-    "title": "2  Pandas I",
-    "section": "2.1 Tabular Data",
-    "text": "2.1 Tabular Data\nData scientists work with data stored in a variety of formats. The primary focus of this class is understanding tabular data — data that is stored in a table.\nTabular data is one of the most common systems that data scientists use to organize data. This is in large part due to the simplicity and flexibility of tables. Tables allow us to represent each observation, or instance of collecting data from an individual, as its own row. We can record each observation’s distinct characteristics, or features, in separate columns.\nTo see this in action, we’ll explore the elections dataset, which stores information about political candidates who ran for president of the United States in previous years.\n\n\nCode\nimport pandas as pd\npd.read_csv(\"data/elections.csv\")\n\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\n\n\n...\n...\n...\n...\n...\n...\n...\n\n\n177\n2016\nJill Stein\nGreen\n1457226\nloss\n1.073699\n\n\n178\n2020\nJoseph Biden\nDemocratic\n81268924\nwin\n51.311515\n\n\n179\n2020\nDonald Trump\nRepublican\n74216154\nloss\n46.858542\n\n\n180\n2020\nJo Jorgensen\nLibertarian\n1865724\nloss\n1.177979\n\n\n181\n2020\nHoward Hawkins\nGreen\n405035\nloss\n0.255731\n\n\n\n\n182 rows × 6 columns\n\n\n\nIn the elections dataset, each row represents one instance of a candidate running for president in a particular year. For example, the first row represents Andrew Jackson running for president in the year 1824. Each column represents one characteristic piece of information about each presidential candidate. For example, the column named “Result” stores whether or not the candidate won the election.\nYour work in Data 8 helped you grow very familiar with using and interpreting data stored in a tabular format. Back then, you used the Table class of the datascience library, a special programming library created specifically for Data 8 students.\nIn Data 100, we will be working with the programming library pandas, which is generally accepted in the data science community as the industry- and academia-standard tool for manipulating tabular data (as well as the inspiration for Petey, our panda bear mascot).\nUsing pandas, we can\n\nArrange data in a tabular format.\nExtract useful information filtered by specific conditions.\nOperate on data to gain new insights.\nApply NumPy functions to our data (our friends from Data 8).\nPerform vectorized computations to speed up our analysis (Lab 1)."
-  },
-  {
-    "objectID": "pandas_1/pandas_1.html#series-dataframes-and-indices",
-    "href": "pandas_1/pandas_1.html#series-dataframes-and-indices",
-    "title": "2  Pandas I",
-    "section": "2.2 Series, DataFrames, and Indices",
-    "text": "2.2 Series, DataFrames, and Indices\nTo begin our work in pandas, we must first import the library into our Python environment. This will allow us to use pandas data structures and methods in our code.\n\n# `pd` is the conventional alias for Pandas, as `np` is for NumPy\nimport pandas as pd\n\nThere are three fundamental data structures in pandas:\n\nSeries: 1D labeled array data; best thought of as columnar data.\nDataFrame: 2D tabular data with rows and columns.\nIndex: A sequence of row/column labels.\n\nDataFrames, Series, and Indices can be represented visually in the following diagram, which considers the first few rows of the elections dataset.\n\nNotice how the DataFrame is a two-dimensional object — it contains both rows and columns. The Series above is a singular column of this DataFrame, namely the Result column. Both contain an Index, or a shared list of row labels (the integers from 0 to 4, inclusive).\n\n2.2.1 Series\nA Series represents a column of a DataFrame; more generally, it can be any 1-dimensional array-like object. It contains:\n\nA sequence of values of the same type.\nA sequence of data labels called the index.\n\nIn the cell below, we create a Series named s.\n\ns = pd.Series([\"welcome\", \"to\", \"data 100\"])\ns\n\n0     welcome\n1          to\n2    data 100\ndtype: object\n\n\n\ns.values # Data values contained within the Series\n\narray(['welcome', 'to', 'data 100'], dtype=object)\n\n\n\ns.index # The Index of the Series\n\nRangeIndex(start=0, stop=3, step=1)\n\n\nBy default, the Index of a Series is a sequential list of integers beginning from 0. Optionally, a manually specified list of desired indices can be passed to the index argument.\n\ns = pd.Series([-1, 10, 2], index = [\"a\", \"b\", \"c\"])\ns\n\na    -1\nb    10\nc     2\ndtype: int64\n\n\n\ns.index\n\nIndex(['a', 'b', 'c'], dtype='object')\n\n\nIndices can also be changed after initialization.\n\ns.index = [\"first\", \"second\", \"third\"]\ns\n\nfirst     -1\nsecond    10\nthird      2\ndtype: int64\n\n\n\ns.index\n\nIndex(['first', 'second', 'third'], dtype='object')\n\n\n\n2.2.1.1 Selection in Series\nMuch like when working with NumPy arrays, we can select a single value or a set of values from a Series. To do so, there are three primary methods:\n\nA single label.\nA list of labels.\nA filtering condition.\n\nTo demonstrate this, let’s define the Series ser.\n\nser = pd.Series([4, -2, 0, 6], index = [\"a\", \"b\", \"c\", \"d\"])\nser\n\na    4\nb   -2\nc    0\nd    6\ndtype: int64\n\n\n\n2.2.1.1.1 A Single Label\n\nser[\"a\"] # We return the value stored at the Index label \"a\"\n\n4\n\n\n\n\n2.2.1.1.2 A List of Labels\n\nser[[\"a\", \"c\"]] # We return a *Series* of the values stored at the Index labels \"a\" and \"c\"\n\na    4\nc    0\ndtype: int64\n\n\n\n\n2.2.1.1.3 A Filtering Condition\nPerhaps the most interesting (and useful) method of selecting data from a Series is by using a filtering condition.\nFirst, we apply a boolean operation to the Series. This creates a new Series of boolean values.\n\nser &gt; 0 # Filter condition: select all elements greater than 0\n\na     True\nb    False\nc    False\nd     True\ndtype: bool\n\n\nWe then use this boolean condition to index into our original Series. pandas will select only the entries in the original Series that satisfy the condition.\n\nser[ser &gt; 0] \n\na    4\nd    6\ndtype: int64\n\n\n\n\n\n\n2.2.2 DataFrames\nTypically, we will work with Series using the perspective that they are columns in a DataFrame. We can think of a DataFrame as a collection of Series that all share the same Index.\nIn Data 8, you encountered the Table class of the datascience library, which represented tabular data. In Data 100, we’ll be using the DataFrame class of the pandas library.\n\n2.2.2.1 Creating a DataFrame\nThere are many ways to create a DataFrame. Here, we will cover the most popular approaches:\n\nFrom a CSV file.\nUsing a list and column name(s).\nFrom a dictionary.\nFrom a Series.\n\nMore generally, the syntax for creating a DataFrame is: pandas.DataFrame(data, index, columns).\n\n2.2.2.1.1 From a CSV file\nIn Data 100, our data are typically stored in a CSV (comma-separated values) file format. We can import a CSV file into a DataFrame by passing the data path as an argument to the following pandas function.   pd.read_csv(\"filename.csv\")\nWith our new understanding of pandas in hand, let’s return to the elections dataset from before. Now, we can recognize that it is represented as a pandas DataFrame.\n\nelections = pd.read_csv(\"data/elections.csv\")\nelections\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\n\n\n...\n...\n...\n...\n...\n...\n...\n\n\n177\n2016\nJill Stein\nGreen\n1457226\nloss\n1.073699\n\n\n178\n2020\nJoseph Biden\nDemocratic\n81268924\nwin\n51.311515\n\n\n179\n2020\nDonald Trump\nRepublican\n74216154\nloss\n46.858542\n\n\n180\n2020\nJo Jorgensen\nLibertarian\n1865724\nloss\n1.177979\n\n\n181\n2020\nHoward Hawkins\nGreen\n405035\nloss\n0.255731\n\n\n\n\n182 rows × 6 columns\n\n\n\nThis code stores our DataFrame object in the elections variable. Upon inspection, our elections DataFrame has 182 rows and 6 columns (Year, Candidate, Party, Popular Vote, Result, %). Each row represents a single record — in our example, a presidential candidate from some particular year. Each column represents a single attribute or feature of the record.\n\n\n2.2.2.1.2 Using a List and Column Name(s)\nWe’ll now explore creating a DataFrame with data of our own.\nConsider the following examples. The first code cell creates a DataFrame with a single column Numbers. The second creates a DataFrame with the columns Numbers and Description. Notice how a 2D list of values is required to initialize the second DataFrame — each nested list represents a single row of data.\n\ndf_list = pd.DataFrame([1, 2, 3], columns=[\"Numbers\"])\ndf_list\n\n\n\n\n\n\n\n\nNumbers\n\n\n\n\n0\n1\n\n\n1\n2\n\n\n2\n3\n\n\n\n\n\n\n\n\ndf_list = pd.DataFrame([[1, \"one\"], [2, \"two\"]], columns = [\"Number\", \"Description\"])\ndf_list\n\n\n\n\n\n\n\n\nNumber\nDescription\n\n\n\n\n0\n1\none\n\n\n1\n2\ntwo\n\n\n\n\n\n\n\n\n\n2.2.2.1.3 From a Dictionary\nA third (and more common) way to create a DataFrame is with a dictionary. The dictionary keys represent the column names, and the dictionary values represent the column values.\nBelow are two ways of implementing this approach. The first is based on specifying the columns of the DataFrame, whereas the second is based on specifying the rows of the DataFrame.\n\ndf_dict = pd.DataFrame({\"Fruit\": [\"Strawberry\", \"Orange\"], \"Price\": [5.49, 3.99]})\ndf_dict\n\n\n\n\n\n\n\n\nFruit\nPrice\n\n\n\n\n0\nStrawberry\n5.49\n\n\n1\nOrange\n3.99\n\n\n\n\n\n\n\n\ndf_dict = pd.DataFrame([{\"Fruit\":\"Strawberry\", \"Price\":5.49}, {\"Fruit\": \"Orange\", \"Price\":3.99}])\ndf_dict\n\n\n\n\n\n\n\n\nFruit\nPrice\n\n\n\n\n0\nStrawberry\n5.49\n\n\n1\nOrange\n3.99\n\n\n\n\n\n\n\n\n\n2.2.2.1.4 From a Series\nEarlier, we explained how a Series was synonymous to a column in a DataFrame. It follows, then, that a DataFrame is equivalent to a collection of Series, which all share the same Index.\nIn fact, we can initialize a DataFrame by merging two or more Series.\n\n# Notice how our indices, or row labels, are the same\n\ns_a = pd.Series([\"a1\", \"a2\", \"a3\"], index = [\"r1\", \"r2\", \"r3\"])\ns_b = pd.Series([\"b1\", \"b2\", \"b3\"], index = [\"r1\", \"r2\", \"r3\"])\n\npd.DataFrame({\"A-column\": s_a, \"B-column\": s_b})\n\n\n\n\n\n\n\n\nA-column\nB-column\n\n\n\n\nr1\na1\nb1\n\n\nr2\na2\nb2\n\n\nr3\na3\nb3\n\n\n\n\n\n\n\n\npd.DataFrame(s_a)\n\n\n\n\n\n\n\n\n0\n\n\n\n\nr1\na1\n\n\nr2\na2\n\n\nr3\na3\n\n\n\n\n\n\n\n\ns_a.to_frame()\n\n\n\n\n\n\n\n\n0\n\n\n\n\nr1\na1\n\n\nr2\na2\n\n\nr3\na3\n\n\n\n\n\n\n\n\n\n\n\n2.2.3 Indices\nOn a more technical note, an Index doesn’t have to be an integer, nor does it have to be unique. For example, we can set the index of the elections Dataframe to be the name of presidential candidates.\n\n# Creating a DataFrame from a CSV file and specifying the Index column\nelections = pd.read_csv(\"data/elections.csv\", index_col = \"Candidate\")\nelections\n\n\n\n\n\n\n\n\nYear\nParty\nPopular vote\nResult\n%\n\n\nCandidate\n\n\n\n\n\n\n\n\n\nAndrew Jackson\n1824\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\nJohn Quincy Adams\n1824\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\nAndrew Jackson\n1828\nDemocratic\n642806\nwin\n56.203927\n\n\nJohn Quincy Adams\n1828\nNational Republican\n500897\nloss\n43.796073\n\n\nAndrew Jackson\n1832\nDemocratic\n702735\nwin\n54.574789\n\n\n...\n...\n...\n...\n...\n...\n\n\nJill Stein\n2016\nGreen\n1457226\nloss\n1.073699\n\n\nJoseph Biden\n2020\nDemocratic\n81268924\nwin\n51.311515\n\n\nDonald Trump\n2020\nRepublican\n74216154\nloss\n46.858542\n\n\nJo Jorgensen\n2020\nLibertarian\n1865724\nloss\n1.177979\n\n\nHoward Hawkins\n2020\nGreen\n405035\nloss\n0.255731\n\n\n\n\n182 rows × 5 columns\n\n\n\nWe can also select a new column and set it as the index of the DataFrame. For example, we can set the index of the elections Dataframe to represent the candidate’s party.\n\nelections.reset_index(inplace = True) # Resetting the index so we can set the Index again\n# This sets the index to the \"Party\" column\nelections.set_index(\"Party\")\n\n\n\n\n\n\n\n\nCandidate\nYear\nPopular vote\nResult\n%\n\n\nParty\n\n\n\n\n\n\n\n\n\nDemocratic-Republican\nAndrew Jackson\n1824\n151271\nloss\n57.210122\n\n\nDemocratic-Republican\nJohn Quincy Adams\n1824\n113142\nwin\n42.789878\n\n\nDemocratic\nAndrew Jackson\n1828\n642806\nwin\n56.203927\n\n\nNational Republican\nJohn Quincy Adams\n1828\n500897\nloss\n43.796073\n\n\nDemocratic\nAndrew Jackson\n1832\n702735\nwin\n54.574789\n\n\n...\n...\n...\n...\n...\n...\n\n\nGreen\nJill Stein\n2016\n1457226\nloss\n1.073699\n\n\nDemocratic\nJoseph Biden\n2020\n81268924\nwin\n51.311515\n\n\nRepublican\nDonald Trump\n2020\n74216154\nloss\n46.858542\n\n\nLibertarian\nJo Jorgensen\n2020\n1865724\nloss\n1.177979\n\n\nGreen\nHoward Hawkins\n2020\n405035\nloss\n0.255731\n\n\n\n\n182 rows × 5 columns\n\n\n\nAnd, if we’d like, we can revert the index back to the default list of integers.\n\n# This resets the index to be the default list of integer\nelections.reset_index(inplace=True) \nelections.index\n\nRangeIndex(start=0, stop=182, step=1)\n\n\nIt is also important to note that the row labels that constitute an index don’t have to be unique. While index values can be unique and numeric, acting as a row number, they can also be named and non-unique.\nHere we see unique and numeric index values. \nHowever, here the index values here are non-unique."
-  },
-  {
-    "objectID": "pandas_1/pandas_1.html#dataframe-attributes-index-columns-and-shape",
-    "href": "pandas_1/pandas_1.html#dataframe-attributes-index-columns-and-shape",
-    "title": "2  Pandas I",
-    "section": "2.3 DataFrame Attributes: Index, Columns, and Shape",
-    "text": "2.3 DataFrame Attributes: Index, Columns, and Shape\nOn the other hand, column names in a DataFrame are almost always unique. Looking back to the elections dataset, it wouldn’t make sense to have two columns named “Candidate”.\nSometimes, you’ll want to extract these different values, in particular, the list of row and column labels.\nFor index/row labels, use DataFrame.index:\n\nelections.set_index(\"Party\", inplace = True)\nelections.index\n\nIndex(['Democratic-Republican', 'Democratic-Republican', 'Democratic',\n       'National Republican', 'Democratic', 'National Republican',\n       'Anti-Masonic', 'Whig', 'Democratic', 'Whig',\n       ...\n       'Constitution', 'Republican', 'Independent', 'Libertarian',\n       'Democratic', 'Green', 'Democratic', 'Republican', 'Libertarian',\n       'Green'],\n      dtype='object', name='Party', length=182)\n\n\nFor column labels, use DataFrame.columns:\n\nelections.columns\n\nIndex(['index', 'Candidate', 'Year', 'Popular vote', 'Result', '%'], dtype='object')\n\n\nAnd for the shape of the DataFrame, we can use DataFrame.shape:\n\nelections.shape\n\n(182, 6)"
-  },
-  {
-    "objectID": "pandas_1/pandas_1.html#slicing-in-dataframes",
-    "href": "pandas_1/pandas_1.html#slicing-in-dataframes",
-    "title": "2  Pandas I",
-    "section": "2.4 Slicing in DataFrames",
-    "text": "2.4 Slicing in DataFrames\nNow that we’ve learned more about DataFrames, let’s dive deeper into their capabilities.\nThe API (Application Programming Interface) for the DataFrame class is enormous. In this section, we’ll discuss several methods of the DataFrame API that allow us to extract subsets of data.\nThe simplest way to manipulate a DataFrame is to extract a subset of rows and columns, known as slicing.\nCommon ways we may want to extract data are grabbing:\n\nThe first or last n rows in the DataFrame.\nData with a certain label.\nData at a certain position.\n\nWe will do so with four primary methods of the DataFrame class:\n\n.head and .tail\n.loc\n.iloc\n[]\n\n\n2.4.1 Extracting data with .head and .tail\nThe simplest scenario in which we want to extract data is when we simply want to select the first or last few rows of the DataFrame.\nTo extract the first n rows of a DataFrame df, we use the syntax df.head(n).\n\nelections = pd.read_csv(\"data/elections.csv\")\n\n# Extract the first 5 rows of the DataFrame\nelections.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\n\n\n\n\n\n\n\nSimilarly, calling df.tail(n) allows us to extract the last n rows of the DataFrame.\n\n# Extract the last 5 rows of the DataFrame\nelections.tail(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n177\n2016\nJill Stein\nGreen\n1457226\nloss\n1.073699\n\n\n178\n2020\nJoseph Biden\nDemocratic\n81268924\nwin\n51.311515\n\n\n179\n2020\nDonald Trump\nRepublican\n74216154\nloss\n46.858542\n\n\n180\n2020\nJo Jorgensen\nLibertarian\n1865724\nloss\n1.177979\n\n\n181\n2020\nHoward Hawkins\nGreen\n405035\nloss\n0.255731\n\n\n\n\n\n\n\n\n\n2.4.2 Label-based Extraction: Indexing with .loc\nFor the more complex task of extracting data with specific column or index labels, we can use .loc. The .loc accessor allows us to specify the labels of rows and columns we wish to extract. The labels (commonly referred to as the indices) are the bold text on the far left of a DataFrame, while the column labels are the column names found at the top of a DataFrame.\n\nTo grab data with .loc, we must specify the row and column label(s) where the data exists. The row labels are the first argument to the .loc function; the column labels are the second.\nArguments to .loc can be:\n\nA single value.\nA slice.\nA list.\n\nFor example, to select a single value, we can select the row labeled 0 and the column labeled Candidate from the elections DataFrame.\n\nelections.loc[0, 'Candidate']\n\n'Andrew Jackson'\n\n\nKeep in mind that passing in just one argument as a single value will produce a Series. Below, we’ve extracted a subset of the \"Popular vote\" column as a Series.\n\nelections.loc[[87, 25, 179], \"Popular vote\"]\n\n87     15761254\n25       848019\n179    74216154\nName: Popular vote, dtype: int64\n\n\nTo select multiple rows and columns, we can use Python slice notation. Here, we select the rows from labels 0 to 3 and the columns from labels \"Year\" to \"Popular vote\".\n\nelections.loc[0:3, 'Year':'Popular vote']\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\n\n\n\n\n\n\n\nSuppose that instead, we want to extract all column values for the first four rows in the elections DataFrame. The shorthand : is useful for this.\n\nelections.loc[0:3, :]\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n\n\n\n\n\nWe can use the same shorthand to extract all rows.\n\nelections.loc[:, [\"Year\", \"Candidate\", \"Result\"]]\n\n\n\n\n\n\n\n\nYear\nCandidate\nResult\n\n\n\n\n0\n1824\nAndrew Jackson\nloss\n\n\n1\n1824\nJohn Quincy Adams\nwin\n\n\n2\n1828\nAndrew Jackson\nwin\n\n\n3\n1828\nJohn Quincy Adams\nloss\n\n\n4\n1832\nAndrew Jackson\nwin\n\n\n...\n...\n...\n...\n\n\n177\n2016\nJill Stein\nloss\n\n\n178\n2020\nJoseph Biden\nwin\n\n\n179\n2020\nDonald Trump\nloss\n\n\n180\n2020\nJo Jorgensen\nloss\n\n\n181\n2020\nHoward Hawkins\nloss\n\n\n\n\n182 rows × 3 columns\n\n\n\nThere are a couple of things we should note. Firstly, unlike conventional Python, pandas allows us to slice string values (in our example, the column labels). Secondly, slicing with .loc is inclusive. Notice how our resulting DataFrame includes every row and column between and including the slice labels we specified.\nEquivalently, we can use a list to obtain multiple rows and columns in our elections DataFrame.\n\nelections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']]\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\n\n\n\n\n\n\n\nLastly, we can interchange list and slicing notation.\n\nelections.loc[[0, 1, 2, 3], :]\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n\n\n\n\n\n\n\n2.4.3 Integer-based Extraction: Indexing with .iloc\nSlicing with .iloc works similarly to .loc. However, .iloc uses the index positions of rows and columns rather than the labels (think to yourself: loc uses lables; iloc uses indices). The arguments to the .iloc function also behave similarly — single values, lists, indices, and any combination of these are permitted.\nLet’s begin reproducing our results from above. We’ll begin by selecting the first presidential candidate in our elections DataFrame:\n\n# elections.loc[0, \"Candidate\"] - Previous approach\nelections.iloc[0, 1]\n\n'Andrew Jackson'\n\n\nNotice how the first argument to both .loc and .iloc are the same. This is because the row with a label of 0 is conveniently in the \\(0^{th}\\) (equivalently, the first position) of the elections DataFrame. Generally, this is true of any DataFrame where the row labels are incremented in ascending order from 0.\nAnd, as before, if we were to pass in only one single value argument, our result would be a Series.\n\nelections.iloc[[1,2,3],1]\n\n1    John Quincy Adams\n2       Andrew Jackson\n3    John Quincy Adams\nName: Candidate, dtype: object\n\n\nHowever, when we select the first four rows and columns using .iloc, we notice something.\n\n# elections.loc[0:3, 'Year':'Popular vote'] - Previous approach\nelections.iloc[0:4, 0:4]\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\n\n\n\n\n\n\n\nSlicing is no longer inclusive in .iloc — it’s exclusive. In other words, the right end of a slice is not included when using .iloc. This is one of the subtleties of pandas syntax; you will get used to it with practice.\nList behavior works just as expected.\n\n#elections.loc[[0, 1, 2, 3], ['Year', 'Candidate', 'Party', 'Popular vote']] - Previous Approach\nelections.iloc[[0, 1, 2, 3], [0, 1, 2, 3]]\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\n\n\n\n\n\n\n\nAnd just like with .loc, we can use a colon with .iloc to extract all rows or columns.\n\nelections.iloc[:, 0:3]\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n\n\n...\n...\n...\n...\n\n\n177\n2016\nJill Stein\nGreen\n\n\n178\n2020\nJoseph Biden\nDemocratic\n\n\n179\n2020\nDonald Trump\nRepublican\n\n\n180\n2020\nJo Jorgensen\nLibertarian\n\n\n181\n2020\nHoward Hawkins\nGreen\n\n\n\n\n182 rows × 3 columns\n\n\n\nThis discussion begs the question: When should we use .loc vs. .iloc? In most cases, .loc is generally safer to use. You can imagine .iloc may return incorrect values when applied to a dataset where the ordering of data can change. However, .iloc can still be useful — for example, if you are looking at a DataFrame of sorted movie earnings and want to get the median earnings for a given year, you can use .iloc to index into the middle.\nOverall, it is important to remember that:\n\n.loc performances label-based extraction.\n.iloc performs integer-based extraction.\n\n\n\n2.4.4 Context-dependent Extraction: Indexing with []\nThe [] selection operator is the most baffling of all, yet the most commonly used. It only takes a single argument, which may be one of the following:\n\nA slice of row numbers.\nA list of column labels.\nA single-column label.\n\nThat is, [] is context-dependent. Let’s see some examples.\n\n2.4.4.1 A slice of row numbers\nSay we wanted the first four rows of our elections DataFrame.\n\nelections[0:4]\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n\n\n\n\n\n\n\n2.4.4.2 A list of column labels\nSuppose we now want the first four columns.\n\nelections[[\"Year\", \"Candidate\", \"Party\", \"Popular vote\"]]\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\n\n\n...\n...\n...\n...\n...\n\n\n177\n2016\nJill Stein\nGreen\n1457226\n\n\n178\n2020\nJoseph Biden\nDemocratic\n81268924\n\n\n179\n2020\nDonald Trump\nRepublican\n74216154\n\n\n180\n2020\nJo Jorgensen\nLibertarian\n1865724\n\n\n181\n2020\nHoward Hawkins\nGreen\n405035\n\n\n\n\n182 rows × 4 columns\n\n\n\n\n\n2.4.4.3 A single-column label\nLastly, [] allows us to extract only the Candidate column.\n\nelections[\"Candidate\"]\n\n0         Andrew Jackson\n1      John Quincy Adams\n2         Andrew Jackson\n3      John Quincy Adams\n4         Andrew Jackson\n             ...        \n177           Jill Stein\n178         Joseph Biden\n179         Donald Trump\n180         Jo Jorgensen\n181       Howard Hawkins\nName: Candidate, Length: 182, dtype: object\n\n\nThe output is a Series! In this course, we’ll become very comfortable with [], especially for selecting columns. In practice, [] is much more common than .loc, especially since it is far more concise."
-  },
-  {
-    "objectID": "pandas_1/pandas_1.html#parting-note",
-    "href": "pandas_1/pandas_1.html#parting-note",
-    "title": "2  Pandas I",
-    "section": "2.5 Parting Note",
-    "text": "2.5 Parting Note\nThe pandas library is enormous and contains many useful functions. Here is a link to documentation. We certainly don’t expect you to memorize each and every method of the library.\nThe introductory Data 100 pandas lectures will provide a high-level view of the key data structures and methods that will form the foundation of your pandas knowledge. A goal of this course is to help you build your familiarity with the real-world programming practice of …Googling! Answers to your questions can be found in documentation, Stack Overflow, etc. Being able to search for, read, and implement documentation is an important life skill for any data scientist.\nWith that, we will move on to Pandas II."
-  },
-  {
-    "objectID": "pandas_2/pandas_2.html#conditional-selection",
-    "href": "pandas_2/pandas_2.html#conditional-selection",
-    "title": "3  Pandas II",
-    "section": "3.1 Conditional Selection",
-    "text": "3.1 Conditional Selection\nConditional selection allows us to select a subset of rows in a DataFrame that satisfy some specified condition.\nTo understand how to use conditional selection, we must look at another possible input of the .loc and [] methods – a boolean array, which is simply an array or Series where each element is either True or False. This boolean array must have a length equal to the number of rows in the DataFrame. It will return all rows that correspond to a value of True in the array. We used a very similar technique when performing conditional extraction from a Series in the last lecture.\nTo see this in action, let’s select all even-indexed rows in the first 10 rows of our DataFrame.\n\n# Ask yourself: why is :9 is the correct slice to select the first 10 rows?\nbabynames_first_10_rows = babynames.loc[:9, :]\n\n# Notice how we have exactly 10 elements in our boolean array argument\nbabynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n4\nCA\nF\n1910\nFrances\n134\n\n\n6\nCA\nF\n1910\nEvelyn\n126\n\n\n8\nCA\nF\n1910\nVirginia\n101\n\n\n\n\n\n\n\nWe can perform a similar operation using .loc.\n\nbabynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n4\nCA\nF\n1910\nFrances\n134\n\n\n6\nCA\nF\n1910\nEvelyn\n126\n\n\n8\nCA\nF\n1910\nVirginia\n101\n\n\n\n\n\n\n\nThese techniques worked well in this example, but you can imagine how tedious it might be to list out Trues and Falses for every row in a larger DataFrame. To make things easier, we can instead provide a logical condition as an input to .loc or [] that returns a boolean array with the necessary length.\nFor example, to return all names associated with F sex:\n\n# First, use a logical condition to generate a boolean array\nlogical_operator = (babynames[\"Sex\"] == \"F\")\n\n# Then, use this boolean array to filter the DataFrame\nbabynames[logical_operator].head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n1\nCA\nF\n1910\nHelen\n239\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n3\nCA\nF\n1910\nMargaret\n163\n\n\n4\nCA\nF\n1910\nFrances\n134\n\n\n\n\n\n\n\nRecall from the previous lecture that .head() will return only the first few rows in the DataFrame. In reality, babynames[logical operator] contains as many rows as there are entries in the original babynames DataFrame with sex \"F\".\nHere, logical_operator evaluates to a Series of boolean values with length 407428.\n\n\nCode\nprint(\"There are a total of {} values in 'logical_operator'\".format(len(logical_operator)))\n\n\nThere are a total of 407428 values in 'logical_operator'\n\n\nRows starting at row 0 and ending at row 239536 evaluate to True and are thus returned in the DataFrame. Rows from 239537 onwards evaluate to False and are omitted from the output.\n\n\nCode\nprint(\"The 0th item in this 'logical_operator' is: {}\".format(logical_operator.iloc[0]))\nprint(\"The 239536th item in this 'logical_operator' is: {}\".format(logical_operator.iloc[239536]))\nprint(\"The 239537th item in this 'logical_operator' is: {}\".format(logical_operator.iloc[239537]))\n\n\nThe 0th item in this 'logical_operator' is: True\nThe 239536th item in this 'logical_operator' is: True\nThe 239537th item in this 'logical_operator' is: False\n\n\nPassing a Series as an argument to babynames[] has the same effect as using a boolean array. In fact, the [] selection operator can take a boolean Series, array, and list as arguments. These three are used interchangeably throughout the course.\nWe can also use .loc to achieve similar results.\n\nbabynames.loc[babynames[\"Sex\"] == \"F\"].head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n1\nCA\nF\n1910\nHelen\n239\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n3\nCA\nF\n1910\nMargaret\n163\n\n\n4\nCA\nF\n1910\nFrances\n134\n\n\n\n\n\n\n\nBoolean conditions can be combined using various bitwise operators, allowing us to filter results by multiple conditions. In the table below, p and q are boolean arrays or Series.\n\n\n\nSymbol\nUsage\nMeaning\n\n\n\n\n~\n~p\nReturns negation of p\n\n\n|\np | q\np OR q\n\n\n&\np & q\np AND q\n\n\n^\np ^ q\np XOR q (exclusive or)\n\n\n\nWhen combining multiple conditions with logical operators, we surround each individual condition with a set of parenthesis (). This imposes an order of operations on pandas evaluating your logic and can avoid code erroring.\nFor example, if we want to return data on all names with sex \"F\" born before the year 2000, we can write:\n\nbabynames[(babynames[\"Sex\"] == \"F\") & (babynames[\"Year\"] &lt; 2000)].head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n1\nCA\nF\n1910\nHelen\n239\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n3\nCA\nF\n1910\nMargaret\n163\n\n\n4\nCA\nF\n1910\nFrances\n134\n\n\n\n\n\n\n\nIf we want to return data on all names with sex \"F\" or all born before the year 2000, we can write:\n\nbabynames[(babynames[\"Sex\"] == \"F\") | (babynames[\"Year\"] &lt; 2000)].head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n1\nCA\nF\n1910\nHelen\n239\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n3\nCA\nF\n1910\nMargaret\n163\n\n\n4\nCA\nF\n1910\nFrances\n134\n\n\n\n\n\n\n\nBoolean array selection is a useful tool, but can lead to overly verbose code for complex conditions. In the example below, our boolean condition is long enough to extend for several lines of code.\n\n# Note: The parentheses surrounding the code make it possible to break the code on to multiple lines for readability\n(\n    babynames[(babynames[\"Name\"] == \"Bella\") | \n              (babynames[\"Name\"] == \"Alex\") |\n              (babynames[\"Name\"] == \"Ani\") |\n              (babynames[\"Name\"] == \"Lisa\")]\n).head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n6289\nCA\nF\n1923\nBella\n5\n\n\n7512\nCA\nF\n1925\nBella\n8\n\n\n12368\nCA\nF\n1932\nLisa\n5\n\n\n14741\nCA\nF\n1936\nLisa\n8\n\n\n17084\nCA\nF\n1939\nLisa\n5\n\n\n\n\n\n\n\nFortunately, pandas provides many alternative methods for constructing boolean filters.\nThe .isin function is one such example. This method evaluates if the values in a Series are contained in a different sequence (list, array, or Series) of values. In the cell below, we achieve equivalent results to the DataFrame above with far more concise code.\n\nnames = [\"Bella\", \"Alex\", \"Narges\", \"Lisa\"]\nbabynames[\"Name\"].isin(names).head()\n\n0    False\n1    False\n2    False\n3    False\n4    False\nName: Name, dtype: bool\n\n\n\nbabynames[babynames[\"Name\"].isin(names)].head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n6289\nCA\nF\n1923\nBella\n5\n\n\n7512\nCA\nF\n1925\nBella\n8\n\n\n12368\nCA\nF\n1932\nLisa\n5\n\n\n14741\nCA\nF\n1936\nLisa\n8\n\n\n17084\nCA\nF\n1939\nLisa\n5\n\n\n\n\n\n\n\nThe function str.startswith can be used to define a filter based on string values in a Series object. It checks to see if string values in a Series start with a particular character.\n\n# Identify whether names begin with the letter \"N\"\nbabynames[\"Name\"].str.startswith(\"N\").head()\n\n0    False\n1    False\n2    False\n3    False\n4    False\nName: Name, dtype: bool\n\n\n\n# Extracting names that begin with the letter \"N\"\nbabynames[babynames[\"Name\"].str.startswith(\"N\")].head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n76\nCA\nF\n1910\nNorma\n23\n\n\n83\nCA\nF\n1910\nNellie\n20\n\n\n127\nCA\nF\n1910\nNina\n11\n\n\n198\nCA\nF\n1910\nNora\n6\n\n\n310\nCA\nF\n1911\nNellie\n23"
-  },
-  {
-    "objectID": "pandas_2/pandas_2.html#adding-removing-and-modifying-columns",
-    "href": "pandas_2/pandas_2.html#adding-removing-and-modifying-columns",
-    "title": "3  Pandas II",
-    "section": "3.2 Adding, Removing, and Modifying Columns",
-    "text": "3.2 Adding, Removing, and Modifying Columns\nIn many data science tasks, we may need to change the columns contained in our DataFrame in some way. Fortunately, the syntax to do so is fairly straightforward.\nTo add a new column to a DataFrame, we use a syntax similar to that used when accessing an existing column. Specify the name of the new column by writing df[\"column\"], then assign this to a Series or array containing the values that will populate this column.\n\n# Create a Series of the length of each name. \nbabyname_lengths = babynames[\"Name\"].str.len()\n\n# Add a column named \"name_lengths\" that includes the length of each name\nbabynames[\"name_lengths\"] = babyname_lengths\nbabynames.head(5)\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\nname_lengths\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n4\n\n\n1\nCA\nF\n1910\nHelen\n239\n5\n\n\n2\nCA\nF\n1910\nDorothy\n220\n7\n\n\n3\nCA\nF\n1910\nMargaret\n163\n8\n\n\n4\nCA\nF\n1910\nFrances\n134\n7\n\n\n\n\n\n\n\nIf we need to later modify an existing column, we can do so by referencing this column again with the syntax df[\"column\"], then re-assigning it to a new Series or array of the appropriate length.\n\n# Modify the “name_lengths” column to be one less than its original value\nbabynames[\"name_lengths\"] = babynames[\"name_lengths\"] - 1\nbabynames.head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\nname_lengths\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n3\n\n\n1\nCA\nF\n1910\nHelen\n239\n4\n\n\n2\nCA\nF\n1910\nDorothy\n220\n6\n\n\n3\nCA\nF\n1910\nMargaret\n163\n7\n\n\n4\nCA\nF\n1910\nFrances\n134\n6\n\n\n\n\n\n\n\nWe can rename a column using the .rename() method. .rename() takes in a dictionary that maps old column names to their new ones.\n\n# Rename “name_lengths” to “Length”\nbabynames = babynames.rename(columns={\"name_lengths\":\"Length\"})\nbabynames.head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\nLength\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n3\n\n\n1\nCA\nF\n1910\nHelen\n239\n4\n\n\n2\nCA\nF\n1910\nDorothy\n220\n6\n\n\n3\nCA\nF\n1910\nMargaret\n163\n7\n\n\n4\nCA\nF\n1910\nFrances\n134\n6\n\n\n\n\n\n\n\nIf we want to remove a column or row of a DataFrame, we can call the .drop method. Use the axis parameter to specify whether a column or row should be dropped. Unless otherwise specified, pandas will assume that we are dropping a row by default.\n\n# Drop our new \"Length\" column from the DataFrame\nbabynames = babynames.drop(\"Length\", axis=\"columns\")\nbabynames.head(5)\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n1\nCA\nF\n1910\nHelen\n239\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n3\nCA\nF\n1910\nMargaret\n163\n\n\n4\nCA\nF\n1910\nFrances\n134\n\n\n\n\n\n\n\nNotice that we re-assigned babynames to the result of babynames.drop(...). This is a subtle but important point: pandas table operations do not occur in-place. Calling df.drop(...) will output a copy of df with the row/column of interest removed without modifying the original df table.\nIn other words, if we simply call:\n\n# This creates a copy of `babynames` and removes the column \"Name\"...\nbabynames.drop(\"Name\", axis=\"columns\")\n\n# ...but the original `babynames` is unchanged! \n# Notice that the \"Name\" column is still present\nbabynames.head(5)\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n1\nCA\nF\n1910\nHelen\n239\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n3\nCA\nF\n1910\nMargaret\n163\n\n\n4\nCA\nF\n1910\nFrances\n134"
-  },
-  {
-    "objectID": "pandas_2/pandas_2.html#handy-utility-functions",
-    "href": "pandas_2/pandas_2.html#handy-utility-functions",
-    "title": "3  Pandas II",
-    "section": "3.3 Handy Utility Functions",
-    "text": "3.3 Handy Utility Functions\npandas contains an extensive library of functions that can help shorten the process of setting and getting information from its data structures. In the following section, we will give overviews of each of the main utility functions that will help us in Data 100.\nDiscussing all functionality offered by pandas could take an entire semester! We will walk you through the most commonly-used functions and encourage you to explore and experiment on your own.\n\nNumPy and built-in function support\n.shape\n.size\n.describe()\n.sample()\n.value_counts()\n.unique()\n.sort_values()\n\nThe pandas documentation will be a valuable resource in Data 100 and beyond.\n\n3.3.1 NumPy\npandas is designed to work well with NumPy, the framework for array computations you encountered in Data 8. Just about any NumPy function can be applied to pandas DataFrames and Series.\n\n# Pull out the number of babies named Yash each year\nyash_count = babynames[babynames[\"Name\"] == \"Yash\"][\"Count\"]\nyash_count.head()\n\n331824     8\n334114     9\n336390    11\n338773    12\n341387    10\nName: Count, dtype: int64\n\n\n\n# Average number of babies named Yash each year\nnp.mean(yash_count)\n\n17.142857142857142\n\n\n\n# Max number of babies named Yash born in any one year\nnp.max(yash_count)\n\n29\n\n\n\n\n3.3.2 .shape and .size\n.shape and .size are attributes of Series and DataFrames that measure the “amount” of data stored in the structure. Calling .shape returns a tuple containing the number of rows and columns present in the DataFrame or Series. .size is used to find the total number of elements in a structure, equivalent to the number of rows times the number of columns.\nMany functions strictly require the dimensions of the arguments along certain axes to match. Calling these dimension-finding functions is much faster than counting all of the items by hand.\n\n# Return the shape of the DataFrame, in the format (num_rows, num_columns)\nbabynames.shape\n\n(407428, 5)\n\n\n\n# Return the size of the DataFrame, equal to num_rows * num_columns\nbabynames.size\n\n2037140\n\n\n\n\n3.3.3 .describe()\nIf many statistics are required from a DataFrame (minimum value, maximum value, mean value, etc.), then .describe() can be used to compute all of them at once.\n\nbabynames.describe()\n\n\n\n\n\n\n\n\nYear\nCount\n\n\n\n\ncount\n407428.000000\n407428.000000\n\n\nmean\n1985.733609\n79.543456\n\n\nstd\n27.007660\n293.698654\n\n\nmin\n1910.000000\n5.000000\n\n\n25%\n1969.000000\n7.000000\n\n\n50%\n1992.000000\n13.000000\n\n\n75%\n2008.000000\n38.000000\n\n\nmax\n2022.000000\n8260.000000\n\n\n\n\n\n\n\nA different set of statistics will be reported if .describe() is called on a Series.\n\nbabynames[\"Sex\"].describe()\n\ncount     407428\nunique         2\ntop            F\nfreq      239537\nName: Sex, dtype: object\n\n\n\n\n3.3.4 .sample()\nAs we will see later in the semester, random processes are at the heart of many data science techniques (for example, train-test splits, bootstrapping, and cross-validation). .sample() lets us quickly select random entries (a row if called from a DataFrame, or a value if called from a Series).\nBy default, .sample() selects entries without replacement. Pass in the argument replace=True to sample with replacement.\n\n# Sample a single row\nbabynames.sample()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n253971\nCA\nM\n1939\nClement\n6\n\n\n\n\n\n\n\nNaturally, this can be chained with other methods and operators (iloc, etc.).\n\n# Sample 5 random rows, and select all columns after column 2\nbabynames.sample(5).iloc[:, 2:]\n\n\n\n\n\n\n\n\nYear\nName\nCount\n\n\n\n\n161929\n2003\nAzalea\n18\n\n\n253610\n1939\nAlvin\n72\n\n\n347114\n2001\nTiernan\n6\n\n\n290525\n1975\nJed\n20\n\n\n297087\n1979\nDamond\n7\n\n\n\n\n\n\n\n\n# Randomly sample 4 names from the year 2000, with replacement, and select all columns after column 2\nbabynames[babynames[\"Year\"] == 2000].sample(4, replace = True).iloc[:, 2:]\n\n\n\n\n\n\n\n\nYear\nName\nCount\n\n\n\n\n343838\n2000\nKris\n10\n\n\n343223\n2000\nJeff\n24\n\n\n342795\n2000\nNelson\n99\n\n\n152704\n2000\nReya\n5\n\n\n\n\n\n\n\n\n\n3.3.5 .value_counts()\nThe Series.value_counts() method counts the number of occurrence of each unique value in a Series. In other words, it counts the number of times each unique value appears. This is often useful for determining the most or least common entries in a Series.\nIn the example below, we can determine the name with the most years in which at least one person has taken that name by counting the number of times each name appears in the \"Name\" column of babynames. Note that the return value is also a Series.\n\nbabynames[\"Name\"].value_counts().head()\n\nJean         223\nFrancis      221\nGuadalupe    218\nJessie       217\nMarion       214\nName: Name, dtype: int64\n\n\n\n\n3.3.6 .unique()\nIf we have a Series with many repeated values, then .unique() can be used to identify only the unique values. Here we return an array of all the names in babynames.\n\nbabynames[\"Name\"].unique()\n\narray(['Mary', 'Helen', 'Dorothy', ..., 'Zae', 'Zai', 'Zayvier'],\n      dtype=object)\n\n\n\n\n3.3.7 .sort_values()\nOrdering a DataFrame can be useful for isolating extreme values. For example, the first 5 entries of a row sorted in descending order (that is, from highest to lowest) are the largest 5 values. .sort_values allows us to order a DataFrame or Series by a specified column. We can choose to either receive the rows in ascending order (default) or descending order.\n\n# Sort the \"Count\" column from highest to lowest\nbabynames.sort_values(by=\"Count\", ascending=False).head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n268041\nCA\nM\n1957\nMichael\n8260\n\n\n267017\nCA\nM\n1956\nMichael\n8258\n\n\n317387\nCA\nM\n1990\nMichael\n8246\n\n\n281850\nCA\nM\n1969\nMichael\n8245\n\n\n283146\nCA\nM\n1970\nMichael\n8196\n\n\n\n\n\n\n\nUnlike when calling .value_counts() on a DataFrame, we do not need to explicitly specify the column used for sorting when calling .value_counts() on a Series. We can still specify the ordering paradigm – that is, whether values are sorted in ascending or descending order.\n\n# Sort the \"Name\" Series alphabetically\nbabynames[\"Name\"].sort_values(ascending=True).head()\n\n366001      Aadan\n384005      Aadan\n369120      Aadan\n398211    Aadarsh\n370306      Aaden\nName: Name, dtype: object"
-  },
-  {
-    "objectID": "pandas_2/pandas_2.html#custom-sorts",
-    "href": "pandas_2/pandas_2.html#custom-sorts",
-    "title": "3  Pandas II",
-    "section": "3.4 Custom Sorts",
-    "text": "3.4 Custom Sorts\nLet’s now try applying what we’ve just learned to solve a sorting problem using different approaches. Assume we want to find the longest baby names and sort our data accordingly.\n\n3.4.1 Approach 1: Create a Temporary Column\nOne method to do this is to first start by creating a column that contains the lengths of the names.\n\n# Create a Series of the length of each name\nbabyname_lengths = babynames[\"Name\"].str.len()\n\n# Add a column named \"name_lengths\" that includes the length of each name\nbabynames[\"name_lengths\"] = babyname_lengths\nbabynames.head(5)\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\nname_lengths\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n4\n\n\n1\nCA\nF\n1910\nHelen\n239\n5\n\n\n2\nCA\nF\n1910\nDorothy\n220\n7\n\n\n3\nCA\nF\n1910\nMargaret\n163\n8\n\n\n4\nCA\nF\n1910\nFrances\n134\n7\n\n\n\n\n\n\n\nWe can then sort the DataFrame by that column using .sort_values():\n\n# Sort by the temporary column\nbabynames = babynames.sort_values(by=\"name_lengths\", ascending=False)\nbabynames.head(5)\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\nname_lengths\n\n\n\n\n334166\nCA\nM\n1996\nFranciscojavier\n8\n15\n\n\n337301\nCA\nM\n1997\nFranciscojavier\n5\n15\n\n\n339472\nCA\nM\n1998\nFranciscojavier\n6\n15\n\n\n321792\nCA\nM\n1991\nRyanchristopher\n7\n15\n\n\n327358\nCA\nM\n1993\nJohnchristopher\n5\n15\n\n\n\n\n\n\n\nFinally, we can drop the name_length column from babynames to prevent our table from getting cluttered.\n\n# Drop the 'name_length' column\nbabynames = babynames.drop(\"name_lengths\", axis='columns')\nbabynames.head(5)\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n334166\nCA\nM\n1996\nFranciscojavier\n8\n\n\n337301\nCA\nM\n1997\nFranciscojavier\n5\n\n\n339472\nCA\nM\n1998\nFranciscojavier\n6\n\n\n321792\nCA\nM\n1991\nRyanchristopher\n7\n\n\n327358\nCA\nM\n1993\nJohnchristopher\n5\n\n\n\n\n\n\n\n\n\n3.4.2 Approach 2: Sorting using the key Argument\nAnother way to approach this is to use the key argument of .sort_values(). Here we can specify that we want to sort \"Name\" values by their length.\n\nbabynames.sort_values(\"Name\", key=lambda x: x.str.len(), ascending=False).head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n334166\nCA\nM\n1996\nFranciscojavier\n8\n\n\n327472\nCA\nM\n1993\nRyanchristopher\n5\n\n\n337301\nCA\nM\n1997\nFranciscojavier\n5\n\n\n337477\nCA\nM\n1997\nRyanchristopher\n5\n\n\n312543\nCA\nM\n1987\nFranciscojavier\n5\n\n\n\n\n\n\n\n\n\n3.4.3 Approach 3: Sorting using the map Function\nWe can also use the map function on a Series to solve this. Say we want to sort the babynames table by the number of \"dr\"’s and \"ea\"s in each \"Name\". We’ll define the function dr_ea_count to help us out.\n\n# First, define a function to count the number of times \"dr\" or \"ea\" appear in each name\ndef dr_ea_count(string):\n    return string.count('dr') + string.count('ea')\n\n# Then, use `map` to apply `dr_ea_count` to each name in the \"Name\" column\nbabynames[\"dr_ea_count\"] = babynames[\"Name\"].map(dr_ea_count)\n\n# Sort the DataFrame by the new \"dr_ea_count\" column so we can see our handiwork\nbabynames = babynames.sort_values(by=\"dr_ea_count\", ascending=False)\nbabynames.head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\ndr_ea_count\n\n\n\n\n115957\nCA\nF\n1990\nDeandrea\n5\n3\n\n\n101976\nCA\nF\n1986\nDeandrea\n6\n3\n\n\n131029\nCA\nF\n1994\nLeandrea\n5\n3\n\n\n108731\nCA\nF\n1988\nDeandrea\n5\n3\n\n\n308131\nCA\nM\n1985\nDeandrea\n6\n3\n\n\n\n\n\n\n\nWe can drop the dr_ea_count once we’re done using it to maintain a neat table.\n\n# Drop the `dr_ea_count` column\nbabynames = babynames.drop(\"dr_ea_count\", axis = 'columns')\nbabynames.head(5)\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n115957\nCA\nF\n1990\nDeandrea\n5\n\n\n101976\nCA\nF\n1986\nDeandrea\n6\n\n\n131029\nCA\nF\n1994\nLeandrea\n5\n\n\n108731\nCA\nF\n1988\nDeandrea\n5\n\n\n308131\nCA\nM\n1985\nDeandrea\n6"
-  },
-  {
-    "objectID": "pandas_2/pandas_2.html#aggregating-data-with-.groupby",
-    "href": "pandas_2/pandas_2.html#aggregating-data-with-.groupby",
-    "title": "3  Pandas II",
-    "section": "3.5 Aggregating Data with .groupby",
-    "text": "3.5 Aggregating Data with .groupby\nUp until this point, we have been working with individual rows of DataFrames. As data scientists, we often wish to investigate trends across a larger subset of our data. For example, we may want to compute some summary statistic (the mean, median, sum, etc.) for a group of rows in our DataFrame. To do this, we’ll use pandas GroupBy objects.\nLet’s say we wanted to aggregate all rows in babynames for a given year.\n\nbabynames.groupby(\"Year\")\n\n&lt;pandas.core.groupby.generic.DataFrameGroupBy object at 0x130159580&gt;\n\n\nWhat does this strange output mean? Calling .groupby has generated a GroupBy object. You can imagine this as a set of “mini” sub-DataFrames, where each subframe contains all of the rows from babynames that correspond to a particular year.\nThe diagram below shows a simplified view of babynames to help illustrate this idea.\n\n\n\nCreating a GroupBy object\n\n\nWe can’t work with a GroupBy object directly – that is why you saw that strange output earlier rather than a standard view of a DataFrame. To actually manipulate values within these “mini” DataFrames, we’ll need to call an aggregation method. This is a method that tells pandas how to aggregate the values within the GroupBy object. Once the aggregation is applied, pandas will return a normal (now grouped) DataFrame.\nThe first aggregation method we’ll consider is .agg. The .agg method takes in a function as its argument; this function is then applied to each column of a “mini” grouped DataFrame. We end up with a new DataFrame with one aggregated row per subframe. Let’s see this in action by finding the sum of all counts for each year in babynames – this is equivalent to finding the number of babies born in each year.\n\nbabynames[[\"Year\", \"Count\"]].groupby(\"Year\").agg(sum).head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nWe can relate this back to the diagram we used above. Remember that the diagram uses a simplified version of babynames, which is why we see smaller values for the summed counts.\n\n\n\nPerforming an aggregation\n\n\nCalling .agg has condensed each subframe back into a single row. This gives us our final output: a DataFrame that is now indexed by \"Year\", with a single row for each unique year in the original babynames DataFrame.\nYou may be wondering: where did the \"State\", \"Sex\", and \"Name\" columns go? Logically, it doesn’t make sense to sum the string data in these columns (how would we add “Mary” + “Ann”?). Because of this, we need to omit these columns when we perform aggregation on the DataFrame.\n\n# Same result, but now we explicitly tell pandas to only consider the \"Count\" column when summing\nbabynames.groupby(\"Year\")[[\"Count\"]].agg(sum).head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nThere are many different aggregations that can be applied to the grouped data. The primary requirement is that an aggregation function must:\n\nTake in a Series of data (a single column of the grouped subframe).\nReturn a single value that aggregates this Series.\n\nBecause of this fairly broad requirement, pandas offers many ways of computing an aggregation.\nIn-built Python operations – such as sum, max, and min – are automatically recognized by pandas.\n\n# What is the minimum count for each name in any year?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(min).head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n5\n\n\nAadarsh\n6\n\n\nAaden\n10\n\n\nAadhav\n6\n\n\nAadhini\n6\n\n\n\n\n\n\n\n\n# What is the largest single-year count of each name?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(max).head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n7\n\n\nAadarsh\n6\n\n\nAaden\n158\n\n\nAadhav\n8\n\n\nAadhini\n6\n\n\n\n\n\n\n\nAs mentioned previously, functions from the NumPy library, such as np.mean, np.max, np.min, and np.sum, are also fair game in pandas.\n\n# What is the average count for each name across all years?\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(np.mean).head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n6.000000\n\n\nAadarsh\n6.000000\n\n\nAaden\n46.214286\n\n\nAadhav\n6.750000\n\n\nAadhini\n6.000000\n\n\n\n\n\n\n\npandas also offers a number of in-built functions. Functions that are native to pandas can be referenced using their string name within a call to .agg. Some examples include:\n\n.agg(\"sum\")\n.agg(\"max\")\n.agg(\"min\")\n.agg(\"mean\")\n.agg(\"first\")\n.agg(\"last\")\n\nThe latter two entries in this list – \"first\" and \"last\" – are unique to pandas. They return the first or last entry in a subframe column. Why might this be useful? Consider a case where multiple columns in a group share identical information. To represent this information in the grouped output, we can simply grab the first or last entry, which we know will be identical to all other entries.\nLet’s illustrate this with an example. Say we add a new column to babynames that contains the first letter of each name.\n\n# Imagine we had an additional column, \"First Letter\". We'll explain this code next week\nbabynames[\"First Letter\"] = babynames[\"Name\"].str[0]\n\n# We construct a simplified DataFrame containing just a subset of columns\nbabynames_new = babynames[[\"Name\", \"First Letter\", \"Year\"]]\nbabynames_new.head()\n\n\n\n\n\n\n\n\nName\nFirst Letter\nYear\n\n\n\n\n115957\nDeandrea\nD\n1990\n\n\n101976\nDeandrea\nD\n1986\n\n\n131029\nLeandrea\nL\n1994\n\n\n108731\nDeandrea\nD\n1988\n\n\n308131\nDeandrea\nD\n1985\n\n\n\n\n\n\n\nIf we form groups for each name in the dataset, \"First Letter\" will be the same for all members of the group. This means that if we simply select the first entry for \"First Letter\" in the group, we’ll represent all data in that group.\nWe can use a dictionary to apply different aggregation functions to each column during grouping.\n\n\n\nAggregating using “first”\n\n\n\nbabynames_new.groupby(\"Name\").agg({\"First Letter\":\"first\", \"Year\":\"max\"}).head()\n\n\n\n\n\n\n\n\nFirst Letter\nYear\n\n\nName\n\n\n\n\n\n\nAadan\nA\n2014\n\n\nAadarsh\nA\n2019\n\n\nAaden\nA\n2020\n\n\nAadhav\nA\n2019\n\n\nAadhini\nA\n2022\n\n\n\n\n\n\n\nSome aggregation functions are common enough that pandas allows them to be called directly, without the explicit use of .agg.\n\nbabynames.groupby(\"Name\")[[\"Count\"]].mean().head()\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n6.000000\n\n\nAadarsh\n6.000000\n\n\nAaden\n46.214286\n\n\nAadhav\n6.750000\n\n\nAadhini\n6.000000\n\n\n\n\n\n\n\nWe can also define aggregation functions of our own! This can be done using either a def or lambda statement. Again, the condition for a custom aggregation function is that it must take in a Series and output a single scalar value.\n\nbabynames = babynames.sort_values(by=\"Year\", ascending=True)\ndef ratio_to_peak(series):\n    return series.iloc[-1]/max(series)\n\nbabynames.groupby(\"Name\")[[\"Year\", \"Count\"]].agg(ratio_to_peak)\n\n\n\n\n\n\n\n\nYear\nCount\n\n\nName\n\n\n\n\n\n\nAadan\n1.0\n0.714286\n\n\nAadarsh\n1.0\n1.000000\n\n\nAaden\n1.0\n0.063291\n\n\nAadhav\n1.0\n0.750000\n\n\nAadhini\n1.0\n1.000000\n\n\n...\n...\n...\n\n\nZymir\n1.0\n1.000000\n\n\nZyon\n1.0\n1.000000\n\n\nZyra\n1.0\n1.000000\n\n\nZyrah\n1.0\n0.833333\n\n\nZyrus\n1.0\n1.000000\n\n\n\n\n20437 rows × 2 columns\n\n\n\n\n# Alternatively, using lambda\nbabynames.groupby(\"Name\")[[\"Year\", \"Count\"]].agg(lambda s: s.iloc[-1]/max(s))\n\n\n\n\n\n\n\n\nYear\nCount\n\n\nName\n\n\n\n\n\n\nAadan\n1.0\n0.714286\n\n\nAadarsh\n1.0\n1.000000\n\n\nAaden\n1.0\n0.063291\n\n\nAadhav\n1.0\n0.750000\n\n\nAadhini\n1.0\n1.000000\n\n\n...\n...\n...\n\n\nZymir\n1.0\n1.000000\n\n\nZyon\n1.0\n1.000000\n\n\nZyra\n1.0\n1.000000\n\n\nZyrah\n1.0\n0.833333\n\n\nZyrus\n1.0\n1.000000\n\n\n\n\n20437 rows × 2 columns"
-  },
-  {
-    "objectID": "pandas_2/pandas_2.html#parting-note",
-    "href": "pandas_2/pandas_2.html#parting-note",
-    "title": "3  Pandas II",
-    "section": "3.6 Parting Note",
-    "text": "3.6 Parting Note\nManipulating DataFrames is not a skill that is mastered in just one day. Due to the flexibility of pandas, there are many different ways to get from point A to point B. We recommend trying multiple different ways to solve the same problem to gain even more practice and reach that point of mastery sooner.\nNext, we will start digging deeper into the mechanics behind grouping data."
-  },
-  {
-    "objectID": "eda/eda.html",
-    "href": "eda/eda.html",
-    "title": "5  Data Cleaning and EDA",
-    "section": "",
-    "text": "6 EDA Demo 1: Tuberculosis in the United States\nNow, let’s walk through the data-cleaning and EDA workflow to see what can we learn about the presence of Tuberculosis in the United States!\nWe will examine the data included in the original CDC article published in 2021.\nMauna Loa Observatory has been monitoring CO2 concentrations since 1958\nco2_file = \"data/co2_mm_mlo.txt\"\nLet’s do some EDA!!\nWe went over a lot of content this lecture; let’s summarize the most important points:"
-  },
-  {
-    "objectID": "eda/eda.html#structure",
-    "href": "eda/eda.html#structure",
-    "title": "5  Data Cleaning and EDA",
-    "section": "5.1 Structure",
-    "text": "5.1 Structure\n\n5.1.1 File Formats\nThere are many file types for storing structured data: TSV, JSON, XML, ASCII, SAS, etc. We’ll only cover CSV, TSV, and JSON in lecture, but you’ll likely encounter other formats as you work with different datasets. Reading documentation is your best bet for understanding how to process the multitude of different file types.\n\n5.1.1.1 CSV\nCSVs, which stand for Comma-Separated Values, are a common tabular data format. In the past two pandas lectures, we briefly touched on the idea of file format: the way data is encoded in a file for storage. Specifically, our elections and babynames datasets were stored and loaded as CSVs:\n\npd.read_csv(\"data/elections.csv\").head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.21\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.79\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.20\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.80\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.57\n\n\n\n\n\n\n\nTo better understand the properties of a CSV, let’s take a look at the first few rows of the raw data file to see what it looks like before being loaded into a DataFrame. We’ll use the repr() function to return the raw string with its special characters:\n\nwith open(\"data/elections.csv\", \"r\") as table:\n    i = 0\n    for row in table:\n        print(repr(row))\n        i += 1\n        if i &gt; 3:\n            break\n\n'Year,Candidate,Party,Popular vote,Result,%\\n'\n'1824,Andrew Jackson,Democratic-Republican,151271,loss,57.21012204\\n'\n'1824,John Quincy Adams,Democratic-Republican,113142,win,42.78987796\\n'\n'1828,Andrew Jackson,Democratic,642806,win,56.20392707\\n'\n\n\nEach row, or record, in the data is delimited by a newline \\n. Each column, or field, in the data is delimited by a comma , (hence, comma-separated!).\n\n\n5.1.1.2 TSV\nAnother common file type is TSV (Tab-Separated Values). In a TSV, records are still delimited by a newline \\n, while fields are delimited by \\t tab character.\nLet’s check out the first few rows of the raw TSV file. Again, we’ll use the repr() function so that print shows the special characters.\n\nwith open(\"data/elections.txt\", \"r\") as table:\n    i = 0\n    for row in table:\n        print(repr(row))\n        i += 1\n        if i &gt; 3:\n            break\n\n'\\ufeffYear\\tCandidate\\tParty\\tPopular vote\\tResult\\t%\\n'\n'1824\\tAndrew Jackson\\tDemocratic-Republican\\t151271\\tloss\\t57.21012204\\n'\n'1824\\tJohn Quincy Adams\\tDemocratic-Republican\\t113142\\twin\\t42.78987796\\n'\n'1828\\tAndrew Jackson\\tDemocratic\\t642806\\twin\\t56.20392707\\n'\n\n\nTSVs can be loaded into pandas using pd.read_csv. We’ll need to specify the delimiter with parametersep='\\t' (documentation).\n\npd.read_csv(\"data/elections.txt\", sep='\\t').head(3)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.21\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.79\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.20\n\n\n\n\n\n\n\nAn issue with CSVs and TSVs comes up whenever there are commas or tabs within the records. How does pandas differentiate between a comma delimiter vs. a comma within the field itself, for example 8,900? To remedy this, check out the quotechar parameter.\n\n\n5.1.1.3 JSON\nJSON (JavaScript Object Notation) files behave similarly to Python dictionaries. A raw JSON is shown below.\n\nwith open(\"data/elections.json\", \"r\") as table:\n    i = 0\n    for row in table:\n        print(row)\n        i += 1\n        if i &gt; 8:\n            break\n\n[\n\n {\n\n   \"Year\": 1824,\n\n   \"Candidate\": \"Andrew Jackson\",\n\n   \"Party\": \"Democratic-Republican\",\n\n   \"Popular vote\": 151271,\n\n   \"Result\": \"loss\",\n\n   \"%\": 57.21012204\n\n },\n\n\n\nJSON files can be loaded into pandas using pd.read_json.\n\npd.read_json('data/elections.json').head(3)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.21\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.79\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.20\n\n\n\n\n\n\n\n\n5.1.1.3.1 EDA with JSON: Berkeley COVID-19 Data\nThe City of Berkeley Open Data website has a dataset with COVID-19 Confirmed Cases among Berkeley residents by date. Let’s download the file and save it as a JSON (note the source URL file type is also a JSON). In the interest of reproducible data science, we will download the data programatically. We have defined some helper functions in the ds100_utils.py file that we can reuse these helper functions in many different notebooks.\n\nfrom ds100_utils import fetch_and_cache\n\ncovid_file = fetch_and_cache(\n    \"https://data.cityofberkeley.info/api/views/xn6j-b766/rows.json?accessType=DOWNLOAD\",\n    \"confirmed-cases.json\",\n    force=False)\ncovid_file          # a file path wrapper object\n\nUsing cached version that was downloaded (UTC): Fri Aug 18 22:19:42 2023\n\n\nPosixPath('data/confirmed-cases.json')\n\n\n\n5.1.1.3.1.1 File Size\nLet’s start our analysis by getting a rough estimate of the size of the dataset to inform the tools we use to view the data. For relatively small datasets, we can use a text editor or spreadsheet. For larger datasets, more programmatic exploration or distributed computing tools may be more fitting. Here we will use Python tools to probe the file.\nSince there seem to be text files, let’s investigate the number of lines, which often corresponds to the number of records\n\nimport os\n\nprint(covid_file, \"is\", os.path.getsize(covid_file) / 1e6, \"MB\")\n\nwith open(covid_file, \"r\") as f:\n    print(covid_file, \"is\", sum(1 for l in f), \"lines.\")\n\ndata/confirmed-cases.json is 0.116367 MB\ndata/confirmed-cases.json is 1110 lines.\n\n\n\n\n5.1.1.3.1.2 Unix Commands\nAs part of the EDA workflow, Unix commands can come in very handy. In fact, there’s an entire book called “Data Science at the Command Line” that explores this idea in depth! In Jupyter/IPython, you can prefix lines with ! to execute arbitrary Unix commands, and within those lines, you can refer to Python variables and expressions with the syntax {expr}.\nHere, we use the ls command to list files, using the -lh flags, which request “long format with information in human-readable form.” We also use the wc command for “word count,” but with the -l flag, which asks for line counts instead of words.\nThese two give us the same information as the code above, albeit in a slightly different form:\n\n!ls -lh {covid_file}\n!wc -l {covid_file}\n\n-rw-r--r--  1 Ishani  staff   114K Aug 18 22:19 data/confirmed-cases.json\n\n\n    1109 data/confirmed-cases.json\n\n\n\n\n5.1.1.3.1.3 File Contents\nLet’s explore the data format using Python.\n\nwith open(covid_file, \"r\") as f:\n    for i, row in enumerate(f):\n        print(repr(row)) # print raw strings\n        if i &gt;= 4: break\n\n'{\\n'\n'  \"meta\" : {\\n'\n'    \"view\" : {\\n'\n'      \"id\" : \"xn6j-b766\",\\n'\n'      \"name\" : \"COVID-19 Confirmed Cases\",\\n'\n\n\nWe can use the head Unix command (which is where pandas’ head method comes from!) to see the first few lines of the file:\n\n!head -5 {covid_file}\n\n{\n  \"meta\" : {\n    \"view\" : {\n      \"id\" : \"xn6j-b766\",\n      \"name\" : \"COVID-19 Confirmed Cases\",\n\n\nIn order to load the JSON file into pandas, Let’s first do some EDA with Python’s json package to understand the particular structure of this JSON file so that we can decide what (if anything) to load into pandas. Python has relatively good support for JSON data since it closely matches the internal python object model. In the following cell we import the entire JSON datafile into a python dictionary using the json package.\n\nimport json\n\nwith open(covid_file, \"rb\") as f:\n    covid_json = json.load(f)\n\nThe covid_json variable is now a dictionary encoding the data in the file:\n\ntype(covid_json)\n\ndict\n\n\nWe can examine what keys are in the top level json object by listing out the keys.\n\ncovid_json.keys()\n\ndict_keys(['meta', 'data'])\n\n\nObservation: The JSON dictionary contains a meta key which likely refers to meta data (data about the data). Meta data often maintained with the data and can be a good source of additional information.\nWe can investigate the meta data further by examining the keys associated with the metadata.\n\ncovid_json['meta'].keys()\n\ndict_keys(['view'])\n\n\nThe meta key contains another dictionary called view. This likely refers to meta-data about a particular “view” of some underlying database. We will learn more about views when we study SQL later in the class.\n\ncovid_json['meta']['view'].keys()\n\ndict_keys(['id', 'name', 'assetType', 'attribution', 'averageRating', 'category', 'createdAt', 'description', 'displayType', 'downloadCount', 'hideFromCatalog', 'hideFromDataJson', 'newBackend', 'numberOfComments', 'oid', 'provenance', 'publicationAppendEnabled', 'publicationDate', 'publicationGroup', 'publicationStage', 'rowsUpdatedAt', 'rowsUpdatedBy', 'tableId', 'totalTimesRated', 'viewCount', 'viewLastModified', 'viewType', 'approvals', 'columns', 'grants', 'metadata', 'owner', 'query', 'rights', 'tableAuthor', 'tags', 'flags'])\n\n\nNotice that this a nested/recursive data structure. As we dig deeper we reveal more and more keys and the corresponding data:\nmeta\n|-&gt; data\n    | ... (haven't explored yet)\n|-&gt; view\n    | -&gt; id\n    | -&gt; name\n    | -&gt; attribution \n    ...\n    | -&gt; description\n    ...\n    | -&gt; columns\n    ...\nThere is a key called description in the view sub dictionary. This likely contains a description of the data:\n\nprint(covid_json['meta']['view']['description'])\n\nCounts of confirmed COVID-19 cases among Berkeley residents by date.\n\n\n\n\n5.1.1.3.1.4 Examining the Data Field for Records\nWe can look at a few entries in the data field. This is what we’ll load into pandas.\n\nfor i in range(3):\n    print(f\"{i:03} | {covid_json['data'][i]}\")\n\n000 | ['row-kzbg.v7my-c3y2', '00000000-0000-0000-0405-CB14DE51DAA7', 0, 1643733903, None, 1643733903, None, '{ }', '2020-02-28T00:00:00', '1', '1']\n001 | ['row-jkyx_9u4r-h2yw', '00000000-0000-0000-F806-86D0DBE0E17F', 0, 1643733903, None, 1643733903, None, '{ }', '2020-02-29T00:00:00', '0', '1']\n002 | ['row-qifg_4aug-y3ym', '00000000-0000-0000-2DCE-4D1872F9B216', 0, 1643733903, None, 1643733903, None, '{ }', '2020-03-01T00:00:00', '0', '1']\n\n\nObservations: * These look like equal-length records, so maybe data is a table! * But what do each of values in the record mean? Where can we find column headers?\nFor that, we’ll need the columns key in the metadata dictionary. This returns a list:\n\ntype(covid_json['meta']['view']['columns'])\n\nlist\n\n\n\n\n5.1.1.3.1.5 Summary of exploring the JSON file\n\nThe above metadata tells us a lot about the columns in the data including column names, potential data anomalies, and a basic statistic.\nBecause of its non-tabular structure, JSON makes it easier (than CSV) to create self-documenting data, meaning that information about the data is stored in the same file as the data.\nSelf-documenting data can be helpful since it maintains its own description and these descriptions are more likely to be updated as data changes.\n\n\n\n5.1.1.3.1.6 Loading COVID Data into pandas\nFinally, let’s load the data (not the metadata) into a pandas DataFrame. In the following block of code we:\n\nTranslate the JSON records into a DataFrame:\n\nfields: covid_json['meta']['view']['columns']\nrecords: covid_json['data']\n\nRemove columns that have no metadata description. This would be a bad idea in general, but here we remove these columns since the above analysis suggests they are unlikely to contain useful information.\nExamine the tail of the table.\n\n\n# Load the data from JSON and assign column titles\ncovid = pd.DataFrame(\n    covid_json['data'],\n    columns=[c['name'] for c in covid_json['meta']['view']['columns']])\n\ncovid.tail()\n\n\n\n\n\n\n\n\nsid\nid\nposition\ncreated_at\ncreated_meta\nupdated_at\nupdated_meta\nmeta\nDate\nNew Cases\nCumulative Cases\n\n\n\n\n699\nrow-49b6_x8zv.gyum\n00000000-0000-0000-A18C-9174A6D05774\n0\n1643733903\nNone\n1643733903\nNone\n{ }\n2022-01-27T00:00:00\n106\n10694\n\n\n700\nrow-gs55-p5em.y4v9\n00000000-0000-0000-F41D-5724AEABB4D6\n0\n1643733903\nNone\n1643733903\nNone\n{ }\n2022-01-28T00:00:00\n223\n10917\n\n\n701\nrow-3pyj.tf95-qu67\n00000000-0000-0000-BEE3-B0188D2518BD\n0\n1643733903\nNone\n1643733903\nNone\n{ }\n2022-01-29T00:00:00\n139\n11056\n\n\n702\nrow-cgnd.8syv.jvjn\n00000000-0000-0000-C318-63CF75F7F740\n0\n1643733903\nNone\n1643733903\nNone\n{ }\n2022-01-30T00:00:00\n33\n11089\n\n\n703\nrow-qywv_24x6-237y\n00000000-0000-0000-FE92-9789FED3AA20\n0\n1643733903\nNone\n1643733903\nNone\n{ }\n2022-01-31T00:00:00\n42\n11131\n\n\n\n\n\n\n\n\n\n\n\n\n5.1.2 Variable Types\nAfter loading data into a file, it’s a good idea to take the time to understand what pieces of information are encoded in the dataset. In particular, we want to identify what variable types are present in our data. Broadly speaking, we can categorize variables into one of two overarching types.\nQuantitative variables describe some numeric quantity or amount. We can divide quantitative data further into:\n\nContinuous quantitative variables: numeric data that can be measured on a continuous scale to arbitrary precision. Continuous variables do not have a strict set of possible values – they can be recorded to any number of decimal places. For example, weights, GPA, or CO2 concentrations.\nDiscrete quantitative variables: numeric data that can only take on a finite set of possible values. For example, someone’s age or the number of siblings they have.\n\nQualitative variables, also known as categorical variables, describe data that isn’t measuring some quantity or amount. The sub-categories of categorical data are:\n\nOrdinal qualitative variables: categories with ordered levels. Specifically, ordinal variables are those where the difference between levels has no consistent, quantifiable meaning. Some examples include levels of education (high school, undergrad, grad, etc.), income bracket (low, medium, high), or Yelp rating.\nNominal qualitative variables: categories with no specific order. For example, someone’s political affiliation or Cal ID number.\n\n\n\n\nClassification of variable types\n\n\nNote that many variables don’t sit neatly in just one of these categories. Qualitative variables could have numeric levels, and conversely, quantitative variables could be stored as strings.\n\n\n5.1.3 Primary and Foreign Keys\nLast time, we introduced .merge as the pandas method for joining multiple DataFrames together. In our discussion of joins, we touched on the idea of using a “key” to determine what rows should be merged from each table. Let’s take a moment to examine this idea more closely.\nThe primary key is the column or set of columns in a table that uniquely determine the values of the remaining columns. It can be thought of as the unique identifier for each individual row in the table. For example, a table of Data 100 students might use each student’s Cal ID as the primary key.\n\n\n\n\n\n\n\n\n\nCal ID\nName\nMajor\n\n\n\n\n0\n3034619471\nOski\nData Science\n\n\n1\n3035619472\nOllie\nComputer Science\n\n\n2\n3025619473\nOrrie\nData Science\n\n\n3\n3046789372\nOllie\nEconomics\n\n\n\n\n\n\n\nThe foreign key is the column or set of columns in a table that reference primary keys in other tables. Knowing a dataset’s foreign keys can be useful when assigning the left_on and right_on parameters of .merge. In the table of office hour tickets below, \"Cal ID\" is a foreign key referencing the previous table.\n\n\n\n\n\n\n\n\n\nOH Request\nCal ID\nQuestion\n\n\n\n\n0\n1\n3034619471\nHW 2 Q1\n\n\n1\n2\n3035619472\nHW 2 Q3\n\n\n2\n3\n3025619473\nLab 3 Q4\n\n\n3\n4\n3035619472\nHW 2 Q7"
-  },
-  {
-    "objectID": "eda/eda.html#granularity-scope-and-temporality",
-    "href": "eda/eda.html#granularity-scope-and-temporality",
-    "title": "5  Data Cleaning and EDA",
-    "section": "5.2 Granularity, Scope, and Temporality",
-    "text": "5.2 Granularity, Scope, and Temporality\nAfter understanding the structure of the dataset, the next task is to determine what exactly the data represents. We’ll do so by considering the data’s granularity, scope, and temporality.\n\n5.2.1 Granularity\nThe granularity of a dataset is what a single row represents. You can also think of it as the level of detail included in the data. To determine the data’s granularity, ask: what does each row in the dataset represent? Fine-grained data contains a high level of detail, with a single row representing a small individual unit. For example, each record may represent one person. Coarse-grained data is encoded such that a single row represents a large individual unit – for example, each record may represent a group of people.\n\n\n5.2.2 Scope\nThe scope of a dataset is the subset of the population covered by the data. If we were investigating student performance in Data Science courses, a dataset with a narrow scope might encompass all students enrolled in Data 100 whereas a dataset with an expansive scope might encompass all students in California.\n\n\n5.2.3 Temporality\nThe temporality of a dataset describes the periodicity over which the data was collected as well as when the data was most recently collected or updated.\nTime and date fields of a dataset could represent a few things:\n\nwhen the “event” happened\nwhen the data was collected, or when it was entered into the system\nwhen the data was copied into the database\n\nTo fully understand the temporality of the data, it also may be necessary to standardize time zones or inspect recurring time-based trends in the data (do patterns recur in 24-hour periods? Over the course of a month? Seasonally?). The convention for standardizing time is the Coordinated Universal Time (UTC), an international time standard measured at 0 degrees latitude that stays consistent throughout the year (no daylight savings). We can represent Berkeley’s time zone, Pacific Standard Time (PST), as UTC-7 (with daylight savings).\n\n5.2.3.1 Temporality with pandas’ dt accessors\nLet’s briefly look at how we can use pandas’ dt accessors to work with dates/times in a dataset using the dataset you’ll see in Lab 3: the Berkeley PD Calls for Service dataset.\n\n\nCode\ncalls = pd.read_csv(\"data/Berkeley_PD_-_Calls_for_Service.csv\")\ncalls.head()\n\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n0\n21014296\nTHEFT MISD. (UNDER $950)\n04/01/2021 12:00:00 AM\n10:58\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n1\n21014391\nTHEFT MISD. (UNDER $950)\n04/01/2021 12:00:00 AM\n10:38\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n2\n21090494\nTHEFT MISD. (UNDER $950)\n04/19/2021 12:00:00 AM\n12:15\nLARCENY\n1\n06/15/2021 12:00:00 AM\n2100 BLOCK HASTE ST\\nBerkeley, CA\\n(37.864908,...\n2100 BLOCK HASTE ST\nBerkeley\nCA\n\n\n3\n21090204\nTHEFT FELONY (OVER $950)\n02/13/2021 12:00:00 AM\n17:00\nLARCENY\n6\n06/15/2021 12:00:00 AM\n2600 BLOCK WARRING ST\\nBerkeley, CA\\n(37.86393...\n2600 BLOCK WARRING ST\nBerkeley\nCA\n\n\n4\n21090179\nBURGLARY AUTO\n02/08/2021 12:00:00 AM\n6:20\nBURGLARY - VEHICLE\n1\n06/15/2021 12:00:00 AM\n2700 BLOCK GARBER ST\\nBerkeley, CA\\n(37.86066,...\n2700 BLOCK GARBER ST\nBerkeley\nCA\n\n\n\n\n\n\n\nLooks like there are three columns with dates/times: EVENTDT, EVENTTM, and InDbDate.\nMost likely, EVENTDT stands for the date when the event took place, EVENTTM stands for the time of day the event took place (in 24-hr format), and InDbDate is the date this call is recorded onto the database.\nIf we check the data type of these columns, we will see they are stored as strings. We can convert them to datetime objects using pandas to_datetime function.\n\ncalls[\"EVENTDT\"] = pd.to_datetime(calls[\"EVENTDT\"])\ncalls.head()\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n0\n21014296\nTHEFT MISD. (UNDER $950)\n2021-04-01\n10:58\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n1\n21014391\nTHEFT MISD. (UNDER $950)\n2021-04-01\n10:38\nLARCENY\n4\n06/15/2021 12:00:00 AM\nBerkeley, CA\\n(37.869058, -122.270455)\nNaN\nBerkeley\nCA\n\n\n2\n21090494\nTHEFT MISD. (UNDER $950)\n2021-04-19\n12:15\nLARCENY\n1\n06/15/2021 12:00:00 AM\n2100 BLOCK HASTE ST\\nBerkeley, CA\\n(37.864908,...\n2100 BLOCK HASTE ST\nBerkeley\nCA\n\n\n3\n21090204\nTHEFT FELONY (OVER $950)\n2021-02-13\n17:00\nLARCENY\n6\n06/15/2021 12:00:00 AM\n2600 BLOCK WARRING ST\\nBerkeley, CA\\n(37.86393...\n2600 BLOCK WARRING ST\nBerkeley\nCA\n\n\n4\n21090179\nBURGLARY AUTO\n2021-02-08\n6:20\nBURGLARY - VEHICLE\n1\n06/15/2021 12:00:00 AM\n2700 BLOCK GARBER ST\\nBerkeley, CA\\n(37.86066,...\n2700 BLOCK GARBER ST\nBerkeley\nCA\n\n\n\n\n\n\n\nNow, we can use the dt accessor on this column.\nWe can get the month:\n\ncalls[\"EVENTDT\"].dt.month.head()\n\n0    4\n1    4\n2    4\n3    2\n4    2\nName: EVENTDT, dtype: int64\n\n\nWhich day of the week the date is on:\n\ncalls[\"EVENTDT\"].dt.dayofweek.head()\n\n0    3\n1    3\n2    0\n3    5\n4    0\nName: EVENTDT, dtype: int64\n\n\nCheck the mimimum values to see if there are any suspicious-looking, 70s dates:\n\ncalls.sort_values(\"EVENTDT\").head()\n\n\n\n\n\n\n\n\nCASENO\nOFFENSE\nEVENTDT\nEVENTTM\nCVLEGEND\nCVDOW\nInDbDate\nBlock_Location\nBLKADDR\nCity\nState\n\n\n\n\n2513\n20057398\nBURGLARY COMMERCIAL\n2020-12-17\n16:05\nBURGLARY - COMMERCIAL\n4\n06/15/2021 12:00:00 AM\n600 BLOCK GILMAN ST\\nBerkeley, CA\\n(37.878405,...\n600 BLOCK GILMAN ST\nBerkeley\nCA\n\n\n624\n20057207\nASSAULT/BATTERY MISD.\n2020-12-17\n16:50\nASSAULT\n4\n06/15/2021 12:00:00 AM\n2100 BLOCK SHATTUCK AVE\\nBerkeley, CA\\n(37.871...\n2100 BLOCK SHATTUCK AVE\nBerkeley\nCA\n\n\n154\n20092214\nTHEFT FROM AUTO\n2020-12-17\n18:30\nLARCENY - FROM VEHICLE\n4\n06/15/2021 12:00:00 AM\n800 BLOCK SHATTUCK AVE\\nBerkeley, CA\\n(37.8918...\n800 BLOCK SHATTUCK AVE\nBerkeley\nCA\n\n\n659\n20057324\nTHEFT MISD. (UNDER $950)\n2020-12-17\n15:44\nLARCENY\n4\n06/15/2021 12:00:00 AM\n1800 BLOCK 4TH ST\\nBerkeley, CA\\n(37.869888, -...\n1800 BLOCK 4TH ST\nBerkeley\nCA\n\n\n993\n20057573\nBURGLARY RESIDENTIAL\n2020-12-17\n22:15\nBURGLARY - RESIDENTIAL\n4\n06/15/2021 12:00:00 AM\n1700 BLOCK STUART ST\\nBerkeley, CA\\n(37.857495...\n1700 BLOCK STUART ST\nBerkeley\nCA\n\n\n\n\n\n\n\nDoesn’t look like it! We are good!\nWe can also do many things with the dt accessor like switching time zones and converting time back to UNIX/POSIX time. Check out the documentation on .dt accessor and time series/date functionality."
-  },
-  {
-    "objectID": "eda/eda.html#faithfulness",
-    "href": "eda/eda.html#faithfulness",
-    "title": "5  Data Cleaning and EDA",
-    "section": "5.3 Faithfulness",
-    "text": "5.3 Faithfulness\nAt this stage in our data cleaning and EDA workflow, we’ve achieved quite a lot: we’ve identified how our data is structured, come to terms with what information it encodes, and gained insight as to how it was generated. Throughout this process, we should always recall the original intent of our work in Data Science – to use data to better understand and model the real world. To achieve this goal, we need to ensure that the data we use is faithful to reality; that is, that our data accurately captures the “real world.”\nData used in research or industry is often “messy” – there may be errors or inaccuracies that impact the faithfulness of the dataset. Signs that data may not be faithful include:\n\nUnrealistic or “incorrect” values, such as negative counts, locations that don’t exist, or dates set in the future\nViolations of obvious dependencies, like an age that does not match a birthday\nClear signs that data was entered by hand, which can lead to spelling errors or fields that are incorrectly shifted\nSigns of data falsification, such as fake email addresses or repeated use of the same names\nDuplicated records or fields containing the same information\nTruncated data, e.g. Microsoft Excel would limit the number of rows to 655536 and the number of columns to 255\n\nWe often solve some of these more common issues in the following ways:\n\nSpelling errors: apply corrections or drop records that aren’t in a dictionary\nTime zone inconsistencies: convert to a common time zone (e.g. UTC)\nDuplicated records or fields: identify and eliminate duplicates (using primary keys)\nUnspecified or inconsistent units: infer the units and check that values are in reasonable ranges in the data\n\n\n5.3.1 Missing Values\nAnother common issue encountered with real-world datasets is that of missing data. One strategy to resolve this is to simply drop any records with missing values from the dataset. This does, however, introduce the risk of inducing biases – it is possible that the missing or corrupt records may be systemically related to some feature of interest in the data. Another solution is to keep the data as NaN values.\nA third method to address missing data is to perform imputation: infer the missing values using other data available in the dataset. There is a wide variety of imputation techniques that can be implemented; some of the most common are listed below.\n\nAverage imputation: replace missing values with the average value for that field\nHot deck imputation: replace missing values with some random value\nRegression imputation: develop a model to predict missing values\nMultiple imputation: replace missing values with multiple random values\n\nRegardless of the strategy used to deal with missing data, we should think carefully about why particular records or fields may be missing – this can help inform whether or not the absence of these values is significant or meaningful."
-  },
-  {
-    "objectID": "eda/eda.html#csvs-and-field-names",
-    "href": "eda/eda.html#csvs-and-field-names",
-    "title": "5  Data Cleaning and EDA",
-    "section": "6.1 CSVs and Field Names",
-    "text": "6.1 CSVs and Field Names\nSuppose Table 1 was saved as a CSV file located in data/cdc_tuberculosis.csv.\nWe can then explore the CSV (which is a text file, and does not contain binary-encoded data) in many ways: 1. Using a text editor like emacs, vim, VSCode, etc. 2. Opening the CSV directly in DataHub (read-only), Excel, Google Sheets, etc. 3. The Python file object 4. pandas, using pd.read_csv()\nTo try out options 1 and 2, you can view or download the Tuberculosis from the lecture demo notebook under the data folder in the left hand menu. Notice how the CSV file is a type of rectangular data (i.e., tabular data) stored as comma-separated values.\nNext, let’s try out option 3 using the Python file object. We’ll look at the first four lines:\n\n\nCode\nwith open(\"data/cdc_tuberculosis.csv\", \"r\") as f:\n    i = 0\n    for row in f:\n        print(row)\n        i += 1\n        if i &gt; 3:\n            break\n\n\n,No. of TB cases,,,TB incidence,,\n\nU.S. jurisdiction,2019,2020,2021,2019,2020,2021\n\nTotal,\"8,900\",\"7,173\",\"7,860\",2.71,2.16,2.37\n\nAlabama,87,72,92,1.77,1.43,1.83\n\n\n\nWhoa, why are there blank lines interspaced between the lines of the CSV?\nYou may recall that all line breaks in text files are encoded as the special newline character \\n. Python’s print() prints each string (including the newline), and an additional newline on top of that.\nIf you’re curious, we can use the repr() function to return the raw string with all special characters:\n\n\nCode\nwith open(\"data/cdc_tuberculosis.csv\", \"r\") as f:\n    i = 0\n    for row in f:\n        print(repr(row)) # print raw strings\n        i += 1\n        if i &gt; 3:\n            break\n\n\n',No. of TB cases,,,TB incidence,,\\n'\n'U.S. jurisdiction,2019,2020,2021,2019,2020,2021\\n'\n'Total,\"8,900\",\"7,173\",\"7,860\",2.71,2.16,2.37\\n'\n'Alabama,87,72,92,1.77,1.43,1.83\\n'\n\n\nFinally, let’s try option 4 and use the tried-and-true Data 100 approach: pandas.\n\ntb_df = pd.read_csv(\"data/cdc_tuberculosis.csv\")\ntb_df.head()\n\n\n\n\n\n\n\n\nUnnamed: 0\nNo. of TB cases\nUnnamed: 2\nUnnamed: 3\nTB incidence\nUnnamed: 5\nUnnamed: 6\n\n\n\n\n0\nU.S. jurisdiction\n2019\n2020\n2021\n2019.00\n2020.00\n2021.00\n\n\n1\nTotal\n8,900\n7,173\n7,860\n2.71\n2.16\n2.37\n\n\n2\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n\n\n3\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n\n\n4\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n\n\n\n\n\n\n\nYou may notice some strange things about this table: what’s up with the “Unnamed” column names and the first row?\nCongratulations — you’re ready to wrangle your data! Because of how things are stored, we’ll need to clean the data a bit to name our columns better.\nA reasonable first step is to identify the row with the right header. The pd.read_csv() function (documentation) has the convenient header parameter that we can set to use the elements in row 1 as the appropriate columns:\n\ntb_df = pd.read_csv(\"data/cdc_tuberculosis.csv\", header=1) # row index\ntb_df.head(5)\n\n\n\n\n\n\n\n\nU.S. jurisdiction\n2019\n2020\n2021\n2019.1\n2020.1\n2021.1\n\n\n\n\n0\nTotal\n8,900\n7,173\n7,860\n2.71\n2.16\n2.37\n\n\n1\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n\n\n2\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n\n\n3\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n\n\n4\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n\n\n\n\n\n\n\nWait…but now we can’t differentiate betwen the “Number of TB cases” and “TB incidence” year columns. pandas has tried to make our lives easier by automatically adding “.1” to the latter columns, but this doesn’t help us, as humans, understand the data.\nWe can do this manually with df.rename() (documentation):\n\nrename_dict = {'2019': 'TB cases 2019',\n               '2020': 'TB cases 2020',\n               '2021': 'TB cases 2021',\n               '2019.1': 'TB incidence 2019',\n               '2020.1': 'TB incidence 2020',\n               '2021.1': 'TB incidence 2021'}\ntb_df = tb_df.rename(columns=rename_dict)\ntb_df.head(5)\n\n\n\n\n\n\n\n\nU.S. jurisdiction\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n\n\n\n\n0\nTotal\n8,900\n7,173\n7,860\n2.71\n2.16\n2.37\n\n\n1\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n\n\n2\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n\n\n3\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n\n\n4\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28"
-  },
-  {
-    "objectID": "eda/eda.html#record-granularity",
-    "href": "eda/eda.html#record-granularity",
-    "title": "5  Data Cleaning and EDA",
-    "section": "6.2 Record Granularity",
-    "text": "6.2 Record Granularity\nYou might already be wondering: what’s up with that first record?\nRow 0 is what we call a rollup record, or summary record. It’s often useful when displaying tables to humans. The granularity of record 0 (Totals) vs the rest of the records (States) is different.\nOkay, EDA step two. How was the rollup record aggregated?\nLet’s check if Total TB cases is the sum of all state TB cases. If we sum over all rows, we should get 2x the total cases in each of our TB cases by year (why do you think this is?).\n\n\nCode\ntb_df.sum(axis=0)\n\n\nU.S. jurisdiction    TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...\nTB cases 2019        8,9008758183642,111666718245583029973261085237...\nTB cases 2020        7,1737258136591,706525417194122219282169239376...\nTB cases 2021        7,8609258129691,750585443194992281064255127494...\nTB incidence 2019                                               109.94\nTB incidence 2020                                                93.09\nTB incidence 2021                                               102.94\ndtype: object\n\n\nWhoa, what’s going on with the TB cases in 2019, 2020, and 2021? Check out the column types:\n\n\nCode\ntb_df.dtypes\n\n\nU.S. jurisdiction     object\nTB cases 2019         object\nTB cases 2020         object\nTB cases 2021         object\nTB incidence 2019    float64\nTB incidence 2020    float64\nTB incidence 2021    float64\ndtype: object\n\n\nSince there are commas in the values for TB cases, the numbers are read as the object datatype, or storage type (close to the Python string datatype), so pandas is concatenating strings instead of adding integers (recall that Python can “sum”, or concatenate, strings together: \"data\" + \"100\" evaluates to \"data100\").\nFortunately read_csv also has a thousands parameter (documentation):\n\n# improve readability: chaining method calls with outer parentheses/line breaks\ntb_df = (\n    pd.read_csv(\"data/cdc_tuberculosis.csv\", header=1, thousands=',')\n    .rename(columns=rename_dict)\n)\ntb_df.head(5)\n\n\n\n\n\n\n\n\nU.S. jurisdiction\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n\n\n\n\n0\nTotal\n8900\n7173\n7860\n2.71\n2.16\n2.37\n\n\n1\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n\n\n2\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n\n\n3\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n\n\n4\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n\n\n\n\n\n\n\n\ntb_df.sum()\n\nU.S. jurisdiction    TotalAlabamaAlaskaArizonaArkansasCaliforniaCol...\nTB cases 2019                                                    17800\nTB cases 2020                                                    14346\nTB cases 2021                                                    15720\nTB incidence 2019                                               109.94\nTB incidence 2020                                                93.09\nTB incidence 2021                                               102.94\ndtype: object\n\n\nThe Total TB cases look right. Phew!\nLet’s just look at the records with state-level granularity:\n\n\nCode\nstate_tb_df = tb_df[1:]\nstate_tb_df.head(5)\n\n\n\n\n\n\n\n\n\nU.S. jurisdiction\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n\n\n\n\n1\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n\n\n2\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n\n\n3\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n\n\n4\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n\n\n5\nCalifornia\n2111\n1706\n1750\n5.35\n4.32\n4.46"
-  },
-  {
-    "objectID": "eda/eda.html#gather-census-data",
-    "href": "eda/eda.html#gather-census-data",
-    "title": "5  Data Cleaning and EDA",
-    "section": "6.3 Gather Census Data",
-    "text": "6.3 Gather Census Data\nU.S. Census population estimates source (2019), source (2020-2021).\nRunning the below cells cleans the data. There are a few new methods here: * df.convert_dtypes() (documentation) conveniently converts all float dtypes into ints and is out of scope for the class. * df.drop_na() (documentation) will be explained in more detail next time.\n\n\nCode\n# 2010s census data\ncensus_2010s_df = pd.read_csv(\"data/nst-est2019-01.csv\", header=3, thousands=\",\")\ncensus_2010s_df = (\n    census_2010s_df\n    .reset_index()\n    .drop(columns=[\"index\", \"Census\", \"Estimates Base\"])\n    .rename(columns={\"Unnamed: 0\": \"Geographic Area\"})\n    .convert_dtypes()                 # \"smart\" converting of columns, use at your own risk\n    .dropna()                         # we'll introduce this next time\n)\ncensus_2010s_df['Geographic Area'] = census_2010s_df['Geographic Area'].str.strip('.')\n\n# with pd.option_context('display.min_rows', 30): # shows more rows\n#     display(census_2010s_df)\n    \ncensus_2010s_df.head(5)\n\n\n\n\n\n\n\n\n\nGeographic Area\n2010\n2011\n2012\n2013\n2014\n2015\n2016\n2017\n2018\n2019\n\n\n\n\n0\nUnited States\n309321666\n311556874\n313830990\n315993715\n318301008\n320635163\n322941311\n324985539\n326687501\n328239523\n\n\n1\nNortheast\n55380134\n55604223\n55775216\n55901806\n56006011\n56034684\n56042330\n56059240\n56046620\n55982803\n\n\n2\nMidwest\n66974416\n67157800\n67336743\n67560379\n67745167\n67860583\n67987540\n68126781\n68236628\n68329004\n\n\n3\nSouth\n114866680\n116006522\n117241208\n118364400\n119624037\n120997341\n122351760\n123542189\n124569433\n125580448\n\n\n4\nWest\n72100436\n72788329\n73477823\n74167130\n74925793\n75742555\n76559681\n77257329\n77834820\n78347268\n\n\n\n\n\n\n\nOccasionally, you will want to modify code that you have imported. To reimport those modifications you can either use python’s importlib library:\nfrom importlib import reload\nreload(utils)\nor use iPython magic which will intelligently import code when files change:\n%load_ext autoreload\n%autoreload 2\n\n\nCode\n# census 2020s data\ncensus_2020s_df = pd.read_csv(\"data/NST-EST2022-POP.csv\", header=3, thousands=\",\")\ncensus_2020s_df = (\n    census_2020s_df\n    .reset_index()\n    .drop(columns=[\"index\", \"Unnamed: 1\"])\n    .rename(columns={\"Unnamed: 0\": \"Geographic Area\"})\n    .convert_dtypes()                 # \"smart\" converting of columns, use at your own risk\n    .dropna()                         # we'll introduce this next time\n)\ncensus_2020s_df['Geographic Area'] = census_2020s_df['Geographic Area'].str.strip('.')\n\ncensus_2020s_df.head(5)\n\n\n\n\n\n\n\n\n\nGeographic Area\n2020\n2021\n2022\n\n\n\n\n0\nUnited States\n331511512\n332031554\n333287557\n\n\n1\nNortheast\n57448898\n57259257\n57040406\n\n\n2\nMidwest\n68961043\n68836505\n68787595\n\n\n3\nSouth\n126450613\n127346029\n128716192\n\n\n4\nWest\n78650958\n78589763\n78743364"
-  },
-  {
-    "objectID": "eda/eda.html#joining-data-merging-dataframes",
-    "href": "eda/eda.html#joining-data-merging-dataframes",
-    "title": "5  Data Cleaning and EDA",
-    "section": "6.4 Joining Data (Merging DataFrames)",
-    "text": "6.4 Joining Data (Merging DataFrames)\nTime to merge! Here we use the DataFrame method df1.merge(right=df2, ...) on DataFrame df1 (documentation). Contrast this with the function pd.merge(left=df1, right=df2, ...) (documentation). Feel free to use either.\n\n# merge TB DataFrame with two US census DataFrames\ntb_census_df = (\n    tb_df\n    .merge(right=census_2010s_df,\n           left_on=\"U.S. jurisdiction\", right_on=\"Geographic Area\")\n    .merge(right=census_2020s_df,\n           left_on=\"U.S. jurisdiction\", right_on=\"Geographic Area\")\n)\ntb_census_df.head(5)\n\n\n\n\n\n\n\n\nU.S. jurisdiction\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\nGeographic Area_x\n2010\n2011\n2012\n2013\n2014\n2015\n2016\n2017\n2018\n2019\nGeographic Area_y\n2020\n2021\n2022\n\n\n\n\n0\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\nAlabama\n4785437\n4799069\n4815588\n4830081\n4841799\n4852347\n4863525\n4874486\n4887681\n4903185\nAlabama\n5031362\n5049846\n5074296\n\n\n1\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\nAlaska\n713910\n722128\n730443\n737068\n736283\n737498\n741456\n739700\n735139\n731545\nAlaska\n732923\n734182\n733583\n\n\n2\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\nArizona\n6407172\n6472643\n6554978\n6632764\n6730413\n6829676\n6941072\n7044008\n7158024\n7278717\nArizona\n7179943\n7264877\n7359197\n\n\n3\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\nArkansas\n2921964\n2940667\n2952164\n2959400\n2967392\n2978048\n2989918\n3001345\n3009733\n3017804\nArkansas\n3014195\n3028122\n3045637\n\n\n4\nCalifornia\n2111\n1706\n1750\n5.35\n4.32\n4.46\nCalifornia\n37319502\n37638369\n37948800\n38260787\n38596972\n38918045\n39167117\n39358497\n39461588\n39512223\nCalifornia\n39501653\n39142991\n39029342\n\n\n\n\n\n\n\nHaving all of these columns is a little unwieldy. We could either drop the unneeded columns now, or just merge on smaller census DataFrames. Let’s do the latter.\n\n# try merging again, but cleaner this time\ntb_census_df = (\n    tb_df\n    .merge(right=census_2010s_df[[\"Geographic Area\", \"2019\"]],\n           left_on=\"U.S. jurisdiction\", right_on=\"Geographic Area\")\n    .drop(columns=\"Geographic Area\")\n    .merge(right=census_2020s_df[[\"Geographic Area\", \"2020\", \"2021\"]],\n           left_on=\"U.S. jurisdiction\", right_on=\"Geographic Area\")\n    .drop(columns=\"Geographic Area\")\n)\ntb_census_df.head(5)\n\n\n\n\n\n\n\n\nU.S. jurisdiction\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n2019\n2020\n2021\n\n\n\n\n0\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n4903185\n5031362\n5049846\n\n\n1\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n731545\n732923\n734182\n\n\n2\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n7278717\n7179943\n7264877\n\n\n3\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n3017804\n3014195\n3028122\n\n\n4\nCalifornia\n2111\n1706\n1750\n5.35\n4.32\n4.46\n39512223\n39501653\n39142991"
-  },
-  {
-    "objectID": "eda/eda.html#reproducing-data-compute-incidence",
-    "href": "eda/eda.html#reproducing-data-compute-incidence",
-    "title": "5  Data Cleaning and EDA",
-    "section": "6.5 Reproducing Data: Compute Incidence",
-    "text": "6.5 Reproducing Data: Compute Incidence\nLet’s recompute incidence to make sure we know where the original CDC numbers came from.\nFrom the CDC report: TB incidence is computed as “Cases per 100,000 persons using mid-year population estimates from the U.S. Census Bureau.”\nIf we define a group as 100,000 people, then we can compute the TB incidence for a given state population as\n\\[\\text{TB incidence} = \\frac{\\text{TB cases in population}}{\\text{groups in population}} = \\frac{\\text{TB cases in population}}{\\text{population}/100000} \\]\n\\[= \\frac{\\text{TB cases in population}}{\\text{population}} \\times 100000\\]\nLet’s try this for 2019:\n\ntb_census_df[\"recompute incidence 2019\"] = tb_census_df[\"TB cases 2019\"]/tb_census_df[\"2019\"]*100000\ntb_census_df.head(5)\n\n\n\n\n\n\n\n\nU.S. jurisdiction\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n2019\n2020\n2021\nrecompute incidence 2019\n\n\n\n\n0\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n4903185\n5031362\n5049846\n1.77\n\n\n1\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n731545\n732923\n734182\n7.93\n\n\n2\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n7278717\n7179943\n7264877\n2.51\n\n\n3\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n3017804\n3014195\n3028122\n2.12\n\n\n4\nCalifornia\n2111\n1706\n1750\n5.35\n4.32\n4.46\n39512223\n39501653\n39142991\n5.34\n\n\n\n\n\n\n\nAwesome!!!\nLet’s use a for-loop and Python format strings to compute TB incidence for all years. Python f-strings are just used for the purposes of this demo, but they’re handy to know when you explore data beyond this course (documentation).\n\n# recompute incidence for all years\nfor year in [2019, 2020, 2021]:\n    tb_census_df[f\"recompute incidence {year}\"] = tb_census_df[f\"TB cases {year}\"]/tb_census_df[f\"{year}\"]*100000\ntb_census_df.head(5)\n\n\n\n\n\n\n\n\nU.S. jurisdiction\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n2019\n2020\n2021\nrecompute incidence 2019\nrecompute incidence 2020\nrecompute incidence 2021\n\n\n\n\n0\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n4903185\n5031362\n5049846\n1.77\n1.43\n1.82\n\n\n1\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n731545\n732923\n734182\n7.93\n7.91\n7.90\n\n\n2\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n7278717\n7179943\n7264877\n2.51\n1.89\n1.78\n\n\n3\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n3017804\n3014195\n3028122\n2.12\n1.96\n2.28\n\n\n4\nCalifornia\n2111\n1706\n1750\n5.35\n4.32\n4.46\n39512223\n39501653\n39142991\n5.34\n4.32\n4.47\n\n\n\n\n\n\n\nThese numbers look pretty close!!! There are a few errors in the hundredths place, particularly in 2021. It may be useful to further explore reasons behind this discrepancy.\n\ntb_census_df.describe()\n\n\n\n\n\n\n\n\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n2019\n2020\n2021\nrecompute incidence 2019\nrecompute incidence 2020\nrecompute incidence 2021\n\n\n\n\ncount\n51.00\n51.00\n51.00\n51.00\n51.00\n51.00\n51.00\n51.00\n51.00\n51.00\n51.00\n51.00\n\n\nmean\n174.51\n140.65\n154.12\n2.10\n1.78\n1.97\n6436069.08\n6500225.73\n6510422.63\n2.10\n1.78\n1.97\n\n\nstd\n341.74\n271.06\n286.78\n1.50\n1.34\n1.48\n7360660.47\n7408168.46\n7394300.08\n1.50\n1.34\n1.47\n\n\nmin\n1.00\n0.00\n2.00\n0.17\n0.00\n0.21\n578759.00\n577605.00\n579483.00\n0.17\n0.00\n0.21\n\n\n25%\n25.50\n29.00\n23.00\n1.29\n1.21\n1.23\n1789606.00\n1820311.00\n1844920.00\n1.30\n1.21\n1.23\n\n\n50%\n70.00\n67.00\n69.00\n1.80\n1.52\n1.70\n4467673.00\n4507445.00\n4506589.00\n1.81\n1.52\n1.69\n\n\n75%\n180.50\n139.00\n150.00\n2.58\n1.99\n2.22\n7446805.00\n7451987.00\n7502811.00\n2.58\n1.99\n2.22\n\n\nmax\n2111.00\n1706.00\n1750.00\n7.91\n7.92\n7.92\n39512223.00\n39501653.00\n39142991.00\n7.93\n7.91\n7.90"
-  },
-  {
-    "objectID": "eda/eda.html#bonus-eda-reproducing-the-reported-statistic",
-    "href": "eda/eda.html#bonus-eda-reproducing-the-reported-statistic",
-    "title": "5  Data Cleaning and EDA",
-    "section": "6.6 Bonus EDA: Reproducing the Reported Statistic",
-    "text": "6.6 Bonus EDA: Reproducing the Reported Statistic\nHow do we reproduce that reported statistic in the original CDC report?\n\nReported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.\n\nThis is TB incidence computed across the entire U.S. population! How do we reproduce this? * We need to reproduce the “Total” TB incidences in our rolled record. * But our current tb_census_df only has 51 entries (50 states plus Washington, D.C.). There is no rolled record. * What happened…?\nLet’s get exploring!\nBefore we keep exploring, we’ll set all indexes to more meaningful values, instead of just numbers that pertain to some row at some point. This will make our cleaning slightly easier.\n\n\nCode\ntb_df = tb_df.set_index(\"U.S. jurisdiction\")\ntb_df.head(5)\n\n\n\n\n\n\n\n\n\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n\n\nU.S. jurisdiction\n\n\n\n\n\n\n\n\n\n\nTotal\n8900\n7173\n7860\n2.71\n2.16\n2.37\n\n\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n\n\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n\n\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n\n\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n\n\n\n\n\n\n\n\ncensus_2010s_df = census_2010s_df.set_index(\"Geographic Area\")\ncensus_2010s_df.head(5)\n\n\n\n\n\n\n\n\n2010\n2011\n2012\n2013\n2014\n2015\n2016\n2017\n2018\n2019\n\n\nGeographic Area\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnited States\n309321666\n311556874\n313830990\n315993715\n318301008\n320635163\n322941311\n324985539\n326687501\n328239523\n\n\nNortheast\n55380134\n55604223\n55775216\n55901806\n56006011\n56034684\n56042330\n56059240\n56046620\n55982803\n\n\nMidwest\n66974416\n67157800\n67336743\n67560379\n67745167\n67860583\n67987540\n68126781\n68236628\n68329004\n\n\nSouth\n114866680\n116006522\n117241208\n118364400\n119624037\n120997341\n122351760\n123542189\n124569433\n125580448\n\n\nWest\n72100436\n72788329\n73477823\n74167130\n74925793\n75742555\n76559681\n77257329\n77834820\n78347268\n\n\n\n\n\n\n\n\ncensus_2020s_df = census_2020s_df.set_index(\"Geographic Area\")\ncensus_2020s_df.head(5)\n\n\n\n\n\n\n\n\n2020\n2021\n2022\n\n\nGeographic Area\n\n\n\n\n\n\n\nUnited States\n331511512\n332031554\n333287557\n\n\nNortheast\n57448898\n57259257\n57040406\n\n\nMidwest\n68961043\n68836505\n68787595\n\n\nSouth\n126450613\n127346029\n128716192\n\n\nWest\n78650958\n78589763\n78743364\n\n\n\n\n\n\n\nIt turns out that our merge above only kept state records, even though our original tb_df had the “Total” rolled record:\n\ntb_df.head()\n\n\n\n\n\n\n\n\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n\n\nU.S. jurisdiction\n\n\n\n\n\n\n\n\n\n\nTotal\n8900\n7173\n7860\n2.71\n2.16\n2.37\n\n\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n\n\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n\n\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n\n\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n\n\n\n\n\n\n\nRecall that merge by default does an inner merge by default, meaning that it only preserves keys that are present in both DataFrames.\nThe rolled records in our census DataFrame have different Geographic Area fields, which was the key we merged on:\n\ncensus_2010s_df.head(5)\n\n\n\n\n\n\n\n\n2010\n2011\n2012\n2013\n2014\n2015\n2016\n2017\n2018\n2019\n\n\nGeographic Area\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nUnited States\n309321666\n311556874\n313830990\n315993715\n318301008\n320635163\n322941311\n324985539\n326687501\n328239523\n\n\nNortheast\n55380134\n55604223\n55775216\n55901806\n56006011\n56034684\n56042330\n56059240\n56046620\n55982803\n\n\nMidwest\n66974416\n67157800\n67336743\n67560379\n67745167\n67860583\n67987540\n68126781\n68236628\n68329004\n\n\nSouth\n114866680\n116006522\n117241208\n118364400\n119624037\n120997341\n122351760\n123542189\n124569433\n125580448\n\n\nWest\n72100436\n72788329\n73477823\n74167130\n74925793\n75742555\n76559681\n77257329\n77834820\n78347268\n\n\n\n\n\n\n\nThe Census DataFrame has several rolled records. The aggregate record we are looking for actually has the Geographic Area named “United States”.\nOne straightforward way to get the right merge is to rename the value itself. Because we now have the Geographic Area index, we’ll use df.rename() (documentation):\n\n# rename rolled record for 2010s\ncensus_2010s_df.rename(index={'United States':'Total'}, inplace=True)\ncensus_2010s_df.head(5)\n\n\n\n\n\n\n\n\n2010\n2011\n2012\n2013\n2014\n2015\n2016\n2017\n2018\n2019\n\n\nGeographic Area\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTotal\n309321666\n311556874\n313830990\n315993715\n318301008\n320635163\n322941311\n324985539\n326687501\n328239523\n\n\nNortheast\n55380134\n55604223\n55775216\n55901806\n56006011\n56034684\n56042330\n56059240\n56046620\n55982803\n\n\nMidwest\n66974416\n67157800\n67336743\n67560379\n67745167\n67860583\n67987540\n68126781\n68236628\n68329004\n\n\nSouth\n114866680\n116006522\n117241208\n118364400\n119624037\n120997341\n122351760\n123542189\n124569433\n125580448\n\n\nWest\n72100436\n72788329\n73477823\n74167130\n74925793\n75742555\n76559681\n77257329\n77834820\n78347268\n\n\n\n\n\n\n\n\n# same, but for 2020s rename rolled record\ncensus_2020s_df.rename(index={'United States':'Total'}, inplace=True)\ncensus_2020s_df.head(5)\n\n\n\n\n\n\n\n\n2020\n2021\n2022\n\n\nGeographic Area\n\n\n\n\n\n\n\nTotal\n331511512\n332031554\n333287557\n\n\nNortheast\n57448898\n57259257\n57040406\n\n\nMidwest\n68961043\n68836505\n68787595\n\n\nSouth\n126450613\n127346029\n128716192\n\n\nWest\n78650958\n78589763\n78743364\n\n\n\n\n\n\n\n\nNext let’s rerun our merge. Note the different chaining, because we are now merging on indexes (df.merge() documentation).\n\ntb_census_df = (\n    tb_df\n    .merge(right=census_2010s_df[[\"2019\"]],\n           left_index=True, right_index=True)\n    .merge(right=census_2020s_df[[\"2020\", \"2021\"]],\n           left_index=True, right_index=True)\n)\ntb_census_df.head(5)\n\n\n\n\n\n\n\n\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n2019\n2020\n2021\n\n\n\n\nTotal\n8900\n7173\n7860\n2.71\n2.16\n2.37\n328239523\n331511512\n332031554\n\n\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n4903185\n5031362\n5049846\n\n\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n731545\n732923\n734182\n\n\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n7278717\n7179943\n7264877\n\n\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n3017804\n3014195\n3028122\n\n\n\n\n\n\n\n\nFinally, let’s recompute our incidences:\n\n# recompute incidence for all years\nfor year in [2019, 2020, 2021]:\n    tb_census_df[f\"recompute incidence {year}\"] = tb_census_df[f\"TB cases {year}\"]/tb_census_df[f\"{year}\"]*100000\ntb_census_df.head(5)\n\n\n\n\n\n\n\n\nTB cases 2019\nTB cases 2020\nTB cases 2021\nTB incidence 2019\nTB incidence 2020\nTB incidence 2021\n2019\n2020\n2021\nrecompute incidence 2019\nrecompute incidence 2020\nrecompute incidence 2021\n\n\n\n\nTotal\n8900\n7173\n7860\n2.71\n2.16\n2.37\n328239523\n331511512\n332031554\n2.71\n2.16\n2.37\n\n\nAlabama\n87\n72\n92\n1.77\n1.43\n1.83\n4903185\n5031362\n5049846\n1.77\n1.43\n1.82\n\n\nAlaska\n58\n58\n58\n7.91\n7.92\n7.92\n731545\n732923\n734182\n7.93\n7.91\n7.90\n\n\nArizona\n183\n136\n129\n2.51\n1.89\n1.77\n7278717\n7179943\n7264877\n2.51\n1.89\n1.78\n\n\nArkansas\n64\n59\n69\n2.12\n1.96\n2.28\n3017804\n3014195\n3028122\n2.12\n1.96\n2.28\n\n\n\n\n\n\n\nWe reproduced the total U.S. incidences correctly!\nWe’re almost there. Let’s revisit the quote:\n\nReported TB incidence (cases per 100,000 persons) increased 9.4%, from 2.2 during 2020 to 2.4 during 2021 but was lower than incidence during 2019 (2.7). Increases occurred among both U.S.-born and non–U.S.-born persons.\n\nRecall that percent change from \\(A\\) to \\(B\\) is computed as \\(\\text{percent change} = \\frac{B - A}{A} \\times 100\\).\n\nincidence_2020 = tb_census_df.loc['Total', 'recompute incidence 2020']\nincidence_2020\n\n2.1637257652759883\n\n\n\nincidence_2021 = tb_census_df.loc['Total', 'recompute incidence 2021']\nincidence_2021\n\n2.3672448914298068\n\n\n\ndifference = (incidence_2021 - incidence_2020)/incidence_2020 * 100\ndifference\n\n9.405957511804143"
-  },
-  {
-    "objectID": "eda/eda.html#reading-this-file-into-pandas",
-    "href": "eda/eda.html#reading-this-file-into-pandas",
-    "title": "5  Data Cleaning and EDA",
-    "section": "7.1 Reading this file into Pandas?",
-    "text": "7.1 Reading this file into Pandas?\nLet’s instead check out this .txt file. Some questions to keep in mind: Do we trust this file extension? What structure is it?\nLines 71-78 (inclusive) are shown below:\nline number |                            file contents\n\n71          |   #            decimal     average   interpolated    trend    #days\n72          |   #             date                             (season corr)\n73          |   1958   3    1958.208      315.71      315.71      314.62     -1\n74          |   1958   4    1958.292      317.45      317.45      315.29     -1\n75          |   1958   5    1958.375      317.50      317.50      314.71     -1\n76          |   1958   6    1958.458      -99.99      317.10      314.85     -1\n77          |   1958   7    1958.542      315.86      315.86      314.98     -1\n78          |   1958   8    1958.625      314.93      314.93      315.94     -1\nNotice how:\n\nThe values are separated by white space, possibly tabs.\nThe data line up down the rows. For example, the month appears in 7th to 8th position of each line.\nThe 71st and 72nd lines in the file contain column headings split over two lines.\n\nWe can use read_csv to read the data into a pandas DataFrame, and we provide several arguments to specify that the separators are white space, there is no header (we will set our own column names), and to skip the first 72 rows of the file.\n\nco2 = pd.read_csv(\n    co2_file, header = None, skiprows = 72,\n    sep = r'\\s+'       #delimiter for continuous whitespace (stay tuned for regex next lecture))\n)\nco2.head()\n\n\n\n\n\n\n\n\n0\n1\n2\n3\n4\n5\n6\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nCongratulations! You’ve wrangled the data!\n\n…But our columns aren’t named. We need to do more EDA."
-  },
-  {
-    "objectID": "eda/eda.html#exploring-variable-feature-types",
-    "href": "eda/eda.html#exploring-variable-feature-types",
-    "title": "5  Data Cleaning and EDA",
-    "section": "7.2 Exploring Variable Feature Types",
-    "text": "7.2 Exploring Variable Feature Types\nThe NOAA webpage might have some useful tidbits (in this case it doesn’t).\nUsing this information, we’ll rerun pd.read_csv, but this time with some custom column names.\n\nco2 = pd.read_csv(\n    co2_file, header = None, skiprows = 72,\n    sep = '\\s+', #regex for continuous whitespace (next lecture)\n    names = ['Yr', 'Mo', 'DecDate', 'Avg', 'Int', 'Trend', 'Days']\n)\nco2.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1"
-  },
-  {
-    "objectID": "eda/eda.html#visualizing-co2",
-    "href": "eda/eda.html#visualizing-co2",
-    "title": "5  Data Cleaning and EDA",
-    "section": "7.3 Visualizing CO2",
-    "text": "7.3 Visualizing CO2\nScientific studies tend to have very clean data, right…? Let’s jump right in and make a time series plot of CO2 monthly averages.\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2);\n\n\n\n\n\nThe code above uses the seaborn plotting library (abbreviated sns). We will cover this in the Visualization lecture, but now you don’t need to worry about how it works!\nYikes! Plotting the data uncovered a problem. The sharp vertical lines suggest that we have some missing values. What happened here?\n\nco2.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\n\nco2.tail()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n733\n2019\n4\n2019.29\n413.32\n413.32\n410.49\n26\n\n\n734\n2019\n5\n2019.38\n414.66\n414.66\n411.20\n28\n\n\n735\n2019\n6\n2019.46\n413.92\n413.92\n411.58\n27\n\n\n736\n2019\n7\n2019.54\n411.77\n411.77\n411.43\n23\n\n\n737\n2019\n8\n2019.62\n409.95\n409.95\n411.84\n29\n\n\n\n\n\n\n\nSome data have unusual values like -1 and -99.99.\nLet’s check the description at the top of the file again.\n\n-1 signifies a missing value for the number of days Days the equipment was in operation that month.\n-99.99 denotes a missing monthly average Avg\n\nHow can we fix this? First, let’s explore other aspects of our data. Understanding our data will help us decide what to do with the missing values."
-  },
-  {
-    "objectID": "eda/eda.html#sanity-checks-reasoning-about-the-data",
-    "href": "eda/eda.html#sanity-checks-reasoning-about-the-data",
-    "title": "5  Data Cleaning and EDA",
-    "section": "7.4 Sanity Checks: Reasoning about the data",
-    "text": "7.4 Sanity Checks: Reasoning about the data\nFirst, we consider the shape of the data. How many rows should we have?\n\nIf chronological order, we should have one record per month.\nData from March 1958 to August 2019.\nWe should have $ 12 (2019-1957) - 2 - 4 = 738 $ records.\n\n\nco2.shape\n\n(738, 7)\n\n\nNice!! The number of rows (i.e. records) match our expectations.\n\n\nLet’s now check the quality of each feature."
-  },
-  {
-    "objectID": "eda/eda.html#understanding-missing-value-1-days",
-    "href": "eda/eda.html#understanding-missing-value-1-days",
-    "title": "5  Data Cleaning and EDA",
-    "section": "7.5 Understanding Missing Value 1: Days",
-    "text": "7.5 Understanding Missing Value 1: Days\nDays is a time field, so let’s analyze other time fields to see if there is an explanation for missing values of days of operation.\nLet’s start with months, Mo.\nAre we missing any records? The number of months should have 62 or 61 instances (March 1957-August 2019).\n\nco2[\"Mo\"].value_counts().sort_index()\n\n1     61\n2     61\n3     62\n4     62\n5     62\n6     62\n7     62\n8     62\n9     61\n10    61\n11    61\n12    61\nName: Mo, dtype: int64\n\n\nAs expected Jan, Feb, Sep, Oct, Nov, and Dec have 61 occurrences and the rest 62.\n\nNext let’s explore days Days itself, which is the number of days that the measurement equipment worked.\n\n\nCode\nsns.displot(co2['Days']);\nplt.title(\"Distribution of days feature\"); # suppresses unneeded plotting output\n\n\n/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:\n\nThe figure layout has changed to tight\n\n\n\n\n\n\nIn terms of data quality, a handful of months have averages based on measurements taken on fewer than half the days. In addition, there are nearly 200 missing values–that’s about 27% of the data!\n\nFinally, let’s check the last time feature, year Yr.\nLet’s check to see if there is any connection between missing-ness and the year of the recording.\n\n\nCode\nsns.scatterplot(x=\"Yr\", y=\"Days\", data=co2);\nplt.title(\"Day field by Year\"); # the ; suppresses output\n\n\n\n\n\nObservations:\n\nAll of the missing data are in the early years of operation.\nIt appears there may have been problems with equipment in the mid to late 80s.\n\nPotential Next Steps:\n\nConfirm these explanations through documentation about the historical readings.\nMaybe drop earliest recordings? However, we would want to delay such action until after we have examined the time trends and assess whether there are any potential problems."
-  },
-  {
-    "objectID": "eda/eda.html#understanding-missing-value-2-avg",
-    "href": "eda/eda.html#understanding-missing-value-2-avg",
-    "title": "5  Data Cleaning and EDA",
-    "section": "7.6 Understanding Missing Value 2: Avg",
-    "text": "7.6 Understanding Missing Value 2: Avg\nNext, let’s return to the -99.99 values in Avg to analyze the overall quality of the CO2 measurements. We’ll plot a histogram of the average CO2 measurements\n\n\nCode\n# Histograms of average CO2 measurements\nsns.displot(co2['Avg']);\n\n\n/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:\n\nThe figure layout has changed to tight\n\n\n\n\n\n\nThe non-missing values are in the 300-400 range (a regular range of CO2 levels).\nWe also see that there are only a few missing Avg values (&lt;1% of values). Let’s examine all of them:\n\nco2[co2[\"Avg\"] &lt; 0]\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n3\n1958\n6\n1958.46\n-99.99\n317.10\n314.85\n-1\n\n\n7\n1958\n10\n1958.79\n-99.99\n312.66\n315.61\n-1\n\n\n71\n1964\n2\n1964.12\n-99.99\n320.07\n319.61\n-1\n\n\n72\n1964\n3\n1964.21\n-99.99\n320.73\n319.55\n-1\n\n\n73\n1964\n4\n1964.29\n-99.99\n321.77\n319.48\n-1\n\n\n213\n1975\n12\n1975.96\n-99.99\n330.59\n331.60\n0\n\n\n313\n1984\n4\n1984.29\n-99.99\n346.84\n344.27\n2\n\n\n\n\n\n\n\nThere doesn’t seem to be a pattern to these values, other than that most records also were missing Days data."
-  },
-  {
-    "objectID": "eda/eda.html#drop-nan-or-impute-missing-avg-data",
-    "href": "eda/eda.html#drop-nan-or-impute-missing-avg-data",
-    "title": "5  Data Cleaning and EDA",
-    "section": "7.7 Drop, NaN, or Impute Missing Avg Data?",
-    "text": "7.7 Drop, NaN, or Impute Missing Avg Data?\nHow should we address the invalid Avg data?\n\nDrop records\nSet to NaN\nImpute using some strategy\n\nRemember we want to fix the following plot:\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2)\nplt.title(\"CO2 Average By Month\");\n\n\n\n\n\nSince we are plotting Avg vs DecDate, we should just focus on dealing with missing values for Avg.\nLet’s consider a few options: 1. Drop those records 2. Replace -99.99 with NaN 3. Substitute it with a likely value for the average CO2?\nWhat do you think are the pros and cons of each possible action?\n\nLet’s examine each of these three options.\n\n# 1. Drop missing values\nco2_drop = co2[co2['Avg'] &gt; 0]\nco2_drop.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n5\n1958\n8\n1958.62\n314.93\n314.93\n315.94\n-1\n\n\n\n\n\n\n\n\n# 2. Replace NaN with -99.99\nco2_NA = co2.replace(-99.99, np.NaN)\nco2_NA.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\nNaN\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nWe’ll also use a third version of the data.\nFirst, we note that the dataset already comes with a substitute value for the -99.99.\nFrom the file description:\n\nThe interpolated column includes average values from the preceding column (average) and interpolated values where data are missing. Interpolated values are computed in two steps…\n\nThe Int feature has values that exactly match those in Avg, except when Avg is -99.99, and then a reasonable estimate is used instead.\nSo, the third version of our data will use the Int feature instead of Avg.\n\n# 3. Use interpolated column which estimates missing Avg values\nco2_impute = co2.copy()\nco2_impute['Avg'] = co2['Int']\nco2_impute.head()\n\n\n\n\n\n\n\n\nYr\nMo\nDecDate\nAvg\nInt\nTrend\nDays\n\n\n\n\n0\n1958\n3\n1958.21\n315.71\n315.71\n314.62\n-1\n\n\n1\n1958\n4\n1958.29\n317.45\n317.45\n315.29\n-1\n\n\n2\n1958\n5\n1958.38\n317.50\n317.50\n314.71\n-1\n\n\n3\n1958\n6\n1958.46\n317.10\n317.10\n314.85\n-1\n\n\n4\n1958\n7\n1958.54\n315.86\n315.86\n314.98\n-1\n\n\n\n\n\n\n\nWhat’s a reasonable estimate?\nTo answer this question, let’s zoom in on a short time period, say the measurements in 1958 (where we know we have two missing values).\n\n\nCode\n# results of plotting data in 1958\n\ndef line_and_points(data, ax, title):\n    # assumes single year, hence Mo\n    ax.plot('Mo', 'Avg', data=data)\n    ax.scatter('Mo', 'Avg', data=data)\n    ax.set_xlim(2, 13)\n    ax.set_title(title)\n    ax.set_xticks(np.arange(3, 13))\n\ndef data_year(data, year):\n    return data[data[\"Yr\"] == 1958]\n    \n# uses matplotlib subplots\n# you may see more next week; focus on output for now\nfig, axes = plt.subplots(ncols = 3, figsize=(12, 4), sharey=True)\n\nyear = 1958\nline_and_points(data_year(co2_drop, year), axes[0], title=\"1. Drop Missing\")\nline_and_points(data_year(co2_NA, year), axes[1], title=\"2. Missing Set to NaN\")\nline_and_points(data_year(co2_impute, year), axes[2], title=\"3. Missing Interpolated\")\n\nfig.suptitle(f\"Monthly Averages for {year}\")\nplt.tight_layout()\n\n\n\n\n\nIn the big picture since there are only 7 Avg values missing (&lt;1% of 738 months), any of these approaches would work.\nHowever there is some appeal to option C: Imputing:\n\nShows seasonal trends for CO2\nWe are plotting all months in our data as a line plot\n\n\nLet’s replot our original figure with option 3:\n\n\nCode\nsns.lineplot(x='DecDate', y='Avg', data=co2_impute)\nplt.title(\"CO2 Average By Month, Imputed\");\n\n\n\n\n\nLooks pretty close to what we see on the NOAA website!"
-  },
-  {
-    "objectID": "eda/eda.html#presenting-the-data-a-discussion-on-data-granularity",
-    "href": "eda/eda.html#presenting-the-data-a-discussion-on-data-granularity",
-    "title": "5  Data Cleaning and EDA",
-    "section": "7.8 Presenting the data: A Discussion on Data Granularity",
-    "text": "7.8 Presenting the data: A Discussion on Data Granularity\nFrom the description:\n\nmonthly measurements are averages of average day measurements.\nThe NOAA GML website has datasets for daily/hourly measurements too.\n\nThe data you present depends on your research question.\nHow do CO2 levels vary by season?\n\nYou might want to keep average monthly data.\n\nAre CO2 levels rising over the past 50+ years, consistent with global warming predictions?\n\nYou might be happier with a coarser granularity of average year data!\n\n\n\nCode\nco2_year = co2_impute.groupby('Yr').mean()\nsns.lineplot(x='Yr', y='Avg', data=co2_year)\nplt.title(\"CO2 Average By Year\");\n\n\n\n\n\nIndeed, we see a rise by nearly 100 ppm of CO2 since Mauna Loa began recording in 1958."
-  },
-  {
-    "objectID": "eda/eda.html#dealing-with-missing-values",
-    "href": "eda/eda.html#dealing-with-missing-values",
-    "title": "5  Data Cleaning and EDA",
-    "section": "8.1 Dealing with Missing Values",
-    "text": "8.1 Dealing with Missing Values\nThere are a few options we can take to deal with missing data:\n\nDrop missing records\nKeep NaN missing values\nImpute using an interpolated column"
-  },
-  {
-    "objectID": "eda/eda.html#eda-and-data-wrangling",
-    "href": "eda/eda.html#eda-and-data-wrangling",
-    "title": "5  Data Cleaning and EDA",
-    "section": "8.2 EDA and Data Wrangling",
-    "text": "8.2 EDA and Data Wrangling\nThere are several ways to approach EDA and Data Wrangling:\n\nExamine the data and metadata: what is the date, size, organization, and structure of the data?\nExamine each field/attribute/dimension individually.\nExamine pairs of related dimensions (e.g. breaking down grades by major).\nAlong the way, we can:\n\nVisualize or summarize the data.\nValidate assumptions about data and its collection process. Pay particular attention to when the data was collected.\nIdentify and address anomalies.\nApply data transformations and corrections (we’ll cover this in the upcoming lecture).\nRecord everything you do! Developing in Jupyter Notebook promotes reproducibility of your own work!"
-  },
-  {
-    "objectID": "pandas_3/pandas_3.html#revisiting-the-.agg-function",
-    "href": "pandas_3/pandas_3.html#revisiting-the-.agg-function",
-    "title": "4  Pandas III",
-    "section": "4.1 Revisiting the .agg() Function",
-    "text": "4.1 Revisiting the .agg() Function\nWe’ll start by loading the babynames dataset. Note that this dataset is filtered to only contain data from California.\n\n\nCode\n# This code pulls census data and loads it into a DataFrame\n# We won't cover it explicitly in this class, but you are welcome to explore it on your own\nimport pandas as pd\nimport numpy as np\nimport urllib.request\nimport os.path\nimport zipfile\n\ndata_url = \"https://www.ssa.gov/oact/babynames/state/namesbystate.zip\"\nlocal_filename = \"data/babynamesbystate.zip\"\nif not os.path.exists(local_filename): # If the data exists don't download again\n    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:\n        f.write(resp.read())\n\nzf = zipfile.ZipFile(local_filename, 'r')\n\nca_name = 'STATE.CA.TXT'\nfield_names = ['State', 'Sex', 'Year', 'Name', 'Count']\nwith zf.open(ca_name) as fh:\n    babynames = pd.read_csv(fh, header=None, names=field_names)\n\nbabynames.tail(10)\n\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n407418\nCA\nM\n2022\nZach\n5\n\n\n407419\nCA\nM\n2022\nZadkiel\n5\n\n\n407420\nCA\nM\n2022\nZae\n5\n\n\n407421\nCA\nM\n2022\nZai\n5\n\n\n407422\nCA\nM\n2022\nZay\n5\n\n\n407423\nCA\nM\n2022\nZayvier\n5\n\n\n407424\nCA\nM\n2022\nZia\n5\n\n\n407425\nCA\nM\n2022\nZora\n5\n\n\n407426\nCA\nM\n2022\nZuriel\n5\n\n\n407427\nCA\nM\n2022\nZylo\n5\n\n\n\n\n\n\n\nLet’s begin by using .agg to find the total number of babies born in each year. Recall that using .agg with .groupby() follows the format: df.groupby(column_name).agg(aggregation_function). The line of code below gives us the total number of babies born in each year.\n\nbabynames.groupby(\"Year\")[[\"Count\"]].agg(sum).head(5)\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nHere’s an illustration of the process:\n\nLet’s now dive deeper into groupby. As we learned last lecture, a groupby operation involves some combination of splitting a DataFrame into grouped subframes, applying a function, and combining the results.\nFor some arbitrary DataFrame df below, the code df.groupby(\"year\").agg(sum) does the following:\n\nSplits the DataFrame into sub-DataFrames with rows belonging to the same year.\nApplies the sum function to each column of each sub-DataFrame.\nCombines the results of sum into a single DataFrame, indexed by year.\n\n\n\n4.1.1 Aggregation Functions\nThere are many different aggregations that can be applied to the grouped data. .agg() can take in any function that aggregates several values into one summary value.\nBecause of this fairly broad requirement, pandas offers many ways of computing an aggregation.\nIn-built Python operations are automatically recognized by pandas. For example:\n\n.agg(sum)\n.agg(max)\n.agg(min)\n\nNumPy functions are also fair game in pandas:\n\n.agg(np.sum)\n.agg(np.max)\n.agg(np.min)\n.agg(\"mean\")\n\npandas also offers a number of in-built functions, including:\n\n.agg(\"sum\")\n.agg(\"max\")\n.agg(\"min\")\n.agg(\"mean\")\n.agg(\"first\")\n.agg(\"last\")\n\nSome commonly-used aggregation functions can even be called directly, without explicit use of .agg(). For example, we can call .mean() on .groupby():\nbabynames.groupby(\"Year\").mean().head()\nWe can now put this all into practice. Say we want to find the baby name with sex “F” that has fallen in popularity the most in California. To calculate this, we can first create a metric: “Ratio to Peak” (RTP). The RTP is the ratio of babies born with a given name in 2022 to the maximum number of babies born with the name in any year.\nLet’s start with calculating this for one baby, “Jennifer”.\n\n# We filter by babies with sex \"F\" and sort by \"Year\"\nf_babynames = babynames[babynames[\"Sex\"] == \"F\"]\nf_babynames = f_babynames.sort_values([\"Year\"])\n\n# Determine how many Jennifers were born in CA per year\njenn_counts_series = f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"]\n\n# Determine the max number of Jennifers born in a year and the number born in 2022 \n# to calculate RTP\nmax_jenn = max(f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"])\ncurr_jenn = f_babynames[f_babynames[\"Name\"] == \"Jennifer\"][\"Count\"].iloc[-1]\nrtp = curr_jenn / max_jenn\nrtp\n\n0.018796372629843364\n\n\nBy creating a function to calculate RTP and applying it to our DataFrame by using .groupby(), we can easily compute the RTP for all names at once!\n\ndef ratio_to_peak(series):\n    return series.iloc[-1] / max(series)\n\n#Using .groupby() to apply the function\nrtp_table = f_babynames.groupby(\"Name\")[[\"Year\", \"Count\"]].agg(ratio_to_peak)\nrtp_table.head()\n\n\n\n\n\n\n\n\nYear\nCount\n\n\nName\n\n\n\n\n\n\nAadhini\n1.0\n1.000000\n\n\nAadhira\n1.0\n0.500000\n\n\nAadhya\n1.0\n0.660000\n\n\nAadya\n1.0\n0.586207\n\n\nAahana\n1.0\n0.269231\n\n\n\n\n\n\n\nIn the rows shown above, we can see that every row shown has a Year value of 1.0.\nThis is the “pandas-ification” of logic you saw in Data 8. Much of the logic you’ve learned in Data 8 will serve you well in Data 100.\n\n\n4.1.2 Nuisance Columns\nNote that you must be careful with which columns you apply the .agg() function to. If we were to apply our function to the table as a whole by doing f_babynames.groupby(\"Name\").agg(ratio_to_peak), executing our .agg() call would result in a TypeError.\n\nWe can avoid this issue (and prevent unintentional loss of data) by explicitly selecting column(s) we want to apply our aggregation function to BEFORE calling .agg(),\n\n\n4.1.3 Renaming Columns After Grouping\nBy default, .groupby will not rename any aggregated columns. As we can see in the table above, the aggregated column is still named Count even though it now represents the RTP. For better readability, we can rename Count to Count RTP\n\nrtp_table = rtp_table.rename(columns = {\"Count\": \"Count RTP\"})\nrtp_table\n\n\n\n\n\n\n\n\nYear\nCount RTP\n\n\nName\n\n\n\n\n\n\nAadhini\n1.0\n1.000000\n\n\nAadhira\n1.0\n0.500000\n\n\nAadhya\n1.0\n0.660000\n\n\nAadya\n1.0\n0.586207\n\n\nAahana\n1.0\n0.269231\n\n\n...\n...\n...\n\n\nZyanya\n1.0\n0.466667\n\n\nZyla\n1.0\n1.000000\n\n\nZylah\n1.0\n1.000000\n\n\nZyra\n1.0\n1.000000\n\n\nZyrah\n1.0\n0.833333\n\n\n\n\n13782 rows × 2 columns\n\n\n\n\n\n4.1.4 Some Data Science Payoff\nBy sorting rtp_table, we can see the names whose popularity has decreased the most.\n\nrtp_table = rtp_table.rename(columns = {\"Count\": \"Count RTP\"})\nrtp_table.sort_values(\"Count RTP\").head()\n\n\n\n\n\n\n\n\nYear\nCount RTP\n\n\nName\n\n\n\n\n\n\nDebra\n1.0\n0.001260\n\n\nDebbie\n1.0\n0.002815\n\n\nCarol\n1.0\n0.003180\n\n\nTammy\n1.0\n0.003249\n\n\nSusan\n1.0\n0.003305\n\n\n\n\n\n\n\nTo visualize the above Dataframe, let’s look at the line plot below:\n\n\nCode\nimport plotly.express as px\npx.line(f_babynames[f_babynames[\"Name\"] == \"Debra\"], x = \"Year\", y = \"Count\")\n\n\n\n                                                \n\n\nWe can get the list of the top 10 names and then plot popularity with the following code:\n\ntop10 = rtp_table.sort_values(\"Count RTP\").head(10).index\npx.line(\n    f_babynames[f_babynames[\"Name\"].isin(top10)], \n    x = \"Year\", \n    y = \"Count\", \n    color = \"Name\"\n)\n\n\n                                                \n\n\nAs a quick exercise, consider what code would compute the total number of babies with each name.\n\n\nCode\nbabynames.groupby(\"Name\")[[\"Count\"]].agg(sum).head()\n# alternative solution: \n# babynames.groupby(\"Name\")[[\"Count\"]].sum()\n\n\n\n\n\n\n\n\n\nCount\n\n\nName\n\n\n\n\n\nAadan\n18\n\n\nAadarsh\n6\n\n\nAaden\n647\n\n\nAadhav\n27\n\n\nAadhini\n6\n\n\n\n\n\n\n\nNow, let’s think about the code to compute the total number of babies born each year. You’ll see that there are multiple ways to achieve this, some of which are listed below.\n\n\nCode\nbabynames.groupby(\"Year\")[[\"Count\"]].agg(sum).head()\n# Alternative 1\n# babynames.groupby(\"Year\")[[\"Count\"]].sum()\n# Alternative 2\n# babynames.groupby(\"Year\").sum(numeric_only=True)\n\n\n\n\n\n\n\n\n\nCount\n\n\nYear\n\n\n\n\n\n1910\n9163\n\n\n1911\n9983\n\n\n1912\n17946\n\n\n1913\n22094\n\n\n1914\n26926\n\n\n\n\n\n\n\nFor the second alternative, note how we can avoid the error we faced earlier with aggregating non-numeric colums by passing in the numeric_only=True argument to groupby.\n\n\n4.1.5 Plotting Birth Counts\nPlotting the Dataframe we obtain tells an interesting story.\n\n\nCode\nimport plotly.express as px\npuzzle2 = babynames.groupby(\"Year\")[[\"Count\"]].agg(sum)\npx.line(puzzle2, y = \"Count\")\n\n\n\n                                                \n\n\nA word of warning: we made an enormous assumption when we decided to use this dataset to estimate birth rate. According to this article from the Legistlative Analyst Office, the true number of babies born in California in 2020 was 421,275. However, our plot shows 362,882 babies – what happened?"
-  },
-  {
-    "objectID": "pandas_3/pandas_3.html#groupby-continued",
-    "href": "pandas_3/pandas_3.html#groupby-continued",
-    "title": "4  Pandas III",
-    "section": "4.2 GroupBy(), Continued",
-    "text": "4.2 GroupBy(), Continued\nWe’ll work with the elections DataFrame again.\n\n\nCode\nimport pandas as pd\nimport numpy as np\n\nelections = pd.read_csv(\"data/elections.csv\")\nelections.head(5)\n\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\n\n\n\n\n\n\n\n\n4.2.1 Raw GroupBy Objects\nThe result of groupby applied to a DataFrame is a DataFrameGroupBy object, not a DataFrame.\n\ngrouped_by_year = elections.groupby(\"Year\")\ntype(grouped_by_year)\n\npandas.core.groupby.generic.DataFrameGroupBy\n\n\nThere are several ways to look into DataFrameGroupBy objects:\n\ngrouped_by_party = elections.groupby(\"Party\")\ngrouped_by_party.groups\n\n{'American': [22, 126], 'American Independent': [115, 119, 124], 'Anti-Masonic': [6], 'Anti-Monopoly': [38], 'Citizens': [127], 'Communist': [89], 'Constitution': [160, 164, 172], 'Constitutional Union': [24], 'Democratic': [2, 4, 8, 10, 13, 14, 17, 20, 28, 29, 34, 37, 39, 45, 47, 52, 55, 57, 64, 70, 74, 77, 81, 83, 86, 91, 94, 97, 100, 105, 108, 111, 114, 116, 118, 123, 129, 134, 137, 140, 144, 151, 158, 162, 168, 176, 178], 'Democratic-Republican': [0, 1], 'Dixiecrat': [103], 'Farmer–Labor': [78], 'Free Soil': [15, 18], 'Green': [149, 155, 156, 165, 170, 177, 181], 'Greenback': [35], 'Independent': [121, 130, 143, 161, 167, 174], 'Liberal Republican': [31], 'Libertarian': [125, 128, 132, 138, 139, 146, 153, 159, 163, 169, 175, 180], 'National Democratic': [50], 'National Republican': [3, 5], 'National Union': [27], 'Natural Law': [148], 'New Alliance': [136], 'Northern Democratic': [26], 'Populist': [48, 61, 141], 'Progressive': [68, 82, 101, 107], 'Prohibition': [41, 44, 49, 51, 54, 59, 63, 67, 73, 75, 99], 'Reform': [150, 154], 'Republican': [21, 23, 30, 32, 33, 36, 40, 43, 46, 53, 56, 60, 65, 69, 72, 79, 80, 84, 87, 90, 96, 98, 104, 106, 109, 112, 113, 117, 120, 122, 131, 133, 135, 142, 145, 152, 157, 166, 171, 173, 179], 'Socialist': [58, 62, 66, 71, 76, 85, 88, 92, 95, 102], 'Southern Democratic': [25], 'States' Rights': [110], 'Taxpayers': [147], 'Union': [93], 'Union Labor': [42], 'Whig': [7, 9, 11, 12, 16, 19]}\n\n\n\ngrouped_by_party.get_group(\"Socialist\")\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n58\n1904\nEugene V. Debs\nSocialist\n402810\nloss\n2.985897\n\n\n62\n1908\nEugene V. Debs\nSocialist\n420852\nloss\n2.850866\n\n\n66\n1912\nEugene V. Debs\nSocialist\n901551\nloss\n6.004354\n\n\n71\n1916\nAllan L. Benson\nSocialist\n590524\nloss\n3.194193\n\n\n76\n1920\nEugene V. Debs\nSocialist\n913693\nloss\n3.428282\n\n\n85\n1928\nNorman Thomas\nSocialist\n267478\nloss\n0.728623\n\n\n88\n1932\nNorman Thomas\nSocialist\n884885\nloss\n2.236211\n\n\n92\n1936\nNorman Thomas\nSocialist\n187910\nloss\n0.412876\n\n\n95\n1940\nNorman Thomas\nSocialist\n116599\nloss\n0.234237\n\n\n102\n1948\nNorman Thomas\nSocialist\n139569\nloss\n0.286312\n\n\n\n\n\n\n\n\n\n4.2.2 Other GroupBy Methods\nThere are many aggregation methods we can use with .agg. Some useful options are:\n\n.mean: creates a new DataFrame with the mean value of each group\n.sum: creates a new DataFrame with the sum of each group\n.max and .min: creates a new DataFrame with the maximum/minimum value of each group\n.first and .last: creates a new DataFrame with the first/last row in each group\n.size: creates a new Series with the number of entries in each group\n.count: creates a new DataFrame with the number of entries, excluding missing values.\n\nLet’s illustrate some examples by creating a DataFrame called df.\n\ndf = pd.DataFrame({'letter':['A','A','B','C','C','C'], \n                   'num':[1,2,3,4,np.NaN,4], \n                   'state':[np.NaN, 'tx', 'fl', 'hi', np.NaN, 'ak']})\ndf\n\n\n\n\n\n\n\n\nletter\nnum\nstate\n\n\n\n\n0\nA\n1.0\nNaN\n\n\n1\nA\n2.0\ntx\n\n\n2\nB\n3.0\nfl\n\n\n3\nC\n4.0\nhi\n\n\n4\nC\nNaN\nNaN\n\n\n5\nC\n4.0\nak\n\n\n\n\n\n\n\nNote the slight difference between .size() and .count(): while .size() returns a Series and counts the number of entries including the missing values, .count() returns a DataFrame and counts the number of entries in each column excluding missing values.\n\ndf.groupby(\"letter\").size()\n\nletter\nA    2\nB    1\nC    3\ndtype: int64\n\n\n\ndf.groupby(\"letter\").count()\n\n\n\n\n\n\n\n\nnum\nstate\n\n\nletter\n\n\n\n\n\n\nA\n2\n1\n\n\nB\n1\n1\n\n\nC\n2\n2\n\n\n\n\n\n\n\nYou might recall that the value_counts() function in the previous note does something similar. It turns out value_counts() and groupby.size() are the same, except value_counts() sorts the resulting Series in descending order automatically.\n\ndf[\"letter\"].value_counts()\n\nC    3\nA    2\nB    1\nName: letter, dtype: int64\n\n\nThese (and other) aggregation functions are so common that pandas allows for writing shorthand. Instead of explicitly stating the use of .agg, we can call the function directly on the GroupBy object.\nFor example, the following are equivalent:\n\nelections.groupby(\"Candidate\").agg(mean)\nelections.groupby(\"Candidate\").mean()\n\nThere are many other methods that pandas supports. You can check them out on the pandas documentation.\n\n\n4.2.3 Filtering by Group\nAnother common use for GroupBy objects is to filter data by group.\ngroupby.filter takes an argument func, where func is a function that:\n\nTakes a DataFrame object as input\nReturns a single True or False for the each sub-DataFrame\n\nSub-DataFrames that correspond to True are returned in the final result, whereas those with a False value are not. Importantly, groupby.filter is different from groupby.agg in that an entire sub-DataFrame is returned in the final DataFrame, not just a single row. As a result, groupby.filter preserves the original indices.\n\nTo illustrate how this happens, let’s go back to the elections dataset. Say we want to identify “tight” election years – that is, we want to find all rows that correspond to elections years where all candidates in that year won a similar portion of the total vote. Specifically, let’s find all rows corresponding to a year where no candidate won more than 45% of the total vote.\nIn other words, we want to:\n\nFind the years where the maximum % in that year is less than 45%\nReturn all DataFrame rows that correspond to these years\n\nFor each year, we need to find the maximum % among all rows for that year. If this maximum % is lower than 45%, we will tell pandas to keep all rows corresponding to that year.\n\nelections.groupby(\"Year\").filter(lambda sf: sf[\"%\"].max() &lt; 45).head(9)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n23\n1860\nAbraham Lincoln\nRepublican\n1855993\nwin\n39.699408\n\n\n24\n1860\nJohn Bell\nConstitutional Union\n590901\nloss\n12.639283\n\n\n25\n1860\nJohn C. Breckinridge\nSouthern Democratic\n848019\nloss\n18.138998\n\n\n26\n1860\nStephen A. Douglas\nNorthern Democratic\n1380202\nloss\n29.522311\n\n\n66\n1912\nEugene V. Debs\nSocialist\n901551\nloss\n6.004354\n\n\n67\n1912\nEugene W. Chafin\nProhibition\n208156\nloss\n1.386325\n\n\n68\n1912\nTheodore Roosevelt\nProgressive\n4122721\nloss\n27.457433\n\n\n69\n1912\nWilliam Taft\nRepublican\n3486242\nloss\n23.218466\n\n\n70\n1912\nWoodrow Wilson\nDemocratic\n6296284\nwin\n41.933422\n\n\n\n\n\n\n\nWhat’s going on here? In this example, we’ve defined our filtering function, func, to be lambda sf: sf[\"%\"].max() &lt; 45. This filtering function will find the maximum \"%\" value among all entries in the grouped sub-DataFrame, which we call sf. If the maximum value is less than 45, then the filter function will return True and all rows in that grouped sub-DataFrame will appear in the final output DataFrame.\nExamine the DataFrame above. Notice how, in this preview of the first 9 rows, all entries from the years 1860 and 1912 appear. This means that in 1860 and 1912, no candidate in that year won more than 45% of the total vote.\nYou may ask: how is the groupby.filter procedure different to the boolean filtering we’ve seen previously? Boolean filtering considers individual rows when applying a boolean condition. For example, the code elections[elections[\"%\"] &lt; 45] will check the \"%\" value of every single row in elections; if it is less than 45, then that row will be kept in the output. groupby.filter, in contrast, applies a boolean condition across all rows in a group. If not all rows in that group satisfy the condition specified by the filter, the entire group will be discarded in the output.\n\n\n4.2.4 Aggregation with lambda Functions\nWhat if we wish to aggregate our DataFrame using a non-standard function – for example, a function of our own design? We can do so by combining .agg with lambda expressions.\nLet’s first consider a puzzle to jog our memory. We will attempt to find the Candidate from each Party with the highest % of votes.\nA naive approach may be to group by the Party column and aggregate by the maximum.\n\nelections.groupby(\"Party\").agg(max).head(10)\n\n\n\n\n\n\n\n\nYear\nCandidate\nPopular vote\nResult\n%\n\n\nParty\n\n\n\n\n\n\n\n\n\nAmerican\n1976\nThomas J. Anderson\n873053\nloss\n21.554001\n\n\nAmerican Independent\n1976\nLester Maddox\n9901118\nloss\n13.571218\n\n\nAnti-Masonic\n1832\nWilliam Wirt\n100715\nloss\n7.821583\n\n\nAnti-Monopoly\n1884\nBenjamin Butler\n134294\nloss\n1.335838\n\n\nCitizens\n1980\nBarry Commoner\n233052\nloss\n0.270182\n\n\nCommunist\n1932\nWilliam Z. Foster\n103307\nloss\n0.261069\n\n\nConstitution\n2016\nMichael Peroutka\n203091\nloss\n0.152398\n\n\nConstitutional Union\n1860\nJohn Bell\n590901\nloss\n12.639283\n\n\nDemocratic\n2020\nWoodrow Wilson\n81268924\nwin\n61.344703\n\n\nDemocratic-Republican\n1824\nJohn Quincy Adams\n151271\nwin\n57.210122\n\n\n\n\n\n\n\nThis approach is clearly wrong – the DataFrame claims that Woodrow Wilson won the presidency in 2020.\nWhy is this happening? Here, the max aggregation function is taken over every column independently. Among Democrats, max is computing:\n\nThe most recent Year a Democratic candidate ran for president (2020)\nThe Candidate with the alphabetically “largest” name (“Woodrow Wilson”)\nThe Result with the alphabetically “largest” outcome (“win”)\n\nInstead, let’s try a different approach. We will:\n\nSort the DataFrame so that rows are in descending order of %\nGroup by Party and select the first row of each sub-DataFrame\n\nWhile it may seem unintuitive, sorting elections by descending order of % is extremely helpful. If we then group by Party, the first row of each groupby object will contain information about the Candidate with the highest voter %.\n\nelections_sorted_by_percent = elections.sort_values(\"%\", ascending=False)\nelections_sorted_by_percent.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n114\n1964\nLyndon Johnson\nDemocratic\n43127041\nwin\n61.344703\n\n\n91\n1936\nFranklin Roosevelt\nDemocratic\n27752648\nwin\n60.978107\n\n\n120\n1972\nRichard Nixon\nRepublican\n47168710\nwin\n60.907806\n\n\n79\n1920\nWarren Harding\nRepublican\n16144093\nwin\n60.574501\n\n\n133\n1984\nRonald Reagan\nRepublican\n54455472\nwin\n59.023326\n\n\n\n\n\n\n\n\nelections_sorted_by_percent.groupby(\"Party\").agg(lambda x : x.iloc[0]).head(10)\n\n# Equivalent to the below code\n# elections_sorted_by_percent.groupby(\"Party\").agg('first').head(10)\n\n\n\n\n\n\n\n\nYear\nCandidate\nPopular vote\nResult\n%\n\n\nParty\n\n\n\n\n\n\n\n\n\nAmerican\n1856\nMillard Fillmore\n873053\nloss\n21.554001\n\n\nAmerican Independent\n1968\nGeorge Wallace\n9901118\nloss\n13.571218\n\n\nAnti-Masonic\n1832\nWilliam Wirt\n100715\nloss\n7.821583\n\n\nAnti-Monopoly\n1884\nBenjamin Butler\n134294\nloss\n1.335838\n\n\nCitizens\n1980\nBarry Commoner\n233052\nloss\n0.270182\n\n\nCommunist\n1932\nWilliam Z. Foster\n103307\nloss\n0.261069\n\n\nConstitution\n2008\nChuck Baldwin\n199750\nloss\n0.152398\n\n\nConstitutional Union\n1860\nJohn Bell\n590901\nloss\n12.639283\n\n\nDemocratic\n1964\nLyndon Johnson\n43127041\nwin\n61.344703\n\n\nDemocratic-Republican\n1824\nAndrew Jackson\n151271\nloss\n57.210122\n\n\n\n\n\n\n\nHere’s an illustration of the process:\n\nNotice how our code correctly determines that Lyndon Johnson from the Democratic Party has the highest voter %.\nMore generally, lambda functions are used to design custom aggregation functions that aren’t pre-defined by Python. The input parameter x to the lambda function is a GroupBy object. Therefore, it should make sense why lambda x : x.iloc[0] selects the first row in each groupby object.\nIn fact, there’s a few different ways to approach this problem. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, etc. We’ve given a few examples below.\nNote: Understanding these alternative solutions is not required. They are given to demonstrate the vast number of problem-solving approaches in pandas.\n\n# Using the idxmax function\nbest_per_party = elections.loc[elections.groupby('Party')['%'].idxmax()]\nbest_per_party.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n22\n1856\nMillard Fillmore\nAmerican\n873053\nloss\n21.554001\n\n\n115\n1968\nGeorge Wallace\nAmerican Independent\n9901118\nloss\n13.571218\n\n\n6\n1832\nWilliam Wirt\nAnti-Masonic\n100715\nloss\n7.821583\n\n\n38\n1884\nBenjamin Butler\nAnti-Monopoly\n134294\nloss\n1.335838\n\n\n127\n1980\nBarry Commoner\nCitizens\n233052\nloss\n0.270182\n\n\n\n\n\n\n\n\n# Using the .drop_duplicates function\nbest_per_party2 = elections.sort_values('%').drop_duplicates(['Party'], keep='last')\nbest_per_party2.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n148\n1996\nJohn Hagelin\nNatural Law\n113670\nloss\n0.118219\n\n\n164\n2008\nChuck Baldwin\nConstitution\n199750\nloss\n0.152398\n\n\n110\n1956\nT. Coleman Andrews\nStates' Rights\n107929\nloss\n0.174883\n\n\n147\n1996\nHoward Phillips\nTaxpayers\n184656\nloss\n0.192045\n\n\n136\n1988\nLenora Fulani\nNew Alliance\n217221\nloss\n0.237804"
-  },
-  {
-    "objectID": "pandas_3/pandas_3.html#aggregating-data-with-pivot-tables",
-    "href": "pandas_3/pandas_3.html#aggregating-data-with-pivot-tables",
-    "title": "4  Pandas III",
-    "section": "4.3 Aggregating Data with Pivot Tables",
-    "text": "4.3 Aggregating Data with Pivot Tables\nWe know now that .groupby gives us the ability to group and aggregate data across our DataFrame. The examples above formed groups using just one column in the DataFrame. It’s possible to group by multiple columns at once by passing in a list of column names to .groupby.\nLet’s consider the babynames dataset again. In this problem, we will find the total number of baby names associated with each sex for each year. To do this, we’ll group by both the \"Year\" and \"Sex\" columns.\n\nbabynames.head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n0\nCA\nF\n1910\nMary\n295\n\n\n1\nCA\nF\n1910\nHelen\n239\n\n\n2\nCA\nF\n1910\nDorothy\n220\n\n\n3\nCA\nF\n1910\nMargaret\n163\n\n\n4\nCA\nF\n1910\nFrances\n134\n\n\n\n\n\n\n\n\n# Find the total number of baby names associated with each sex for each \n# year in the data\nbabynames.groupby([\"Year\", \"Sex\"])[[\"Count\"]].agg(sum).head(6)\n\n\n\n\n\n\n\n\n\nCount\n\n\nYear\nSex\n\n\n\n\n\n1910\nF\n5950\n\n\nM\n3213\n\n\n1911\nF\n6602\n\n\nM\n3381\n\n\n1912\nF\n9804\n\n\nM\n8142\n\n\n\n\n\n\n\nNotice that both \"Year\" and \"Sex\" serve as the index of the DataFrame (they are both rendered in bold). We’ve created a multi-index DataFrame where two different index values, the year and sex, are used to uniquely identify each row.\nThis isn’t the most intuitive way of representing this data – and, because multi-indexed DataFrames have multiple dimensions in their index, they can often be difficult to use.\nAnother strategy to aggregate across two columns is to create a pivot table. You saw these back in Data 8. One set of values is used to create the index of the pivot table; another set is used to define the column names. The values contained in each cell of the table correspond to the aggregated data for each index-column pair.\nHere’s an illustration of the process:\n\nThe best way to understand pivot tables is to see one in action. Let’s return to our original goal of summing the total number of names associated with each combination of year and sex. We’ll call the pandas .pivot_table method to create a new table.\n\n# The `pivot_table` method is used to generate a Pandas pivot table\nimport numpy as np\nbabynames.pivot_table(\n    index = \"Year\", \n    columns = \"Sex\", \n    values = \"Count\", \n    aggfunc = np.sum,\n).head(5)\n\n\n\n\n\n\n\nSex\nF\nM\n\n\nYear\n\n\n\n\n\n\n1910\n5950\n3213\n\n\n1911\n6602\n3381\n\n\n1912\n9804\n8142\n\n\n1913\n11860\n10234\n\n\n1914\n13815\n13111\n\n\n\n\n\n\n\nLooks a lot better! Now, our DataFrame is structured with clear index-column combinations. Each entry in the pivot table represents the summed count of names for a given combination of \"Year\" and \"Sex\".\nLet’s take a closer look at the code implemented above.\n\nindex = \"Year\" specifies the column name in the original DataFrame that should be used as the index of the pivot table\ncolumns = \"Sex\" specifies the column name in the original DataFrame that should be used to generate the columns of the pivot table\nvalues = \"Count\" indicates what values from the original DataFrame should be used to populate the entry for each index-column combination\naggfunc = np.sum tells pandas what function to use when aggregating the data specified by values. Here, we are summing the name counts for each pair of \"Year\" and \"Sex\"\n\nWe can even include multiple values in the index or columns of our pivot tables.\n\nbabynames_pivot = babynames.pivot_table(\n    index=\"Year\",     # the rows (turned into index)\n    columns=\"Sex\",    # the column values\n    values=[\"Count\", \"Name\"], \n    aggfunc=max,   # group operation\n)\nbabynames_pivot.head(6)\n\n\n\n\n\n\n\n\nCount\nName\n\n\nSex\nF\nM\nF\nM\n\n\nYear\n\n\n\n\n\n\n\n\n1910\n295\n237\nYvonne\nWilliam\n\n\n1911\n390\n214\nZelma\nWillis\n\n\n1912\n534\n501\nYvonne\nWoodrow\n\n\n1913\n584\n614\nZelma\nYoshio\n\n\n1914\n773\n769\nZelma\nYoshio\n\n\n1915\n998\n1033\nZita\nYukio"
-  },
-  {
-    "objectID": "pandas_3/pandas_3.html#joining-tables",
-    "href": "pandas_3/pandas_3.html#joining-tables",
-    "title": "4  Pandas III",
-    "section": "4.4 Joining Tables",
-    "text": "4.4 Joining Tables\nWhen working on data science projects, we’re unlikely to have absolutely all the data we want contained in a single DataFrame – a real-world data scientist needs to grapple with data coming from multiple sources. If we have access to multiple datasets with related information, we can join two or more tables into a single DataFrame.\nTo put this into practice, we’ll revisit the elections dataset.\n\nelections.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\n\n\n\n\n\n\n\nSay we want to understand the popularity of the names of each presidential candidate in 2022. To do this, we’ll need the combined data of babynames and elections.\nWe’ll start by creating a new column containing the first name of each presidential candidate. This will help us join each name in elections to the corresponding name data in babynames.\n\n# This `str` operation splits each candidate's full name at each \n# blank space, then takes just the candidiate's first name\nelections[\"First Name\"] = elections[\"Candidate\"].str.split().str[0]\nelections.head(5)\n\n\n\n\n\n\n\n\nYear\nCandidate\nParty\nPopular vote\nResult\n%\nFirst Name\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\nAndrew\n\n\n1\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\nJohn\n\n\n2\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\nAndrew\n\n\n3\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\nJohn\n\n\n4\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\nAndrew\n\n\n\n\n\n\n\n\n# Here, we'll only consider `babynames` data from 2022\nbabynames_2022 = babynames[babynames[\"Year\"]==2020]\nbabynames_2022.head()\n\n\n\n\n\n\n\n\nState\nSex\nYear\nName\nCount\n\n\n\n\n228550\nCA\nF\n2020\nOlivia\n2353\n\n\n228551\nCA\nF\n2020\nCamila\n2187\n\n\n228552\nCA\nF\n2020\nEmma\n2110\n\n\n228553\nCA\nF\n2020\nMia\n2043\n\n\n228554\nCA\nF\n2020\nSophia\n1999\n\n\n\n\n\n\n\nNow, we’re ready to join the two tables. pd.merge is the pandas method used to join DataFrames together.\n\nmerged = pd.merge(left = elections, right = babynames_2022, \\\n                  left_on = \"First Name\", right_on = \"Name\")\nmerged.head()\n# Notice that pandas automatically specifies `Year_x` and `Year_y` \n# when both merged DataFrames have the same column name to avoid confusion\n\n\n\n\n\n\n\n\nYear_x\nCandidate\nParty\nPopular vote\nResult\n%\nFirst Name\nState\nSex\nYear_y\nName\nCount\n\n\n\n\n0\n1824\nAndrew Jackson\nDemocratic-Republican\n151271\nloss\n57.210122\nAndrew\nCA\nM\n2020\nAndrew\n874\n\n\n1\n1828\nAndrew Jackson\nDemocratic\n642806\nwin\n56.203927\nAndrew\nCA\nM\n2020\nAndrew\n874\n\n\n2\n1832\nAndrew Jackson\nDemocratic\n702735\nwin\n54.574789\nAndrew\nCA\nM\n2020\nAndrew\n874\n\n\n3\n1824\nJohn Quincy Adams\nDemocratic-Republican\n113142\nwin\n42.789878\nJohn\nCA\nM\n2020\nJohn\n623\n\n\n4\n1828\nJohn Quincy Adams\nNational Republican\n500897\nloss\n43.796073\nJohn\nCA\nM\n2020\nJohn\n623\n\n\n\n\n\n\n\nLet’s take a closer look at the parameters:\n\nleft and right parameters are used to specify the DataFrames to be joined.\nleft_on and right_on parameters are assigned to the string names of the columns to be used when performing the join. These two on parameters tell pandas what values should act as pairing keys to determine which rows to merge across the DataFrames. We’ll talk more about this idea of a pairing key next lecture."
-  },
-  {
-    "objectID": "pandas_3/pandas_3.html#parting-note",
-    "href": "pandas_3/pandas_3.html#parting-note",
-    "title": "4  Pandas III",
-    "section": "4.5 Parting Note",
-    "text": "4.5 Parting Note\nCongratulations! We finally tackled pandas. Don’t worry if you are still not feeling very comfortable with it—you will have plenty of chance to practice over the next few weeks.\nNext, we will get our hands dirty with some real-world datasets and use our pandas knowledge to conduct some exploratory data analysis."
-  },
-  {
-    "objectID": "regex/regex.html#why-work-with-text",
-    "href": "regex/regex.html#why-work-with-text",
-    "title": "6  Regular Expressions",
-    "section": "6.1 Why Work with Text?",
-    "text": "6.1 Why Work with Text?\nLast lecture, we learned of the difference between quantitative and qualitative variable types. The latter includes string data — the primary focus of lecture 6. In this note, we’ll discuss the necessary tools to manipulate text: python string manipulation and regular expressions.\nThere are two main reasons for working with text.\n\nCanonicalization: Convert data that has multiple formats into a standard form.\n\nBy manipulating text, we can join tables with mismatched string labels.\n\nExtract information into a new feature.\n\nFor example, we can extract date and time features from text."
-  },
-  {
-    "objectID": "regex/regex.html#python-string-methods",
-    "href": "regex/regex.html#python-string-methods",
-    "title": "6  Regular Expressions",
-    "section": "6.2 Python String Methods",
-    "text": "6.2 Python String Methods\nFirst, we’ll introduce a few methods useful for string manipulation. The following table includes a number of string operations supported by python and pandas. The python functions operate on a single string, while their equivalent in pandas are vectorized — they operate on a Series of string data.\n\n\n\n\n\n\n\n\nOperation\nPython\nPandas (Series)\n\n\n\n\nTransformation\n\ns.lower(_)\ns.upper(_)\n\n\nser.str.lower(_)\nser.str.upper(_)\n\n\n\nReplacement + Deletion\n\ns.replace(_)\n\n\nser.str.replace(_)\n\n\n\nSplit\n\ns.split(_)\n\n\nser.str.split(_)\n\n\n\nSubstring\n\ns[1:4]\n\n\nser.str[1:4]\n\n\n\nMembership\n\n'_' in s\n\n\nser.str.contains(_)\n\n\n\nLength\n\nlen(s)\n\n\nser.str.len()\n\n\n\n\nWe’ll discuss the differences between python string functions and pandas Series methods in the following section on canonicalization.\n\n6.2.1 Canonicalization\nAssume we want to merge the given tables.\n\n\nCode\nimport pandas as pd\n\nwith open('data/county_and_state.csv') as f:\n    county_and_state = pd.read_csv(f)\n    \nwith open('data/county_and_population.csv') as f:\n    county_and_pop = pd.read_csv(f)\n\n\n\ndisplay(county_and_state), display(county_and_pop);\n\n\n\n\n\n\n\n\nCounty\nState\n\n\n\n\n0\nDe Witt County\nIL\n\n\n1\nLac qui Parle County\nMN\n\n\n2\nLewis and Clark County\nMT\n\n\n3\nSt John the Baptist Parish\nLS\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCounty\nPopulation\n\n\n\n\n0\nDeWitt\n16798\n\n\n1\nLac Qui Parle\n8067\n\n\n2\nLewis & Clark\n55716\n\n\n3\nSt. John the Baptist\n43044\n\n\n\n\n\n\n\nLast time, we used a primary key and foreign key to join two tables. While neither of these keys exist in our DataFrames, the \"County\" columns look similar enough. Can we convert these columns into one standard, canonical form to merge the two tables?\n\n6.2.1.1 Canonicalization with python String Manipulation\nThe following function uses python string manipulation to convert a single county name into canonical form. It does so by eliminating whitespace, punctuation, and unnecessary text.\n\ndef canonicalize_county(county_name):\n    return (\n        county_name\n            .lower()\n            .replace(' ', '')\n            .replace('&', 'and')\n            .replace('.', '')\n            .replace('county', '')\n            .replace('parish', '')\n    )\n\ncanonicalize_county(\"St. John the Baptist\")\n\n'stjohnthebaptist'\n\n\nWe will use the pandas map function to apply the canonicalize_county function to every row in both DataFrames. In doing so, we’ll create a new column in each called clean_county_python with the canonical form.\n\ncounty_and_pop['clean_county_python'] = county_and_pop['County'].map(canonicalize_county)\ncounty_and_state['clean_county_python'] = county_and_state['County'].map(canonicalize_county)\ndisplay(county_and_state), display(county_and_pop);\n\n\n\n\n\n\n\n\nCounty\nState\nclean_county_python\n\n\n\n\n0\nDe Witt County\nIL\ndewitt\n\n\n1\nLac qui Parle County\nMN\nlacquiparle\n\n\n2\nLewis and Clark County\nMT\nlewisandclark\n\n\n3\nSt John the Baptist Parish\nLS\nstjohnthebaptist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCounty\nPopulation\nclean_county_python\n\n\n\n\n0\nDeWitt\n16798\ndewitt\n\n\n1\nLac Qui Parle\n8067\nlacquiparle\n\n\n2\nLewis & Clark\n55716\nlewisandclark\n\n\n3\nSt. John the Baptist\n43044\nstjohnthebaptist\n\n\n\n\n\n\n\n\n\n6.2.1.2 Canonicalization with Pandas Series Methods\nAlternatively, we can use pandas Series methods to create this standardized column. To do so, we must call the .str attribute of our Series object prior to calling any methods, like .lower and .replace. Notice how these method names match their equivalent built-in Python string functions.\nChaining multiple Series methods in this manner eliminates the need to use the map function (as this code is vectorized).\n\ndef canonicalize_county_series(county_series):\n    return (\n        county_series\n            .str.lower()\n            .str.replace(' ', '')\n            .str.replace('&', 'and')\n            .str.replace('.', '')\n            .str.replace('county', '')\n            .str.replace('parish', '')\n    )\n\ncounty_and_pop['clean_county_pandas'] = canonicalize_county_series(county_and_pop['County'])\ncounty_and_state['clean_county_pandas'] = canonicalize_county_series(county_and_state['County'])\ndisplay(county_and_pop), display(county_and_state);\n\n/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_96319/2523629438.py:3: FutureWarning:\n\nThe default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n\n/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_96319/2523629438.py:3: FutureWarning:\n\nThe default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.\n\n\n\n\n\n\n\n\n\n\nCounty\nPopulation\nclean_county_python\nclean_county_pandas\n\n\n\n\n0\nDeWitt\n16798\ndewitt\ndewitt\n\n\n1\nLac Qui Parle\n8067\nlacquiparle\nlacquiparle\n\n\n2\nLewis & Clark\n55716\nlewisandclark\nlewisandclark\n\n\n3\nSt. John the Baptist\n43044\nstjohnthebaptist\nstjohnthebaptist\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCounty\nState\nclean_county_python\nclean_county_pandas\n\n\n\n\n0\nDe Witt County\nIL\ndewitt\ndewitt\n\n\n1\nLac qui Parle County\nMN\nlacquiparle\nlacquiparle\n\n\n2\nLewis and Clark County\nMT\nlewisandclark\nlewisandclark\n\n\n3\nSt John the Baptist Parish\nLS\nstjohnthebaptist\nstjohnthebaptist\n\n\n\n\n\n\n\n\n\n\n6.2.2 Extraction\nExtraction explores the idea of obtaining useful information from text data. This will be particularily important in model building, which we’ll study in a few weeks.\nSay we want to read some data from a .txt file.\n\nwith open('data/log.txt', 'r') as f:\n    log_lines = f.readlines()\n\nlog_lines\n\n['169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n',\n '193.205.203.3 - - [2/Feb/2005:17:23:6 -0800] \"GET /stat141/Notes/dim.html HTTP/1.0\" 404 302 \"http://eeyore.ucdavis.edu/stat141/Notes/session.html\"\\n',\n '169.237.46.240 - \"\" [3/Feb/2006:10:18:37 -0800] \"GET /stat141/homework/Solutions/hw1Sol.pdf HTTP/1.1\"\\n']\n\n\nSuppose we want to extract the day, month, year, hour, minutes, seconds, and time zone. Unfortunately, these items are not in a fixed position from the beginning of the string, so slicing by some fixed offset won’t work.\nInstead, we can use some clever thinking. Notice how the relevant information is contained within a set of brackets, further seperated by / and :. We can hone in on this region of text, and split the data on these characters. Python’s built-in .split function makes this easy.\n\nfirst = log_lines[0] # Only considering the first row of data\n\npertinent = first.split(\"[\")[1].split(']')[0]\nday, month, rest = pertinent.split('/')\nyear, hour, minute, rest = rest.split(':')\nseconds, time_zone = rest.split(' ')\nday, month, year, hour, minute, seconds, time_zone\n\n('26', 'Jan', '2014', '10', '47', '58', '-0800')\n\n\nThere are two problems with this code:\n\nPython’s built-in functions limit us to extract data one record at a time,\n\nThis can be resolved using the map function or pandas Series methods.\n\nThe code is quite verbose.\n\nThis is a larger issue that is trickier to solve\n\n\nIn the next section, we’ll introduce regular expressions - a tool that solves problem 2."
-  },
-  {
-    "objectID": "regex/regex.html#regex-basics",
-    "href": "regex/regex.html#regex-basics",
-    "title": "6  Regular Expressions",
-    "section": "6.3 Regex Basics",
-    "text": "6.3 Regex Basics\nA regular expression (“RegEx”) is a sequence of characters that specifies a search pattern. They are written to extract specific information from text. Regular expressions are essentially part of a smaller programming language embedded in python, made available through the re module. As such, they have a stand-alone syntax and methods for various capabilities.\nRegular expressions are useful in many applications beyond data science. For example, Social Security Numbers (SSNs) are often validated with regular expressions.\n\nr\"[0-9]{3}-[0-9]{2}-[0-9]{4}\" # Regular Expression Syntax\n\n# 3 of any digit, then a dash,\n# then 2 of any digit, then a dash,\n# then 4 of any digit\n\n'[0-9]{3}-[0-9]{2}-[0-9]{4}'\n\n\n\n\nThere are a ton of resources to learn and experiment with regular expressions. A few are provided below:\n\nOfficial Regex Guide\nData 100 Reference Sheet\nRegex101.com\n\nBe sure to check Python under the category on the left.\n\n\n\n6.3.1 Basics Regex Syntax\nThere are four basic operations with regular expressions.\n\n\n\n\n\n\n\n\n\n\nOperation\nOrder\nSyntax Example\nMatches\nDoesn’t Match\n\n\n\n\nOr: |\n4\nAA|BAAB\nAA BAAB\nevery other string\n\n\nConcatenation\n3\nAABAAB\nAABAAB\nevery other string\n\n\nClosure: *  (zero or more)\n2\nAB*A\nAA ABBBBBBA\nAB  ABABA\n\n\nGroup: ()  (parenthesis)\n1\nA(A|B)AAB    (AB)*A\nAAAAB ABAAB    A  ABABABABA\nevery other string    AA  ABBA\n\n\n\nNotice how these metacharacter operations are ordered. Rather than being literal characters, these metacharacters manipulate adjacent characters. () takes precedence, followed by *, and finally |. This allows us to differentiate between very different regex commands like AB* and (AB)*. The former reads “A then zero or more copies of B”, while the latter specifies “zero or more copies of AB”.\n\n6.3.1.1 Examples\nQuestion 1: Give a regular expression that matches moon, moooon, etc. Your expression should match any even number of os except zero (i.e. don’t match mn).\nAnswer 1: moo(oo)*n\n\nHardcoding oo before the capture group ensures that mn is not matched.\nA capture group of (oo)* ensures the number of o’s is even.\n\nQuestion 2: Using only basic operations, formulate a regex that matches muun, muuuun, moon, moooon, etc. Your expression should match any even number of us or os except zero (i.e. don’t match mn).\nAnswer 2: m(uu(uu)*|oo(oo)*)n\n\nThe leading m and trailing n ensures that only strings beginning with m and ending with n are matched.\nNotice how the outer capture group surrounds the |.\n\nConsider the regex m(uu(uu)*)|(oo(oo)*)n. This incorrectly matches muu and oooon.\n\nEach OR clause is everything to the left and right of |. The incorrect solution matches only half of the string, and ignores either the beginning m or trailing n.\nA set of parenthesis must surround |. That way, each OR clause is everything to the left and right of | within the group. This ensures both the beginning m and trailing n are matched."
-  },
-  {
-    "objectID": "regex/regex.html#regex-expanded",
-    "href": "regex/regex.html#regex-expanded",
-    "title": "6  Regular Expressions",
-    "section": "6.4 Regex Expanded",
-    "text": "6.4 Regex Expanded\nProvided below are more complex regular expression functions.\n\n\n\n\n\n\n\n\n\nOperation\nSyntax Example\nMatches\nDoesn’t Match\n\n\n\n\nAny Character: .  (except newline)\n.U.U.U.\nCUMULUS  JUGULUM\nSUCCUBUS TUMULTUOUS\n\n\nCharacter Class: []  (match one character in [])\n[A-Za-z][a-z]*\nword  Capitalized\ncamelCase 4illegal\n\n\nRepeated \"a\" Times: {a}\nj[aeiou]{3}hn\njaoehn  jooohn\njhn  jaeiouhn\n\n\nRepeated \"from a to b\" Times: {a, b}\nj[0u]{1,2}hn\njohn  juohn\njhn  jooohn\n\n\nAt Least One: +\njo+hn\njohn  joooooohn\njhn  jjohn\n\n\nZero or One: ?\njoh?n\njon  john\nany other string\n\n\n\nA character class matches a single character in it’s class. These characters can be hardcoded – in the case of [aeiou] – or shorthand can be specified to mean a range of characters. Examples include:\n\n[A-Z]: Any capitalized letter\n[a-z]: Any lowercase letter\n[0-9]: Any single digit\n[A-Za-z]: Any capitalized of lowercase letter\n[A-Za-z0-9]: Any capitalized or lowercase letter or single digit\n\n\n6.4.0.1 Examples\nLet’s analyze a few examples of complex regular expressions.\n\n\n\n\n\n\n\nMatches\nDoes Not Match\n\n\n\n\n\n.*SPB.*\n\n\n\n\nRASPBERRY  SPBOO\nSUBSPACE  SUBSPECIES\n\n\n\n[0-9]{3}-[0-9]{2}-[0-9]{4}\n\n\n\n\n231-41-5121  573-57-1821\n231415121  57-3571821\n\n\n\n[a-z]+@([a-z]+\\.)+(edu|com)\n\n\n\n\nhorse@pizza.com  horse@pizza.food.com\nfrank_99@yahoo.com  hug@cs\n\n\n\nExplanations\n\n.*SPB.* only matches strings that contain the substring SPB.\n\nThe .* metacharacter matches any amount of non-negative characters. Newlines do not count.\n\n\nThis regular expression matches 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.\n\nYou’ll recognize this as the familiar Social Security Number regular expression.\n\nMatches any email with a com or edu domain, where all characters of the email are letters.\n\nAt least one . must precede the domain name. Including a backslash \\ before any metacharacter (in this case, the .) tells RegEx to match that character exactly."
-  },
-  {
-    "objectID": "regex/regex.html#convenient-regex",
-    "href": "regex/regex.html#convenient-regex",
-    "title": "6  Regular Expressions",
-    "section": "6.5 Convenient Regex",
-    "text": "6.5 Convenient Regex\nHere are a few more convenient regular expressions.\n\n\n\n\n\n\n\n\n\nOperation\nSyntax Example\nMatches\nDoesn’t Match\n\n\n\n\nbuilt in character class\n\\w+  \\d+ \\s+ \nFawef_03  231123  whitespace\nthis person 423 people non-whitespace\n\n\ncharacter class negation: [^] (everything except the given characters)\n[^a-z]+.\nPEPPERS3982 17211!↑å\nporch  CLAmS\n\n\nescape character: \\  (match the literal next character)\ncow\\.com\ncow.com\ncowscom\n\n\nbeginning of line: ^\n^ark\nark two ark o ark\ndark\n\n\nend of line: $\nark$\ndark  ark o ark\nark two\n\n\nlazy version of zero or more : *?\n5.*?5\n5005  55\n5005005\n\n\n\n\n6.5.1 Greediness\nIn order to fully understand the last operation in the table, we have to discuss greediness. RegEx is greedy – it will look for the longest possible match in a string. To motivate this with an example, consider the pattern &lt;div&gt;.*&lt;/div&gt;. Given the sentence below, we would hope that the bolded portions would be matched:\n“This is a &lt;div&gt;example&lt;/div&gt; of greediness &lt;div&gt;in&lt;/div&gt; regular expressions.” ”\nIn actuality, the way RegEx processes the text given that pattern is as follows:\n\n“Look for the exact string &lt;&gt;”\nthen, “look for any character 0 or more times”\nthen, “look for the exact string &lt;/div&gt;”\n\nThe result would be all the characters starting from the leftmost &lt;div&gt; and the rightmost &lt;/div&gt; (inclusive). We can fix this making our the pattern non-greedy, &lt;div&gt;.*?&lt;/div&gt;. You can read up more on the documentation here.\n\n\n6.5.2 Examples\nLet’s revist our earlier problem of extracting date/time data from the given .txt files. Here is how the data looked.\n\nlog_lines[0]\n\n'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'\n\n\nQuestion: Give a regular expression that matches everything contained within and including the brackets - the day, month, year, hour, minutes, seconds, and time zone.\nAnswer: \\[.*\\]\n\nNotice how matching the literal [ and ] is necessary. Therefore, an escape character \\ is required before both [ and ] — otherwise these metacharacters will match character classes.\nWe need to match a particular format between [ and ]. For this example, .* will suffice.\n\nAlternative Solution: \\[\\w+/\\w+/\\w+:\\w+:\\w+:\\w+\\s-\\w+\\]\n\nThis solution is much safer.\n\nImagine the data between [ and ] was garbage - .* will still match that.\nThe alternate solution will only match data that follows the correct format."
-  },
-  {
-    "objectID": "regex/regex.html#regex-in-python-and-pandas-regex-groups",
-    "href": "regex/regex.html#regex-in-python-and-pandas-regex-groups",
-    "title": "6  Regular Expressions",
-    "section": "6.6 Regex in Python and Pandas (RegEx Groups)",
-    "text": "6.6 Regex in Python and Pandas (RegEx Groups)\n\n6.6.1 Canonicalization\n\n6.6.1.1 Canonicalization with RegEx\nEarlier in this note, we examined the process of canonicalization using python string manipulation and pandas Series methods. However, we mentioned this approach had a major flaw: our code was unnecessarily verbose. Equipped with our knowledge of regular expressions, let’s fix this.\nTo do so, we need to understand a few functions in the re module. The first of these is the substitute function: re.sub(pattern, rep1, text). It behaves similarly to python’s built-in .replace function, and returns text with all instances of pattern replaced by rep1.\nThe regular expression here removes text surrounded by &lt;&gt; (also known as HTML tags).\nIn order, the pattern matches … 1. a single &lt; 2. any character that is not a &gt; : div, td valign…, /td, /div 3. a single &gt;\nAny substring in text that fulfills all three conditions will be replaced by ''.\n\nimport re\n\ntext = \"&lt;div&gt;&lt;td valign='top'&gt;Moo&lt;/td&gt;&lt;/div&gt;\"\npattern = r\"&lt;[^&gt;]+&gt;\"\nre.sub(pattern, '', text) \n\n'Moo'\n\n\nNotice the r preceding the regular expression pattern; this specifies the regular expression is a raw string. Raw strings do not recognize escape sequences (i.e., the Python newline metacharacter \\n). This makes them useful for regular expressions, which often contain literal \\ characters.\nIn other words, don’t forget to tag your RegEx with an r.\n\n\n6.6.1.2 Canonicalization with pandas\nWe can also use regular expressions with pandas Series methods. This gives us the benefit of operating on an entire column of data as opposed to a single value. The code is simple:  ser.str.replace(pattern, repl, regex=True).\nConsider the following DataFrame html_data with a single column.\n\n\nCode\ndata = {\"HTML\": [\"&lt;div&gt;&lt;td valign='top'&gt;Moo&lt;/td&gt;&lt;/div&gt;\", \\\n                 \"&lt;a href='http://ds100.org'&gt;Link&lt;/a&gt;\", \\\n                 \"&lt;b&gt;Bold text&lt;/b&gt;\"]}\nhtml_data = pd.DataFrame(data)\n\n\n\nhtml_data\n\n\n\n\n\n\n\n\nHTML\n\n\n\n\n0\n&lt;div&gt;&lt;td valign='top'&gt;Moo&lt;/td&gt;&lt;/div&gt;\n\n\n1\n&lt;a href='http://ds100.org'&gt;Link&lt;/a&gt;\n\n\n2\n&lt;b&gt;Bold text&lt;/b&gt;\n\n\n\n\n\n\n\n\npattern = r\"&lt;[^&gt;]+&gt;\"\nhtml_data['HTML'].str.replace(pattern, '', regex=True)\n\n0          Moo\n1         Link\n2    Bold text\nName: HTML, dtype: object\n\n\n\n\n\n6.6.2 Extraction\n\n6.6.2.1 Extraction with RegEx\nJust like with canonicalization, the re module provides capability to extract relevant text from a string:  re.findall(pattern, text). This function returns a list of all matches to pattern.\nUsing the familiar regular expression for Social Security Numbers:\n\ntext = \"My social security number is 123-45-6789 bro, or maybe it’s 321-45-6789.\"\npattern = r\"[0-9]{3}-[0-9]{2}-[0-9]{4}\"\nre.findall(pattern, text)  \n\n['123-45-6789', '321-45-6789']\n\n\n\n\n6.6.2.2 Extraction with pandas\npandas similarily provides extraction functionality on a Series of data: ser.str.findall(pattern)\nConsider the following DataFrame ssn_data.\n\n\nCode\ndata = {\"SSN\": [\"987-65-4321\", \"forty\", \\\n                \"123-45-6789 bro or 321-45-6789\",\n               \"999-99-9999\"]}\nssn_data = pd.DataFrame(data)\n\n\n\nssn_data\n\n\n\n\n\n\n\n\nSSN\n\n\n\n\n0\n987-65-4321\n\n\n1\nforty\n\n\n2\n123-45-6789 bro or 321-45-6789\n\n\n3\n999-99-9999\n\n\n\n\n\n\n\n\nssn_data[\"SSN\"].str.findall(pattern)\n\n0                 [987-65-4321]\n1                            []\n2    [123-45-6789, 321-45-6789]\n3                 [999-99-9999]\nName: SSN, dtype: object\n\n\nThis function returns a list for every row containing the pattern matches in a given string.\nAs you may expect, there are similar pandas equivalents for other re functions as well. Series.str.extract takes in a pattern and returns a DataFrame of each capture group’s first match in the string. In contrast, Series.str.extractall returns a multi-indexed DataFrame of all matches for each capture group. You can see the difference in the outputs below:\n\npattern_cg = r\"([0-9]{3})-([0-9]{2})-([0-9]{4})\"\nssn_data[\"SSN\"].str.extract(pattern_cg)\n\n\n\n\n\n\n\n\n0\n1\n2\n\n\n\n\n0\n987\n65\n4321\n\n\n1\nNaN\nNaN\nNaN\n\n\n2\n123\n45\n6789\n\n\n3\n999\n99\n9999\n\n\n\n\n\n\n\n\nssn_data[\"SSN\"].str.extractall(pattern_cg)\n\n\n\n\n\n\n\n\n\n0\n1\n2\n\n\n\nmatch\n\n\n\n\n\n\n\n0\n0\n987\n65\n4321\n\n\n2\n0\n123\n45\n6789\n\n\n1\n321\n45\n6789\n\n\n3\n0\n999\n99\n9999\n\n\n\n\n\n\n\n\n\n\n6.6.3 Regular Expression Capture Groups\nEarlier we used parentheses ( ) to specify the highest order of operation in regular expressions. However, they have another meaning; parentheses are often used to represent capture groups. Capture groups are essentially, a set of smaller regular expressions that match multiple substrings in text data.\nLet’s take a look at an example.\n\n6.6.3.1 Example 1\n\ntext = \"Observations: 03:04:53 - Horse awakens. \\\n        03:05:14 - Horse goes back to sleep.\"\n\nSay we want to capture all occurences of time data (hour, minute, and second) as seperate entities.\n\npattern_1 = r\"(\\d\\d):(\\d\\d):(\\d\\d)\"\nre.findall(pattern_1, text)\n\n[('03', '04', '53'), ('03', '05', '14')]\n\n\nNotice how the given pattern has 3 capture groups, each specified by the regular expression (\\d\\d). We then use re.findall to return these capture groups, each as tuples containing 3 matches.\nThese regular expression capture groups can be different. We can use the (\\d{2}) shorthand to extract the same data.\n\npattern_2 = r\"(\\d\\d):(\\d\\d):(\\d{2})\"\nre.findall(pattern_2, text)\n\n[('03', '04', '53'), ('03', '05', '14')]\n\n\n\n\n6.6.3.2 Example 2\nWith the notion of capture groups, convince yourself how the following regular expression works.\n\nfirst = log_lines[0]\nfirst\n\n'169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] \"GET /stat141/Winter04/ HTTP/1.1\" 200 2585 \"http://anson.ucdavis.edu/courses/\"\\n'\n\n\n\npattern = r'\\[(\\d+)\\/(\\w+)\\/(\\d+):(\\d+):(\\d+):(\\d+) (.+)\\]'\nday, month, year, hour, minute, second, time_zone = re.findall(pattern, first)[0]\nprint(day, month, year, hour, minute, second, time_zone)\n\n26 Jan 2014 10 47 58 -0800"
-  },
-  {
-    "objectID": "regex/regex.html#limitations-of-regular-expressions",
-    "href": "regex/regex.html#limitations-of-regular-expressions",
-    "title": "6  Regular Expressions",
-    "section": "6.7 Limitations of Regular Expressions",
-    "text": "6.7 Limitations of Regular Expressions\nToday, we explored the capabilities of regular expressions in data wrangling with text data. However, there are a few things to be wary of.\nWriting regular expressions is like writing a program.\n\nNeed to know the syntax well.\nCan be easier to write than to read.\nCan be difficult to debug.\n\nRegular expressions are terrible at certain types of problems:\n\nFor parsing a hierarchical structure, such as JSON, use the json.load() parser, not RegEx!\nComplex features (e.g. valid email address).\nCounting (same number of instances of a and b). (impossible)\nComplex properties (palindromes, balanced parentheses). (impossible)\n\nUltimately, the goal is not to memorize all regular expressions. Rather, the aim is to:\n\nUnderstand what RegEx is capable of.\nParse and create RegEx, with a reference table\nUse vocabulary (metacharacter, escape character, groups, etc.) to describe regex metacharacters.\nDifferentiate between (), [], {}\nDesign your own character classes with , , […-…], ^, etc.\nUse python and pandas RegEx methods."
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#visualizations-in-data-8-and-data-100-so-far",
-    "href": "visualization_1/visualization_1.html#visualizations-in-data-8-and-data-100-so-far",
-    "title": "7  Visualization I",
-    "section": "7.1 Visualizations in Data 8 and Data 100 (so far)",
-    "text": "7.1 Visualizations in Data 8 and Data 100 (so far)\nYou’ve likely encountered several forms of data visualizations in your studies. You may remember two such examples from Data 8: line plots and histograms. Each of these served a unique purpose. For example, line plots displayed how numerical quantities changed over time, while histograms were useful in understanding a variable’s distribution.\n\n\nLine Chart\n\n\n\n\nHistogram"
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#goals-of-visualization",
-    "href": "visualization_1/visualization_1.html#goals-of-visualization",
-    "title": "7  Visualization I",
-    "section": "7.2 Goals of Visualization",
-    "text": "7.2 Goals of Visualization\nVisualizations are useful for a number of reasons. In Data 100, we consider two areas in particular:\n\nTo broaden your understanding of the data\n\nKey part in exploratory data analysis.\nUseful in investigating relationships between variables.\n\nTo communicate results/conclusions to others\n\nVisualization theory is especially important here.\n\n\nOne of the most common applications of visualizations is in understanding a distribution of data.\nThis course note will focus on the first half of visualization topics in Data 100. The goal here is to understand how to choose the “right” plot depending on different variable types and, secondly, how to generate these plots using code."
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#an-overview-of-distributions",
-    "href": "visualization_1/visualization_1.html#an-overview-of-distributions",
-    "title": "7  Visualization I",
-    "section": "7.3 An Overview of Distributions",
-    "text": "7.3 An Overview of Distributions\nA distribution describes the frequency of unique values in a variable. Distributions must satisfy two properties:\n\nEach data point must belong to only one category.\nThe total frequency of all categories must sum to 100%. In other words, their total count should equal the number of values in consideration.\n\n\n\nNot a Valid Distribution\n\n\n\n\nValid Distribution\n\n\n\nLeft Diagram: This is not a valid distribution since individuals can be associated with more than one category and the bar values demonstrate values in minutes and not probability.\nRight Diagram: This example satisfies the two properties of distributions, so it is a valid distribution."
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#variable-types-should-inform-plot-choice",
-    "href": "visualization_1/visualization_1.html#variable-types-should-inform-plot-choice",
-    "title": "7  Visualization I",
-    "section": "7.4 Variable Types Should Inform Plot Choice",
-    "text": "7.4 Variable Types Should Inform Plot Choice\nDifferent plots are more or less suited for displaying particular types of variables, laid out in the diagram below:"
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#bar-plots",
-    "href": "visualization_1/visualization_1.html#bar-plots",
-    "title": "7  Visualization I",
-    "section": "7.5 Bar Plots",
-    "text": "7.5 Bar Plots\nAs we saw above, a bar plot is one of the most common ways of displaying the distribution of a qualitative (categorical) variable. The length of a bar plot encodes the frequency of a category; the width encodes no useful information. The color could indicate a sub-category, but this is not necessarily the case.\nLet’s contextualize this in an example. We will use the World Bank dataset (wb) in our analysis.\n\n\nCode\nimport pandas as pd\nimport numpy as np\n\nwb = pd.read_csv(\"data/world_bank.csv\", index_col=0)\nwb.head()\n\n\n\n\n\n\n\n\n\nContinent\nCountry\nPrimary completion rate: Male: % of relevant age group: 2015\nPrimary completion rate: Female: % of relevant age group: 2015\nLower secondary completion rate: Male: % of relevant age group: 2015\nLower secondary completion rate: Female: % of relevant age group: 2015\nYouth literacy rate: Male: % of ages 15-24: 2005-14\nYouth literacy rate: Female: % of ages 15-24: 2005-14\nAdult literacy rate: Male: % ages 15 and older: 2005-14\nAdult literacy rate: Female: % ages 15 and older: 2005-14\n...\nAccess to improved sanitation facilities: % of population: 1990\nAccess to improved sanitation facilities: % of population: 2015\nChild immunization rate: Measles: % of children ages 12-23 months: 2015\nChild immunization rate: DTP3: % of children ages 12-23 months: 2015\nChildren with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016\nChildren with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016\nChildren sleeping under treated bed nets: % of children under age 5: 2009-2016\nChildren with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016\nTuberculosis: Treatment success rate: % of new cases: 2014\nTuberculosis: Cases detection rate: % of new estimated cases: 2015\n\n\n\n\n0\nAfrica\nAlgeria\n106.0\n105.0\n68.0\n85.0\n96.0\n92.0\n83.0\n68.0\n...\n80.0\n88.0\n95.0\n95.0\n66.0\n42.0\nNaN\nNaN\n88.0\n80.0\n\n\n1\nAfrica\nAngola\nNaN\nNaN\nNaN\nNaN\n79.0\n67.0\n82.0\n60.0\n...\n22.0\n52.0\n55.0\n64.0\nNaN\nNaN\n25.9\n28.3\n34.0\n64.0\n\n\n2\nAfrica\nBenin\n83.0\n73.0\n50.0\n37.0\n55.0\n31.0\n41.0\n18.0\n...\n7.0\n20.0\n75.0\n79.0\n23.0\n33.0\n72.7\n25.9\n89.0\n61.0\n\n\n3\nAfrica\nBotswana\n98.0\n101.0\n86.0\n87.0\n96.0\n99.0\n87.0\n89.0\n...\n39.0\n63.0\n97.0\n95.0\nNaN\nNaN\nNaN\nNaN\n77.0\n62.0\n\n\n5\nAfrica\nBurundi\n58.0\n66.0\n35.0\n30.0\n90.0\n88.0\n89.0\n85.0\n...\n42.0\n48.0\n93.0\n94.0\n55.0\n43.0\n53.8\n25.4\n91.0\n51.0\n\n\n\n\n5 rows × 47 columns\n\n\n\nWe can visualize the distribution of the Continent column using a bar plot. There are a few ways to do this.\n\n7.5.1 Plotting in Pandas\n\nwb['Continent'].value_counts().plot(kind = 'bar');\n\n\n\n\nRecall that .value_counts() returns a Series with the total count of each unique value. We call .plot(kind = 'bar') on this result to visualize these counts as a bar plot.\nPlotting methods in pandas are the least preferred and not supported in Data 100, as their functionality is limited. Instead, future examples will focus on other libraries built specifically for visualizing data. The most well-known library here is matplotlib.\n\n\n7.5.2 Plotting in Matplotlib\n\nimport matplotlib.pyplot as plt # matplotlib is typically given the alias plt\n\ncontinent = wb['Continent'].value_counts()\nplt.bar(continent.index, continent)\nplt.xlabel('Continent')\nplt.ylabel('Count');\n\n\n\n\nWhile more code is required to achieve the same result, matplotlib is often used over pandas for its ability to plot more complex visualizations, some of which are discussed shortly.\nHowever, note how we needed to label the axes with plt.xlabel and plt.ylabel - matplotlib does not support automatic axis labeling. To get around these inconveniences, we can use a more efficient plotting library, seaborn.\n\n\n7.5.3 Plotting in Seaborn\n\nimport seaborn as sns # seaborn is typically given the alias sns\nsns.countplot(data = wb, x = 'Continent');\n\n\n\n\nseaborn.countplot both counts and visualizes the number of unique values in a given column. This column is specified by the x argument to sns.countplot, while the DataFrame is specified by the data argument. In contrast to matplotlib, the general structure of a seaborn call involves passing in an entire DataFrame, and then specifying what column(s) to plot.\nFor the vast majority of visualizations, seaborn is far more concise and aesthetically pleasing than matplotlib. However, the color scheme of this particular bar plot is arbitrary - it encodes no additional information about the categories themselves. This is not always true; color may signify meaningful detail in other visualizations. We’ll explore this more in-depth during the next lecture.\nBy now, you’ll have noticed that each of these plotting libraries have a very different syntax. As with pandas, we’ll teach you the important methods in matplotlib and seaborn, but you’ll learn more through documentation.\n\nMatplotlib Documentation\nSeaborn Documentation\n\nThinking back to our second goal, when we want to use visualizations to communicate results/conclusions to others, we must consider:\n\nWhat colors should we use?\nHow wide should the bars be?\nShould the legend exist?\nShould the bars and axes have dark borders?\n\nTo accomplish this, here are some ways we can improve a plot:\n\nIntroducing different colors for each bar\nIncluding a legend\nIncluding a title\nLabeling the y-axis\nUsing color-blind friendly palettes\nRe-orienting the labels\nIncrease the font size"
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#distributions-of-quantitative-variables",
-    "href": "visualization_1/visualization_1.html#distributions-of-quantitative-variables",
-    "title": "7  Visualization I",
-    "section": "7.6 Distributions of Quantitative Variables",
-    "text": "7.6 Distributions of Quantitative Variables\nRevisiting our example with the wb DataFrame, let’s plot the distribution of Gross national income per capita.\n\n\nCode\nwb.head(5)\n\n\n\n\n\n\n\n\n\nContinent\nCountry\nPrimary completion rate: Male: % of relevant age group: 2015\nPrimary completion rate: Female: % of relevant age group: 2015\nLower secondary completion rate: Male: % of relevant age group: 2015\nLower secondary completion rate: Female: % of relevant age group: 2015\nYouth literacy rate: Male: % of ages 15-24: 2005-14\nYouth literacy rate: Female: % of ages 15-24: 2005-14\nAdult literacy rate: Male: % ages 15 and older: 2005-14\nAdult literacy rate: Female: % ages 15 and older: 2005-14\n...\nAccess to improved sanitation facilities: % of population: 1990\nAccess to improved sanitation facilities: % of population: 2015\nChild immunization rate: Measles: % of children ages 12-23 months: 2015\nChild immunization rate: DTP3: % of children ages 12-23 months: 2015\nChildren with acute respiratory infection taken to health provider: % of children under age 5 with ARI: 2009-2016\nChildren with diarrhea who received oral rehydration and continuous feeding: % of children under age 5 with diarrhea: 2009-2016\nChildren sleeping under treated bed nets: % of children under age 5: 2009-2016\nChildren with fever receiving antimalarial drugs: % of children under age 5 with fever: 2009-2016\nTuberculosis: Treatment success rate: % of new cases: 2014\nTuberculosis: Cases detection rate: % of new estimated cases: 2015\n\n\n\n\n0\nAfrica\nAlgeria\n106.0\n105.0\n68.0\n85.0\n96.0\n92.0\n83.0\n68.0\n...\n80.0\n88.0\n95.0\n95.0\n66.0\n42.0\nNaN\nNaN\n88.0\n80.0\n\n\n1\nAfrica\nAngola\nNaN\nNaN\nNaN\nNaN\n79.0\n67.0\n82.0\n60.0\n...\n22.0\n52.0\n55.0\n64.0\nNaN\nNaN\n25.9\n28.3\n34.0\n64.0\n\n\n2\nAfrica\nBenin\n83.0\n73.0\n50.0\n37.0\n55.0\n31.0\n41.0\n18.0\n...\n7.0\n20.0\n75.0\n79.0\n23.0\n33.0\n72.7\n25.9\n89.0\n61.0\n\n\n3\nAfrica\nBotswana\n98.0\n101.0\n86.0\n87.0\n96.0\n99.0\n87.0\n89.0\n...\n39.0\n63.0\n97.0\n95.0\nNaN\nNaN\nNaN\nNaN\n77.0\n62.0\n\n\n5\nAfrica\nBurundi\n58.0\n66.0\n35.0\n30.0\n90.0\n88.0\n89.0\n85.0\n...\n42.0\n48.0\n93.0\n94.0\n55.0\n43.0\n53.8\n25.4\n91.0\n51.0\n\n\n\n\n5 rows × 47 columns\n\n\n\nHow should we define our categories for this variable? In the previous example, these were a few unique values of the Continent column. If we use similar logic here, our categories are the different numerical values contained in the Gross national income per capita column.\nUnder this assumption, let’s plot this distribution using the seaborn.countplot function.\n\nsns.countplot(data = wb, x = 'Gross national income per capita, Atlas method: $: 2016');\n\n\n\n\nWhat happened? A bar plot (either plt.bar or sns.countplot) will create a separate bar for each unique value of a variable. With a continuous variable, we may not have a finite number of possible values, which can lead to situations where we would need many, many bars to display each unique value.\nSpecifically, we can say this histogram suffers from overplotting as we are unable to interpret the plot and gain any meaningful insight.\nRather than bar plots, to visualize the distribution of a continuous variable, we use one of the following types of plots:\n\nHistogram\nBox plot\nViolin plot"
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#box-plots-and-violin-plots",
-    "href": "visualization_1/visualization_1.html#box-plots-and-violin-plots",
-    "title": "7  Visualization I",
-    "section": "7.7 Box Plots and Violin Plots",
-    "text": "7.7 Box Plots and Violin Plots\nBox plots and violin plots are two very similar kinds of visualizations. Both display the distribution of a variable using information about quartiles.\nIn a box plot, the width of the box at any point does not encode meaning. In a violin plot, the width of the plot indicates the density of the distribution at each possible value.\n\nsns.boxplot(data=wb, y='Gross national income per capita, Atlas method: $: 2016');\n\n\n\n\n\nsns.violinplot(data=wb, y=\"Gross national income per capita, Atlas method: $: 2016\");\n\n\n\n\nA quartile represents a 25% portion of the data. We say that:\n\nThe first quartile (Q1) represents the 25th percentile – 25% of the data lies below the first quartile.\nThe second quartile (Q2) represents the 50th percentile, also known as the median – 50% of the data lies below the second quartile.\nThe third quartile (Q3) represents the 75th percentile – 75% of the data lies below the third quartile.\n\nThis means that the middle 50% of the data lies between the first and third quartiles. This is demonstrated in the histogram below. The three quartiles are marked with red vertical bars.\n\ngdp = wb['Gross domestic product: % growth : 2016']\ngdp = gdp[~gdp.isna()]\n\nq1, q2, q3 = np.percentile(gdp, [25, 50, 75])\n\nwb_quartiles = wb.copy()\nwb_quartiles['category'] = None\nwb_quartiles.loc[(wb_quartiles['Gross domestic product: % growth : 2016'] &lt; q1) | (wb_quartiles['Gross domestic product: % growth : 2016'] &gt; q3), 'category'] = 'Outside of the middle 50%'\nwb_quartiles.loc[(wb_quartiles['Gross domestic product: % growth : 2016'] &gt; q1) & (wb_quartiles['Gross domestic product: % growth : 2016'] &lt; q3), 'category'] = 'In the middle 50%'\n\nsns.histplot(wb_quartiles, x=\"Gross domestic product: % growth : 2016\", hue=\"category\")\nsns.rugplot([q1, q2, q3], c=\"firebrick\", lw=6, height=0.1);\n\n\n\n\nIn a box plot, the lower extent of the box lies at Q1, while the upper extent of the box lies at Q3. The horizontal line in the middle of the box corresponds to Q2 (equivalently, the median).\n\nsns.boxplot(data=wb, y='Gross domestic product: % growth : 2016');\n\n\n\n\nThe whiskers of a box-plot are the two points that lie at the [\\(1^{st}\\) Quartile \\(-\\) (\\(1.5\\times\\) IQR)], and the [\\(3^{rd}\\) Quartile \\(+\\) (\\(1.5\\times\\) IQR)]. They are the lower and upper ranges of “normal” data (the points excluding outliers).\nThe different forms of information contained in a box plot can be summarised as follows:\n\nA violin plot displays quartile information, albeit a bit more subtly. Look closely at the center vertical bar of the violin plot below!\n\nsns.violinplot(data=wb, y='Gross domestic product: % growth : 2016');"
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#side-by-side-box-and-violin-plots",
-    "href": "visualization_1/visualization_1.html#side-by-side-box-and-violin-plots",
-    "title": "7  Visualization I",
-    "section": "7.8 Side-by-Side Box and Violin Plots",
-    "text": "7.8 Side-by-Side Box and Violin Plots\nPlotting side-by-side box or violin plots allows us to compare distributions across different categories. In other words, they enable us to plot both a qualitative variable and a quantitative continuous variable in one visualization.\nWith seaborn, we can easily create side-by-side plots by specifying both an x and y column.\n\nsns.boxplot(data=wb, x=\"Continent\", y='Gross domestic product: % growth : 2016');"
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#plotting-histograms",
-    "href": "visualization_1/visualization_1.html#plotting-histograms",
-    "title": "7  Visualization I",
-    "section": "7.9 Plotting Histograms",
-    "text": "7.9 Plotting Histograms\nYou are likely familiar with histograms from Data 8. A histogram collects continuous data into bins, then plots this binned data. Each bin reflects the density of datapoints with values that lie between the left and right ends of the bin.\n\n# The `edgecolor` argument controls the color of the bin edges\ngni = wb[\"Gross national income per capita, Atlas method: $: 2016\"]\nplt.hist(gni, density=True, edgecolor=\"white\")\n\n# Add labels\nplt.xlabel(\"Gross national income per capita\")\nplt.ylabel(\"Density\")\nplt.title(\"Distribution of gross national income per capita\");\n\n\n\n\n\nsns.histplot(data=wb, x=\"Gross national income per capita, Atlas method: $: 2016\", stat=\"density\")\nplt.title(\"Distribution of gross national income per capita\");\n\n\n\n\n\n7.9.1 Overlaid Histograms\nWe can overlay histograms (or density curves) to compare distributions across qualitative categories.\nThe hue parameter of sns.histplot specifies the column that should be used to determine the color of each category. hue can be used in many seaborn plotting functions.\nNotice that the resulting plot includes a legend describing which color corresponds to each hemisphere – a legend should always be included if color is used to encode information in a visualization!\n\n# Create a new variable to store the hemisphere in which each country is located\nnorth = [\"Asia\", \"Europe\", \"N. America\"]\nsouth = [\"Africa\", \"Oceania\", \"S. America\"]\nwb.loc[wb[\"Continent\"].isin(north), \"Hemisphere\"] = \"Northern\"\nwb.loc[wb[\"Continent\"].isin(south), \"Hemisphere\"] = \"Southern\"\n\n\nsns.histplot(data=wb, x=\"Gross national income per capita, Atlas method: $: 2016\", hue=\"Hemisphere\", stat=\"density\")\nplt.title(\"Distribution of gross national income per capita\");\n\n\n\n\nEach bin of a histogram is scaled such that its area is equal to the percentage of all datapoints that it contains.\n\ndensities, bins, _ = plt.hist(gni, density=True, edgecolor=\"white\", bins=5)\nplt.xlabel(\"Gross national income per capita\")\nplt.ylabel(\"Density\")\n\nprint(f\"First bin has width {bins[1]-bins[0]} and height {densities[0]}\")\nprint(f\"This corresponds to {bins[1]-bins[0]} * {densities[0]} = {(bins[1]-bins[0])*densities[0]*100}% of the data\")\n\nFirst bin has width 16410.0 and height 4.7741589911386953e-05\nThis corresponds to 16410.0 * 4.7741589911386953e-05 = 78.343949044586% of the data"
-  },
-  {
-    "objectID": "visualization_1/visualization_1.html#evaluating-histograms",
-    "href": "visualization_1/visualization_1.html#evaluating-histograms",
-    "title": "7  Visualization I",
-    "section": "7.10 Evaluating Histograms",
-    "text": "7.10 Evaluating Histograms\nHistograms allow us to assess a distribution by their shape. There are a few properties of histograms we can analyze:\n\nSkewness and Tails\n\nSkewed left vs skewed right\nLeft tail vs right tail\n\nOutliers\n\nUsing percentiles\n\nModes\n\nMost commonly occuring data\n\n\n\n7.10.1 Skewness and Tails\nThe skew of a histogram describes the direction in which its “tail” extends. - A distribution with a long right tail is skewed right (such as Gross national income per capita). In a right-skewed distribution, the few large outliers “pull” the mean to the right of the median. - A distribution with a long left tail is skewed left (such as Access to an improved water source). In a left-skewed distribution, the few small outliers “pull” the mean to the left of the median.\nIn the case where a distribution has equal-sized right and left tails, it is symmetric. The mean is approximately equal to the median. Think of mean as the balancing point of the distribution.\n\nsns.histplot(data = wb, x = 'Gross national income per capita, Atlas method: $: 2016', stat = 'density');\nplt.title('Distribution with a long right tail')\n\nText(0.5, 1.0, 'Distribution with a long right tail')\n\n\n\n\n\n\nsns.histplot(data = wb, x = 'Access to an improved water source: % of population: 2015', stat = 'density');\nplt.title('Distribution with a long left tail')\n\nText(0.5, 1.0, 'Distribution with a long left tail')\n\n\n\n\n\n\n\n7.10.2 Outliers\nLoosely speaking, an outlier is defined as a data point that lies an abnormally large distance away from other values. Let’s make this more concrete. As you may have observed in the box plot infographic earlier, we define outliers to be the data points that fall beyond the whiskers. Specifically, values that are less than the [\\(1^{st}\\) Quartile \\(-\\) (\\(1.5\\times\\) IQR)], or greater than [\\(3^{rd}\\) Quartile \\(+\\) (\\(1.5\\times\\) IQR).]\n\n\n7.10.3 Modes\nIn Data 100, we describe a “mode” of a histogram as a peak in the distribution. Often, however, it is difficult to determine what counts as its own “peak.” For example, the number of peaks in the distribution of HIV rates across different countries varies depending on the number of histogram bins we plot.\nIf we set the number of bins to 5, the distribution appears unimodal.\n\n# Rename the very long column name for convenience\nwb = wb.rename(columns={'Antiretroviral therapy coverage: % of people living with HIV: 2015':\"HIV rate\"})\n# With 5 bins, it seems that there is only one peak\nsns.histplot(data=wb, x=\"HIV rate\", stat=\"density\", bins=5)\nplt.title(\"5 histogram bins\");\n\n\n\n\n\n# With 10 bins, there seem to be two peaks\n\nsns.histplot(data=wb, x=\"HIV rate\", stat=\"density\", bins=10)\nplt.title(\"10 histogram bins\");\n\n\n\n\n\n# And with 20 bins, it becomes hard to say what counts as a \"peak\"!\n\nsns.histplot(data=wb, x =\"HIV rate\", stat=\"density\", bins=20)\nplt.title(\"20 histogram bins\");\n\n\n\n\nIn part, it is these ambiguities that motivate us to consider using Kernel Density Estimation (KDE), which we will explore more in the next lecture."
-  },
-  {
-    "objectID": "visualization_2/visualization_2.html#relationships-between-quantitative-variables",
-    "href": "visualization_2/visualization_2.html#relationships-between-quantitative-variables",
-    "title": "8  Visualization II",
-    "section": "8.1 Relationships Between Quantitative Variables",
-    "text": "8.1 Relationships Between Quantitative Variables\nUp until now, we’ve discussed how to visualize single-variable distributions. Going beyond this, we want to understand the relationship between pairs of numerical variables.\n\n8.1.0.1 Scatter Plots\nScatter plots are one of the most useful tools in representing the relationship between two quantitative variables. They are particularly important in gauging the strength, or correlation, of the relationship between variables. Knowledge of these relationships can then motivate decisions in our modeling process.\nIn matplotlib, we use the function plt.scatter to generate a scatter plot. Notice that, unlike our examples of plotting single-variable distributions, now we specify sequences of values to be plotted along the x-axis and the y-axis.\n\nplt.scatter(wb[\"per capita: % growth: 2016\"], \\\n            wb['Adult literacy rate: Female: % ages 15 and older: 2005-14'])\n\nplt.xlabel(\"% growth per capita\")\nplt.ylabel(\"Female adult literacy rate\")\nplt.title(\"Female adult literacy against % growth\");\n\n\n\n\nIn seaborn, we call the function sns.scatterplot. We use the x and y parameters to indicate the values to be plotted along the x and y axes, respectively. By using the hue parameter, we can specify a third variable to be used for coloring each scatter point.\n\nsns.scatterplot(data = wb, x = \"per capita: % growth: 2016\", \\\n               y = \"Adult literacy rate: Female: % ages 15 and older: 2005-14\", \n               hue = \"Continent\")\n\nplt.title(\"Female adult literacy against % growth\");\n\n\n\n\nAlthough the plots above communicate the general relationship between the two plotted variables, they both suffer a major limitation – overplotting. Overplotting occurs when scatter points with similar values are stacked on top of one another, making it difficult to see the number of scatter points actually plotted in the visualization. Notice how in the upper righthand region of the plots, we cannot easily tell just how many points have been plotted. This makes our visualizations difficult to interpret.\nWe have a few methods to help reduce overplotting:\n\nDecreasing the size of the scatter point markers can improve readability. We do this by setting a new value to the size parameter, s, of plt.scatter or sns.scatterplot.\nJittering is the process of adding a small amount of random noise to all x and y values to slightly shift the position of each datapoint. By randomly shifting all the data by some small distance, we can discern individual points more clearly without modifying the major trends of the original dataset.\n\nIn the cell below, we first jitter the data using np.random.uniform, then re-plot it with smaller markers. The resulting plot is much easier to interpret.\n\n# Setting a seed ensures that we produce the same plot each time\n# This means that the course notes will not change each time you access them\nnp.random.seed(150)\n\n# This call to np.random.uniform generates random numbers between -1 and 1\n# We add these random numbers to the original x data to jitter it slightly\nx_noise = np.random.uniform(-1, 1, len(wb))\njittered_x = wb[\"per capita: % growth: 2016\"] + x_noise\n\n# Repeat for y data\ny_noise = np.random.uniform(-5, 5, len(wb))\njittered_y = wb[\"Adult literacy rate: Female: % ages 15 and older: 2005-14\"] + y_noise\n\n# Setting the size parameter `s` changes the size of each point\nplt.scatter(jittered_x, jittered_y, s=15)\n\nplt.xlabel(\"% growth per capita (jittered)\")\nplt.ylabel(\"Female adult literacy rate (jittered)\")\nplt.title(\"Female adult literacy against % growth\");\n\n\n\n\n\n\n8.1.0.2 lmplot and jointplot\nseaborn also includes several built-in functions for creating more sophisticated scatter plots. Two of the most commonly used examples are sns.lmplot and sns.jointplot.\nsns.lmplot plots both a scatter plot and a linear regression line, all in one function call. We’ll discuss linear regression in a few lectures.\n\nsns.lmplot(data = wb, x = \"per capita: % growth: 2016\", \\\n           y = \"Adult literacy rate: Female: % ages 15 and older: 2005-14\")\n\nplt.title(\"Female adult literacy against % growth\");\n\n/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:\n\nThe figure layout has changed to tight\n\n\n\n\n\n\nsns.jointplot creates a visualization with three components: a scatter plot, a histogram of the distribution of x values, and a histogram of the distribution of y values.\n\nsns.jointplot(data = wb, x = \"per capita: % growth: 2016\", \\\n           y = \"Adult literacy rate: Female: % ages 15 and older: 2005-14\")\n\n# plt.suptitle allows us to shift the title up so it does not overlap with the histogram\nplt.suptitle(\"Female adult literacy against % growth\")\nplt.subplots_adjust(top=0.9);\n\n\n\n\n\n\n8.1.0.3 Hex plots\nFor datasets with a very large number of datapoints, jittering is unlikely to fully resolve the issue of overplotting. In these cases, we can attempt to visualize our data by its density, rather than displaying each individual datapoint.\nHex plots can be thought of as two-dimensional histograms that show the joint distribution between two variables. This is particularly useful when working with very dense data. In a hex plot, the x-y plane is binned into hexagons. Hexagons that are darker in color indicate a greater density of data – that is, there are more data points that lie in the region enclosed by the hexagon.\nWe can generate a hex plot using sns.jointplot modified with the kind parameter.\n\nsns.jointplot(data = wb, x = \"per capita: % growth: 2016\", \\\n              y = \"Adult literacy rate: Female: % ages 15 and older: 2005-14\", \\\n              kind = \"hex\")\n\n# plt.suptitle allows us to shift the title up so it does not overlap with the histogram\nplt.suptitle(\"Female adult literacy against % growth\")\nplt.subplots_adjust(top=0.9);\n\n\n\n\n\n\n8.1.0.4 Contour Plots\nContour plots are an alternative way of plotting the joint distribution of two variables. You can think of them as the 2-dimensional versions of KDE plots. A contour plot can be interpreted in a similar way to a topographic map. Each contour line represents an area that has the same density of datapoints throughout the region. Contours marked with darker colors contain more datapoints (a higher density) in that region.\nsns.kdeplot will generate a contour plot if we specify both x and y data.\n\nsns.kdeplot(data = wb, x = \"per capita: % growth: 2016\", \\\n            y = \"Adult literacy rate: Female: % ages 15 and older: 2005-14\", \\\n            fill = True)\n\nplt.title(\"Female adult literacy against % growth\");"
-  },
-  {
-    "objectID": "visualization_2/visualization_2.html#transformations",
-    "href": "visualization_2/visualization_2.html#transformations",
-    "title": "8  Visualization II",
-    "section": "8.2 Transformations",
-    "text": "8.2 Transformations\nWe have now covered visualizations in great depth, looking into various forms of visualizations, plotting libraries, and high-level theory.\nMuch of this was done to uncover insights in data, which will prove necessary when we begin building models of data later in the course. A strong graphical correlation between two variables hints at an underlying relationship that we may want to study in greater detail. However, relying on visual relationships alone is limiting - not all plots show association. The presence of outliers and other statistical anomalies makes it hard to interpret data.\nTransformations are the process of manipulating data to find significant relationships between variables. These are often found by applying mathematical functions to variables that “transform” their range of possible values and highlight some previously hidden associations between data.\nTo see why we may want to transform data, consider the following plot of adult literacy rates against gross national income.\n\n\nCode\n# Some data cleaning to help with the next example\ndf = pd.DataFrame(index=wb.index)\ndf['lit'] = wb['Adult literacy rate: Female: % ages 15 and older: 2005-14'] \\\n            + wb[\"Adult literacy rate: Male: % ages 15 and older: 2005-14\"]\ndf['inc'] = wb['gni']\ndf.dropna(inplace=True)\n\nplt.scatter(df[\"inc\"], df[\"lit\"])\nplt.xlabel(\"Gross national income per capita\")\nplt.ylabel(\"Adult literacy rate\")\nplt.title(\"Adult literacy rate against GNI per capita\");\n\n\n\n\n\nThis plot is difficult to interpret for two reasons:\n\nThe data shown in the visualization appears almost “smushed” – it is heavily concentrated in the upper lefthand region of the plot. Even if we jittered the dataset, we likely would not be able to fully assess all datapoints in that area.\nIt is hard to generalize a clear relationship between the two plotted variables. While adult literacy rate appears to share some positive relationship with gross national income, we are not able to describe the specifics of this trend in much detail.\n\nA transformation would allow us to visualize this data more clearly, which, in turn, would enable us to describe the underlying relationship between our variables of interest.\nWe will most commonly apply a transformation to linearize a relationship between variables. If we find a transformation to make a scatter plot of two variables linear, we can “backtrack” to find the exact relationship between the variables. This helps us in two major ways. Firstly, linear relationships are particularly simple to interpret – we have an intuitive sense of what the slope and intercept of a linear trend represent, and how they can help us understand the relationship between two variables. Secondly, linear relationships are the backbone of linear models. We will begin exploring linear modeling in great detail next week. As we’ll soon see, linear models become much more effective when we are working with linearized data.\nIn the remainder of this note, we will discuss how to linearize a dataset to produce the result below. Notice that the resulting plot displays a rough linear relationship between the values plotted on the x and y axes.\n\n\n8.2.1 Linearization and Applying Transformations\nTo linearize a relationship, begin by asking yourself: what makes the data non-linear? It is helpful to repeat this question for each variable in your visualization.\nLet’s start by considering the gross national income variable in our plot above. Looking at the y values in the scatter plot, we can see that many large y values are all clumped together, compressing the vertical axis. The scale of the horizontal axis is also being distorted by the few large outlying x values on the right.\n\nIf we decreased the size of these outliers relative to the bulk of the data, we could reduce the distortion of the horizontal axis. How can we do this? We need a transformation that will:\n\nDecrease the magnitude of large x values by a significant amount.\nNot drastically change the magnitude of small x values.\n\nOne function that produces this result is the log transformation. When we take the logarithm of a large number, the original number will decrease in magnitude dramatically. Conversely, when we take the logarithm of a small number, the original number does not change its value by as significant of an amount (to illustrate this, consider the difference between \\(\\log{(100)} = 4.61\\) and \\(\\log{(10)} = 2.3\\)).\nIn Data 100 (and most upper-division STEM classes), \\(\\log\\) is used to refer to the natural logarithm with base \\(e\\).\n\n# np.log takes the logarithm of an array or Series\nplt.scatter(np.log(df[\"inc\"]), df[\"lit\"])\n\nplt.xlabel(\"Log(gross national income per capita)\")\nplt.ylabel(\"Adult literacy rate\")\nplt.title(\"Adult literacy rate against Log(GNI per capita)\");\n\n\n\n\nAfter taking the logarithm of our x values, our plot appears much more balanced in its horizontal scale. We no longer have many datapoints clumped on one end and a few outliers out at extreme values.\nLet’s repeat this reasoning for the y values. Considering only the vertical axis of the plot, notice how there are many datapoints concentrated at large y values. Only a few datapoints lie at smaller values of y.\nIf we were to “spread out” these large values of y more, we would no longer see the dense concentration in one region of the y-axis. We need a transformation that will:\n\nIncrease the magnitude of large values of y so these datapoints are distributed more broadly on the vertical scale,\nNot substantially alter the scaling of small values of y (we do not want to drastically modify the lower end of the y axis, which is already distributed evenly on the vertical scale).\n\nIn this case, it is helpful to apply a power transformation – that is, raise our y values to a power. Let’s try raising our adult literacy rate values to the power of 4. Large values raised to the power of 4 will increase in magnitude proportionally much more than small values raised to the power of 4 (consider the difference between \\(2^4 = 16\\) and \\(200^4 = 1600000000\\)).\n\n# Apply a log transformation to the x values and a power transformation to the y values\nplt.scatter(np.log(df[\"inc\"]), df[\"lit\"]**4)\n\nplt.xlabel(\"Log(gross national income per capita)\")\nplt.ylabel(\"Adult literacy rate (4th power)\")\nplt.suptitle(\"Adult literacy rate (4th power) against Log(GNI per capita)\")\nplt.subplots_adjust(top=0.9);\n\n\n\n\nOur scatter plot is looking a lot better! Now, we are plotting the log of our original x values on the horizontal axis, and the 4th power of our original y values on the vertical axis. We start to see an approximate linear relationship between our transformed variables.\nWhat can we take away from this? We now know that the log of gross national income and adult literacy to the power of 4 are roughly linearly related. If we denote the original, untransformed gross national income values as \\(x\\) and the original adult literacy rate values as \\(y\\), we can use the standard form of a linear fit to express this relationship:\n\\[y^4 = m(\\log{x}) + b\\]\nWhere \\(m\\) represents the slope of the linear fit, while \\(b\\) represents the intercept.\nThe cell below computes \\(m\\) and \\(b\\) for our transformed data. We’ll discuss how this code was generated in a future lecture.\n\n\nCode\n# The code below fits a linear regression model. We'll discuss it at length in a future lecture\nfrom sklearn.linear_model import LinearRegression\n\nmodel = LinearRegression()\nmodel.fit(np.log(df[[\"inc\"]]), df[\"lit\"]**4)\nm, b = model.coef_[0], model.intercept_\n\nprint(f\"The slope, m, of the transformed data is: {m}\")\nprint(f\"The intercept, b, of the transformed data is: {b}\")\n\ndf = df.sort_values(\"inc\")\nplt.scatter(np.log(df[\"inc\"]), df[\"lit\"]**4, label=\"Transformed data\")\nplt.plot(np.log(df[\"inc\"]), m*np.log(df[\"inc\"])+b, c=\"red\", label=\"Linear regression\")\nplt.xlabel(\"Log(gross national income per capita)\")\nplt.ylabel(\"Adult literacy rate (4th power)\")\nplt.legend();\n\n\nThe slope, m, of the transformed data is: 336400693.43172693\nThe intercept, b, of the transformed data is: -1802204836.0479977\n\n\n\n\n\nWhat if we want to understand the underlying relationship between our original variables, before they were transformed? We can simply rearrange our linear expression above!\nRecall our linear relationship between the transformed variables \\(\\log{x}\\) and \\(y^4\\).\n\\[y^4 = m(\\log{x}) + b\\]\nBy rearranging the equation, we find a relationship between the untransformed variables \\(x\\) and \\(y\\).\n\\[y = [m(\\log{x}) + b]^{(1/4)}\\]\nWhen we plug in the values for \\(m\\) and \\(b\\) computed above, something interesting happens.\n\n\nCode\n# Now, plug the values for m and b into the relationship between the untransformed x and y\nplt.scatter(df[\"inc\"], df[\"lit\"], label=\"Untransformed data\")\nplt.plot(df[\"inc\"], (m*np.log(df[\"inc\"])+b)**(1/4), c=\"red\", label=\"Modeled relationship\")\nplt.xlabel(\"Gross national income per capita\")\nplt.ylabel(\"Adult literacy rate\")\nplt.legend();\n\n\n\n\n\nWe have found a relationship between our original variables – gross national income and adult literacy rate!\nTransformations are powerful tools for understanding our data in greater detail. To summarize what we just achieved:\n\nWe identified appropriate transformations to linearize the original data.\nWe used our knowledge of linear curves to compute the slope and intercept of the transformed data.\nWe used this slope and intercept information to derive a relationship in the untransformed data.\n\nLinearization will be an important tool as we begin our work on linear modeling next week.\n\n8.2.1.1 Tukey-Mosteller Bulge Diagram\nThe Tukey-Mosteller Bulge Diagram is a good guide when determining possible transformations to achieve linearity. It is a visual summary of the reasoning we just worked through above.\n\nHow does it work? Each curved “bulge” represents a possible shape of non-linear data. To use the diagram, find which of the four bulges resembles your dataset the most closely. Then, look at the axes of the quadrant for this bulge. The horizontal axis will list possible transformations that could be applied to your x data for linearization. Similarly, the vertical axis will list possible transformations that could be applied to your y data. Note that each axis lists two possible transformations. While either of these transformations has the potential to linearize your dataset, note that this is an iterative process. It’s important to try out these transformations and look at the results to see whether you’ve actually achieved linearity. If not, you’ll need to continue testing other possible transformations.\nGenerally:\n\n\\(\\sqrt{}\\) and \\(\\log{}\\) will reduce the magnitude of large values.\nPowers (\\(^2\\) and \\(^3\\)) will increase the spread in magnitude of large values.\n\n\nImportant: You should still understand the logic we worked through to determine how best to transform the data. The bulge diagram is just a summary of this same reasoning. You will be expected to be able to explain why a given transformation is or is not appropriate for linearization.\n\n\n\n8.2.2 Additional Remarks\nVisualization requires a lot of thought!\n\nThere are many tools for visualizing distributions.\n\nDistribution of a single variable:\n\nRugplot\nHistogram\nDensity plot\nBox plot\nViolin plot\n\nJoint distribution of two quantitative variables:\n\nScatter plot\nHex plot\nContour plot\n\n\n\nThis class primarily uses seaborn and matplotlib, but pandas also has basic built-in plotting methods. Many other visualization libraries exist, and plotly is one of them.\n\nplotly creates very easily creates interactive plots.\nplotly will occasionally appear in lecture code, labs, and assignments!\n\nNext, we’ll go deeper into the theory behind visualization."
-  },
-  {
-    "objectID": "visualization_2/visualization_2.html#visualization-theory",
-    "href": "visualization_2/visualization_2.html#visualization-theory",
-    "title": "8  Visualization II",
-    "section": "8.3 Visualization Theory",
-    "text": "8.3 Visualization Theory\nThis section marks a pivot to the second major topic of this lecture - visualization theory. We’ll discuss the abstract nature of visualizations and analyze how they convey information.\nRemember, we had two goals for visualizing data. This section is particularly important in:\n\nHelping us understand the data and results,\nCommunicating our results and conclusions with others.\n\n\n8.3.1 Information Channels\nVisualizations are able to convey information through various encodings. In the remainder of this lecture, we’ll look at the use of color, scale, and depth, to name a few.\n\n8.3.1.1 Encodings in Rugplots\nOne detail that we may have overlooked in our earlier discussion of rugplots is the importance of encodings. Rugplots are effective visuals because they utilize line thickness to encode frequency. Consider the following diagram:\n\n\n\n8.3.1.2 Multi-Dimensional Encodings\nEncodings are also useful for representing multi-dimensional data. Notice how the following visual highlights four distinct “dimensions” of data:\n\nX-axis\nY-axis\nArea\nColor\n\n\nThe human visual perception system is only capable of visualizing data in a three-dimensional plane, but as you’ve seen, we can encode many more channels of information.\n\n\n\n8.3.2 Harnessing the Axes\n\n8.3.2.1 Consider the Scale of the Data\nHowever, we should be careful to not misrepresent relationships in our data by manipulating the scale or axes. The visualization below improperly portrays two seemingly independent relationships on the same plot. The authors have clearly changed the scale of the y-axis to mislead their audience.\n\nNotice how the downwards-facing line segment contains values in the millions, while the upwards-trending segment only contains values near three hundred thousand. These lines should not be intersecting.\nWhen there is a large difference in the magnitude of the data, it’s advised to analyze percentages instead of counts. The following diagrams correctly display the trends in cancer screening and abortion rates.\n\n\n\n\n\n\n\n\n\n\n\n8.3.2.2 Reveal the Data\nGreat visualizations not only consider the scale of the data but also utilize the axes in a way that best conveys information. For example, data scientists commonly set certain axes limits to highlight parts of the visualization they are most interested in.\n\n\n\n\n\n\n\n\n\nThe visualization on the right captures the trend in coronavirus cases during March of 2020. From only looking at the visualization on the left, a viewer may incorrectly believe that coronavirus began to skyrocket on March 4th, 2020. However, the second illustration tells a different story - cases rose closer to March 21th, 2020.\n\n\n\n8.3.3 Harnessing Color\nColor is another important feature in visualizations that does more than what meets the eye.\nWe already explored using color to encode a categorical variable in our scatter plot. Let’s now discuss the uses of color in novel visualizations like colormaps and heatmaps.\n5-8% of the world is red-green color blind, so we have to be very particular about our color scheme. We want to make these as accessible as possible. Choosing a set of colors that work together is evidently a challenging task!\n\n8.3.3.1 Colormaps\nColormaps are mappings from pixel data to color values, and they’re often used to highlight distinct parts of an image. Let’s investigate a few properties of colormaps.\n\n\nJet Colormap \n\n\n\nViridis Colormap \n\n\nThe jet colormap is infamous for being misleading. While it seems more vibrant than viridis, the aggressive colors poorly encode numerical data. To understand why, let’s analyze the following images.\n\n\n\n\n\n\n\n\n\nThe diagram on the left compares how a variety of colormaps represent pixel data that transitions from a high to low intensity. These include the jet colormap (row a) and grayscale (row b). Notice how the grayscale images do the best job in smoothly transitioning between pixel data. The jet colormap is the worst at this - the four images in row (a) look like a conglomeration of individual colors.\nThe difference is also evident in the images labeled (a) and (b) on the left side. The grayscale image is better at preserving finer detail in the vertical line strokes. Additionally, grayscale is preferred in X-ray scans for being more neutral. The intensity of the dark red color in the jet colormap is frightening and indicates something is wrong.\nWhy is the jet colormap so much worse? The answer lies in how its color composition is perceived to the human eye.\n\n\nJet Colormap Perception \n\n\n\nViridis Colormap Perception \n\n\nThe jet colormap is largely misleading because it is not perceptually uniform. Perceptually uniform colormaps have the property that if the pixel data goes from 0.1 to 0.2, the perceptual change is the same as when the data goes from 0.8 to 0.9.\nNotice how the said uniformity is present within the linear trend displayed in the viridis colormap. On the other hand, the jet colormap is largely non-linear - this is precisely why it’s considered a worse colormap.\n\n\n\n8.3.4 Harnessing Markings\nIn our earlier discussion of multi-dimensional encodings, we analyzed a scatter plot with four pseudo-dimensions: the two axes, area, and color. Were these appropriate to use? The following diagram analyzes how well the human eye can distinguish between these “markings”.\n\nThere are a few key takeaways from this diagram\n\nLengths are easy to discern. Don’t use plots with jiggled baselines - keep everything axis-aligned.\nAvoid pie charts! Angle judgments are inaccurate.\nAreas and volumes are hard to distinguish (area charts, word clouds, etc.).\n\n\n\n8.3.5 Harnessing Conditioning\nConditioning is the process of comparing data that belong to separate groups. We’ve seen this before in overlayed distributions, side-by-side box plots, and scatter plots with categorical encodings. Here, we’ll introduce terminology that formalizes these examples.\nConsider an example where we want to analyze income earnings for males and females with varying levels of education. There are multiple ways to compare this data.\n\n\n\n\n\n\n\n\n\nThe barplot is an example of juxtaposition: placing multiple plots side by side, with the same scale. The scatter plot is an example of superposition: placing multiple density curves and scatter plots on top of each other.\nWhich is better depends on the problem at hand. Here, superposition makes the precise wage difference very clear from a quick glance. However, many sophisticated plots convey information that favors the use of juxtaposition. Below is one example.\n\n\n\n8.3.6 Harnessing Context\nThe last component of a great visualization is perhaps the most critical - the use of context. Adding informative titles, axis labels, and descriptive captions are all best practices that we’ve heard repeatedly in Data 8.\nA publication-ready plot (and every Data 100 plot) needs:\n\nInformative title (takeaway, not description),\nAxis labels,\nReference lines, markers, etc,\nLegends, if appropriate,\nCaptions that describe data,\n\nCaptions should:\n\nBe comprehensive and self-contained,\nDescribe what has been graphed,\nDraw attention to important features,\nDescribe conclusions drawn from graphs."
-  },
-  {
-    "objectID": "sampling/sampling.html#censuses-and-surveys",
-    "href": "sampling/sampling.html#censuses-and-surveys",
-    "title": "9  Sampling",
-    "section": "9.1 Censuses and Surveys",
-    "text": "9.1 Censuses and Surveys\nIn general: a census is “an official count or survey of a population, typically recording various details of individuals.” An example is the U.S. Decennial Census which was held in April 2020. It counts every person living in all 50 states, DC, and US territories, not just citizens. Participation is required by law (it is mandated by the U.S. Constitution). Important uses include the allocation of Federal funds, congressional representation, and drawing congressional and state legislative districts. The census is composed of a survey mailed to different housing addresses in the United States.\nA survey is a set of questions. An example is workers sampling individuals and households. What is asked and how it is asked can affect how the respondent answers or even whether or not they answer in the first place.\nWhile censuses are great, it is often very difficult and expensive to survey everyone in a population. Imagine the amount of resources, money, time, and energy the U.S. spent on the 2020 Census. While this does give us more accurate information about the population, it’s often infeasible to execute. Thus, we usually survey a subset of the population instead.\nA sample is (usually) a subset of the population that is often used to make inferences about the population. If our sample is a good representation of our population, then we can use it to glean useful information at a lower cost. That being said, how the sample is drawn will affect the reliability of such inferences. Two common sources of error in sampling are chance error, where random samples can vary from what is expected in any direction, and bias, which is a systematic error in one direction. Biases can be the result of many things, for example, our sampling scheme or survey methods.\nLet’s define some useful vocabulary:\n\nPopulation: The group that you want to learn something about.\n\nIndividuals in a population are not always people. Other populations include bacteria in your gut (sampled using DNA sequencing), trees of a certain species, small businesses receiving a microloan, or published results in an academic journal or field.\n\nSampling Frame: The list from which the sample is drawn.\n\nFor example, if sampling people, then the sampling frame is the set of all people that could possibly end up in your sample.\n\nSample: Who you actually end up sampling. The sample is therefore a subset of your sampling frame.\n\nWhile ideally, these three sets would be exactly the same, they usually aren’t in practice. For example, there may be individuals in your sampling frame (and hence, your sample) that are not in your population. And generally, sample sizes are much smaller than population sizes.\n\n\n\nSampling_Frames"
-  },
-  {
-    "objectID": "sampling/sampling.html#bias-a-case-study",
-    "href": "sampling/sampling.html#bias-a-case-study",
-    "title": "9  Sampling",
-    "section": "9.2 Bias: A Case Study",
-    "text": "9.2 Bias: A Case Study\nThe following case study is adapted from Statistics by Freedman, Pisani, and Purves, W.W. Norton NY, 1978.\nIn 1936, President Franklin D. Roosevelt (Democratic) went up for re-election against Alf Landon (Republican). As is usual, polls were conducted in the months leading up to the election to try and predict the outcome. The Literary Digest was a magazine that had successfully predicted the outcome of 5 general elections coming into 1936. In their polling for the 1936 election, they sent out their survey to 10 million individuals whom they found from phone books, lists of magazine subscribers, and lists of country club members. Of the roughly 2.4 million people who filled out the survey, only 43% reported they would vote for Roosevelt; thus, the Digest predicted that Landon would win.\nOn election day, Roosevelt won in a landslide, winning 61% of the popular vote of about 45 million voters. How could the Digest have been so wrong with their polling?\nIt turns out that the Literary Digest sample was not representative of the population. Their sampling frame of people found in phone books, lists of magazine subscribers, and lists of country club members were more affluent and tended to vote Republican. As such, their sampling frame was inherently skewed in Landon’s favor. The Literary Digest completely overlooked the lion’s share of voters who were still suffering through the Great Depression. Furthermore, they had a dismal response rate (about 24%); who knows how the other non-respondents would have polled? The Digest folded just 18 months after this disaster.\nAt the same time, George Gallup, a rising statistician, also made predictions about the 1936 elections. Despite having a smaller sample size of “only” 50,000 (this is still more than necessary; more when we cover the Central Limit Theorem), his estimate that 56% of voters would choose Roosevelt was much closer to the actual result (61%). Gallup also predicted the Digest’s prediction within 1% with a sample size of only 3000 people by anticipating the Digest’s affluent sampling frame and subsampling those individuals.\nSo what’s the moral of the story? Samples, while convenient, are subject to chance error and bias. Election polling, in particular, can involve many sources of bias. To name a few:\n\nSelection bias systematically excludes (or favors) particular groups.\n\nExample: the Literary Digest poll excludes people not in phone books.\nHow to avoid: Examine the sampling frame and the method of sampling.\n\nResponse bias occurs because people don’t always respond truthfully. Survey designers pay special detail to the nature and wording of questions to avoid this type of bias.\n\nExample: Illegal immigrants might not answer truthfully when asked citizenship questions on the census survey.\nHow to avoid: Examine the nature of questions and the method of surveying.\n\nNon-response bias occurs because people don’t always respond to survey requests, which can skew responses.\n\nExample: Only 2.4m out of 10m people responded to the Literary Digest’s poll.\nHow to avoid: Keep surveys short, and be persistent.\n\n\nToday, the Gallup Poll is one of the leading polls for election results. The many sources of biases – who responds to polls? Do voters tell the truth? How can we predict turnout? – still remain, but the Gallup Poll uses several tactics to mitigate them. Within their sampling frame of “civilian, non-institutionalized population” of adults in telephone households in continental U.S., they use random digit dialing to include both listed/unlisted phone numbers and to avoid selection bias. Additionally, they use a within-household selection process to randomly select households with one or more adults. If no one answers, re-call multiple times to avoid non-response bias."
-  },
-  {
-    "objectID": "sampling/sampling.html#probability-samples",
-    "href": "sampling/sampling.html#probability-samples",
-    "title": "9  Sampling",
-    "section": "9.3 Probability Samples",
-    "text": "9.3 Probability Samples\nWhen sampling, it is essential to focus on the quality of the sample rather than the quantity of the sample. A huge sample size does not fix a bad sampling method. Our main goal is to gather a sample that is representative of the population it came from. In this section, we’ll explore the different types of sampling and their pros and cons.\nA convenience sample is whatever you can get ahold of; this type of sampling is non-random. Note that haphazard sampling is not necessarily random sampling; there are many potential sources of bias.\nIn a probability sample, we provide the chance that any specified set of individuals will be in the sample (individuals in the population can have different chances of being selected; they don’t all have to be uniform), and we sample at random based off this known chance. For this reason, probability samples are also called random samples. The randomness provides a few benefits:\n\nBecause we know the source probabilities, we can measure the errors.\nSampling at random gives us a more representative sample of the population, which reduces bias. (Note: this is only the case when the probability distribution we’re sampling from is accurate. Random samples using “bad” or inaccurate distributions can produce biased estimates of population quantities.)\nProbability samples allow us to estimate the bias and chance error, which helps us quantify uncertainty (more in a future lecture).\n\nThe real world is usually more complicated, and we often don’t know the initial probabilities. For example, we do not generally know the probability that a given bacterium is in a microbiome sample or whether people will answer when Gallup calls landlines. That being said, still we try to model probability sampling to the best of our ability even when the sampling or measurement process is not fully under our control.\nA few common random sampling schemes:\n\nA random sample with replacement is a sample drawn uniformly at random with replacement.\n\nRandom doesn’t always mean “uniformly at random,” but in this specific context, it does.\nSome individuals in the population might get picked more than once.\n\nA simple random sample (SRS) is a sample drawn uniformly at random without replacement.\n\nEvery individual (and subset of individuals) has the same chance of being selected.\nEvery pair has the same chance as every other pair.\nEvery triple has the same chance as every other triple.\nAnd so on.\n\nA stratified random sample, where random sampling is performed on strata (specific groups), and the groups together compose a sample.\n\n\n9.3.1 Example Scheme 1: Probability Sample\nSuppose we have 3 TA’s (Alan, Bennett, Celine): I decide to sample 2 of them as follows:\n\nI choose A with probability 1.0\nI choose either B or C, each with a probability of 0.5.\n\nWe can list all the possible outcomes and their respective probabilities in a table:\n\n\n\nOutcome\nProbability\n\n\n\n\n{A, B}\n0.5\n\n\n{A, C}\n0.5\n\n\n{B, C}\n0\n\n\n\nThis is a probability sample (though not a great one). Of the 3 people in my population, I know the chance of getting each subset. Suppose I’m measuring the average distance TAs live from campus.\n\nThis scheme does not see the entire population!\nMy estimate using the single sample I take has some chance error depending on if I see AB or AC.\nThis scheme is biased towards A’s response.\n\n\n\n9.3.2 Example Scheme 2: Simple Random Sample\nConsider the following sampling scheme:\n\nA class roster has 1100 students listed alphabetically.\nPick one of the first 10 students on the list at random (e.g. Student 8).\nTo create your sample, take that student and every 10th student listed after that (e.g. Students 8, 18, 28, 38, etc.).\n\n\n\nIs this a probability sample?\n\nYes. For a sample [n, n + 10, n + 20, …, n + 1090], where 1 &lt;= n &lt;= 10, the probability of that sample is 1/10. Otherwise, the probability is 0.\nOnly 10 possible samples!\n\n\n\nDoes each student have the same probability of being selected?\n\nYes. Each student is chosen with a probability of 1/10.\n\n\n\nIs this a simple random sample?\n\nNo. The chance of selecting (8, 18) is 1/10; the chance of selecting (8, 9) is 0.\n\n\n\n9.3.3 Demo: Barbie v. Oppenheimer\nWe are trying to collect a sample from Berkeley residents to predict the which one of Barbie and Oppenheimer would perform better on their opening day, July 21st.\nFirst, let’s grab a dataset that has every single resident in Berkeley (this is a fake dataset) and which movie they actually watched on July 21st.\nLet’s load in the movie.csv table. We can assume that:\n\nis_male is a boolean that indicates if a resident identifies as male.\nThere are only two movies they can watch on July 21st: Barbie and Oppenheimer.\nEvery resident watches a movie (either Barbie or Oppenheimer) on July 21st.\n\n\n\nCode\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport pandas as pd\nimport seaborn as sns\n\nsns.set_theme(style='darkgrid', font_scale = 1.5,\n              rc={'figure.figsize':(7,5)})\n\nrng = np.random.default_rng()\n\n\n\nmovie = pd.read_csv(\"data/movie.csv\")\n\n# create a 1/0 int that indicates Barbie vote\nmovie['barbie'] = (movie['movie'] == 'Barbie').astype(int)\nmovie.head()\n\n\n\n\n\n\n\n\nage\nis_male\nmovie\nbarbie\n\n\n\n\n0\n35\nFalse\nBarbie\n1\n\n\n1\n42\nTrue\nOppenheimer\n0\n\n\n2\n55\nFalse\nBarbie\n1\n\n\n3\n77\nTrue\nOppenheimer\n0\n\n\n4\n31\nFalse\nBarbie\n1\n\n\n\n\n\n\n\nWhat fraction of Berkeley residents chose Barbie?\n\nactual_barbie = np.mean(movie[\"barbie\"])\nactual_barbie\n\n0.5302792307692308\n\n\nThis is the actual outcome of the competition. Based on this result, Barbie would win. How did our sample of retirees do?\n\n9.3.3.1 Convenience Sample: Retirees\nLet’s take a convenience sample of people who have retired (&gt;= 65 years old). What proportion of them went to see Barbie instead of Oppenheimer?\n\nconvenience_sample = movie[movie['age'] &gt;= 65] # take a convenience sample of retirees\nnp.mean(convenience_sample[\"barbie\"]) # what proportion of them saw Barbie? \n\n0.3744755089093924\n\n\nBased on this result, we would have predicted that Oppenheimer would win! What happened? Is it possible that our sample is too small or noisy?\n\n# what's the size of our sample? \nlen(convenience_sample)\n\n359396\n\n\n\n# what proportion of our data is in the convenience sample? \nlen(convenience_sample)/len(movie)\n\n0.27645846153846154\n\n\nSeems like our sample is rather large (roughly 360,000 people), so the error is likely not due to solely to chance.\n\n\n9.3.3.2 Check for Bias\nLet us aggregate all choices by age and visualize the fraction of Barbie views, split by gender.\n\nvotes_by_barbie = movie.groupby([\"age\",\"is_male\"]).agg(\"mean\", numeric_only=True).reset_index()\nvotes_by_barbie.head()\n\n\n\n\n\n\n\n\nage\nis_male\nbarbie\n\n\n\n\n0\n18\nFalse\n0.819594\n\n\n1\n18\nTrue\n0.667001\n\n\n2\n19\nFalse\n0.812214\n\n\n3\n19\nTrue\n0.661252\n\n\n4\n20\nFalse\n0.805281\n\n\n\n\n\n\n\n\n\nCode\n# A common matplotlib/seaborn pattern: create the figure and axes object, pass ax\n# to seaborn for drawing into, and later fine-tune the figure via ax.\nfig, ax = plt.subplots();\n\nred_blue = [\"#bf1518\", \"#397eb7\"]\nwith sns.color_palette(red_blue):\n    sns.pointplot(data=votes_by_barbie, x = \"age\", y = \"barbie\", hue = \"is_male\", ax=ax)\n\nnew_ticks = [i.get_text() for i in ax.get_xticklabels()]\nax.set_xticks(range(0, len(new_ticks), 10), new_ticks[::10])\nax.set_title(\"Preferences by Demographics\");\n\n\n\n\n\n\nWe see that retirees (in Berkeley) tend to watch Oppenheimer.\nWe also see that residents who identify as non-male tend to prefer Barbie.\n\n\n\n9.3.3.3 Simple Random Sample\nSuppose we took a simple random sample (SRS) of the same size as our retiree sample:\n\nn = len(convenience_sample)\nrandom_sample = movie.sample(n, replace = False) ## By default, replace = False\nnp.mean(random_sample[\"barbie\"])\n\n0.5294104553194805\n\n\nThis is very close to the actual vote of 0.5302792307692308!\nIt turns out that we can get similar results with a much smaller sample size, say, 800:\n\nn = 800\nrandom_sample = movie.sample(n, replace = False)\n\n# Compute the sample average and the resulting relative error\nsample_barbie = np.mean(random_sample[\"barbie\"])\nerr = abs(sample_barbie-actual_barbie)/actual_barbie\n\n# We can print output with Markdown formatting too...\nfrom IPython.display import Markdown\nMarkdown(f\"**Actual** = {actual_barbie:.4f}, **Sample** = {sample_barbie:.4f}, \"\n         f\"**Err** = {100*err:.2f}%.\")\n\nActual = 0.5303, Sample = 0.5138, Err = 3.12%.\n\n\nWe’ll learn how to choose this number when we (re)learn the Central Limit Theorem later in the semester.\n\n\n9.3.3.4 Quantifying Chance Error\nIn our SRS of size 800, what would be our chance error?\nLet’s simulate 1000 versions of taking the 800-sized SRS from before:\n\nnrep = 1000   # number of simulations\nn = 800       # size of our sample\npoll_result = []\nfor i in range(0, nrep):\n    random_sample = movie.sample(n, replace = False)\n    poll_result.append(np.mean(random_sample[\"barbie\"]))\n\n\n\nCode\nfig, ax = plt.subplots()\nsns.histplot(poll_result, stat='density', ax=ax)\nax.axvline(actual_barbie, color=\"orange\", lw=4);\n\n\n\n\n\nWhat fraction of these simulated samples would have predicted Barbie?\n\npoll_result = pd.Series(poll_result)\nnp.sum(poll_result &gt; 0.5)/1000\n\n0.955\n\n\nYou can see the curve looks roughly Gaussian/normal. Using KDE:\n\n\nCode\nsns.histplot(poll_result, stat='density', kde=True);"
-  },
-  {
-    "objectID": "sampling/sampling.html#summary",
-    "href": "sampling/sampling.html#summary",
-    "title": "9  Sampling",
-    "section": "9.4 Summary",
-    "text": "9.4 Summary\nUnderstanding the sampling process is what lets us go from describing the data to understanding the world. Without knowing / assuming something about how the data were collected, there is no connection between the sample and the population. Ultimately, the dataset doesn’t tell us about the world behind the data."
-  },
-  {
-    "objectID": "intro_to_modeling/intro_to_modeling.html#what-is-a-model",
-    "href": "intro_to_modeling/intro_to_modeling.html#what-is-a-model",
-    "title": "10  Introduction to Modeling",
-    "section": "10.1 What is a Model?",
-    "text": "10.1 What is a Model?\nA model is an idealized representation of a system. A system is a set of principles or procedures according to which something functions. We live in a world full of systems: the procedure of turning on a light happens according to a specific set of rules dictating the flow of electricity. The truth behind how any event occurs is usually complex, and many times the specifics are unknown. The workings of the world can be viewed as its own giant procedure. Models seek to simplify the world and distill them into workable pieces.\nExample: We model the fall of an object on Earth as subject to a constant acceleration of \\(9.81 m/s^2\\) due to gravity.\n\nWhile this describes the behavior of our system, it is merely an approximation.\nIt doesn’t account for the effects of air resistance, local variations in gravity, etc.\nIn practice, it’s accurate enough to be useful!\n\n\n10.1.1 Reasons for Building Models\nWhy do we want to build models? As far as data scientists and statisticians are concerned, there are three reasons, and each implies a different focus on modeling.\n\nTo explain complex phenomena occurring in the world we live in. Examples of this might be:\n\nHow are the parents’ average height related to their children’s average height?\nHow does an object’s velocity and acceleration impact how far it travels? (Physics: \\(d = d_0 + vt + \\frac{1}{2}at^2\\))\n\nIn these cases, we care about creating models that are simple and interpretable, allowing us to understand what the relationships between our variables are.\nTo make accurate predictions about unseen data. Some examples include:\n\nCan we predict if an email is spam or not?\nCan we generate a one-sentence summary of this 10-page long article?\n\nWhen making predictions, we care more about making extremely accurate predictions, at the cost of having an uninterpretable model. These are sometimes called black-box models and are common in fields like deep learning.\nTo measure the causal effects of one event on some other event. For example,\n\nDoes smoking cause lung cancer?\nDoes a job training program cause increases in employment and wages?\n\nThis is a much harder question because most statistical tools are designed to infer association, not causation. We will not focus on this task in Data 100, but you can take other advanced classes on causal inference (e.g., Stat 156, Data 102) if you are intrigued!\n\nMost of the time, we aim to strike a balance between building interpretable models and building accurate models.\n\n\n10.1.2 Common Types of Models\nIn general, models can be split into two categories:\n\nDeterministic physical (mechanistic) models: Laws that govern how the world works.\n\nKepler’s Third Law of Planetary Motion (1619): The ratio of the square of an object’s orbital period with the cube of the semi-major axis of its orbit is the same for all objects orbiting the same primary.\n\n\\(T^2 \\propto R^3\\)\n\nNewton’s Laws: motion and gravitation (1687): Newton’s second law of motion models the relationship between the mass of an object and the force required to accelerate it.\n\n\\(F = ma\\)\n\\(F_g = G \\frac{m_1 m_2}{r^2}\\) \n\n\nProbabilistic models: Models that attempt to understand how random processes evolve. These are more general and can be used to describe many phenomena in the real world. These models commonly make simplifying assumptions about the nature of the world.\n\nPoisson Process models: Used to model random events that happen with some probability at any point in time and are strictly increasing in count, such as the arrival of customers at a store.\n\n\nNote: These specific models are not in the scope of Data 100 and exist to serve as motivation."
-  },
-  {
-    "objectID": "intro_to_modeling/intro_to_modeling.html#simple-linear-regression",
-    "href": "intro_to_modeling/intro_to_modeling.html#simple-linear-regression",
-    "title": "10  Introduction to Modeling",
-    "section": "10.2 Simple Linear Regression",
-    "text": "10.2 Simple Linear Regression\nThe regression line is the unique straight line that minimizes the mean squared error of estimation among all straight lines. As with any straight line, it can be defined by a slope and a y-intercept:\n\n\\(slope = r \\cdot \\frac{\\text{Standard Deviation of } y}{\\text{Standard Deviation of }x}\\)\n\\(y\\text{-intercept} = \\text{average of }y - \\text{slope}\\cdot\\text{average of }x\\)\n\\(\\text{regression estimate} = y\\text{-intercept} + \\text{slope}\\cdot\\text{}x\\)\n\\(\\text{residual} =\\text{observed }y - \\text{regression estimate}\\)\n\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\n# Set random seed for consistency \nnp.random.seed(43)\nplt.style.use('default') \n\n#Generate random noise for plotting\nx = np.linspace(-3, 3, 100)\ny = x * 0.5 - 1 + np.random.randn(100) * 0.3\n\n#plot regression line\nsns.regplot(x=x,y=y);\n\n\n\n\n\n\n10.2.1 Notations and Definitions\nFor a pair of variables \\(x\\) and \\(y\\) representing our data \\(\\mathcal{D} = \\{(x_1, y_1), (x_2, y_2), \\dots, (x_n, y_n)\\}\\), we denote their means/averages as \\(\\bar x\\) and \\(\\bar y\\) and standard deviations as \\(\\sigma_x\\) and \\(\\sigma_y\\).\n\n10.2.1.1 Standard Units\nA variable is represented in standard units if the following are true:\n\n0 in standard units is equal to the mean (\\(\\bar{x}\\)) in the original variable’s units.\nAn increase of 1 standard unit is an increase of 1 standard deviation (\\(\\sigma_x\\)) in the original variable’s units.\n\nTo convert a variable \\(x_i\\) into standard units, we subtract its mean from it and divide it by its standard deviation. For example, \\(x_i\\) in standard units is \\(\\frac{x_i - \\bar x}{\\sigma_x}\\).\n\n\n10.2.1.2 Correlation\nThe correlation (\\(r\\)) is the average of the product of \\(x\\) and \\(y\\), both measured in standard units.\n\\[r = \\frac{1}{n} \\sum_{i=1}^n (\\frac{x_i - \\bar{x}}{\\sigma_x})(\\frac{y_i - \\bar{y}}{\\sigma_y})\\]\n\nCorrelation measures the strength of a linear association between two variables.\nCorrelations range between -1 and 1: \\(|r| \\leq 1\\), with \\(r=1\\) indicating perfect linear association, and \\(r=-1\\) indicating perfect negative association. The closer \\(r\\) is to \\(0\\), the weaker the linear association is.\nCorrelation says nothing about causation and non-linear association. Correlation does imply causation. When \\(r = 0\\), the two variables are uncorrelated. However, they could still be related through some non-linear relationship.\n\n\n\nCode\ndef plot_and_get_corr(ax, x, y, title):\n    ax.set_xlim(-3, 3)\n    ax.set_ylim(-3, 3)\n    ax.set_xticks([])\n    ax.set_yticks([])\n    ax.scatter(x, y, alpha = 0.73)\n    r = np.corrcoef(x, y)[0, 1]\n    ax.set_title(title + \" (corr: {})\".format(r.round(2)))\n    return r\n\nfig, axs = plt.subplots(2, 2, figsize = (10, 10))\n\n# Just noise\nx1, y1 = np.random.randn(2, 100)\ncorr1 = plot_and_get_corr(axs[0, 0], x1, y1, title = \"noise\")\n\n# Strong linear\nx2 = np.linspace(-3, 3, 100)\ny2 = x2 * 0.5 - 1 + np.random.randn(100) * 0.3\ncorr2 = plot_and_get_corr(axs[0, 1], x2, y2, title = \"strong linear\")\n\n# Unequal spread\nx3 = np.linspace(-3, 3, 100)\ny3 = - x3/3 + np.random.randn(100)*(x3)/2.5\ncorr3 = plot_and_get_corr(axs[1, 0], x3, y3, title = \"strong linear\")\nextent = axs[1, 0].get_window_extent().transformed(fig.dpi_scale_trans.inverted())\n\n# Strong non-linear\nx4 = np.linspace(-3, 3, 100)\ny4 = 2*np.sin(x3 - 1.5) + np.random.randn(100) * 0.3\ncorr4 = plot_and_get_corr(axs[1, 1], x4, y4, title = \"strong non-linear\")\n\nplt.show()\n\n\n\n\n\n\n\n\n10.2.2 Alternate Form\nWhen the variables \\(y\\) and \\(x\\) are measured in standard units, the regression line for predicting \\(y\\) based on \\(x\\) has slope \\(r\\) and passes through the origin.\n\\[\\hat{y}_{su} = r \\cdot x_{su}\\]\n\n\nIn the original units, this becomes\n\n\\[\\frac{\\hat{y} - \\bar{y}}{\\sigma_y} = r \\cdot \\frac{x - \\bar{x}}{\\sigma_x}\\]\n\n\n\n10.2.3 Derivation\nStarting from the top, we have our claimed form of the regression line, and we want to show that it is equivalent to the optimal linear regression line: \\(\\hat{y} = \\hat{a} + \\hat{b}x\\).\nRecall:\n\n\\(\\hat{b} = r \\cdot \\frac{\\text{Standard Deviation of }y}{\\text{Standard Deviation of }x}\\)\n\\(\\hat{a} = \\text{average of }y - \\text{slope}\\cdot\\text{average of }x\\)\n\n\n\n\n\n\n\nProof:\n\\[\\frac{\\hat{y} - \\bar{y}}{\\sigma_y} = r \\cdot \\frac{x - \\bar{x}}{\\sigma_x}\\]\nMultiply by \\(\\sigma_y\\), and add \\(\\bar{y}\\) on both sides.\n\\[\\hat{y} = \\sigma_y \\cdot r \\cdot \\frac{x - \\bar{x}}{\\sigma_x} + \\bar{y}\\]\nDistribute coefficient \\(\\sigma_{y}\\cdot r\\) to the \\(\\frac{x - \\bar{x}}{\\sigma_x}\\) term\n\\[\\hat{y} = (\\frac{r\\sigma_y}{\\sigma_x} ) \\cdot x + (\\bar{y} - (\\frac{r\\sigma_y}{\\sigma_x} ) \\bar{x})\\]\nWe now see that we have a line that matches our claim:\n\nslope: \\(r\\cdot\\frac{\\text{SD of x}}{\\text{SD of y}} = r\\cdot\\frac{\\sigma_x}{\\sigma_y}\\)\nintercept: \\(\\bar{y} - \\text{slope}\\cdot x\\)\n\nNote that the error for the i-th datapoint is: \\(e_i = y_i - \\hat{y_i}\\)"
-  },
-  {
-    "objectID": "intro_to_modeling/intro_to_modeling.html#the-modeling-process",
-    "href": "intro_to_modeling/intro_to_modeling.html#the-modeling-process",
-    "title": "10  Introduction to Modeling",
-    "section": "10.3 The Modeling Process",
-    "text": "10.3 The Modeling Process\nAt a high level, a model is a way of representing a system. In Data 100, we’ll treat a model as some mathematical rule we use to describe the relationship between variables.\nWhat variables are we modeling? Typically, we use a subset of the variables in our sample of collected data to model another variable in this data. To put this more formally, say we have the following dataset \\(\\mathcal{D}\\):\n\\[\\mathcal{D} = \\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\\}\\]\nEach pair of values \\((x_i, y_i)\\) represents a datapoint. In a modeling setting, we call these observations. \\(y_i\\) is the dependent variable we are trying to model, also called an output or response. \\(x_i\\) is the independent variable inputted into the model to make predictions, also known as a feature.\nOur goal in modeling is to use the observed data \\(\\mathcal{D}\\) to predict the output variable \\(y_i\\). We denote each prediction as \\(\\hat{y}_i\\) (read: “y hat sub i”).\nHow do we generate these predictions? Some examples of models we’ll encounter in the next few lectures are given below:\n\\[\\hat{y}_i = \\theta\\] \\[\\hat{y}_i = \\theta_0 + \\theta_1 x_i\\]\nThe examples above are known as parametric models. They relate the collected data, \\(x_i\\), to the prediction we make, \\(\\hat{y}_i\\). A few parameters (\\(\\theta\\), \\(\\theta_0\\), \\(\\theta_1\\)) are used to describe the relationship between \\(x_i\\) and \\(\\hat{y}_i\\).\nNotice that we don’t immediately know the values of these parameters. While the features, \\(x_i\\), are taken from our observed data, we need to decide what values to give \\(\\theta\\), \\(\\theta_0\\), and \\(\\theta_1\\) ourselves. This is the heart of parametric modeling: what parameter values should we choose so our model makes the best possible predictions?\nTo choose our model parameters, we’ll work through the modeling process.\n\nChoose a model: how should we represent the world?\nChoose a loss function: how do we quantify prediction error?\nFit the model: how do we choose the best parameters of our model given our data?\nEvaluate model performance: how do we evaluate whether this process gave rise to a good model?"
-  },
-  {
-    "objectID": "intro_to_modeling/intro_to_modeling.html#choosing-a-model",
-    "href": "intro_to_modeling/intro_to_modeling.html#choosing-a-model",
-    "title": "10  Introduction to Modeling",
-    "section": "10.4 Choosing a Model",
-    "text": "10.4 Choosing a Model\nOur first step is choosing a model: defining the mathematical rule that describes the relationship between the features, \\(x_i\\), and predictions \\(\\hat{y}_i\\).\nIn Data 8, you learned about the Simple Linear Regression (SLR) model. You learned that the model takes the form: \\[\\hat{y}_i = a + bx_i\\]\nIn Data 100, we’ll use slightly different notation: we will replace \\(a\\) with \\(\\theta_0\\) and \\(b\\) with \\(\\theta_1\\). This will allow us to use the same notation when we explore more complex models later on in the course.\n\\[\\hat{y}_i = \\theta_0 + \\theta_1 x_i\\]\nThe parameters of the SLR model are \\(\\theta_0\\), also called the intercept term, and \\(\\theta_1\\), also called the slope term. To create an effective model, we want to choose values for \\(\\theta_0\\) and \\(\\theta_1\\) that most accurately predict the output variable. The “best” fitting model parameters are given the special names: \\(\\hat{\\theta}_0\\) and \\(\\hat{\\theta}_1\\); they are the specific parameter values that allow our model to generate the best possible predictions.\nIn Data 8, you learned that the best SLR model parameters are: \\[\\hat{\\theta}_0 = \\bar{y} - \\hat{\\theta}_1\\bar{x} \\qquad \\qquad \\hat{\\theta}_1 = r \\frac{\\sigma_y}{\\sigma_x}\\]\nA quick reminder on notation:\n\n\\(\\bar{y}\\) and \\(\\bar{x}\\) indicate the mean value of \\(y\\) and \\(x\\), respectively\n\\(\\sigma_y\\) and \\(\\sigma_x\\) indicate the standard deviations of \\(y\\) and \\(x\\)\n\\(r\\) is the correlation coefficient, defined as the average of the product of \\(x\\) and \\(y\\) measured in standard units: \\(\\frac{1}{n} \\sum_{i=1}^n (\\frac{x_i-\\bar{x}}{\\sigma_x})(\\frac{y_i-\\bar{y}}{\\sigma_y})\\)\n\nIn Data 100, we want to understand how to derive these best model coefficients. To do so, we’ll introduce the concept of a loss function."
-  },
-  {
-    "objectID": "intro_to_modeling/intro_to_modeling.html#choosing-a-loss-function",
-    "href": "intro_to_modeling/intro_to_modeling.html#choosing-a-loss-function",
-    "title": "10  Introduction to Modeling",
-    "section": "10.5 Choosing a Loss Function",
-    "text": "10.5 Choosing a Loss Function\nWe’ve talked about the idea of creating the “best” possible predictions. This begs the question: how do we decide how “good” or “bad” our model’s predictions are?\nA loss function characterizes the cost, error, or fit resulting from a particular choice of model or model parameters. This function, \\(L(y, \\hat{y})\\), quantifies how “bad” or “far off” a single prediction by our model is from a true, observed value in our collected data.\nThe choice of loss function for a particular model will affect the accuracy and computational cost of estimation, and it’ll also depend on the estimation task at hand. For example,\n\nAre outputs quantitative or qualitative?\nDo outliers matter?\nAre all errors equally costly? (e.g., a false negative on a cancer test is arguably more dangerous than a false positive)\n\nRegardless of the specific function used, a loss function should follow two basic principles:\n\nIf the prediction \\(\\hat{y}_i\\) is close to the actual value \\(y_i\\), loss should be low.\nIf the prediction \\(\\hat{y}_i\\) is far from the actual value \\(y_i\\), loss should be high.\n\nTwo common choices of loss function are squared loss and absolute loss.\nSquared loss, also known as L2 loss, computes loss as the square of the difference between the observed \\(y_i\\) and predicted \\(\\hat{y}_i\\): \\[L(y_i, \\hat{y}_i) = (y_i - \\hat{y}_i)^2\\]\nAbsolute loss, also known as L1 loss, computes loss as the absolute difference between the observed \\(y_i\\) and predicted \\(\\hat{y}_i\\): \\[L(y_i, \\hat{y}_i) = |y_i - \\hat{y}_i|\\]\nL1 and L2 loss give us a tool for quantifying our model’s performance on a single data point. This is a good start, but ideally, we want to understand how our model performs across our entire dataset. A natural way to do this is to compute the average loss across all data points in the dataset. This is known as the cost function, \\(\\hat{R}(\\theta)\\): \\[\\hat{R}(\\theta) = \\frac{1}{n} \\sum^n_{i=1} L(y_i, \\hat{y}_i)\\]\nThe cost function has many names in the statistics literature. You may also encounter the terms:\n\nEmpirical risk (this is why we give the cost function the name \\(R\\))\nError function\nAverage loss\n\nWe can substitute our L1 and L2 loss into the cost function definition. The Mean Squared Error (MSE) is the average squared loss across a dataset: \\[\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2\\]\nThe Mean Absolute Error (MAE) is the average absolute loss across a dataset: \\[\\text{MAE}= \\frac{1}{n} \\sum_{i=1}^n |y_i - \\hat{y}_i|\\]"
-  },
-  {
-    "objectID": "intro_to_modeling/intro_to_modeling.html#fitting-the-model",
-    "href": "intro_to_modeling/intro_to_modeling.html#fitting-the-model",
-    "title": "10  Introduction to Modeling",
-    "section": "10.6 Fitting the Model",
-    "text": "10.6 Fitting the Model\nNow that we’ve established the concept of a loss function, we can return to our original goal of choosing model parameters. Specifically, we want to choose the best set of model parameters that will minimize the model’s cost on our dataset. This process is called fitting the model.\nWe know from calculus that a function is minimized when (1) its first derivative is equal to zero and (2) its second derivative is positive. We often call the function being minimized the objective function (our objective is to find its minimum).\nTo find the optimal model parameter, we:\n\nTake the derivative of the cost function with respect to that parameter\nSet the derivative equal to 0\nSolve for the parameter\n\nWe repeat this process for each parameter present in the model. For now, we’ll disregard the second derivative condition.\nTo help us make sense of this process, let’s put it into action by deriving the optimal model parameters for simple linear regression using the mean squared error as our cost function. Remember: although the notation may look tricky, all we are doing is following the three steps above!\nStep 1: take the derivative of the cost function with respect to each model parameter. We substitute the SLR model, \\(\\hat{y}_i = \\theta_0+\\theta_1 x_i\\), into the definition of MSE above and differentiate with respect to \\(\\theta_0\\) and \\(\\theta_1\\). \\[\\text{MSE} = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\theta_0 - \\theta_1 x_i)^2\\]\n\\[\\frac{\\partial}{\\partial \\theta_0} \\text{MSE} = \\frac{-2}{n} \\sum_{i=1}^{n} y_i - \\theta_0 - \\theta_1 x_i\\]\n\\[\\frac{\\partial}{\\partial \\theta_1} \\text{MSE} = \\frac{-2}{n} \\sum_{i=1}^{n} (y_i - \\theta_0 - \\theta_1 x_i)x_i\\]\nLet’s walk through these derivations in more depth, starting with the derivative of MSE with respect to \\(\\theta_0\\).\nGiven our MSE above, we know that: \\[\\frac{\\partial}{\\partial \\theta_0} \\text{MSE} = \\frac{\\partial}{\\partial \\theta_0} \\frac{1}{n} \\sum_{i=1}^{n} {(y_i - \\theta_0 - \\theta_1 x_i)}^{2}\\]\nNoting that the derivative of sum is equivalent to the sum of derivatives, this then becomes: \\[ = \\frac{1}{n} \\sum_{i=1}^{n} \\frac{\\partial}{\\partial \\theta_0} {(y_i - \\theta_0 - \\theta_1 x_i)}^{2}\\]\nWe can then apply the chain rule.\n\\[ = \\frac{1}{n} \\sum_{i=1}^{n} 2 \\dot{(y_i - \\theta_0 - \\theta_1 x_i)}\\dot(-1)\\]\nFinally, we can simplify the constants, leaving us with our answer.\n\\[\\frac{\\partial}{\\partial \\theta_0} \\text{MSE} = \\frac{-2}{n} \\sum_{i=1}^{n}{(y_i - \\theta_0 - \\theta_1 x_i)}\\]\nFollowing the same procedure, we can take the derivative of MSE with respect to \\(\\theta_1\\).\n\\[\\frac{\\partial}{\\partial \\theta_1} \\text{MSE} = \\frac{\\partial}{\\partial \\theta_1} \\frac{1}{n} \\sum_{i=1}^{n} {(y_i - \\theta_0 - \\theta_1 x_i)}^{2}\\]\n\\[ = \\frac{1}{n} \\sum_{i=1}^{n} \\frac{\\partial}{\\partial \\theta_1} {(y_i - \\theta_0 - \\theta_1 x_i)}^{2}\\]\n\\[ = \\frac{1}{n} \\sum_{i=1}^{n} 2 \\dot{(y_i - \\theta_0 - \\theta_1 x_i)}\\dot(-x_i)\\]\n\\[= \\frac{-2}{n} \\sum_{i=1}^{n} {(y_i - \\theta_0 - \\theta_1 x_i)}x_i\\]\nStep 2: set the derivatives equal to 0. After simplifying terms, this produces two estimating equations. The best set of model parameters \\((\\hat{\\theta}_0, \\hat{\\theta}_1)\\) must satisfy these two optimality conditions. \\[0 = \\frac{-2}{n} \\sum_{i=1}^{n} y_i - \\hat{\\theta}_0 - \\hat{\\theta}_1 x_i \\Longleftrightarrow \\frac{1}{n}\\sum_{i=1}^{n} y_i - \\hat{y}_i = 0\\] \\[0 = \\frac{-2}{n} \\sum_{i=1}^{n} (y_i - \\hat{\\theta}_0 - \\hat{\\theta}_1 x_i)x_i \\Longleftrightarrow \\frac{1}{n}\\sum_{i=1}^{n} (y_i - \\hat{y}_i)x_i = 0\\]\nStep 3: solve the estimating equations to compute estimates for \\(\\hat{\\theta}_0\\) and \\(\\hat{\\theta}_1\\).\nTaking the first equation gives the estimate of \\(\\hat{\\theta}_0\\): \\[\\frac{1}{n} \\sum_{i=1}^n y_i - \\hat{\\theta}_0 - \\hat{\\theta}_1 x_i = 0 \\]\n\\[\\left(\\frac{1}{n} \\sum_{i=1}^n y_i \\right) - \\hat{\\theta}_0 - \\hat{\\theta}_1\\left(\\frac{1}{n} \\sum_{i=1}^n x_i \\right) = 0\\]\n\\[ \\hat{\\theta}_0 = \\bar{y} - \\hat{\\theta}_1 \\bar{x}\\]\nWith a bit more maneuvering, the second equation gives the estimate of \\(\\hat{\\theta}_1\\). Start by multiplying the first estimating equation by \\(\\bar{x}\\), then subtracting the result from the second estimating equation.\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)x_i - \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)\\bar{x} = 0 \\]\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)(x_i - \\bar{x}) = 0 \\]\nNext, plug in \\(\\hat{y}_i = \\hat{\\theta}_0 + \\hat{\\theta}_1 x_i = \\bar{y} + \\hat{\\theta}_1(x_i - \\bar{x})\\):\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\bar{y} - \\hat{\\theta}_1(x - \\bar{x}))(x_i - \\bar{x}) = 0 \\]\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\bar{y})(x_i - \\bar{x}) = \\hat{\\theta}_1 \\times \\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar{x})^2\n\\]\nBy using the definition of correlation \\(\\left(r = \\frac{1}{n} \\sum_{i=1}^n (\\frac{x_i-\\bar{x}}{\\sigma_x})(\\frac{y_i-\\bar{y}}{\\sigma_y}) \\right)\\) and standard deviation \\(\\left(\\sigma_x = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (x_i - \\bar{x})^2} \\right)\\), we can conclude: \\[r \\sigma_x \\sigma_y = \\hat{\\theta}_1 \\times \\sigma_x^2\\] \\[\\hat{\\theta}_1 = r \\frac{\\sigma_y}{\\sigma_x}\\]\nJust as was given in Data 8!\nRemember, this derivation found the optimal model parameters for SLR when using the MSE cost function. If we had used a different model or different loss function, we likely would have found different values for the best model parameters. However, regardless of the model and loss used, we can always follow these three steps to fit the model."
-  },
-  {
-    "objectID": "constant_model_loss_transformations/loss_transformations.html#step-4-evaluating-the-slr-model",
-    "href": "constant_model_loss_transformations/loss_transformations.html#step-4-evaluating-the-slr-model",
-    "title": "11  Constant Model, Loss, and Transformations",
-    "section": "11.1 Step 4: Evaluating the SLR Model",
-    "text": "11.1 Step 4: Evaluating the SLR Model\nNow that we’ve explored the mathematics behind (1) choosing a model, (2) choosing a loss function, and (3) fitting the model, we’re left with one final question – how “good” are the predictions made by this “best” fitted model? To determine this, we can:\n\nVisualize data and compute statistics:\n\nPlot the original data.\nCompute each column’s mean and standard deviation. If the mean and standard deviation of our predictions are close to those of the original observed \\(y_i\\)s, we might be inclined to say that our model has done well.\n(If we’re fitting a linear model) compute the correlation \\(r\\). A large magnitude for the correlation coefficient between the feature and response variables could also indicate that our model has done well.\n\nPerformance metrics:\n\nWe can take the Root Mean Squared Error (RMSE).\n\nIt’s the square root of the mean squared error (MSE), which is the average loss that we’ve been minimizing to determine optimal model parameters.\nRMSE is in the same units as \\(y\\).\nA lower RMSE indicates more “accurate” predictions, as we have a lower “average loss” across the data.\n\n\n\\[\\text{RMSE} = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2}\\]\nVisualization:\n\nLook at the residual plot of \\(e_i = y_i - \\hat{y_i}\\) to visualize the difference between actual and predicted values. The good residual plot should not show any pattern between input/features \\(x_i\\) and residual values \\(e_i\\).\n\n\nTo illustrate this process, let’s take a look at Anscombe’s quartet.\n\n11.1.1 Four Mysterious Datasets (Anscombe’s quartet)\nLet’s take a look at four different datasets.\n\n\nCode\nimport numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n%matplotlib inline\nimport seaborn as sns\nimport itertools\nfrom mpl_toolkits.mplot3d import Axes3D\n\n\n\n\nCode\n# Big font helper\ndef adjust_fontsize(size=None):\n    SMALL_SIZE = 8\n    MEDIUM_SIZE = 10\n    BIGGER_SIZE = 12\n    if size != None:\n        SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size\n\n    plt.rc('font', size=SMALL_SIZE)          # controls default text sizes\n    plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title\n    plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels\n    plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels\n    plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels\n    plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize\n    plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title\n\n# Helper functions\ndef standard_units(x):\n    return (x - np.mean(x)) / np.std(x)\n\ndef correlation(x, y):\n    return np.mean(standard_units(x) * standard_units(y))\n\ndef slope(x, y):\n    return correlation(x, y) * np.std(y) / np.std(x)\n\ndef intercept(x, y):\n    return np.mean(y) - slope(x, y)*np.mean(x)\n\ndef fit_least_squares(x, y):\n    theta_0 = intercept(x, y)\n    theta_1 = slope(x, y)\n    return theta_0, theta_1\n\ndef predict(x, theta_0, theta_1):\n    return theta_0 + theta_1*x\n\ndef compute_mse(y, yhat):\n    return np.mean((y - yhat)**2)\n\nplt.style.use('default') # Revert style to default mpl\n\n\n\n\nCode\nplt.style.use('default') # Revert style to default mpl\nNO_VIZ, RESID, RESID_SCATTER = range(3)\ndef least_squares_evaluation(x, y, visualize=NO_VIZ):\n    # statistics\n    print(f\"x_mean : {np.mean(x):.2f}, y_mean : {np.mean(y):.2f}\")\n    print(f\"x_stdev: {np.std(x):.2f}, y_stdev: {np.std(y):.2f}\")\n    print(f\"r = Correlation(x, y): {correlation(x, y):.3f}\")\n    \n    # Performance metrics\n    ahat, bhat = fit_least_squares(x, y)\n    yhat = predict(x, ahat, bhat)\n    print(f\"\\theta_0: {ahat:.2f}, \\theta_1: {bhat:.2f}\")\n    print(f\"RMSE: {np.sqrt(compute_mse(y, yhat)):.3f}\")\n\n    # visualization\n    fig, ax_resid = None, None\n    if visualize == RESID_SCATTER:\n        fig, axs = plt.subplots(1,2,figsize=(8, 3))\n        axs[0].scatter(x, y)\n        axs[0].plot(x, yhat)\n        axs[0].set_title(\"LS fit\")\n        ax_resid = axs[1]\n    elif visualize == RESID:\n        fig = plt.figure(figsize=(4, 3))\n        ax_resid = plt.gca()\n    \n    if ax_resid is not None:\n        ax_resid.scatter(x, y - yhat, color = 'red')\n        ax_resid.plot([4, 14], [0, 0], color = 'black')\n        ax_resid.set_title(\"Residuals\")\n    \n    return fig\n\n\n\n\nCode\n# Load in four different datasets: I, II, III, IV\nx = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]\ny1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]\ny2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]\ny3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]\nx4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]\ny4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]\n\nanscombe = {\n    'I': pd.DataFrame(list(zip(x, y1)), columns =['x', 'y']),\n    'II': pd.DataFrame(list(zip(x, y2)), columns =['x', 'y']),\n    'III': pd.DataFrame(list(zip(x, y3)), columns =['x', 'y']),\n    'IV': pd.DataFrame(list(zip(x4, y4)), columns =['x', 'y'])\n}\n\n# Plot the scatter plot and line of best fit \nfig, axs = plt.subplots(2, 2, figsize = (10, 10))\n\nfor i, dataset in enumerate(['I', 'II', 'III', 'IV']):\n    ans = anscombe[dataset]\n    x, y  = ans['x'], ans['y']\n    ahat, bhat = fit_least_squares(x, y)\n    yhat = predict(x, ahat, bhat)\n    axs[i//2, i%2].scatter(x, y, alpha=0.6, color='red') # plot the x, y points\n    axs[i//2, i%2].plot(x, yhat) # plot the line of best fit \n    axs[i//2, i%2].set_xlabel(f'$x_{i+1}$')\n    axs[i//2, i%2].set_ylabel(f'$y_{i+1}$')\n    axs[i//2, i%2].set_title(f\"Dataset {dataset}\")\n\nplt.show()\n\n\n\n\n\nWhile these four sets of datapoints look very different, they actually all have identical \\(\\bar x\\), \\(\\bar y\\), \\(\\sigma_x\\), \\(\\sigma_y\\), correlation \\(r\\), and RMSE! If we only look at these statistics, we would probably be inclined to say that these datasets are similar.\n\n\nCode\nfor dataset in ['I', 'II', 'III', 'IV']:\n    print(f\"&gt;&gt;&gt; Dataset {dataset}:\")\n    ans = anscombe[dataset]\n    fig = least_squares_evaluation(ans['x'], ans['y'], visualize = NO_VIZ)\n    print()\n    print()\n\n\n&gt;&gt;&gt; Dataset I:\nx_mean : 9.00, y_mean : 7.50\nx_stdev: 3.16, y_stdev: 1.94\nr = Correlation(x, y): 0.816\n    heta_0: 3.00,   heta_1: 0.50\nRMSE: 1.119\n\n\n&gt;&gt;&gt; Dataset II:\nx_mean : 9.00, y_mean : 7.50\nx_stdev: 3.16, y_stdev: 1.94\nr = Correlation(x, y): 0.816\n    heta_0: 3.00,   heta_1: 0.50\nRMSE: 1.119\n\n\n&gt;&gt;&gt; Dataset III:\nx_mean : 9.00, y_mean : 7.50\nx_stdev: 3.16, y_stdev: 1.94\nr = Correlation(x, y): 0.816\n    heta_0: 3.00,   heta_1: 0.50\nRMSE: 1.118\n\n\n&gt;&gt;&gt; Dataset IV:\nx_mean : 9.00, y_mean : 7.50\nx_stdev: 3.16, y_stdev: 1.94\nr = Correlation(x, y): 0.817\n    heta_0: 3.00,   heta_1: 0.50\nRMSE: 1.118\n\n\n\n\nWe may also wish to visualize the model’s residuals, defined as the difference between the observed and predicted \\(y_i\\) value (\\(e_i = y_i - \\hat{y}_i\\)). This gives a high-level view of how “off” each prediction is from the true observed value. Recall that you explored this concept in Data 8: a good regression fit should display no clear pattern in its plot of residuals. The residual plots for Anscombe’s quartet are displayed below. Note how only the first plot shows no clear pattern to the magnitude of residuals. This is an indication that SLR is not the best choice of model for the remaining three sets of points.\n\n\n\nCode\n# Residual visualization\nfig, axs = plt.subplots(2, 2, figsize = (10, 10))\n\nfor i, dataset in enumerate(['I', 'II', 'III', 'IV']):\n    ans = anscombe[dataset]\n    x, y  = ans['x'], ans['y']\n    ahat, bhat = fit_least_squares(x, y)\n    yhat = predict(x, ahat, bhat)\n    axs[i//2, i%2].scatter(x, y - yhat, alpha=0.6, color='red') # plot the x, y points\n    axs[i//2, i%2].plot(x, np.zeros_like(x), color='black') # plot the residual line\n    axs[i//2, i%2].set_xlabel(f'$x_{i+1}$')\n    axs[i//2, i%2].set_ylabel(f'$e_{i+1}$')\n    axs[i//2, i%2].set_title(f\"Dataset {dataset} Residuals\")\n\nplt.show()\n\n\n\n\n\n\n\n11.1.2 Prediction vs. Estimation\nThe terms prediction and estimation are often used somewhat interchangeably, but there is a subtle difference between them. Estimation is the task of using data to calculate model parameters. Prediction is the task of using a model to predict outputs for unseen data. In our simple linear regression model\n\\[\\hat{y} = \\hat{\\theta_0} + \\hat{\\theta_1}\\]\nwe estimate the parameters by minimizing average loss; then, we predict using these estimations. Least Squares Estimation is when we choose the parameters that minimize MSE."
-  },
-  {
-    "objectID": "constant_model_loss_transformations/loss_transformations.html#constant-model-mse",
-    "href": "constant_model_loss_transformations/loss_transformations.html#constant-model-mse",
-    "title": "11  Constant Model, Loss, and Transformations",
-    "section": "11.2 Constant Model + MSE",
-    "text": "11.2 Constant Model + MSE\nNow, we’ll shift from the SLR model to the constant model, also known as a summary statistic. The constant model is slightly different from the simple linear regression model we’ve explored previously. Rather than generating predictions from an inputted feature variable, the constant model always predicts the same constant number. This ignores any relationships between variables. For example, let’s say we want to predict the number of drinks a boba shop sells in a day. Boba tea sales likely depend on the time of year, the weather, how the customers feel, whether school is in session, etc., but the constant model ignores these factors in favor of a simpler model. In other words, the constant model employs a simplifying assumption.\nIt is also a parametric, statistical model:\n\\[\\hat{y}_i = \\theta_0\\]\n\\(\\theta_0\\) is the parameter of the constant model, just as \\(\\theta_0\\) and \\(\\theta_1\\) were the parameters in SLR. Since our parameter \\(\\theta_0\\) is 1-dimensional (\\(\\theta_0 \\in \\mathbb{R}\\)), we now have no input to our model and will always predict \\(\\hat{y}_i = \\theta_0\\).\n\n11.2.1 Deriving the optimal \\(\\theta_0\\)\nOur task now is to determine what value of \\(\\theta_0\\) best represents the optimal model – in other words, what number should we guess each time to have the lowest possible average loss on our data?\nLike before, we’ll use Mean Squared Error (MSE). Recall that the MSE is average squared loss (L2 loss) over the data \\(D = \\{y_1, y_2, ..., y_n\\}\\).\n\\[R(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} (y_i - \\hat{y_i})^2 \\]\nOur modeling process now looks like this:\n\nChoose a model: constant model\nChoose a loss function: L2 loss\nFit the model\nEvaluate model performance\n\nGiven the constant model \\(\\hat{y}_i = \\theta_0\\), we can rewrite the MSE equation as\n\\[R(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} (y_i - \\theta_0)^2 \\]\nWe can fit the model by finding the optimal \\(\\theta_0\\) that minimizes the MSE using a calculus approach.\n\nDifferentiate with respect to \\(\\theta_0\\)\n\n\\[\n\\begin{align}\n\\frac{d}{d\\theta_0}\\text{R}(\\theta) & = \\frac{d}{d\\theta_0}\\frac{1}{n}\\sum^{n}_{i=1} (y_i - \\theta_0)^2\n\\\\ &= {n}\\sum^{n}_{i=1} \\frac{d}{d\\theta_0}  (y_i - \\theta_0)^2 \\quad \\quad \\text{derivative of sum is a sum of derivatives}\n\\\\ &= {n}\\sum^{n}_{i=1} 2 (y_i - \\theta_0) (-1) \\quad \\quad \\text{chain rule}\n\\\\ &= {\\frac{-2}{n}}\\sum^{n}_{i=1} (y_i - \\theta_0) \\quad \\quad \\text{simply constants}\n\\end{align}\n\\]\n\nSet equal to 0 \\[\n0 = {\\frac{-2}{n}}\\sum^{n}_{i=1} (y_i - \\theta_0)\n\\]\nSolve for \\(\\theta_0\\)\n\n\\[\n\\begin{align}\n0 &= {\\frac{-2}{n}}\\sum^{n}_{i=1} (y_i - \\theta_0)\n\\\\ &= \\sum^{n}_{i=1} (y_i - \\theta_0) \\quad \\quad \\text{divide both sides by} \\frac{-2}{n}\n\\\\ &= \\sum^{n}_{i=1} y_i - \\sum^{n}_{i=1} \\theta_0 \\quad \\quad \\text{separate sums}\n\\\\ &= \\sum^{n}_{i=1} y_i - n * \\theta_0 \\quad \\quad  \\text{c + c + … + c = nc}\n\\\\ n * \\theta_0 &= \\sum^{n}_{i=1} y_i\n\\\\ \\theta_0 &= \\frac{1}{n} \\sum^{n}_{i=1} y_i\n\\\\ \\theta_0 &= \\bar{y}\n\\end{align}\n\\]\nLet’s take a moment to interpret this result. \\(\\hat{\\theta} = \\bar{y}\\) is the optimal parameter for constant model + MSE. It holds true regardless of what data sample you have, and it provides some formal reasoning as to why the mean is such a common summary statistic.\nOur optimal model parameter is the value of the parameter that minimizes the cost function. This minimum value of the cost function can be expressed:\n\\[R(\\hat{\\theta}) = \\min_{\\theta} R(\\theta)\\]\nTo restate the above in plain English: we are looking at the value of the cost function when it takes the best parameter as input. This optimal model parameter, \\(\\hat{\\theta}\\), is the value of \\(\\theta\\) that minimizes the cost \\(R\\).\nFor modeling purposes, we care less about the minimum value of cost, \\(R(\\hat{\\theta})\\), and more about the value of \\(\\theta\\) that results in this lowest average loss. In other words, we concern ourselves with finding the best parameter value such that:\n\\[\\hat{\\theta} = \\underset{\\theta}{\\operatorname{\\arg\\min}}\\:R(\\theta)\\]\nThat is, we want to find the argument \\(\\theta\\) that minimizes the cost function.\n\n\n11.2.2 Comparing Two Different Models, Both Fit with MSE\nNow that we’ve explored the constant model with an L2 loss, we can compare it to the SLR model that we learned last lecture. Consider the dataset below, which contains information about the ages and lengths of dugongs. Supposed we wanted to predict dugong ages:\n\n\n\n\n\n\n\n\n\nConstant Model\nSimple Linear Regression\n\n\n\n\nmodel\n\\(\\hat{y} = \\theta_0\\)\n\\(\\hat{y} = \\theta_0 + \\theta1 x\\)\n\n\ndata\nsample of ages \\(D = \\{y_1, y_2, ..., y_m\\}\\)\nsample of ages \\(D = \\{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\\}\\)\n\n\ndimensions\n\\(\\hat{\\theta_0}\\) is 1-D\n\\(\\hat{\\theta} = [\\hat{\\theta_0}, \\hat{\\theta_1}]\\) is 2-D\n\n\nloss surface\n2-D \n3-D \n\n\nLoss Model\n\\(\\hat{R}(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} (y_i - \\theta_0)^2\\)\n\\(\\hat{R}(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} (y_i - (\\theta_0 + \\theta_1 x))^2\\)\n\n\nRMSE\n7.72\n4.31\n\n\npredictions visualized\nrug plot \nscatter plot \n\n\n\n(Notice how the points for our SLR scatter plot are visually not a great linear fit. We’ll come back to this).\nThe code for generating the graphs and models is included below, but we won’t go over it in too much depth.\n\n\nCode\ndugongs = pd.read_csv(\"data/dugongs.csv\")\ndata_constant = dugongs[\"Age\"]\ndata_linear = dugongs[[\"Length\", \"Age\"]]\n\n\n\n\nCode\n# Constant Model + MSE\nplt.style.use('default') # Revert style to default mpl\nadjust_fontsize(size=16)\n%matplotlib inline\n\ndef mse_constant(theta, data):\n    return np.mean(np.array([(y_obs - theta) ** 2 for y_obs in data]), axis=0)\n\nthetas = np.linspace(-20, 42, 1000)\nl2_loss_thetas = mse_constant(thetas, data_constant)\n\n# Plotting the loss surface\nplt.plot(thetas, l2_loss_thetas)\nplt.xlabel(r'$\\theta_0$')\nplt.ylabel(r'MSE')\n\n# Optimal point\nthetahat = np.mean(data_constant)\nplt.scatter([thetahat], [mse_constant(thetahat, data_constant)], s=50, label = r\"$\\hat{\\theta}_0$\")\nplt.legend();\n# plt.show()\n\n\n\n\n\n\n\nCode\n# SLR + MSE\ndef mse_linear(theta_0, theta_1, data_linear):\n    data_x, data_y = data_linear.iloc[:,0], data_linear.iloc[:,1]\n    return np.mean(np.array([(y - (theta_0+theta_1*x)) ** 2 for x, y in zip(data_x, data_y)]), axis=0)\n\n# plotting the loss surface\ntheta_0_values = np.linspace(-80, 20, 80)\ntheta_1_values = np.linspace(-10, 30, 80)\nmse_values = np.array([[mse_linear(x,y,data_linear) for x in theta_0_values] for y in theta_1_values])\n\n# Optimal point\ndata_x, data_y = data_linear.iloc[:, 0], data_linear.iloc[:, 1]\ntheta_1_hat = np.corrcoef(data_x, data_y)[0, 1] * np.std(data_y) / np.std(data_x)\ntheta_0_hat = np.mean(data_y) - theta_1_hat * np.mean(data_x)\n\n# Create the 3D plot\nfig = plt.figure(figsize=(7, 5))\nax = fig.add_subplot(111, projection='3d')\n\nX, Y = np.meshgrid(theta_0_values, theta_1_values)\nsurf = ax.plot_surface(X, Y, mse_values, cmap='viridis', alpha=0.6)  # Use alpha to make it slightly transparent\n\n# Scatter point using matplotlib\nsc = ax.scatter([theta_0_hat], [theta_1_hat], [mse_linear(theta_0_hat, theta_1_hat, data_linear)],\n                marker='o', color='red', s=100, label='theta hat')\n\n# Create a colorbar\ncbar = fig.colorbar(surf, ax=ax, shrink=0.5, aspect=10)\ncbar.set_label('Cost Value')\n\nax.set_title('MSE for different $\\\\theta_0, \\\\theta_1$')\nax.set_xlabel('$\\\\theta_0$')\nax.set_ylabel('$\\\\theta_1$') \nax.set_zlabel('MSE')\n\n# plt.show()\n\n\nText(0.5, 0, 'MSE')\n\n\n\n\n\n\n\nCode\n# Predictions\nyobs = data_linear[\"Age\"]      # The true observations y\nxs = data_linear[\"Length\"]     # Needed for linear predictions\nn = len(yobs)                  # Predictions\n\nyhats_constant = [thetahat for i in range(n)]    # Not used, but food for thought\nyhats_linear = [theta_0_hat + theta_1_hat * x for x in xs]\n\n\n\n\nCode\n# Constant Model Rug Plot\n# In case we're in a weird style state\nsns.set_theme()\nadjust_fontsize(size=16)\n%matplotlib inline\n\nfig = plt.figure(figsize=(8, 1.5))\nsns.rugplot(yobs, height=0.25, lw=2) ;\nplt.axvline(thetahat, color='red', lw=4, label=r\"$\\hat{\\theta}_0$\");\nplt.legend()\nplt.yticks([]);\n# plt.show()\n\n\n\n\n\n\n\nCode\n# SLR model scatter plot \n# In case we're in a weird style state\nsns.set_theme()\nadjust_fontsize(size=16)\n%matplotlib inline\n\nsns.scatterplot(x=xs, y=yobs)\nplt.plot(xs, yhats_linear, color='red', lw=4);\n# plt.savefig('dugong_line.png', bbox_inches = 'tight');\n# plt.show()\n\n\n\n\n\nInterpreting the RMSE (Root Mean Squared Error): * The constant error is HIGHER than the linear error.\nHence, * The constant model is WORSE than the linear model (at least for this metric)."
-  },
-  {
-    "objectID": "constant_model_loss_transformations/loss_transformations.html#constant-model-mae",
-    "href": "constant_model_loss_transformations/loss_transformations.html#constant-model-mae",
-    "title": "11  Constant Model, Loss, and Transformations",
-    "section": "11.3 Constant Model + MAE",
-    "text": "11.3 Constant Model + MAE\nWe see now that changing the model used for prediction leads to a wildly different result for the optimal model parameter. What happens if we instead change the loss function used in model evaluation?\nThis time, we will consider the constant model with L1 (absolute loss) as the loss function. This means that the average loss will be expressed as the Mean Absolute Error (MAE).\n\nChoose a model: constant model\nChoose a loss function: L1 loss\nFit the model\nEvaluate model performance\n\n\n11.3.1 Deriving the optimal \\(\\theta_0\\)\nRecall that the MAE is average absolute loss (L1 loss) over the data \\(D = \\{y_1, y_2, ..., y_m\\}\\).\n\\[\\hat{R}(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} |y_i - \\hat{y_i}| \\]\nGiven the constant model \\(\\hat{y} = \\theta_0\\), we can write the MAE as:\n\\[\\hat{R}(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} |y_i - \\theta_0| \\]\nTo fit the model, we find the optimal parameter value \\(\\hat{\\theta}\\) by differentiating using a calculus approach:\n\nDifferentiate with respect to \\(\\hat{\\theta_0}\\).\n\n\\[\\hat{R}(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} |y_i - \\theta| \\]\n\\[\\frac{d}{d\\theta} R(\\theta) = \\frac{d}{d\\theta} \\left(\\frac{1}{n} \\sum^{n}_{i=1} |y_i - \\theta| \\right)\\]\n\\[= \\frac{1}{n} \\sum^{n}_{i=1} \\frac{d}{d\\theta} |y_i - \\theta|\n\\]\n\nHere, we seem to have run into a problem: the derivative of an absolute value is undefined when the argument is 0 (i.e. when \\(y_i = \\theta\\)). For now, we’ll ignore this issue. It turns out that disregarding this case doesn’t influence our final result.\nTo perform the derivative, consider two cases. When \\(\\theta\\) is less than or equal to \\(y_i\\), the term \\(y_i - \\theta\\) will be positive and the absolute value has no impact. When \\(\\theta\\) is greater than \\(y_i\\), the term \\(y_i - \\theta\\) will be negative. Applying the absolute value will convert this to a positive value, which we can express by saying \\(-(y_i - \\theta) = \\theta - y_i\\).\n\n\\[|y_i - \\theta| = \\begin{cases} y_i - \\theta \\quad \\text{ if } \\theta \\le y_i \\\\ \\theta - y_i \\quad \\text{if }\\theta &gt; y_i \\end{cases}\\]\n\nTaking derivatives:\n\n\\[\\frac{d}{d\\theta} |y_i - \\theta| = \\begin{cases} \\frac{d}{d\\theta} (y_i - \\theta) = -1 \\quad \\text{if }\\theta &lt; y_i \\\\ \\frac{d}{d\\theta} (\\theta - y_i) = 1 \\quad \\text{if }\\theta &gt; y_i \\end{cases}\\]\n\nThis means that we obtain a different value for the derivative for data points where \\(\\theta &lt; y_i\\) and where \\(\\theta &gt; y_i\\). We can summarize this by saying:\n\n\\[\\frac{d}{d\\theta} R(\\theta) = \\frac{1}{n} \\sum^{n}_{i=1} \\frac{d}{d\\theta} |y_i - \\theta| \\\\\n= \\frac{1}{n} \\left[\\sum_{\\hat{\\theta_0} &lt; y_i} (-1) + \\sum_{\\hat{\\theta_0} &gt; y_i} (+1) \\right]\n\\]\n\nIn other words, we take the sum of values for \\(i = 1, 2, ..., n\\):\n\n\\(-1\\) if our observation \\(y_i\\) is greater than our prediction \\(\\hat{\\theta_0}\\)\n\\(+1\\) if our observation \\(y_i\\) is smaller than our prediction \\(\\hat{\\theta_0}\\)\n\n\n\nSet equal to 0. \\[ 0 = \\frac{1}{n}\\sum_{\\hat{\\theta_0} &lt; y_i} (-1) + \\frac{1}{n}\\sum_{\\hat{\\theta_0} &gt; y_i} (+1) \\]\nSolve for \\(\\hat{\\theta_0}\\). \\[ 0 = -\\frac{1}{n}\\sum_{\\hat{\\theta_0} &lt; y_i} (1) + \\frac{1}{n}\\sum_{\\hat{\\theta_0} &gt; y_i} (1)\\]\n\n\\[\\sum_{\\hat{\\theta_0} &lt; y_i} (1) = \\sum_{\\hat{\\theta_0} &gt; y_i} (1) \\]\nThus, the constant model parameter \\(\\theta = \\hat{\\theta_0}\\) that minimizes MAE must satisfy:\n\\[ \\sum_{\\hat{\\theta_0} &lt; y_i} (1) = \\sum_{\\hat{\\theta_0} &gt; y_i} (1) \\]\nIn other words, the number of observations greater than \\(\\theta_0\\) must be equal to the number of observations less than \\(\\theta_0\\); there must be an equal number of points on the left and right sides of the equation. This is the definition of median, so our optimal value is \\[ \\hat{\\theta_0} = median(y) \\]"
-  },
-  {
-    "objectID": "constant_model_loss_transformations/loss_transformations.html#summary-loss-optimization-calculus-and-critical-points",
-    "href": "constant_model_loss_transformations/loss_transformations.html#summary-loss-optimization-calculus-and-critical-points",
-    "title": "11  Constant Model, Loss, and Transformations",
-    "section": "11.4 Summary: Loss Optimization, Calculus, and Critical Points",
-    "text": "11.4 Summary: Loss Optimization, Calculus, and Critical Points\nFirst, define the objective function as average loss.\n\nPlug in L1 or L2 loss.\nPlug in the model so that the resulting expression is a function of \\(\\theta\\).\n\nThen, find the minimum of the objective function:\n\nDifferentiate with respect to \\(\\theta\\).\nSet equal to 0.\nSolve for \\(\\hat{\\theta}\\).\n(If we have multiple parameters) repeat steps 1-3 with partial derivatives.\n\nRecall critical points from calculus: \\(R(\\hat{\\theta})\\) could be a minimum, maximum, or saddle point! * We should technically also perform the second derivative test, i.e., show \\(R''(\\hat{\\theta}) &gt; 0\\). * MSE has a property—convexity—that guarantees that \\(R(\\hat{\\theta})\\) is a global minimum. * The proof of convexity for MAE is beyond this course."
-  },
-  {
-    "objectID": "constant_model_loss_transformations/loss_transformations.html#comparing-loss-functions",
-    "href": "constant_model_loss_transformations/loss_transformations.html#comparing-loss-functions",
-    "title": "11  Constant Model, Loss, and Transformations",
-    "section": "11.5 Comparing Loss Functions",
-    "text": "11.5 Comparing Loss Functions\nWe’ve now tried our hand at fitting a model under both MSE and MAE cost functions. How do the two results compare?\nLet’s consider a dataset where each entry represents the number of drinks sold at a bubble tea store each day. We’ll fit a constant model to predict the number of drinks that will be sold tomorrow.\n\ndrinks = np.array([20, 21, 22, 29, 33])\ndrinks\n\narray([20, 21, 22, 29, 33])\n\n\nFrom our derivations above, we know that the optimal model parameter under MSE cost is the mean of the dataset. Under MAE cost, the optimal parameter is the median of the dataset.\n\nnp.mean(drinks), np.median(drinks)\n\n(25.0, 22.0)\n\n\nIf we plot each empirical risk function across several possible values of \\(\\theta\\), we find that each \\(\\hat{\\theta}\\) does indeed correspond to the lowest value of error:\n\nNotice that the MSE above is a smooth function – it is differentiable at all points, making it easy to minimize using numerical methods. The MAE, in contrast, is not differentiable at each of its “kinks.” We’ll explore how the smoothness of the cost function can impact our ability to apply numerical optimization in a few weeks.\nHow do outliers affect each cost function? Imagine we replace the largest value in the dataset with 1000. The mean of the data increases substantially, while the median is nearly unaffected.\n\ndrinks_with_outlier = np.append(drinks, 1033)\ndisplay(drinks_with_outlier)\nnp.mean(drinks_with_outlier), np.median(drinks_with_outlier)\n\narray([  20,   21,   22,   29,   33, 1033])\n\n\n(193.0, 25.5)\n\n\nThis means that under the MSE, the optimal model parameter \\(\\hat{\\theta}\\) is strongly affected by the presence of outliers. Under the MAE, the optimal parameter is not as influenced by outlying data. We can generalize this by saying that the MSE is sensitive to outliers, while the MAE is robust to outliers.\nLet’s try another experiment. This time, we’ll add an additional, non-outlying datapoint to the data.\n\ndrinks_with_additional_observation = np.append(drinks, 35)\ndrinks_with_additional_observation\n\narray([20, 21, 22, 29, 33, 35])\n\n\nWhen we again visualize the cost functions, we find that the MAE now plots a horizontal line between 22 and 29. This means that there are infinitely many optimal values for the model parameter: any value \\(\\hat{\\theta} \\in [22, 29]\\) will minimize the MAE. In contrast, the MSE still has a single best value for \\(\\hat{\\theta}\\). In other words, the MSE has a unique solution for \\(\\hat{\\theta}\\); the MAE is not guaranteed to have a single unique solution.\n\n \nTo summarize our example,\n\n\n\n\n\n\n\n\n–\nMSE (Mean Squared Loss)\nMAE (Mean Absolute Loss)\n\n\n\n\nLoss Function\n\\(\\hat{R}(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} (y_i - \\theta_0)^2\\)\n\\(\\hat{R}(\\theta) = \\frac{1}{n}\\sum^{n}_{i=1} |y_i - \\theta_0|\\)\n\n\noptimal \\(\\hat{\\theta_0}\\)\n\\(\\hat{\\theta_0} = mean(y) = \\bar{y}\\)\n\\(\\hat{\\theta_0} = median(y)\\)\n\n\nloss surface\n\n\n\n\nshape\nSmooth - easy to minimize using numerical methods (in a few weeks)\nPiecewise - at each of the “kinks,” it’s not differentiable. Harder to minimize.\n\n\noutliers\nSensitive to outliers (since they change mean substantially). Sensitivity also depends on the dataset size.\nMore robust to outliers.\n\n\n\\(\\hat{\\theta_0}\\) uniqueness\nunique \\(\\hat{\\theta_0}\\)\nInfinitely many \\(\\hat{\\theta_0}\\)"
-  },
-  {
-    "objectID": "constant_model_loss_transformations/loss_transformations.html#transformations-to-fit-linear-models",
-    "href": "constant_model_loss_transformations/loss_transformations.html#transformations-to-fit-linear-models",
-    "title": "11  Constant Model, Loss, and Transformations",
-    "section": "11.6 Transformations to fit Linear Models",
-    "text": "11.6 Transformations to fit Linear Models\nAt this point, we have an effective method of fitting models to predict linear relationships. Given a feature variable and target, we can apply our four-step process to find the optimal model parameters.\nA key word above is linear. When we computed parameter estimates earlier, we assumed that \\(x_i\\) and \\(y_i\\) shared a roughly linear relationship. Data in the real world isn’t always so straightforward, but we can transform the data to try and obtain linearity.\nThe Tukey-Mosteller Bulge Diagram is a useful tool for summarizing what transformations can linearize the relationship between two variables. To determine what transformations might be appropriate, trace the shape of the “bulge” made by your data. Find the quadrant of the diagram that matches this bulge. The transformations shown on the vertical and horizontal axes of this quadrant can help improve the fit between the variables.\n\nNote that:\n\nThere are multiple solutions. Some will fit better than others.\nsqrt and log make a value “smaller.”\nRaising to a power makes a value “bigger.”\nEach of these transformations equates to increasing or decreasing the scale of an axis.\n\nOther goals in addition to linearity are possible, for example, making data appear more symmetric. Linearity allows us to fit lines to the transformed data.\nLet’s revisit our dugongs example. The lengths and ages are plotted below:\n\n\nCode\n# `corrcoef` computes the correlation coefficient between two variables\n# `std` finds the standard deviation\nx = dugongs[\"Length\"]\ny = dugongs[\"Age\"]\nr = np.corrcoef(x, y)[0, 1]\ntheta_1 = r*np.std(y)/np.std(x)\ntheta_0 = np.mean(y) - theta_1*np.mean(x)\n\nfig, ax = plt.subplots(1, 2, dpi=200, figsize=(8, 3))\nax[0].scatter(x, y)\nax[0].set_xlabel(\"Length\")\nax[0].set_ylabel(\"Age\")\n\nax[1].scatter(x, y)\nax[1].plot(x, theta_0 + theta_1*x, \"tab:red\")\nax[1].set_xlabel(\"Length\")\nax[1].set_ylabel(\"Age\");\n\n\n\n\n\nLooking at the plot on the left, we see that there is a slight curvature to the data points. Plotting the SLR curve on the right results in a poor fit.\nFor SLR to perform well, we’d like there to be a rough linear trend relating \"Age\" and \"Length\". What is making the raw data deviate from a linear relationship? Notice that the data points with \"Length\" greater than 2.6 have disproportionately high values of \"Age\" relative to the rest of the data. If we could manipulate these data points to have lower \"Age\" values, we’d “shift” these points downwards and reduce the curvature in the data. Applying a logarithmic transformation to \\(y_i\\) (that is, taking \\(\\log(\\) \"Age\" \\()\\) ) would achieve just that.\nAn important word on \\(\\log\\): in Data 100 (and most upper-division STEM courses), \\(\\log\\) denotes the natural logarithm with base \\(e\\). The base-10 logarithm, where relevant, is indicated by \\(\\log_{10}\\).\n\n\nCode\nz = np.log(y)\n\nr = np.corrcoef(x, z)[0, 1]\ntheta_1 = r*np.std(z)/np.std(x)\ntheta_0 = np.mean(z) - theta_1*np.mean(x)\n\nfig, ax = plt.subplots(1, 2, dpi=200, figsize=(8, 3))\nax[0].scatter(x, z)\nax[0].set_xlabel(\"Length\")\nax[0].set_ylabel(r\"$\\log{(Age)}$\")\n\nax[1].scatter(x, z)\nax[1].plot(x, theta_0 + theta_1*x, \"tab:red\")\nax[1].set_xlabel(\"Length\")\nax[1].set_ylabel(r\"$\\log{(Age)}$\")\n\nplt.subplots_adjust(wspace=0.3);\n\n\n\n\n\nOur SLR fit looks a lot better! We now have a new target variable: the SLR model is now trying to predict the log of \"Age\", rather than the untransformed \"Age\". In other words, we are applying the transformation \\(z_i = \\log{(y_i)}\\). Notice that the resulting model is still linear in the parameters \\(\\theta = [\\theta_0, \\theta_1]\\). The SLR model becomes:\n\\[\\log{\\hat{(y_i)}} = \\theta_0 + \\theta_1 x_i\\] \\[\\hat{z}_i = \\theta_0 + \\theta_1 x_i\\]\nIt turns out that this linearized relationship can help us understand the underlying relationship between \\(x_i\\) and \\(y_i\\). If we rearrange the relationship above, we find: \\[\n\\log{(y_i)} = \\theta_0 + \\theta_1 x_i \\\\\ny_i = e^{\\theta_0 + \\theta_1 x_i} \\\\\ny_i = (e^{\\theta_0})e^{\\theta_1 x_i} \\\\\ny_i = C e^{k x_i}\n\\]\nFor some constants \\(C\\) and \\(k\\).\n\\(y_i\\) is an exponential function of \\(x_i\\). Applying an exponential fit to the untransformed variables corroborates this finding.\n\n\nCode\nplt.figure(dpi=120, figsize=(4, 3))\n\nplt.scatter(x, y)\nplt.plot(x, np.exp(theta_0)*np.exp(theta_1*x), \"tab:red\")\nplt.xlabel(\"Length\")\nplt.ylabel(\"Age\");\n\n\n\n\n\nYou may wonder: why did we choose to apply a log transformation specifically? Why not some other function to linearize the data?\nPractically, many other mathematical operations that modify the relative scales of \"Age\" and \"Length\" could have worked here."
-  },
-  {
-    "objectID": "ols/ols.html#linearity",
-    "href": "ols/ols.html#linearity",
-    "title": "12  Ordinary Least Squares",
-    "section": "12.1 Linearity",
-    "text": "12.1 Linearity\nAn expression is linear in \\(\\theta\\) (a set of parameters) if it is a linear combination of the elements of the set. Checking if an expression can separate into a matrix product of two terms – a vector of \\(\\theta\\) s, and a matrix/vector not involving \\(\\theta\\) – is a good indicator of linearity.\nFor example, consider the vector \\(\\theta = [\\theta_0, \\theta_1, \\theta_2]\\)\n\n\\(\\hat{y} = \\theta_0 + 2\\theta_1 + 3\\theta_2\\) is linear in theta, and we can separate it into a matrix product of two terms:\n\n\\[\\hat{y} = \\begin{bmatrix} 1 \\space 2 \\space 3 \\end{bmatrix} \\begin{bmatrix} \\theta_0 \\\\ \\theta_1 \\\\ \\theta_2 \\end{bmatrix}\\]\n\n\\(\\hat{y} = \\theta_0\\theta_1 + 2\\theta_1^2 + 3log(\\theta_2)\\) is not linear in theta, as the \\(\\theta_1\\) term is squared, and the \\(\\theta_2\\) term is logged. We cannot separate it into a matrix product of two terms."
-  },
-  {
-    "objectID": "ols/ols.html#terminology-for-multiple-linear-regression",
-    "href": "ols/ols.html#terminology-for-multiple-linear-regression",
-    "title": "12  Ordinary Least Squares",
-    "section": "12.2 Terminology for Multiple Linear Regression",
-    "text": "12.2 Terminology for Multiple Linear Regression\nThere are several equivalent terms in the context of regression. The ones we use most often for this course are bolded.\n\n\\(x\\) can be called a\n\nFeature(s)\nCovariate(s)\nIndependent variable(s)\nExplanatory variable(s)\nPredictor(s)\nInput(s)\nRegressor(s)\n\n\\(y\\) can be called an\n\nOutput\nOutcome\nResponse\nDependent variable\n\n\\(\\hat{y}\\) can be called a\n\nPrediction\nPredicted response\nEstimated value\n\n\\(\\theta\\) can be called a\n\nWeight(s)\nParameter(s)\nCoefficient(s)\n\n\\(\\hat{\\theta}\\) can be called a\n\nEstimator(s)\nOptimal parameter(s)\n\nA datapoint \\((x, y)\\) is also called an observation."
-  },
-  {
-    "objectID": "ols/ols.html#multiple-linear-regression",
-    "href": "ols/ols.html#multiple-linear-regression",
-    "title": "12  Ordinary Least Squares",
-    "section": "12.3 Multiple Linear Regression",
-    "text": "12.3 Multiple Linear Regression\nMultiple linear regression is an extension of simple linear regression that adds additional features to the model. The multiple linear regression model takes the form:\n\\[\\hat{y} = \\theta_0\\:+\\:\\theta_1x_{1}\\:+\\:\\theta_2 x_{2}\\:+\\:...\\:+\\:\\theta_p x_{p}\\]\nOur predicted value of \\(y\\), \\(\\hat{y}\\), is a linear combination of the single observations (features), \\(x_i\\), and the parameters, \\(\\theta_i\\).\nWe can explore this idea further by looking at a dataset containing aggregate per-player data from the 2018-19 NBA season, downloaded from Kaggle.\n\n\nCode\nimport pandas as pd\nnba = pd.read_csv('data/nba18-19.csv', index_col=0)\nnba.index.name = None # Drops name of index (players are ordered by rank)\n\n\n\n\nCode\nnba.head(5)\n\n\n\n\n\n\n\n\n\nPlayer\nPos\nAge\nTm\nG\nGS\nMP\nFG\nFGA\nFG%\n...\nFT%\nORB\nDRB\nTRB\nAST\nSTL\nBLK\nTOV\nPF\nPTS\n\n\n\n\n1\nÁlex Abrines\\abrinal01\nSG\n25\nOKC\n31\n2\n19.0\n1.8\n5.1\n0.357\n...\n0.923\n0.2\n1.4\n1.5\n0.6\n0.5\n0.2\n0.5\n1.7\n5.3\n\n\n2\nQuincy Acy\\acyqu01\nPF\n28\nPHO\n10\n0\n12.3\n0.4\n1.8\n0.222\n...\n0.700\n0.3\n2.2\n2.5\n0.8\n0.1\n0.4\n0.4\n2.4\n1.7\n\n\n3\nJaylen Adams\\adamsja01\nPG\n22\nATL\n34\n1\n12.6\n1.1\n3.2\n0.345\n...\n0.778\n0.3\n1.4\n1.8\n1.9\n0.4\n0.1\n0.8\n1.3\n3.2\n\n\n4\nSteven Adams\\adamsst01\nC\n25\nOKC\n80\n80\n33.4\n6.0\n10.1\n0.595\n...\n0.500\n4.9\n4.6\n9.5\n1.6\n1.5\n1.0\n1.7\n2.6\n13.9\n\n\n5\nBam Adebayo\\adebaba01\nC\n21\nMIA\n82\n28\n23.3\n3.4\n5.9\n0.576\n...\n0.735\n2.0\n5.3\n7.3\n2.2\n0.9\n0.8\n1.5\n2.5\n8.9\n\n\n\n\n5 rows × 29 columns\n\n\n\nLet’s say we are interested in predicting the number of points (PTS) an athlete will score in a basketball game this season.\nSuppose we want to fit a linear model by using some characteristics, or features of a player. Specifically, we’ll focus on field goals, assists, and 3-point attempts.\n\nFG, the number of (2-point) field goals per game\nAST, the average number of assists per game\n3PA, the number of 3-point field goals attempted per game\n\n\n\nCode\nnba[['FG', 'AST', '3PA', 'PTS']].head()\n\n\n\n\n\n\n\n\n\nFG\nAST\n3PA\nPTS\n\n\n\n\n1\n1.8\n0.6\n4.1\n5.3\n\n\n2\n0.4\n0.8\n1.5\n1.7\n\n\n3\n1.1\n1.9\n2.2\n3.2\n\n\n4\n6.0\n1.6\n0.0\n13.9\n\n\n5\n3.4\n2.2\n0.2\n8.9\n\n\n\n\n\n\n\nBecause we are now dealing with many parameter values, we’ve collected them all into a parameter vector with dimensions \\((p+1) \\times 1\\) to keep things tidy. Remember that \\(p\\) represents the number of features we have (in this case, 3).\n\\[\\theta = \\begin{bmatrix}\n           \\theta_{0} \\\\\n           \\theta_{1} \\\\\n           \\vdots \\\\\n           \\theta_{p}\n         \\end{bmatrix}\\]\nWe are working with two vectors here: a row vector representing the observed data, and a column vector containing the model parameters. The multiple linear regression model is equivalent to the dot (scalar) product of the observation vector and parameter vector.\n\\[[1,\\:x_{1},\\:x_{2},\\:x_{3},\\:...,\\:x_{p}] \\theta = [1,\\:x_{1},\\:x_{2},\\:x_{3},\\:...,\\:x_{p}] \\begin{bmatrix}\n           \\theta_{0} \\\\\n           \\theta_{1} \\\\\n           \\vdots \\\\\n           \\theta_{p}\n         \\end{bmatrix} = \\theta_0\\:+\\:\\theta_1x_{1}\\:+\\:\\theta_2 x_{2}\\:+\\:...\\:+\\:\\theta_p x_{p}\\]\nNotice that we have inserted 1 as the first value in the observation vector. When the dot product is computed, this 1 will be multiplied with \\(\\theta_0\\) to give the intercept of the regression model. We call this 1 entry the intercept or bias term.\nGiven that we have three features here, we can express this model as: \\[\\hat{y} = \\theta_0\\:+\\:\\theta_1x_{1}\\:+\\:\\theta_2 x_{2}\\:+\\:\\theta_3 x_{3}\\]\nOur features are represented by \\(x_1\\) (FG), \\(x_2\\) (AST), and \\(x_3\\) (3PA) with each having correpsonding parameters, \\(\\theta_1\\), \\(\\theta_2\\), and \\(\\theta_3\\).\nIn statistics, this model + loss is called Ordinary Least Squares (OLS). The solution to OLS is the minimizing loss for parameters \\(\\hat{\\theta}\\), also called the least squares estimate."
-  },
-  {
-    "objectID": "ols/ols.html#linear-algebra-approach",
-    "href": "ols/ols.html#linear-algebra-approach",
-    "title": "12  Ordinary Least Squares",
-    "section": "12.4 Linear Algebra Approach",
-    "text": "12.4 Linear Algebra Approach\nWe now know how to generate a single prediction from multiple observed features. Data scientists usually work at scale – that is, they want to build models that can produce many predictions, all at once. The vector notation we introduced above gives us a hint on how we can expedite multiple linear regression. We want to use the tools of linear algebra.\nLet’s think about how we can apply what we did above. To accomodate for the fact that we’re considering several feature variables, we’ll adjust our notation slightly. Each observation can now be thought of as a row vector with an entry for each of \\(p\\) features.\n\n\n\n\n\n\n\n\n\n\n\nTo make a prediction from the first observation in the data, we take the dot product of the parameter vector and first observation vector. To make a prediction from the second observation, we would repeat this process to find the dot product of the parameter vector and the second observation vector. If we wanted to find the model predictions for each observation in the dataset, we’d repeat this process for all \\(n\\) observations in the data.\n\\[\\hat{y}_1 = \\theta_0 + \\theta_1 x_{11} + \\theta_2 x_{12} + ... + \\theta_p x_{1p} = [1,\\:x_{11},\\:x_{12},\\:x_{13},\\:...,\\:x_{1p}] \\theta\\] \\[\\hat{y}_2 = \\theta_0 + \\theta_1 x_{21} + \\theta_2 x_{22} + ... + \\theta_p x_{2p} = [1,\\:x_{21},\\:x_{22},\\:x_{23},\\:...,\\:x_{2p}] \\theta\\] \\[\\vdots\\] \\[\\hat{y}_n = \\theta_0 + \\theta_1 x_{n1} + \\theta_2 x_{n2} + ... + \\theta_p x_{np} = [1,\\:x_{n1},\\:x_{n2},\\:x_{n3},\\:...,\\:x_{np}] \\theta\\]\nOur observed data is represented by \\(n\\) row vectors, each with dimension \\((p+1)\\). We can collect them all into a single matrix, which we call \\(\\mathbb{X}\\).\n\n\n\n\n\n\n\n\n\n\n\nThe matrix \\(\\mathbb{X}\\) is known as the design matrix. It contains all observed data for each of our \\(p\\) features, where each row corresponds to one observation, and each column corresponds to a feature. It often (but not always) contains an additional column of all ones to represent the intercept or bias column.\nTo review what is happening in the design matrix: each row represents a single observation. For example, a student in Data 100. Each column represents a feature. For example, the ages of students in Data 100. This convention allows us to easily transfer our previous work in DataFrames over to this new linear algebra perspective.\n\n\n\n\n\n\n\n\n\n\n\nThe multiple linear regression model can then be restated in terms of matrices: \\[\n\\Large\n\\mathbb{\\hat{Y}} = \\mathbb{X} \\theta\n\\]\nHere, \\(\\mathbb{\\hat{Y}}\\) is the prediction vector with \\(n\\) elements (\\(\\mathbb{\\hat{Y}} \\in \\mathbb{R}^{n}\\)); it contains the prediction made by the model for each of the \\(n\\) input observations. \\(\\mathbb{X}\\) is the design matrix with dimensions \\(\\mathbb{X} \\in \\mathbb{R}^(n \\times (p + 1))\\), and \\(\\theta\\) is the parameter vector with dimensions \\(\\theta \\in \\mathbb{R}^{(p + 1)}\\)\nAs a refresher, let’s also review the dot product (or inner product). This is a vector operation that:\n\nCan only be carried out on two vectors of the same length\nSums up the products of the corresponding entries of the two vectors\nReturns a single number\n\nWhile this is not in scope, note that we can also interpret the dot product geometrically:\n\nIt is the product of three things: the magnitude of both vectors, and the cosine of the angles between them: \\[\\vec{u} \\cdot \\vec{v} = ||\\vec{u}|| \\cdot ||\\vec{v}|| \\cdot {cos \\theta}\\]"
-  },
-  {
-    "objectID": "ols/ols.html#mean-squared-error",
-    "href": "ols/ols.html#mean-squared-error",
-    "title": "12  Ordinary Least Squares",
-    "section": "12.5 Mean Squared Error",
-    "text": "12.5 Mean Squared Error\nWe now have a new approach to understanding models in terms of vectors and matrices. To accompany this new convention, we should update our understanding of risk functions and model fitting.\nRecall our definition of MSE: \\[R(\\theta) = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2\\]\nAt its heart, the MSE is a measure of distance – it gives an indication of how “far away” the predictions are from the true values, on average.\nWhen working with vectors, this idea of “distance” or the vector’s size/length is represented by the norm. More precisely, the distance between two vectors \\(\\vec{a}\\) and \\(\\vec{b}\\) can be expressed as: \\[||\\vec{a} - \\vec{b}||_2 = \\sqrt{(a_1 - b_1)^2 + (a_2 - b_2)^2 + \\ldots + (a_n - b_n)^2} = \\sqrt{\\sum_{i=1}^n (a_i - b_i)^2}\\]\nThe double bars are mathematical notation for the norm. The subscript 2 indicates that we are computing the L2, or squared norm.\nThe two norms we need to know for Data 100 are the L1 and L2 norms (sound familiar?). In this note, we’ll focus on L2 norm. We’ll dive into L1 norm in future lectures.\nFor the n-dimensional vector \\[\\vec{x} = \\begin{bmatrix} x_1 \\\\ x_2 \\\\ \\vdots \\\\ x_n \\end{bmatrix}\\], the L2 vector norm is\n\\[||\\vec{x}||_2 = \\sqrt{(x_1)^2 + (x_2)^2 + \\ldots + (x_n)^2} = \\sqrt{\\sum_{i=1}^n (x_i)^2}\\]\nThe L2 vector norm is a generalization of the Pythagorean theorem in \\(n\\) dimensions. Thus, it can be used as a measure of the length of a vector or even as a measure of the distance between two vectors.\nWe can express the MSE as a squared L2 norm if we rewrite it in terms of the prediction vector, \\(\\hat{\\mathbb{Y}}\\), and true target vector, \\(\\mathbb{Y}\\):\n\\[R(\\theta) = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2 = \\frac{1}{n} ||\\mathbb{Y} - \\hat{\\mathbb{Y}}||_2^2\\]\nHere, the superscript 2 outside of the norm double bars means that we are squaring the norm. If we plug in our linear model \\(\\hat{\\mathbb{Y}} = \\mathbb{X} \\theta\\), we find the MSE cost function in vector notation:\n\\[R(\\theta) = \\frac{1}{n} ||\\mathbb{Y} - \\mathbb{X} \\theta||_2^2\\]\nUnder the linear algebra perspective, our new task is to fit the optimal parameter vector \\(\\theta\\) such that the cost function is minimized. Equivalently, we wish to minimize the norm \\[||\\mathbb{Y} - \\mathbb{X} \\theta||_2 = ||\\mathbb{Y} - \\hat{\\mathbb{Y}}||_2.\\]\nWe can restate this goal in two ways:\n\nMinimize the distance between the vector of true values, \\(\\mathbb{Y}\\), and the vector of predicted values, \\(\\mathbb{\\hat{Y}}\\)\nMinimize the length of the residual vector, defined as: \\[e = \\mathbb{Y} - \\mathbb{\\hat{Y}} = \\begin{bmatrix}\n         y_1 - \\hat{y}_1 \\\\\n         y_2 - \\hat{y}_2 \\\\\n         \\vdots \\\\\n         y_n - \\hat{y}_n\n       \\end{bmatrix}\\]"
-  },
-  {
-    "objectID": "ols/ols.html#geometric-perspective",
-    "href": "ols/ols.html#geometric-perspective",
-    "title": "12  Ordinary Least Squares",
-    "section": "12.6 Geometric Perspective",
-    "text": "12.6 Geometric Perspective\nTo derive the best parameter vector to meet this goal, we can turn to the geometric properties of our modeling setup.\nUp until now, we’ve mostly thought of our model as a scalar product between horizontally stacked observations and the parameter vector. We can also think of \\(\\hat{\\mathbb{Y}}\\) as a linear combination of feature vectors, scaled by the parameters. We use the notation \\(\\mathbb{X}_{:, i}\\) to denote the \\(i\\)th column of the design matrix. You can think of this as following the same convention as used when calling .iloc and .loc. “:” means that we are taking all entries in the \\(i\\)th column.\n\n\n\n\n\n\n\n\n\n\n\n\\[\n\\hat{\\mathbb{Y}} =\n\\theta_0 \\begin{bmatrix}\n           1 \\\\\n           1 \\\\\n           \\vdots \\\\\n           1\n         \\end{bmatrix} + \\theta_1 \\begin{bmatrix}\n           x_{11} \\\\\n           x_{21} \\\\\n           \\vdots \\\\\n           x_{n1}\n         \\end{bmatrix} + \\ldots + \\theta_p \\begin{bmatrix}\n           x_{1p} \\\\\n           x_{2p} \\\\\n           \\vdots \\\\\n           x_{np}\n         \\end{bmatrix}\n         = \\theta_0 \\mathbb{X}_{:,\\:1} + \\theta_1 \\mathbb{X}_{:,\\:2} + \\ldots + \\theta_p \\mathbb{X}_{:,\\:p+1}\\]\nThis new approach is useful because it allows us to take advantage of the properties of linear combinations.\nRecall that the span or column space of a matrix \\(\\mathbb{X}\\) (denoted \\(span(\\mathbb{X})\\)) is the set of all possible linear combinations of the matrix’s columns. In other words, the span represents every point in space that could possibly be reached by adding and scaling some combination of the matrix columns. Additionally, if each column of \\(\\mathbb{X}\\) has length \\(n\\), \\(span(\\mathbb{X})\\) is a subspace of \\(\\mathbb{R}^{n}\\).\nBecause the prediction vector, \\(\\hat{\\mathbb{Y}} = \\mathbb{X} \\theta\\), is a linear combination of the columns of \\(\\mathbb{X}\\), we know that the predictions are contained in the span of \\(\\mathbb{X}\\). That is, we know that \\(\\mathbb{\\hat{Y}} \\in \\text{Span}(\\mathbb{X})\\).\nThe diagram below is a simplified view of \\(\\text{Span}(\\mathbb{X})\\), assuming that each column of \\(\\mathbb{X}\\) has length \\(n\\). Notice that the columns of \\(\\mathbb{X}\\) define a subspace of \\(\\mathbb{R}^n\\), where each point in the subspace can be reached by a linear combination of \\(\\mathbb{X}\\)’s columns. The prediction vector \\(\\mathbb{\\hat{Y}}\\) lies somewhere in this subspace.\n\n\n\n\n\n\n\n\n\n\n\nExamining this diagram, we find a problem. The vector of true values, \\(\\mathbb{Y}\\), could theoretically lie anywhere in \\(\\mathbb{R}^n\\) space – its exact location depends on the data we collect out in the real world. However, our multiple linear regression model can only make predictions in the subspace of \\(\\mathbb{R}^n\\) spanned by \\(\\mathbb{X}\\). Remember the model fitting goal we established in the previous section: we want to generate predictions such that the distance between the vector of true values, \\(\\mathbb{Y}\\), and the vector of predicted values, \\(\\mathbb{\\hat{Y}}\\), is minimized. This means that we want \\(\\mathbb{\\hat{Y}}\\) to be the vector in \\(\\text{Span}(\\mathbb{X})\\) that is closest to \\(\\mathbb{Y}\\).\nAnother way of rephrasing this goal is to say that we wish to minimize the length of the residual vector \\(e\\), as measured by its \\(L_2\\) norm.\n\n\n\n\n\n\n\n\n\n\n\nThe vector in \\(\\text{Span}(\\mathbb{X})\\) that is closest to \\(\\mathbb{Y}\\) is always the orthogonal projection of \\(\\mathbb{Y}\\) onto \\(\\text{Span}(\\mathbb{X}).\\) Thus, we should choose the parameter vector \\(\\theta\\) that makes the residual vector orthogonal to any vector in \\(\\text{Span}(\\mathbb{X})\\). You can visualize this as the vector created by dropping a perpendicular line from \\(\\mathbb{Y}\\) onto the span of \\(\\mathbb{X}\\).\nHow does this help us identify the optimal parameter vector, \\(\\hat{\\theta}\\)? Recall that two vectors \\(a\\) and \\(b\\) are orthogonal if their dot product is zero: \\({a}^{T}b = 0\\). A vector \\(v\\) is orthogonal to the span of a matrix \\(M\\) if and only if \\(v\\) is orthogonal to each column in \\(M\\). Put together, a vector \\(v\\) is orthogonal to \\(\\text{Span}(M)\\) if:\n\\[M^Tv = \\vec{0}\\]\nNote that \\(\\vec{0}\\) represents the zero vector, a \\(d\\)-length vector full of 0s.\nRemember our goal is to find \\(\\hat{\\theta}\\) such that we minimize the objective function \\(R(\\theta)\\). Equivalently, this is the \\(\\hat{\\theta}\\) such that the residual vector \\(e = \\mathbb{Y} - \\mathbb{X} \\theta\\) is orthogonal to \\(\\text{Span}(\\mathbb{X})\\).\nLooking at the definition of orthogonality of \\(\\mathbb{Y} - \\mathbb{X}\\hat{\\theta}\\) to \\(span(\\mathbb{X})\\) (0 is the \\(\\vec{0}\\) vector), we can write: \\[\\mathbb{X}^T (\\mathbb{Y} - \\mathbb{X}\\hat{\\theta}) = \\vec{0}\\]\nLet’s then rearrange the terms: \\[\\mathbb{X}^T \\mathbb{Y} - \\mathbb{X}^T \\mathbb{X} \\hat{\\theta} = \\vec{0}\\]\nAnd finally, we end up with the normal equation: \\[\\mathbb{X}^T \\mathbb{X} \\hat{\\theta} = \\mathbb{X}^T \\mathbb{Y}\\]\nAny vector \\(\\theta\\) that minimizes MSE on a dataset must satisfy this equation.\nIf \\(\\mathbb{X}^T \\mathbb{X}\\) is invertible, we can conclude: \\[\\hat{\\theta} = (\\mathbb{X}^T \\mathbb{X})^{-1} \\mathbb{X}^T \\mathbb{Y}\\]\nThis is called the least squares estimate of \\(\\theta\\): it is the value of \\(\\theta\\) that minimizes the squared loss.\nNote that the least squares estimate was derived under the assumption that \\(\\mathbb{X}^T \\mathbb{X}\\) is invertible. This condition holds true when \\(\\mathbb{X}^T \\mathbb{X}\\) is full column rank, which, in turn, happens when \\(\\mathbb{X}\\) is full column rank. We will explore the consequences of this fact in lab and homework."
-  },
-  {
-    "objectID": "ols/ols.html#evaluating-model-performance",
-    "href": "ols/ols.html#evaluating-model-performance",
-    "title": "12  Ordinary Least Squares",
-    "section": "12.7 Evaluating Model Performance",
-    "text": "12.7 Evaluating Model Performance\nOur geometric view of multiple linear regression has taken us far! We have identified the optimal set of parameter values to minimize MSE in a model of multiple features.\nNow, we want to understand how well our fitted model performs. One measure of model performance is the Root Mean Squared Error, or RMSE. The RMSE is simply the square root of MSE. Taking the square root converts the value back into the original, non-squared units of \\(y_i\\), which is useful for understanding the model’s performance. A low RMSE indicates more “accurate” predictions – that there is a lower average loss across the dataset.\n\\[\\text{RMSE} = \\sqrt{\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2}\\]\nWhen working with SLR, we generated plots of the residuals against a single feature to understand the behavior of residuals. When working with several features in multiple linear regression, it no longer makes sense to consider a single feature in our residual plots. Instead, multiple linear regression is evaluated by making plots of the residuals against the predicted values. As was the case with SLR, a multiple linear model performs well if its residual plot shows no patterns.\n\n\n\n\n\n\n\n\n\n\n\nFor SLR, we used the correlation coefficient to capture the association between the target variable and a single feature variable. In a multiple linear model setting, we will need a performance metric that can account for multiple features at once. Multiple \\(R^2\\), also called the coefficient of determination, is the proportion of variance of our fitted values (predictions) \\(\\hat{y}_i\\) to our true values \\(y_i\\). It ranges from 0 to 1 and is effectively the proportion of variance in the observations the model explains.\n\\[R^2 = \\frac{\\text{variance of } \\hat{y}_i}{\\text{variance of } y_i} = \\frac{\\sigma^2_{\\hat{y}}}{\\sigma^2_y}\\]\nNote that for OLS with an intercept term, for example \\(\\hat{y} = \\theta_0 + \\theta_1x_1 + \\theta_2x_2 + \\cdots + \\theta_px_p\\), \\(\\mathbb{R}^2\\) is equal to the square of the correlation between \\(y\\) and \\(\\hat{y}\\). On the other hand for SLR, \\(\\mathbb{R}^2\\) is equal to \\(r^2\\), the correlation between \\(x\\) and \\(y\\). The proof of these last two properties is out of scope for this course.\nAdditionally, as we add more features, our fitted values tend to become closer and closer to our actual values. Thus, \\(\\mathbb{R}^2\\) increases.\nAdding more features doesn’t always mean our model is better though! We’ll see why later in the course."
-  },
-  {
-    "objectID": "ols/ols.html#ols-properties",
-    "href": "ols/ols.html#ols-properties",
-    "title": "12  Ordinary Least Squares",
-    "section": "12.8 OLS Properties",
-    "text": "12.8 OLS Properties\n\nWhen using the optimal parameter vector, our residuals \\(e = \\mathbb{Y} - \\hat{\\mathbb{Y}}\\) are orthogonal to \\(span(\\mathbb{X})\\).\n\n\\[\\mathbb{X}^Te = 0 \\]\n\n\n\n\n\n\nProof:\n\nThe optimal parameter vector, \\(\\hat{\\theta}\\), solves the normal equations \\(\\implies \\hat{\\theta} = \\mathbb{X}^T\\mathbb{X}^{-1}\\mathbb{X}^T\\mathbb{Y}\\)\n\n\\[\\mathbb{X}^Te = \\mathbb{X}^T (\\mathbb{Y} - \\mathbb{\\hat{Y}}) \\]\n\\[\\mathbb{X}^T (\\mathbb{Y} - \\mathbb{X}\\hat{\\theta}) = \\mathbb{X}^T\\mathbb{Y} - \\mathbb{X}^T\\mathbb{X}\\hat{\\theta}\\]\n\nAny matrix multiplied with its own inverse is the identity matrix \\(\\mathbb{I}\\)\n\n\\[\\mathbb{X}^T\\mathbb{Y} - (\\mathbb{X}^T\\mathbb{X})(\\mathbb{X}^T\\mathbb{X})^{-1}\\mathbb{X}^T\\mathbb{Y} = \\mathbb{X}^T\\mathbb{Y} - \\mathbb{X}^T\\mathbb{Y} = 0\\]\n\n\n\n\nFor all linear models with an intercept term, the sum of residuals is zero.\n\n\\[\\sum_i^n e_i = 0\\]\n\n\n\n\n\n\nProof:\n\nFor all linear models with an intercept term, the average of the predicted \\(y\\) values is equal to the average of the true \\(y\\) values. \\[\\bar{y} = \\bar{\\hat{y}}\\]\nRewriting the sum of residuals as two separate sums, \\[\\sum_i^n e_i = \\sum_i^n y_i - \\sum_i^n\\hat{y}_i\\]\nEach respective sum is a multiple of the average of the sum. \\[\\sum_i^n e_i = n\\bar{y} - n\\bar{y} = n(\\bar{y} - \\bar{y}) = 0\\]\n\n\n\n\n\nThe Least Squares estimate \\(\\hat{\\theta}\\) is unique if and only if \\(\\mathbb{X}\\) is full column rank.\n\n\n\n\n\n\n\nProof:\n\nWe know the solution to the normal equation \\(\\mathbb{X}^T\\mathbb{X}\\hat{\\theta} = \\mathbb{Y}\\) is the least square estimate that fulfills the prior equality.\n\\(\\hat{\\theta}\\) has a unique solution \\(\\iff\\) the square matrix \\(\\mathbb{X}^T\\mathbb{X}\\) is invertible.\n\nThe column rank of a square matrix is the number of linearly independent columns it contains.\nAn \\(n\\) x \\(n\\) square matrix is deemed full column rank when all of its columns are linearly independent. That is, its rank would be equal to \\(n\\).\n\\(\\mathbb{X}^T\\mathbb{X}\\) has shape \\((p + 1) \\times (p + 1)\\), and therefore has max rank \\(p + 1\\).\n\n\\(rank(\\mathbb{X}^T\\mathbb{X})\\) = \\(rank(\\mathbb{X})\\) (proof out of scope).\nTherefore, \\(\\mathbb{X}^T\\mathbb{X}\\) has rank \\(p + 1\\) \\(\\iff\\) \\(\\mathbb{X}\\) has rank \\(p + 1\\) \\(\\iff \\mathbb{X}\\) is full column rank.\n\n\n\n\nTo summarize:\n\n\n\n\n\n\n\n\n\n\nModel\nEstimate\nUnique?\n\n\n\n\nConstant Model + MSE\n\\(\\hat{y} = \\theta_0\\)\n\\(\\hat{\\theta_0} = mean(y) = \\bar{y}\\)\nYes. Any set of values has a unique mean.\n\n\nConstant Model + MAE\n\\(\\hat{y} = \\theta_0\\)\n\\(\\hat{\\theta_0} = median(y)\\)\nYes, if odd. No, if even. Return the average of the middle 2 values.\n\n\nSimple Linear Regression + MSE\n\\(\\hat{y} = \\theta_0 + \\theta_1x\\)\n\\(\\hat{\\theta_0} = \\bar{y} - \\hat{\\theta_1}\\hat{x}\\) \\(\\hat{\\theta_1} = r\\frac{\\sigma_y}{\\sigma_x}\\)\nYes. Any set of non-constant* values has a unique mean, SD, and correlation coefficient.\n\n\nOLS (Linear Model + MSE)\n\\(\\mathbb{\\hat{Y}} = \\mathbb{X}\\mathbb{\\theta}\\)\n\\(\\hat{\\theta} = \\mathbb{X}^T\\mathbb{X}^{-1}\\mathbb{X}^T\\mathbb{Y}\\)\nYes, if \\(\\mathbb{X}\\) is full column rank (all columns are linearly independent, # of datapoints &gt;&gt;&gt; # of features)."
-  },
-  {
-    "objectID": "gradient_descent/gradient_descent.html#minimizing-a-1d-function",
-    "href": "gradient_descent/gradient_descent.html#minimizing-a-1d-function",
-    "title": "13  Gradient Descent",
-    "section": "13.1 Minimizing a 1D Function",
-    "text": "13.1 Minimizing a 1D Function\nLet’s consider an arbitrary function. Our goal is to find the value of \\(x\\) that minimizes this function.\n\n\nCode\nimport pandas as pd\nimport seaborn as sns\nimport plotly.express as px\nimport matplotlib.pyplot as plt\nimport numpy as np\npd.options.mode.chained_assignment = None  # default='warn'\n\n\n\ndef arbitrary(x):\n    return (x**4 - 15*x**3 + 80*x**2 - 180*x + 144)/10\n\n\n\n13.1.1 The Naive Approach: Guess and Check\nAbove, we saw that the minimum is somewhere around 5.3. Let’s see if we can figure out how to find the exact minimum algorithmically from scratch. One very slow (and terrible) way would be manual guess-and-check.\n\narbitrary(6)\n\n0.0\n\n\nA somewhat better (but still slow) approach is to use brute force to try out a bunch of x values and return the one that yields the lowest loss.\n\ndef simple_minimize(f, xs):\n    # Takes in a function f and a set of values xs. \n    # Calculates the value of the function f at all values x in xs\n    # Takes the minimum value of f(x) and returns the corresponding value x \n    y = [f(x) for x in xs]  \n    return xs[np.argmin(y)]\n\nguesses = [5.3, 5.31, 5.32, 5.33, 5.34, 5.35]\nsimple_minimize(arbitrary, guesses)\n\n5.33\n\n\nThis process is essentially the same as before where we made a graphical plot, it’s just that we’re only looking at 20 selected points.\n\n\nCode\nxs = np.linspace(1, 7, 200)\nsparse_xs = np.linspace(1, 7, 5)\n\nys = arbitrary(xs)\nsparse_ys = arbitrary(sparse_xs)\n\nfig = px.line(x = xs, y = arbitrary(xs))\nfig.add_scatter(x = sparse_xs, y = arbitrary(sparse_xs), mode = \"markers\")\nfig.update_layout(showlegend= False)\nfig.update_layout(autosize=False, width=800, height=600)\nfig.show()\n\n\n\n                                                \n\n\nThis basic approach suffers from three major flaws: 1. If the minimum is outside our range of guesses, the answer will be completely wrong. 2. Even if our range of guesses is correct, if the guesses are too coarse, our answer will be inaccurate. 3. It is absurdly computationally inefficient, considering potentially vast numbers of guesses that are useless.\n\n\n13.1.2 Scipy.optimize.minimize\nOne way to minimize this mathematical function is to use the scipy.optimize.minimize function. It takes a function and a starting guess and tries to find the minimum.\n\nfrom scipy.optimize import minimize\n\n# takes a function f and a starting point x0 and returns a readout \n# with the optimal input value of x which minimizes f\nminimize(arbitrary, x0 = 3.5)\n\n  message: Optimization terminated successfully.\n  success: True\n   status: 0\n      fun: -0.13827491292966557\n        x: [ 2.393e+00]\n      nit: 3\n      jac: [ 6.486e-06]\n hess_inv: [[ 7.385e-01]]\n     nfev: 20\n     njev: 10\n\n\nscipy.optimize.minimize is great. It may also seem a bit magical. How could you write a function that can find the minimum of any mathematical function? There are a number of ways to do this, which we’ll explore in today’s lecture, eventually arriving at the important idea of gradient descent, which is the principle that scipy.optimize.minimize uses.\nIt turns out that under the hood, the fit method for LinearRegression models uses gradient descent. Gradient descent is also how much of machine learning works, including even advanced neural network models.\nIn Data 100, the gradient descent process will usually be invisible to us, hidden beneath an abstraction layer. However, to be good data scientists, it’s important that we know the underlying principles that optimization functions harness to find optimal parameters."
-  },
-  {
-    "objectID": "gradient_descent/gradient_descent.html#digging-into-gradient-descent",
-    "href": "gradient_descent/gradient_descent.html#digging-into-gradient-descent",
-    "title": "13  Gradient Descent",
-    "section": "13.2 Digging into Gradient Descent",
-    "text": "13.2 Digging into Gradient Descent\nLooking at the function across this domain, it is clear that the function’s minimum value occurs around \\(\\theta = 5.3\\). Let’s pretend for a moment that we couldn’t see the full view of the cost function. How would we guess the value of \\(\\theta\\) that minimizes the function?\nIt turns out that the first derivative of the function can give us a clue. In the plots below, the line indicates the value of the derivative of each value of \\(\\theta\\). The derivative is negative where it is red and positive where it is green.\nSay we make a guess for the minimizing value of \\(\\theta\\). Remember that we read plots from left to right, and assume that our starting \\(\\theta\\) value is to the left of the optimal \\(\\hat{\\theta}\\). If the guess “undershoots” the true minimizing value – our guess for \\(\\theta\\) is lower than the value of the \\(\\hat{\\theta}\\) that minimizes the function – the derivative will be negative. This means that if we increase \\(\\theta\\) (move further to the right), then we can decrease our loss function further. If this guess “overshoots” the true minimizing value, the derivative will be positive, implying the converse.\n\n\n\n\n\n\n\n\n\n\n\nWe can use this pattern to help formulate our next guess for the optimal \\(\\hat{\\theta}\\). Consider the case where we’ve undershot \\(\\theta\\) by guessing too low of a value. We’ll want our next guess to be greater in value than our previous guess – that is, we want to shift our guess to the right. You can think of this as following the slope “downhill” to the function’s minimum value.\n\n\n\n\n\n\n\n\n\n\n\nIf we’ve overshot \\(\\hat{\\theta}\\) by guessing too high of a value, we’ll want our next guess to be lower in value – we want to shift our guess for \\(\\hat{\\theta}\\) to the left.\n\n\n\n\n\n\n\n\n\n\n\nIn other words, the derivative of the function at each point tells us the direction of our next guess. * A negative slope means we want to step to the right, or move in the positive direction. * A positive slope means we want to step to the left, or move in the negative direction.\n\n13.2.1 Algorithm Attempt 1\nArmed with this knowledge, let’s try to see if we can use the derivative to optimize the function.\nWe start by making some guess for the minimizing value of \\(x\\). Then, we look at the derivative of the function at this value of \\(x\\), and step downhill in the opposite direction. We can express our new rule as a recurrence relation:\n\\[x^{(t+1)} = x^{(t)} - \\frac{d}{dx} f(x^{(t)})\\]\nTranslating this statement into English: we obtain our next guess for the minimizing value of \\(x\\) at timestep \\(t+1\\) (\\(x^{(t+1)}\\)) by taking our last guess (\\(x^{(t)}\\)) and subtracting the derivative of the function at that point (\\(\\frac{d}{dx} f(x^{(t)})\\)).\nA few steps are shown below, where the old step is shown as a transparent point, and the next step taken is the green-filled dot.\n\n\n\n\n\n\n\n\n\n\n\nLooking pretty good! We do have a problem though – once we arrive close to the minimum value of the function, our guesses “bounce” back and forth past the minimum without ever reaching it.\n\n\n\n\n\n\n\n\n\n\n\nIn other words, each step we take when updating our guess moves us too far. We can address this by decreasing the size of each step.\n\n\n13.2.2 Algorithm Attempt 2\nLet’s update our algorithm to use a learning rate (also sometimes called the step size), which controls how far we move with each update. We represent the learning rate with \\(\\alpha\\).\n\\[x^{(t+1)} = x^{(t)} - \\alpha \\frac{d}{dx} f(x^{(t)})\\]\nA small \\(\\alpha\\) means that we will take small steps; a large \\(\\alpha\\) means we will take large steps.\nUpdating our function to use \\(\\alpha=0.3\\), our algorithm successfully converges (settles on a solution and stops updating significantly, or at all) on the minimum value.\n\n\n\n\n\n\n\n\n\n\n\n\n\n13.2.3 Convexity\nIn our analysis above, we focused our attention on the global minimum of the loss function. You may be wondering: what about the local minimum that’s just to the left?\nIf we had chosen a different starting guess for \\(\\theta\\), or a different value for the learning rate \\(\\alpha\\), our algorithm may have gotten “stuck” and converged on the local minimum, rather than on the true optimum value of loss.\n\n\n\n\n\n\n\n\n\n\n\nIf the loss function is convex, gradient descent is guaranteed to converge and find the global minimum of the objective function. Formally, a function \\(f\\) is convex if: \\[tf(a) + (1-t)f(b) \\geq f(ta + (1-t)b)\\] for all \\(a, b\\) in the domain of \\(f\\) and \\(t \\in [0, 1]\\).\nTo put this into words: if you drew a line between any two points on the curve, all values on the curve must be on or below the line. Importantly, any local minimum of a convex function is also its global minimum.\n\n\n\n\n\n\n\n\n\n\n\nIn summary, non-convex loss functions can cause problems with optimization. This means that our choice of loss function is a key factor in our modeling process. It turns out that MSE is convex, which is a major reason why it is such a popular choice of loss function."
-  },
-  {
-    "objectID": "gradient_descent/gradient_descent.html#gradient-descent-in-1-dimension",
-    "href": "gradient_descent/gradient_descent.html#gradient-descent-in-1-dimension",
-    "title": "13  Gradient Descent",
-    "section": "13.3 Gradient Descent in 1 Dimension",
-    "text": "13.3 Gradient Descent in 1 Dimension\n\nTerminology clarification: In past lectures, we have used “loss” to refer to the error incurred on a single datapoint. In applications, we usually care more about the average error across all datapoints. Going forward, we will take the “model’s loss” to mean the model’s average error across the dataset. This is sometimes also known as the empirical risk, cost function, or objective function. \\[L(\\theta) = R(\\theta) = \\frac{1}{n} \\sum_{i=1}^{n} L(y, \\hat{y})\\]\n\nIn our discussion above, we worked with some arbitrary function \\(f\\). As data scientists, we will almost always work with gradient descent in the context of optimizing models – specifically, we want to apply gradient descent to find the minimum of a loss function. In a modeling context, our goal is to minimize a loss function by choosing the minimizing model parameters.\nRecall our modeling workflow from the past few lectures: * Define a model with some parameters \\(\\theta_i\\) * Choose a loss function * Select the values of \\(\\theta_i\\) that minimize the loss function on the data\nGradient descent is a powerful technique for completing this last task. By applying the gradient descent algorithm, we can select values for our parameters \\(\\theta_i\\) that will lead to the model having minimal loss on the training data.\nWhen using gradient descent in a modeling context: * We make guesses for the minimizing \\(\\theta_i\\) * We compute the derivative of the loss function \\(L\\)\nWe can “translate” our gradient descent rule from before by replacing \\(x\\) with \\(\\theta\\) and \\(f\\) with \\(L\\):\n\\[\\theta^{(t+1)} = \\theta^{(t)} - \\alpha \\frac{d}{d\\theta} L(\\theta^{(t)})\\]\n\n13.3.1 Gradient Descent on the tips Dataset\nTo see this in action, let’s consider a case where we have a linear model with no offset. We want to predict the tip (y) given the price of a meal (x). To do this, we\n\nChoose a model: \\(\\hat{y} = \\theta_1 x\\),\nChoose a loss function: \\(L(\\theta) = MSE(\\theta) = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\theta_1x_i)^2\\).\n\nLet’s apply our gradient_descent function from before to optimize our model on the tips dataset. We will try to select the best parameter \\(\\theta_i\\) to predict the tip \\(y\\) from the total_bill \\(x\\).\n\ndf = sns.load_dataset(\"tips\")\ndf.head()\n\n\n\n\n\n\n\n\ntotal_bill\ntip\nsex\nsmoker\nday\ntime\nsize\n\n\n\n\n0\n16.99\n1.01\nFemale\nNo\nSun\nDinner\n2\n\n\n1\n10.34\n1.66\nMale\nNo\nSun\nDinner\n3\n\n\n2\n21.01\n3.50\nMale\nNo\nSun\nDinner\n3\n\n\n3\n23.68\n3.31\nMale\nNo\nSun\nDinner\n2\n\n\n4\n24.59\n3.61\nFemale\nNo\nSun\nDinner\n4\n\n\n\n\n\n\n\nWe can visualize the value of the MSE on our dataset for different possible choices of \\(\\theta_1\\). To optimize our model, we want to select the value of \\(\\theta_1\\) that leads to the lowest MSE.\n\n\nCode\nimport plotly.graph_objects as go\n\ndef derivative_arbitrary(x):\n    return (4*x**3 - 45*x**2 + 160*x - 180)/10\n\nfig = go.Figure()\nroots = np.array([2.3927, 3.5309, 5.3263])\n\nfig.add_trace(go.Scatter(x = xs, y = arbitrary(xs), \n                         mode = \"lines\", name = \"f\"))\nfig.add_trace(go.Scatter(x = xs, y = derivative_arbitrary(xs), \n                         mode = \"lines\", name = \"df\", line = {\"dash\": \"dash\"}))\nfig.add_trace(go.Scatter(x = np.array(roots), y = 0*roots, \n                         mode = \"markers\", name = \"df = zero\", marker_size = 12))\nfig.update_layout(font_size = 20, yaxis_range=[-1, 3])\nfig.update_layout(autosize=False, width=800, height=600)\nfig.show()\n\n\n\n                                                \n\n\nTo apply gradient descent, we need to compute the derivative of the loss function with respect to our parameter \\(\\theta_1\\).\n\nGiven our loss function, \\[L(\\theta) = MSE(\\theta) = \\frac{1}{n} \\sum_{i=1}^n (y_i - \\theta_1x_i)^2\\]\nWe take the derivative with respect to \\(\\theta_1\\) \\[\\frac{\\partial}{\\partial \\theta_{1}} L(\\theta_1^{(t)}) = \\frac{-2}{n} \\sum_{i=1}^n (y_i - \\theta_1^{(t)} x_i) x_i\\]\nWhich results in the gradient descent update rule \\[\\theta_1^{(t+1)} = \\theta_1^{(t)} - \\alpha \\frac{d}{d\\theta}L(\\theta_1^{(t)})\\]\n\nfor some learning rate \\(\\alpha\\).\nImplementing this in code, we can visualize the MSE loss on the tips data. MSE is convex, so there is one global minimum.\n\n\nCode\ndef gradient_descent(df, initial_guess, alpha, n):\n    \"\"\"Performs n steps of gradient descent on df using learning rate alpha starting\n       from initial_guess. Returns a numpy array of all guesses over time.\"\"\"\n    guesses = [initial_guess]\n    current_guess = initial_guess\n    while len(guesses) &lt; n:\n        current_guess = current_guess - alpha * df(current_guess)\n        guesses.append(current_guess)\n        \n    return np.array(guesses)\n\ndef mse_single_arg(theta_1):\n    \"\"\"Returns the MSE on our data for the given theta1\"\"\"\n    x = df[\"total_bill\"]\n    y_obs = df[\"tip\"]\n    y_hat = theta_1 * x\n    return np.mean((y_hat - y_obs) ** 2)\n\ndef mse_loss_derivative_single_arg(theta_1):\n    \"\"\"Returns the derivative of the MSE on our data for the given theta1\"\"\"\n    x = df[\"total_bill\"]\n    y_obs = df[\"tip\"]\n    y_hat = theta_1 * x\n    \n    return np.mean(2 * (y_hat - y_obs) * x)\n\nloss_df = pd.DataFrame({\"theta_1\":np.linspace(-1.5, 1), \"MSE\":[mse_single_arg(theta_1) for theta_1 in np.linspace(-1.5, 1)]})\n\ntrajectory = gradient_descent(mse_loss_derivative_single_arg, -0.5, 0.0001, 100)\n\nplt.plot(loss_df[\"theta_1\"], loss_df[\"MSE\"])\nplt.scatter(trajectory, [mse_single_arg(guess) for guess in trajectory], c=\"white\", edgecolor=\"firebrick\")\nplt.scatter(trajectory[-1], mse_single_arg(trajectory[-1]), c=\"firebrick\")\nplt.xlabel(r\"$\\theta_1$\")\nplt.ylabel(r\"$L(\\theta_1)$\");\n\nprint(f\"Final guess for theta_1: {trajectory[-1]}\")\n\n\nFinal guess for theta_1: 0.14369554654231262"
-  },
-  {
-    "objectID": "gradient_descent/gradient_descent.html#gradient-descent-on-multi-dimensional-models",
-    "href": "gradient_descent/gradient_descent.html#gradient-descent-on-multi-dimensional-models",
-    "title": "13  Gradient Descent",
-    "section": "13.4 Gradient Descent on Multi-Dimensional Models",
-    "text": "13.4 Gradient Descent on Multi-Dimensional Models\nThe function we worked with above was one-dimensional – we were only minimizing the function with respect to a single parameter, \\(\\theta\\). However, models usually have a cost function with multiple parameters that need to be optimized. For example, simple linear regression has 2 parameters: \\(\\hat{y} + \\theta_0 + \\theta_1x\\), and multiple linear regression has \\(p+1\\) parameters: \\(\\mathbb{Y} = \\theta_0 + \\theta_1 \\Bbb{X}_{:,1} + \\theta_2 \\Bbb{X}_{:,2} + \\cdots + \\theta_p \\Bbb{X}_{:,p}\\)\nWe’ll need to expand gradient descent so we can update our guesses for all model parameters, all in one go.\nWith multiple parameters to optimize, we consider a loss surface, or the model’s loss for a particular combination of possible parameter values.\n\n\nCode\nimport plotly.graph_objects as go\n\n\ndef mse_loss(theta, X, y_obs):\n    y_hat = X @ theta\n    return np.mean((y_hat - y_obs) ** 2)    \n\ntips_with_bias = df.copy()\ntips_with_bias[\"bias\"] = 1\ntips_with_bias = tips_with_bias[[\"bias\", \"total_bill\"]]\n\nuvalues = np.linspace(0, 2, 10)\nvvalues = np.linspace(-0.1, 0.35, 10)\n(u,v) = np.meshgrid(uvalues, vvalues)\nthetas = np.vstack((u.flatten(),v.flatten()))\n\ndef mse_loss_single_arg(theta):\n    return mse_loss(theta, tips_with_bias, df[\"tip\"])\n\nMSE = np.array([mse_loss_single_arg(t) for t in thetas.T])\n\nloss_surface = go.Surface(x=u, y=v, z=np.reshape(MSE, u.shape))\n\nind = np.argmin(MSE)\noptimal_point = go.Scatter3d(name = \"Optimal Point\",\n    x = [thetas.T[ind,0]], y = [thetas.T[ind,1]], \n    z = [MSE[ind]],\n    marker=dict(size=10, color=\"red\"))\n\nfig = go.Figure(data=[loss_surface, optimal_point])\nfig.update_layout(scene = dict(\n    xaxis_title = \"theta0\",\n    yaxis_title = \"theta1\",\n    zaxis_title = \"MSE\"), autosize=False, width=800, height=600)\n\nfig.show()\n\n\n\n                                                \n\n\nWe can also visualize a bird’s-eye view of the loss surface from above using a contour plot:\n\ncontour = go.Contour(x=u[0], y=v[:, 0], z=np.reshape(MSE, u.shape))\nfig = go.Figure(contour)\nfig.update_layout(\n    xaxis_title = \"theta0\",\n    yaxis_title = \"theta1\", autosize=False, width=800, height=600)\n\nfig.show()\n\n\n                                                \n\n\n\n13.4.1 The Gradient Vector\nAs before, the derivative of the loss function tells us the best way towards the minimum value.\nOn a 2D (or higher) surface, the best way to go down (gradient) is described by a vector.\n\n\n\n\n\n\n\n\n\n\n\n\nMath Aside: Partial Derivatives - For an equation with multiple variables, we take a partial derivative by differentiating with respect to just one variable at a time. The partial derivative is denoted with a \\(\\partial\\). Intuitively, we want to see how the function changes if we only vary one variable while holding other variables constant. - Using \\(f(x, y) = 3x^2 + y\\) as an example, - taking the partial derivative with respect to x and treating y as a constant gives us \\(\\frac{\\partial f}{\\partial x} = 6x\\) - taking the partial derivative with respect to y and treating x as a constant gives us \\(\\frac{\\partial f}{\\partial y} = 1\\)\n\nFor the vector of parameter values \\(\\vec{\\theta} = \\begin{bmatrix}  \\theta_{0} \\\\  \\theta_{1} \\\\  \\end{bmatrix}\\), we take the partial derivative of loss with respect to each parameter: \\(\\frac{\\partial L}{\\partial \\theta_0}\\) and \\(\\frac{\\partial L}{\\partial \\theta_1}\\).\nThe gradient vector is therefore \\[\\nabla_\\theta L =  \\begin{bmatrix} \\frac{\\partial L}{\\partial \\theta_0} \\\\ \\frac{\\partial L}{\\partial \\theta_1} \\\\ \\vdots \\end{bmatrix}\\] where \\(\\nabla_\\theta L\\) always points in the downhill direction of the surface.\nWe can use this to update our 1D gradient rule for models with multiple parameters.\n\nRecall our 1D update rule: \\[\\theta^{(t+1)} = \\theta^{(t)} - \\alpha \\frac{d}{d\\theta}L(\\theta^{(t)})\\]\nFor models with multiple parameters, we work in terms of vectors: \\[\\begin{bmatrix}\n         \\theta_{0}^{(t+1)} \\\\\n         \\theta_{1}^{(t+1)} \\\\\n         \\vdots\n       \\end{bmatrix} = \\begin{bmatrix}\n         \\theta_{0}^{(t)} \\\\\n         \\theta_{1}^{(t)} \\\\\n         \\vdots\n       \\end{bmatrix} - \\alpha \\begin{bmatrix}\n         \\frac{\\partial L}{\\partial \\theta_{0}} \\\\\n         \\frac{\\partial L}{\\partial \\theta_{1}} \\\\\n         \\vdots \\\\\n       \\end{bmatrix}\\]\nWritten in a more compact form, \\[\\vec{\\theta}^{(t+1)} = \\vec{\\theta}^{(t)} - \\alpha \\nabla_{\\vec{\\theta}} L(\\theta^{(t)}) \\]\n\n\\(\\theta\\) is a vector with our model weights\n\\(L\\) is the loss function\n\\(\\alpha\\) is the learning rate (ours is constant, but other techniques use an \\(\\alpha\\) that decreases over time)\n\\(\\vec{\\theta}^{(t)}\\) is the current value of \\(\\theta\\)\n\\(\\vec{\\theta}^{(t+1)}\\) is the next value of \\(\\theta\\)\n\\(\\nabla_{\\vec{\\theta}} L(\\theta^{(t)})\\) is the gradient of the loss function evaluated at the current \\(\\vec{\\theta}^{(t)}\\)"
-  },
-  {
-    "objectID": "gradient_descent/gradient_descent.html#batch-mini-batch-gradient-descent-and-stochastic-gradient-descent",
-    "href": "gradient_descent/gradient_descent.html#batch-mini-batch-gradient-descent-and-stochastic-gradient-descent",
-    "title": "13  Gradient Descent",
-    "section": "13.5 Batch, Mini-Batch Gradient Descent and Stochastic Gradient Descent",
-    "text": "13.5 Batch, Mini-Batch Gradient Descent and Stochastic Gradient Descent\nFormally, the algorithm we derived above is called batch gradient descent. For each iteration of the algorithm, the derivative of loss is computed across the entire batch of all \\(n\\) datapoints. While this update rule works well in theory, it is not practical in most circumstances. For large datasets (with perhaps billions of datapoints), finding the gradient across all the data is incredibly computationally taxing; gradient descent will converge slowly because each individual update is slow.\nMini-batch gradient descent tries to address this issue. In mini-batch descent, only a subset of the data is used to estimate the gradient. The batch size is the number of data points used in each subset.\nEach complete “pass” through the data is known as a training epoch. In a single training epoch of mini-batch gradient descent, we\n\nCompute the gradient on the first x% of the data. Update the parameter guesses.\nCompute the gradient on the next x% of the data. Update the parameter guesses.\n\\(\\dots\\)\nCompute the gradient on the last x% of the data. Update the parameter guesses.\n\nEvery data point is once in a single training epoch. We then perform several training epochs until we’re satisfied.\nIn the most extreme case, we might choose a batch size of only 1 datapoint – that is, a single datapoint is used to estimate the gradient of loss with each update step. This is known as stochastic gradient descent. In a single training epoch of stochastic gradient descent, we\n\nCompute the gradient on the first data point. Update the parameter guesses.\nCompute the gradient on the next data point. Update the parameter guesses.\n\\(\\dots\\)\nCompute the gradient on the last data point. Update the parameter guesses.\n\nBatch gradient descent is a deterministic technique – because the entire dataset is used at each update iteration, the algorithm will always advance towards the minimum of the loss surface. In contrast, both mini-batch and stochastic gradient descent involve an element of randomness. Since only a subset of the full data is used to update the guess for \\(\\vec{\\theta}\\) at each iteration, there’s a chance the algorithm will not progress towards the true minimum of loss with each update. Over the longer term, these stochastic techniques should still converge towards the optimal solution.\nThe diagrams below represent a “bird’s eye view” of a loss surface from above. Notice that batch gradient descent takes a direct path towards the optimal \\(\\hat{\\theta}\\). Stochastic gradient descent, in contrast, “hops around” on its path to the minimum point on the loss surface. This reflects the randomness of the sampling process at each update step."
-  },
-  {
-    "objectID": "feature_engineering/feature_engineering.html#implementing-derived-formulas-in-code",
-    "href": "feature_engineering/feature_engineering.html#implementing-derived-formulas-in-code",
-    "title": "14  Sklearn and Feature Engineering",
-    "section": "14.1 Implementing Derived Formulas in Code",
-    "text": "14.1 Implementing Derived Formulas in Code\nThroughout this lecture, we’ll refer to the penguins dataset.\n\nimport pandas as pd\nimport seaborn as sns\nimport numpy as np\n\npenguins = sns.load_dataset(\"penguins\")\npenguins = penguins[penguins[\"species\"] == \"Adelie\"].dropna()\npenguins.head()\n\n\n\n\n\n\n\n\nspecies\nisland\nbill_length_mm\nbill_depth_mm\nflipper_length_mm\nbody_mass_g\nsex\n\n\n\n\n0\nAdelie\nTorgersen\n39.1\n18.7\n181.0\n3750.0\nMale\n\n\n1\nAdelie\nTorgersen\n39.5\n17.4\n186.0\n3800.0\nFemale\n\n\n2\nAdelie\nTorgersen\n40.3\n18.0\n195.0\n3250.0\nFemale\n\n\n4\nAdelie\nTorgersen\n36.7\n19.3\n193.0\n3450.0\nFemale\n\n\n5\nAdelie\nTorgersen\n39.3\n20.6\n190.0\n3650.0\nMale\n\n\n\n\n\n\n\nOur goal will be to predict the value of the \"bill_depth_mm\" for a particular penguin given its \"flipper_length_mm\" and \"body_mass_g\". We’ll also add a bias column of all ones to represent the intercept term of our models.\n\n# Add a bias column of all ones to `penguins`\npenguins[\"bias\"] = np.ones(len(penguins), dtype=int) \n\n# Define the design matrix, X...\nX = penguins[[\"bias\", \"flipper_length_mm\", \"body_mass_g\"]].to_numpy()\n\n# ...as well as the target variable, y\nY = penguins[[\"bill_depth_mm\"]].to_numpy()\n# Converting X and Y to NumPy arrays avoids misinterpretation of column labels\n\nIn the lecture on ordinary least squares, we expressed multiple linear regression using matrix notation.\n\\[\\hat{\\mathbb{Y}} = \\mathbb{X}\\theta\\]\nWe used a geometric approach to derive the following expression for the optimal model parameters:\n\\[\\hat{\\theta} = (\\mathbb{X}^T \\mathbb{X})^{-1}\\mathbb{X}^T \\mathbb{Y}\\]\nThat’s a whole lot of matrix manipulation. How do we implement it in python?\nThere are three operations we need to perform here: multiplying matrices, taking transposes, and finding inverses.\n\nTo perform matrix multiplication, use the @ operator\nTo take a transpose, call the .T attribute of an NumPy array or DataFrame\nTo compute an inverse, use NumPy’s in-built method np.linalg.inv\n\nPutting this all together, we can compute the OLS estimate for the optimal model parameters, stored in the array theta_hat.\n\ntheta_hat = np.linalg.inv(X.T @ X) @ X.T @ Y\ntheta_hat\n\narray([[1.10029953e+01],\n       [9.82848689e-03],\n       [1.47749591e-03]])\n\n\nTo make predictions using our optimized parameter values, we matrix-multiply the design matrix with the parameter vector:\n\\[\\hat{\\mathbb{Y}} = \\mathbb{X}\\theta\\]\n\nY_hat = X @ theta_hat\npd.DataFrame(Y_hat).head()\n\n\n\n\n\n\n\n\n0\n\n\n\n\n0\n18.322561\n\n\n1\n18.445578\n\n\n2\n17.721412\n\n\n3\n17.997254\n\n\n4\n18.263268"
-  },
-  {
-    "objectID": "feature_engineering/feature_engineering.html#sklearn",
-    "href": "feature_engineering/feature_engineering.html#sklearn",
-    "title": "14  Sklearn and Feature Engineering",
-    "section": "14.2 sklearn",
-    "text": "14.2 sklearn\nWe’ve already saved a lot of time (and avoided tedious calculations) by translating our derived formulas into code. However, we still had to go through the process of writing out the linear algebra ourselves.\nTo make life even easier, we can turn to the sklearn python library. sklearn is a robust library of machine learning tools used extensively in research and industry. It gives us a wide variety of in-built modeling frameworks and methods, so we’ll keep returning to sklearn techniques as we progress through Data 100.\nRegardless of the specific type of model being implemented, sklearn follows a standard set of steps for creating a model.\n\nCreate a model object. This generates a new instance of the model class. You can think of it as making a new “copy” of a standard “template” for a model. In code, this looks like:\nmy_model = ModelClass()\n\nFit the model to the X design matrix and Y target vector. This calculates the optimal model parameters “behind the scenes” without us explicitly working through the calculations ourselves. The fitted parameters are then stored within the model for use in future predictions:\nmy_model.fit(X, Y)\n\nUse the fitted model to make predictions on the X input data using .predict.\nmy_model.predict(X)\n\n\nTo extract the fitted parameters, we can use:\nmy_model.coef_\n\nmy_model.intercept_\n\nLet’s put this into action with our multiple regression task.\n1. Initialize an instance of the model class\nsklearn stores “templates” of useful models for machine learning. We begin the modeling process by making a “copy” of one of these templates for our own use. Model initialization looks like ModelClass(), where ModelClass is the type of model we wish to create.\nFor now, let’s create a linear regression model using LinearRegression().\nmy_model is now an instance of the LinearRegression class. You can think of it as the “idea” of a linear regression model. We haven’t trained it yet, so it doesn’t know any model parameters and cannot be used to make predictions. In fact, we haven’t even told it what data to use for modeling! It simply waits for further instructions.\n\nimport sklearn.linear_model as lm\n\nmy_model = lm.LinearRegression()\n\n2. Train the model using .fit\nBefore the model can make predictions, we will need to fit it to our training data. When we fit the model, sklearn will run gradient descent behind the scenes to determine the optimal model parameters. It will then save these model parameters to our model instance for future use.\nAll sklearn model classes include a .fit method, which is used to fit the model. It takes in two inputs: the design matrix, X, and the target variable, Y.\nLet’s start by fitting a model with just one feature: the flipper length. We create a design matrix X by pulling out the \"flipper_length_mm\" column from the DataFrame.\n\n# .fit expects a 2D data design matrix, so we use double brackets to extract a DataFrame\nX = penguins[[\"flipper_length_mm\"]]\nY = penguins[\"bill_depth_mm\"]\n\nmy_model.fit(X, Y)\n\nLinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.LinearRegressionLinearRegression()\n\n\nNotice that we use double brackets to extract this column. Why double brackets instead of just single brackets? The .fit method, by default, expects to receive 2-dimensional data – some kind of data that includes both rows and columns. Writing penguins[\"flipper_length_mm\"] would return a 1D Series, causing sklearn to error. We avoid this by writing penguins[[\"flipper_length_mm\"]] to produce a 2D DataFrame.\nAnd in just three lines of code, our model has run gradient descent to determine the optimal model parameters! Our single-feature model takes the form:\n\\[\\text{bill depth} = \\theta_0 + \\theta_1 \\text{flipper length}\\]\nNote that LinearRegression will automatically include an intercept term.\nThe fitted model parameters are stored as attributes of the model instance. my_model.intercept_ will return the value of \\(\\hat{\\theta}_0\\) as a scalar. my_model.coef_ will return all values \\(\\hat{\\theta}_1, \\hat{\\theta}_1, ...\\) in an array. Because our model only contains one feature, we see just the value of \\(\\hat{\\theta}_1\\) in the cell below.\n\n# The intercept term, theta_0\nmy_model.intercept_\n\n7.297305899612306\n\n\n\n# All parameters theta_1, ..., theta_p\nmy_model.coef_\n\narray([0.05812622])\n\n\n3. Use the fitted model to make predictions\nNow that the model has been trained, we can use it to make predictions! To do so, we use the .predict method. .predict takes in one argument: the design matrix that should be used to generate predictions. To understand how the model performs on the training set, we would pass in the training data. Alternatively, to make predictions on unseen data, we would pass in a new dataset that wasn’t used to train the model.\nBelow, we call .predict to generate model predictions on the original training data. As before, we use double brackets to ensure that we extract 2-dimensional data.\n\nY_hat_one_feature = my_model.predict(penguins[[\"flipper_length_mm\"]])\n\nprint(f\"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_one_feature)**2))}\")\n\nThe RMSE of the model is 1.1549363099239012\n\n\nWhat if we wanted a model with two features?\n\\[\\text{bill depth} = \\theta_0 + \\theta_1 \\text{flipper length} + \\theta_2 \\text{body mass}\\]\nWe repeat this three-step process by intializing a new model object, then calling .fit and .predict as before.\n\n# Step 1: initialize LinearRegression model\ntwo_feature_model = lm.LinearRegression()\n\n# Step 2: fit the model\nX_two_features = penguins[[\"flipper_length_mm\", \"body_mass_g\"]]\nY = penguins[\"bill_depth_mm\"]\n\ntwo_feature_model.fit(X_two_features, Y)\n\n# Step 3: make predictions\nY_hat_two_features = two_feature_model.predict(X_two_features)\n\nprint(f\"The RMSE of the model is {np.sqrt(np.mean((Y-Y_hat_two_features)**2))}\")\n\nThe RMSE of the model is 0.9881331104079044\n\n\nWe can also see that we obtain the same predictions using sklearn as we did when applying the ordinary least squares formula before!\n\n\nCode\npd.DataFrame({\"Y_hat from OLS\":np.squeeze(Y_hat), \"Y_hat from sklearn\":Y_hat_two_features}).head()\n\n\n\n\n\n\n\n\n\nY_hat from OLS\nY_hat from sklearn\n\n\n\n\n0\n18.322561\n18.322561\n\n\n1\n18.445578\n18.445578\n\n\n2\n17.721412\n17.721412\n\n\n3\n17.997254\n17.997254\n\n\n4\n18.263268\n18.263268"
-  },
-  {
-    "objectID": "feature_engineering/feature_engineering.html#feature-engineering",
-    "href": "feature_engineering/feature_engineering.html#feature-engineering",
-    "title": "14  Sklearn and Feature Engineering",
-    "section": "14.3 Feature Engineering",
-    "text": "14.3 Feature Engineering\nAt this point in the course, we’ve equipped ourselves with some powerful techniques to build and optimize models. We’ve explored how to develop models of multiple variables, as well as how to transform variables to help linearize a dataset and fit these models to maximize their performance.\nAll of this was done with one major caveat: the regression models we’ve worked with so far are all linear in the input variables. We’ve assumed that our predictions should be some combination of linear variables. While this works well in some cases, the real world isn’t always so straightforward. We’ll learn an important method to address this issue – feature engineering – and consider some new problems that can arise when we do so.\nFeature engineering is the process of transforming raw features into more informative features that can be used in modeling or EDA tasks and improve model performance.\nFeature engineering allows you to:\n\nCapture domain knowledge\nExpress non-linear relationships using linear models\nUse non-numeric features in models"
-  },
-  {
-    "objectID": "feature_engineering/feature_engineering.html#feature-functions",
-    "href": "feature_engineering/feature_engineering.html#feature-functions",
-    "title": "14  Sklearn and Feature Engineering",
-    "section": "14.4 Feature Functions",
-    "text": "14.4 Feature Functions\nA feature function describes the transformations we apply to raw features in a dataset to create a design matrix of transformed features. We typically denote the feature function as \\(\\Phi\\) (think to yourself: “phi”-ture function). When we apply the feature function to our original dataset \\(\\mathbb{X}\\), the result, \\(\\Phi(\\mathbb{X})\\), is a transformed design matrix ready to be used in modeling.\nFor example, we might design a feature function that computes the square of an existing feature and adds it to the design matrix. In this case, our existing matrix \\([x]\\) is transformed to \\([x, x^2]\\). Its dimension increases from 1 to 2. Often, the dimension of the featurized dataset increases as seen here.\n\n\n\nThe new features introduced by the feature function can then be used in modeling. Often, we use the symbol \\(\\phi_i\\) to represent transformed features after feature engineering.\n\\[\\hat{y} = \\theta_1 x + \\theta_2 x^2\\] \\[\\hat{y}= \\theta_1 \\phi_1 + \\theta_2 \\phi_2\\]\nIn matrix notation, the symbol \\(\\Phi\\) is sometimes used to denote the design matrix after feature engineering has been performed. Note that in the usage below, \\(\\Phi\\) is now a feature-engineered matrix, rather than a function.\n\\[\\hat{\\mathbb{Y}} = \\Phi \\theta\\]\nMore formally, we describe a feature function as transforming the original \\(\\mathbb{R}^{n \\times p}\\) dataset \\(\\mathbb{X}\\) to a featurized \\(\\mathbb{R}^{n \\times p'}\\) dataset \\(\\mathbb{\\Phi}\\), where \\(p'\\) is typically greater than \\(p\\).\n\\[\\mathbb{X} \\in \\mathbb{R}^{n \\times p} \\longrightarrow \\Phi \\in \\mathbb{R}^{n \\times p'}\\]"
-  },
-  {
-    "objectID": "feature_engineering/feature_engineering.html#one-hot-encoding",
-    "href": "feature_engineering/feature_engineering.html#one-hot-encoding",
-    "title": "14  Sklearn and Feature Engineering",
-    "section": "14.5 One Hot Encoding",
-    "text": "14.5 One Hot Encoding\nFeature engineering opens up a whole new set of possibilities for designing better-performing models. As you will see in lab and homework, feature engineering is one of the most important parts of the entire modeling process.\nA particularly powerful use of feature engineering is to allow us to perform regression on non-numeric features. One hot encoding is a feature engineering technique that generates numeric features from categorical data, allowing us to use our usual methods to fit a regression model on the data.\nTo illustrate how this works, we’ll refer back to the tips dataset from previous lectures. Consider the \"day\" column of the dataset:\n\n\nCode\nimport numpy as np\ntips = sns.load_dataset(\"tips\")\ntips.head()\n\n\n\n\n\n\n\n\n\ntotal_bill\ntip\nsex\nsmoker\nday\ntime\nsize\n\n\n\n\n0\n16.99\n1.01\nFemale\nNo\nSun\nDinner\n2\n\n\n1\n10.34\n1.66\nMale\nNo\nSun\nDinner\n3\n\n\n2\n21.01\n3.50\nMale\nNo\nSun\nDinner\n3\n\n\n3\n23.68\n3.31\nMale\nNo\nSun\nDinner\n2\n\n\n4\n24.59\n3.61\nFemale\nNo\nSun\nDinner\n4\n\n\n\n\n\n\n\nAt first glance, it doesn’t seem possible to fit a regression model to this data – we can’t directly perform any mathematical operations on the entry “Sun”.\nTo resolve this, we instead create a new table with a feature for each unique value in the original \"day\" column. We then iterate through the \"day\" column. For each entry in \"day\" we fill the corresponding feature in the new table with 1. All other features are set to 0.\n\n\n\n\nThe OneHotEncoder class of sklearn (documentation) offers a quick way to perform this one-hot encoding. You will explore its use in detail in the lab. For now, recognize that we follow a very similar workflow to when we were working with the LinearRegression class: we initialize a OneHotEncoder object, fit it to our data, then use .transform to apply the fitted encoder.\n\nfrom sklearn.preprocessing import OneHotEncoder\n\n# Initialize a OneHotEncoder object\nohe = OneHotEncoder()\n\n# Fit the encoder\nohe.fit(tips[[\"day\"]])\n\n# Use the encoder to transform the raw \"day\" feature\nencoded_day = ohe.transform(tips[[\"day\"]]).toarray()\nencoded_day_df = pd.DataFrame(encoded_day, columns=ohe.get_feature_names_out())\n\nencoded_day_df.head()\n\n\n\n\n\n\n\n\nday_Fri\nday_Sat\nday_Sun\nday_Thur\n\n\n\n\n0\n0.0\n0.0\n1.0\n0.0\n\n\n1\n0.0\n0.0\n1.0\n0.0\n\n\n2\n0.0\n0.0\n1.0\n0.0\n\n\n3\n0.0\n0.0\n1.0\n0.0\n\n\n4\n0.0\n0.0\n1.0\n0.0\n\n\n\n\n\n\n\nThe one-hot encoded features can then be used in the design matrix to train a model:\n\n\n\n\\[\\hat{y} = \\theta_1 (\\text{total}\\textunderscore\\text{bill}) + \\theta_2 (\\text{size}) + \\theta_3 (\\text{day}\\textunderscore\\text{Fri}) + \\theta_4 (\\text{day}\\textunderscore\\text{Sat}) + \\theta_5 (\\text{day}\\textunderscore\\text{Sun}) + \\theta_6 (\\text{day}\\textunderscore\\text{Thur})\\]\nOr in shorthand:\n\\[\\hat{y} = \\theta_1\\phi_1 + \\theta_2\\phi_2 + \\theta_3\\phi_3 + \\theta_4\\phi_4 + \\theta_5\\phi_5 + \\theta_6\\phi_6\\]\nNow, the day feature (or rather, the four new boolean features that represent day) can be used to fit a model.\nUsing sklearn to fit the new model, we can determine the model coefficients, allowing us to understand how each feature impacts the predicted tip.\n\nfrom sklearn.linear_model import LinearRegression\ndata_w_ohe = tips[[\"total_bill\", \"size\", \"day\"]].join(encoded_day_df).drop(columns = \"day\")\nohe_model = lm.LinearRegression(fit_intercept=False) #Tell sklearn to not add an additional bias column. Why?\nohe_model.fit(data_w_ohe, tips[\"tip\"])\n\npd.DataFrame({\"Feature\":data_w_ohe.columns, \"Model Coefficient\":ohe_model.coef_})\n\n\n\n\n\n\n\n\nFeature\nModel Coefficient\n\n\n\n\n0\ntotal_bill\n0.092994\n\n\n1\nsize\n0.187132\n\n\n2\nday_Fri\n0.745787\n\n\n3\nday_Sat\n0.621129\n\n\n4\nday_Sun\n0.732289\n\n\n5\nday_Thur\n0.668294\n\n\n\n\n\n\n\nFor example, when looking at the coefficient for day_Fri, we can understand how much the fact that it is Friday impacts the predicted tip.\nWhen one-hot encoding, keep in mind that any set of one-hot encoded columns will always sum to a column of all ones, representing the bias column. More formally, the bias column is a linear combination of the OHE columns.\n\n\n\nWe must be careful not to include this bias column in our design matrix. Otherwise, there will be linear dependence in the model, meaning \\(\\mathbb{X}^T\\mathbb{X}\\) would no longer be invertible, and our OLS estimate \\(\\hat{\\theta} = (\\mathbb{X}^T\\mathbb{X})^{-1}\\mathbb{X}^T\\mathbb{Y}\\) fails.\nTo resolve this issue, we simply omit one of the one-hot encoded columns or do not include an intercept term.\n\n\n\nEither approach works — we still retain the same information as the omitted column being a linear combination of the remaining columns."
-  },
-  {
-    "objectID": "feature_engineering/feature_engineering.html#polynomial-features",
-    "href": "feature_engineering/feature_engineering.html#polynomial-features",
-    "title": "14  Sklearn and Feature Engineering",
-    "section": "14.6 Polynomial Features",
-    "text": "14.6 Polynomial Features\nWe have encountered a few cases now where models with linear features have performed poorly on datasets that show clear non-linear curvature.\nAs an example, consider the vehicles dataset, which contains information about cars. Suppose we want to use the hp (horsepower) of a car to predict its \"mpg\" (gas mileage in miles per gallon). If we visualize the relationship between these two variables, we see a non-linear curvature. Fitting a linear model to these variables results in a high (poor) value of RMSE.\n\\[\\hat{y} = \\theta_0 + \\theta_1 (\\text{hp})\\]\n\n\nCode\npd.options.mode.chained_assignment = None \nvehicles = sns.load_dataset(\"mpg\").dropna().rename(columns = {\"horsepower\": \"hp\"}).sort_values(\"hp\")\n\nX = vehicles[[\"hp\"]]\nY = vehicles[\"mpg\"]\n\nhp_model = lm.LinearRegression()\nhp_model.fit(X, Y)\nhp_model_predictions = hp_model.predict(X)\n\nimport matplotlib.pyplot as plt\n\nsns.scatterplot(data=vehicles, x=\"hp\", y=\"mpg\")\nplt.plot(vehicles[\"hp\"], hp_model_predictions, c=\"tab:red\");\n\nprint(f\"MSE of model with (hp) feature: {np.mean((Y-hp_model_predictions)**2)}\")\n\n\nMSE of model with (hp) feature: 23.943662938603108\n\n\n\n\n\nTo capture non-linearity in a dataset, it makes sense to incorporate non-linear features. Let’s introduce a polynomial term, \\(\\text{hp}^2\\), into our regression model. The model now takes the form:\n\\[\\hat{y} = \\theta_0 + \\theta_1 (\\text{hp}) + \\theta_2 (\\text{hp}^2)\\] \\[\\hat{y} = \\theta_0 + \\theta_1 \\phi_1 + \\theta_2 \\phi_2\\]\nHow can we fit a model with non-linear features? We can use the exact same techniques as before: ordinary least squares, gradient descent, or sklearn. This is because our new model is still a linear model. Although it contains non-linear features, it is linear with respect to the model parameters. All of our previous work on fitting models was done under the assumption that we were working with linear models. Because our new model is still linear, we can apply our existing methods to determine the optimal parameters.\n\n# Add a hp^2 feature to the design matrix\nX = vehicles[[\"hp\"]]\nX[\"hp^2\"] = vehicles[\"hp\"]**2\n\n# Use sklearn to fit the model\nhp2_model = lm.LinearRegression()\nhp2_model.fit(X, Y)\nhp2_model_predictions = hp2_model.predict(X)\n\nsns.scatterplot(data=vehicles, x=\"hp\", y=\"mpg\")\nplt.plot(vehicles[\"hp\"], hp2_model_predictions, c=\"tab:red\");\n\nprint(f\"MSE of model with (hp^2) feature: {np.mean((Y-hp2_model_predictions)**2)}\")\n\nMSE of model with (hp^2) feature: 18.984768907617223\n\n\n\n\n\nLooking a lot better! By incorporating a squared feature, we are able to capture the curvature of the dataset. Our model is now a parabola centered on our data. Notice that our new model’s error has decreased relative to the original model with linear features. ."
-  },
-  {
-    "objectID": "feature_engineering/feature_engineering.html#complexity-and-overfitting",
-    "href": "feature_engineering/feature_engineering.html#complexity-and-overfitting",
-    "title": "14  Sklearn and Feature Engineering",
-    "section": "14.7 Complexity and Overfitting",
-    "text": "14.7 Complexity and Overfitting\nWe’ve seen now that feature engineering allows us to build all sorts of features to improve the performance of the model. In particular, we saw that designing a more complex feature (squaring hp in the vehicles data previously) substantially improved the model’s ability to capture non-linear relationships. To take full advantage of this, we might be inclined to design increasingly complex features. Consider the following three models, each of different order (the maximum exponent power of each model):\n\nModel with order 2: \\(\\hat{\\text{mpg}} = \\theta_0 + \\theta_1 (\\text{hp}) + \\theta_2 (\\text{hp}^2)\\)\nModel with order 3: \\(\\hat{\\text{mpg}} = \\theta_0 + \\theta_1 (\\text{hp}) + \\theta_2 (\\text{hp}^2) + \\theta_3 (\\text{hp}^3)\\)\nModel with order 4: \\(\\hat{\\text{mpg}} = \\theta_0 + \\theta_1 (\\text{hp}) + \\theta_2 (\\text{hp}^2) + \\theta_3 (\\text{hp}^3) + \\theta_4 (\\text{hp}^4)\\)\n\n\n\n\n\nAs we can see in the plots above, MSE continues to decrease with each additional polynomial term. To visualize it further, let’s plot models as the complexity increases from 0 to 6:\n\n\n\nWhen we use our model to make predictions on the same data that was used to fit the model, we find that the MSE decreases with each additional polynomial term (as our model gets more complex). The training error is the model’s error when generating predictions from the same data that was used for training purposes. We can conclude that the training error goes down as the complexity of the model increases.\n\n\n\nThis seems like good news – when working on the training data, we can improve model performance by designing increasingly complex models.\n\nMath Fact: given \\(N\\) overlapping data points, we can always find a polynomial of degree \\(N-1\\) that goes through all those points.\nFor example: there always exists a degree-4 polynomial curve that can perfectly model a dataset of 5 datapoints\n\n\n\n\nHowever, high model complexity comes with its own set of issues. When building the vehicles models above, we trained the models on the entire dataset and then evaluated their performance on this same dataset. In reality, we are likely to instead train the model on a sample from the population, then use it to make predictions on data it didn’t encounter during training.\nLet’s walk through a more realistic example. Say we are given a training dataset of just 6 datapoints and want to train a model to then make predictions on a different set of points. We may be tempted to make a highly complex model (e.g., degree 5), especially given it makes perfect predictions on the training data as clear on the left. However, as shown in the graph on the right, this model would perform horribly on the rest of the population!\n\n\n\nThe phenomenon above is called overfitting. The model effectively just memorized the training data it encountered when it was fitted, leaving it unable to generalize well to data it didn’t encounter during training.\nAdditionally, since complex models are sensitive to the specific dataset used to train them, they have high variance. A model with high variance tends to vary more dramatically when trained on different datasets. Going back to our example above, we can see our degree-5 model varies erratically when we fit it to different samples of 6 points from vehicles.\n\n\n\nWe now face a dilemma: we know that we can decrease training error by increasing model complexity, but models that are too complex start to overfit and can’t be reapplied to new datasets due to high variance.\n\n\n\nWe can see that there is a clear trade-off that comes from the complexity of our model. As model complexity increases, the model’s error on the training data decreases. At the same time, the model’s variance tends to increase.\nThe takeaway here: we need to strike a balance in the complexity of our models; we want models that are generalizable to “unseen” data. A model that is too simple won’t be able to capture the key relationships between our variables of interest; a model that is too complex runs the risk of overfitting.\nThis begs the question: how do we control the complexity of a model? Stay tuned for our Lecture 16 on Cross-Validation and Regularization!"
-  },
-  {
-    "objectID": "case_study_HCE/case_study_HCE.html#the-problem",
-    "href": "case_study_HCE/case_study_HCE.html#the-problem",
-    "title": "15  Case Study in Human Contexts and Ethics",
-    "section": "15.1 The Problem",
-    "text": "15.1 The Problem\nA report by the Chicago Tribune uncovered a major scandal: the team showed that the model perpetuated a highly regressive tax system that disproportionately burdened African-American and Latinx homeowners in Cook County. How did they know?\n\n \n\nIn the field of housing assessment, there are standard metrics that assessors use across the world to estimate the fairness of assessments: coefficient of dispersion and price-related differential. These metrics have been rigorously tested by experts in the field and are out of scope for our class. Calculating these metrics for the Cook County prices revealed that the pricing created by the CCAO did not fall in acceptable ranges (see figure above). This by itself is not the end of the story, but a good indicator that something fishy was going on.\n\n \n\nThis prompted them to investigate if the model itself was producing fair tax rates. Evidently, when accounting for the home owner’s income, they found that the model actually produced a regressive tax rate (see figure above). A tax rate is regressive if the percentage tax rate is higher for individuals with lower net income. A tax rate is progressive if the percentage tax rate is higher for individuals with higher net income.\n\n  \n\n Further digging suggests that not only was the system unfair to people across the axis of income, it was also unfair across the axis of race (see figure above). The likelihood of a property being under- or over-assessed was highly dependent on the owner’s race, and that did not sit well with many homeowners.\n\n15.1.1 Spotlight: Appeals\nWhat actually caused this to come about? A comprehensive answer goes beyond just models. At the end of the day, these are real systems that have a lot of moving parts. One of which was the appeals system. Homeowners are mailed the value their home assessed by CCAO, and the homeowner can choose to appeal to a board of elected officials to try and change the listed value of their home and thus how much they are taxed. In theory, this sounds like a very fair system: someone oversees the final pricing of houses rather than just an algorithm. However, it ended up exacerbating the problems.\n\n“Appeals are a good thing,” Thomas Jaconetty, deputy assessor for valuation and appeals, said in an interview. “The goal here is fairness. We made the numbers. We can change them.”\n\n\n  \n\n\nHere we can borrow lessons from Critical Race Theory. On the surface, everyone having the legal right to try and appeal is undeniable. However, not everyone has an equal ability to do so. Those who have the money to hire tax lawyers to appeal for them have a drastically higher chance of trying and succeeding (see above figure). This model is part of a deeper institutional pattern rife with potential corruption.\n\n  \n\n\nHomeowners who appealed were generally under-assessed relative to homeowners who did not (see above figure). Those with higher incomes pay less in property tax, tax lawyers are able to grow their business due to their role in appeals, and politicians are commonly socially connected to the aforementioned tax lawyers and wealthy homeowners. All these stakeholders have reasons to advertise the model as an integral part of a fair system. Here lies the value in asking questions: a system that seems fair on the surface may in actuality be unfair upon taking a closer look.\n\n\n15.1.2 Human Impacts\n\n  \n\n\nThe impact of the housing model extends beyond the realm of home ownership and taxation. Discriminatory practices have a long history within the United States, and the model served to perpetuate this fact. To this day, Chicago is one of the most segregated cities in the United States (source). These factors are central to informing us, as data scientists, about what is at stake.\n\n\n15.1.3 Spotlight: Intersection of Real Estate and Race\nHousing has been a persistent source of racial inequality throughout US history, amongst other factors. It is one of the main areas where inequalities are created and reproduced. In the beginning, Jim Crow laws were explicit in forbidding people of color from schools, public utilities, etc.\n\n\n\n\nToday, while advancements in Civil Rights have been made, the spirit of the laws are alive in many parts of the US. The real estate industry was “professionalized” in the 1920’s and 1930’s by aspiring to become a science guided by strict methods and principles outlined below:\n\nRedlining: making it difficult or impossible to get a federally-backed mortgage to buy a house in specific neighborhoods coded as “risky” (red).\n\nWhat made them “risky” according to the makers of these was racial composition.\nSegregation was not only a result of federal policy, but developed by real estate professionals.\n\nThe methods centered on creating objective rating systems (information technologies) for the appraisal of property values which encoded race as a factor of valuation (see figure below),\n\nThis, in turn, influenced federal policy and practice.\n\n\n\n\n\nSource: Colin Koopman, How We Became Our Data (2019) p. 137"
-  },
-  {
-    "objectID": "case_study_HCE/case_study_HCE.html#the-response-cook-county-open-data-initiative",
-    "href": "case_study_HCE/case_study_HCE.html#the-response-cook-county-open-data-initiative",
-    "title": "15  Case Study in Human Contexts and Ethics",
-    "section": "15.2 The Response: Cook County Open Data Initiative",
-    "text": "15.2 The Response: Cook County Open Data Initiative\nThe response started in politics. A new assessor, Fritz Kaegi, was elected and created a new mandate with two goals:\n\nDistributional equity in property taxation, meaning that properties of same value treated alike during assessments.\nCreating a new Office of Data Science.\n\n\n\n\n\n\n15.2.1 Question/Problem Formulation\n\n\n\n\n\n\nDriving Questions\n\n\n\n\nWhat do we want to know?\nWhat problems are we trying to solve?\nWhat are the hypotheses we want to test?\nWhat are our metrics for success?\n\n\n\nThe new Office of Data Science started by redefining their goals.\n\nAccurately, uniformly, and impartially assess the value of a home by\n\nFollowing international standards (coefficient of dispersion)\nPredicting value of all homes with as little total error as possible\n\nCreate a robust pipeline that accurately assesses property values at scale and is fair by\n\nDisrupts the circuit of corruption (Board of Review appeals process)\nEliminates regressivity\nEngenders trust in the system among all stakeholders\n\n\n\n\n\n\n\n\nDefinitions: Fairness and Transparency\n\n\n\nThe definitions, as given by the Cook County Assessor’s Office, are given below: \n\nFairness: The ability of our pipeline to accurately assess property values, accounting for disparities in geography, information, etc. \nTransparency: The ability of the data science department to share and explain pipeline results and decisions to both internal and external stakeholders \n\nNote how the Office defines “fairness” in terms of accuracy. Thus, the problem - make the system more fair - was already framed in terms amenable to a data scientist: make the assessments more accurate. The idea here is that if the model is more accurate it will also (perhaps necessarily) become more fair, which is a big assumption. There are, in a sense, two different problems - make accurate assessments, and make a fair system.\n\n\nThe way the goals are defined lead us to ask the question: what does it actually mean to accurately assess property values, and what role does “scale” play?\n\nWhat is an assessment of a home’s value?\nWhat makes one assessment more accurate than another?\nWhat makes one batch of assessments more accurate than another batch?\n\nEach of the above questions leads to a slew of more questions. Considering just the first question, one answer could be that an assessment is an estimate of the value of a home. This leads to more inquiries: what is the value of a home? What determines it? How do we know? For this class, we take it to be the house’s market value.\n\n\n15.2.2 Data Acquisition and Cleaning\n\n\n\n\n\n\nDriving Questions\n\n\n\n\nWhat data do we have, and what data do we need?\nHow will we sample more data?\nIs our data representative of the population we want to study?\n\n\n\nThe data scientists also critically examined their original sales data:\n\n\n\n\nand asked the questions:\n\nHow was this data collected?\nWhen was this data collected?\nWho collected this data?\nFor what purposes was the data collected?\nHow and why were particular categories created?\n\nFor example, attributes can have different likelihoods of appearing in the data, and housing data in the floodplains geographic region of Chicago were less represented than other regions.\nThe features can even be reported at different rates. Improvements in homes, which tend to increase property value, were unlikely to be reported by the homeowners.\nAdditionally, they found that there was simply more missing data in lower income neighborhoods.\n\n\n15.2.3 Exploratory Data Analysis\n\n\n\n\n\n\nDriving Questions\n\n\n\n\nHow is our data organized, and what does it contain?\nDo we already have relevant data?\nWhat are the biases, anomalies, or other issues with the data?\nHow do we transform the data to enable effective analysis?\n\n\n\nBefore the modeling step, they investigated a multitude of crucial questions:\n\nWhich attributes are most predictive of sales price?\nIs the data uniformly distributed?\nDo all neighborhoods have up to date data? Do all neighborhoods have the same granularity?\n\nDo some neighborhoods have missing or outdated data?\n\nFirstly, they found that the impact of certain features, such as bedroom number, were much more impactful in determining house value inside certain neighborhoods more than others. This informed them that different models should be used depending on the neighborhood.\nThey also noticed that low income neighborhoods had disproportionately spottier data. This informed them that they needed to develop new data collection practices - including finding new sources of data.\n\n\n15.2.4 Prediction and Inference\n\n\n\n\n\n\nDriving Questions\n\n\n\n\nWhat does the data say about the world?\nDoes it answer our questions or accurately solve the problem?\nHow robust are our conclusions and can we trust the predictions?\n\n\n\nRather than using a singular model to predict sale prices (“fair market value”) of unsold properties, the CCAO fit machine learning models that discover patterns using known sale prices and characteristics of similar and nearby properties. It uses different model weights for each township.\nCompared to traditional mass appraisal, the CCAO’s new approach is more granular and more sensitive to neighborhood variations.\nHere, we might ask why should any particular individual believe that the model is accurate for their property?\nThis leads us to recognize that the CCAO counts on its performance of “transparency” (putting data, models, pipeline onto GitLab) to foster public trust, which would help it equate the production of “accurate assessments” with “fairness”.\nThere’s a lot more to be said here on the relationship between accuracy, fairness, and metrics we tend to use when evaluating our models. Given the nuanced nature of the argument, it is recommended you view the corresponding lecture as the course notes are not as comprehensive for this portion of lecture.\n\n\n15.2.5 Reports Decisions, and Conclusions\n\n\n\n\n\n\nDriving Questions\n\n\n\n\nHow successful is the system for each goal?\n\nAccuracy/uniformity of the model\nFairness and transparency that eliminates regressivity and engenders trust\n\nHow do you know?\n\n\n\nThe model is not the end of the road. The new Office still sends homeowners their house evaluations, but now the data that they get sent back from the homeowners is taken into account. More detailed reports are being written by the Office itself to democratize the information. Town halls and other public facing outreach helps involves the whole community in the process of housing evaluations, rather than limiting participation to a select few."
-  },
-  {
-    "objectID": "case_study_HCE/case_study_HCE.html#key-takeaways",
-    "href": "case_study_HCE/case_study_HCE.html#key-takeaways",
-    "title": "15  Case Study in Human Contexts and Ethics",
-    "section": "15.3 Key Takeaways",
-    "text": "15.3 Key Takeaways\n\nAccuracy is a necessary, but not sufficient, condition of a fair system.\nFairness and transparency are context-dependent and sociotechnical concepts.\nLearn to work with contexts, and consider how your data analysis will reshape them.\nKeep in mind the power, and limits, of data analysis."
-  },
-  {
-    "objectID": "case_study_HCE/case_study_HCE.html#lessons-for-data-science-practice",
-    "href": "case_study_HCE/case_study_HCE.html#lessons-for-data-science-practice",
-    "title": "15  Case Study in Human Contexts and Ethics",
-    "section": "15.4 Lessons for Data Science Practice",
-    "text": "15.4 Lessons for Data Science Practice\n\nQuestion/Problem Formulation\n\nWho is responsible for framing the problem?\nWho are the stakeholders? How are they involved in the problem framing?\nWhat do you bring to the table? How does your positionality affect your understanding of the problem?\nWhat are the narratives that you’re tapping into?\n\nData Acquisition and Cleaning\n\nWhere does the data come from?\nWho collected it? For what purpose?\nWhat kinds of collecting and recording systems and techniques were used?\nHow has this data been used in the past?\nWhat restrictions are there on access to the data, and what enables you to have access?\n\nExploratory Data Analysis & Visualization\n\nWhat kind of personal or group identities have become salient in this data?\nWhich variables became salient, and what kinds of relationship obtain between them?\nDo any of the relationships made visible lend themselves to arguments that might be potentially harmful to a particular community?\n\nPrediction and Inference\n\nWhat does the prediction or inference do in the world?\nAre the results useful for the intended purposes?\nAre there benchmarks to compare the results?\nHow are your predictions and inferences dependent upon the larger system in which your model works?\n\nReports, Decisions, and Solutions\n\nHow do we know if we have accomplished our goals?\nHow does your work fit in the broader literature?\nWhere does your work agree or disagree with the status quo?\nDo your conclusions make sense?"
-  },
-  {
-    "objectID": "cv_regularization/cv_reg.html#training-test-and-validation-sets",
-    "href": "cv_regularization/cv_reg.html#training-test-and-validation-sets",
-    "title": "16  Cross Validation and Regularization",
-    "section": "16.1 Training, Test, and Validation Sets",
-    "text": "16.1 Training, Test, and Validation Sets\nFrom the last lecture, we learned that increasing model complexity decreased our model’s training error but increased its variance. This makes intuitive sense: adding more features causes our model to fit more closely to data it encountered during training, but generalize worse to new data it hasn’t seen before. For this reason, a low training error is not always representative of our model’s underlying performance - we need to also assess how well it performs on unseen data to ensure that it is not overfitting.\nTruly, the only way to know when our model overfits is by evaluating it on unseen data. Unfortunately, that means we need to wait for more data. This may be very expensive and time-consuming.\nHow should we proceed? In this section, we will build up a viable solution to this problem.\n\n16.1.1 Test Sets\nThe simplest approach to avoid overfitting is to keep some of our data “secret” from ourselves. We can set aside a random portion of our full dataset to use only for testing purposes. The datapoints in this test set will not be used in the model fitting process. Instead, we will:\n\nUse the remaining portion of our dataset – now called the training set – to run ordinary least squares, gradient descent, or some other technique to fit model parameters\nTake the fitted model and use it to make predictions on datapoints in the test set. The model’s performance on the test set (expressed as the MSE, RMSE, etc.) is now indicative of how well it can make predictions on unseen data\n\nImportantly, the optimal model parameters were found by only considering the data in the training set. After the model has been fitted to the training data, we do not change any parameters before making predictions on the test set. Importantly, we only ever make predictions on the test set once after all model design has been completely finalized. We treat the test set performance as the final test of how well a model does.\nThe process of sub-dividing our dataset into training and test sets is known as a train-test split. Typically, between 10% and 20% of the data is allocated to the test set.\n\n\n\nIn sklearn, the train_test_split function of the model_selection module allows us to automatically generate train-test splits.\nThroughout today’s work, we will work with the vehicles dataset from previous lectures. As before, we will attempt to predict the mpg of a vehicle from transformations of its hp. In the cell below, we allocate 20% of the full dataset to testing, and the remaining 80% to training.\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport seaborn as sns\nimport warnings\nwarnings.filterwarnings('ignore')\n\n# Load the dataset and construct the design matrix\nvehicles = sns.load_dataset(\"mpg\").rename(columns={\"horsepower\":\"hp\"}).dropna()\nX = vehicles[[\"hp\"]]\nX[\"hp^2\"] = vehicles[\"hp\"]**2\nX[\"hp^3\"] = vehicles[\"hp\"]**3\nX[\"hp^4\"] = vehicles[\"hp\"]**4\n\nY = vehicles[\"mpg\"]\n\n\n\nfrom sklearn.model_selection import train_test_split\n\n# `test_size` specifies the proportion of the full dataset that should be allocated to testing\n# `random_state` makes our results reproducible for educational purposes\nX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=220)\n\nprint(f\"Size of full dataset: {X.shape[0]} points\")\nprint(f\"Size of training set: {X_train.shape[0]} points\")\nprint(f\"Size of test set: {X_test.shape[0]} points\")\n\nSize of full dataset: 392 points\nSize of training set: 313 points\nSize of test set: 79 points\n\n\nAfter performing our train-test split, we fit a model to the training set and assess its performance on the test set.\n\nimport sklearn.linear_model as lm\nfrom sklearn.metrics import mean_squared_error\n\nmodel = lm.LinearRegression()\n\n# Fit to the training set\nmodel.fit(X_train, Y_train)\n\n# Make predictions on the test set\ntest_predictions = model.predict(X_test)\n\n\n\n16.1.2 Validation Sets\nNow, what if we were dissatisfied with our test set performance? With our current framework, we’d be stuck. As outlined previously, assessing model performance on the test set is the final stage of the model design process. We can’t go back and adjust our model based on the new discovery that it is overfitting – if we did, then we would be factoring in information from the test set to design our model. The test error would no longer be a true representation of the model’s performance on unseen data!\nOur solution is to introduce a validation set. A validation set is a random portion of the training set that is set aside for assessing model performance while the model is still being developed. The process for using a validation set is:\n\nPerform a train-test split. Set the test set aside; we will not touch it until the very end of the model design process.\nSet aside a portion of the training set to be used for validation.\nFit the model parameters to the datapoints contained in the remaining portion of the training set.\nAssess the model’s performance on the validation set. Adjust the model as needed, re-fit it to the remaining portion of the training set, then re-evaluate it on the validation set. Repeat as necessary until you are satisfied.\nAfter all model development is complete, assess the model’s performance on the test set. This is the final test of how well the model performs on unseen data. No further modifications should be made to the model.\n\nThe process of creating a validation set is called a validation split.\n\n\n\nNote that the validation error behaves quite differently from the training error explored previously. Recall that the training error decreased monotonically with increasing model degree – as the model became more complex, it made better and better predictions on the training data. The validation error, in contrast, decreases then increases as we increase model complexity. This reflects the transition from under- to overfitting. At low model complexity, the model underfits because it is not complex enough to capture the main trends in the data. At high model complexity, the model overfits because it “memorizes” the training data too closely.\nWe can update our understanding of the relationships between error, complexity, and model variance:\n\n\n\nOur goal is to train a model with complexity near the orange dotted line – this is where our model achieves minimum error on the validation set. Note that this relationship is a simplification of the real-world. But for the purposes of Data 100, this is good enough."
-  },
-  {
-    "objectID": "cv_regularization/cv_reg.html#k-fold-cross-validation",
-    "href": "cv_regularization/cv_reg.html#k-fold-cross-validation",
-    "title": "16  Cross Validation and Regularization",
-    "section": "16.2 K-Fold Cross-Validation",
-    "text": "16.2 K-Fold Cross-Validation\nIntroducing a validation set gave us an “extra” chance to assess model performance on another set of unseen data. We are able to finetune the model design based on its performance on this one set of validation data.\nBut what if, by random chance, our validation set just happened to contain many outliers? It is possible that the validation datapoints we set aside do not actually represent other unseen data that the model might encounter. Ideally, we would like to validate our model’s performance on several different unseen datasets. This would give us greater confidence in our understanding of how the model behaves on new data.\nLet’s think back to our validation framework. Earlier, we set aside x% of our training data (say, 20%) to use for validation.\n\n\n\nIn the example above, we set aside the first 20% of training datapoints for the validation set. This was an arbitrary choice. We could have set aside any 20% portion of the training data for validation. In fact, there are 5 non-overlapping “chunks” of training points that we could have designated as the validation set.\n\n\n\nThe common term for one of these chunks is a fold. In the example above, we had 5 folds, each containing 20% of the training data. This gives us a new perspective: we really have 5 validation sets “hidden” in our training set.\nIn cross-validation, we perform validation splits for each fold in the training set. For a dataset with \\(K\\) folds, we:\n\nPick one fold to be the validation fold\nFit the model to training data from every fold other than the validation fold\nCompute the model’s error on the validation fold and record it\nRepeat for all \\(K\\) folds\n\nThe cross-validation error is then the average error across all \\(K\\) validation folds.\n\n\n\n\n16.2.1 Model Selection Workflow\nAt this stage, we have refined our model selection workflow. We begin by performing a train-test split to set aside a test set for the final evaluation of model performance. Then, we alternate between adjusting our design matrix and computing the cross-validation error to finetune the model’s design. In the example below, we illustrate the use of 4-fold cross-validation to help inform model design.\n\n\n\n\n\n16.2.2 Hyperparameters\nAn important use of cross-validation is for hyperparameter selection. A hyperparameter is some value in a model that is chosen before the model is fit to any data. This means that it is distinct from the model parameters \\(\\theta_i\\) because its value is selected before the training process begins. We cannot use our usual techniques – calculus, ordinary least squares, or gradient descent – to choose its value. Instead, we must decide it ourselves.\nSome examples of hyperparameters in Data 100 are:\n\nThe degree of our polynomial model (recall that we selected the degree before creating our design matrix and calling .fit)\nThe learning rate, \\(\\alpha\\), in gradient descent\nThe regularization penalty, \\(\\lambda\\) (to be introduced later this lecture)\n\nTo select a hyperparameter value via cross-validation, we first list out several “guesses” for what the best hyperparameter may be. For each guess, we then run cross-validation to compute the cross-validation error incurred by the model when using that choice of hyperparameter value. We then select the value of the hyperparameter that resulted in the lowest cross-validation error.\nFor example, we may wish to use cross-validation to decide what value we should use for \\(\\alpha\\), which controls the step size of each gradient descent update. To do so, we list out some possible guesses for the best \\(\\alpha\\): 0.1, 1, and 10. For each possible value, we perform cross-validation to see what error the model has when we use that value of \\(\\alpha\\) to train it."
-  },
-  {
-    "objectID": "cv_regularization/cv_reg.html#regularization",
-    "href": "cv_regularization/cv_reg.html#regularization",
-    "title": "16  Cross Validation and Regularization",
-    "section": "16.3 Regularization",
-    "text": "16.3 Regularization\nWe’ve now addressed the first of our two goals for today: creating a framework to assess model performance on unseen data. Now, we’ll discuss our second objective: developing a technique to adjust model complexity. This will allow us to directly tackle the issues of under- and overfitting.\nEarlier, we adjusted the complexity of our polynomial model by tuning a hyperparameter – the degree of the polynomial. We trialed several different polynomial degrees, computed the validation error for each, and selected the value that minimized the validation error. Tweaking the “complexity” was simple; it was only a matter of adjusting the polynomial degree.\nIn most machine learning problems, complexity is defined differently from what we have seen so far. Today, we’ll explore two different definitions of complexity: the squared and absolute magnitude of \\(\\theta_i\\) coefficients.\n\n16.3.1 Constraining Model Parameters\nThink back to our work using gradient descent to descend down a loss surface. You may find it helpful to refer back to the Gradient Descent note to refresh your memory. Our aim was to find the combination of model parameters that led to the model having minimum loss. We visualized this using a contour map by plotting possible parameter values on the horizontal and vertical axes, which allows us to take a bird’s eye view above the loss surface. We want to find the model parameters corresponding to the lowest point on the loss surface.\n\n\n\nLet’s review our current modeling framework.\n\\[\\hat{\\mathbb{Y}} = \\theta_0 + \\theta_1 \\phi_1 + \\theta_2 \\phi_2 + \\ldots + \\theta_p \\phi_p\\]\nRecall that we represent our features with \\(\\phi_i\\) to reflect the fact that we have performed feature engineering.\nPreviously, we restricted model complexity by limiting the total number of features present in the model. We only included a limited number of polynomial features at a time; all other polynomials were excluded from the model.\nWhat if, instead of fully removing particular features, we kept all features and used each one only a “little bit”? If we put a limit on how much each feature can contribute to the predictions, we can still control the model’s complexity without the need to manually determine how many features should be removed.\nWhat do we mean by a “little bit”? Consider the case where some parameter \\(\\theta_i\\) is close to or equal to 0. Then, feature \\(\\phi_i\\) barely impacts the prediction – the feature is weighted by such a small value that its presence doesn’t significantly change the value of \\(\\hat{\\mathbb{Y}}\\). If we restrict how large each parameter \\(\\theta_i\\) can be, we restrict how much feature \\(\\phi_i\\) contributes to the model. This has the effect of reducing model complexity.\nIn regularization, we restrict model complexity by putting a limit on the magnitudes of the model parameters \\(\\theta_i\\).\nWhat do these limits look like? Suppose we specify that the sum of all absolute parameter values can be no greater than some number \\(Q\\). In other words:\n\\[\\sum_{i=1}^p |\\theta_i| \\leq Q\\]\nwhere \\(p\\) is the total number of parameters in the model. You can think of this as us giving our model a “budget” for how it distributes the magnitudes of each parameter. If the model assigns a large value to some \\(\\theta_i\\), it may have to assign a small value to some other \\(\\theta_j\\). This has the effect of increasing feature \\(\\phi_i\\)’s influence on the predictions while decreasing the influence of feature \\(\\phi_j\\). The model will need to be strategic about how the parameter weights are distributed – ideally, more “important” features will receive greater weighting.\nNotice that the intercept term, \\(\\theta_0\\), is excluded from this constraint. We typically do not regularize the intercept term.\nNow, let’s think back to gradient descent and visualize the loss surface as a contour map. As a refresher, a loss surface means that each point represents the model’s loss for a particular combination of \\(\\theta_1\\), \\(\\theta_2\\). Let’s say our goal is to find the combination of parameters that gives us the lowest loss.\n\n\n\n With no constraint, the optimal \\(\\hat{\\theta}\\) is in the center.\nApplying this constraint limits what combinations of model parameters are valid. We can now only consider parameter combinations with a total absolute sum less than or equal to our number \\(Q\\). This means that we can only assign our regularized parameter vector \\(\\hat{\\theta}_{\\text{Reg}}\\) to positions in the green diamond below.\n\n\n\n We can no longer select the parameter vector that truly minimizes the loss surface, \\(\\hat{\\theta}_{\\text{No Reg}}\\), because this combination of parameters does not lie within our allowed region. Instead, we select whatever allowable combination brings us closest to the true minimum loss.\n\n\n\n Notice that, under regularization, our optimized \\(\\theta_1\\) and \\(\\theta_2\\) values are much smaller than they were without regularization (indeed, \\(\\theta_1\\) has decreased to 0). The model has decreased in complexity because we have limited how much our features contribute to the model. In fact, by setting its parameter to 0, we have effectively removed the influence of feature \\(\\phi_1\\) from the model altogether.\nIf we change the value of \\(Q\\), we change the region of allowed parameter combinations. The model will still choose the combination of parameters that produces the lowest loss – the closest point in the constrained region to the true minimizer, \\(\\hat{\\theta}_{\\text{No Reg}}\\).\nIf we make \\(Q\\) smaller:\n\n\n\nIf we make \\(Q\\) larger:\n\n\n\n\nWhen \\(Q\\) is small, we severely restrict the size of our parameters. \\(\\theta_i\\)s are small in value, and features \\(\\phi_i\\) only contribute a little to the model. The allowed region of model parameters contracts, and the model becomes much simpler.\nWhen \\(Q\\) is large, we do not restrict our parameter sizes by much. \\(\\theta_i\\)s are large in value, and features \\(\\phi_i\\) contribute more to the model. The allowed region of model parameters expands, and the model becomes more complex.\n\nConsider the extreme case of when \\(Q\\) is extremely large. In this situation, our restriction has essentially no effect, and the allowed region includes the OLS solution!\n\n\n\n\nNow what if \\(Q\\) were very small? Our parameters are then set to (essentially 0). If the model has no intercept term: \\(\\hat{\\mathbb{Y}} = (0)\\phi_1 + (0)\\phi_2 + \\ldots = 0\\). And if the model has an intercept term: \\(\\hat{\\mathbb{Y}} = (0)\\phi_1 + (0)\\phi_2 + \\ldots = \\theta_0\\). Remember that the intercept term is excluded from the constraint - this is so we avoid the situation where we always predict 0.\nLet’s summarize what we have seen."
-  },
-  {
-    "objectID": "cv_regularization/cv_reg.html#l1-lasso-regularization",
-    "href": "cv_regularization/cv_reg.html#l1-lasso-regularization",
-    "title": "16  Cross Validation and Regularization",
-    "section": "16.4 L1 (LASSO) Regularization",
-    "text": "16.4 L1 (LASSO) Regularization\nHow do we actually apply our constraint \\(\\sum_{i=1}^p |\\theta_i| \\leq Q\\)? We will do so by modifying the objective function that we seek to minimize when fitting a model.\nRecall our ordinary least squares objective function: our goal was to find parameters that minimize the model’s mean squared error.\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - \\hat{y}_i)^2 = \\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2\\]\nTo apply our constraint, we need to rephrase our minimization goal.\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2\\:\\text{such that} \\sum_{i=1}^p |\\theta_i| \\leq Q\\]\nUnfortunately, we can’t directly use this formulation as our objective function – it’s not easy to mathematically optimize over a constraint. Instead, we will apply the magic of the Lagrangian Duality. The details of this are out of scope (take EECS 127 if you’re interested in learning more), but the end result is very useful. It turns out that minimizing the following augmented objective function is equivalent to our minimization goal above.\n\\[\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2 + \\lambda \\sum_{i=1}^p \\vert \\theta_i \\vert = ||\\mathbb{Y} - \\mathbb{X}\\theta||_2^2 + \\lambda \\sum_{i=1}^p |\\theta_i|\\]\nThe second of these two expressions includes the MSE expressed using vector notation.\nNotice that we’ve replaced the constraint with a second term in our objective function. We’re now minimizing a function with an additional regularization term that penalizes large coefficients. In order to minimize this new objective function, we’ll end up balancing two components:\n\nKeep the model’s error on the training data low, represented by the term \\(\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 x_{i, 1} + \\theta_2 x_{i, 2} + \\ldots + \\theta_p x_{i, p}))^2\\)\nAt the same time, keep the magnitudes of model parameters low, represented by the term \\(\\lambda \\sum_{i=1}^p |\\theta_i|\\)\n\nThe \\(\\lambda\\) factor controls the degree of regularization. Roughly speaking, \\(\\lambda\\) is related to our \\(Q\\) constraint from before by the rule \\(\\lambda \\approx \\frac{1}{Q}\\).  To understand why, let’s consider two extreme examples:\n\nAssume \\(\\lambda \\rightarrow \\infty\\). Then, \\(\\lambda \\sum_{j=1}^{d} \\vert \\theta_j \\vert\\) dominates the cost function. To minimize this term, we set \\(\\theta_j = 0\\) for all \\(j \\ge 1\\). This is a very constrained model that is mathematically equivalent to the constant model. Earlier, we explained the constant model also arises when the L2 norm ball radius \\(Q \\rightarrow 0\\).\nAssume \\(\\lambda \\rightarrow 0\\). Then, \\(\\lambda \\sum_{j=1}^{d} \\vert \\theta_j \\vert\\) is 0. Minimizing the cost function is equivalent to \\(\\min_{\\theta} \\frac{1}{n} || Y - X\\theta ||_2^2\\), our usual MSE loss function. The act of minimizing MSE loss is just our familiar OLS, and the optimal solution is the global minimum \\(\\hat{\\theta} = \\hat\\theta_{No Reg.}\\). We showed that the global optimum is achieved when the L2 norm ball radius \\(Q \\rightarrow \\infty\\).\n\nWe call \\(\\lambda\\) the regularization penalty hyperparameter and select its value via cross-validation.\nThe process of finding the optimal \\(\\hat{\\theta}\\) to minimize our new objective function is called L1 regularization. It is also sometimes known by the acronym “LASSO”, which stands for “Least Absolute Shrinkage and Selection Operator.”\nUnlike ordinary least squares, which can be solved via the closed-form solution \\(\\hat{\\theta}_{OLS} = (\\mathbb{X}^{\\top}\\mathbb{X})^{-1}\\mathbb{X}^{\\top}\\mathbb{Y}\\), there is no closed-form solution for the optimal parameter vector under L1 regularization. Instead, we use the Lasso model class of sklearn.\n\nimport sklearn.linear_model as lm\n\n# The alpha parameter represents our lambda term\nlasso_model = lm.Lasso(alpha=2)\nlasso_model.fit(X_train, Y_train)\n\nlasso_model.coef_\n\narray([-2.54932056e-01, -9.48597165e-04,  8.91976284e-06, -1.22872290e-08])\n\n\nNotice that all model coefficients are very small in magnitude. In fact, some of them are so small that they are essentially 0. An important characteristic of L1 regularization is that many model parameters are set to 0. In other words, LASSO effectively selects only a subset of the features. The reason for this comes back to our loss surface and allowed “diamond” regions from earlier – we can often get closer to the lowest loss contour at a corner of the diamond than along an edge.\nWhen a model parameter is set to 0 or close to 0, its corresponding feature is essentially removed from the model. We say that L1 regularization performs feature selection because, by setting the parameters of unimportant features to 0, LASSO “selects” which features are more useful for modeling."
-  },
-  {
-    "objectID": "cv_regularization/cv_reg.html#scaling-features-for-regularization",
-    "href": "cv_regularization/cv_reg.html#scaling-features-for-regularization",
-    "title": "16  Cross Validation and Regularization",
-    "section": "16.5 Scaling Features for Regularization",
-    "text": "16.5 Scaling Features for Regularization\nThe regularization procedure we just performed had one subtle issue. To see what it is, let’s take a look at the design matrix for our lasso_model.\n\nX_train.head()\n\n\n\n\n\n\n\n\nhp\nhp^2\nhp^3\nhp^4\n\n\n\n\n259\n85.0\n7225.0\n614125.0\n52200625.0\n\n\n129\n67.0\n4489.0\n300763.0\n20151121.0\n\n\n207\n102.0\n10404.0\n1061208.0\n108243216.0\n\n\n302\n70.0\n4900.0\n343000.0\n24010000.0\n\n\n71\n97.0\n9409.0\n912673.0\n88529281.0\n\n\n\n\n\n\n\nOur features – hp, hp^2, hp^3, and hp^4 – are on drastically different numeric scales! The values contained in hp^4 are orders of magnitude larger than those contained in hp. This can be a problem because the value of hp^4 will naturally contribute more to each predicted \\(\\hat{y}\\) because it is so much greater than the values of the other features. For hp to have much of an impact at all on the prediction, it must be scaled by a large model parameter.\nBy inspecting the fitted parameters of our model, we see that this is the case – the parameter for hp is much larger in magnitude than the parameter for hp^4.\n\npd.DataFrame({\"Feature\":X_train.columns, \"Parameter\":lasso_model.coef_})\n\n\n\n\n\n\n\n\nFeature\nParameter\n\n\n\n\n0\nhp\n-2.549321e-01\n\n\n1\nhp^2\n-9.485972e-04\n\n\n2\nhp^3\n8.919763e-06\n\n\n3\nhp^4\n-1.228723e-08\n\n\n\n\n\n\n\nRecall that by applying regularization, we give our a model a “budget” for how it can allocate the values of model parameters. For hp to have much of an impact on each prediction, LASSO is forced to “spend” more of this budget on the parameter for hp.\nWe can avoid this issue by scaling the data before regularizing. This is a process where we convert all features to the same numeric scale. A common way to scale data is to perform standardization such that all features have mean 0 and standard deviation 1; essentially, we replace everything with its Z-score.\n\\[z_k = \\frac{x_k - \\mu_k}{\\sigma_k}\\]"
-  },
-  {
-    "objectID": "cv_regularization/cv_reg.html#l2-ridge-regularization",
-    "href": "cv_regularization/cv_reg.html#l2-ridge-regularization",
-    "title": "16  Cross Validation and Regularization",
-    "section": "16.6 L2 (Ridge) Regularization",
-    "text": "16.6 L2 (Ridge) Regularization\nIn all of our work above, we considered the constraint \\(\\sum_{i=1}^p |\\theta_i| \\leq Q\\) to limit the complexity of the model. What if we had applied a different constraint?\nIn L2 regularization, also known as ridge regression, we constrain the model such that the sum of the squared parameters must be less than some number \\(Q\\). This constraint takes the form:\n\\[\\sum_{i=1}^p \\theta_i^2 \\leq Q\\]\nAs before, we typically do not regularize the intercept term.\nThe allowed region of parameters for a given value of \\(Q\\) is now shaped like a ball.\n\n\n\nIf we modify our objective function like before, we find that our new goal is to minimize the function: \\[\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2\\:\\text{such that} \\sum_{i=1}^p \\theta_i^2 \\leq Q\\]\nNotice that all we have done is change the constraint on the model parameters. The first term in the expression, the MSE, has not changed.\nUsing Lagrangian Duality, we can re-express our objective function as: \\[\\frac{1}{n} \\sum_{i=1}^n (y_i - (\\theta_0 + \\theta_1 \\phi_{i, 1} + \\theta_2 \\phi_{i, 2} + \\ldots + \\theta_p \\phi_{i, p}))^2 + \\lambda \\sum_{i=1}^p \\theta_i^2 = ||\\mathbb{Y} - \\mathbb{X}\\theta||_2^2 + \\lambda \\sum_{i=1}^p \\theta_i^2\\]\nWhen applying L2 regularization, our goal is to minimize this updated objective function.\nUnlike L1 regularization, L2 regularization does have a closed-form solution for the best parameter vector when regularization is applied:\n\\[\\hat\\theta_{\\text{ridge}} = (\\mathbb{X}^{\\top}\\mathbb{X} + n\\lambda I)^{-1}\\mathbb{X}^{\\top}\\mathbb{Y}\\]\nThis solution exists even if \\(\\mathbb{X}\\) is not full column rank. This is a major reason why L2 regularization is often used – it can produce a solution even when there is colinearity in the features. We will discuss the concept of colinearity in a future lecture. We will not derive this result in Data 100, as it involves a fair bit of matrix calculus.\nIn sklearn, we perform L2 regularization using the Ridge class. Notice that we scale the data before regularizing.\n\nridge_model = lm.Ridge(alpha=1) # alpha represents the hyperparameter lambda\nridge_model.fit(X_train, Y_train)\n\nridge_model.coef_\n\narray([ 5.89130559e-02, -6.42445915e-03,  4.44468157e-05, -8.83981945e-08])"
-  },
-  {
-    "objectID": "cv_regularization/cv_reg.html#regression-summary",
-    "href": "cv_regularization/cv_reg.html#regression-summary",
-    "title": "16  Cross Validation and Regularization",
-    "section": "16.7 Regression Summary",
-    "text": "16.7 Regression Summary\nOur regression models are summarized below. Note the objective function is what the gradient descent optimizer minimizes.\n\n\n\n\n\n\n\n\n\n\n\nType\nModel\nLoss\nRegularization\nObjective Function\nSolution\n\n\n\n\nOLS\n\\(\\hat{\\mathbb{Y}} = \\mathbb{X}\\theta\\)\nMSE\nNone\n\\(\\frac{1}{n} \\|\\mathbb{Y}-\\mathbb{X} \\theta\\|^2_2\\)\n\\(\\hat{\\theta}_{OLS} = (\\mathbb{X}^{\\top}\\mathbb{X})^{-1}\\mathbb{X}^{\\top}\\mathbb{Y}\\) if \\(\\mathbb{X}\\) is full column rank\n\n\nRidge\n\\(\\hat{\\mathbb{Y}} = \\mathbb{X} \\theta\\)\nMSE\nL2\n\\(\\frac{1}{n} \\|\\mathbb{Y}-\\mathbb{X}\\theta\\|^2_2 + \\lambda \\sum_{i=1}^p \\theta_i^2\\)\n\\(\\hat{\\theta}_{ridge} = (\\mathbb{X}^{\\top}\\mathbb{X} + n \\lambda I)^{-1}\\mathbb{X}^{\\top}\\mathbb{Y}\\)\n\n\nLASSO\n\\(\\hat{\\mathbb{Y}} = \\mathbb{X} \\theta\\)\nMSE\nL1\n\\(\\frac{1}{n} \\|\\mathbb{Y}-\\mathbb{X}\\theta\\|^2_2 + \\lambda \\sum_{i=1}^p \\vert \\theta_i \\vert\\)\nNo closed form"
-  },
-  {
-    "objectID": "probability_1/probability_1.html#random-variables-and-distributions",
-    "href": "probability_1/probability_1.html#random-variables-and-distributions",
-    "title": "17  Random Variables",
-    "section": "17.1 Random Variables and Distributions",
-    "text": "17.1 Random Variables and Distributions\nSuppose we generate a set of random data, like a random sample from some population. A random variable is a numerical function of the randomness in the data. It is random since our sample was drawn at random; it is variable because its exact value depends on how this random sample came out. As such, the domain or input of our random variable is all possible (random) outcomes in a sample space, and its range or output is the number line. We typically denote random variables with uppercase letters, such as \\(X\\) or \\(Y\\).\n\n17.1.1 Distribution\nFor any random variable \\(X\\), we need to be able to specify 2 things:\n\nPossible values: the set of values the random variable can take on.\nProbabilities: the set of probabilities describing how the total probability of 100% is split over the possible values.\n\nIf \\(X\\) is discrete (has a finite number of possible values), the probability that a random variable \\(X\\) takes on the value \\(x\\) is given by \\(P(X=x)\\), and probabilities must sum to 1: \\(\\sum_{\\text{all} x} P(X=x) = 1\\),\nWe can often display this using a probability distribution table, which you will see in the coin toss example below.\nThe distribution of a random variable \\(X\\) is a description of how the total probability of 100% is split over all the possible values of \\(X\\), and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is continuous – it can take on infinitely many values – we can illustrate its distribution using a density curve.\n\n\n\nProbabilities are areas. For discrete random variables, the area of the red bars represent the probability that a discrete random variable \\(X\\) falls within those values. For continuous random variables, the area under the curve represents the probability that a discrete random variable \\(Y\\) falls within those values.\n\n\n\nIf we sum up the total area of the bars/under the density curve, we should get 100%, or 1.\n\n\n17.1.2 Example: Tossing a Coin\nTo give a concrete example, let’s formally define a fair coin toss. A fair coin can land on heads (\\(H\\)) or tails (\\(T\\)), each with a probability of 0.5. With these possible outcomes, we can define a random variable \\(X\\) as: \\[X = \\begin{cases}\n      1, \\text{if the coin lands heads} \\\\\n      0, \\text{if the coin lands tails}\n   \\end{cases}\\]\n\\(X\\) is a function with a domain, or input, of \\(\\{H, T\\}\\) and a range, or output, of \\(\\{1, 0\\}\\). We can write this in function notation as \\[\\begin{cases}  X(H) = 1 \\\\ X(T) = 0 \\end{cases}\\] The probability distribution table of \\(X\\) is given by.\n\n\n\n\\(x\\)\n\\(P(X=x)\\)\n\n\n\n\n0\n\\(\\frac{1}{2}\\)\n\n\n1\n\\(\\frac{1}{2}\\)\n\n\n\nSuppose we draw a random sample \\(s\\) of size 3 from all students enrolled in Data 100. We can define \\(Y\\) as the number of data science students in our sample. Its domain is all possible samples of size 3, and its range is \\(\\{0, 1, 2, 3\\}\\).\n\n\n\nWe can show the distribution of \\(Y\\) in the following tables. The table on the left lists all possible samples of \\(s\\) and the number of times they can appear (\\(Y(s)\\)). We can use this to calculate the values for the table on the right, a probability distribution table.\n\n\n\n\n\n17.1.3 Simulation\nGiven a random variable \\(X\\)’s distribution, how could we generate/simulate a population? To do so, we can randomly pick values of \\(X\\) according to its distribution using np.random.choice or df.sample."
-  },
-  {
-    "objectID": "probability_1/probability_1.html#expectation-and-variance",
-    "href": "probability_1/probability_1.html#expectation-and-variance",
-    "title": "17  Random Variables",
-    "section": "17.2 Expectation and Variance",
-    "text": "17.2 Expectation and Variance\nThere are several ways to describe a random variable. The methods shown above – a table of all samples \\(s, X(s)\\), distribution table \\(P(X=x)\\), and histograms – are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a “summary” of how the variable tends to behave, they are not random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.\n\n17.2.1 Expectation\nThe expectation of a random variable \\(X\\) is the weighted average of the values of \\(X\\), where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation:\n\nApply the weights one sample at a time: \\[\\mathbb{E}[X] = \\sum_{\\text{all possible } s} X(s) P(s)\\].\nApply the weights one possible value at a time: \\[\\mathbb{E}[X] = \\sum_{\\text{all possible } x} x P(X=x)\\]\n\nWe want to emphasize that the expectation is a number, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable.\n\n17.2.1.1 Example 1: Coin Toss\nGoing back to our coin toss example, we define a random variable \\(X\\) as: \\[X = \\begin{cases}\n      1, \\text{if the coin lands heads} \\\\\n      0, \\text{if the coin lands tails}\n   \\end{cases}\\] We can calculate its expectation \\(\\mathbb{E}[X]\\) using the second method of applying the weights one possible value at a time: \\[\\begin{align}\n\\mathbb{E}[X] &= \\sum_{x} x P(X=x) \\\\\n&= 1 * 0.5 + 0 * 0.5 \\\\\n&= 0.5\n\\end{align}\\] Note that \\(\\mathbb{E}[X] = 0.5\\) is not a possible value of \\(X\\); it’s an average. The expectation of X does not need to be a possible value of X.\n\n\n17.2.1.2 Example 2\nConsider the random variable \\(X\\):\n\n\n\n\\(x\\)\n\\(P(X=x)\\)\n\n\n\n\n3\n0.1\n\n\n4\n0.2\n\n\n6\n0.4\n\n\n8\n0.3\n\n\n\nTo calculate it’s expectation, \\[\\begin{align}\n\\mathbb{E}[X] &= \\sum_{x} x P(X=x) \\\\\n&= 3 * 0.1 + 4 * 0.2 + 6 * 0.4 + 8 * 0.3 \\\\\n&= 0.3 + 0.8 + 2.4 + 2.4 \\\\\n&= 5.9\n\\end{align}\\] Again, note that \\(\\mathbb{E}[X] = 5.9\\) is not a possible value of \\(X\\); it’s an average. The expectation of X does not need to be a possible value of X.\n\n\n\n17.2.2 Variance\nThe variance of a random variable is a measure of its chance error. It is defined as the expected squared deviation from the expectation of \\(X\\). Put more simply, variance asks: how far does \\(X\\) typically vary from its average value, just by chance? What is the spread of \\(X\\)’s distribution?\n\\[\\text{Var}(X) = \\mathbb{E}[(X-\\mathbb{E}[X])^2]\\]\nThe units of variance are the square of the units of \\(X\\). To get it back to the right scale, use the standard deviation of \\(X\\): \\[\\text{SD}(X) = \\sqrt{\\text{Var}(X)}\\]\nLike with expectation, variance is a number, not a random variable! Its main use is to quantify chance error.\nBy Chebyshev’s inequality, which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”\nIf we expand the square and use properties of expectation, we can re-express variance as the computational formula for variance. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as \\(\\mathbb{E}[X^2] = \\text{Var}(X)\\) if \\(X\\) is centered and \\(E(X)=0\\).\n\\[\\text{Var}(X) = \\mathbb{E}[X^2] - (\\mathbb{E}[X])^2\\]\n\n\n\n\n\n\nProof\n\n\n\n\n\n\\[\\begin{align}\n   \\text{Var}(X) &= \\mathbb{E}[(X-\\mathbb{E}[X])^2] \\\\\n   &= \\mathbb{E}(X^2 - 2X\\mathbb{E}(X) + (\\mathbb{E}(X))^2) \\\\\n   &= \\mathbb{E}(X^2) - 2 \\mathbb{E}(X)\\mathbb{E}(X) +( \\mathbb{E}(X))^2\\\\\n   &= \\mathbb{E}[X^2] - (\\mathbb{E}[X])^2\n\\end{align}\\]\n\n\n\nHow do we compute \\(\\mathbb{E}[X^2]\\)? Any function of a random variable is also a random variable – that means that by squaring \\(X\\), we’ve created a new random variable. To compute \\(\\mathbb{E}[X^2]\\), we can simply apply our definition of expectation to the random variable \\(X^2\\).\n\\[\\mathbb{E}[X^2] = \\sum_{x} x^2 P(X = x)\\]\n\n\n17.2.3 Example: Dice\nLet \\(X\\) be the outcome of a single fair dice roll. \\(X\\) is a random variable defined as \\[X = \\begin{cases}\n      \\frac{1}{6}, \\text{if } x \\in \\{1,2,3,4,5,6\\} \\\\\n      0, \\text{otherwise}\n   \\end{cases}\\]\n\n\n\n\n\n\nWhat’s the expectation \\(\\mathbb{E}[X]?\\)\n\n\n\n\n\n\\[ \\begin{align}\n         \\mathbb{E}[X] &= 1(\\frac{1}{6}) + 2(\\frac{1}{6}) + 3(\\frac{1}{6}) + 4(\\frac{1}{6}) + 5(\\frac{1}{6}) + 6(\\frac{1}{6}) \\\\\n         &= (\\frac{1}{6}) ( 1 + 2 + 3 + 4 + 5 + 6) \\\\\n         &= \\frac{7}{2}\n      \\end{align}\\]\n\n\n\n\n\n\n\n\n\nWhat’s the variance \\(\\text{Var}(X)?\\)\n\n\n\n\n\nUsing approach 1: \\[\\begin{align}\n      \\text{Var}(X) &= (\\frac{1}{6})((1 - \\frac{7}{2})^2 + (2 - \\frac{7}{2})^2 + (3 - \\frac{7}{2})^2 + (4 - \\frac{7}{2})^2 + (5 - \\frac{7}{2})^2 + (6 - \\frac{7}{2})^2) \\\\\n      &= \\frac{35}{12}\n   \\end{align}\\]\nUsing approach 2: \\[\\mathbb{E}[X^2] = \\sum_{x} x^2 P(X = x) = \\frac{91}{6}\\] \\[\\text{Var}(X) = \\frac{91}{6} - (\\frac{7}{2})^2 = \\frac{35}{12}\\]"
-  },
-  {
-    "objectID": "probability_1/probability_1.html#sums-of-random-variables",
-    "href": "probability_1/probability_1.html#sums-of-random-variables",
-    "title": "17  Random Variables",
-    "section": "17.3 Sums of Random Variables",
-    "text": "17.3 Sums of Random Variables\nOften, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables.\nFor example, if \\(X_1, X_2, ..., X_n\\) are random variables, then so are all of these:\n\n\\(X_n^2\\)\n\\(\\#\\{i : X_i &gt; 10\\}\\)\n\\(\\text{max}(X_1, X_2, ..., X_n)\\)\n\\(\\frac{1}{n} \\sum_{i=1}^n (X_i - c)^2\\)\n\\(\\frac{1}{n} \\sum_{i=1}^n X_i\\)\n\n\n17.3.1 Equal vs. Identically Distributed vs. i.i.d\nSuppose that we have two random variables \\(X\\) and \\(Y\\):\n\n\\(X\\) and \\(Y\\) are equal if \\(X(s) = Y(s)\\) for every sample \\(s\\). Regardless of the exact sample drawn, \\(X\\) is always equal to \\(Y\\).\n\\(X\\) and \\(Y\\) are identically distributed if the distribution of \\(X\\) is equal to the distribution of \\(Y\\). We say “X and Y are equal in distribution.” That is, \\(X\\) and \\(Y\\) take on the same set of possible values, and each of these possible values is taken with the same probability. On any specific sample \\(s\\), identically distributed variables do not necessarily share the same value. If X = Y, then X and Y are identically distributed; however, the converse is not true (ex: Y = 7-X, X is a die)\n\\(X\\) and \\(Y\\) are independent and identically distributed (i.i.d) if\n\nThe variables are identically distributed.\nKnowing the outcome of one variable does not influence our belief of the outcome of the other.\n\n\nFor example, let \\(X_1\\) and \\(X_2\\) be numbers on rolls of two fair die. \\(X_1\\) and \\(X_2\\) are i.i.d, so \\(X_1\\) and \\(X_2\\) have the same distribution. However, the sums \\(Y = X_1 + X_1 = 2X_1\\) and \\(Z=X_1+X_2\\) have different distributions but the same expectation.\n\n\n\nHowever, \\(Y = X_1\\) has a larger variance\n\n\n\n\n\n17.3.2 Properties of Expectation\nInstead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: \\[\\mathbb{E}[X] = \\sum_{x} x P(X=x)\\] From it, we can derive some useful properties of expectation:\n\nLinearity of expectation. The expectation of the linear transformation \\(aX+b\\), where \\(a\\) and \\(b\\) are constants, is:\n\n\\[\\mathbb{E}[aX+b] = aE[\\mathbb{X}] + b\\]\n\n\n\n\n\n\nProof\n\n\n\n\n\n\\[\\begin{align}\n        \\mathbb{E}[aX+b] &= \\sum_{x} (ax + b) P(X=x) \\\\\n        &= \\sum_{x} (ax P(X=x) + bP(X=x)) \\\\\n        &= a\\sum_{x}P(X=x) + b\\sum_{x}P(X=x)\\\\\n        &= a\\mathbb{E}(X) + b * 1\n    \\end{align}\\]\n\n\n\n\nExpectation is also linear in sums of random variables.\n\n\\[\\mathbb{E}[X+Y] = \\mathbb{E}[X] + \\mathbb{E}[Y]\\]\n\n\n\n\n\n\nProof\n\n\n\n\n\n\\[\\begin{align}\n    \\mathbb{E}[X+Y] &= \\sum_{s} (X+Y)(s) P(s) \\\\\n    &= \\sum_{s} (X(s)P(s) + Y(s)P(s)) \\\\\n    &= \\sum_{s} X(s)P(s) + \\sum_{s} Y(s)P(s)\\\\\n    &= \\mathbb{E}[X] + \\mathbb{E}[Y]\n\\end{align}\\]\n\n\n\n\nIf \\(g\\) is a non-linear function, then in general, \\[\\mathbb{E}[g(X)] \\neq g(\\mathbb{E}[X])\\] For example, if \\(X\\) is -1 or 1 with equal probability, then \\(\\mathbb{E}[X] = 0\\), but \\(\\mathbb{E}[X^2] = 1 \\neq 0\\).\n\n\n\n17.3.3 Properties of Variance\nRecall the definition of variance: \\[\\text{Var}(X) = \\mathbb{E}[(X-\\mathbb{E}[X])^2]\\] Combining it with the properties of expectation, we can derive some useful properties of variance:\n\nUnlike expectation, variance is non-linear. The variance of the linear transformation \\(aX+b\\) is: \\[\\text{Var}(aX+b) = a^2 \\text{Var}(X)\\]\n\n\nSubsequently, \\[\\text{SD}(aX+b) = |a| \\text{SD}(X)\\]\nThe full proof of this fact can be found using the definition of variance. As general intuition, consider that \\(aX+b\\) scales the variable \\(X\\) by a factor of \\(a\\), then shifts the distribution of \\(X\\) by \\(b\\) units.\n\n\n\n\n\n\n\nProof\n\n\n\n\n\nWe know that \\[\\mathbb{E}[aX+b] = aE[\\mathbb{X}] + b\\]\nIn order to compute \\(\\text{Var}(aX+b)\\), consider that a shift by b units does not affect spread, so \\(\\text{Var}(aX+b) = \\text{Var}(aX)\\).\nThen, \\[\\begin{align}\n    \\text{Var}(aX+b) &= \\text{Var}(aX) \\\\\n    &= E((aX)^2) - (E(aX))^2 \\\\\n    &= E(a^2 X^2) - (aE(X))^2\\\\\n    &= a^2 (E(X^2) - (E(X))^2) \\\\\n    &= a^2 \\text{Var}(X)\n\\end{align}\\]\n\n\n\n\nShifting the distribution by \\(b\\) does not impact the spread of the distribution. Thus, \\(\\text{Var}(aX+b) = \\text{Var}(aX)\\).\nScaling the distribution by \\(a\\) does impact the spread of the distribution.\n\n\n\n\n\nVariance of sums of RVs is affected by the (in)dependence of the RVs. \\[\\text{Var}(X + Y) = \\text{Var}(X) + \\text{Var}(Y) + 2\\text{cov}(X,Y)\\] \\[\\text{Var}(X + Y) = \\text{Var}(X) + \\text{Var}(Y) \\qquad \\text{if } X, Y \\text{ independent}\\]\n\n\n\n\n\n\n\nProof\n\n\n\n\n\nThe variance of a sum is affected by the dependence between the two random variables that are being added. Let’s expand out the definition of \\(\\text{Var}(X + Y)\\) to see what’s going on.\nTo simplify the math, let \\(\\mu_x = \\mathbb{E}[X]\\) and \\(\\mu_y = \\mathbb{E}[Y]\\).\n\\[ \\begin{align}\n\\text{Var}(X + Y) &= \\mathbb{E}[(X+Y- \\mathbb{E}(X+Y))^2] \\\\\n&= \\mathbb{E}[((X - \\mu_x) + (Y - \\mu_y))^2] \\\\\n&= \\mathbb{E}[(X - \\mu_x)^2 + 2(X - \\mu_x)(Y - \\mu_y) + (Y - \\mu_y)^2] \\\\\n&= \\mathbb{E}[(X - \\mu_x)^2] + \\mathbb{E}[(Y - \\mu_y)^2] + \\mathbb{E}[(X - \\mu_x)(Y - \\mu_y)] \\\\\n&= \\text{Var}(X) + \\text{Var}(Y) + \\mathbb{E}[(X - \\mu_x)(Y - \\mu_y)]\n\\end{align}\\]\n\n\n\n\n\n17.3.4 Covariance and Correlation\nWe define the covariance of two random variables as the expected product of deviations from expectation. Put more simply, covariance is a generalization of variance to two random variables: \\(\\text{Cov}(X, X) = \\mathbb{E}[(X - \\mathbb{E}[X])^2] = \\text{Var}(X)\\)\n\\[\\text{Cov}(X, Y) = \\mathbb{E}[(X - \\mathbb{E}[X])(Y - \\mathbb{E}[Y])]\\]\nWe can treat the covariance as a measure of association. Remember the definition of correlation given when we first established SLR?\n\\[r(X, Y) = \\mathbb{E}\\left[\\left(\\frac{X-\\mathbb{E}[X]}{\\text{SD}(X)}\\right)\\left(\\frac{Y-\\mathbb{E}[Y]}{\\text{SD}(Y)}\\right)\\right] = \\frac{\\text{Cov}(X, Y)}{\\text{SD}(X)\\text{SD}(Y)}\\]\nIt turns out we’ve been quietly using covariance for some time now! If \\(X\\) and \\(Y\\) are independent, then \\(\\text{Cov}(X, Y) =0\\) and \\(r(X, Y) = 0\\). Note, however, that the converse is not always true: \\(X\\) and \\(Y\\) could have \\(\\text{Cov}(X, Y) = r(X, Y) = 0\\) but not be independent.\n\n\n17.3.5 Summary\n\nLet \\(X\\) be a random variable with distribution \\(P(X=x)\\).\n\n\\(\\mathbb{E}[X] = \\sum_{x} x P(X=x)\\)\n\\(\\text{Var}(X) = \\mathbb{E}[(X-\\mathbb{E}[X])^2] = \\mathbb{E}[X^2] - (\\mathbb{E}[X])^2\\)\n\nLet \\(a\\) and \\(b\\) be scalar values.\n\n\\(\\mathbb{E}[aX+b] = aE[\\mathbb{X}] + b\\)\n\\(\\text{Var}(aX+b) = a^2 \\text{Var}(X)\\)\n\nLet \\(Y\\) be another random variable.\n\n\\(\\mathbb{E}[X+Y] = \\mathbb{E}[X] + \\mathbb{E}[Y]\\)\n\\(\\text{Var}(X + Y) = \\text{Var}(X) + \\text{Var}(Y) 2\\text{cov}(X,Y)\\)"
-  },
-  {
-    "objectID": "probability_2/probability_2.html#common-random-variables",
-    "href": "probability_2/probability_2.html#common-random-variables",
-    "title": "18  Estimators, Bias, and Variance",
-    "section": "18.1 Common Random Variables",
-    "text": "18.1 Common Random Variables\nThere are several cases of random variables that appear often and have useful properties. Below are the ones we will explore further in this course. The numbers in parentheses are the parameters of a random variable, which are constants. Parameters define a random variable’s shape (i.e., distribution) and its values. For this lecture, we’ll focus more heavily on the bolded random variables and their special properties, but you should familiarize yourself with all the ones listed below:\n\nBernoulli(p)\n\nTakes on value 1 with probability p, and 0 with probability 1 - p.\nAKA the “indicator” random variable.\nLet X be a Bernoulli(p) random variable\n\n\\(\\mathbb{E}[X] = 1 * p + 0 * (1-p) = p\\)\n\n\\(\\mathbb{E}[X^2] = 1^2 * p + 0 * (1-p) = p\\)\n\n\\(\\text{Var}(X) = \\mathbb{E}[X^2] - (\\mathbb{E}[X])^2 = p - p^2 = p(1-p)\\)\n\n\nBinomial(n, p)\n\nNumber of 1s in \\(n\\) independent Bernoulli(p) trials.\nLet \\(Y\\) be a Binomial(n, p) random variable.\n\nThe distribution of \\(Y\\) is given by the binomial formula, and we can write \\(Y = \\sum_{i=1}^n X_i\\) where:\n\n\\(X_i\\) s the indicator of success on trial i. \\(X_i = 1\\) if trial i is a success, else 0.\nAll \\(X_i\\) are i.i.d. and Bernoulli(p).\n\n\\(\\mathbb{E}[Y] = \\sum_{i=1}^n \\mathbb{E}[X_i] = np\\)\n\\(\\text{Var}(X) = \\sum_{i=1}^n \\text{Var}(X_i) = np(1-p)\\)\n\n\\(X_i\\)’s are independent, so \\(\\text{Cov}(X_i, X_j) = 0\\) for all i, j.\n\n\n\nUniform on a finite set of values\n\nProbability of each value is 1 / (number of possible values).\nFor example, a standard/fair die.\n\nUniform on the unit interval (0, 1)\n\nDensity is flat at 1 on (0, 1) and 0 elsewhere.\n\nNormal(\\(\\mu, \\sigma^2\\))\n\n\\(f(x) = \\frac{1}{\\sigma\\sqrt{2\\pi}} \\exp\\left( -\\frac{1}{2}\\left(\\frac{x-\\mu}{\\sigma}\\right)^{\\!2}\\,\\right)\\)\n\n\n\n18.1.1 Example\nSuppose you win cash based on the number of heads you get in a series of 20 coin flips. Let \\(X_i = 1\\) if the \\(i\\)-th coin is heads, 0 otherwise. Which payout strategy would you choose?\nA. \\(Y_A = 10 * X_1 + 10 * X_2\\)\nB. \\(Y_B = \\sum_{i=1}^{20} X_i\\)\nC. \\(Y_C = 20 * X_1\\)\n\n\n\n\n\n\nSolution\n\n\n\n\n\nLet \\(X_1, X_2, ... X_{20}\\) be 20 i.i.d Bernoulli(0.5) random variables. Since the \\(X_i\\)’s are independent, \\(\\text{Cov}(X_i, X_j) = 0\\) for all pairs \\(i, j\\). Additionally, Since \\(X_i\\) is Bernoulli(0.5), we know that \\(\\mathbb{E}[X] = p = 0.5\\) and \\(\\text{Var}(X) = p(1-p) = 0.25\\). We can calculate the following for each scenario:\n\n\n\n\n\n\n\n\n\n\nA. \\(Y_A = 10 * X_1 + 10 * X_2\\)\nB. \\(Y_B = \\sum_{i=1}^{20} X_i\\)\nC. \\(Y_C = 20 * X_1\\)\n\n\n\n\nExpectation\n\\(\\mathbb{E}[Y_A] = 10 (0.5) + 10(0.5) = 10\\)\n\\(\\mathbb{E}[Y_B] = 0.5 + ... + 0.5 = 10\\)\n\\(\\mathbb{E}[Y_C] = 20(0.5) = 10\\)\n\n\nVariance\n\\(\\text{Var}(Y_A) = 10^2 (0.25) + 10^2 (0.25) = 50\\)\n\\(\\text{Var}(Y_B) = 0.25 + ... + 0.25 = 5\\)\n\\(\\text{Var}(Y_C) = 20^2 (0.25) = 100\\)\n\n\nStandard Deviation\n\\(\\text{SD}(Y_A) \\approx 7.07\\)\n\\(\\text{SD}(Y_B) \\approx 2.24\\)\n\\(\\text{SD}(Y_C) = 10\\)\n\n\n\nAs we can see, all the scenarios have the same expected value but different variances. The higher the variance, the greater the risk and uncertainty, so the “right” strategy depends on your personal preference. Would you choose the “safest” option B, the most “risky” option C, or somewhere in the middle (option A)?"
-  },
-  {
-    "objectID": "probability_2/probability_2.html#sample-statistics",
-    "href": "probability_2/probability_2.html#sample-statistics",
-    "title": "18  Estimators, Bias, and Variance",
-    "section": "18.2 Sample Statistics",
-    "text": "18.2 Sample Statistics\nToday, we’ve talked extensively about populations; if we know the distribution of a random variable, we can reliably compute expectation, variance, functions of the random variable, etc. Note that:\n\nThe distribution of a population describes how a random variable behaves across all individuals of interest.\nThe distribution of a sample describes how a random variable behaves in a specific sample from the population.\n\nIn Data Science, however, we often do not have access to the whole population, so we don’t know its distribution. As such, we need to collect a sample and use its distribution to estimate or infer properties of the population. In cases like these, we can take several samples of size \\(n\\) from the population (an easy way to do this is using df.sample(n, replace=True)), and compute the mean of each sample. When sampling, we make the (big) assumption that we sample uniformly at random with replacement from the population; each observation in our sample is a random variable drawn i.i.d from our population distribution. Remember that our sample mean is a random variable since it depends on our randomly drawn sample! On the other hand, our population mean is simply a number (a fixed value).\n\n18.2.1 Sample Mean\nConsider an i.i.d. sample \\(X_1, X_2, ..., X_n\\) drawn from a population with mean 𝜇 and SD 𝜎. We define the sample mean as \\[\\bar{X}_n = \\frac{1}{n} \\sum_{i=1}^n X_i\\]\nThe expectation of the sample mean is given by: \\[\\begin{align}\n    \\mathbb{E}[\\bar{X}_n] &= \\frac{1}{n} \\sum_{i=1}^n \\mathbb{E}[X_i] \\\\\n    &= \\frac{1}{n} (n \\mu) \\\\\n    &= \\mu\n\\end{align}\\]\nThe variance is given by: \\[\\begin{align}\n    \\text{Var}(\\bar{X}_n) &= \\frac{1}{n^2} \\text{Var}( \\sum_{i=1}^n X_i) \\\\\n    &=  \\frac{1}{n^2} \\left( \\sum_{i=1}^n \\text{Var}(X_i) \\right) \\\\\n    &=  \\frac{1}{n^2} (n \\sigma^2) = \\frac{\\sigma^2}{n}\n\\end{align}\\]\n\\(\\bar{X}_n\\) is normally distributed by the Central Limit Theorem (CLT).\n\n\n18.2.2 Central Limit Theorem\nIn Data 8 and in the previous lecture, you encountered the Central Limit Theorem (CLT). This is a powerful theorem for estimating the distribution of a population with mean \\(\\mu\\) and standard deviation \\(\\sigma\\) from a collection of smaller samples. The CLT tells us that if an i.i.d sample of size \\(n\\) is large, then the probability distribution of the sample mean is roughly normal with mean \\(\\mu\\) and SD of \\(\\frac{\\sigma}{\\sqrt{n}}\\). More generally, any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists! This is because we rarely know a lot about the population.\n\n\n\nImportantly, the CLT assumes that each observation in our samples is drawn i.i.d from the distribution of the population. In addition, the CLT is accurate only when \\(n\\) is “large”, but what counts as a “large” sample size depends on the specific distribution. If a population is highly symmetric and unimodal, we could need as few as \\(n=20\\); if a population is very skewed, we need a larger \\(n\\). If in doubt, you can bootstrap the sample mean and see if the bootstrapped distribution is bell-shaped. Classes like Data 140 investigate this idea in great detail. \nFor a more in-depth demo, check out onlinestatbook.\n\n\n18.2.3 Using the Sample Mean to Estimate the Population Mean\nNow let’s say we want to use the sample mean to estimate the population mean, for example, the average height of Cal undergraduates. We can typically collect a single sample, which has just one average. However, what if we happened, by random chance, to draw a sample with a different mean or spread than that of the population? We might get a skewed view of how the population behaves (consider the extreme case where we happen to sample the exact same value \\(n\\) times!).\n\n\n\nFor example, notice the difference in variation between these two distributions that are different in sample size. The distribution with a bigger sample size (\\(n=800\\)) is tighter around the mean than the distribution with a smaller sample size (\\(n=200\\)). Try plugging in these values into the standard deviation equation for the normal distribution to make sense of this!\nApplying the CLT allows us to make sense of all of this and resolve this issue. By drawing many samples, we can consider how the sample distribution varies across multiple subsets of the data. This allows us to approximate the properties of the population without the need to survey every single member.\nGiven this potential variance, it is also important that we consider the average value and spread of all possible sample means, and what this means for how big \\(n\\) should be. For every sample size, the expected value of the sample mean is the population mean: \\[\\mathbb{E}[\\bar{X}_n] = \\mu\\]. We call the sample mean an unbiased estimator of the population mean and will explore this idea more in the next lecture.\n\n\n\n\n\n\nData 8 Recap: Square Root Law\n\n\n\n\n\nThe square root law (Data 8) states that if you increase the sample size by a factor, the SD decreases by the square root of the factor. \\[\\text{SD}(\\bar{X_n}) = \\frac{\\sigma}{\\sqrt{n}}\\] The sample mean is more likely to be close to the population mean if we have a larger sample size."
-  },
-  {
-    "objectID": "probability_2/probability_2.html#prediction-and-inference",
-    "href": "probability_2/probability_2.html#prediction-and-inference",
-    "title": "18  Estimators, Bias, and Variance",
-    "section": "18.3 Prediction and Inference",
-    "text": "18.3 Prediction and Inference\nAt this point in the course, we’ve spent a great deal of time working with models. When we first introduced the idea of modeling a few weeks ago, we did so in the context of prediction: using models to make accurate predictions about unseen data. Another reason we might build models is to better understand complex phenomena in the world around us. Inference is the task of using a model to infer the true underlying relationships between the feature and response variables. For example, if we are working with a set of housing data, prediction might ask: given the attributes of a house, how much is it worth? Inference might ask: how much does having a local park impact the value of a house?\nA major goal of inference is to draw conclusions about the full population of data given only a random sample. To do this, we aim to estimate the value of a parameter, which is a numerical function of the population (for example, the population mean \\(\\mu\\)). We use a collected sample to construct a statistic, which is a numerical function of the random sample (for example, the sample mean \\(\\bar{X}_n\\)). It’s helpful to think “p” for “parameter” and “population,” and “s” for “sample” and “statistic.”\nSince the sample represents a random subset of the population, any statistic we generate will likely deviate from the true population parameter, and it could have been different. We say that the sample statistic is an estimator of the true population parameter. Notationally, the population parameter is typically called \\(\\theta\\), while its estimator is denoted by \\(\\hat{\\theta}\\).\nTo address our inference question, we aim to construct estimators that closely estimate the value of the population parameter. We evaluate how “good” an estimator is by answering three questions:\n\nDo we get the right answer for the parameter, on average?\nHow variable is the answer?\nHow close is our answer to the parameter?\n\n\n18.3.1 Modeling as Estimation\nNow that we’ve established the idea of an estimator, let’s see how we can apply this learning to the modeling process. To do so, we’ll take a moment to formalize our data collection and models in the language of random variables.\nSay we are working with an input variable, \\(x\\), and a response variable, \\(Y\\). We assume that \\(Y\\) and \\(x\\) are linked by some relationship \\(g\\); in other words, \\(Y = g(x)\\). \\(g\\) represents some “universal truth” or “law of nature” that defines the underlying relationship between \\(x\\) and \\(Y\\). In the image below, \\(g\\) is denoted by the red line.\nAs data scientists, however, we have no way of directly “seeing” the underlying relationship \\(g\\). The best we can do is collect observed data out in the real world to try to understand this relationship. Unfortunately, the data collection process will always have some inherent error (think of the randomness you might encounter when taking measurements in a scientific experiment). We say that each observation comes with some random error or noise term, \\(\\epsilon\\). This error is assumed to be a random variable with expectation \\(\\mathbb{E}(\\epsilon)=0\\), variance \\(\\text{Var}(\\epsilon) = \\sigma^2\\), and be i.i.d. across each observation. The existence of this random noise means that our observations, \\(Y(x)\\), are random variables.\n\n\n\nWe can only observe our random sample of data, represented by the blue points above. From this sample, we want to estimate the true relationship \\(g\\). We do this by constructing the model \\(\\hat{Y}(x)\\) to estimate \\(g\\).\n\\[\\text{True relationship: } g(x)\\]\n\\[\\text{Observed relationship: }Y = g(x) + \\epsilon\\]\n\\[\\text{Prediction: }\\hat{Y}(x)\\]\n\n\n\n\n18.3.1.1 Estimating a Linear Relationship\nIf we assume that the true relationship \\(g\\) is linear, we can express the response as \\(Y = f_{\\theta}(x)\\), where our true relationship is modeled by \\[Y = g(x) + \\epsilon\\] \\[ f_{\\theta}(x) = Y = \\theta_0 + \\sum_{j=1}^p \\theta_j x_j + \\epsilon\\]\n\n\n\n\n\n\nWhich Expressions are random?\n\n\n\nIn our two equations above, the true relationship \\(g(x) = \\theta_0 + \\sum_{j=1}^p \\theta_j x_j\\) is not random, but \\(\\epsilon\\) is random. Hence, \\(Y = f_{\\theta}(x)\\) is also random.\n\n\nThis true relationship has true, unobservable parameters \\(\\theta\\), and it has random noise \\(\\epsilon\\), so we can never observe the true relationship. Instead, the next best thing we can do is obtain a sample \\(\\Bbb{X}\\), \\(\\Bbb{Y}\\) of \\(n\\) observed relationships, \\((x, Y)\\) and use it to train a model and obtain an estimate of \\(\\hat{\\theta}\\) \\[\\hat{Y}(x) = f_{\\hat{\\theta}}(x) = \\hat{\\theta_0} + \\sum_{j=1}^p \\hat{\\theta_j} x_j\\]\n\n\n\n\n\n\nWhich Expressions are random?\n\n\n\nIn our estimating equation above, our sample \\(\\Bbb{X}\\), \\(\\Bbb{Y}\\) are random. Hence, the estimates we calculate from our samples \\(\\hat{\\theta}\\) are also random, so our predictor \\(\\hat{Y}(x)\\) is also random.\n\n\nNow taking a look at our original equations, we can see that they both have differing sources of randomness. For our observed relationship, \\(Y = g(x) + \\epsilon\\), \\(\\epsilon\\) represents measurement errors and reflects randomness from the future. For the estimation model, the data we have is a random sample collected from the population, so the randomness from the past."
-  },
-  {
-    "objectID": "probability_2/probability_2.html#bootstrap-resampling-review",
-    "href": "probability_2/probability_2.html#bootstrap-resampling-review",
-    "title": "18  Estimators, Bias, and Variance",
-    "section": "18.4 Bootstrap Resampling (Review)",
-    "text": "18.4 Bootstrap Resampling (Review)\nTo determine properties of the sampling distribution of an estimator like variance, we’d need to have access to the population so that we can consider all possible samples and compute an estimate for each sample.\n\n\n\nHowever, we don’t have access to the population; we only have one random sample from the population. How can we consider all possible samples if we only have one?\nThe idea of bootstrapping is to treat our random sample as a “population” and resample from it with replacement. Intuitively, a random sample resembles the population, so a random resample also resamples a random sample.\n\n\n\nBootstrap resampling is a technique for estimating the sampling distribution of an estimator. To execute it, we can follow the pseudocode below:\ncollect a random sample of size n (called the bootstrap population)\n\ninitiate list of estimates\n\nrepeat 10,000 times:\n    resample with replacement n times from bootstrap population\n\napply estimator f to resample\n\nstore in list\n\nlist of estimates is the bootstrapped sampling distribution of f\n\n\n\n\n\n\nWhy must we resample with replacement?\n\n\n\n\n\nGiven an original sample of size \\(n\\), we want a resample that has the same size \\(n\\) as the original. Sampling without replacement will give us the original sample with shuffled rows. Hence, when we calculate summary statistics like the average, our sample without replacement will always have the same average as the original sample, defeating the purpose of a bootstrap.\n\n\n\nHow well does bootstrapping actually represent our population? The bootstrapped sampling distribution of an estimator does not exactly match the sampling distribution of that estimator, but it is often close. Similarly, the variance of the bootstrapped distribution is often close to the true variance of the estimator. The example below displays the results from different bootstraps from a known population using a sample size of \\(n=50\\).\n\n\n\nIn the real world, we don’t know the population distribution. The center of the boostrapped distribution is the estimator applied to our original sample, so we have no way of recovering the estimator’s true expected value. The quality of our bootstrapped distribution depends on the quality of our original sample; if our original sample was not representative of the population, bootstrap is next to useless.\nOne thing to note is that the bootstrap often does not work well for some statistics, like the median or other quantile-based statistics, that depend heavily on a small number of observations out of a larger sample. Bootstrapping does not overcome the weakness of small samples as a basis for inference. Indeed, for the very smallest samples, it may be better to make additional assumptions such as a parametric family."
-  },
-  {
-    "objectID": "inference_causality/inference_causality.html#bias-variance-tradeoff",
-    "href": "inference_causality/inference_causality.html#bias-variance-tradeoff",
-    "title": "19  Bias, Variance, and Inference",
-    "section": "19.1 Bias-Variance Tradeoff",
-    "text": "19.1 Bias-Variance Tradeoff\nRecall the model and the data we generated from that model in the last section:\n\\[\\text{True relationship: } g(x)\\]\n\\[\\text{Observed relationship: }Y = g(x) + \\epsilon\\]\n\\[\\text{Prediction: }\\hat{Y}(x)\\]\nWith this reformulated modeling goal, we can now revisit the Bias-Variance Tradeoff from two lectures ago (shown below):\n\n\n\nIn today’s lecture, we’ll explore a more mathematical version of the graph you see above by introducing the terms model risk, observation variance, model bias, and model variance. Eventually, we’ll work our way up to an updated version of the Bias-Variance Tradeoff graph that you see below\n\n\n\n\n19.1.1 Performance of an Estimator\nSuppose we want to estimate a target \\(Y\\) using an estimator \\(\\hat{Y}(x)\\). For every estimator that we train, we can determine how good a model is by asking the following questions:\n\nDo we get the right answer on average? (Bias)\nHow variable is the answer? (Variance)\nHow close do we get to \\(Y\\)? (Risk / MSE)\n\n\n\n\nIdeally, we want our estimator to have low bias and low variance, but how can we mathematically quantify that? To do so, let’s introduce a few terms.\n\n\n19.1.2 Model Risk\nModel risk is defined as the mean square prediction error of the random variable \\(\\hat{Y}\\). It is an expectation across all samples we could have possibly gotten when fitting the model, which we can denote as random variables \\(X_1, X_2, \\ldots, X_n, Y\\). Model risk considers the model’s performance on any sample that is theoretically possible, rather than the specific data that we have collected.\n\\[\\text{model risk }=E\\left[(Y-\\hat{Y(x)})^2\\right]\\]\nWhat is the origin of the error encoded by model risk? Note that there are two types of errors:\n\nChance errors: happen due to randomness alone\n\nSource 1 (Observation Variance): randomness in new observations \\(Y\\) due to random noise \\(\\epsilon\\)\nSource 2 (Model Variance): randomness in the sample we used to train the models, as samples \\(X_1, X_2, \\ldots, X_n, Y\\) are random\n\n(Model Bias): non-random error due to our model being different from the true underlying function \\(g\\)\n\nRecall the data-generating process we established earlier. There is a true underlying relationship \\(g\\), observed data (with random noise) \\(Y\\), and model \\(\\hat{Y}\\).\n\n\n\nTo better understand model risk, we’ll zoom in on a single data point in the plot above.\n\n\n\nRemember that \\(\\hat{Y}(x)\\) is a random variable – it is the prediction made for \\(x\\) after being fit on the specific sample used for training. If we had used a different sample for training, a different prediction might have been made for this value of \\(x\\). To capture this, the diagram above considers both the prediction \\(\\hat{Y}(x)\\) made for a particular random training sample, and the expected prediction across all possible training samples, \\(E[\\hat{Y}(x)]\\).\nWe can use this simplified diagram to break down the prediction error into smaller components. First, start by considering the error on a single prediction, \\(Y(x)-\\hat{Y}(x)\\).\n\n\n\nWe can identify three components of this error.\n\n\n\nThat is, the error can be written as:\n\\[Y(x)-\\hat{Y}(x) = \\epsilon + \\left(g(x)-E\\left[\\hat{Y}(x)\\right]\\right) + \\left(E\\left[\\hat{Y}(x)\\right] - \\hat{Y}(x)\\right)\\] \\[\\newline   \\]\nThe model risk is the expected square of the expression above, \\(E\\left[(Y(x)-\\hat{Y}(x))^2\\right]\\). If we square both sides and then take the expectation, we will get the following decomposition of model risk:\n\\[E\\left[(Y(x)-\\hat{Y}(x))^2\\right] = E[\\epsilon^2] + \\left(g(x)-E\\left[\\hat{Y}(x)\\right]\\right)^2 + E\\left[\\left(E\\left[\\hat{Y}(x)\\right] - \\hat{Y}(x)\\right)^2\\right]\\]\nIt looks like we are missing some cross-product terms when squaring the right-hand side, but it turns out that all of those cross-product terms are zero. The detailed derivation is out of scope for this class, but a proof is included at the end of this note for your reference.\nThis expression may look complicated at first glance, but we’ve actually already defined each term earlier in this lecture! Let’s look at them term by term.\n\n19.1.2.1 Observation Variance\nThe first term in the above decomposition is \\(E[\\epsilon^2]\\). Remember \\(\\epsilon\\) is the random noise when observing \\(Y\\), with expectation \\(\\mathbb{E}(\\epsilon)=0\\) and variance \\(\\text{Var}(\\epsilon) = \\sigma^2\\). We can show that \\(E[\\epsilon^2]\\) is the variance of \\(\\epsilon\\): \\[\n\\begin{align*}\n\\text{Var}(\\epsilon) &= E[\\epsilon^2] + \\left(E[\\epsilon]\\right)^2\\\\\n&= E[\\epsilon^2] + 0^2\\\\\n&= \\sigma^2.\n\\end{align*}\n\\]\nThis term describes how variable the random error \\(\\epsilon\\) (and \\(Y\\)) is for each observation. This is called the observation variance. It exists due to the randomness in our observations \\(Y\\). It is a form of chance error we talked about in the Sampling lecture.\n\\[\\text{observation variance} = \\text{Var}(\\epsilon) = \\sigma^2.\\]\nThe observation variance results from measurement errors when observing data or missing information that acts like noise. To reduce this observation variance, we could try to get more precise measurements, but it is often beyond the control of data scientists. Because of this, the observation variance \\(\\sigma^2\\) is sometimes called “irreducible error.”\n\n\n19.1.2.2 Model Variance\nWe will then look at the last term: \\(E\\left[\\left(E\\left[\\hat{Y}(x)\\right] - \\hat{Y}(x)\\right)^2\\right]\\). If you recall the definition of variance from the last lecture, this is precisely \\(\\text{Var}(\\hat{Y}(x))\\). We call this the model variance.\nIt describes how much the prediction \\(\\hat{Y}(x)\\) tends to vary when we fit the model on different samples. Remember the sample we collect can come out very differently, thus the prediction \\(\\hat{Y}(x)\\) will also be different. The model variance describes this variability due to the randomness in our sampling process. Like observation variance, it is also a form of chance error—even though the sources of randomness are different.\n\\[\\text{model variance} = \\text{Var}(\\hat{Y}(x)) = E\\left[\\left(\\hat{Y}(x) - E\\left[\\hat{Y}(x)\\right]\\right)^2\\right]\\]\nThe main reason for the large model variance is because of overfitting: we paid too much attention to the details in our sample that small differences in our random sample lead to large differences in the fitted model. To remediate this, we try to reduce model complexity (e.g. take out some features and limit the magnitude of estimated model coefficients) and not fit our model on the noises.\n\n\n19.1.2.3 Model Bias\nFinally, the second term is \\(\\left(g(x)-E\\left[\\hat{Y}(x)\\right]\\right)^2\\). What is this? The term \\(E\\left[\\hat{Y}(x)\\right] - g(x)\\) is called the model bias.\nRemember that \\(g(x)\\) is the fixed underlying truth and \\(\\hat{Y}(x)\\) is our fitted model, which is random. Model bias therefore measures how far off \\(g(x)\\) and \\(\\hat{Y}(x)\\) are on average over all possible samples.\n\\[\\text{model bias} = E\\left[\\hat{Y}(x) - g(x)\\right] = E\\left[\\hat{Y}(x)\\right] - g(x)\\]\nThe model bias is not random; it’s an average measure for a specific individual \\(x\\). If bias is positive, our model tends to overestimate \\(g(x)\\); if it’s negative, our model tends to underestimate \\(g(x)\\). And if it’s 0, we can say that our model is unbiased.\n\n\n\n\n\n\nUnbiased Estimators\n\n\n\nAn unbiased model has a \\(\\text{model bias } = 0\\). In other words, our model predicts \\(g(x)\\) on average.\nSimilarly, we can define bias for estimators like the mean. The sample mean is an unbiased estimator of the population mean, as by CLT, \\(\\mathbb{E}[\\bar{X}_n] = \\mu\\). Therefore, the \\(\\text{estimator bias } = \\mathbb{E}[\\bar{X}_n] - \\mu = 0\\).\n\n\nThere are two main reasons for large model biases:\n\nUnderfitting: our model is too simple for the data\nLack of domain knowledge: we don’t understand what features are useful for the response variable\n\nTo fix this, we increase model complexity (but we don’t want to overfit!) or consult domain experts to see which models make sense. You can start to see a tradeoff here: if we increase model complexity, we decrease the model bias, but we also risk increasing the model variance.\n\n\n\n19.1.3 The Decomposition\nTo summarize:\n\nThe model risk, \\(\\mathbb{E}\\left[(Y(x)-\\hat{Y}(x))^2\\right]\\), is the mean squared prediction error of the model.\nThe observation variance, \\(\\sigma^2\\), is the variance of the random noise in the observations. It describes how variable the random error \\(\\epsilon\\) is for each observation.\nThe model bias, \\(\\mathbb{E}\\left[\\hat{Y}(x)\\right]-g(x)\\), is how “off” the \\(\\hat{Y}(x)\\) is as an estimator of the true underlying relationship \\(g(x)\\).\nThe model variance, \\(\\text{Var}(\\hat{Y}(x))\\), describes how much the prediction \\(\\hat{Y}(x)\\) tends to vary when we fit the model on different samples.\n\nThe above definitions enable us to simplify the decomposition of model risk before as:\n\\[ E[(Y(x) - \\hat{Y}(x))^2] = \\sigma^2 + (E[\\hat{Y}(x)] - g(x))^2 + \\text{Var}(\\hat{Y}(x)) \\] \\[\\text{model risk } = \\text{observation variance} + (\\text{model bias})^2 \\text{+ model variance}\\]\nThis is known as the bias-variance tradeoff. What does it mean? Remember that the model risk is a measure of the model’s performance. Our goal in building models is to keep model risk low; this means that we will want to ensure that each component of model risk is kept at a small value.\nObservation variance is an inherent, random part of the data collection process. We aren’t able to reduce the observation variance, so we’ll focus our attention on the model bias and model variance.\nIn the Feature Engineering lecture, we considered the issue of overfitting. We saw that the model’s error or bias tends to decrease as model complexity increases — if we design a highly complex model, it will tend to make predictions that are closer to the true relationship \\(g\\). At the same time, model variance tends to increase as model complexity increases; a complex model may overfit to the training data, meaning that small differences in the random samples used for training lead to large differences in the fitted model. We have a problem. To decrease model bias, we could increase the model’s complexity, which would lead to overfitting and an increase in model variance. Alternatively, we could decrease model variance by decreasing the model’s complexity at the cost of increased model bias due to underfitting.\n\n\n\nWe need to strike a balance. Our goal in model creation is to use a complexity level that is high enough to keep bias low, but not so high that model variance is large."
-  },
-  {
-    "objectID": "inference_causality/inference_causality.html#interpreting-regression-coefficients",
-    "href": "inference_causality/inference_causality.html#interpreting-regression-coefficients",
-    "title": "19  Bias, Variance, and Inference",
-    "section": "19.2 Interpreting Regression Coefficients",
-    "text": "19.2 Interpreting Regression Coefficients\nRecall the framework we established earlier in this lecture. If we assume that the underlying relationship between our observations and input features is linear, we can express this relationship in terms of the unknown, true model parameters \\(\\theta\\).\n\\[f_{\\theta}(x) = g(x) + \\epsilon = \\theta_0 + \\theta_1 x_1 + \\ldots + \\theta_p x_p + \\epsilon\\]\nOur model attempts to estimate each true parameter \\(\\theta_i\\) using the estimates \\(\\hat{\\theta}_i\\) calculated from the design matrix \\(\\Bbb{X}\\) and response vector \\(\\Bbb{Y}\\).\n\\[f_{\\hat{\\theta}}(x) = \\hat{\\theta}_0 + \\hat{\\theta}_1 x_1 + \\ldots + \\hat{\\theta}_p x_p\\]\nLet’s pause for a moment. At this point, we’re very used to working with the idea of a model parameter. But what exactly does each coefficient \\(\\theta_i\\) actually mean? We can think of each \\(\\theta_i\\) as a slope of the linear model – if all other variables are held constant, a unit change in \\(x_i\\) will result in a \\(\\theta_i\\) change in \\(f_{\\theta}(x)\\). Broadly speaking, a large value of \\(\\theta_i\\) means that the feature \\(x_i\\) has a large effect on the response; conversely, a small value of \\(\\theta_i\\) means that \\(x_i\\) has little effect on the response. In the extreme case, if the true parameter \\(\\theta_i\\) is 0, then the feature \\(x_i\\) has no effect on \\(Y(x)\\).\nIf the true parameter \\(\\theta_i\\) for a particular feature is 0, this tells us something pretty significant about the world: there is no underlying relationship between \\(x_i\\) and \\(Y(x)\\)! How then, can we test if a parameter is 0? As a baseline, we go through our usual process of drawing a sample, using this data to fit a model, and computing an estimate \\(\\hat{\\theta}_i\\). However, we need to also consider the fact that if our random sample had come out differently, we may have found a different result for \\(\\hat{\\theta}_i\\). To infer if the true parameter \\(\\theta_i\\) is 0, we want to draw our conclusion from the distribution of \\(\\hat{\\theta}_i\\) estimates we could have drawn across all other random samples. This is where hypothesis testing comes in handy!\nTo test if the true parameter \\(\\theta_i\\) is 0, we construct a hypothesis test where our null hypothesis states that the true parameter \\(\\theta_i\\) is 0 and the alternative hypothesis states that the true parameter \\(\\theta_i\\) is not 0. If our p-value is smaller than our cutoff value (usually p=0.05), we reject the null hypothesis."
-  },
-  {
-    "objectID": "inference_causality/inference_causality.html#hypothesis-testing-through-bootstrap-purpleair-demo",
-    "href": "inference_causality/inference_causality.html#hypothesis-testing-through-bootstrap-purpleair-demo",
-    "title": "19  Bias, Variance, and Inference",
-    "section": "19.3 Hypothesis Testing through Bootstrap: PurpleAir Demo",
-    "text": "19.3 Hypothesis Testing through Bootstrap: PurpleAir Demo\nAn equivalent way to execute the hypothesis test described above is through bootstrapping (this equivalence can be proven through the duality argument, which is out of scope for this class). We use bootstrapping to compute approximate 95% confidence intervals for each \\(\\theta_i\\). If the interval doesn’t contain 0, we reject the null hypothesis at the 5% level. Otherwise, the data is consistent with the null, as the true parameter could be 0.\nTo show an example of this hypothesis testing process, we’ll work with the snowy plover dataset throughout this section. The data are about the eggs and newly-hatched chicks of the Snowy Plover. The data were collected at the Point Reyes National Seashore by a former student at Berkeley. Here’s a parent bird and some eggs.\n\n\n\nNote that Egg Length and Egg Breadth (widest diameter) are measured in millimeters, and Egg Weight and Bird Weight are measured in grams; for comparison, a standard paper clip weighs about one gram.\n\n\n\n\n\nCode\n# import numpy as np\n# import pandas as pd\n# import matplotlib\n# import matplotlib.pyplot as plt\n# import seaborn as sns\n# import sklearn.linear_model as lm\n# from sklearn.linear_model import LinearRegression\n\n# # big font helper\n# def adjust_fontsize(size=None):\n#     SMALL_SIZE = 8\n#     MEDIUM_SIZE = 10\n#     BIGGER_SIZE = 12\n#     if size != None:\n#         SMALL_SIZE = MEDIUM_SIZE = BIGGER_SIZE = size\n\n#     plt.rc('font', size=SMALL_SIZE)          # controls default text sizes\n#     plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title\n#     plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels\n#     plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels\n#     plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels\n#     plt.rc('legend', fontsize=SMALL_SIZE)    # legend fontsize\n#     plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title\n\n# plt.style.use('fivethirtyeight')\n# sns.set_context(\"talk\")\n# sns.set_theme()\n# #plt.style.use('default') # revert style to default mpl\n# adjust_fontsize(size=20)\n# %matplotlib inline\n# csv_file = 'data/Full24hrdataset.csv'\n# usecols = ['Date', 'ID', 'region', 'PM25FM', 'PM25cf1', 'TempC', 'RH', 'Dewpoint']\n# full_df = (pd.read_csv(csv_file, usecols=usecols, parse_dates=['Date'])\n#         .dropna())\n# full_df.columns = ['date', 'id', 'region', 'pm25aqs', 'pm25pa', 'temp', 'rh', 'dew']\n# full_df = full_df.loc[(full_df['pm25aqs'] &lt; 50)]\n\n\n# bad_dates = ['2019-08-21', '2019-08-22', '2019-09-24']\n# GA = full_df.loc[(full_df['id'] == 'GA1') & (~full_df['date'].isin(bad_dates)) , :]\n# AQS, PA = GA[['pm25aqs']], GA['pm25pa']\n# AQS.head()\n# pd.DataFrame(PA).head()\n\n\n\n\nCode\nimport pandas as pd\neggs = pd.read_csv(\"data/snowy_plover.csv\")\neggs.head(5)\n\n\n\n\n\n\n\n\n\negg_weight\negg_length\negg_breadth\nbird_weight\n\n\n\n\n0\n7.4\n28.80\n21.84\n5.2\n\n\n1\n7.7\n29.04\n22.45\n5.4\n\n\n2\n7.9\n29.36\n22.48\n5.6\n\n\n3\n7.5\n30.10\n21.71\n5.3\n\n\n4\n8.3\n30.17\n22.75\n5.9\n\n\n\n\n\n\n\nOur goal will be to predict the weight of a newborn plover chick, which we assume follows the true relationship \\(Y = f_{\\theta}(x)\\) below.\n\\[\\text{bird\\_weight} = \\theta_0 + \\theta_1 \\text{egg\\_weight} + \\theta_2 \\text{egg\\_length} + \\theta_3 \\text{egg\\_breadth} + \\epsilon\\]\n\nFor each \\(i\\), the parameter \\(\\theta_i\\) is a fixed number but it is unobservable. We can only estimate it.\nThe random error \\(\\epsilon\\) is also unobservable, but it is assumed to have expectation 0 and be independent and identically distributed across eggs.\n\nSay we wish to determine if the egg_weight impacts the bird_weight of a chick – we want to infer if \\(\\theta_1\\) is equal to 0.\nFirst, we define our hypotheses:\n\nNull hypothesis: the true parameter \\(\\theta_1\\) is 0; any variation is due to random chance.\nAlternative hypothesis: the true parameter \\(\\theta_1\\) is not 0.\n\nNext, we use our data to fit a model \\(\\hat{Y} = f_{\\hat{\\theta}}(x)\\) that approximates the relationship above. This gives us the observed value of \\(\\hat{\\theta}_1\\) found from our data.\n\nfrom sklearn.linear_model import LinearRegression\nimport numpy as np\n\nX = eggs[[\"egg_weight\", \"egg_length\", \"egg_breadth\"]]\nY = eggs[\"bird_weight\"]\n\nmodel = LinearRegression()\nmodel.fit(X, Y)\n\n# This gives an array containing the fitted model parameter estimates\nthetas = model.coef_\n\n# Put the parameter estimates in a nice table for viewing\ndisplay(pd.DataFrame([model.intercept_] + list(model.coef_),\n             columns=['theta_hat'],\n             index=['intercept', 'egg_weight', 'egg_length', 'egg_breadth']))\n\nprint(\"RMSE\", np.mean((Y - model.predict(X)) ** 2))\n\n\n\n\n\n\n\n\ntheta_hat\n\n\n\n\nintercept\n-4.605670\n\n\negg_weight\n0.431229\n\n\negg_length\n0.066570\n\n\negg_breadth\n0.215914\n\n\n\n\n\n\n\nRMSE 0.04547085380275766\n\n\nWe now have the value of \\(\\hat{\\theta}_1\\) when considering the single sample of data that we have. To get a sense of how this estimate might vary if we were to draw different random samples, we will use bootstrapping. To construct a bootstrap sample, we will draw a resample from the collected data that:\n\nHas the same sample size as the collected data\nIs drawn with replacement (this ensures that we don’t draw the exact same sample every time!)\n\nWe draw a bootstrap sample, use this sample to fit a model, and record the result for \\(\\hat{\\theta}_1\\) on this bootstrapped sample. We then repeat this process many times to generate a bootstrapped empirical distribution of \\(\\hat{\\theta}_1\\). This gives us an estimate of what the true distribution of \\(\\hat{\\theta}_1\\) across all possible samples might look like.\n\n# Set a random seed so you generate the same random sample as staff\n# In the \"real world\", we wouldn't do this\nimport numpy as np\nnp.random.seed(1337)\n\n# Set the sample size of each bootstrap sample\nn = len(eggs)\n\n# Create a list to store all the bootstrapped estimates\nestimates = []\n\n# Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. \n# Repeat 10000 times.\nfor i in range(10000):\n    bootstrap_resample = eggs.sample(n, replace=True)\n    X_bootstrap = bootstrap_resample[[\"egg_weight\", \"egg_length\", \"egg_breadth\"]]\n    Y_bootstrap = bootstrap_resample[\"bird_weight\"]\n    \n    bootstrap_model = LinearRegression()\n    bootstrap_model.fit(X_bootstrap, Y_bootstrap)\n    bootstrap_thetas = bootstrap_model.coef_\n    \n    estimates.append(bootstrap_thetas[0])\n    \n# calculate the 95% confidence interval \nlower = np.percentile(estimates, 2.5, axis=0)\nupper = np.percentile(estimates, 97.5, axis=0)\nconf_interval = (lower, upper)\nconf_interval\n\n(-0.25864811956848754, 1.1034243854204049)\n\n\nWe find that our bootstrapped approximate 95% confidence interval for \\(\\theta_1\\) is \\([-0.259, 1.103]\\). Immediately, we can see that 0 is indeed contained in this interval – this means that we cannot conclude that \\(\\theta_1\\) is non-zero! More formally, we fail to reject the null hypothesis (that \\(\\theta_1\\) is 0) under a 5% p-value cutoff."
-  },
-  {
-    "objectID": "inference_causality/inference_causality.html#colinearity",
-    "href": "inference_causality/inference_causality.html#colinearity",
-    "title": "19  Bias, Variance, and Inference",
-    "section": "19.4 Colinearity",
-    "text": "19.4 Colinearity\nWe can repeat this process to construct 95% confidence intervals for the other parameters of the model.\n\n\nCode\nnp.random.seed(1337)\n\ntheta_0_estimates = []\ntheta_1_estimates = []\ntheta_2_estimates = []\ntheta_3_estimates = []\n\n\nfor i in range(10000):\n    bootstrap_resample = eggs.sample(n, replace=True)\n    X_bootstrap = bootstrap_resample[[\"egg_weight\", \"egg_length\", \"egg_breadth\"]]\n    Y_bootstrap = bootstrap_resample[\"bird_weight\"]\n    \n    bootstrap_model = LinearRegression()\n    bootstrap_model.fit(X_bootstrap, Y_bootstrap)\n    bootstrap_theta_0 = bootstrap_model.intercept_\n    bootstrap_theta_1, bootstrap_theta_2, bootstrap_theta_3 = bootstrap_model.coef_\n    \n    theta_0_estimates.append(bootstrap_theta_0)\n    theta_1_estimates.append(bootstrap_theta_1)\n    theta_2_estimates.append(bootstrap_theta_2)\n    theta_3_estimates.append(bootstrap_theta_3)\n    \ntheta_0_lower, theta_0_upper = np.percentile(theta_0_estimates, 2.5), np.percentile(theta_0_estimates, 97.5)\ntheta_1_lower, theta_1_upper = np.percentile(theta_1_estimates, 2.5), np.percentile(theta_1_estimates, 97.5)\ntheta_2_lower, theta_2_upper = np.percentile(theta_2_estimates, 2.5), np.percentile(theta_2_estimates, 97.5)\ntheta_3_lower, theta_3_upper = np.percentile(theta_3_estimates, 2.5), np.percentile(theta_3_estimates, 97.5)\n\n# Make a nice table to view results\npd.DataFrame({\"lower\":[theta_0_lower, theta_1_lower, theta_2_lower, theta_3_lower], \"upper\":[theta_0_upper, \\\n                theta_1_upper, theta_2_upper, theta_3_upper]}, index=[\"theta_0\", \"theta_1\", \"theta_2\", \"theta_3\"])\n\n\n\n\n\n\n\n\n\nlower\nupper\n\n\n\n\ntheta_0\n-15.278542\n5.161473\n\n\ntheta_1\n-0.258648\n1.103424\n\n\ntheta_2\n-0.099138\n0.208557\n\n\ntheta_3\n-0.257141\n0.758155\n\n\n\n\n\n\n\nSomething’s off here. Notice that 0 is included in the 95% confidence interval for every parameter of the model. Using the interpretation we outlined above, this would suggest that we can’t say for certain that any of the input variables impact the response variable! This makes it seem like our model can’t make any predictions – and yet, each model we fit in our bootstrap experiment above could very much make predictions of \\(Y\\).\nHow can we explain this result? Think back to how we first interpreted the parameters of a linear model. We treated each \\(\\theta_i\\) as a slope, where a unit increase in \\(x_i\\) leads to a \\(\\theta_i\\) increase in \\(Y\\), if all other variables are held constant. It turns out that this last assumption is very important. If variables in our model are somehow related to one another, then it might not be possible to have a change in one of them while holding the others constant. This means that our interpretation framework is no longer valid! In the models we fit above, we incorporated egg_length, egg_breadth, and egg_weight as input variables. These variables are very likely related to one another – an egg with large egg_length and egg_breadth will likely be heavy in egg_weight. This means that the model parameters cannot be meaningfully interpreted as slopes.\nTo support this conclusion, we can visualize the relationships between our feature variables. Notice the strong positive association between the features.\n\n\nCode\nimport seaborn as sns\nsns.pairplot(eggs[[\"egg_length\", \"egg_breadth\", \"egg_weight\", 'bird_weight']]);\n\n\n/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:\n\nThe figure layout has changed to tight\n\n\n\n\n\n\nThis issue is known as colinearity, sometimes also called multicolinearity. Collinearity occurs when one feature can be predicted fairly accurately by a linear combination of the other features, which happens when one feature is highly correlated with the others.\nWhy is colinearity a problem? Its consequences span several aspects of the modeling process:\n\nInference: Slopes can’t be interpreted for an inference task.\nModel Variance: If features strongly influence one another, even small changes in the sampled data can lead to large changes in the estimated slopes.\nUnique Solution: If one feature is a linear combination of the other features, the design matrix will not be full rank, and \\(\\mathbb{X}^{\\top}\\mathbb{X}\\) is not invertible. This means that least squares does not have a unique solution.\n\nThe take-home point is that we need to be careful with what features we select for modeling. If two features likely encode similar information, it is often a good idea to choose only one of them as an input variable.\n\n19.4.1 A Simpler Model\nLet us now consider a more interpretable model: we instead assume a true relationship using only egg weight:\n\\[f_\\theta(x) = \\theta_0 + \\theta_1 \\text{egg\\_weight} + \\epsilon\\]\n\n\nCode\nfrom sklearn.linear_model import LinearRegression\nX_int = eggs[[\"egg_weight\"]]\nY_int = eggs[\"bird_weight\"]\n\nmodel_int = LinearRegression()\n\nmodel_int.fit(X_int, Y_int)\n\n# This gives an array containing the fitted model parameter estimates\nthetas_int = model_int.coef_\n\n# Put the parameter estimates in a nice table for viewing\npd.DataFrame({\"theta_hat\":[model_int.intercept_, thetas_int[0]]}, index=[\"theta_0\", \"theta_1\"])\n\n\n\n\n\n\n\n\n\ntheta_hat\n\n\n\n\ntheta_0\n-0.058272\n\n\ntheta_1\n0.718515\n\n\n\n\n\n\n\n\nimport matplotlib.pyplot as plt\n\n# Set a random seed so you generate the same random sample as staff\n# In the \"real world\", we wouldn't do this\nnp.random.seed(1337)\n\n# Set the sample size of each bootstrap sample\nn = len(eggs)\n\n# Create a list to store all the bootstrapped estimates\nestimates_int = []\n\n# Generate a bootstrap resample from `eggs` and find an estimate for theta_1 using this sample. \n# Repeat 10000 times.\nfor i in range(10000):\n    bootstrap_resample_int = eggs.sample(n, replace=True)\n    X_bootstrap_int = bootstrap_resample_int[[\"egg_weight\"]]\n    Y_bootstrap_int = bootstrap_resample_int[\"bird_weight\"]\n    \n    bootstrap_model_int = LinearRegression()\n    bootstrap_model_int.fit(X_bootstrap_int, Y_bootstrap_int)\n    bootstrap_thetas_int = bootstrap_model_int.coef_\n    \n    estimates_int.append(bootstrap_thetas_int[0])\n\nplt.figure(dpi=120)\nsns.histplot(estimates_int, stat=\"density\")\nplt.xlabel(r\"$\\hat{\\theta}_1$\")\nplt.title(r\"Bootstrapped estimates $\\hat{\\theta}_1$ Under the Interpretable Model\");\n\n\n\n\nNotice how the interpretable model performs almost as well as our other model:\n\n\nCode\nfrom sklearn.metrics import mean_squared_error\n\nrmse = mean_squared_error(Y, model.predict(X))\nrmse_int = mean_squared_error(Y_int, model_int.predict(X_int))\nprint(f'RMSE of Original Model: {rmse}')\nprint(f'RMSE of Interpretable Model: {rmse_int}')\n\n\nRMSE of Original Model: 0.04547085380275766\nRMSE of Interpretable Model: 0.046493941375556846\n\n\nYet, the confidence interval for the true parameter \\(\\theta_{1}\\) does not contain zero.\n\n\nCode\nlower_int = np.percentile(estimates_int, 2.5)\nupper_int = np.percentile(estimates_int, 97.5)\n\nconf_interval_int = (lower_int, upper_int)\nconf_interval_int\n\n\n(0.6029335250209633, 0.8208401738546206)\n\n\nIn retrospect, it’s no surprise that the weight of an egg best predicts the weight of a newly-hatched chick.\nA model with highly correlated variables prevents us from interpreting how the variables are related to the prediction.\n\n\n19.4.2 Reminder: Assumptions Matter\nKeep the following in mind: All inference assumes that the regression model holds.\n\nIf the model doesn’t hold, the inference might not be valid.\nIf the assumptions of the bootstrap don’t hold…\n\nSample size n is large\nSample is representative of population distribution (drawn i.i.d., unbiased)\n\n…then the results of the bootstrap might not be valid."
-  },
-  {
-    "objectID": "inference_causality/inference_causality.html#bonus-correlation-and-causation",
-    "href": "inference_causality/inference_causality.html#bonus-correlation-and-causation",
-    "title": "19  Bias, Variance, and Inference",
-    "section": "19.5 (Bonus) Correlation and Causation",
-    "text": "19.5 (Bonus) Correlation and Causation\nLet us consider some questions in an arbitrary regression problem.\nWhat does \\(\\theta_{j}\\) mean in our regression?\n\nHolding other variables fixed, how much should our prediction change with \\(X_{j}\\)?\n\nFor simple linear regression, this boils down to the correlation coefficient\n\nDoes having more \\(x\\) predict more \\(y\\) (and by how much)?\n\nExamples:\n\nAre homes with granite countertops worth more money?\nIs college GPA higher for students who win a certain scholarship?\nAre breastfed babies less likely to develop asthma?\nDo cancer patients given some aggressive treatment have a higher 5-year survival rate?\nAre people who smoke more likely to get cancer?\n\nThese sound like causal questions, but they are not!\n\n19.5.1 Prediction vs Causation\nThe difference between correlation/prediction vs. causation is best illustrated through examples.\nSome questions about correlation / prediction include:\n\nAre homes with granite countertops worth more money?\nIs college GPA higher for students who win a certain scholarship?\nAre breastfed babies less likely to develop asthma?\nDo cancer patients given some aggressive treatment have a higher 5-year survival rate?\nAre people who smoke more likely to get cancer?\n\nSome questions about causality include:\n\nHow much do granite countertops raise the value of a house?\nDoes getting the scholarship improve students’ GPAs?\nDoes breastfeeding protect babies against asthma?\nDoes the treatment improve cancer survival?\nDoes smoking cause cancer?\n\nCausal questions are about the effects of interventions (not just passive observation). Note, however, that regression coefficients are sometimes called “effects”, which can be deceptive!\nWhen using data alone, predictive questions (i.e. are breastfed babies healthier?) can be answered, but causal questions: (i.e. does breastfeeding improve babies’ health?) cannot. The reason for this is that there are many possible causes for our predictive question. For example, possible explanations for why breastfed babies are healthier on average include:\n\nCausal effect: breastfeeding makes babies healthier\nReverse causality: healthier babies more likely to successfully breastfeed\nCommon cause: healthier / richer parents have healthier babies and are more likely to breastfeed\n\nWe cannot tell which explanations are true (or to what extent) just by observing (\\(x\\),\\(y\\)) pairs.\nAdditionally, causal questions implicitly involve counterfactuals, events that didn’t happen. For example, we could ask, would the same breastfed babies have been less healthy if they hadn’t been breastfed? Explanation 1 from above implies they would be, but explanations 2 and 3 do not.\n\n\n19.5.2 Confounders\nLet T represent a treatment (for example, alcohol use), and Y represent an outcome (for example, lung cancer).\n\nA confounder is a variable that affects both T and Y, distorting the correlation between them. Using the example above. Confounders can be a measured covariate or an unmeasured variable we don’t know about, and they generally cause problems, as the relationship between T and Y is really affected by data we cannot see.\nCommon assumption: all confounders are observed (ignorability)\n\n\n19.5.3 Terminology\nLet us define some terms that will help us understand causal effects.\nIn prediction, we had two kinds of variables:\n\nResponse (\\(Y\\)): what we are trying to predict\nPredictors (\\(X\\)): inputs to our prediction\n\nOther variables in causal inference include:\n\nResponse (\\(Y\\)): the outcome of interest\nTreatment (\\(T\\)): the variable we might intervene on\nCovariate (\\(X\\)): other variables we measured that may affect \\(T\\) and/or \\(Y\\)\n\nFor this lecture, \\(T\\) is a binary (0/1) variable:\n\n\n19.5.4 Neyman-Rubin Causal Model\nCausal questions are about counterfactuals:\n\nWhat would have happened if T were different?\nWhat will happen if we set T differently in the future?\n\nWe assume every individual has two potential outcomes:\n\n\\(Y_{i}(1)\\): value of \\(y_{i}\\) if \\(T_{i} = 1\\) (treated outcome)\n\\(Y_{i}(0)\\): value of \\(y_{i}\\) if \\(T_{i} = 0\\) (control outcome)\n\nFor each individual in the data set, we observe:\n\nCovariates \\(x_{i}\\)\nTreatment \\(T_{i}\\)\nResponse \\(y_{i} = Y_{i}(T_{i})\\)\n\nWe will assume (\\(x_{i}\\), \\(T_{i}\\), \\(y_{i} = Y_{i}(T_{i})\\)) tuples iid for \\(i = 1,..., n\\)\n\n\n19.5.5 Average Treatment Effect\nFor each individual, the treatment effect is \\(Y_{i}(1)-Y_{i}(0)\\)\nThe most common thing to estimate is the Average Treatment Effect (ATE)\n\\[ATE = \\mathbb{E}[Y(1)-Y(0)] = \\mathbb{E}[Y(1)] - \\mathbb{E}[Y(0)]\\]\nCan we just take the sample mean?\n\\[\\hat{ATE} = \\frac{1}{n}\\sum_{i=1}^{n}Y_{i}(1) - Y_{i}(0)\\]\nWe cannot. Why? We only observe one of \\(Y_{i}(1)\\), \\(Y_{i}(0)\\).\nFundamental problem of causal inference: We only ever observe one potential outcome\nTo draw causal conclusions, we need some causal assumption relating the observed to the unobserved units\nInstead of \\(\\frac{1}{n}\\sum_{i=1}^{n}Y_{i}(1) - Y_{i}(0)\\), what if we took the difference between the sample mean for each group?\n\\[\\hat{ATE} = \\frac{1}{n_{1}}\\sum_{i: T_{i} = 1}{Y_{i}(1)} - \\frac{1}{n_{0}}\\sum_{i: T_{i} = 0}{Y_{i}(0)} = \\frac{1}{n_{1}}\\sum_{i: T_{i} = 1}{y_{i}} - \\frac{1}{n_{0}}\\sum_{i: T_{i} = 0}{y_{i}}\\]\nIs this estimator of \\(ATE\\) unbiased? Thus, this proposed \\(\\hat{ATE}\\) is not suitable for our purposes.\nIf treatment assignment comes from random coin flips, then the treated units are an iid random sample of size \\(n_{1}\\) from the population of \\(Y_{i}(1)\\).\nThis means that,\n\\[\\mathbb{E}[\\frac{1}{n_{1}}\\sum_{i: T_{i} = 1}{y_{i}}] = \\mathbb{E}[Y_{i}(1)]\\]\nSimilarly,\n\\[\\mathbb{E}[\\frac{1}{n_{0}}\\sum_{i: T_{i} = 0}{y_{i}}] = \\mathbb{E}[Y_{i}(0)]\\]\nwhich allows us to conclude that \\(\\hat{ATE}\\) is an unbiased estimator of \\(ATE\\):\n\\[\\mathbb{E}[\\hat{ATE}] = ATE\\]\n\n\n19.5.6 Randomized Experiments\nHowever, often, randomly assigning treatments is impractical or unethical. For example, assigning a treatment of cigarettes would likely be impractical and unethical.\nAn alternative to bypass this issue is to utilize observational studies.\nExperiments:\n\nObservational Study:\n\n\n\n19.5.7 Covariate Adjustment\nWhat to do about confounders?\n\nIgnorability assumption: all important confounders are in the data set!\n\nOne idea: come up with a model that includes them, such as:\n\\[Y_{i}(t) = \\theta_{0} + \\theta_{1}x_{1} + ... + \\theta_{p}x_{p} + \\tau{t} + \\epsilon\\]\nQuestion: what is the \\(ATE\\) in this model? \\(\\tau\\)\nThis approach can work but is fragile. Breaks if:\n\nImportant covariates are missing or true dependence on \\(x\\) is nonlinear\nSometimes pejoratively called “causal inference”\n\n\n\n19.5.7.1 Covariate adjustment without parametric assumptions\nWhat to do about confounders?\n\nIgnorability assumption: all possible confounders are in the data set!\n\nOne idea: come up with a model that includes them, such as:\n\\[Y_{i}(t) = f_{\\theta}(x, t) + \\epsilon\\]\nThen:\n\\[ATE = \\frac{1}{n}\\sum_{i=1}^{n}{f_{\\theta}(x_i, 1) - f_{\\theta}(x_i, 0)}\\]\nWith enough data, we may be able to learn \\(f_{\\theta}\\) very accurately\n\nVery difficult if x is high-dimensional / its functional form is highly nonlinear\nNeed additional assumption: overlap\n\n\n\n\n19.5.8 Other Methods\nCausal inference is hard, and covariate adjustment is often not the best approach\nMany other methods are some combination of:\n\nModeling treatment T as a function of covariates x\nModeling the outcome y as a function of x, T\n\nWhat if we don’t believe in ignorability? Other methods look for a\n\nFavorite example: regression discontinuity"
-  },
-  {
-    "objectID": "inference_causality/inference_causality.html#bonus-proof-of-bias-variance-decomposition",
-    "href": "inference_causality/inference_causality.html#bonus-proof-of-bias-variance-decomposition",
-    "title": "19  Bias, Variance, and Inference",
-    "section": "19.6 (Bonus) Proof of Bias-Variance Decomposition",
-    "text": "19.6 (Bonus) Proof of Bias-Variance Decomposition\nThis section walks through the detailed derivation of the Bias-Variance Decomposition in the Bias-Variance Tradeoff section earlier in this note.\n\n\n\n\n\n\nClick to show\n\n\n\n\n\nWe want to prove that the model risk can be decomposed as\n\\[\n\\begin{align*}\nE\\left[(Y(x)-\\hat{Y}(x))^2\\right] &= E[\\epsilon^2] + \\left(g(x)-E\\left[\\hat{Y}(x)\\right]\\right)^2 + E\\left[\\left(E\\left[\\hat{Y}(x)\\right] - \\hat{Y}(x)\\right)^2\\right].\n\\end{align*}\n\\]\nTo prove this, we will first need the following lemma:\n\nIf \\(V\\) and \\(W\\) are independent random variables then \\(E[VW] = E[V]E[W]\\).\n\nWe will prove this in the discrete finite case. Trust that it’s true in greater generality.\nThe job is to calculate the weighted average of the values of \\(VW\\), where the weights are the probabilities of those values. Here goes.\n\\[\\begin{align*}\nE[VW] ~ &= ~ \\sum_v\\sum_w vwP(V=v \\text{ and } W=w) \\\\\n&= ~ \\sum_v\\sum_w vwP(V=v)P(W=w) ~~~~ \\text{by independence} \\\\\n&= ~ \\sum_v vP(V=v)\\sum_w wP(W=w) \\\\\n&= ~ E[V]E[W]\n\\end{align*}\\]\nNow we go into the actual proof:\n\n19.6.1 Goal\nDecompose the model risk into recognizable components.\n\n\n19.6.2 Step 1\n\\[\n\\begin{align*}\n\\text{model risk} ~ &= ~ E\\left[\\left(Y - \\hat{Y}(x)\\right)^2 \\right] \\\\\n&= ~ E\\left[\\left(g(x) + \\epsilon - \\hat{Y}(x)\\right)^2 \\right] \\\\\n&= ~ E\\left[\\left(\\epsilon + \\left(g(x)- \\hat{Y}(x)\\right)\\right)^2 \\right] \\\\\n&= ~ E\\left[\\epsilon^2\\right] + 2E\\left[\\epsilon \\left(g(x)- \\hat{Y}(x)\\right)\\right] + E\\left[\\left(g(x) - \\hat{Y}(x)\\right)^2\\right]\\\\\n\\end{align*}\n\\]\nOn the right hand side:\n\nThe first term is the observation variance \\(\\sigma^2\\).\nThe cross product term is 0 because \\(\\epsilon\\) is independent of \\(g(x) - \\hat{Y}(x)\\) and \\(E(\\epsilon) = 0\\)\nThe last term is the mean squared difference between our predicted value and the value of the true function at \\(x\\)\n\n\n\n19.6.3 Step 2\nAt this stage we have\n\\[\n\\text{model risk} ~ = ~ E\\left[\\epsilon^2\\right] + E\\left[\\left(g(x) - \\hat{Y}(x)\\right)^2\\right]\n\\]\nWe don’t yet have a good understanding of \\(g(x) - \\hat{Y}(x)\\). But we do understand the deviation \\(D_{\\hat{Y}(x)} = \\hat{Y}(x) - E\\left[\\hat{Y}(x)\\right]\\). We know that\n\n\\(E\\left[D_{\\hat{Y}(x)}\\right] ~ = ~ 0\\)\n\\(E\\left[D_{\\hat{Y}(x)}^2\\right] ~ = ~ \\text{model variance}\\)\n\nSo let’s add and subtract \\(E\\left[\\hat{Y}(x)\\right]\\) and see if that helps.\n\\[\ng(x) - \\hat{Y}(x) ~ = ~ \\left(g(x) - E\\left[\\hat{Y}(x)\\right] \\right) + \\left(E\\left[\\hat{Y}(x)\\right] - \\hat{Y}(x)\\right)\n\\]\nThe first term on the right hand side is the model bias at \\(x\\). The second term is \\(-D_{\\hat{Y}(x)}\\). So\n\\[\ng(x) - \\hat{Y}(x) ~ = ~ \\text{model bias} - D_{\\hat{Y}(x)}\n\\]\n\n\n19.6.4 Step 3\nRemember that the model bias at \\(x\\) is a constant, not a random variable. Think of it as your favorite number, say 10. Then \\[\n\\begin{align*}\nE\\left[ \\left(g(x) - \\hat{Y}(x)\\right)^2 \\right] ~ &= ~ \\text{model bias}^2 - 2(\\text{model bias})E\\left[D_{\\hat{Y}(x)}\\right] + E\\left[D_{\\hat{Y}(x)}^2\\right] \\\\\n&= ~ \\text{model bias}^2 - 0 + \\text{model variance} \\\\\n&= ~ \\text{model bias}^2 + \\text{model variance}\n\\end{align*}\n\\]\nAgain, the cross-product term is \\(0\\) because \\(E\\left[D_{\\hat{Y}(x)}\\right] ~ = ~ 0\\).\n\n\n19.6.5 Step 4: Bias-Variance Decomposition\nIn Step 2 we had\n\\[\n\\text{model risk} ~ = ~ \\text{observation variance} + E\\left[\\left(g(x) - \\hat{Y}(x)\\right)^2\\right]\n\\]\nStep 3 showed\n\\[\nE\\left[ \\left(g(x) - \\hat{Y}(x)\\right)^2 \\right] ~ = ~ \\text{model bias}^2 + \\text{model variance}\n\\]\nThus we have shown the bias-variance decomposition:\n\\[\n\\text{model risk} = \\text{observation variance} + \\text{model bias}^2 + \\text{model variance}.\n\\]\nThat is,\n\\[\nE\\left[(Y(x)-\\hat{Y}(x))^2\\right] = \\sigma^2 + \\left(E\\left[\\hat{Y}(x)\\right] - g(x)\\right)^2 + E\\left[\\left(\\hat{Y}(x)-E\\left[\\hat{Y}(x)\\right]\\right)^2\\right]\n\\]"
-  },
   {
     "objectID": "sql_I/sql_I.html#databases",
     "href": "sql_I/sql_I.html#databases",
@@ -851,7 +25,7 @@
     "href": "sql_I/sql_I.html#structured-query-language-and-database-schema",
     "title": "20  SQL I",
     "section": "20.2 Structured Query Language and Database Schema",
-    "text": "20.2 Structured Query Language and Database Schema\nStructured Query Language, or SQL (commonly pronounced “sequel,” though this is the subject of fierce debate), is a special programming language designed to communicate with databases. You may have encountered it in classes like CS 61A or Data C88C before. Unlike Python, it is a declarative programming language – this means that rather than writing the exact logic needed to complete a task, a piece of SQL code “declares” what the desired final output should be and leaves the program to determine what logic should be implemented.\nIt is important to reiterate that SQL is an entirely different language from Python. However, Python does have special engines that allow us to run SQL code in a Jupyter notebook. While this is typically not how SQL is used outside of an educational setting, we will be using this workflow to illustrate how SQL queries are constructed using the tools we’ve already worked with this semester. You will learn more about how to run SQL queries in Jupyter in Lab 10.\nThe syntax below will seem unfamiliar to you; for now, just focus on understanding the output displayed. We will clarify the SQL code in a bit.\nTo start, we’ll look at a database called basic_examples.db.\n\n# Load the SQL Alchemy Python library\nimport sqlalchemy\nimport pandas as pd\n\n\n# load %%sql cell magic\n%load_ext sql\n\nConnect to the SQLite database basic_examples.db.\n\n%%sql\nsqlite:///data/basic_examples.db \n\n\n%%sql\nSELECT * \nFROM sqlite_master\nWHERE type=\"table\"\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\n\n\n\n\n\n\n\ntype\nname\ntbl_name\nrootpage\nsql\n\n\n\n\ntable\nsqlite_sequence\nsqlite_sequence\n7\nCREATE TABLE sqlite_sequence(name,seq)\n\n\ntable\nDragon\nDragon\n2\nCREATE TABLE Dragon (\nname TEXT PRIMARY KEY,\nyear INTEGER CHECK (year &gt;= 2000),\ncute INTEGER\n)\n\n\ntable\nDish\nDish\n4\nCREATE TABLE Dish (\nname TEXT PRIMARY KEY,\ntype TEXT,\ncost INTEGER CHECK (cost &gt;= 0)\n)\n\n\ntable\nScene\nScene\n6\nCREATE TABLE Scene (\nid INTEGER PRIMARY KEY AUTOINCREMENT,\nbiome TEXT NOT NULL,\ncity TEXT NOT NULL,\nvisitors INTEGER CHECK (visitors &gt;= 0),\ncreated_at DATETIME DEFAULT (DATETIME('now'))\n)\n\n\n\n\n\nThe summary above displays information about the database. The database contains four tables, named sqlite_sequence, Dragon, Dish, and Scene. The rightmost column above lists the command that was used to construct each table.\nLet’s look more closely at the command used to create the Dragon table (the second entry above).\nCREATE TABLE Dragon (name TEXT PRIMARY KEY,\n                     year INTEGER CHECK (year &gt;= 2000),\n                     cute INTEGER)\nThe statement CREATE TABLE is used to specify the schema of the table – a description of what logic is used to organize the table. Schema follows a set format:\n\nColName: the name of a column\nDataType: the type of data to be stored in a column. Some of the most common SQL data types are INT (integers), FLOAT (floating point numbers), TEXT (strings), BLOB (arbitrary data, such as audio/video files), and DATETIME (a date and time).\nConstraint: some restriction on the data to be stored in the column. Common constraints are CHECK (data must obey a certain condition), PRIMARY KEY (designate a column as the table’s primary key), NOT NULL (data cannot be null), and DEFAULT (a default fill value if no specific entry is given).\n\nWe see that Dragon contains five columns. The first of these, \"name\", contains text data. It is designated as the primary key of the table; that is, the data contained in \"name\" uniquely identifies each entry in the table. Because \"name\" is the primary key of the table, no two entries in the table can have the same name – a given value of \"name\" is unique to each dragon. The \"year\" column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, \"cute\", contains integer data with no restrictions on allowable values.\nWe can verify this by viewing Dragon itself.\n\n%%sql\nSELECT *\nFROM Dragon\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\nname\nyear\ncute\n\n\n\n\nhiccup\n2010\n10\n\n\ndrogon\n2011\n-100\n\n\ndragon 2\n2019\n0\n\n\n\n\n\nDatabase tables (also referred to as relations) are structured much like DataFrames in pandas. Each row, sometimes called a tuple, represents a single record in the dataset. Each column, sometimes called an attribute or field, describes some feature of the record."
+    "text": "20.2 Structured Query Language and Database Schema\nStructured Query Language, or SQL (commonly pronounced “sequel,” though this is the subject of fierce debate), is a special programming language designed to communicate with databases. You may have encountered it in classes like CS 61A or Data C88C before. Unlike Python, it is a declarative programming language – this means that rather than writing the exact logic needed to complete a task, a piece of SQL code “declares” what the desired final output should be and leaves the program to determine what logic should be implemented.\nIt is important to reiterate that SQL is an entirely different language from Python. However, Python does have special engines that allow us to run SQL code in a Jupyter notebook. While this is typically not how SQL is used outside of an educational setting, we will be using this workflow to illustrate how SQL queries are constructed using the tools we’ve already worked with this semester. You will learn more about how to run SQL queries in Jupyter in Lab 10.\nThe syntax below will seem unfamiliar to you; for now, just focus on understanding the output displayed. We will clarify the SQL code in a bit.\nTo start, we’ll look at a database called basic_examples.db.\n\n# Load the SQL Alchemy Python library\nimport sqlalchemy\nimport pandas as pd\n\n\n# load %%sql cell magic\n%load_ext sql\n\nConnect to the SQLite database basic_examples.db.\n\n%%sql\nsqlite:///data/basic_examples.db \n\n\n%%sql\nSELECT * \nFROM sqlite_master\nWHERE type=\"table\"\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\n\n\n\n\n\n\n\ntype\nname\ntbl_name\nrootpage\nsql\n\n\n\n\ntable\nsqlite_sequence\nsqlite_sequence\n7\nCREATE TABLE sqlite_sequence(name,seq)\n\n\ntable\nDragon\nDragon\n2\nCREATE TABLE Dragon (\nname TEXT PRIMARY KEY,\nyear INTEGER CHECK (year &gt;= 2000),\ncute INTEGER\n)\n\n\ntable\nDish\nDish\n4\nCREATE TABLE Dish (\nname TEXT PRIMARY KEY,\ntype TEXT,\ncost INTEGER CHECK (cost &gt;= 0)\n)\n\n\ntable\nScene\nScene\n6\nCREATE TABLE Scene (\nid INTEGER PRIMARY KEY AUTOINCREMENT,\nbiome TEXT NOT NULL,\ncity TEXT NOT NULL,\nvisitors INTEGER CHECK (visitors &gt;= 0),\ncreated_at DATETIME DEFAULT (DATETIME('now'))\n)\n\n\n\n\n\nThe summary above displays information about the database. The database contains four tables, named sqlite_sequence, Dragon, Dish, and Scene. The rightmost column above lists the command that was used to construct each table.\nLet’s look more closely at the command used to create the Dragon table (the second entry above).\nCREATE TABLE Dragon (name TEXT PRIMARY KEY,\n                     year INTEGER CHECK (year &gt;= 2000),\n                     cute INTEGER)\nThe statement CREATE TABLE is used to specify the schema of the table – a description of what logic is used to organize the table. Schema follows a set format:\n\nColName: the name of a column\nDataType: the type of data to be stored in a column. Some of the most common SQL data types are INT (integers), FLOAT (floating point numbers), TEXT (strings), BLOB (arbitrary data, such as audio/video files), and DATETIME (a date and time).\nConstraint: some restriction on the data to be stored in the column. Common constraints are CHECK (data must obey a certain condition), PRIMARY KEY (designate a column as the table’s primary key), NOT NULL (data cannot be null), and DEFAULT (a default fill value if no specific entry is given).\n\nWe see that Dragon contains three columns. The first of these, \"name\", contains text data. It is designated as the primary key of the table; that is, the data contained in \"name\" uniquely identifies each entry in the table. Because \"name\" is the primary key of the table, no two entries in the table can have the same name – a given value of \"name\" is unique to each dragon. The \"year\" column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, \"cute\", contains integer data with no restrictions on allowable values.\nWe can verify this by viewing Dragon itself.\n\n%%sql\nSELECT *\nFROM Dragon\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\nname\nyear\ncute\n\n\n\n\nhiccup\n2010\n10\n\n\ndrogon\n2011\n-100\n\n\ndragon 2\n2019\n0\n\n\n\n\n\nDatabase tables (also referred to as relations) are structured much like DataFrames in pandas. Each row, sometimes called a tuple, represents a single record in the dataset. Each column, sometimes called an attribute or field, describes some feature of the record."
   },
   {
     "objectID": "sql_I/sql_I.html#selecting-from-tables",
@@ -880,54 +54,5 @@
     "title": "20  SQL I",
     "section": "20.6 Aggregating with GROUP BY",
     "text": "20.6 Aggregating with GROUP BY\nAt this point, we’ve seen that SQL offers much of the same functionality that was given to us by pandas. We can extract data from a table, filter it, and reorder it to suit our needs.\nIn pandas, much of our analysis work relied heavily on being able to use .groupby() to aggregate across the rows of our dataset. SQL’s answer to this task is the (very conveniently named) GROUP BY clause. While the outputs of GROUP BY are similar to those of .groupby() – in both cases, we obtain an output table where some column has been used for grouping – the syntax and logic used to group data in SQL are fairly different to the pandas implementation.\nTo illustrate GROUP BY, we will consider the Dish table from the basic_examples.db database.\n\n%%sql\nSELECT * \nFROM Dish\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\nname\ntype\ncost\n\n\n\n\nravioli\nentree\n10\n\n\nramen\nentree\n13\n\n\ntaco\nentree\n7\n\n\nedamame\nappetizer\n4\n\n\nfries\nappetizer\n4\n\n\npotsticker\nappetizer\n4\n\n\nice cream\ndessert\n5\n\n\n\n\n\nSay we wanted to find the total costs of dishes of a certain type. To accomplish this, we would write the following code.\n\n%%sql\nSELECT type, SUM(cost)\nFROM Dish\nGROUP BY type\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\ntype\nSUM(cost)\n\n\n\n\nappetizer\n12\n\n\ndessert\n5\n\n\nentree\n30\n\n\n\n\n\nWhat is going on here? The statement GROUP BY type tells SQL to group the data based on the value contained in the type column (whether a record is an appetizer, entree, or dessert). SUM(cost) sums up the costs of dishes in each type and displays the result in the output table.\nYou may be wondering: why does SUM(cost) come before the command to GROUP BY type? Don’t we need to form groups before we can count the number of entries in each?\nRemember that SQL is a declarative programming language – a SQL programmer simply states what end result they would like to see, and leaves the task of figuring out how to obtain this result to SQL itself. This means that SQL queries sometimes don’t follow what a reader sees as a “logical” sequence of thought. Instead, SQL requires that we follow its set order of operations when constructing queries. So long as we follow this ordering, SQL will handle the underlying logic.\nIn practical terms: our goal with this query was to output the total costs of each type. To communicate this to SQL, we say that we want to SELECT the SUMmed cost values for each type group.\nThere are many aggregation functions that can be used to aggregate the data contained in each group. Some common examples are:\n\nCOUNT: count the number of rows associated with each group\nMIN: find the minimum value of each group\nMAX: find the maximum value of each group\nSUM: sum across all records in each group\nAVG: find the average value of each group\n\nWe can easily compute multiple aggregations, all at once (a task that was very tricky in pandas).\n\n%%sql\nSELECT type, SUM(cost), MIN(cost), MAX(name)\nFROM Dish\nGROUP BY type\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\ntype\nSUM(cost)\nMIN(cost)\nMAX(name)\n\n\n\n\nappetizer\n12\n4\npotsticker\n\n\ndessert\n5\n5\nice cream\n\n\nentree\n30\n7\ntaco\n\n\n\n\n\nTo count the number of rows associated with each group, we use the COUNT keyword. Calling COUNT(*) will compute the total number of rows in each group, including rows with null values. Its pandas equivalent is .groupby().size().\n\n%%sql\nSELECT type, COUNT(*)\nFROM Dish\nGROUP BY type\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\ntype\nCOUNT(*)\n\n\n\n\nappetizer\n3\n\n\ndessert\n1\n\n\nentree\n3\n\n\n\n\n\nTo exclude NULL values when counting the rows in each group, we explicitly call COUNT on a column in the table. This is similar to calling .groupby().count() in pandas.\n\n%%sql\nSELECT year, COUNT(cute)\nFROM Dragon\nGROUP BY year\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\nyear\nCOUNT(cute)\n\n\n\n\n2010\n1\n\n\n2011\n1\n\n\n2019\n1\n\n\n\n\n\nWith this definition of GROUP BY in hand, let’s update our SQL order of operations. Remember: every SQL query must list clauses in this order.\nSELECT &lt;column expression list&gt;\nFROM &lt;table&gt;\n[WHERE &lt;predicate&gt;]\n[GROUP BY &lt;column list&gt;]\n[ORDER BY &lt;column list&gt;]\n[LIMIT &lt;number of rows&gt;]\n[OFFSET &lt;number of rows&gt;];\nNote that we can use the AS keyword to rename columns during the selection process and that column expressions may include aggregation functions (MAX, MIN, etc.)."
-  },
-  {
-    "objectID": "sql_II/sql_II.html#filtering-groups",
-    "href": "sql_II/sql_II.html#filtering-groups",
-    "title": "21  SQL II",
-    "section": "21.1 Filtering Groups",
-    "text": "21.1 Filtering Groups\nHAVING filters groups by applying some condition across all rows in each group. We interpret it as a a way to keep only the groups HAVING some condition. Note the difference between WHERE and HAVING: we use WHERE to filter rows, whereas we use HAVING to filter groups. WHERE precedes HAVING in terms of how SQL executes a query.\nLet’s take a look at the Dish table to see how we can use HAVING.\n\n%%sql\nSELECT *\nFROM Dish;\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\nname\ntype\ncost\n\n\n\n\nravioli\nentree\n10\n\n\nramen\nentree\n13\n\n\ntaco\nentree\n7\n\n\nedamame\nappetizer\n4\n\n\nfries\nappetizer\n4\n\n\npotsticker\nappetizer\n4\n\n\nice cream\ndessert\n5\n\n\n\n\n\nThe code below groups the different dishes by type, and only keeps those groups wherein the max cost is still less than 8.\n\n%%sql\nSELECT type, COUNT(*)\nFROM Dish\nGROUP BY type\nHAVING MAX(cost) &lt; 8;\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\ntype\nCOUNT(*)\n\n\n\n\nappetizer\n3\n\n\ndessert\n1\n\n\n\n\n\nIn contrast, the code below first filters for rows where the cost is less than 8, and then does the grouping. Note the difference in outputs - in this case, “taco” is also included, whereas other entries in the same group having cost greater than or equal to 8 are not included.\n\n%%sql\nSELECT type, COUNT(*)\nFROM Dish\nWHERE cost &lt; 8\nGROUP BY type;\n\n * sqlite:///data/basic_examples.db\nDone.\n\n\n\n\n\ntype\nCOUNT(*)\n\n\n\n\nappetizer\n3\n\n\ndessert\n1\n\n\nentree\n1"
-  },
-  {
-    "objectID": "sql_II/sql_II.html#eda-in-sql",
-    "href": "sql_II/sql_II.html#eda-in-sql",
-    "title": "21  SQL II",
-    "section": "21.2 EDA in SQL",
-    "text": "21.2 EDA in SQL\nIn the last lecture, we mostly worked under the assumption that our data had already been cleaned. However, as we saw in our first pass through the data science lifecycle, we’re very unlikely to be given data that is free of formatting issues. With this in mind, we’ll want to learn how to clean and transform data in SQL.\nOur typical workflow when working with “big data” is:\n\nUse SQL to query data from a database\nUse python (with pandas) to analyze this data in detail\n\nWe can, however, still perform simple data cleaning and re-structuring using SQL directly. To do so, we’ll use the Title table from the imdbmini database.\n\n21.2.1 Matching Text using LIKE\nOne common task we encountered in our first look at EDA was needing to match string data. For example, we might want to remove entries beginning with the same prefix as part of the data cleaning process.\nIn SQL, we use the LIKE operator to (you guessed it) look for strings that are like a given string pattern.\n\n%%sql\nsqlite:///data/imdbmini.db \n\n\n%%sql\nSELECT titleType, primaryTitle\nFROM Title\nWHERE primaryTitle LIKE \"Star Wars: Episode I - The Phantom Menace\"\n\n   sqlite:///data/basic_examples.db\n * sqlite:///data/imdbmini.db\nDone.\n\n\n\n\n\ntitleType\nprimaryTitle\n\n\n\n\nmovie\nStar Wars: Episode I - The Phantom Menace\n\n\n\n\n\nWhat if we wanted to find all Star Wars movies? % is the wildcard operator, it means “look for any character, any number of times”. This makes it helpful for identifying strings that are similar to our desired pattern, even when we don’t know the full text of what we aim to extract. In contrast, _ means “look for exactly 1 character”, as you can see in the Harry Potter example that follows.\n\n%%sql\nSELECT titleType, primaryTitle\nFROM Title\nWHERE primaryTitle LIKE \"%Star Wars%\"\nLIMIT 10;\n\n   sqlite:///data/basic_examples.db\n * sqlite:///data/imdbmini.db\nDone.\n\n\n\n\n\ntitleType\nprimaryTitle\n\n\n\n\nmovie\nStar Wars: Episode IV - A New Hope\n\n\nmovie\nStar Wars: Episode V - The Empire Strikes Back\n\n\nmovie\nStar Wars: Episode VI - Return of the Jedi\n\n\nmovie\nStar Wars: Episode I - The Phantom Menace\n\n\nmovie\nStar Wars: Episode II - Attack of the Clones\n\n\nmovie\nStar Wars: Episode III - Revenge of the Sith\n\n\ntvSeries\nStar Wars: Clone Wars\n\n\ntvSeries\nStar Wars: The Clone Wars\n\n\nmovie\nStar Wars: The Clone Wars\n\n\nmovie\nStar Wars: Episode VII - The Force Awakens\n\n\n\n\n\n\n%%sql\nSELECT titleType, primaryTitle\nFROM Title\nWHERE primaryTitle LIKE \"Harry Potter and the Deathly Hallows: Part _\"\n\n   sqlite:///data/basic_examples.db\n * sqlite:///data/imdbmini.db\nDone.\n\n\n\n\n\ntitleType\nprimaryTitle\n\n\n\n\nmovie\nHarry Potter and the Deathly Hallows: Part 1\n\n\nmovie\nHarry Potter and the Deathly Hallows: Part 2\n\n\n\n\n\n\n\n21.2.2 CASTing Data Types\nA common data cleaning task is converting data to the correct variable type. The CAST keyword is used to generate a new output column. Each entry in this output column is the result of converting the data in an existing column to a new data type. For example, we may wish to convert numeric data stored as a string to an integer.\n\n%%sql\nSELECT primaryTitle, CAST(runtimeMinutes AS INT), CAST(startYear AS INT)\nFROM Title\nLIMIT 5\n\n   sqlite:///data/basic_examples.db\n * sqlite:///data/imdbmini.db\nDone.\n\n\n\n\n\nprimaryTitle\nCAST(runtimeMinutes AS INT)\nCAST(startYear AS INT)\n\n\n\n\nA Trip to the Moon\n13\n1902\n\n\nThe Birth of a Nation\n195\n1915\n\n\nThe Cabinet of Dr. Caligari\n76\n1920\n\n\nThe Kid\n68\n1921\n\n\nNosferatu\n94\n1922\n\n\n\n\n\nWe use CAST when SELECTing colunns for our output table. In the example above, we want to SELECT the columns of integer year and runtime data that is created by the CAST.\nSQL will automatically name a new column according to the command used to SELECT it, which can lead to unwieldy column names. We can rename the CASTed column using the AS keyword.\n\n%%sql\nSELECT primaryTitle AS title, CAST(runtimeMinutes AS INT) AS minutes, CAST(startYear AS INT) AS year\nFROM Title\nLIMIT 5;\n\n   sqlite:///data/basic_examples.db\n * sqlite:///data/imdbmini.db\nDone.\n\n\n\n\n\ntitle\nminutes\nyear\n\n\n\n\nA Trip to the Moon\n13\n1902\n\n\nThe Birth of a Nation\n195\n1915\n\n\nThe Cabinet of Dr. Caligari\n76\n1920\n\n\nThe Kid\n68\n1921\n\n\nNosferatu\n94\n1922\n\n\n\n\n\n\n\n21.2.3 Using Conditional Statements with CASE\nWhen working with pandas, we often ran into situations where we wanted to generate new columns using some form of conditional statement. For example, say we wanted to describe a film title as “old,” “mid-aged,” or “new,” depending on the year of its release.\nIn SQL, conditional operations are performed using a CASE clause. Conceptually, CASE behaves much like the CAST operation: it creates a new column that we can then SELECT to appear in the output. The syntax for a CASE clause is as follows:\nCASE WHEN &lt;condition&gt; THEN &lt;value&gt;\n     WHEN &lt;other condition&gt; THEN &lt;other value&gt;\n     ...\n     ELSE &lt;yet another value&gt;\nEND\nScanning through the skeleton code above, you can see that the logic is similar to that of an if statement in python. The conditional statement is first opened by calling CASE. Each new condition is specified by WHEN, with THEN indicating what value should be filled if the condition is met. ELSE specifies the value that should be filled if no other conditions are met. Lastly, END indicates the end of the conditional statement; once END has been called, SQL will continue evaluating the query as usual.\nLet’s see this in action. In the example below, we give the new column created by the CASE statement the name movie_age.\n\n%%sql\n/* If a movie was filmed before 1950, it is \"old\"\nOtherwise, if a movie was filmed before 2000, it is \"mid-aged\"\nElse, a movie is \"new\" */\n\nSELECT titleType, startYear,\nCASE WHEN startYear &lt; 1950 THEN \"old\"\n     WHEN startYear &lt; 2000 THEN \"mid-aged\"\n     ELSE \"new\"\n     END AS movie_age\nFROM Title\nLIMIT 10;\n\n   sqlite:///data/basic_examples.db\n * sqlite:///data/imdbmini.db\nDone.\n\n\n\n\n\ntitleType\nstartYear\nmovie_age\n\n\n\n\nshort\n1902\nold\n\n\nmovie\n1915\nold\n\n\nmovie\n1920\nold\n\n\nmovie\n1921\nold\n\n\nmovie\n1922\nold\n\n\nmovie\n1924\nold\n\n\nmovie\n1925\nold\n\n\nmovie\n1925\nold\n\n\nmovie\n1927\nold\n\n\nmovie\n1926\nold"
-  },
-  {
-    "objectID": "sql_II/sql_II.html#joining-tables",
-    "href": "sql_II/sql_II.html#joining-tables",
-    "title": "21  SQL II",
-    "section": "21.3 JOINing Tables",
-    "text": "21.3 JOINing Tables\nAt this point, we’re well-versed in using SQL as a tool to clean, manipulate, and transform data in a table. Notice that this sentence referred to one table, specifically. What happens if the data we need is distributed across multiple tables? This is an important consideration when using SQL – recall that we first introduced SQL as a language to query from databases. Databases often store data in a multidimensional structure. In other words, information is stored across several tables, with each table containing a small subset of all the data housed by the database.\nA common way of organizing a database is by using a star schema. A star schema is composed of two types of tables. A fact table is the central table of the database – it contains the information needed to link entries across several dimension tables, which contain more detailed information about the data.\nSay we were working with a database about boba offerings in Berkeley. The dimension tables of the database might contain information about tea varieties and boba toppings. The fact table would be used to link this information across the various dimension tables.\n\n\n\nIf we explicitly mark the relationships between tables, we start to see the star-like structure of the star schema.\n\n\n\nTo join data across multiple tables, we’ll use the (creatively named) JOIN keyword. We’ll make things easier for now by first considering the simpler cats dataset, which consists of the tables s and t.\n\n\n\nTo perform a join, we amend the FROM clause. You can think of this as saying, “SELECT my data FROM tables that have been JOINed together.”\nRemember: SQL does not consider newlines or whitespace when interpreting queries. The indentation given in the example below is to help improve readability. If you wish, you can write code that does not follow this formatting.\nSELECT &lt;column list&gt;\nFROM table_1 \n    JOIN table_2 \n    ON key_1 = key_2;\nWe also need to specify what column from each table should be used to determine matching entries. By defining these keys, we provide SQL with the information it needs to pair rows of data together.\nIn a cross join, all possible combinations of rows appear in the output table, regardless of whether or not rows share a matching key. Because all rows are joined, even if there is no matching key, it is not necessary to specify what keys to consider in an ON statement. A cross join is also known as a cartesian product.\n\n\n\nThe most commonly used type of SQL JOIN is the inner join. It turns out you’re already familiar with what an inner join does, and how it works – this is the type of join we’ve been using in pandas all along! In an inner join, we combine every row in our first table with its matching entry in the second table. If a row from either table does not have a match in the other table, it is omitted from the output.\n\n\n\nAnother way of interpreting the inner join: perform a cross join, then remove all rows that do not share a matching key. Notice that the output of the inner join above contains all rows of the cross join example that contain a single color across the entire row.\nIn a full outer join, all rows that have a match between the two tables are joined together. If a row has no match in the second table, then the values of the columns for that second table are filled with null. In other words, a full outer join performs an inner join while still keeping rows that have no match in the other table. This is best understood visually:\n\n\n\nWe have kept the same output achieved using an inner join, with the addition of partially null rows for entries in s and t that had no match in the second table. Note that FULL OUTER JOIN is not supported by SQLite, the “flavor” of SQL that will be used in lab and homework.\nA left outer join is similar to a full outer join. In a left outer join, all rows in the left table are kept in the output table. If a row in the right table shares a match with the left table, this row will be kept; otherwise, the rows in the right table are omitted from the output.\n\n\n\nA right outer join keeps all rows in the right table. Rows in the left table are only kept if they share a match in the right table. Right outer joins are not supported by SQLite.\n\n\n\nIn the examples above, we performed our joins by checking for equality between the two tables (i.e., by setting s.id = t.id). SQL also supports joining rows on inequalities, which is something we weren’t able to do when working in pandas. Consider a new dataset that contains information about students and teachers.\n\n\n\nOften, we wish to compare the relative values of rows in different tables, rather than check that they are exactly equal. For example, we may want to join rows where students are older than the corresponding teacher. We can do so by specifying an inequality in our ON statement."
-  },
-  {
-    "objectID": "logistic_regression_1/logistic_reg_1.html#classification",
-    "href": "logistic_regression_1/logistic_reg_1.html#classification",
-    "title": "22  Logistic Regression I",
-    "section": "22.1 Classification",
-    "text": "22.1 Classification\nIn the next two lectures, we’ll tackle the task of classification. A classification problem aims to classify data into categories. Unlike in regression, where we predicted a numeric output, classification involves predicting some categorical variable, or response, \\(y\\). Examples of classification tasks include:\n\nPredicting which team won from its turnover percentage\nPredicting the day of the week of a meal from the total restaurant bill\nPredicting the model of car from its horsepower\n\nThere are a couple of different types of classification:\n\nBinary classification: classify data into two classes, and responses \\(y\\) are either 0 or 1\nMulticlass classification: classify data into multiple classes (e.g., image labeling, next word in a sentence, etc.)\nStructured prediction tasks: conduct mutliple related classification predictions (e.g., translation, voice recognition, etc.)\n\nIn Data 100, we will mostly deal with binary classification, where we are attempting to classify data into one of two classes.\nTo build a classification model, we need to modify our modeling workflow slightly. Recall that in regression we:\n\nCreated a design matrix of numeric features\nDefined our model as a linear combination of these numeric features\nUsed the model to output numeric predictions\n\nIn classification, however, we no longer want to output numeric predictions; instead, we want to predict the class to which a datapoint belongs. This means that we need to update our workflow. To build a classification model, we will:\n\nCreate a design matrix of numeric features.\nDefine our model as a linear combination of these numeric features, transformed by a non-linear sigmoid function. This outputs a numeric quantity.\nApply a decision rule to interpret the outputted quantity and decide a classification.\nOutput a predicted class.\n\nThere are two key differences: as we’ll soon see, we need to incorporate a non-linear transformation to capture non-linear relationships hidden in our data. We do so by applying the sigmoid function to a linear combination of the features. Secondly, we must apply a decision rule to convert the numeric quantities computed by our model into an actual class prediction. This can be as simple as saying that any datapoint with a feature greater than some number \\(x\\) belongs to Class 1.\n\n\n\n\n\n\nThis was a very high-level overview. Let’s walk through the process in detail to clarify what we mean."
-  },
-  {
-    "objectID": "logistic_regression_1/logistic_reg_1.html#deriving-the-logistic-regression-model",
-    "href": "logistic_regression_1/logistic_reg_1.html#deriving-the-logistic-regression-model",
-    "title": "22  Logistic Regression I",
-    "section": "22.2 Deriving the Logistic Regression Model",
-    "text": "22.2 Deriving the Logistic Regression Model\nThroughout this lecture, we will work with the games dataset, which contains information about games played in the NBA basketball league. Our goal will be to use a basketball team’s \"GOAL_DIFF\" to predict whether or not a given team won their game (\"WON\"). If a team wins their game, we’ll say they belong to Class 1. If they lose, they belong to Class 0.\nFor those who are curious, \"GOAL_DIFF\" represents the difference in successful field goal percentages between the two competing teams.\n\n\nCode\nimport warnings\nwarnings.filterwarnings(\"ignore\")\n\nimport pandas as pd\nimport numpy as np\nnp.seterr(divide='ignore')\n\ngames = pd.read_csv(\"data/games\").dropna()\ngames.head()\n\n\n\n\n\n\n\n\n\nGAME_ID\nTEAM_NAME\nMATCHUP\nWON\nGOAL_DIFF\nAST\n\n\n\n\n0\n21701216\nDallas Mavericks\nDAL vs. PHX\n0\n-0.251\n20\n\n\n1\n21700846\nPhoenix Suns\nPHX @ GSW\n0\n-0.237\n13\n\n\n2\n21700071\nSan Antonio Spurs\nSAS @ ORL\n0\n-0.234\n19\n\n\n3\n21700221\nNew York Knicks\nNYK @ TOR\n0\n-0.234\n17\n\n\n4\n21700306\nMiami Heat\nMIA @ NYK\n0\n-0.222\n21\n\n\n\n\n\n\n\nLet’s visualize the relationship between \"GOAL_DIFF\" and \"WON\" using the Seaborn function sns.stripplot. A strip plot automatically introduces a small amount of random noise to jitter the data. Recall that all values in the \"WON\" column are either 1 (won) or 0 (lost) – if we were to directly plot them without jittering, we would see severe overplotting.\n\n\nCode\nimport seaborn as sns\nimport matplotlib.pyplot as plt\n\nsns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\")\n# By default, sns.stripplot plots 0, then 1. We invert the y axis to reverse this behavior\nplt.gca().invert_yaxis();\n\n\n\n\n\nThis dataset is unlike anything we’ve seen before – our target variable contains only two unique values! Remember that each y value is either 0 or 1; the plot above jitters the y data slightly for ease of reading.\nThe regression models we have worked with always assumed that we were attempting to predict a continuous target. If we apply a linear regression model to this dataset, something strange happens.\n\n\nCode\nimport sklearn.linear_model as lm\n\nX, Y = games[[\"GOAL_DIFF\"]], games[\"WON\"]\nregression_model = lm.LinearRegression()\nregression_model.fit(X, Y)\n\nplt.plot(X.squeeze(), regression_model.predict(X), \"k\")\nsns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\")\nplt.gca().invert_yaxis();\n\n\n\n\n\nThe linear regression fit follows the data as closely as it can. However, there are a few key flaws with this approach:\n\nThe predicted output, \\(\\hat{y}\\), can be outside the range of possible classes (there are predictions above 1 and below 0)\nThis means that the output can’t always be interpreted (what does it mean to predict a class of -2.3?)\n\nOur usual linear regression framework won’t work here. Instead, we’ll need to get more creative.\nBack in Data 8, you gradually built up to the concept of linear regression by using the graph of averages. Before you knew the mathematical underpinnings of the regression line, you took a more intuitive approach: you bucketed the \\(x\\) data into bins of common values, then computed the average \\(y\\) for all datapoints in the same bin. The result gave you the insight needed to derive the regression fit.\nLet’s take the same approach as we grapple with our new classification task. In the cell below, we 1) bucket the \"GOAL_DIFF\" data into bins of similar values and 2) compute the average \"WON\" value of all datapoints in a bin.\n\nbins = pd.cut(games[\"GOAL_DIFF\"], 20)\ngames[\"bin\"] = [(b.left + b.right) / 2 for b in bins]\nwin_rates_by_bin = games.groupby(\"bin\")[\"WON\"].mean()\n\n# alpha makes the points transparent so we can see the line more clearly\nsns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\", alpha=0.3)\nplt.plot(win_rates_by_bin.index, win_rates_by_bin, c=\"tab:red\")\nplt.gca().invert_yaxis();\n\n\n\n\nInteresting: our result is certainly not like the straight line produced by finding the graph of averages for a linear relationship. We can make two observations:\n\nAll predictions on our line are between 0 and 1\nThe predictions are non-linear, following a rough “S” shape\n\nLet’s think more about what we’ve just done.\nTo find the average \\(y\\) value for each bin, we computed:\n\\[\\frac{1 \\text{(\\# Y = 1 in bin)} + 0 \\text{(\\# Y = 0 in bin)}}{\\text{\\# datapoints in bin}} = \\frac{\\text{\\# Y = 1 in bin}}{\\text{\\# datapoints in bin}} = P(\\text{Y = 1} | \\text{bin})\\]\nThis is simply the probability of a datapoint in that bin belonging to Class 1! This aligns with our observation from earlier: all of our predictions lie between 0 and 1, just as we would expect for a probability.\nOur graph of averages was really modeling the probability, \\(p\\), that a datapoint belongs to Class 1, or essentially that \\(\\text{Y = 1}\\) for a particular value of \\(\\text{x}\\).\n\\[ p = P(Y = 1 | \\text{ x} )\\]\nIn logistic regression, we have a new modeling goal. We want to model the probability that a particular datapoint belongs to Class 1. To do so, we’ll need to create a model that can approximate the S-shaped curve we plotted above.\nFortunately for us, we’re already well-versed with a technique to model non-linear relationships – applying non-linear transformations to linearize the relationship. Recall the steps we’ve applied previously:\n\nTransform the variables until we linearize their relationship\nFit a linear model to the transformed variables\n“Undo” our transformations to identify the underlying relationship between the original variables\n\nIn past examples, we used the bulge diagram to help us decide what transformations may be useful. The S-shaped curve we saw above, however, looks nothing like any relationship we’ve seen in the past. We’ll need to think carefully about what transformations will linearize this curve.\nLet’s consider our eventual goal: determining if we should predict a Class of 0 or 1 for each datapoint. Rephrased, we want to decide if it seems more “likely” that the datapoint belongs to Class 0 or to Class 1. One way of deciding this is to see which class has the higher predicted probability for a given datapoint. The odds is defined as the probability of a datapoint belonging to Class 1 divided by the probability of it belonging to Class 0.\n\\[\\text{odds} = \\frac{P(Y=1|x)}{P(Y=0|x)} = \\frac{p}{1-p}\\]\nIf we plot the odds for each input \"GOAL_DIFF\" (\\(x\\)), we see something that looks more promising.\n\np = win_rates_by_bin\nodds = p/(1-p) \n\nplt.plot(odds.index, odds)\nplt.xlabel(\"x\")\nplt.ylabel(r\"Odds $= \\frac{p}{1-p}$\");\n\n\n\n\nIt turns out that the relationship between our input \"GOAL_DIFF\" and the odds is roughly exponential! Let’s linearize the exponential by taking the logarithm. As a reminder, you should assume that any logarithm in Data 100 is the base \\(e\\) natural logarithm unless told otherwise.\n\nimport numpy as np\nlog_odds = np.log(odds)\nplt.plot(odds.index, log_odds, c=\"tab:green\")\nplt.xlabel(\"x\")\nplt.ylabel(r\"Log-Odds $= \\log{\\frac{p}{1-p}}$\");\n\n\n\n\nWe see something promising – the relationship between the log-odds and our input feature is approximately linear. This means that we can use a linear model to describe the relationship between the log-odds and \\(x\\). In other words:\n\\[\\begin{align}\n\\log{(\\frac{p}{1-p})} &= \\theta_0 + \\theta_1 x_i\\\\\n&= x^{\\top} \\theta\n\\end{align}\\]\nHere, we use \\(x^{\\top}\\) to represent an observation in our dataset, stored as a row vector. You can imagine it as a single row in our design matrix. \\(x^{\\top} \\theta\\) indicates a linear combination of the features for this observation (just as we used in multiple linear regression).\nWe’re in good shape! We have now derived the following relationship:\n\\[\\log{(\\frac{p}{1-p})} = x^{\\top} \\theta\\]\nRemember that our goal is to predict the probability of a datapoint belonging to Class 1, \\(p\\). Let’s rearrange this relationship to uncover the original relationship between \\(p\\) and our input data, \\(x^{\\top}\\).\n\\[\\begin{align}\n\\log{(\\frac{p}{1-p})} &= x^T \\theta\\\\\n\\frac{p}{1-p} &= e^{x^T \\theta}\\\\\np &= (1-p)e^{x^T \\theta}\\\\\np &= e^{x^T \\theta}- p e^{x^T \\theta}\\\\\np(1 + e^{x^T \\theta}) &= e^{x^T \\theta} \\\\\np &= \\frac{e^{x^T \\theta}}{1+e^{x^T \\theta}}\\\\\np &= \\frac{1}{1+e^{-x^T \\theta}}\\\\\n\\end{align}\\]\nPhew, that was a lot of algebra. What we’ve uncovered is the logistic regression model to predict the probability of a datapoint \\(x^{\\top}\\) belonging to Class 1. If we plot this relationship for our data, we see the S-shaped curve from earlier!\n\n\nCode\n# We'll discuss the `LogisticRegression` class next time\nxs = np.linspace(-0.3, 0.3)\n\nlogistic_model = lm.LogisticRegression(C=20)\nlogistic_model.fit(X, Y)\npredicted_prob = logistic_model.predict_proba(xs[:, np.newaxis])[:, 1]\n\nsns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\", alpha=0.5)\nplt.plot(xs, predicted_prob, c=\"k\", lw=3, label=\"Logistic regression model\")\nplt.plot(win_rates_by_bin.index, win_rates_by_bin, lw=2, c=\"tab:red\", label=\"Graph of averages\")\nplt.legend(loc=\"upper left\")\nplt.gca().invert_yaxis();\n\n\n\n\n\nTo predict a probability using the logistic regression model, we:\n\nCompute a linear combination of the features, \\(x^{\\top}\\theta\\)\nApply the sigmoid activation function, \\(\\sigma(x^{\\top} \\theta)\\).\n\nOur predicted probabilities are of the form \\(P(Y=1|x) = p = \\frac{1}{1+e^{-(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 + \\ldots + \\theta_p x_p)}}\\)\nAn important note: despite its name, logistic regression is used for classification tasks, not regression tasks. In Data 100, we always apply logistic regression with the goal of classifying data.\nThe S-shaped curve is formally known as the sigmoid function and is typically denoted by \\(\\sigma\\).\n\\[\\sigma(t) = \\frac{1}{1+e^{-t}}\\]\n\n\n\n\n\n\nProperties of the Sigmoid\n\n\n\n\nReflection/Symmetry: \\(1-\\sigma(t) = \\frac{e^{-t}}{1+e^{-t}}=\\sigma(-t)\\)\nInverse: \\(t=\\sigma^{-1}(p)=\\log{(\\frac{p}{1-p})}\\)\nDerivative: \\(\\frac{d}{dz} \\sigma(t) = \\sigma(t) (1-\\sigma(t))=\\sigma(t)\\sigma(-t)\\)\nDomain: \\(-\\infty &lt; t &lt; \\infty\\)\nRange: \\(0 &lt; \\sigma(t) &lt; 1\\)\n\n\n\nIn the context of our modeling process, the sigmoid is considered an activation function. It takes in a linear combination of the features and applies a non-linear transformation.\nLet’s summarize our logistic regression modeling workflow.\n\n\n\nOur main takeaways from this section are:\n\nFit the “S” curve as best as possible\nThe curve models the probability: \\(P = (Y=1 | x)\\)\nAssume log-odds is a linear combination of \\(x\\) and \\(\\theta\\)\n\nPutting this together, we know that the estimated probability that response is 1 given the features \\(x\\) is equal to the logistic function \\(\\sigma()\\) at the value \\(x^{\\top}\\theta\\):\n\\[\\begin{align}\n\\hat{P}_{\\theta}(Y = 1 | x) = \\frac{1}{1 + e^{-x^{\\top}\\theta}}\n\\end{align}\\]\nMore commonly, the logistic regression model is written as:\n\\[\\begin{align}\n\\hat{P}_{\\theta}(Y = 1 | x) = \\sigma(x^{\\top}\\theta)\n\\end{align}\\]\n\n\n\n\n\n\nProperties of the Logistic Model\n\n\n\nConsider a logistic regression model with one feature and an intercept term:\n\\[\\begin{align}\np = P(Y = 1 | x) = \\frac{1}{1+e^{-(\\theta_0 + \\theta_1 x)}}\n\\end{align}\\]\nProperties:\n\n\\(\\theta_0\\) controls the position of the curve along the horizontal axis\nThe magnitude of \\(\\theta_1\\) controls the “steepness” of the sigmoid\nThe sign of \\(\\theta_1\\) controls the orientation of the curve\n\n\n\n\n\n\n\n\n\nExample Calculation\n\n\n\n\n\nSuppose we want to predict the probability that a team wins a game, given \"GOAL_DIFF\" (first feature) and the number of free throws (second feature). Let’s say we fit a logistic regression model (with no intercept) using the training data and estimate the optimal parameters. Now we want to predict the probability that a new team will win their game.\n\\[\\begin{align}\n\\hat{\\theta}^{\\top} = \\begin{matrix}[0.1 & -0.5]\\end{matrix}\n\\\\x^{\\top} = \\begin{matrix}[15 & 1]\\end{matrix}\n\\end{align}\\]\n\\[\\begin{align}\n\\hat{P}_{\\hat{\\theta}}(Y = 1 | x) = \\sigma(x^{\\top}\\hat{\\theta}) = \\sigma(0.1 \\cdot 15 + (-0.5) \\cdot 1) = \\sigma(1) = \\frac{1}{1+e^{-1}} \\approx 0.7311\n\\end{align}\\]\nWe see that the response is more likely to be 1 than 0, indicating that a reasonable prediction is \\(\\hat{y} = 1\\). We’ll dive deeper into this in the next lecture."
-  },
-  {
-    "objectID": "logistic_regression_1/logistic_reg_1.html#cross-entropy-loss",
-    "href": "logistic_regression_1/logistic_reg_1.html#cross-entropy-loss",
-    "title": "22  Logistic Regression I",
-    "section": "22.3 Cross-Entropy Loss",
-    "text": "22.3 Cross-Entropy Loss\nTo quantify the error of our logistic regression model, we’ll need to define a loss function.\n\n22.3.1 Why Not MSE?\nYou may wonder: why not use our familiar mean squared error? It turns out that the MSE is not well suited for logistic regression. To see why, let’s consider a simple, artificially generated toy dataset (this will be easier to work with than the more complicated games data).\n\n\nCode\ntoy_df = pd.DataFrame({\n        \"x\": [-4, -2, -0.5, 1, 3, 5],\n        \"y\": [0, 0, 1, 0, 1, 1]})\ntoy_df.head()\n\n\n\n\n\n\n\n\n\nx\ny\n\n\n\n\n0\n-4.0\n0\n\n\n1\n-2.0\n0\n\n\n2\n-0.5\n1\n\n\n3\n1.0\n0\n\n\n4\n3.0\n1\n\n\n\n\n\n\n\nWe’ll construct a basic logistic regression model with only one feature and no intercept term. Our predicted probabilities take the form:\n\\[p=P(Y=1|x)=\\frac{1}{1+e^{-\\theta_1 x}}\\]\nIn the cell below, we plot the MSE for our model on the data.\n\n\nCode\ndef sigmoid(z):\n    return 1/(1+np.e**(-z))\n    \ndef mse_on_toy_data(theta):\n    p_hat = sigmoid(toy_df['x'] * theta)\n    return np.mean((toy_df['y'] - p_hat)**2)\n\nthetas = np.linspace(-15, 5, 100)\nplt.plot(thetas, [mse_on_toy_data(theta) for theta in thetas])\nplt.title(\"MSE on toy classification data\")\nplt.xlabel(r'$\\theta_1$')\nplt.ylabel('MSE');\n\n\n\n\n\nThis looks nothing like the parabola we found when plotting the MSE of a linear regression model! In particular, we can identify two flaws with using the MSE for logistic regression:\n\nThe MSE loss surface is non-convex. There is both a global minimum and a (barely perceptible) local minimum in the loss surface above. This means that there is the risk of gradient descent converging on the local minimum of the loss surface, missing the true optimum parameter \\(\\theta_1\\).\nSquared loss is bounded for a classification task. Recall that each true \\(y\\) has a value of either 0 or 1. This means that even if our model makes the worst possible prediction (e.g. predicting \\(p=0\\) for \\(y=1\\)), the squared loss for an observation will be no greater than 1: \\[(y-p)^2=(1-0)^2=1\\] The MSE does not strongly penalize poor predictions.\n\n\n\n22.3.2 Motivating Cross-Entropy Loss\nSuffice to say, we don’t want to use the MSE when working with logistic regression. Instead, we’ll consider what kind of behavior we would like to see in a loss function.\nLet \\(y\\) be the binary label \\({0, 1}\\), and \\(p\\) be the model’s predicted probability of the label being 1.\n\nWhen the true \\(y\\) is 1, we should incur low loss when the model predicts large \\(p\\)\nWhen the true \\(y\\) is 0, we should incur high loss when the model predicts large \\(p\\)\n\nIn other words, our loss function should behave differently depending on the value of the true class, \\(y\\).\nThe cross-entropy loss incorporates this changing behavior. We will use it throughout our work on logistic regression. Below, we write out the cross-entropy loss for a single datapoint (no averages just yet).\n\\[\\text{Cross-Entropy Loss} = \\begin{cases}\n  -\\log{(p)}  & \\text{if } y=1 \\\\\n  -\\log{(1-p)} & \\text{if } y=0\n\\end{cases}\\]\nWhy does this (seemingly convoluted) loss function “work”? Let’s break it down.\n\n\nWhen \\(y=1\\)\n\n\n\n\nAs \\(p \\rightarrow 0\\), loss approches \\(\\infty\\)\nAs \\(p \\rightarrow 1\\), loss approaches 0\n\n\n\n\nWhen \\(y=0\\)\n\n\n\n\nAs \\(p \\rightarrow 0\\), loss approches 0\nAs \\(p \\rightarrow 1\\), loss approaches \\(\\infty\\)\n\n\n\nAll good – we are seeing the behavior we want for our logistic regression model.\nThe piecewise function we outlined above is difficult to optimize: we don’t want to constantly “check” which form of the loss function we should be using at each step of choosing the optimal model parameters. We can re-express cross-entropy loss in a more convenient way:\n\\[\\text{Cross-Entropy Loss} = -\\left(y\\log{(p)}-(1-y)\\log{(1-p)}\\right)\\]\nBy setting \\(y\\) to 0 or 1, we see that this new form of cross-entropy loss gives us the same behavior as the original formulation.\n\n\nWhen \\(y=1\\):\n\\[\\begin{align}\n\\text{CE} &= -\\left((1)\\log{(p)}-(1-1)\\log{(1-p)}\\right)\\\\\n&= -\\log{(p)}\n\\end{align}\\]\n\n\n\nWhen \\(y=0\\):\n\\[\\begin{align}\n\\text{CE} &= -\\left((0)\\log{(p)}-(1-0)\\log{(1-p)}\\right)\\\\\n&= -\\log{(1-p)}\n\\end{align}\\]\n\n\nThe empirical risk of the logistic regression model is then the mean cross-entropy loss across all datapoints in the dataset. When fitting the model, we want to determine the model parameter \\(\\theta\\) that leads to the lowest mean cross-entropy loss possible.\n\\[R(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(p_i)}-(1-y_i)\\log{(1-p_i)}\\right)\\] \\[R(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(\\sigma(X_i^{\\top}\\theta)}-(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta)}\\right)\\]\nThe optimization problem is therefore to find the estimate \\(\\hat{\\theta}\\) that minimizes \\(R(\\theta)\\):\n\\[\\begin{align}\n\\hat{\\theta} = \\underset{\\theta}{\\arg\\min} = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(\\sigma(X_i^{\\top}\\theta)}-(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta)}\\right)\n\\end{align}\\]\nPlotting the cross-entropy loss surface for our toy dataset gives us a more encouraging result – our loss function is now convex. This means we can optimize it using gradient descent. Computing the gradient of the logistic model is fairly challenging, so we’ll let sklearn take care of this for us. You won’t need to compute the gradient of the logistic model in Data 100.\n\n\nCode\ndef cross_entropy(y, p_hat):\n    return - y * np.log(p_hat) - (1 - y) * np.log(1 - p_hat)\n\ndef mean_cross_entropy_on_toy_data(theta):\n    p_hat = sigmoid(toy_df['x'] * theta)\n    return np.mean(cross_entropy(toy_df['y'], p_hat))\n\nplt.plot(thetas, [mean_cross_entropy_on_toy_data(theta) for theta in thetas], color = 'green')\nplt.ylabel(r'Mean Cross-Entropy Loss($\\theta$)')\nplt.xlabel(r'$\\theta$');"
-  },
-  {
-    "objectID": "logistic_regression_1/logistic_reg_1.html#bonus-maximum-likelihood-estimation",
-    "href": "logistic_regression_1/logistic_reg_1.html#bonus-maximum-likelihood-estimation",
-    "title": "22  Logistic Regression I",
-    "section": "22.4 (Bonus) Maximum Likelihood Estimation",
-    "text": "22.4 (Bonus) Maximum Likelihood Estimation\nIt may have seemed like we pulled cross-entropy loss out of thin air. How did we know that taking the negative logarithms of our probabilities would work so well? It turns out that cross-entropy loss is justified by probability theory.\nThe following section is out of scope, but is certainly an interesting read!\n\n22.4.1 Building Intuition: The Coin Flip\nTo build some intuition for logistic regression, let’s look at an introductory example to classification: the coin flip. Suppose we observe some outcomes of a coin flip (1 = Heads, 0 = Tails).\n\nflips = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]\nflips\n\n[0, 0, 1, 1, 1, 1, 0, 0, 0, 0]\n\n\nA reasonable model is to assume all flips are IID (independent and identically distributed). In other words, each flip has the same probability of returning a 1 (or heads). Let’s define a parameter \\(\\theta\\), the probability that the next flip is a heads. We will use this parameter to inform our decision for \\(\\hat y\\) (predicting either 0 or 1) of the next flip. If \\(\\theta \\ge 0.5, \\hat y = 1, \\text{else } \\hat y = 0\\).\nYou may be inclined to say \\(0.5\\) is the best choice for \\(\\theta\\). However, notice that we made no assumption about the coin itself. The coin may be biased, so we should make our decision based only on the data. We know that exactly \\(\\frac{4}{10}\\) of the flips were heads, so we might guess \\(\\hat \\theta = 0.4\\). In the next section, we will mathematically prove why this is the best possible estimate.\n\n\n22.4.2 Likelihood of Data\nLet’s call the result of the coin flip a random variable \\(Y\\). This is a Bernoulli random variable with two outcomes. \\(Y\\) has the following distribution:\n\\[P(Y = y) = \\begin{cases}\n        p, \\text{if }  y=1\\\\\n        1 - p, \\text{if }  y=0\n    \\end{cases} \\]\n\\(p\\) is unknown to us. But we can find the \\(p\\) that makes the data we observed the most likely.\nThe probability of observing 4 heads and 6 tails follows the binomial distribution.\n\\[\\binom{10}{4} (p)^4 (1-p)^6\\]\nWe define the likelihood of obtaining our observed data as a quantity proportional to the probability above. To find it, simply multiply the probabilities of obtaining each coin flip.\n\\[(p)^{4} (1-p)^6\\]\nThe technique known as maximum likelihood estimation finds the \\(p\\) that maximizes the above likelihood. You can find this maximum by taking the derivative of the likelihood, but we’ll provide a more intuitive graphical solution.\n\nthetas = np.linspace(0, 1)\nplt.plot(thetas, (thetas**4)*(1-thetas)**6)\nplt.xlabel(r\"$\\theta$\")\nplt.ylabel(\"Likelihood\");\n\n\n\n\nMore generally, the likelihood for some Bernoulli(\\(p\\)) random variable \\(Y\\) is:\n\\[P(Y = y) = \\begin{cases}\n        1, \\text{with probability }  p\\\\\n        0, \\text{with probability }  1 - p\n    \\end{cases} \\]\nEquivalently, this can be written in a compact way:\n\\[P(Y=y) = p^y(1-p)^{1-y}\\]\n\nWhen \\(y = 1\\), this reads \\(P(Y=y) = p\\)\nWhen \\(y = 0\\), this reads \\(P(Y=y) = (1-p)\\)\n\nIn our example, a Bernoulli random variable is analogous to a single data point (e.g., one instance of a basketball team winning or losing a game). All together, our games data consists of many IID Bernoulli(\\(p\\)) random variables. To find the likelihood of independent events in succession, simply multiply their likelihoods.\n\\[\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}\\]\nAs with the coin example, we want to find the parameter \\(p\\) that maximizes this likelihood. Earlier, we gave an intuitive graphical solution, but let’s take the derivative of the likelihood to find this maximum.\nAt a first glance, this derivative will be complicated! We will have to use the product rule, followed by the chain rule. Instead, we can make an observation that simplifies the problem.\nFinding the \\(p\\) that maximizes \\[\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}\\] is equivalent to the \\(p\\) that maximizes \\[\\text{log}(\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i})\\]\nThis is because \\(\\text{log}\\) is a strictly increasing function. It won’t change the maximum or minimum of the function it was applied to. From \\(\\text{log}\\) properties, \\(\\text{log}(a*b)\\) = \\(\\text{log}(a) + \\text{log}(b)\\). We can apply this to our equation above to get:\n\\[\\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} \\text{log}(p^{y_i} (1-p)^{1-y_i})\\]\n\\[= \\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} (\\text{log}(p^{y_i}) + \\text{log}((1-p)^{1-y_i}))\\]\n\\[= \\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} (y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p))\\]\nWe can add a constant factor of \\(\\frac{1}{n}\\) out front. It won’t affect the \\(p\\) that maximizes our likelihood.\n\\[=\\underset{p}{\\text{argmax}}  \\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)\\]\nOne last “trick” we can do is change this to a minimization problem by negating the result. This works because we are dealing with a concave function, which can be made convex.\n\\[= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)\\]\nNow let’s say that we have data that are independent with different probability \\(p_i\\). Then, we would want to find the \\(p_1, p_2, \\dots, p_n\\) that maximize \\[\\prod_{i=1}^{n} p_i^{y_i} (1-p_i)^{1-y_i}\\]\nSetting up and simplifying the optimization problems as we did above, we ultimately want to find:\n\\[= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p_i) + (1-y_i)\\text{log}(1-p_i)\\]\nFor logistic regression, \\(p_i = \\sigma(x^{\\top}\\theta)\\). Plugging that in, we get:\n\\[= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(\\sigma(x^{\\top}\\theta)) + (1-y_i)\\text{log}(1-\\sigma(x^{\\top}\\theta))\\]\nThis is exactly our average cross-entropy loss minimization problem from before!\nWhy did we do all this complicated math? We have shown that minimizing cross-entropy loss is equivalent to maximizing the likelihood of the training data.\n\nBy minimizing cross-entropy loss, we are choosing the model parameters that are “most likely” for the data we observed.\n\nNote that this is under the assumption that all data is drawn independently from the same logistic regression model with parameter \\(\\theta\\). In fact, many of the model + loss combinations we’ve seen can be motivated using MLE (e.g., OLS, Ridge Regression, etc.). In probability and ML classes, you’ll get the chance to explore MLE further."
   }
 ]
\ No newline at end of file
diff --git a/docs/site_libs/bootstrap/bootstrap.min.css b/docs/site_libs/bootstrap/bootstrap.min.css
index d3c0f959..020a55e1 100644
--- a/docs/site_libs/bootstrap/bootstrap.min.css
+++ b/docs/site_libs/bootstrap/bootstrap.min.css
@@ -7,4 +7,4 @@
 *
 * ansi colors from IPython notebook's
 *
-*/.ansi-black-fg{color:#3e424d}.ansi-black-bg{background-color:#3e424d}.ansi-black-intense-fg{color:#282c36}.ansi-black-intense-bg{background-color:#282c36}.ansi-red-fg{color:#e75c58}.ansi-red-bg{background-color:#e75c58}.ansi-red-intense-fg{color:#b22b31}.ansi-red-intense-bg{background-color:#b22b31}.ansi-green-fg{color:#00a250}.ansi-green-bg{background-color:#00a250}.ansi-green-intense-fg{color:#007427}.ansi-green-intense-bg{background-color:#007427}.ansi-yellow-fg{color:#ddb62b}.ansi-yellow-bg{background-color:#ddb62b}.ansi-yellow-intense-fg{color:#b27d12}.ansi-yellow-intense-bg{background-color:#b27d12}.ansi-blue-fg{color:#208ffb}.ansi-blue-bg{background-color:#208ffb}.ansi-blue-intense-fg{color:#0065ca}.ansi-blue-intense-bg{background-color:#0065ca}.ansi-magenta-fg{color:#d160c4}.ansi-magenta-bg{background-color:#d160c4}.ansi-magenta-intense-fg{color:#a03196}.ansi-magenta-intense-bg{background-color:#a03196}.ansi-cyan-fg{color:#60c6c8}.ansi-cyan-bg{background-color:#60c6c8}.ansi-cyan-intense-fg{color:#258f8f}.ansi-cyan-intense-bg{background-color:#258f8f}.ansi-white-fg{color:#c5c1b4}.ansi-white-bg{background-color:#c5c1b4}.ansi-white-intense-fg{color:#a1a6b2}.ansi-white-intense-bg{background-color:#a1a6b2}.ansi-default-inverse-fg{color:#fff}.ansi-default-inverse-bg{background-color:#000}.ansi-bold{font-weight:bold}.ansi-underline{text-decoration:underline}:root{--quarto-body-bg: #fff;--quarto-body-color: #373a3c;--quarto-text-muted: #6c757d;--quarto-border-color: #dee2e6;--quarto-border-width: 1px;--quarto-border-radius: 0.25rem}table.gt_table{color:var(--quarto-body-color);font-size:1em;width:100%;background-color:rgba(0,0,0,0);border-top-width:inherit;border-bottom-width:inherit;border-color:var(--quarto-border-color)}table.gt_table th.gt_column_spanner_outer{color:var(--quarto-body-color);background-color:rgba(0,0,0,0);border-top-width:inherit;border-bottom-width:inherit;border-color:var(--quarto-border-color)}table.gt_table th.gt_col_heading{color:var(--quarto-body-color);font-weight:bold;background-color:rgba(0,0,0,0)}table.gt_table thead.gt_col_headings{border-bottom:1px solid currentColor;border-top-width:inherit;border-top-color:var(--quarto-border-color)}table.gt_table thead.gt_col_headings:not(:first-child){border-top-width:1px;border-top-color:var(--quarto-border-color)}table.gt_table td.gt_row{border-bottom-width:1px;border-bottom-color:var(--quarto-border-color);border-top-width:0px}table.gt_table tbody.gt_table_body{border-top-width:1px;border-bottom-width:1px;border-bottom-color:var(--quarto-border-color);border-top-color:currentColor}div.columns{display:initial;gap:initial}div.column{display:inline-block;overflow-x:initial;vertical-align:top;width:50%}.code-annotation-tip-content{word-wrap:break-word}.code-annotation-container-hidden{display:none !important}dl.code-annotation-container-grid{display:grid;grid-template-columns:min-content auto}dl.code-annotation-container-grid dt{grid-column:1}dl.code-annotation-container-grid dd{grid-column:2}pre.sourceCode.code-annotation-code{padding-right:0}code.sourceCode .code-annotation-anchor{z-index:100;position:absolute;right:.5em;left:inherit;background-color:rgba(0,0,0,0)}:root{--mermaid-bg-color: #fff;--mermaid-edge-color: #373a3c;--mermaid-node-fg-color: #373a3c;--mermaid-fg-color: #373a3c;--mermaid-fg-color--lighter: #4f5457;--mermaid-fg-color--lightest: #686d71;--mermaid-font-family: Source Sans Pro, -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Arial, sans-serif, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol;--mermaid-label-bg-color: #fff;--mermaid-label-fg-color: #2780e3;--mermaid-node-bg-color: rgba(39, 128, 227, 0.1);--mermaid-node-fg-color: #373a3c}@media print{:root{font-size:11pt}#quarto-sidebar,#TOC,.nav-page{display:none}.page-columns .content{grid-column-start:page-start}.fixed-top{position:relative}.panel-caption,.figure-caption,figcaption{color:#666}}.code-copy-button{position:absolute;top:0;right:0;border:0;margin-top:5px;margin-right:5px;background-color:rgba(0,0,0,0);z-index:3}.code-copy-button:focus{outline:none}.code-copy-button-tooltip{font-size:.75em}pre.sourceCode:hover>.code-copy-button>.bi::before{display:inline-block;height:1rem;width:1rem;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(94, 94, 94)" viewBox="0 0 16 16"><path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1v-1z"/><path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5h3zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0h-3z"/></svg>');background-repeat:no-repeat;background-size:1rem 1rem}pre.sourceCode:hover>.code-copy-button-checked>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(94, 94, 94)" viewBox="0 0 16 16"><path d="M13.854 3.646a.5.5 0 0 1 0 .708l-7 7a.5.5 0 0 1-.708 0l-3.5-3.5a.5.5 0 1 1 .708-.708L6.5 10.293l6.646-6.647a.5.5 0 0 1 .708 0z"/></svg>')}pre.sourceCode:hover>.code-copy-button:hover>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(71, 88, 171)" viewBox="0 0 16 16"><path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1v-1z"/><path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5h3zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0h-3z"/></svg>')}pre.sourceCode:hover>.code-copy-button-checked:hover>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(71, 88, 171)"  viewBox="0 0 16 16"><path d="M13.854 3.646a.5.5 0 0 1 0 .708l-7 7a.5.5 0 0 1-.708 0l-3.5-3.5a.5.5 0 1 1 .708-.708L6.5 10.293l6.646-6.647a.5.5 0 0 1 .708 0z"/></svg>')}main ol ol,main ul ul,main ol ul,main ul ol{margin-bottom:1em}ul>li:not(:has(>p))>ul,ol>li:not(:has(>p))>ul,ul>li:not(:has(>p))>ol,ol>li:not(:has(>p))>ol{margin-bottom:0}ul>li:not(:has(>p))>ul>li:has(>p),ol>li:not(:has(>p))>ul>li:has(>p),ul>li:not(:has(>p))>ol>li:has(>p),ol>li:not(:has(>p))>ol>li:has(>p){margin-top:1rem}body{margin:0}main.page-columns>header>h1.title,main.page-columns>header>.title.h1{margin-bottom:0}@media(min-width: 992px){body .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset] 35px [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(500px, calc( 850px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] minmax(75px, 145px) [page-end-inset] 35px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.fullcontent:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset] 35px [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(500px, calc( 850px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] 35px [page-end-inset page-end] 5fr [screen-end-inset] 1.5em}body.slimcontent:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset] 35px [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(500px, calc( 850px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 35px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.listing:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 850px - 3em )) [body-content-end] 3em [body-end] 50px [body-end-outset] minmax(0px, 250px) [page-end-inset] minmax(50px, 100px) [page-end] 1fr [screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 35px [page-start-inset] minmax(0px, 175px) [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(450px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 35px [page-start-inset] minmax(0px, 175px) [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(450px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] minmax(25px, 50px) [page-start-inset] minmax(50px, 150px) [body-start-outset] minmax(25px, 50px) [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] minmax(25px, 50px) [body-end-outset] minmax(50px, 150px) [page-end-inset] minmax(25px, 50px) [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 1000px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(50px, 100px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 1000px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 50px [page-start-inset] minmax(50px, 150px) [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.slimcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(450px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.listing .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 1000px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.slimcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 50px [page-start-inset] minmax(50px, 150px) [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(450px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(50px, 150px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.listing .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] minmax(25px, 50px) [page-start-inset] minmax(50px, 150px) [body-start-outset] minmax(25px, 50px) [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] minmax(25px, 50px) [body-end-outset] minmax(50px, 150px) [page-end-inset] minmax(25px, 50px) [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}}@media(max-width: 991.98px){body .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset] 5fr [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] minmax(75px, 145px) [page-end-inset] 35px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.fullcontent:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset] 5fr [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.slimcontent:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset] 5fr [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] minmax(75px, 145px) [page-end-inset] 35px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.listing:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset] 5fr [body-start] 1.5em [body-content-start] minmax(500px, calc( 1250px - 3em )) [body-content-end body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 35px [page-start-inset] minmax(0px, 145px) [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(450px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 35px [page-start-inset] minmax(0px, 145px) [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(450px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset body-start-outset body-start] 1.5em [body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(75px, 150px) [page-end-inset] 25px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(25px, 50px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(500px, calc( 1000px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset body-start-outset body-start] 1em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 4fr [screen-end-inset] 1.5em [screen-end]}body.docked.slimcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(25px, 50px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.listing .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(25px, 50px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.slimcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset body-start-outset body-start] 1em [body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] minmax(75px, 145px) [page-end-inset] 35px [page-end] 4fr [screen-end-inset] 1.5em [screen-end]}body.floating.listing .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset body-start-outset body-start] 1em [body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(75px, 150px) [page-end-inset] 25px [page-end] 4fr [screen-end-inset] 1.5em [screen-end]}}@media(max-width: 767.98px){body .page-columns,body.fullcontent:not(.floating):not(.docked) .page-columns,body.slimcontent:not(.floating):not(.docked) .page-columns,body.docked .page-columns,body.docked.slimcontent .page-columns,body.docked.fullcontent .page-columns,body.floating .page-columns,body.floating.slimcontent .page-columns,body.floating.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(0px, 1fr) [body-content-end body-end body-end-outset page-end-inset page-end screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(0px, 1fr) [body-content-end body-end body-end-outset page-end-inset page-end screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(0px, 1fr) [body-content-end body-end body-end-outset page-end-inset page-end screen-end-inset] 1.5em [screen-end]}nav[role=doc-toc]{display:none}}body,.page-row-navigation{grid-template-rows:[page-top] max-content [contents-top] max-content [contents-bottom] max-content [page-bottom]}.page-rows-contents{grid-template-rows:[content-top] minmax(max-content, 1fr) [content-bottom] minmax(60px, max-content) [page-bottom]}.page-full{grid-column:screen-start/screen-end !important}.page-columns>*{grid-column:body-content-start/body-content-end}.page-columns.column-page>*{grid-column:page-start/page-end}.page-columns.column-page-left>*{grid-column:page-start/body-content-end}.page-columns.column-page-right>*{grid-column:body-content-start/page-end}.page-rows{grid-auto-rows:auto}.header{grid-column:screen-start/screen-end;grid-row:page-top/contents-top}#quarto-content{padding:0;grid-column:screen-start/screen-end;grid-row:contents-top/contents-bottom}body.floating .sidebar.sidebar-navigation{grid-column:page-start/body-start;grid-row:content-top/page-bottom}body.docked .sidebar.sidebar-navigation{grid-column:screen-start/body-start;grid-row:content-top/page-bottom}.sidebar.toc-left{grid-column:page-start/body-start;grid-row:content-top/page-bottom}.sidebar.margin-sidebar{grid-column:body-end/page-end;grid-row:content-top/page-bottom}.page-columns .content{grid-column:body-content-start/body-content-end;grid-row:content-top/content-bottom;align-content:flex-start}.page-columns .page-navigation{grid-column:body-content-start/body-content-end;grid-row:content-bottom/page-bottom}.page-columns .footer{grid-column:screen-start/screen-end;grid-row:contents-bottom/page-bottom}.page-columns .column-body{grid-column:body-content-start/body-content-end}.page-columns .column-body-fullbleed{grid-column:body-start/body-end}.page-columns .column-body-outset{grid-column:body-start-outset/body-end-outset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-body-outset table{background:#fff}.page-columns .column-body-outset-left{grid-column:body-start-outset/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-body-outset-left table{background:#fff}.page-columns .column-body-outset-right{grid-column:body-content-start/body-end-outset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-body-outset-right table{background:#fff}.page-columns .column-page{grid-column:page-start/page-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page table{background:#fff}.page-columns .column-page-inset{grid-column:page-start-inset/page-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-inset table{background:#fff}.page-columns .column-page-inset-left{grid-column:page-start-inset/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-inset-left table{background:#fff}.page-columns .column-page-inset-right{grid-column:body-content-start/page-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-inset-right figcaption table{background:#fff}.page-columns .column-page-left{grid-column:page-start/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-left table{background:#fff}.page-columns .column-page-right{grid-column:body-content-start/page-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-right figcaption table{background:#fff}#quarto-content.page-columns #quarto-margin-sidebar,#quarto-content.page-columns #quarto-sidebar{z-index:1}@media(max-width: 991.98px){#quarto-content.page-columns #quarto-margin-sidebar.collapse,#quarto-content.page-columns #quarto-sidebar.collapse,#quarto-content.page-columns #quarto-margin-sidebar.collapsing,#quarto-content.page-columns #quarto-sidebar.collapsing{z-index:1055}}#quarto-content.page-columns main.column-page,#quarto-content.page-columns main.column-page-right,#quarto-content.page-columns main.column-page-left{z-index:0}.page-columns .column-screen-inset{grid-column:screen-start-inset/screen-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset table{background:#fff}.page-columns .column-screen-inset-left{grid-column:screen-start-inset/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset-left table{background:#fff}.page-columns .column-screen-inset-right{grid-column:body-content-start/screen-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset-right table{background:#fff}.page-columns .column-screen{grid-column:screen-start/screen-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen table{background:#fff}.page-columns .column-screen-left{grid-column:screen-start/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-left table{background:#fff}.page-columns .column-screen-right{grid-column:body-content-start/screen-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-right table{background:#fff}.page-columns .column-screen-inset-shaded{grid-column:screen-start/screen-end;padding:1em;background:#f8f9fa;z-index:998;transform:translate3d(0, 0, 0);margin-bottom:1em}.zindex-content{z-index:998;transform:translate3d(0, 0, 0)}.zindex-modal{z-index:1055;transform:translate3d(0, 0, 0)}.zindex-over-content{z-index:999;transform:translate3d(0, 0, 0)}img.img-fluid.column-screen,img.img-fluid.column-screen-inset-shaded,img.img-fluid.column-screen-inset,img.img-fluid.column-screen-inset-left,img.img-fluid.column-screen-inset-right,img.img-fluid.column-screen-left,img.img-fluid.column-screen-right{width:100%}@media(min-width: 992px){.margin-caption,div.aside,aside,.column-margin{grid-column:body-end/page-end !important;z-index:998}.column-sidebar{grid-column:page-start/body-start !important;z-index:998}.column-leftmargin{grid-column:screen-start-inset/body-start !important;z-index:998}.no-row-height{height:1em;overflow:visible}}@media(max-width: 991.98px){.margin-caption,div.aside,aside,.column-margin{grid-column:body-end/page-end !important;z-index:998}.no-row-height{height:1em;overflow:visible}.page-columns.page-full{overflow:visible}.page-columns.toc-left .margin-caption,.page-columns.toc-left div.aside,.page-columns.toc-left aside,.page-columns.toc-left .column-margin{grid-column:body-content-start/body-content-end !important;z-index:998;transform:translate3d(0, 0, 0)}.page-columns.toc-left .no-row-height{height:initial;overflow:initial}}@media(max-width: 767.98px){.margin-caption,div.aside,aside,.column-margin{grid-column:body-content-start/body-content-end !important;z-index:998;transform:translate3d(0, 0, 0)}.no-row-height{height:initial;overflow:initial}#quarto-margin-sidebar{display:none}#quarto-sidebar-toc-left{display:none}.hidden-sm{display:none}}.panel-grid{display:grid;grid-template-rows:repeat(1, 1fr);grid-template-columns:repeat(24, 1fr);gap:1em}.panel-grid .g-col-1{grid-column:auto/span 1}.panel-grid .g-col-2{grid-column:auto/span 2}.panel-grid .g-col-3{grid-column:auto/span 3}.panel-grid .g-col-4{grid-column:auto/span 4}.panel-grid .g-col-5{grid-column:auto/span 5}.panel-grid .g-col-6{grid-column:auto/span 6}.panel-grid .g-col-7{grid-column:auto/span 7}.panel-grid .g-col-8{grid-column:auto/span 8}.panel-grid .g-col-9{grid-column:auto/span 9}.panel-grid .g-col-10{grid-column:auto/span 10}.panel-grid .g-col-11{grid-column:auto/span 11}.panel-grid .g-col-12{grid-column:auto/span 12}.panel-grid .g-col-13{grid-column:auto/span 13}.panel-grid .g-col-14{grid-column:auto/span 14}.panel-grid .g-col-15{grid-column:auto/span 15}.panel-grid .g-col-16{grid-column:auto/span 16}.panel-grid .g-col-17{grid-column:auto/span 17}.panel-grid .g-col-18{grid-column:auto/span 18}.panel-grid .g-col-19{grid-column:auto/span 19}.panel-grid .g-col-20{grid-column:auto/span 20}.panel-grid .g-col-21{grid-column:auto/span 21}.panel-grid .g-col-22{grid-column:auto/span 22}.panel-grid .g-col-23{grid-column:auto/span 23}.panel-grid .g-col-24{grid-column:auto/span 24}.panel-grid .g-start-1{grid-column-start:1}.panel-grid .g-start-2{grid-column-start:2}.panel-grid .g-start-3{grid-column-start:3}.panel-grid .g-start-4{grid-column-start:4}.panel-grid .g-start-5{grid-column-start:5}.panel-grid .g-start-6{grid-column-start:6}.panel-grid .g-start-7{grid-column-start:7}.panel-grid .g-start-8{grid-column-start:8}.panel-grid .g-start-9{grid-column-start:9}.panel-grid .g-start-10{grid-column-start:10}.panel-grid .g-start-11{grid-column-start:11}.panel-grid .g-start-12{grid-column-start:12}.panel-grid .g-start-13{grid-column-start:13}.panel-grid .g-start-14{grid-column-start:14}.panel-grid .g-start-15{grid-column-start:15}.panel-grid .g-start-16{grid-column-start:16}.panel-grid .g-start-17{grid-column-start:17}.panel-grid .g-start-18{grid-column-start:18}.panel-grid .g-start-19{grid-column-start:19}.panel-grid .g-start-20{grid-column-start:20}.panel-grid .g-start-21{grid-column-start:21}.panel-grid .g-start-22{grid-column-start:22}.panel-grid .g-start-23{grid-column-start:23}@media(min-width: 576px){.panel-grid .g-col-sm-1{grid-column:auto/span 1}.panel-grid .g-col-sm-2{grid-column:auto/span 2}.panel-grid .g-col-sm-3{grid-column:auto/span 3}.panel-grid .g-col-sm-4{grid-column:auto/span 4}.panel-grid .g-col-sm-5{grid-column:auto/span 5}.panel-grid .g-col-sm-6{grid-column:auto/span 6}.panel-grid .g-col-sm-7{grid-column:auto/span 7}.panel-grid .g-col-sm-8{grid-column:auto/span 8}.panel-grid .g-col-sm-9{grid-column:auto/span 9}.panel-grid .g-col-sm-10{grid-column:auto/span 10}.panel-grid .g-col-sm-11{grid-column:auto/span 11}.panel-grid .g-col-sm-12{grid-column:auto/span 12}.panel-grid .g-col-sm-13{grid-column:auto/span 13}.panel-grid .g-col-sm-14{grid-column:auto/span 14}.panel-grid .g-col-sm-15{grid-column:auto/span 15}.panel-grid .g-col-sm-16{grid-column:auto/span 16}.panel-grid .g-col-sm-17{grid-column:auto/span 17}.panel-grid .g-col-sm-18{grid-column:auto/span 18}.panel-grid .g-col-sm-19{grid-column:auto/span 19}.panel-grid .g-col-sm-20{grid-column:auto/span 20}.panel-grid .g-col-sm-21{grid-column:auto/span 21}.panel-grid .g-col-sm-22{grid-column:auto/span 22}.panel-grid .g-col-sm-23{grid-column:auto/span 23}.panel-grid .g-col-sm-24{grid-column:auto/span 24}.panel-grid .g-start-sm-1{grid-column-start:1}.panel-grid .g-start-sm-2{grid-column-start:2}.panel-grid .g-start-sm-3{grid-column-start:3}.panel-grid .g-start-sm-4{grid-column-start:4}.panel-grid .g-start-sm-5{grid-column-start:5}.panel-grid .g-start-sm-6{grid-column-start:6}.panel-grid .g-start-sm-7{grid-column-start:7}.panel-grid .g-start-sm-8{grid-column-start:8}.panel-grid .g-start-sm-9{grid-column-start:9}.panel-grid .g-start-sm-10{grid-column-start:10}.panel-grid .g-start-sm-11{grid-column-start:11}.panel-grid .g-start-sm-12{grid-column-start:12}.panel-grid .g-start-sm-13{grid-column-start:13}.panel-grid .g-start-sm-14{grid-column-start:14}.panel-grid .g-start-sm-15{grid-column-start:15}.panel-grid .g-start-sm-16{grid-column-start:16}.panel-grid .g-start-sm-17{grid-column-start:17}.panel-grid .g-start-sm-18{grid-column-start:18}.panel-grid .g-start-sm-19{grid-column-start:19}.panel-grid .g-start-sm-20{grid-column-start:20}.panel-grid .g-start-sm-21{grid-column-start:21}.panel-grid .g-start-sm-22{grid-column-start:22}.panel-grid .g-start-sm-23{grid-column-start:23}}@media(min-width: 768px){.panel-grid .g-col-md-1{grid-column:auto/span 1}.panel-grid .g-col-md-2{grid-column:auto/span 2}.panel-grid .g-col-md-3{grid-column:auto/span 3}.panel-grid .g-col-md-4{grid-column:auto/span 4}.panel-grid .g-col-md-5{grid-column:auto/span 5}.panel-grid .g-col-md-6{grid-column:auto/span 6}.panel-grid .g-col-md-7{grid-column:auto/span 7}.panel-grid .g-col-md-8{grid-column:auto/span 8}.panel-grid .g-col-md-9{grid-column:auto/span 9}.panel-grid .g-col-md-10{grid-column:auto/span 10}.panel-grid .g-col-md-11{grid-column:auto/span 11}.panel-grid .g-col-md-12{grid-column:auto/span 12}.panel-grid .g-col-md-13{grid-column:auto/span 13}.panel-grid .g-col-md-14{grid-column:auto/span 14}.panel-grid .g-col-md-15{grid-column:auto/span 15}.panel-grid .g-col-md-16{grid-column:auto/span 16}.panel-grid .g-col-md-17{grid-column:auto/span 17}.panel-grid .g-col-md-18{grid-column:auto/span 18}.panel-grid .g-col-md-19{grid-column:auto/span 19}.panel-grid .g-col-md-20{grid-column:auto/span 20}.panel-grid .g-col-md-21{grid-column:auto/span 21}.panel-grid .g-col-md-22{grid-column:auto/span 22}.panel-grid .g-col-md-23{grid-column:auto/span 23}.panel-grid .g-col-md-24{grid-column:auto/span 24}.panel-grid .g-start-md-1{grid-column-start:1}.panel-grid .g-start-md-2{grid-column-start:2}.panel-grid .g-start-md-3{grid-column-start:3}.panel-grid .g-start-md-4{grid-column-start:4}.panel-grid .g-start-md-5{grid-column-start:5}.panel-grid .g-start-md-6{grid-column-start:6}.panel-grid .g-start-md-7{grid-column-start:7}.panel-grid .g-start-md-8{grid-column-start:8}.panel-grid .g-start-md-9{grid-column-start:9}.panel-grid .g-start-md-10{grid-column-start:10}.panel-grid .g-start-md-11{grid-column-start:11}.panel-grid .g-start-md-12{grid-column-start:12}.panel-grid .g-start-md-13{grid-column-start:13}.panel-grid .g-start-md-14{grid-column-start:14}.panel-grid .g-start-md-15{grid-column-start:15}.panel-grid .g-start-md-16{grid-column-start:16}.panel-grid .g-start-md-17{grid-column-start:17}.panel-grid .g-start-md-18{grid-column-start:18}.panel-grid .g-start-md-19{grid-column-start:19}.panel-grid .g-start-md-20{grid-column-start:20}.panel-grid .g-start-md-21{grid-column-start:21}.panel-grid .g-start-md-22{grid-column-start:22}.panel-grid .g-start-md-23{grid-column-start:23}}@media(min-width: 992px){.panel-grid .g-col-lg-1{grid-column:auto/span 1}.panel-grid .g-col-lg-2{grid-column:auto/span 2}.panel-grid .g-col-lg-3{grid-column:auto/span 3}.panel-grid .g-col-lg-4{grid-column:auto/span 4}.panel-grid .g-col-lg-5{grid-column:auto/span 5}.panel-grid .g-col-lg-6{grid-column:auto/span 6}.panel-grid .g-col-lg-7{grid-column:auto/span 7}.panel-grid .g-col-lg-8{grid-column:auto/span 8}.panel-grid .g-col-lg-9{grid-column:auto/span 9}.panel-grid .g-col-lg-10{grid-column:auto/span 10}.panel-grid .g-col-lg-11{grid-column:auto/span 11}.panel-grid .g-col-lg-12{grid-column:auto/span 12}.panel-grid .g-col-lg-13{grid-column:auto/span 13}.panel-grid .g-col-lg-14{grid-column:auto/span 14}.panel-grid .g-col-lg-15{grid-column:auto/span 15}.panel-grid .g-col-lg-16{grid-column:auto/span 16}.panel-grid .g-col-lg-17{grid-column:auto/span 17}.panel-grid .g-col-lg-18{grid-column:auto/span 18}.panel-grid .g-col-lg-19{grid-column:auto/span 19}.panel-grid .g-col-lg-20{grid-column:auto/span 20}.panel-grid .g-col-lg-21{grid-column:auto/span 21}.panel-grid .g-col-lg-22{grid-column:auto/span 22}.panel-grid .g-col-lg-23{grid-column:auto/span 23}.panel-grid .g-col-lg-24{grid-column:auto/span 24}.panel-grid .g-start-lg-1{grid-column-start:1}.panel-grid .g-start-lg-2{grid-column-start:2}.panel-grid .g-start-lg-3{grid-column-start:3}.panel-grid .g-start-lg-4{grid-column-start:4}.panel-grid .g-start-lg-5{grid-column-start:5}.panel-grid .g-start-lg-6{grid-column-start:6}.panel-grid .g-start-lg-7{grid-column-start:7}.panel-grid .g-start-lg-8{grid-column-start:8}.panel-grid .g-start-lg-9{grid-column-start:9}.panel-grid .g-start-lg-10{grid-column-start:10}.panel-grid .g-start-lg-11{grid-column-start:11}.panel-grid .g-start-lg-12{grid-column-start:12}.panel-grid .g-start-lg-13{grid-column-start:13}.panel-grid .g-start-lg-14{grid-column-start:14}.panel-grid .g-start-lg-15{grid-column-start:15}.panel-grid .g-start-lg-16{grid-column-start:16}.panel-grid .g-start-lg-17{grid-column-start:17}.panel-grid .g-start-lg-18{grid-column-start:18}.panel-grid .g-start-lg-19{grid-column-start:19}.panel-grid .g-start-lg-20{grid-column-start:20}.panel-grid .g-start-lg-21{grid-column-start:21}.panel-grid .g-start-lg-22{grid-column-start:22}.panel-grid .g-start-lg-23{grid-column-start:23}}@media(min-width: 1200px){.panel-grid .g-col-xl-1{grid-column:auto/span 1}.panel-grid .g-col-xl-2{grid-column:auto/span 2}.panel-grid .g-col-xl-3{grid-column:auto/span 3}.panel-grid .g-col-xl-4{grid-column:auto/span 4}.panel-grid .g-col-xl-5{grid-column:auto/span 5}.panel-grid .g-col-xl-6{grid-column:auto/span 6}.panel-grid .g-col-xl-7{grid-column:auto/span 7}.panel-grid .g-col-xl-8{grid-column:auto/span 8}.panel-grid .g-col-xl-9{grid-column:auto/span 9}.panel-grid .g-col-xl-10{grid-column:auto/span 10}.panel-grid .g-col-xl-11{grid-column:auto/span 11}.panel-grid .g-col-xl-12{grid-column:auto/span 12}.panel-grid .g-col-xl-13{grid-column:auto/span 13}.panel-grid .g-col-xl-14{grid-column:auto/span 14}.panel-grid .g-col-xl-15{grid-column:auto/span 15}.panel-grid .g-col-xl-16{grid-column:auto/span 16}.panel-grid .g-col-xl-17{grid-column:auto/span 17}.panel-grid .g-col-xl-18{grid-column:auto/span 18}.panel-grid .g-col-xl-19{grid-column:auto/span 19}.panel-grid .g-col-xl-20{grid-column:auto/span 20}.panel-grid .g-col-xl-21{grid-column:auto/span 21}.panel-grid .g-col-xl-22{grid-column:auto/span 22}.panel-grid .g-col-xl-23{grid-column:auto/span 23}.panel-grid .g-col-xl-24{grid-column:auto/span 24}.panel-grid .g-start-xl-1{grid-column-start:1}.panel-grid .g-start-xl-2{grid-column-start:2}.panel-grid .g-start-xl-3{grid-column-start:3}.panel-grid .g-start-xl-4{grid-column-start:4}.panel-grid .g-start-xl-5{grid-column-start:5}.panel-grid .g-start-xl-6{grid-column-start:6}.panel-grid .g-start-xl-7{grid-column-start:7}.panel-grid .g-start-xl-8{grid-column-start:8}.panel-grid .g-start-xl-9{grid-column-start:9}.panel-grid .g-start-xl-10{grid-column-start:10}.panel-grid .g-start-xl-11{grid-column-start:11}.panel-grid .g-start-xl-12{grid-column-start:12}.panel-grid .g-start-xl-13{grid-column-start:13}.panel-grid .g-start-xl-14{grid-column-start:14}.panel-grid .g-start-xl-15{grid-column-start:15}.panel-grid .g-start-xl-16{grid-column-start:16}.panel-grid .g-start-xl-17{grid-column-start:17}.panel-grid .g-start-xl-18{grid-column-start:18}.panel-grid .g-start-xl-19{grid-column-start:19}.panel-grid .g-start-xl-20{grid-column-start:20}.panel-grid .g-start-xl-21{grid-column-start:21}.panel-grid .g-start-xl-22{grid-column-start:22}.panel-grid .g-start-xl-23{grid-column-start:23}}@media(min-width: 1400px){.panel-grid .g-col-xxl-1{grid-column:auto/span 1}.panel-grid .g-col-xxl-2{grid-column:auto/span 2}.panel-grid .g-col-xxl-3{grid-column:auto/span 3}.panel-grid .g-col-xxl-4{grid-column:auto/span 4}.panel-grid .g-col-xxl-5{grid-column:auto/span 5}.panel-grid .g-col-xxl-6{grid-column:auto/span 6}.panel-grid .g-col-xxl-7{grid-column:auto/span 7}.panel-grid .g-col-xxl-8{grid-column:auto/span 8}.panel-grid .g-col-xxl-9{grid-column:auto/span 9}.panel-grid .g-col-xxl-10{grid-column:auto/span 10}.panel-grid .g-col-xxl-11{grid-column:auto/span 11}.panel-grid .g-col-xxl-12{grid-column:auto/span 12}.panel-grid .g-col-xxl-13{grid-column:auto/span 13}.panel-grid .g-col-xxl-14{grid-column:auto/span 14}.panel-grid .g-col-xxl-15{grid-column:auto/span 15}.panel-grid .g-col-xxl-16{grid-column:auto/span 16}.panel-grid .g-col-xxl-17{grid-column:auto/span 17}.panel-grid .g-col-xxl-18{grid-column:auto/span 18}.panel-grid .g-col-xxl-19{grid-column:auto/span 19}.panel-grid .g-col-xxl-20{grid-column:auto/span 20}.panel-grid .g-col-xxl-21{grid-column:auto/span 21}.panel-grid .g-col-xxl-22{grid-column:auto/span 22}.panel-grid .g-col-xxl-23{grid-column:auto/span 23}.panel-grid .g-col-xxl-24{grid-column:auto/span 24}.panel-grid .g-start-xxl-1{grid-column-start:1}.panel-grid .g-start-xxl-2{grid-column-start:2}.panel-grid .g-start-xxl-3{grid-column-start:3}.panel-grid .g-start-xxl-4{grid-column-start:4}.panel-grid .g-start-xxl-5{grid-column-start:5}.panel-grid .g-start-xxl-6{grid-column-start:6}.panel-grid .g-start-xxl-7{grid-column-start:7}.panel-grid .g-start-xxl-8{grid-column-start:8}.panel-grid .g-start-xxl-9{grid-column-start:9}.panel-grid .g-start-xxl-10{grid-column-start:10}.panel-grid .g-start-xxl-11{grid-column-start:11}.panel-grid .g-start-xxl-12{grid-column-start:12}.panel-grid .g-start-xxl-13{grid-column-start:13}.panel-grid .g-start-xxl-14{grid-column-start:14}.panel-grid .g-start-xxl-15{grid-column-start:15}.panel-grid .g-start-xxl-16{grid-column-start:16}.panel-grid .g-start-xxl-17{grid-column-start:17}.panel-grid .g-start-xxl-18{grid-column-start:18}.panel-grid .g-start-xxl-19{grid-column-start:19}.panel-grid .g-start-xxl-20{grid-column-start:20}.panel-grid .g-start-xxl-21{grid-column-start:21}.panel-grid .g-start-xxl-22{grid-column-start:22}.panel-grid .g-start-xxl-23{grid-column-start:23}}main{margin-top:1em;margin-bottom:1em}h1,.h1,h2,.h2{opacity:.9;margin-top:2rem;margin-bottom:1rem;font-weight:600}h1.title,.title.h1{margin-top:0}h2,.h2{border-bottom:1px solid #dee2e6;padding-bottom:.5rem}h3,.h3{font-weight:600}h3,.h3,h4,.h4{opacity:.9;margin-top:1.5rem}h5,.h5,h6,.h6{opacity:.9}.header-section-number{color:#747a7f}.nav-link.active .header-section-number{color:inherit}mark,.mark{padding:0em}.panel-caption,caption,.figure-caption{font-size:.9rem}.panel-caption,.figure-caption,figcaption{color:#747a7f}.table-caption,caption{color:#373a3c}.quarto-layout-cell[data-ref-parent] caption{color:#747a7f}.column-margin figcaption,.margin-caption,div.aside,aside,.column-margin{color:#747a7f;font-size:.825rem}.panel-caption.margin-caption{text-align:inherit}.column-margin.column-container p{margin-bottom:0}.column-margin.column-container>*:not(.collapse){padding-top:.5em;padding-bottom:.5em;display:block}.column-margin.column-container>*.collapse:not(.show){display:none}@media(min-width: 768px){.column-margin.column-container .callout-margin-content:first-child{margin-top:4.5em}.column-margin.column-container .callout-margin-content-simple:first-child{margin-top:3.5em}}.margin-caption>*{padding-top:.5em;padding-bottom:.5em}@media(max-width: 767.98px){.quarto-layout-row{flex-direction:column}}.nav-tabs .nav-item{margin-top:1px;cursor:pointer}.tab-content{margin-top:0px;border-left:#dee2e6 1px solid;border-right:#dee2e6 1px solid;border-bottom:#dee2e6 1px solid;margin-left:0;padding:1em;margin-bottom:1em}@media(max-width: 767.98px){.layout-sidebar{margin-left:0;margin-right:0}}.panel-sidebar,.panel-sidebar .form-control,.panel-input,.panel-input .form-control,.selectize-dropdown{font-size:.9rem}.panel-sidebar .form-control,.panel-input .form-control{padding-top:.1rem}.tab-pane div.sourceCode{margin-top:0px}.tab-pane>p{padding-top:1em}.tab-content>.tab-pane:not(.active){display:none !important}div.sourceCode{background-color:rgba(233,236,239,.65);border:1px solid rgba(233,236,239,.65);border-radius:.25rem}pre.sourceCode{background-color:rgba(0,0,0,0)}pre.sourceCode{border:none;font-size:.875em;overflow:visible !important;padding:.4em}.callout pre.sourceCode{padding-left:0}div.sourceCode{overflow-y:hidden}.callout div.sourceCode{margin-left:initial}.blockquote{font-size:inherit;padding-left:1rem;padding-right:1.5rem;color:#747a7f}.blockquote h1:first-child,.blockquote .h1:first-child,.blockquote h2:first-child,.blockquote .h2:first-child,.blockquote h3:first-child,.blockquote .h3:first-child,.blockquote h4:first-child,.blockquote .h4:first-child,.blockquote h5:first-child,.blockquote .h5:first-child{margin-top:0}pre{background-color:initial;padding:initial;border:initial}p code:not(.sourceCode),li code:not(.sourceCode),td code:not(.sourceCode){background-color:#f7f7f7;padding:.2em}nav p code:not(.sourceCode),nav li code:not(.sourceCode),nav td code:not(.sourceCode){background-color:rgba(0,0,0,0);padding:0}td code:not(.sourceCode){white-space:pre-wrap}#quarto-embedded-source-code-modal>.modal-dialog{max-width:1000px;padding-left:1.75rem;padding-right:1.75rem}#quarto-embedded-source-code-modal>.modal-dialog>.modal-content>.modal-body{padding:0}#quarto-embedded-source-code-modal>.modal-dialog>.modal-content>.modal-body div.sourceCode{margin:0;padding:.2rem .2rem;border-radius:0px;border:none}#quarto-embedded-source-code-modal>.modal-dialog>.modal-content>.modal-header{padding:.7rem}.code-tools-button{font-size:1rem;padding:.15rem .15rem;margin-left:5px;color:#6c757d;background-color:rgba(0,0,0,0);transition:initial;cursor:pointer}.code-tools-button>.bi::before{display:inline-block;height:1rem;width:1rem;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(108, 117, 125)" viewBox="0 0 16 16"><path d="M10.478 1.647a.5.5 0 1 0-.956-.294l-4 13a.5.5 0 0 0 .956.294l4-13zM4.854 4.146a.5.5 0 0 1 0 .708L1.707 8l3.147 3.146a.5.5 0 0 1-.708.708l-3.5-3.5a.5.5 0 0 1 0-.708l3.5-3.5a.5.5 0 0 1 .708 0zm6.292 0a.5.5 0 0 0 0 .708L14.293 8l-3.147 3.146a.5.5 0 0 0 .708.708l3.5-3.5a.5.5 0 0 0 0-.708l-3.5-3.5a.5.5 0 0 0-.708 0z"/></svg>');background-repeat:no-repeat;background-size:1rem 1rem}.code-tools-button:hover>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" viewBox="0 0 16 16"><path d="M10.478 1.647a.5.5 0 1 0-.956-.294l-4 13a.5.5 0 0 0 .956.294l4-13zM4.854 4.146a.5.5 0 0 1 0 .708L1.707 8l3.147 3.146a.5.5 0 0 1-.708.708l-3.5-3.5a.5.5 0 0 1 0-.708l3.5-3.5a.5.5 0 0 1 .708 0zm6.292 0a.5.5 0 0 0 0 .708L14.293 8l-3.147 3.146a.5.5 0 0 0 .708.708l3.5-3.5a.5.5 0 0 0 0-.708l-3.5-3.5a.5.5 0 0 0-.708 0z"/></svg>')}#quarto-embedded-source-code-modal .code-copy-button>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(108, 117, 125)" viewBox="0 0 16 16"><path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1v-1z"/><path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5h3zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0h-3z"/></svg>')}#quarto-embedded-source-code-modal .code-copy-button-checked>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(108, 117, 125)" viewBox="0 0 16 16"><path d="M13.854 3.646a.5.5 0 0 1 0 .708l-7 7a.5.5 0 0 1-.708 0l-3.5-3.5a.5.5 0 1 1 .708-.708L6.5 10.293l6.646-6.647a.5.5 0 0 1 .708 0z"/></svg>')}.sidebar{will-change:top;transition:top 200ms linear;position:sticky;overflow-y:auto;padding-top:1.2em;max-height:100vh}.sidebar.toc-left,.sidebar.margin-sidebar{top:0px;padding-top:1em}.sidebar.toc-left>*,.sidebar.margin-sidebar>*{padding-top:.5em}.sidebar.quarto-banner-title-block-sidebar>*{padding-top:1.65em}figure .quarto-notebook-link{margin-top:.5em}.quarto-notebook-link{font-size:.75em;color:#6c757d;margin-bottom:1em;text-decoration:none;display:block}.quarto-notebook-link:hover{text-decoration:underline;color:#2780e3}.quarto-notebook-link::before{display:inline-block;height:.75rem;width:.75rem;margin-bottom:0em;margin-right:.25em;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(108, 117, 125)" class="bi bi-journal-code" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.646 5.646a.5.5 0 0 1 .708 0l2 2a.5.5 0 0 1 0 .708l-2 2a.5.5 0 0 1-.708-.708L10.293 8 8.646 6.354a.5.5 0 0 1 0-.708zm-1.292 0a.5.5 0 0 0-.708 0l-2 2a.5.5 0 0 0 0 .708l2 2a.5.5 0 0 0 .708-.708L5.707 8l1.647-1.646a.5.5 0 0 0 0-.708z"/><path d="M3 0h10a2 2 0 0 1 2 2v12a2 2 0 0 1-2 2H3a2 2 0 0 1-2-2v-1h1v1a1 1 0 0 0 1 1h10a1 1 0 0 0 1-1V2a1 1 0 0 0-1-1H3a1 1 0 0 0-1 1v1H1V2a2 2 0 0 1 2-2z"/><path d="M1 5v-.5a.5.5 0 0 1 1 0V5h.5a.5.5 0 0 1 0 1h-2a.5.5 0 0 1 0-1H1zm0 3v-.5a.5.5 0 0 1 1 0V8h.5a.5.5 0 0 1 0 1h-2a.5.5 0 0 1 0-1H1zm0 3v-.5a.5.5 0 0 1 1 0v.5h.5a.5.5 0 0 1 0 1h-2a.5.5 0 0 1 0-1H1z"/></svg>');background-repeat:no-repeat;background-size:.75rem .75rem}.quarto-alternate-notebooks i.bi,.quarto-alternate-formats i.bi{margin-right:.4em}.quarto-notebook .cell-container{display:flex}.quarto-notebook .cell-container .cell{flex-grow:4}.quarto-notebook .cell-container .cell-decorator{padding-top:1.5em;padding-right:1em;text-align:right}.quarto-notebook h2,.quarto-notebook .h2{border-bottom:none}.sidebar .quarto-alternate-formats a,.sidebar .quarto-alternate-notebooks a{text-decoration:none}.sidebar .quarto-alternate-formats a:hover,.sidebar .quarto-alternate-notebooks a:hover{color:#2780e3}.sidebar .quarto-alternate-notebooks h2,.sidebar .quarto-alternate-notebooks .h2,.sidebar .quarto-alternate-formats h2,.sidebar .quarto-alternate-formats .h2,.sidebar nav[role=doc-toc]>h2,.sidebar nav[role=doc-toc]>.h2{font-size:.875rem;font-weight:400;margin-bottom:.5rem;margin-top:.3rem;font-family:inherit;border-bottom:0;padding-bottom:0;padding-top:0px}.sidebar .quarto-alternate-notebooks h2,.sidebar .quarto-alternate-notebooks .h2,.sidebar .quarto-alternate-formats h2,.sidebar .quarto-alternate-formats .h2{margin-top:1rem}.sidebar nav[role=doc-toc]>ul a{border-left:1px solid #e9ecef;padding-left:.6rem}.sidebar .quarto-alternate-notebooks h2>ul a,.sidebar .quarto-alternate-notebooks .h2>ul a,.sidebar .quarto-alternate-formats h2>ul a,.sidebar .quarto-alternate-formats .h2>ul a{border-left:none;padding-left:.6rem}.sidebar .quarto-alternate-notebooks ul a:empty,.sidebar .quarto-alternate-formats ul a:empty,.sidebar nav[role=doc-toc]>ul a:empty{display:none}.sidebar .quarto-alternate-notebooks ul,.sidebar .quarto-alternate-formats ul,.sidebar nav[role=doc-toc] ul{padding-left:0;list-style:none;font-size:.875rem;font-weight:300}.sidebar .quarto-alternate-notebooks ul li a,.sidebar .quarto-alternate-formats ul li a,.sidebar nav[role=doc-toc]>ul li a{line-height:1.1rem;padding-bottom:.2rem;padding-top:.2rem;color:inherit}.sidebar nav[role=doc-toc] ul>li>ul>li>a{padding-left:1.2em}.sidebar nav[role=doc-toc] ul>li>ul>li>ul>li>a{padding-left:2.4em}.sidebar nav[role=doc-toc] ul>li>ul>li>ul>li>ul>li>a{padding-left:3.6em}.sidebar nav[role=doc-toc] ul>li>ul>li>ul>li>ul>li>ul>li>a{padding-left:4.8em}.sidebar nav[role=doc-toc] ul>li>ul>li>ul>li>ul>li>ul>li>ul>li>a{padding-left:6em}.sidebar nav[role=doc-toc] ul>li>a.active,.sidebar nav[role=doc-toc] ul>li>ul>li>a.active{border-left:1px solid #2780e3;color:#2780e3 !important}.sidebar nav[role=doc-toc] ul>li>a:hover,.sidebar nav[role=doc-toc] ul>li>ul>li>a:hover{color:#2780e3 !important}kbd,.kbd{color:#373a3c;background-color:#f8f9fa;border:1px solid;border-radius:5px;border-color:#dee2e6}div.hanging-indent{margin-left:1em;text-indent:-1em}.citation a,.footnote-ref{text-decoration:none}.footnotes ol{padding-left:1em}.tippy-content>*{margin-bottom:.7em}.tippy-content>*:last-child{margin-bottom:0}.table a{word-break:break-word}.table>thead{border-top-width:1px;border-top-color:#dee2e6;border-bottom:1px solid #b6babc}.callout{margin-top:1.25rem;margin-bottom:1.25rem;border-radius:.25rem;overflow-wrap:break-word}.callout .callout-title-container{overflow-wrap:anywhere}.callout.callout-style-simple{padding:.4em .7em;border-left:5px solid;border-right:1px solid #dee2e6;border-top:1px solid #dee2e6;border-bottom:1px solid #dee2e6}.callout.callout-style-default{border-left:5px solid;border-right:1px solid #dee2e6;border-top:1px solid #dee2e6;border-bottom:1px solid #dee2e6}.callout .callout-body-container{flex-grow:1}.callout.callout-style-simple .callout-body{font-size:.9rem;font-weight:400}.callout.callout-style-default .callout-body{font-size:.9rem;font-weight:400}.callout.callout-titled .callout-body{margin-top:.2em}.callout:not(.no-icon).callout-titled.callout-style-simple .callout-body{padding-left:1.6em}.callout.callout-titled>.callout-header{padding-top:.2em;margin-bottom:-0.2em}.callout.callout-style-simple>div.callout-header{border-bottom:none;font-size:.9rem;font-weight:600;opacity:75%}.callout.callout-style-default>div.callout-header{border-bottom:none;font-weight:600;opacity:85%;font-size:.9rem;padding-left:.5em;padding-right:.5em}.callout.callout-style-default div.callout-body{padding-left:.5em;padding-right:.5em}.callout.callout-style-default div.callout-body>:first-child{margin-top:.5em}.callout>div.callout-header[data-bs-toggle=collapse]{cursor:pointer}.callout.callout-style-default .callout-header[aria-expanded=false],.callout.callout-style-default .callout-header[aria-expanded=true]{padding-top:0px;margin-bottom:0px;align-items:center}.callout.callout-titled .callout-body>:last-child:not(.sourceCode),.callout.callout-titled .callout-body>div>:last-child:not(.sourceCode){margin-bottom:.5rem}.callout:not(.callout-titled) .callout-body>:first-child,.callout:not(.callout-titled) .callout-body>div>:first-child{margin-top:.25rem}.callout:not(.callout-titled) .callout-body>:last-child,.callout:not(.callout-titled) .callout-body>div>:last-child{margin-bottom:.2rem}.callout.callout-style-simple .callout-icon::before,.callout.callout-style-simple .callout-toggle::before{height:1rem;width:1rem;display:inline-block;content:"";background-repeat:no-repeat;background-size:1rem 1rem}.callout.callout-style-default .callout-icon::before,.callout.callout-style-default .callout-toggle::before{height:.9rem;width:.9rem;display:inline-block;content:"";background-repeat:no-repeat;background-size:.9rem .9rem}.callout.callout-style-default .callout-toggle::before{margin-top:5px}.callout .callout-btn-toggle .callout-toggle::before{transition:transform .2s linear}.callout .callout-header[aria-expanded=false] .callout-toggle::before{transform:rotate(-90deg)}.callout .callout-header[aria-expanded=true] .callout-toggle::before{transform:none}.callout.callout-style-simple:not(.no-icon) div.callout-icon-container{padding-top:.2em;padding-right:.55em}.callout.callout-style-default:not(.no-icon) div.callout-icon-container{padding-top:.1em;padding-right:.35em}.callout.callout-style-default:not(.no-icon) div.callout-title-container{margin-top:-1px}.callout.callout-style-default.callout-caution:not(.no-icon) div.callout-icon-container{padding-top:.3em;padding-right:.35em}.callout>.callout-body>.callout-icon-container>.no-icon,.callout>.callout-header>.callout-icon-container>.no-icon{display:none}div.callout.callout{border-left-color:#6c757d}div.callout.callout-style-default>.callout-header{background-color:#6c757d}div.callout-note.callout{border-left-color:#2780e3}div.callout-note.callout-style-default>.callout-header{background-color:#e9f2fc}div.callout-note:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %232373cc" class="bi bi-info-circle" viewBox="0 0 16 16"><path d="M8 15A7 7 0 1 1 8 1a7 7 0 0 1 0 14zm0 1A8 8 0 1 0 8 0a8 8 0 0 0 0 16z"/><path d="m8.93 6.588-2.29.287-.082.38.45.083c.294.07.352.176.288.469l-.738 3.468c-.194.897.105 1.319.808 1.319.545 0 1.178-.252 1.465-.598l.088-.416c-.2.176-.492.246-.686.246-.275 0-.375-.193-.304-.533L8.93 6.588zM9 4.5a1 1 0 1 1-2 0 1 1 0 0 1 2 0z"/></svg>');}div.callout-note.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %232373cc" class="bi bi-info-circle" viewBox="0 0 16 16"><path d="M8 15A7 7 0 1 1 8 1a7 7 0 0 1 0 14zm0 1A8 8 0 1 0 8 0a8 8 0 0 0 0 16z"/><path d="m8.93 6.588-2.29.287-.082.38.45.083c.294.07.352.176.288.469l-.738 3.468c-.194.897.105 1.319.808 1.319.545 0 1.178-.252 1.465-.598l.088-.416c-.2.176-.492.246-.686.246-.275 0-.375-.193-.304-.533L8.93 6.588zM9 4.5a1 1 0 1 1-2 0 1 1 0 0 1 2 0z"/></svg>');}div.callout-note .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}div.callout-tip.callout{border-left-color:#3fb618}div.callout-tip.callout-style-default>.callout-header{background-color:#ecf8e8}div.callout-tip:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %2339a416" class="bi bi-lightbulb" viewBox="0 0 16 16"><path d="M2 6a6 6 0 1 1 10.174 4.31c-.203.196-.359.4-.453.619l-.762 1.769A.5.5 0 0 1 10.5 13a.5.5 0 0 1 0 1 .5.5 0 0 1 0 1l-.224.447a1 1 0 0 1-.894.553H6.618a1 1 0 0 1-.894-.553L5.5 15a.5.5 0 0 1 0-1 .5.5 0 0 1 0-1 .5.5 0 0 1-.46-.302l-.761-1.77a1.964 1.964 0 0 0-.453-.618A5.984 5.984 0 0 1 2 6zm6-5a5 5 0 0 0-3.479 8.592c.263.254.514.564.676.941L5.83 12h4.342l.632-1.467c.162-.377.413-.687.676-.941A5 5 0 0 0 8 1z"/></svg>');}div.callout-tip.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %2339a416" class="bi bi-lightbulb" viewBox="0 0 16 16"><path d="M2 6a6 6 0 1 1 10.174 4.31c-.203.196-.359.4-.453.619l-.762 1.769A.5.5 0 0 1 10.5 13a.5.5 0 0 1 0 1 .5.5 0 0 1 0 1l-.224.447a1 1 0 0 1-.894.553H6.618a1 1 0 0 1-.894-.553L5.5 15a.5.5 0 0 1 0-1 .5.5 0 0 1 0-1 .5.5 0 0 1-.46-.302l-.761-1.77a1.964 1.964 0 0 0-.453-.618A5.984 5.984 0 0 1 2 6zm6-5a5 5 0 0 0-3.479 8.592c.263.254.514.564.676.941L5.83 12h4.342l.632-1.467c.162-.377.413-.687.676-.941A5 5 0 0 0 8 1z"/></svg>');}div.callout-tip .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}div.callout-warning.callout{border-left-color:#ff7518}div.callout-warning.callout-style-default>.callout-header{background-color:#fff1e8}div.callout-warning:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23e66916" class="bi bi-exclamation-triangle" viewBox="0 0 16 16"><path d="M7.938 2.016A.13.13 0 0 1 8.002 2a.13.13 0 0 1 .063.016.146.146 0 0 1 .054.057l6.857 11.667c.036.06.035.124.002.183a.163.163 0 0 1-.054.06.116.116 0 0 1-.066.017H1.146a.115.115 0 0 1-.066-.017.163.163 0 0 1-.054-.06.176.176 0 0 1 .002-.183L7.884 2.073a.147.147 0 0 1 .054-.057zm1.044-.45a1.13 1.13 0 0 0-1.96 0L.165 13.233c-.457.778.091 1.767.98 1.767h13.713c.889 0 1.438-.99.98-1.767L8.982 1.566z"/><path d="M7.002 12a1 1 0 1 1 2 0 1 1 0 0 1-2 0zM7.1 5.995a.905.905 0 1 1 1.8 0l-.35 3.507a.552.552 0 0 1-1.1 0L7.1 5.995z"/></svg>');}div.callout-warning.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23e66916" class="bi bi-exclamation-triangle" viewBox="0 0 16 16"><path d="M7.938 2.016A.13.13 0 0 1 8.002 2a.13.13 0 0 1 .063.016.146.146 0 0 1 .054.057l6.857 11.667c.036.06.035.124.002.183a.163.163 0 0 1-.054.06.116.116 0 0 1-.066.017H1.146a.115.115 0 0 1-.066-.017.163.163 0 0 1-.054-.06.176.176 0 0 1 .002-.183L7.884 2.073a.147.147 0 0 1 .054-.057zm1.044-.45a1.13 1.13 0 0 0-1.96 0L.165 13.233c-.457.778.091 1.767.98 1.767h13.713c.889 0 1.438-.99.98-1.767L8.982 1.566z"/><path d="M7.002 12a1 1 0 1 1 2 0 1 1 0 0 1-2 0zM7.1 5.995a.905.905 0 1 1 1.8 0l-.35 3.507a.552.552 0 0 1-1.1 0L7.1 5.995z"/></svg>');}div.callout-warning .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}div.callout-caution.callout{border-left-color:#f0ad4e}div.callout-caution.callout-style-default>.callout-header{background-color:#fef7ed}div.callout-caution:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23d89c46" class="bi bi-cone-striped" viewBox="0 0 16 16"><path d="M9.97 4.88l.953 3.811C10.158 8.878 9.14 9 8 9c-1.14 0-2.159-.122-2.923-.309L6.03 4.88C6.635 4.957 7.3 5 8 5s1.365-.043 1.97-.12zm-.245-.978L8.97.88C8.718-.13 7.282-.13 7.03.88L6.274 3.9C6.8 3.965 7.382 4 8 4c.618 0 1.2-.036 1.725-.098zm4.396 8.613a.5.5 0 0 1 .037.96l-6 2a.5.5 0 0 1-.316 0l-6-2a.5.5 0 0 1 .037-.96l2.391-.598.565-2.257c.862.212 1.964.339 3.165.339s2.303-.127 3.165-.339l.565 2.257 2.391.598z"/></svg>');}div.callout-caution.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23d89c46" class="bi bi-cone-striped" viewBox="0 0 16 16"><path d="M9.97 4.88l.953 3.811C10.158 8.878 9.14 9 8 9c-1.14 0-2.159-.122-2.923-.309L6.03 4.88C6.635 4.957 7.3 5 8 5s1.365-.043 1.97-.12zm-.245-.978L8.97.88C8.718-.13 7.282-.13 7.03.88L6.274 3.9C6.8 3.965 7.382 4 8 4c.618 0 1.2-.036 1.725-.098zm4.396 8.613a.5.5 0 0 1 .037.96l-6 2a.5.5 0 0 1-.316 0l-6-2a.5.5 0 0 1 .037-.96l2.391-.598.565-2.257c.862.212 1.964.339 3.165.339s2.303-.127 3.165-.339l.565 2.257 2.391.598z"/></svg>');}div.callout-caution .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}div.callout-important.callout{border-left-color:#ff0039}div.callout-important.callout-style-default>.callout-header{background-color:#ffe6eb}div.callout-important:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23e60033" class="bi bi-exclamation-circle" viewBox="0 0 16 16"><path d="M8 15A7 7 0 1 1 8 1a7 7 0 0 1 0 14zm0 1A8 8 0 1 0 8 0a8 8 0 0 0 0 16z"/><path d="M7.002 11a1 1 0 1 1 2 0 1 1 0 0 1-2 0zM7.1 4.995a.905.905 0 1 1 1.8 0l-.35 3.507a.552.552 0 0 1-1.1 0L7.1 4.995z"/></svg>');}div.callout-important.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23e60033" class="bi bi-exclamation-circle" viewBox="0 0 16 16"><path d="M8 15A7 7 0 1 1 8 1a7 7 0 0 1 0 14zm0 1A8 8 0 1 0 8 0a8 8 0 0 0 0 16z"/><path d="M7.002 11a1 1 0 1 1 2 0 1 1 0 0 1-2 0zM7.1 4.995a.905.905 0 1 1 1.8 0l-.35 3.507a.552.552 0 0 1-1.1 0L7.1 4.995z"/></svg>');}div.callout-important .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}.quarto-toggle-container{display:flex;align-items:center}.quarto-reader-toggle .bi::before,.quarto-color-scheme-toggle .bi::before{display:inline-block;height:1rem;width:1rem;content:"";background-repeat:no-repeat;background-size:1rem 1rem}.sidebar-navigation{padding-left:20px}.navbar .quarto-color-scheme-toggle:not(.alternate) .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(84, 85, 85, 1)" class="bi bi-toggle-off" viewBox="0 0 16 16"><path d="M11 4a4 4 0 0 1 0 8H8a4.992 4.992 0 0 0 2-4 4.992 4.992 0 0 0-2-4h3zm-6 8a4 4 0 1 1 0-8 4 4 0 0 1 0 8zM0 8a5 5 0 0 0 5 5h6a5 5 0 0 0 0-10H5a5 5 0 0 0-5 5z"/></svg>')}.navbar .quarto-color-scheme-toggle.alternate .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(84, 85, 85, 1)" class="bi bi-toggle-on" viewBox="0 0 16 16"><path d="M5 3a5 5 0 0 0 0 10h6a5 5 0 0 0 0-10H5zm6 9a4 4 0 1 1 0-8 4 4 0 0 1 0 8z"/></svg>')}.sidebar-navigation .quarto-color-scheme-toggle:not(.alternate) .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(79, 84, 87, 1)" class="bi bi-toggle-off" viewBox="0 0 16 16"><path d="M11 4a4 4 0 0 1 0 8H8a4.992 4.992 0 0 0 2-4 4.992 4.992 0 0 0-2-4h3zm-6 8a4 4 0 1 1 0-8 4 4 0 0 1 0 8zM0 8a5 5 0 0 0 5 5h6a5 5 0 0 0 0-10H5a5 5 0 0 0-5 5z"/></svg>')}.sidebar-navigation .quarto-color-scheme-toggle.alternate .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(79, 84, 87, 1)" class="bi bi-toggle-on" viewBox="0 0 16 16"><path d="M5 3a5 5 0 0 0 0 10h6a5 5 0 0 0 0-10H5zm6 9a4 4 0 1 1 0-8 4 4 0 0 1 0 8z"/></svg>')}.quarto-sidebar-toggle{border-color:#dee2e6;border-bottom-left-radius:.25rem;border-bottom-right-radius:.25rem;border-style:solid;border-width:1px;overflow:hidden;border-top-width:0px;padding-top:0px !important}.quarto-sidebar-toggle-title{cursor:pointer;padding-bottom:2px;margin-left:.25em;text-align:center;font-weight:400;font-size:.775em}#quarto-content .quarto-sidebar-toggle{background:#fafafa}#quarto-content .quarto-sidebar-toggle-title{color:#373a3c}.quarto-sidebar-toggle-icon{color:#dee2e6;margin-right:.5em;float:right;transition:transform .2s ease}.quarto-sidebar-toggle-icon::before{padding-top:5px}.quarto-sidebar-toggle.expanded .quarto-sidebar-toggle-icon{transform:rotate(-180deg)}.quarto-sidebar-toggle.expanded .quarto-sidebar-toggle-title{border-bottom:solid #dee2e6 1px}.quarto-sidebar-toggle-contents{background-color:#fff;padding-right:10px;padding-left:10px;margin-top:0px !important;transition:max-height .5s ease}.quarto-sidebar-toggle.expanded .quarto-sidebar-toggle-contents{padding-top:1em;padding-bottom:10px}.quarto-sidebar-toggle:not(.expanded) .quarto-sidebar-toggle-contents{padding-top:0px !important;padding-bottom:0px}nav[role=doc-toc]{z-index:1020}#quarto-sidebar>*,nav[role=doc-toc]>*{transition:opacity .1s ease,border .1s ease}#quarto-sidebar.slow>*,nav[role=doc-toc].slow>*{transition:opacity .4s ease,border .4s ease}.quarto-color-scheme-toggle:not(.alternate).top-right .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(142, 148, 151, 1)" class="bi bi-toggle-off" viewBox="0 0 16 16"><path d="M11 4a4 4 0 0 1 0 8H8a4.992 4.992 0 0 0 2-4 4.992 4.992 0 0 0-2-4h3zm-6 8a4 4 0 1 1 0-8 4 4 0 0 1 0 8zM0 8a5 5 0 0 0 5 5h6a5 5 0 0 0 0-10H5a5 5 0 0 0-5 5z"/></svg>')}.quarto-color-scheme-toggle.alternate.top-right .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(104, 109, 113, 1)" class="bi bi-toggle-on" viewBox="0 0 16 16"><path d="M5 3a5 5 0 0 0 0 10h6a5 5 0 0 0 0-10H5zm6 9a4 4 0 1 1 0-8 4 4 0 0 1 0 8z"/></svg>')}#quarto-appendix.default{border-top:1px solid #dee2e6}#quarto-appendix.default{background-color:#fff;padding-top:1.5em;margin-top:2em;z-index:998}#quarto-appendix.default .quarto-appendix-heading{margin-top:0;line-height:1.4em;font-weight:600;opacity:.9;border-bottom:none;margin-bottom:0}#quarto-appendix.default .footnotes ol,#quarto-appendix.default .footnotes ol li>p:last-of-type,#quarto-appendix.default .quarto-appendix-contents>p:last-of-type{margin-bottom:0}#quarto-appendix.default .quarto-appendix-secondary-label{margin-bottom:.4em}#quarto-appendix.default .quarto-appendix-bibtex{font-size:.7em;padding:1em;border:solid 1px #dee2e6;margin-bottom:1em}#quarto-appendix.default .quarto-appendix-bibtex code.sourceCode{white-space:pre-wrap}#quarto-appendix.default .quarto-appendix-citeas{font-size:.9em;padding:1em;border:solid 1px #dee2e6;margin-bottom:1em}#quarto-appendix.default .quarto-appendix-heading{font-size:1em !important}#quarto-appendix.default *[role=doc-endnotes]>ol,#quarto-appendix.default .quarto-appendix-contents>*:not(h2):not(.h2){font-size:.9em}#quarto-appendix.default section{padding-bottom:1.5em}#quarto-appendix.default section *[role=doc-endnotes],#quarto-appendix.default section>*:not(a){opacity:.9;word-wrap:break-word}.btn.btn-quarto,div.cell-output-display .btn-quarto{color:#cbcccc;background-color:#373a3c;border-color:#373a3c}.btn.btn-quarto:hover,div.cell-output-display .btn-quarto:hover{color:#cbcccc;background-color:#555859;border-color:#4b4e50}.btn-check:focus+.btn.btn-quarto,.btn.btn-quarto:focus,.btn-check:focus+div.cell-output-display .btn-quarto,div.cell-output-display .btn-quarto:focus{color:#cbcccc;background-color:#555859;border-color:#4b4e50;box-shadow:0 0 0 .25rem rgba(77,80,82,.5)}.btn-check:checked+.btn.btn-quarto,.btn-check:active+.btn.btn-quarto,.btn.btn-quarto:active,.btn.btn-quarto.active,.show>.btn.btn-quarto.dropdown-toggle,.btn-check:checked+div.cell-output-display .btn-quarto,.btn-check:active+div.cell-output-display .btn-quarto,div.cell-output-display .btn-quarto:active,div.cell-output-display .btn-quarto.active,.show>div.cell-output-display .btn-quarto.dropdown-toggle{color:#fff;background-color:#5f6163;border-color:#4b4e50}.btn-check:checked+.btn.btn-quarto:focus,.btn-check:active+.btn.btn-quarto:focus,.btn.btn-quarto:active:focus,.btn.btn-quarto.active:focus,.show>.btn.btn-quarto.dropdown-toggle:focus,.btn-check:checked+div.cell-output-display .btn-quarto:focus,.btn-check:active+div.cell-output-display .btn-quarto:focus,div.cell-output-display .btn-quarto:active:focus,div.cell-output-display .btn-quarto.active:focus,.show>div.cell-output-display .btn-quarto.dropdown-toggle:focus{box-shadow:0 0 0 .25rem rgba(77,80,82,.5)}.btn.btn-quarto:disabled,.btn.btn-quarto.disabled,div.cell-output-display .btn-quarto:disabled,div.cell-output-display .btn-quarto.disabled{color:#fff;background-color:#373a3c;border-color:#373a3c}nav.quarto-secondary-nav.color-navbar{background-color:#f8f9fa;color:#545555}nav.quarto-secondary-nav.color-navbar h1,nav.quarto-secondary-nav.color-navbar .h1,nav.quarto-secondary-nav.color-navbar .quarto-btn-toggle{color:#545555}@media(max-width: 991.98px){body.nav-sidebar .quarto-title-banner{margin-bottom:0;padding-bottom:0}body.nav-sidebar #title-block-header{margin-block-end:0}}p.subtitle{margin-top:.25em;margin-bottom:.5em}code a:any-link{color:inherit;text-decoration-color:#6c757d}/*! light */div.observablehq table thead tr th{background-color:var(--bs-body-bg)}input,button,select,optgroup,textarea{background-color:var(--bs-body-bg)}.code-annotated .code-copy-button{margin-right:1.25em;margin-top:0;padding-bottom:0;padding-top:3px}.code-annotation-gutter-bg{background-color:#fff}.code-annotation-gutter{background-color:rgba(233,236,239,.65)}.code-annotation-gutter,.code-annotation-gutter-bg{height:100%;width:calc(20px + .5em);position:absolute;top:0;right:0}dl.code-annotation-container-grid dt{margin-right:1em;margin-top:.25rem}dl.code-annotation-container-grid dt{font-family:var(--bs-font-monospace);color:#4f5457;border:solid #4f5457 1px;border-radius:50%;height:22px;width:22px;line-height:22px;font-size:11px;text-align:center;vertical-align:middle;text-decoration:none}dl.code-annotation-container-grid dt[data-target-cell]{cursor:pointer}dl.code-annotation-container-grid dt[data-target-cell].code-annotation-active{color:#fff;border:solid #aaa 1px;background-color:#aaa}pre.code-annotation-code{padding-top:0;padding-bottom:0}pre.code-annotation-code code{z-index:3}#code-annotation-line-highlight-gutter{width:100%;border-top:solid rgba(170,170,170,.2666666667) 1px;border-bottom:solid rgba(170,170,170,.2666666667) 1px;z-index:2;background-color:rgba(170,170,170,.1333333333)}#code-annotation-line-highlight{margin-left:-4em;width:calc(100% + 4em);border-top:solid rgba(170,170,170,.2666666667) 1px;border-bottom:solid rgba(170,170,170,.2666666667) 1px;z-index:2;background-color:rgba(170,170,170,.1333333333)}code.sourceCode .code-annotation-anchor.code-annotation-active{background-color:var(--quarto-hl-normal-color, #aaaaaa);border:solid var(--quarto-hl-normal-color, #aaaaaa) 1px;color:#e9ecef;font-weight:bolder}code.sourceCode .code-annotation-anchor{font-family:var(--bs-font-monospace);color:var(--quarto-hl-co-color);border:solid var(--quarto-hl-co-color) 1px;border-radius:50%;height:18px;width:18px;font-size:9px;margin-top:2px}code.sourceCode button.code-annotation-anchor{padding:2px}code.sourceCode a.code-annotation-anchor{line-height:18px;text-align:center;vertical-align:middle;cursor:default;text-decoration:none}@media print{.page-columns .column-screen-inset{grid-column:page-start-inset/page-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset table{background:#fff}.page-columns .column-screen-inset-left{grid-column:page-start-inset/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset-left table{background:#fff}.page-columns .column-screen-inset-right{grid-column:body-content-start/page-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset-right table{background:#fff}.page-columns .column-screen{grid-column:page-start/page-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen table{background:#fff}.page-columns .column-screen-left{grid-column:page-start/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-left table{background:#fff}.page-columns .column-screen-right{grid-column:body-content-start/page-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-right table{background:#fff}.page-columns .column-screen-inset-shaded{grid-column:page-start-inset/page-end-inset;padding:1em;background:#f8f9fa;z-index:998;transform:translate3d(0, 0, 0);margin-bottom:1em}}.quarto-video{margin-bottom:1em}.table>thead{border-top-width:0}.table>:not(caption)>*:not(:last-child)>*{border-bottom-color:#ebeced;border-bottom-style:solid;border-bottom-width:1px}.table>:not(:first-child){border-top:1px solid #b6babc;border-bottom:1px solid inherit}.table tbody{border-bottom-color:#b6babc}a.external:after{display:inline-block;height:.75rem;width:.75rem;margin-bottom:.15em;margin-left:.25em;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(39, 128, 227)" class="bi bi-box-arrow-up-right" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.636 3.5a.5.5 0 0 0-.5-.5H1.5A1.5 1.5 0 0 0 0 4.5v10A1.5 1.5 0 0 0 1.5 16h10a1.5 1.5 0 0 0 1.5-1.5V7.864a.5.5 0 0 0-1 0V14.5a.5.5 0 0 1-.5.5h-10a.5.5 0 0 1-.5-.5v-10a.5.5 0 0 1 .5-.5h6.636a.5.5 0 0 0 .5-.5z"/><path fill-rule="evenodd" d="M16 .5a.5.5 0 0 0-.5-.5h-5a.5.5 0 0 0 0 1h3.793L6.146 9.146a.5.5 0 1 0 .708.708L15 1.707V5.5a.5.5 0 0 0 1 0v-5z"/></svg>');background-repeat:no-repeat;background-size:.75rem .75rem}div.sourceCode code a.external:after{content:none}a.external:after:hover{cursor:pointer}.quarto-ext-icon{display:inline-block;font-size:.75em;padding-left:.3em}.code-with-filename .code-with-filename-file{margin-bottom:0;padding-bottom:2px;padding-top:2px;padding-left:.7em;border:var(--quarto-border-width) solid var(--quarto-border-color);border-radius:var(--quarto-border-radius);border-bottom:0;border-bottom-left-radius:0%;border-bottom-right-radius:0%}.code-with-filename div.sourceCode,.reveal .code-with-filename div.sourceCode{margin-top:0;border-top-left-radius:0%;border-top-right-radius:0%}.code-with-filename .code-with-filename-file pre{margin-bottom:0}.code-with-filename .code-with-filename-file,.code-with-filename .code-with-filename-file pre{background-color:rgba(219,219,219,.8)}.quarto-dark .code-with-filename .code-with-filename-file,.quarto-dark .code-with-filename .code-with-filename-file pre{background-color:#555}.code-with-filename .code-with-filename-file strong{font-weight:400}.quarto-title-banner{margin-bottom:1em;color:#545555;background:#f8f9fa}.quarto-title-banner .code-tools-button{color:#878888}.quarto-title-banner .code-tools-button:hover{color:#545555}.quarto-title-banner .code-tools-button>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(135, 136, 136)" viewBox="0 0 16 16"><path d="M10.478 1.647a.5.5 0 1 0-.956-.294l-4 13a.5.5 0 0 0 .956.294l4-13zM4.854 4.146a.5.5 0 0 1 0 .708L1.707 8l3.147 3.146a.5.5 0 0 1-.708.708l-3.5-3.5a.5.5 0 0 1 0-.708l3.5-3.5a.5.5 0 0 1 .708 0zm6.292 0a.5.5 0 0 0 0 .708L14.293 8l-3.147 3.146a.5.5 0 0 0 .708.708l3.5-3.5a.5.5 0 0 0 0-.708l-3.5-3.5a.5.5 0 0 0-.708 0z"/></svg>')}.quarto-title-banner .code-tools-button:hover>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(84, 85, 85)" viewBox="0 0 16 16"><path d="M10.478 1.647a.5.5 0 1 0-.956-.294l-4 13a.5.5 0 0 0 .956.294l4-13zM4.854 4.146a.5.5 0 0 1 0 .708L1.707 8l3.147 3.146a.5.5 0 0 1-.708.708l-3.5-3.5a.5.5 0 0 1 0-.708l3.5-3.5a.5.5 0 0 1 .708 0zm6.292 0a.5.5 0 0 0 0 .708L14.293 8l-3.147 3.146a.5.5 0 0 0 .708.708l3.5-3.5a.5.5 0 0 0 0-.708l-3.5-3.5a.5.5 0 0 0-.708 0z"/></svg>')}.quarto-title-banner .quarto-title .title{font-weight:600}.quarto-title-banner .quarto-categories{margin-top:.75em}@media(min-width: 992px){.quarto-title-banner{padding-top:2.5em;padding-bottom:2.5em}}@media(max-width: 991.98px){.quarto-title-banner{padding-top:1em;padding-bottom:1em}}main.quarto-banner-title-block>section:first-child>h2,main.quarto-banner-title-block>section:first-child>.h2,main.quarto-banner-title-block>section:first-child>h3,main.quarto-banner-title-block>section:first-child>.h3,main.quarto-banner-title-block>section:first-child>h4,main.quarto-banner-title-block>section:first-child>.h4{margin-top:0}.quarto-title .quarto-categories{display:flex;flex-wrap:wrap;row-gap:.5em;column-gap:.4em;padding-bottom:.5em;margin-top:.75em}.quarto-title .quarto-categories .quarto-category{padding:.25em .75em;font-size:.65em;text-transform:uppercase;border:solid 1px;border-radius:.25rem;opacity:.6}.quarto-title .quarto-categories .quarto-category a{color:inherit}#title-block-header.quarto-title-block.default .quarto-title-meta{display:grid;grid-template-columns:repeat(2, 1fr)}#title-block-header.quarto-title-block.default .quarto-title .title{margin-bottom:0}#title-block-header.quarto-title-block.default .quarto-title-author-orcid img{margin-top:-5px}#title-block-header.quarto-title-block.default .quarto-description p:last-of-type{margin-bottom:0}#title-block-header.quarto-title-block.default .quarto-title-meta-contents p,#title-block-header.quarto-title-block.default .quarto-title-authors p,#title-block-header.quarto-title-block.default .quarto-title-affiliations p{margin-bottom:.1em}#title-block-header.quarto-title-block.default .quarto-title-meta-heading{text-transform:uppercase;margin-top:1em;font-size:.8em;opacity:.8;font-weight:400}#title-block-header.quarto-title-block.default .quarto-title-meta-contents{font-size:.9em}#title-block-header.quarto-title-block.default .quarto-title-meta-contents a{color:#373a3c}#title-block-header.quarto-title-block.default .quarto-title-meta-contents p.affiliation:last-of-type{margin-bottom:.7em}#title-block-header.quarto-title-block.default p.affiliation{margin-bottom:.1em}#title-block-header.quarto-title-block.default .description,#title-block-header.quarto-title-block.default .abstract{margin-top:0}#title-block-header.quarto-title-block.default .description>p,#title-block-header.quarto-title-block.default .abstract>p{font-size:.9em}#title-block-header.quarto-title-block.default .description>p:last-of-type,#title-block-header.quarto-title-block.default .abstract>p:last-of-type{margin-bottom:0}#title-block-header.quarto-title-block.default .description .abstract-title,#title-block-header.quarto-title-block.default .abstract .abstract-title{margin-top:1em;text-transform:uppercase;font-size:.8em;opacity:.8;font-weight:400}#title-block-header.quarto-title-block.default .quarto-title-meta-author{display:grid;grid-template-columns:1fr 1fr}.quarto-title-tools-only{display:flex;justify-content:right}body{-webkit-font-smoothing:antialiased}.badge.bg-light{color:#373a3c}.progress .progress-bar{font-size:8px;line-height:8px}/*# sourceMappingURL=603954f6f730b7a48ae583e90c07e56e.css.map */
+*/.ansi-black-fg{color:#3e424d}.ansi-black-bg{background-color:#3e424d}.ansi-black-intense-fg{color:#282c36}.ansi-black-intense-bg{background-color:#282c36}.ansi-red-fg{color:#e75c58}.ansi-red-bg{background-color:#e75c58}.ansi-red-intense-fg{color:#b22b31}.ansi-red-intense-bg{background-color:#b22b31}.ansi-green-fg{color:#00a250}.ansi-green-bg{background-color:#00a250}.ansi-green-intense-fg{color:#007427}.ansi-green-intense-bg{background-color:#007427}.ansi-yellow-fg{color:#ddb62b}.ansi-yellow-bg{background-color:#ddb62b}.ansi-yellow-intense-fg{color:#b27d12}.ansi-yellow-intense-bg{background-color:#b27d12}.ansi-blue-fg{color:#208ffb}.ansi-blue-bg{background-color:#208ffb}.ansi-blue-intense-fg{color:#0065ca}.ansi-blue-intense-bg{background-color:#0065ca}.ansi-magenta-fg{color:#d160c4}.ansi-magenta-bg{background-color:#d160c4}.ansi-magenta-intense-fg{color:#a03196}.ansi-magenta-intense-bg{background-color:#a03196}.ansi-cyan-fg{color:#60c6c8}.ansi-cyan-bg{background-color:#60c6c8}.ansi-cyan-intense-fg{color:#258f8f}.ansi-cyan-intense-bg{background-color:#258f8f}.ansi-white-fg{color:#c5c1b4}.ansi-white-bg{background-color:#c5c1b4}.ansi-white-intense-fg{color:#a1a6b2}.ansi-white-intense-bg{background-color:#a1a6b2}.ansi-default-inverse-fg{color:#fff}.ansi-default-inverse-bg{background-color:#000}.ansi-bold{font-weight:bold}.ansi-underline{text-decoration:underline}:root{--quarto-body-bg: #fff;--quarto-body-color: #373a3c;--quarto-text-muted: #6c757d;--quarto-border-color: #dee2e6;--quarto-border-width: 1px;--quarto-border-radius: 0.25rem}table.gt_table{color:var(--quarto-body-color);font-size:1em;width:100%;background-color:rgba(0,0,0,0);border-top-width:inherit;border-bottom-width:inherit;border-color:var(--quarto-border-color)}table.gt_table th.gt_column_spanner_outer{color:var(--quarto-body-color);background-color:rgba(0,0,0,0);border-top-width:inherit;border-bottom-width:inherit;border-color:var(--quarto-border-color)}table.gt_table th.gt_col_heading{color:var(--quarto-body-color);font-weight:bold;background-color:rgba(0,0,0,0)}table.gt_table thead.gt_col_headings{border-bottom:1px solid currentColor;border-top-width:inherit;border-top-color:var(--quarto-border-color)}table.gt_table thead.gt_col_headings:not(:first-child){border-top-width:1px;border-top-color:var(--quarto-border-color)}table.gt_table td.gt_row{border-bottom-width:1px;border-bottom-color:var(--quarto-border-color);border-top-width:0px}table.gt_table tbody.gt_table_body{border-top-width:1px;border-bottom-width:1px;border-bottom-color:var(--quarto-border-color);border-top-color:currentColor}div.columns{display:initial;gap:initial}div.column{display:inline-block;overflow-x:initial;vertical-align:top;width:50%}.code-annotation-tip-content{word-wrap:break-word}.code-annotation-container-hidden{display:none !important}dl.code-annotation-container-grid{display:grid;grid-template-columns:min-content auto}dl.code-annotation-container-grid dt{grid-column:1}dl.code-annotation-container-grid dd{grid-column:2}pre.sourceCode.code-annotation-code{padding-right:0}code.sourceCode .code-annotation-anchor{z-index:100;position:absolute;right:.5em;left:inherit;background-color:rgba(0,0,0,0)}:root{--mermaid-bg-color: #fff;--mermaid-edge-color: #373a3c;--mermaid-node-fg-color: #373a3c;--mermaid-fg-color: #373a3c;--mermaid-fg-color--lighter: #4f5457;--mermaid-fg-color--lightest: #686d71;--mermaid-font-family: Source Sans Pro, -apple-system, BlinkMacSystemFont, Segoe UI, Roboto, Helvetica Neue, Arial, sans-serif, Apple Color Emoji, Segoe UI Emoji, Segoe UI Symbol;--mermaid-label-bg-color: #fff;--mermaid-label-fg-color: #2780e3;--mermaid-node-bg-color: rgba(39, 128, 227, 0.1);--mermaid-node-fg-color: #373a3c}@media print{:root{font-size:11pt}#quarto-sidebar,#TOC,.nav-page{display:none}.page-columns .content{grid-column-start:page-start}.fixed-top{position:relative}.panel-caption,.figure-caption,figcaption{color:#666}}.code-copy-button{position:absolute;top:0;right:0;border:0;margin-top:5px;margin-right:5px;background-color:rgba(0,0,0,0);z-index:3}.code-copy-button:focus{outline:none}.code-copy-button-tooltip{font-size:.75em}pre.sourceCode:hover>.code-copy-button>.bi::before{display:inline-block;height:1rem;width:1rem;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(94, 94, 94)" viewBox="0 0 16 16"><path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1v-1z"/><path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5h3zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0h-3z"/></svg>');background-repeat:no-repeat;background-size:1rem 1rem}pre.sourceCode:hover>.code-copy-button-checked>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(94, 94, 94)" viewBox="0 0 16 16"><path d="M13.854 3.646a.5.5 0 0 1 0 .708l-7 7a.5.5 0 0 1-.708 0l-3.5-3.5a.5.5 0 1 1 .708-.708L6.5 10.293l6.646-6.647a.5.5 0 0 1 .708 0z"/></svg>')}pre.sourceCode:hover>.code-copy-button:hover>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(71, 88, 171)" viewBox="0 0 16 16"><path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1v-1z"/><path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5h3zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0h-3z"/></svg>')}pre.sourceCode:hover>.code-copy-button-checked:hover>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(71, 88, 171)"  viewBox="0 0 16 16"><path d="M13.854 3.646a.5.5 0 0 1 0 .708l-7 7a.5.5 0 0 1-.708 0l-3.5-3.5a.5.5 0 1 1 .708-.708L6.5 10.293l6.646-6.647a.5.5 0 0 1 .708 0z"/></svg>')}main ol ol,main ul ul,main ol ul,main ul ol{margin-bottom:1em}ul>li:not(:has(>p))>ul,ol>li:not(:has(>p))>ul,ul>li:not(:has(>p))>ol,ol>li:not(:has(>p))>ol{margin-bottom:0}ul>li:not(:has(>p))>ul>li:has(>p),ol>li:not(:has(>p))>ul>li:has(>p),ul>li:not(:has(>p))>ol>li:has(>p),ol>li:not(:has(>p))>ol>li:has(>p){margin-top:1rem}body{margin:0}main.page-columns>header>h1.title,main.page-columns>header>.title.h1{margin-bottom:0}@media(min-width: 992px){body .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset] 35px [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(500px, calc( 850px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] minmax(75px, 145px) [page-end-inset] 35px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.fullcontent:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset] 35px [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(500px, calc( 850px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] 35px [page-end-inset page-end] 5fr [screen-end-inset] 1.5em}body.slimcontent:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset] 35px [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(500px, calc( 850px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 35px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.listing:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 850px - 3em )) [body-content-end] 3em [body-end] 50px [body-end-outset] minmax(0px, 250px) [page-end-inset] minmax(50px, 100px) [page-end] 1fr [screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 35px [page-start-inset] minmax(0px, 175px) [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(450px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 35px [page-start-inset] minmax(0px, 175px) [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(450px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] minmax(25px, 50px) [page-start-inset] minmax(50px, 150px) [body-start-outset] minmax(25px, 50px) [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] minmax(25px, 50px) [body-end-outset] minmax(50px, 150px) [page-end-inset] minmax(25px, 50px) [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 1000px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(50px, 100px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 1000px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 50px [page-start-inset] minmax(50px, 150px) [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.slimcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(450px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.listing .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start] minmax(50px, 100px) [page-start-inset] 50px [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(500px, calc( 1000px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(0px, 200px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.slimcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 50px [page-start-inset] minmax(50px, 150px) [body-start-outset] 50px [body-start] 1.5em [body-content-start] minmax(450px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(50px, 150px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.listing .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] minmax(25px, 50px) [page-start-inset] minmax(50px, 150px) [body-start-outset] minmax(25px, 50px) [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] minmax(25px, 50px) [body-end-outset] minmax(50px, 150px) [page-end-inset] minmax(25px, 50px) [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}}@media(max-width: 991.98px){body .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset] 5fr [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] minmax(75px, 145px) [page-end-inset] 35px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.fullcontent:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset] 5fr [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.slimcontent:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset] 5fr [body-start] 1.5em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] minmax(75px, 145px) [page-end-inset] 35px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.listing:not(.floating):not(.docked) .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset] 5fr [body-start] 1.5em [body-content-start] minmax(500px, calc( 1250px - 3em )) [body-content-end body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 35px [page-start-inset] minmax(0px, 145px) [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(450px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start] 35px [page-start-inset] minmax(0px, 145px) [body-start-outset] 35px [body-start] 1.5em [body-content-start] minmax(450px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset body-start-outset body-start] 1.5em [body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(75px, 150px) [page-end-inset] 25px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(25px, 50px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(500px, calc( 1000px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset body-start-outset body-start] 1em [body-content-start] minmax(500px, calc( 800px - 3em )) [body-content-end] 1.5em [body-end body-end-outset page-end-inset page-end] 4fr [screen-end-inset] 1.5em [screen-end]}body.docked.slimcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(25px, 50px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.docked.listing .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(25px, 50px) [page-end-inset] 50px [page-end] 5fr [screen-end-inset] 1.5em [screen-end]}body.floating.slimcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset body-start-outset body-start] 1em [body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 35px [body-end-outset] minmax(75px, 145px) [page-end-inset] 35px [page-end] 4fr [screen-end-inset] 1.5em [screen-end]}body.floating.listing .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset] 5fr [page-start page-start-inset body-start-outset body-start] 1em [body-content-start] minmax(500px, calc( 750px - 3em )) [body-content-end] 1.5em [body-end] 50px [body-end-outset] minmax(75px, 150px) [page-end-inset] 25px [page-end] 4fr [screen-end-inset] 1.5em [screen-end]}}@media(max-width: 767.98px){body .page-columns,body.fullcontent:not(.floating):not(.docked) .page-columns,body.slimcontent:not(.floating):not(.docked) .page-columns,body.docked .page-columns,body.docked.slimcontent .page-columns,body.docked.fullcontent .page-columns,body.floating .page-columns,body.floating.slimcontent .page-columns,body.floating.fullcontent .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(0px, 1fr) [body-content-end body-end body-end-outset page-end-inset page-end screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(0px, 1fr) [body-content-end body-end body-end-outset page-end-inset page-end screen-end-inset] 1.5em [screen-end]}body:not(.floating):not(.docked) .page-columns.toc-left .page-columns{display:grid;gap:0;grid-template-columns:[screen-start] 1.5em [screen-start-inset page-start page-start-inset body-start-outset body-start body-content-start] minmax(0px, 1fr) [body-content-end body-end body-end-outset page-end-inset page-end screen-end-inset] 1.5em [screen-end]}nav[role=doc-toc]{display:none}}body,.page-row-navigation{grid-template-rows:[page-top] max-content [contents-top] max-content [contents-bottom] max-content [page-bottom]}.page-rows-contents{grid-template-rows:[content-top] minmax(max-content, 1fr) [content-bottom] minmax(60px, max-content) [page-bottom]}.page-full{grid-column:screen-start/screen-end !important}.page-columns>*{grid-column:body-content-start/body-content-end}.page-columns.column-page>*{grid-column:page-start/page-end}.page-columns.column-page-left>*{grid-column:page-start/body-content-end}.page-columns.column-page-right>*{grid-column:body-content-start/page-end}.page-rows{grid-auto-rows:auto}.header{grid-column:screen-start/screen-end;grid-row:page-top/contents-top}#quarto-content{padding:0;grid-column:screen-start/screen-end;grid-row:contents-top/contents-bottom}body.floating .sidebar.sidebar-navigation{grid-column:page-start/body-start;grid-row:content-top/page-bottom}body.docked .sidebar.sidebar-navigation{grid-column:screen-start/body-start;grid-row:content-top/page-bottom}.sidebar.toc-left{grid-column:page-start/body-start;grid-row:content-top/page-bottom}.sidebar.margin-sidebar{grid-column:body-end/page-end;grid-row:content-top/page-bottom}.page-columns .content{grid-column:body-content-start/body-content-end;grid-row:content-top/content-bottom;align-content:flex-start}.page-columns .page-navigation{grid-column:body-content-start/body-content-end;grid-row:content-bottom/page-bottom}.page-columns .footer{grid-column:screen-start/screen-end;grid-row:contents-bottom/page-bottom}.page-columns .column-body{grid-column:body-content-start/body-content-end}.page-columns .column-body-fullbleed{grid-column:body-start/body-end}.page-columns .column-body-outset{grid-column:body-start-outset/body-end-outset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-body-outset table{background:#fff}.page-columns .column-body-outset-left{grid-column:body-start-outset/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-body-outset-left table{background:#fff}.page-columns .column-body-outset-right{grid-column:body-content-start/body-end-outset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-body-outset-right table{background:#fff}.page-columns .column-page{grid-column:page-start/page-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page table{background:#fff}.page-columns .column-page-inset{grid-column:page-start-inset/page-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-inset table{background:#fff}.page-columns .column-page-inset-left{grid-column:page-start-inset/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-inset-left table{background:#fff}.page-columns .column-page-inset-right{grid-column:body-content-start/page-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-inset-right figcaption table{background:#fff}.page-columns .column-page-left{grid-column:page-start/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-left table{background:#fff}.page-columns .column-page-right{grid-column:body-content-start/page-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-page-right figcaption table{background:#fff}#quarto-content.page-columns #quarto-margin-sidebar,#quarto-content.page-columns #quarto-sidebar{z-index:1}@media(max-width: 991.98px){#quarto-content.page-columns #quarto-margin-sidebar.collapse,#quarto-content.page-columns #quarto-sidebar.collapse,#quarto-content.page-columns #quarto-margin-sidebar.collapsing,#quarto-content.page-columns #quarto-sidebar.collapsing{z-index:1055}}#quarto-content.page-columns main.column-page,#quarto-content.page-columns main.column-page-right,#quarto-content.page-columns main.column-page-left{z-index:0}.page-columns .column-screen-inset{grid-column:screen-start-inset/screen-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset table{background:#fff}.page-columns .column-screen-inset-left{grid-column:screen-start-inset/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset-left table{background:#fff}.page-columns .column-screen-inset-right{grid-column:body-content-start/screen-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset-right table{background:#fff}.page-columns .column-screen{grid-column:screen-start/screen-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen table{background:#fff}.page-columns .column-screen-left{grid-column:screen-start/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-left table{background:#fff}.page-columns .column-screen-right{grid-column:body-content-start/screen-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-right table{background:#fff}.page-columns .column-screen-inset-shaded{grid-column:screen-start/screen-end;padding:1em;background:#f8f9fa;z-index:998;transform:translate3d(0, 0, 0);margin-bottom:1em}.zindex-content{z-index:998;transform:translate3d(0, 0, 0)}.zindex-modal{z-index:1055;transform:translate3d(0, 0, 0)}.zindex-over-content{z-index:999;transform:translate3d(0, 0, 0)}img.img-fluid.column-screen,img.img-fluid.column-screen-inset-shaded,img.img-fluid.column-screen-inset,img.img-fluid.column-screen-inset-left,img.img-fluid.column-screen-inset-right,img.img-fluid.column-screen-left,img.img-fluid.column-screen-right{width:100%}@media(min-width: 992px){.margin-caption,div.aside,aside,.column-margin{grid-column:body-end/page-end !important;z-index:998}.column-sidebar{grid-column:page-start/body-start !important;z-index:998}.column-leftmargin{grid-column:screen-start-inset/body-start !important;z-index:998}.no-row-height{height:1em;overflow:visible}}@media(max-width: 991.98px){.margin-caption,div.aside,aside,.column-margin{grid-column:body-end/page-end !important;z-index:998}.no-row-height{height:1em;overflow:visible}.page-columns.page-full{overflow:visible}.page-columns.toc-left .margin-caption,.page-columns.toc-left div.aside,.page-columns.toc-left aside,.page-columns.toc-left .column-margin{grid-column:body-content-start/body-content-end !important;z-index:998;transform:translate3d(0, 0, 0)}.page-columns.toc-left .no-row-height{height:initial;overflow:initial}}@media(max-width: 767.98px){.margin-caption,div.aside,aside,.column-margin{grid-column:body-content-start/body-content-end !important;z-index:998;transform:translate3d(0, 0, 0)}.no-row-height{height:initial;overflow:initial}#quarto-margin-sidebar{display:none}#quarto-sidebar-toc-left{display:none}.hidden-sm{display:none}}.panel-grid{display:grid;grid-template-rows:repeat(1, 1fr);grid-template-columns:repeat(24, 1fr);gap:1em}.panel-grid .g-col-1{grid-column:auto/span 1}.panel-grid .g-col-2{grid-column:auto/span 2}.panel-grid .g-col-3{grid-column:auto/span 3}.panel-grid .g-col-4{grid-column:auto/span 4}.panel-grid .g-col-5{grid-column:auto/span 5}.panel-grid .g-col-6{grid-column:auto/span 6}.panel-grid .g-col-7{grid-column:auto/span 7}.panel-grid .g-col-8{grid-column:auto/span 8}.panel-grid .g-col-9{grid-column:auto/span 9}.panel-grid .g-col-10{grid-column:auto/span 10}.panel-grid .g-col-11{grid-column:auto/span 11}.panel-grid .g-col-12{grid-column:auto/span 12}.panel-grid .g-col-13{grid-column:auto/span 13}.panel-grid .g-col-14{grid-column:auto/span 14}.panel-grid .g-col-15{grid-column:auto/span 15}.panel-grid .g-col-16{grid-column:auto/span 16}.panel-grid .g-col-17{grid-column:auto/span 17}.panel-grid .g-col-18{grid-column:auto/span 18}.panel-grid .g-col-19{grid-column:auto/span 19}.panel-grid .g-col-20{grid-column:auto/span 20}.panel-grid .g-col-21{grid-column:auto/span 21}.panel-grid .g-col-22{grid-column:auto/span 22}.panel-grid .g-col-23{grid-column:auto/span 23}.panel-grid .g-col-24{grid-column:auto/span 24}.panel-grid .g-start-1{grid-column-start:1}.panel-grid .g-start-2{grid-column-start:2}.panel-grid .g-start-3{grid-column-start:3}.panel-grid .g-start-4{grid-column-start:4}.panel-grid .g-start-5{grid-column-start:5}.panel-grid .g-start-6{grid-column-start:6}.panel-grid .g-start-7{grid-column-start:7}.panel-grid .g-start-8{grid-column-start:8}.panel-grid .g-start-9{grid-column-start:9}.panel-grid .g-start-10{grid-column-start:10}.panel-grid .g-start-11{grid-column-start:11}.panel-grid .g-start-12{grid-column-start:12}.panel-grid .g-start-13{grid-column-start:13}.panel-grid .g-start-14{grid-column-start:14}.panel-grid .g-start-15{grid-column-start:15}.panel-grid .g-start-16{grid-column-start:16}.panel-grid .g-start-17{grid-column-start:17}.panel-grid .g-start-18{grid-column-start:18}.panel-grid .g-start-19{grid-column-start:19}.panel-grid .g-start-20{grid-column-start:20}.panel-grid .g-start-21{grid-column-start:21}.panel-grid .g-start-22{grid-column-start:22}.panel-grid .g-start-23{grid-column-start:23}@media(min-width: 576px){.panel-grid .g-col-sm-1{grid-column:auto/span 1}.panel-grid .g-col-sm-2{grid-column:auto/span 2}.panel-grid .g-col-sm-3{grid-column:auto/span 3}.panel-grid .g-col-sm-4{grid-column:auto/span 4}.panel-grid .g-col-sm-5{grid-column:auto/span 5}.panel-grid .g-col-sm-6{grid-column:auto/span 6}.panel-grid .g-col-sm-7{grid-column:auto/span 7}.panel-grid .g-col-sm-8{grid-column:auto/span 8}.panel-grid .g-col-sm-9{grid-column:auto/span 9}.panel-grid .g-col-sm-10{grid-column:auto/span 10}.panel-grid .g-col-sm-11{grid-column:auto/span 11}.panel-grid .g-col-sm-12{grid-column:auto/span 12}.panel-grid .g-col-sm-13{grid-column:auto/span 13}.panel-grid .g-col-sm-14{grid-column:auto/span 14}.panel-grid .g-col-sm-15{grid-column:auto/span 15}.panel-grid .g-col-sm-16{grid-column:auto/span 16}.panel-grid .g-col-sm-17{grid-column:auto/span 17}.panel-grid .g-col-sm-18{grid-column:auto/span 18}.panel-grid .g-col-sm-19{grid-column:auto/span 19}.panel-grid .g-col-sm-20{grid-column:auto/span 20}.panel-grid .g-col-sm-21{grid-column:auto/span 21}.panel-grid .g-col-sm-22{grid-column:auto/span 22}.panel-grid .g-col-sm-23{grid-column:auto/span 23}.panel-grid .g-col-sm-24{grid-column:auto/span 24}.panel-grid .g-start-sm-1{grid-column-start:1}.panel-grid .g-start-sm-2{grid-column-start:2}.panel-grid .g-start-sm-3{grid-column-start:3}.panel-grid .g-start-sm-4{grid-column-start:4}.panel-grid .g-start-sm-5{grid-column-start:5}.panel-grid .g-start-sm-6{grid-column-start:6}.panel-grid .g-start-sm-7{grid-column-start:7}.panel-grid .g-start-sm-8{grid-column-start:8}.panel-grid .g-start-sm-9{grid-column-start:9}.panel-grid .g-start-sm-10{grid-column-start:10}.panel-grid .g-start-sm-11{grid-column-start:11}.panel-grid .g-start-sm-12{grid-column-start:12}.panel-grid .g-start-sm-13{grid-column-start:13}.panel-grid .g-start-sm-14{grid-column-start:14}.panel-grid .g-start-sm-15{grid-column-start:15}.panel-grid .g-start-sm-16{grid-column-start:16}.panel-grid .g-start-sm-17{grid-column-start:17}.panel-grid .g-start-sm-18{grid-column-start:18}.panel-grid .g-start-sm-19{grid-column-start:19}.panel-grid .g-start-sm-20{grid-column-start:20}.panel-grid .g-start-sm-21{grid-column-start:21}.panel-grid .g-start-sm-22{grid-column-start:22}.panel-grid .g-start-sm-23{grid-column-start:23}}@media(min-width: 768px){.panel-grid .g-col-md-1{grid-column:auto/span 1}.panel-grid .g-col-md-2{grid-column:auto/span 2}.panel-grid .g-col-md-3{grid-column:auto/span 3}.panel-grid .g-col-md-4{grid-column:auto/span 4}.panel-grid .g-col-md-5{grid-column:auto/span 5}.panel-grid .g-col-md-6{grid-column:auto/span 6}.panel-grid .g-col-md-7{grid-column:auto/span 7}.panel-grid .g-col-md-8{grid-column:auto/span 8}.panel-grid .g-col-md-9{grid-column:auto/span 9}.panel-grid .g-col-md-10{grid-column:auto/span 10}.panel-grid .g-col-md-11{grid-column:auto/span 11}.panel-grid .g-col-md-12{grid-column:auto/span 12}.panel-grid .g-col-md-13{grid-column:auto/span 13}.panel-grid .g-col-md-14{grid-column:auto/span 14}.panel-grid .g-col-md-15{grid-column:auto/span 15}.panel-grid .g-col-md-16{grid-column:auto/span 16}.panel-grid .g-col-md-17{grid-column:auto/span 17}.panel-grid .g-col-md-18{grid-column:auto/span 18}.panel-grid .g-col-md-19{grid-column:auto/span 19}.panel-grid .g-col-md-20{grid-column:auto/span 20}.panel-grid .g-col-md-21{grid-column:auto/span 21}.panel-grid .g-col-md-22{grid-column:auto/span 22}.panel-grid .g-col-md-23{grid-column:auto/span 23}.panel-grid .g-col-md-24{grid-column:auto/span 24}.panel-grid .g-start-md-1{grid-column-start:1}.panel-grid .g-start-md-2{grid-column-start:2}.panel-grid .g-start-md-3{grid-column-start:3}.panel-grid .g-start-md-4{grid-column-start:4}.panel-grid .g-start-md-5{grid-column-start:5}.panel-grid .g-start-md-6{grid-column-start:6}.panel-grid .g-start-md-7{grid-column-start:7}.panel-grid .g-start-md-8{grid-column-start:8}.panel-grid .g-start-md-9{grid-column-start:9}.panel-grid .g-start-md-10{grid-column-start:10}.panel-grid .g-start-md-11{grid-column-start:11}.panel-grid .g-start-md-12{grid-column-start:12}.panel-grid .g-start-md-13{grid-column-start:13}.panel-grid .g-start-md-14{grid-column-start:14}.panel-grid .g-start-md-15{grid-column-start:15}.panel-grid .g-start-md-16{grid-column-start:16}.panel-grid .g-start-md-17{grid-column-start:17}.panel-grid .g-start-md-18{grid-column-start:18}.panel-grid .g-start-md-19{grid-column-start:19}.panel-grid .g-start-md-20{grid-column-start:20}.panel-grid .g-start-md-21{grid-column-start:21}.panel-grid .g-start-md-22{grid-column-start:22}.panel-grid .g-start-md-23{grid-column-start:23}}@media(min-width: 992px){.panel-grid .g-col-lg-1{grid-column:auto/span 1}.panel-grid .g-col-lg-2{grid-column:auto/span 2}.panel-grid .g-col-lg-3{grid-column:auto/span 3}.panel-grid .g-col-lg-4{grid-column:auto/span 4}.panel-grid .g-col-lg-5{grid-column:auto/span 5}.panel-grid .g-col-lg-6{grid-column:auto/span 6}.panel-grid .g-col-lg-7{grid-column:auto/span 7}.panel-grid .g-col-lg-8{grid-column:auto/span 8}.panel-grid .g-col-lg-9{grid-column:auto/span 9}.panel-grid .g-col-lg-10{grid-column:auto/span 10}.panel-grid .g-col-lg-11{grid-column:auto/span 11}.panel-grid .g-col-lg-12{grid-column:auto/span 12}.panel-grid .g-col-lg-13{grid-column:auto/span 13}.panel-grid .g-col-lg-14{grid-column:auto/span 14}.panel-grid .g-col-lg-15{grid-column:auto/span 15}.panel-grid .g-col-lg-16{grid-column:auto/span 16}.panel-grid .g-col-lg-17{grid-column:auto/span 17}.panel-grid .g-col-lg-18{grid-column:auto/span 18}.panel-grid .g-col-lg-19{grid-column:auto/span 19}.panel-grid .g-col-lg-20{grid-column:auto/span 20}.panel-grid .g-col-lg-21{grid-column:auto/span 21}.panel-grid .g-col-lg-22{grid-column:auto/span 22}.panel-grid .g-col-lg-23{grid-column:auto/span 23}.panel-grid .g-col-lg-24{grid-column:auto/span 24}.panel-grid .g-start-lg-1{grid-column-start:1}.panel-grid .g-start-lg-2{grid-column-start:2}.panel-grid .g-start-lg-3{grid-column-start:3}.panel-grid .g-start-lg-4{grid-column-start:4}.panel-grid .g-start-lg-5{grid-column-start:5}.panel-grid .g-start-lg-6{grid-column-start:6}.panel-grid .g-start-lg-7{grid-column-start:7}.panel-grid .g-start-lg-8{grid-column-start:8}.panel-grid .g-start-lg-9{grid-column-start:9}.panel-grid .g-start-lg-10{grid-column-start:10}.panel-grid .g-start-lg-11{grid-column-start:11}.panel-grid .g-start-lg-12{grid-column-start:12}.panel-grid .g-start-lg-13{grid-column-start:13}.panel-grid .g-start-lg-14{grid-column-start:14}.panel-grid .g-start-lg-15{grid-column-start:15}.panel-grid .g-start-lg-16{grid-column-start:16}.panel-grid .g-start-lg-17{grid-column-start:17}.panel-grid .g-start-lg-18{grid-column-start:18}.panel-grid .g-start-lg-19{grid-column-start:19}.panel-grid .g-start-lg-20{grid-column-start:20}.panel-grid .g-start-lg-21{grid-column-start:21}.panel-grid .g-start-lg-22{grid-column-start:22}.panel-grid .g-start-lg-23{grid-column-start:23}}@media(min-width: 1200px){.panel-grid .g-col-xl-1{grid-column:auto/span 1}.panel-grid .g-col-xl-2{grid-column:auto/span 2}.panel-grid .g-col-xl-3{grid-column:auto/span 3}.panel-grid .g-col-xl-4{grid-column:auto/span 4}.panel-grid .g-col-xl-5{grid-column:auto/span 5}.panel-grid .g-col-xl-6{grid-column:auto/span 6}.panel-grid .g-col-xl-7{grid-column:auto/span 7}.panel-grid .g-col-xl-8{grid-column:auto/span 8}.panel-grid .g-col-xl-9{grid-column:auto/span 9}.panel-grid .g-col-xl-10{grid-column:auto/span 10}.panel-grid .g-col-xl-11{grid-column:auto/span 11}.panel-grid .g-col-xl-12{grid-column:auto/span 12}.panel-grid .g-col-xl-13{grid-column:auto/span 13}.panel-grid .g-col-xl-14{grid-column:auto/span 14}.panel-grid .g-col-xl-15{grid-column:auto/span 15}.panel-grid .g-col-xl-16{grid-column:auto/span 16}.panel-grid .g-col-xl-17{grid-column:auto/span 17}.panel-grid .g-col-xl-18{grid-column:auto/span 18}.panel-grid .g-col-xl-19{grid-column:auto/span 19}.panel-grid .g-col-xl-20{grid-column:auto/span 20}.panel-grid .g-col-xl-21{grid-column:auto/span 21}.panel-grid .g-col-xl-22{grid-column:auto/span 22}.panel-grid .g-col-xl-23{grid-column:auto/span 23}.panel-grid .g-col-xl-24{grid-column:auto/span 24}.panel-grid .g-start-xl-1{grid-column-start:1}.panel-grid .g-start-xl-2{grid-column-start:2}.panel-grid .g-start-xl-3{grid-column-start:3}.panel-grid .g-start-xl-4{grid-column-start:4}.panel-grid .g-start-xl-5{grid-column-start:5}.panel-grid .g-start-xl-6{grid-column-start:6}.panel-grid .g-start-xl-7{grid-column-start:7}.panel-grid .g-start-xl-8{grid-column-start:8}.panel-grid .g-start-xl-9{grid-column-start:9}.panel-grid .g-start-xl-10{grid-column-start:10}.panel-grid .g-start-xl-11{grid-column-start:11}.panel-grid .g-start-xl-12{grid-column-start:12}.panel-grid .g-start-xl-13{grid-column-start:13}.panel-grid .g-start-xl-14{grid-column-start:14}.panel-grid .g-start-xl-15{grid-column-start:15}.panel-grid .g-start-xl-16{grid-column-start:16}.panel-grid .g-start-xl-17{grid-column-start:17}.panel-grid .g-start-xl-18{grid-column-start:18}.panel-grid .g-start-xl-19{grid-column-start:19}.panel-grid .g-start-xl-20{grid-column-start:20}.panel-grid .g-start-xl-21{grid-column-start:21}.panel-grid .g-start-xl-22{grid-column-start:22}.panel-grid .g-start-xl-23{grid-column-start:23}}@media(min-width: 1400px){.panel-grid .g-col-xxl-1{grid-column:auto/span 1}.panel-grid .g-col-xxl-2{grid-column:auto/span 2}.panel-grid .g-col-xxl-3{grid-column:auto/span 3}.panel-grid .g-col-xxl-4{grid-column:auto/span 4}.panel-grid .g-col-xxl-5{grid-column:auto/span 5}.panel-grid .g-col-xxl-6{grid-column:auto/span 6}.panel-grid .g-col-xxl-7{grid-column:auto/span 7}.panel-grid .g-col-xxl-8{grid-column:auto/span 8}.panel-grid .g-col-xxl-9{grid-column:auto/span 9}.panel-grid .g-col-xxl-10{grid-column:auto/span 10}.panel-grid .g-col-xxl-11{grid-column:auto/span 11}.panel-grid .g-col-xxl-12{grid-column:auto/span 12}.panel-grid .g-col-xxl-13{grid-column:auto/span 13}.panel-grid .g-col-xxl-14{grid-column:auto/span 14}.panel-grid .g-col-xxl-15{grid-column:auto/span 15}.panel-grid .g-col-xxl-16{grid-column:auto/span 16}.panel-grid .g-col-xxl-17{grid-column:auto/span 17}.panel-grid .g-col-xxl-18{grid-column:auto/span 18}.panel-grid .g-col-xxl-19{grid-column:auto/span 19}.panel-grid .g-col-xxl-20{grid-column:auto/span 20}.panel-grid .g-col-xxl-21{grid-column:auto/span 21}.panel-grid .g-col-xxl-22{grid-column:auto/span 22}.panel-grid .g-col-xxl-23{grid-column:auto/span 23}.panel-grid .g-col-xxl-24{grid-column:auto/span 24}.panel-grid .g-start-xxl-1{grid-column-start:1}.panel-grid .g-start-xxl-2{grid-column-start:2}.panel-grid .g-start-xxl-3{grid-column-start:3}.panel-grid .g-start-xxl-4{grid-column-start:4}.panel-grid .g-start-xxl-5{grid-column-start:5}.panel-grid .g-start-xxl-6{grid-column-start:6}.panel-grid .g-start-xxl-7{grid-column-start:7}.panel-grid .g-start-xxl-8{grid-column-start:8}.panel-grid .g-start-xxl-9{grid-column-start:9}.panel-grid .g-start-xxl-10{grid-column-start:10}.panel-grid .g-start-xxl-11{grid-column-start:11}.panel-grid .g-start-xxl-12{grid-column-start:12}.panel-grid .g-start-xxl-13{grid-column-start:13}.panel-grid .g-start-xxl-14{grid-column-start:14}.panel-grid .g-start-xxl-15{grid-column-start:15}.panel-grid .g-start-xxl-16{grid-column-start:16}.panel-grid .g-start-xxl-17{grid-column-start:17}.panel-grid .g-start-xxl-18{grid-column-start:18}.panel-grid .g-start-xxl-19{grid-column-start:19}.panel-grid .g-start-xxl-20{grid-column-start:20}.panel-grid .g-start-xxl-21{grid-column-start:21}.panel-grid .g-start-xxl-22{grid-column-start:22}.panel-grid .g-start-xxl-23{grid-column-start:23}}main{margin-top:1em;margin-bottom:1em}h1,.h1,h2,.h2{opacity:.9;margin-top:2rem;margin-bottom:1rem;font-weight:600}h1.title,.title.h1{margin-top:0}h2,.h2{border-bottom:1px solid #dee2e6;padding-bottom:.5rem}h3,.h3{font-weight:600}h3,.h3,h4,.h4{opacity:.9;margin-top:1.5rem}h5,.h5,h6,.h6{opacity:.9}.header-section-number{color:#747a7f}.nav-link.active .header-section-number{color:inherit}mark,.mark{padding:0em}.panel-caption,caption,.figure-caption{font-size:.9rem}.panel-caption,.figure-caption,figcaption{color:#747a7f}.table-caption,caption{color:#373a3c}.quarto-layout-cell[data-ref-parent] caption{color:#747a7f}.column-margin figcaption,.margin-caption,div.aside,aside,.column-margin{color:#747a7f;font-size:.825rem}.panel-caption.margin-caption{text-align:inherit}.column-margin.column-container p{margin-bottom:0}.column-margin.column-container>*:not(.collapse){padding-top:.5em;padding-bottom:.5em;display:block}.column-margin.column-container>*.collapse:not(.show){display:none}@media(min-width: 768px){.column-margin.column-container .callout-margin-content:first-child{margin-top:4.5em}.column-margin.column-container .callout-margin-content-simple:first-child{margin-top:3.5em}}.margin-caption>*{padding-top:.5em;padding-bottom:.5em}@media(max-width: 767.98px){.quarto-layout-row{flex-direction:column}}.nav-tabs .nav-item{margin-top:1px;cursor:pointer}.tab-content{margin-top:0px;border-left:#dee2e6 1px solid;border-right:#dee2e6 1px solid;border-bottom:#dee2e6 1px solid;margin-left:0;padding:1em;margin-bottom:1em}@media(max-width: 767.98px){.layout-sidebar{margin-left:0;margin-right:0}}.panel-sidebar,.panel-sidebar .form-control,.panel-input,.panel-input .form-control,.selectize-dropdown{font-size:.9rem}.panel-sidebar .form-control,.panel-input .form-control{padding-top:.1rem}.tab-pane div.sourceCode{margin-top:0px}.tab-pane>p{padding-top:1em}.tab-content>.tab-pane:not(.active){display:none !important}div.sourceCode{background-color:rgba(233,236,239,.65);border:1px solid rgba(233,236,239,.65);border-radius:.25rem}pre.sourceCode{background-color:rgba(0,0,0,0)}pre.sourceCode{border:none;font-size:.875em;overflow:visible !important;padding:.4em}.callout pre.sourceCode{padding-left:0}div.sourceCode{overflow-y:hidden}.callout div.sourceCode{margin-left:initial}.blockquote{font-size:inherit;padding-left:1rem;padding-right:1.5rem;color:#747a7f}.blockquote h1:first-child,.blockquote .h1:first-child,.blockquote h2:first-child,.blockquote .h2:first-child,.blockquote h3:first-child,.blockquote .h3:first-child,.blockquote h4:first-child,.blockquote .h4:first-child,.blockquote h5:first-child,.blockquote .h5:first-child{margin-top:0}pre{background-color:initial;padding:initial;border:initial}p code:not(.sourceCode),li code:not(.sourceCode),td code:not(.sourceCode){background-color:#f7f7f7;padding:.2em}nav p code:not(.sourceCode),nav li code:not(.sourceCode),nav td code:not(.sourceCode){background-color:rgba(0,0,0,0);padding:0}td code:not(.sourceCode){white-space:pre-wrap}#quarto-embedded-source-code-modal>.modal-dialog{max-width:1000px;padding-left:1.75rem;padding-right:1.75rem}#quarto-embedded-source-code-modal>.modal-dialog>.modal-content>.modal-body{padding:0}#quarto-embedded-source-code-modal>.modal-dialog>.modal-content>.modal-body div.sourceCode{margin:0;padding:.2rem .2rem;border-radius:0px;border:none}#quarto-embedded-source-code-modal>.modal-dialog>.modal-content>.modal-header{padding:.7rem}.code-tools-button{font-size:1rem;padding:.15rem .15rem;margin-left:5px;color:#6c757d;background-color:rgba(0,0,0,0);transition:initial;cursor:pointer}.code-tools-button>.bi::before{display:inline-block;height:1rem;width:1rem;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(108, 117, 125)" viewBox="0 0 16 16"><path d="M10.478 1.647a.5.5 0 1 0-.956-.294l-4 13a.5.5 0 0 0 .956.294l4-13zM4.854 4.146a.5.5 0 0 1 0 .708L1.707 8l3.147 3.146a.5.5 0 0 1-.708.708l-3.5-3.5a.5.5 0 0 1 0-.708l3.5-3.5a.5.5 0 0 1 .708 0zm6.292 0a.5.5 0 0 0 0 .708L14.293 8l-3.147 3.146a.5.5 0 0 0 .708.708l3.5-3.5a.5.5 0 0 0 0-.708l-3.5-3.5a.5.5 0 0 0-.708 0z"/></svg>');background-repeat:no-repeat;background-size:1rem 1rem}.code-tools-button:hover>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" viewBox="0 0 16 16"><path d="M10.478 1.647a.5.5 0 1 0-.956-.294l-4 13a.5.5 0 0 0 .956.294l4-13zM4.854 4.146a.5.5 0 0 1 0 .708L1.707 8l3.147 3.146a.5.5 0 0 1-.708.708l-3.5-3.5a.5.5 0 0 1 0-.708l3.5-3.5a.5.5 0 0 1 .708 0zm6.292 0a.5.5 0 0 0 0 .708L14.293 8l-3.147 3.146a.5.5 0 0 0 .708.708l3.5-3.5a.5.5 0 0 0 0-.708l-3.5-3.5a.5.5 0 0 0-.708 0z"/></svg>')}#quarto-embedded-source-code-modal .code-copy-button>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(108, 117, 125)" viewBox="0 0 16 16"><path d="M4 1.5H3a2 2 0 0 0-2 2V14a2 2 0 0 0 2 2h10a2 2 0 0 0 2-2V3.5a2 2 0 0 0-2-2h-1v1h1a1 1 0 0 1 1 1V14a1 1 0 0 1-1 1H3a1 1 0 0 1-1-1V3.5a1 1 0 0 1 1-1h1v-1z"/><path d="M9.5 1a.5.5 0 0 1 .5.5v1a.5.5 0 0 1-.5.5h-3a.5.5 0 0 1-.5-.5v-1a.5.5 0 0 1 .5-.5h3zm-3-1A1.5 1.5 0 0 0 5 1.5v1A1.5 1.5 0 0 0 6.5 4h3A1.5 1.5 0 0 0 11 2.5v-1A1.5 1.5 0 0 0 9.5 0h-3z"/></svg>')}#quarto-embedded-source-code-modal .code-copy-button-checked>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(108, 117, 125)" viewBox="0 0 16 16"><path d="M13.854 3.646a.5.5 0 0 1 0 .708l-7 7a.5.5 0 0 1-.708 0l-3.5-3.5a.5.5 0 1 1 .708-.708L6.5 10.293l6.646-6.647a.5.5 0 0 1 .708 0z"/></svg>')}.sidebar{will-change:top;transition:top 200ms linear;position:sticky;overflow-y:auto;padding-top:1.2em;max-height:100vh}.sidebar.toc-left,.sidebar.margin-sidebar{top:0px;padding-top:1em}.sidebar.toc-left>*,.sidebar.margin-sidebar>*{padding-top:.5em}.sidebar.quarto-banner-title-block-sidebar>*{padding-top:1.65em}figure .quarto-notebook-link{margin-top:.5em}.quarto-notebook-link{font-size:.75em;color:#6c757d;margin-bottom:1em;text-decoration:none;display:block}.quarto-notebook-link:hover{text-decoration:underline;color:#2780e3}.quarto-notebook-link::before{display:inline-block;height:.75rem;width:.75rem;margin-bottom:0em;margin-right:.25em;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(108, 117, 125)" class="bi bi-journal-code" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.646 5.646a.5.5 0 0 1 .708 0l2 2a.5.5 0 0 1 0 .708l-2 2a.5.5 0 0 1-.708-.708L10.293 8 8.646 6.354a.5.5 0 0 1 0-.708zm-1.292 0a.5.5 0 0 0-.708 0l-2 2a.5.5 0 0 0 0 .708l2 2a.5.5 0 0 0 .708-.708L5.707 8l1.647-1.646a.5.5 0 0 0 0-.708z"/><path d="M3 0h10a2 2 0 0 1 2 2v12a2 2 0 0 1-2 2H3a2 2 0 0 1-2-2v-1h1v1a1 1 0 0 0 1 1h10a1 1 0 0 0 1-1V2a1 1 0 0 0-1-1H3a1 1 0 0 0-1 1v1H1V2a2 2 0 0 1 2-2z"/><path d="M1 5v-.5a.5.5 0 0 1 1 0V5h.5a.5.5 0 0 1 0 1h-2a.5.5 0 0 1 0-1H1zm0 3v-.5a.5.5 0 0 1 1 0V8h.5a.5.5 0 0 1 0 1h-2a.5.5 0 0 1 0-1H1zm0 3v-.5a.5.5 0 0 1 1 0v.5h.5a.5.5 0 0 1 0 1h-2a.5.5 0 0 1 0-1H1z"/></svg>');background-repeat:no-repeat;background-size:.75rem .75rem}.quarto-alternate-notebooks i.bi,.quarto-alternate-formats i.bi{margin-right:.4em}.quarto-notebook .cell-container{display:flex}.quarto-notebook .cell-container .cell{flex-grow:4}.quarto-notebook .cell-container .cell-decorator{padding-top:1.5em;padding-right:1em;text-align:right}.quarto-notebook h2,.quarto-notebook .h2{border-bottom:none}.sidebar .quarto-alternate-formats a,.sidebar .quarto-alternate-notebooks a{text-decoration:none}.sidebar .quarto-alternate-formats a:hover,.sidebar .quarto-alternate-notebooks a:hover{color:#2780e3}.sidebar .quarto-alternate-notebooks h2,.sidebar .quarto-alternate-notebooks .h2,.sidebar .quarto-alternate-formats h2,.sidebar .quarto-alternate-formats .h2,.sidebar nav[role=doc-toc]>h2,.sidebar nav[role=doc-toc]>.h2{font-size:.875rem;font-weight:400;margin-bottom:.5rem;margin-top:.3rem;font-family:inherit;border-bottom:0;padding-bottom:0;padding-top:0px}.sidebar .quarto-alternate-notebooks h2,.sidebar .quarto-alternate-notebooks .h2,.sidebar .quarto-alternate-formats h2,.sidebar .quarto-alternate-formats .h2{margin-top:1rem}.sidebar nav[role=doc-toc]>ul a{border-left:1px solid #e9ecef;padding-left:.6rem}.sidebar .quarto-alternate-notebooks h2>ul a,.sidebar .quarto-alternate-notebooks .h2>ul a,.sidebar .quarto-alternate-formats h2>ul a,.sidebar .quarto-alternate-formats .h2>ul a{border-left:none;padding-left:.6rem}.sidebar .quarto-alternate-notebooks ul a:empty,.sidebar .quarto-alternate-formats ul a:empty,.sidebar nav[role=doc-toc]>ul a:empty{display:none}.sidebar .quarto-alternate-notebooks ul,.sidebar .quarto-alternate-formats ul,.sidebar nav[role=doc-toc] ul{padding-left:0;list-style:none;font-size:.875rem;font-weight:300}.sidebar .quarto-alternate-notebooks ul li a,.sidebar .quarto-alternate-formats ul li a,.sidebar nav[role=doc-toc]>ul li a{line-height:1.1rem;padding-bottom:.2rem;padding-top:.2rem;color:inherit}.sidebar nav[role=doc-toc] ul>li>ul>li>a{padding-left:1.2em}.sidebar nav[role=doc-toc] ul>li>ul>li>ul>li>a{padding-left:2.4em}.sidebar nav[role=doc-toc] ul>li>ul>li>ul>li>ul>li>a{padding-left:3.6em}.sidebar nav[role=doc-toc] ul>li>ul>li>ul>li>ul>li>ul>li>a{padding-left:4.8em}.sidebar nav[role=doc-toc] ul>li>ul>li>ul>li>ul>li>ul>li>ul>li>a{padding-left:6em}.sidebar nav[role=doc-toc] ul>li>a.active,.sidebar nav[role=doc-toc] ul>li>ul>li>a.active{border-left:1px solid #2780e3;color:#2780e3 !important}.sidebar nav[role=doc-toc] ul>li>a:hover,.sidebar nav[role=doc-toc] ul>li>ul>li>a:hover{color:#2780e3 !important}kbd,.kbd{color:#373a3c;background-color:#f8f9fa;border:1px solid;border-radius:5px;border-color:#dee2e6}div.hanging-indent{margin-left:1em;text-indent:-1em}.citation a,.footnote-ref{text-decoration:none}.footnotes ol{padding-left:1em}.tippy-content>*{margin-bottom:.7em}.tippy-content>*:last-child{margin-bottom:0}.table a{word-break:break-word}.table>thead{border-top-width:1px;border-top-color:#dee2e6;border-bottom:1px solid #b6babc}.callout{margin-top:1.25rem;margin-bottom:1.25rem;border-radius:.25rem;overflow-wrap:break-word}.callout .callout-title-container{overflow-wrap:anywhere}.callout.callout-style-simple{padding:.4em .7em;border-left:5px solid;border-right:1px solid #dee2e6;border-top:1px solid #dee2e6;border-bottom:1px solid #dee2e6}.callout.callout-style-default{border-left:5px solid;border-right:1px solid #dee2e6;border-top:1px solid #dee2e6;border-bottom:1px solid #dee2e6}.callout .callout-body-container{flex-grow:1}.callout.callout-style-simple .callout-body{font-size:.9rem;font-weight:400}.callout.callout-style-default .callout-body{font-size:.9rem;font-weight:400}.callout.callout-titled .callout-body{margin-top:.2em}.callout:not(.no-icon).callout-titled.callout-style-simple .callout-body{padding-left:1.6em}.callout.callout-titled>.callout-header{padding-top:.2em;margin-bottom:-0.2em}.callout.callout-style-simple>div.callout-header{border-bottom:none;font-size:.9rem;font-weight:600;opacity:75%}.callout.callout-style-default>div.callout-header{border-bottom:none;font-weight:600;opacity:85%;font-size:.9rem;padding-left:.5em;padding-right:.5em}.callout.callout-style-default div.callout-body{padding-left:.5em;padding-right:.5em}.callout.callout-style-default div.callout-body>:first-child{margin-top:.5em}.callout>div.callout-header[data-bs-toggle=collapse]{cursor:pointer}.callout.callout-style-default .callout-header[aria-expanded=false],.callout.callout-style-default .callout-header[aria-expanded=true]{padding-top:0px;margin-bottom:0px;align-items:center}.callout.callout-titled .callout-body>:last-child:not(.sourceCode),.callout.callout-titled .callout-body>div>:last-child:not(.sourceCode){margin-bottom:.5rem}.callout:not(.callout-titled) .callout-body>:first-child,.callout:not(.callout-titled) .callout-body>div>:first-child{margin-top:.25rem}.callout:not(.callout-titled) .callout-body>:last-child,.callout:not(.callout-titled) .callout-body>div>:last-child{margin-bottom:.2rem}.callout.callout-style-simple .callout-icon::before,.callout.callout-style-simple .callout-toggle::before{height:1rem;width:1rem;display:inline-block;content:"";background-repeat:no-repeat;background-size:1rem 1rem}.callout.callout-style-default .callout-icon::before,.callout.callout-style-default .callout-toggle::before{height:.9rem;width:.9rem;display:inline-block;content:"";background-repeat:no-repeat;background-size:.9rem .9rem}.callout.callout-style-default .callout-toggle::before{margin-top:5px}.callout .callout-btn-toggle .callout-toggle::before{transition:transform .2s linear}.callout .callout-header[aria-expanded=false] .callout-toggle::before{transform:rotate(-90deg)}.callout .callout-header[aria-expanded=true] .callout-toggle::before{transform:none}.callout.callout-style-simple:not(.no-icon) div.callout-icon-container{padding-top:.2em;padding-right:.55em}.callout.callout-style-default:not(.no-icon) div.callout-icon-container{padding-top:.1em;padding-right:.35em}.callout.callout-style-default:not(.no-icon) div.callout-title-container{margin-top:-1px}.callout.callout-style-default.callout-caution:not(.no-icon) div.callout-icon-container{padding-top:.3em;padding-right:.35em}.callout>.callout-body>.callout-icon-container>.no-icon,.callout>.callout-header>.callout-icon-container>.no-icon{display:none}div.callout.callout{border-left-color:#6c757d}div.callout.callout-style-default>.callout-header{background-color:#6c757d}div.callout-note.callout{border-left-color:#2780e3}div.callout-note.callout-style-default>.callout-header{background-color:#e9f2fc}div.callout-note:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %232373cc" class="bi bi-info-circle" viewBox="0 0 16 16"><path d="M8 15A7 7 0 1 1 8 1a7 7 0 0 1 0 14zm0 1A8 8 0 1 0 8 0a8 8 0 0 0 0 16z"/><path d="m8.93 6.588-2.29.287-.082.38.45.083c.294.07.352.176.288.469l-.738 3.468c-.194.897.105 1.319.808 1.319.545 0 1.178-.252 1.465-.598l.088-.416c-.2.176-.492.246-.686.246-.275 0-.375-.193-.304-.533L8.93 6.588zM9 4.5a1 1 0 1 1-2 0 1 1 0 0 1 2 0z"/></svg>');}div.callout-note.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %232373cc" class="bi bi-info-circle" viewBox="0 0 16 16"><path d="M8 15A7 7 0 1 1 8 1a7 7 0 0 1 0 14zm0 1A8 8 0 1 0 8 0a8 8 0 0 0 0 16z"/><path d="m8.93 6.588-2.29.287-.082.38.45.083c.294.07.352.176.288.469l-.738 3.468c-.194.897.105 1.319.808 1.319.545 0 1.178-.252 1.465-.598l.088-.416c-.2.176-.492.246-.686.246-.275 0-.375-.193-.304-.533L8.93 6.588zM9 4.5a1 1 0 1 1-2 0 1 1 0 0 1 2 0z"/></svg>');}div.callout-note .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}div.callout-tip.callout{border-left-color:#3fb618}div.callout-tip.callout-style-default>.callout-header{background-color:#ecf8e8}div.callout-tip:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %2339a416" class="bi bi-lightbulb" viewBox="0 0 16 16"><path d="M2 6a6 6 0 1 1 10.174 4.31c-.203.196-.359.4-.453.619l-.762 1.769A.5.5 0 0 1 10.5 13a.5.5 0 0 1 0 1 .5.5 0 0 1 0 1l-.224.447a1 1 0 0 1-.894.553H6.618a1 1 0 0 1-.894-.553L5.5 15a.5.5 0 0 1 0-1 .5.5 0 0 1 0-1 .5.5 0 0 1-.46-.302l-.761-1.77a1.964 1.964 0 0 0-.453-.618A5.984 5.984 0 0 1 2 6zm6-5a5 5 0 0 0-3.479 8.592c.263.254.514.564.676.941L5.83 12h4.342l.632-1.467c.162-.377.413-.687.676-.941A5 5 0 0 0 8 1z"/></svg>');}div.callout-tip.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %2339a416" class="bi bi-lightbulb" viewBox="0 0 16 16"><path d="M2 6a6 6 0 1 1 10.174 4.31c-.203.196-.359.4-.453.619l-.762 1.769A.5.5 0 0 1 10.5 13a.5.5 0 0 1 0 1 .5.5 0 0 1 0 1l-.224.447a1 1 0 0 1-.894.553H6.618a1 1 0 0 1-.894-.553L5.5 15a.5.5 0 0 1 0-1 .5.5 0 0 1 0-1 .5.5 0 0 1-.46-.302l-.761-1.77a1.964 1.964 0 0 0-.453-.618A5.984 5.984 0 0 1 2 6zm6-5a5 5 0 0 0-3.479 8.592c.263.254.514.564.676.941L5.83 12h4.342l.632-1.467c.162-.377.413-.687.676-.941A5 5 0 0 0 8 1z"/></svg>');}div.callout-tip .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}div.callout-warning.callout{border-left-color:#ff7518}div.callout-warning.callout-style-default>.callout-header{background-color:#fff1e8}div.callout-warning:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23e66916" class="bi bi-exclamation-triangle" viewBox="0 0 16 16"><path d="M7.938 2.016A.13.13 0 0 1 8.002 2a.13.13 0 0 1 .063.016.146.146 0 0 1 .054.057l6.857 11.667c.036.06.035.124.002.183a.163.163 0 0 1-.054.06.116.116 0 0 1-.066.017H1.146a.115.115 0 0 1-.066-.017.163.163 0 0 1-.054-.06.176.176 0 0 1 .002-.183L7.884 2.073a.147.147 0 0 1 .054-.057zm1.044-.45a1.13 1.13 0 0 0-1.96 0L.165 13.233c-.457.778.091 1.767.98 1.767h13.713c.889 0 1.438-.99.98-1.767L8.982 1.566z"/><path d="M7.002 12a1 1 0 1 1 2 0 1 1 0 0 1-2 0zM7.1 5.995a.905.905 0 1 1 1.8 0l-.35 3.507a.552.552 0 0 1-1.1 0L7.1 5.995z"/></svg>');}div.callout-warning.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23e66916" class="bi bi-exclamation-triangle" viewBox="0 0 16 16"><path d="M7.938 2.016A.13.13 0 0 1 8.002 2a.13.13 0 0 1 .063.016.146.146 0 0 1 .054.057l6.857 11.667c.036.06.035.124.002.183a.163.163 0 0 1-.054.06.116.116 0 0 1-.066.017H1.146a.115.115 0 0 1-.066-.017.163.163 0 0 1-.054-.06.176.176 0 0 1 .002-.183L7.884 2.073a.147.147 0 0 1 .054-.057zm1.044-.45a1.13 1.13 0 0 0-1.96 0L.165 13.233c-.457.778.091 1.767.98 1.767h13.713c.889 0 1.438-.99.98-1.767L8.982 1.566z"/><path d="M7.002 12a1 1 0 1 1 2 0 1 1 0 0 1-2 0zM7.1 5.995a.905.905 0 1 1 1.8 0l-.35 3.507a.552.552 0 0 1-1.1 0L7.1 5.995z"/></svg>');}div.callout-warning .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}div.callout-caution.callout{border-left-color:#f0ad4e}div.callout-caution.callout-style-default>.callout-header{background-color:#fef7ed}div.callout-caution:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23d89c46" class="bi bi-cone-striped" viewBox="0 0 16 16"><path d="M9.97 4.88l.953 3.811C10.158 8.878 9.14 9 8 9c-1.14 0-2.159-.122-2.923-.309L6.03 4.88C6.635 4.957 7.3 5 8 5s1.365-.043 1.97-.12zm-.245-.978L8.97.88C8.718-.13 7.282-.13 7.03.88L6.274 3.9C6.8 3.965 7.382 4 8 4c.618 0 1.2-.036 1.725-.098zm4.396 8.613a.5.5 0 0 1 .037.96l-6 2a.5.5 0 0 1-.316 0l-6-2a.5.5 0 0 1 .037-.96l2.391-.598.565-2.257c.862.212 1.964.339 3.165.339s2.303-.127 3.165-.339l.565 2.257 2.391.598z"/></svg>');}div.callout-caution.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23d89c46" class="bi bi-cone-striped" viewBox="0 0 16 16"><path d="M9.97 4.88l.953 3.811C10.158 8.878 9.14 9 8 9c-1.14 0-2.159-.122-2.923-.309L6.03 4.88C6.635 4.957 7.3 5 8 5s1.365-.043 1.97-.12zm-.245-.978L8.97.88C8.718-.13 7.282-.13 7.03.88L6.274 3.9C6.8 3.965 7.382 4 8 4c.618 0 1.2-.036 1.725-.098zm4.396 8.613a.5.5 0 0 1 .037.96l-6 2a.5.5 0 0 1-.316 0l-6-2a.5.5 0 0 1 .037-.96l2.391-.598.565-2.257c.862.212 1.964.339 3.165.339s2.303-.127 3.165-.339l.565 2.257 2.391.598z"/></svg>');}div.callout-caution .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}div.callout-important.callout{border-left-color:#ff0039}div.callout-important.callout-style-default>.callout-header{background-color:#ffe6eb}div.callout-important:not(.callout-titled) .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23e60033" class="bi bi-exclamation-circle" viewBox="0 0 16 16"><path d="M8 15A7 7 0 1 1 8 1a7 7 0 0 1 0 14zm0 1A8 8 0 1 0 8 0a8 8 0 0 0 0 16z"/><path d="M7.002 11a1 1 0 1 1 2 0 1 1 0 0 1-2 0zM7.1 4.995a.905.905 0 1 1 1.8 0l-.35 3.507a.552.552 0 0 1-1.1 0L7.1 4.995z"/></svg>');}div.callout-important.callout-titled .callout-icon::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" style="fill: %23e60033" class="bi bi-exclamation-circle" viewBox="0 0 16 16"><path d="M8 15A7 7 0 1 1 8 1a7 7 0 0 1 0 14zm0 1A8 8 0 1 0 8 0a8 8 0 0 0 0 16z"/><path d="M7.002 11a1 1 0 1 1 2 0 1 1 0 0 1-2 0zM7.1 4.995a.905.905 0 1 1 1.8 0l-.35 3.507a.552.552 0 0 1-1.1 0L7.1 4.995z"/></svg>');}div.callout-important .callout-toggle::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(55, 58, 60)" class="bi bi-chevron-down" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M1.646 4.646a.5.5 0 0 1 .708 0L8 10.293l5.646-5.647a.5.5 0 0 1 .708.708l-6 6a.5.5 0 0 1-.708 0l-6-6a.5.5 0 0 1 0-.708z"/></svg>')}.quarto-toggle-container{display:flex;align-items:center}.quarto-reader-toggle .bi::before,.quarto-color-scheme-toggle .bi::before{display:inline-block;height:1rem;width:1rem;content:"";background-repeat:no-repeat;background-size:1rem 1rem}.sidebar-navigation{padding-left:20px}.navbar .quarto-color-scheme-toggle:not(.alternate) .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(84, 85, 85, 1)" class="bi bi-toggle-off" viewBox="0 0 16 16"><path d="M11 4a4 4 0 0 1 0 8H8a4.992 4.992 0 0 0 2-4 4.992 4.992 0 0 0-2-4h3zm-6 8a4 4 0 1 1 0-8 4 4 0 0 1 0 8zM0 8a5 5 0 0 0 5 5h6a5 5 0 0 0 0-10H5a5 5 0 0 0-5 5z"/></svg>')}.navbar .quarto-color-scheme-toggle.alternate .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(84, 85, 85, 1)" class="bi bi-toggle-on" viewBox="0 0 16 16"><path d="M5 3a5 5 0 0 0 0 10h6a5 5 0 0 0 0-10H5zm6 9a4 4 0 1 1 0-8 4 4 0 0 1 0 8z"/></svg>')}.sidebar-navigation .quarto-color-scheme-toggle:not(.alternate) .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(79, 84, 87, 1)" class="bi bi-toggle-off" viewBox="0 0 16 16"><path d="M11 4a4 4 0 0 1 0 8H8a4.992 4.992 0 0 0 2-4 4.992 4.992 0 0 0-2-4h3zm-6 8a4 4 0 1 1 0-8 4 4 0 0 1 0 8zM0 8a5 5 0 0 0 5 5h6a5 5 0 0 0 0-10H5a5 5 0 0 0-5 5z"/></svg>')}.sidebar-navigation .quarto-color-scheme-toggle.alternate .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(79, 84, 87, 1)" class="bi bi-toggle-on" viewBox="0 0 16 16"><path d="M5 3a5 5 0 0 0 0 10h6a5 5 0 0 0 0-10H5zm6 9a4 4 0 1 1 0-8 4 4 0 0 1 0 8z"/></svg>')}.quarto-sidebar-toggle{border-color:#dee2e6;border-bottom-left-radius:.25rem;border-bottom-right-radius:.25rem;border-style:solid;border-width:1px;overflow:hidden;border-top-width:0px;padding-top:0px !important}.quarto-sidebar-toggle-title{cursor:pointer;padding-bottom:2px;margin-left:.25em;text-align:center;font-weight:400;font-size:.775em}#quarto-content .quarto-sidebar-toggle{background:#fafafa}#quarto-content .quarto-sidebar-toggle-title{color:#373a3c}.quarto-sidebar-toggle-icon{color:#dee2e6;margin-right:.5em;float:right;transition:transform .2s ease}.quarto-sidebar-toggle-icon::before{padding-top:5px}.quarto-sidebar-toggle.expanded .quarto-sidebar-toggle-icon{transform:rotate(-180deg)}.quarto-sidebar-toggle.expanded .quarto-sidebar-toggle-title{border-bottom:solid #dee2e6 1px}.quarto-sidebar-toggle-contents{background-color:#fff;padding-right:10px;padding-left:10px;margin-top:0px !important;transition:max-height .5s ease}.quarto-sidebar-toggle.expanded .quarto-sidebar-toggle-contents{padding-top:1em;padding-bottom:10px}.quarto-sidebar-toggle:not(.expanded) .quarto-sidebar-toggle-contents{padding-top:0px !important;padding-bottom:0px}nav[role=doc-toc]{z-index:1020}#quarto-sidebar>*,nav[role=doc-toc]>*{transition:opacity .1s ease,border .1s ease}#quarto-sidebar.slow>*,nav[role=doc-toc].slow>*{transition:opacity .4s ease,border .4s ease}.quarto-color-scheme-toggle:not(.alternate).top-right .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(142, 148, 151, 1)" class="bi bi-toggle-off" viewBox="0 0 16 16"><path d="M11 4a4 4 0 0 1 0 8H8a4.992 4.992 0 0 0 2-4 4.992 4.992 0 0 0-2-4h3zm-6 8a4 4 0 1 1 0-8 4 4 0 0 1 0 8zM0 8a5 5 0 0 0 5 5h6a5 5 0 0 0 0-10H5a5 5 0 0 0-5 5z"/></svg>')}.quarto-color-scheme-toggle.alternate.top-right .bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgba(104, 109, 113, 1)" class="bi bi-toggle-on" viewBox="0 0 16 16"><path d="M5 3a5 5 0 0 0 0 10h6a5 5 0 0 0 0-10H5zm6 9a4 4 0 1 1 0-8 4 4 0 0 1 0 8z"/></svg>')}#quarto-appendix.default{border-top:1px solid #dee2e6}#quarto-appendix.default{background-color:#fff;padding-top:1.5em;margin-top:2em;z-index:998}#quarto-appendix.default .quarto-appendix-heading{margin-top:0;line-height:1.4em;font-weight:600;opacity:.9;border-bottom:none;margin-bottom:0}#quarto-appendix.default .footnotes ol,#quarto-appendix.default .footnotes ol li>p:last-of-type,#quarto-appendix.default .quarto-appendix-contents>p:last-of-type{margin-bottom:0}#quarto-appendix.default .quarto-appendix-secondary-label{margin-bottom:.4em}#quarto-appendix.default .quarto-appendix-bibtex{font-size:.7em;padding:1em;border:solid 1px #dee2e6;margin-bottom:1em}#quarto-appendix.default .quarto-appendix-bibtex code.sourceCode{white-space:pre-wrap}#quarto-appendix.default .quarto-appendix-citeas{font-size:.9em;padding:1em;border:solid 1px #dee2e6;margin-bottom:1em}#quarto-appendix.default .quarto-appendix-heading{font-size:1em !important}#quarto-appendix.default *[role=doc-endnotes]>ol,#quarto-appendix.default .quarto-appendix-contents>*:not(h2):not(.h2){font-size:.9em}#quarto-appendix.default section{padding-bottom:1.5em}#quarto-appendix.default section *[role=doc-endnotes],#quarto-appendix.default section>*:not(a){opacity:.9;word-wrap:break-word}.btn.btn-quarto,div.cell-output-display .btn-quarto{color:#cbcccc;background-color:#373a3c;border-color:#373a3c}.btn.btn-quarto:hover,div.cell-output-display .btn-quarto:hover{color:#cbcccc;background-color:#555859;border-color:#4b4e50}.btn-check:focus+.btn.btn-quarto,.btn.btn-quarto:focus,.btn-check:focus+div.cell-output-display .btn-quarto,div.cell-output-display .btn-quarto:focus{color:#cbcccc;background-color:#555859;border-color:#4b4e50;box-shadow:0 0 0 .25rem rgba(77,80,82,.5)}.btn-check:checked+.btn.btn-quarto,.btn-check:active+.btn.btn-quarto,.btn.btn-quarto:active,.btn.btn-quarto.active,.show>.btn.btn-quarto.dropdown-toggle,.btn-check:checked+div.cell-output-display .btn-quarto,.btn-check:active+div.cell-output-display .btn-quarto,div.cell-output-display .btn-quarto:active,div.cell-output-display .btn-quarto.active,.show>div.cell-output-display .btn-quarto.dropdown-toggle{color:#fff;background-color:#5f6163;border-color:#4b4e50}.btn-check:checked+.btn.btn-quarto:focus,.btn-check:active+.btn.btn-quarto:focus,.btn.btn-quarto:active:focus,.btn.btn-quarto.active:focus,.show>.btn.btn-quarto.dropdown-toggle:focus,.btn-check:checked+div.cell-output-display .btn-quarto:focus,.btn-check:active+div.cell-output-display .btn-quarto:focus,div.cell-output-display .btn-quarto:active:focus,div.cell-output-display .btn-quarto.active:focus,.show>div.cell-output-display .btn-quarto.dropdown-toggle:focus{box-shadow:0 0 0 .25rem rgba(77,80,82,.5)}.btn.btn-quarto:disabled,.btn.btn-quarto.disabled,div.cell-output-display .btn-quarto:disabled,div.cell-output-display .btn-quarto.disabled{color:#fff;background-color:#373a3c;border-color:#373a3c}nav.quarto-secondary-nav.color-navbar{background-color:#f8f9fa;color:#545555}nav.quarto-secondary-nav.color-navbar h1,nav.quarto-secondary-nav.color-navbar .h1,nav.quarto-secondary-nav.color-navbar .quarto-btn-toggle{color:#545555}@media(max-width: 991.98px){body.nav-sidebar .quarto-title-banner{margin-bottom:0;padding-bottom:0}body.nav-sidebar #title-block-header{margin-block-end:0}}p.subtitle{margin-top:.25em;margin-bottom:.5em}code a:any-link{color:inherit;text-decoration-color:#6c757d}/*! light */div.observablehq table thead tr th{background-color:var(--bs-body-bg)}input,button,select,optgroup,textarea{background-color:var(--bs-body-bg)}.code-annotated .code-copy-button{margin-right:1.25em;margin-top:0;padding-bottom:0;padding-top:3px}.code-annotation-gutter-bg{background-color:#fff}.code-annotation-gutter{background-color:rgba(233,236,239,.65)}.code-annotation-gutter,.code-annotation-gutter-bg{height:100%;width:calc(20px + .5em);position:absolute;top:0;right:0}dl.code-annotation-container-grid dt{margin-right:1em;margin-top:.25rem}dl.code-annotation-container-grid dt{font-family:var(--bs-font-monospace);color:#4f5457;border:solid #4f5457 1px;border-radius:50%;height:22px;width:22px;line-height:22px;font-size:11px;text-align:center;vertical-align:middle;text-decoration:none}dl.code-annotation-container-grid dt[data-target-cell]{cursor:pointer}dl.code-annotation-container-grid dt[data-target-cell].code-annotation-active{color:#fff;border:solid #aaa 1px;background-color:#aaa}pre.code-annotation-code{padding-top:0;padding-bottom:0}pre.code-annotation-code code{z-index:3}#code-annotation-line-highlight-gutter{width:100%;border-top:solid rgba(170,170,170,.2666666667) 1px;border-bottom:solid rgba(170,170,170,.2666666667) 1px;z-index:2;background-color:rgba(170,170,170,.1333333333)}#code-annotation-line-highlight{margin-left:-4em;width:calc(100% + 4em);border-top:solid rgba(170,170,170,.2666666667) 1px;border-bottom:solid rgba(170,170,170,.2666666667) 1px;z-index:2;background-color:rgba(170,170,170,.1333333333)}code.sourceCode .code-annotation-anchor.code-annotation-active{background-color:var(--quarto-hl-normal-color, #aaaaaa);border:solid var(--quarto-hl-normal-color, #aaaaaa) 1px;color:#e9ecef;font-weight:bolder}code.sourceCode .code-annotation-anchor{font-family:var(--bs-font-monospace);color:var(--quarto-hl-co-color);border:solid var(--quarto-hl-co-color) 1px;border-radius:50%;height:18px;width:18px;font-size:9px;margin-top:2px}code.sourceCode button.code-annotation-anchor{padding:2px}code.sourceCode a.code-annotation-anchor{line-height:18px;text-align:center;vertical-align:middle;cursor:default;text-decoration:none}@media print{.page-columns .column-screen-inset{grid-column:page-start-inset/page-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset table{background:#fff}.page-columns .column-screen-inset-left{grid-column:page-start-inset/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset-left table{background:#fff}.page-columns .column-screen-inset-right{grid-column:body-content-start/page-end-inset;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-inset-right table{background:#fff}.page-columns .column-screen{grid-column:page-start/page-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen table{background:#fff}.page-columns .column-screen-left{grid-column:page-start/body-content-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-left table{background:#fff}.page-columns .column-screen-right{grid-column:body-content-start/page-end;z-index:998;transform:translate3d(0, 0, 0)}.page-columns .column-screen-right table{background:#fff}.page-columns .column-screen-inset-shaded{grid-column:page-start-inset/page-end-inset;padding:1em;background:#f8f9fa;z-index:998;transform:translate3d(0, 0, 0);margin-bottom:1em}}.quarto-video{margin-bottom:1em}.table>thead{border-top-width:0}.table>:not(caption)>*:not(:last-child)>*{border-bottom-color:#ebeced;border-bottom-style:solid;border-bottom-width:1px}.table>:not(:first-child){border-top:1px solid #b6babc;border-bottom:1px solid inherit}.table tbody{border-bottom-color:#b6babc}a.external:after{display:inline-block;height:.75rem;width:.75rem;margin-bottom:.15em;margin-left:.25em;content:"";vertical-align:-0.125em;background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(39, 128, 227)" class="bi bi-box-arrow-up-right" viewBox="0 0 16 16"><path fill-rule="evenodd" d="M8.636 3.5a.5.5 0 0 0-.5-.5H1.5A1.5 1.5 0 0 0 0 4.5v10A1.5 1.5 0 0 0 1.5 16h10a1.5 1.5 0 0 0 1.5-1.5V7.864a.5.5 0 0 0-1 0V14.5a.5.5 0 0 1-.5.5h-10a.5.5 0 0 1-.5-.5v-10a.5.5 0 0 1 .5-.5h6.636a.5.5 0 0 0 .5-.5z"/><path fill-rule="evenodd" d="M16 .5a.5.5 0 0 0-.5-.5h-5a.5.5 0 0 0 0 1h3.793L6.146 9.146a.5.5 0 1 0 .708.708L15 1.707V5.5a.5.5 0 0 0 1 0v-5z"/></svg>');background-repeat:no-repeat;background-size:.75rem .75rem}div.sourceCode code a.external:after{content:none}a.external:after:hover{cursor:pointer}.quarto-ext-icon{display:inline-block;font-size:.75em;padding-left:.3em}.code-with-filename .code-with-filename-file{margin-bottom:0;padding-bottom:2px;padding-top:2px;padding-left:.7em;border:var(--quarto-border-width) solid var(--quarto-border-color);border-radius:var(--quarto-border-radius);border-bottom:0;border-bottom-left-radius:0%;border-bottom-right-radius:0%}.code-with-filename div.sourceCode,.reveal .code-with-filename div.sourceCode{margin-top:0;border-top-left-radius:0%;border-top-right-radius:0%}.code-with-filename .code-with-filename-file pre{margin-bottom:0}.code-with-filename .code-with-filename-file,.code-with-filename .code-with-filename-file pre{background-color:rgba(219,219,219,.8)}.quarto-dark .code-with-filename .code-with-filename-file,.quarto-dark .code-with-filename .code-with-filename-file pre{background-color:#555}.code-with-filename .code-with-filename-file strong{font-weight:400}.quarto-title-banner{margin-bottom:1em;color:#545555;background:#f8f9fa}.quarto-title-banner .code-tools-button{color:#878888}.quarto-title-banner .code-tools-button:hover{color:#545555}.quarto-title-banner .code-tools-button>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(135, 136, 136)" viewBox="0 0 16 16"><path d="M10.478 1.647a.5.5 0 1 0-.956-.294l-4 13a.5.5 0 0 0 .956.294l4-13zM4.854 4.146a.5.5 0 0 1 0 .708L1.707 8l3.147 3.146a.5.5 0 0 1-.708.708l-3.5-3.5a.5.5 0 0 1 0-.708l3.5-3.5a.5.5 0 0 1 .708 0zm6.292 0a.5.5 0 0 0 0 .708L14.293 8l-3.147 3.146a.5.5 0 0 0 .708.708l3.5-3.5a.5.5 0 0 0 0-.708l-3.5-3.5a.5.5 0 0 0-.708 0z"/></svg>')}.quarto-title-banner .code-tools-button:hover>.bi::before{background-image:url('data:image/svg+xml,<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" fill="rgb(84, 85, 85)" viewBox="0 0 16 16"><path d="M10.478 1.647a.5.5 0 1 0-.956-.294l-4 13a.5.5 0 0 0 .956.294l4-13zM4.854 4.146a.5.5 0 0 1 0 .708L1.707 8l3.147 3.146a.5.5 0 0 1-.708.708l-3.5-3.5a.5.5 0 0 1 0-.708l3.5-3.5a.5.5 0 0 1 .708 0zm6.292 0a.5.5 0 0 0 0 .708L14.293 8l-3.147 3.146a.5.5 0 0 0 .708.708l3.5-3.5a.5.5 0 0 0 0-.708l-3.5-3.5a.5.5 0 0 0-.708 0z"/></svg>')}.quarto-title-banner .quarto-title .title{font-weight:600}.quarto-title-banner .quarto-categories{margin-top:.75em}@media(min-width: 992px){.quarto-title-banner{padding-top:2.5em;padding-bottom:2.5em}}@media(max-width: 991.98px){.quarto-title-banner{padding-top:1em;padding-bottom:1em}}main.quarto-banner-title-block>section:first-child>h2,main.quarto-banner-title-block>section:first-child>.h2,main.quarto-banner-title-block>section:first-child>h3,main.quarto-banner-title-block>section:first-child>.h3,main.quarto-banner-title-block>section:first-child>h4,main.quarto-banner-title-block>section:first-child>.h4{margin-top:0}.quarto-title .quarto-categories{display:flex;flex-wrap:wrap;row-gap:.5em;column-gap:.4em;padding-bottom:.5em;margin-top:.75em}.quarto-title .quarto-categories .quarto-category{padding:.25em .75em;font-size:.65em;text-transform:uppercase;border:solid 1px;border-radius:.25rem;opacity:.6}.quarto-title .quarto-categories .quarto-category a{color:inherit}#title-block-header.quarto-title-block.default .quarto-title-meta{display:grid;grid-template-columns:repeat(2, 1fr)}#title-block-header.quarto-title-block.default .quarto-title .title{margin-bottom:0}#title-block-header.quarto-title-block.default .quarto-title-author-orcid img{margin-top:-5px}#title-block-header.quarto-title-block.default .quarto-description p:last-of-type{margin-bottom:0}#title-block-header.quarto-title-block.default .quarto-title-meta-contents p,#title-block-header.quarto-title-block.default .quarto-title-authors p,#title-block-header.quarto-title-block.default .quarto-title-affiliations p{margin-bottom:.1em}#title-block-header.quarto-title-block.default .quarto-title-meta-heading{text-transform:uppercase;margin-top:1em;font-size:.8em;opacity:.8;font-weight:400}#title-block-header.quarto-title-block.default .quarto-title-meta-contents{font-size:.9em}#title-block-header.quarto-title-block.default .quarto-title-meta-contents a{color:#373a3c}#title-block-header.quarto-title-block.default .quarto-title-meta-contents p.affiliation:last-of-type{margin-bottom:.7em}#title-block-header.quarto-title-block.default p.affiliation{margin-bottom:.1em}#title-block-header.quarto-title-block.default .description,#title-block-header.quarto-title-block.default .abstract{margin-top:0}#title-block-header.quarto-title-block.default .description>p,#title-block-header.quarto-title-block.default .abstract>p{font-size:.9em}#title-block-header.quarto-title-block.default .description>p:last-of-type,#title-block-header.quarto-title-block.default .abstract>p:last-of-type{margin-bottom:0}#title-block-header.quarto-title-block.default .description .abstract-title,#title-block-header.quarto-title-block.default .abstract .abstract-title{margin-top:1em;text-transform:uppercase;font-size:.8em;opacity:.8;font-weight:400}#title-block-header.quarto-title-block.default .quarto-title-meta-author{display:grid;grid-template-columns:1fr 1fr}.quarto-title-tools-only{display:flex;justify-content:right}body{-webkit-font-smoothing:antialiased}.badge.bg-light{color:#373a3c}.progress .progress-bar{font-size:8px;line-height:8px}/*# sourceMappingURL=b5095878d7f8e19a7a39c0bdbfd52d00.css.map */
diff --git a/docs/sql_I/sql_I.html b/docs/sql_I/sql_I.html
index 1ddcdd5d..4bdf471f 100644
--- a/docs/sql_I/sql_I.html
+++ b/docs/sql_I/sql_I.html
@@ -453,7 +453,7 @@ <h2 data-number="20.2" class="anchored" data-anchor-id="structured-query-languag
 <li><code>DataType</code>: the type of data to be stored in a column. Some of the most common SQL data types are <code>INT</code> (integers), <code>FLOAT</code> (floating point numbers), <code>TEXT</code> (strings), <code>BLOB</code> (arbitrary data, such as audio/video files), and <code>DATETIME</code> (a date and time).</li>
 <li><code>Constraint</code>: some restriction on the data to be stored in the column. Common constraints are <code>CHECK</code> (data must obey a certain condition), <code>PRIMARY KEY</code> (designate a column as the table’s primary key), <code>NOT NULL</code> (data cannot be null), and <code>DEFAULT</code> (a default fill value if no specific entry is given).</li>
 </ul>
-<p>We see that <code>Dragon</code> contains five columns. The first of these, <code>"name"</code>, contains text data. It is designated as the <strong>primary key</strong> of the table; that is, the data contained in <code>"name"</code> uniquely identifies each entry in the table. Because <code>"name"</code> is the primary key of the table, no two entries in the table can have the same name – a given value of <code>"name"</code> is unique to each dragon. The <code>"year"</code> column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, <code>"cute"</code>, contains integer data with no restrictions on allowable values.</p>
+<p>We see that <code>Dragon</code> contains three columns. The first of these, <code>"name"</code>, contains text data. It is designated as the <strong>primary key</strong> of the table; that is, the data contained in <code>"name"</code> uniquely identifies each entry in the table. Because <code>"name"</code> is the primary key of the table, no two entries in the table can have the same name – a given value of <code>"name"</code> is unique to each dragon. The <code>"year"</code> column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, <code>"cute"</code>, contains integer data with no restrictions on allowable values.</p>
 <p>We can verify this by viewing <code>Dragon</code> itself.</p>
 <div class="cell" data-execution_count="5">
 <div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
@@ -1570,259 +1570,258 @@ <h2 data-number="20.6" class="anchored" data-anchor-id="aggregating-with-group-b
 <span id="cb49-75"><a href="#cb49-75" aria-hidden="true" tabindex="-1"></a>WHERE <span class="bu">type</span><span class="op">=</span><span class="st">"table"</span></span>
 <span id="cb49-76"><a href="#cb49-76" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
 <span id="cb49-77"><a href="#cb49-77" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-78"><a href="#cb49-78" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-79"><a href="#cb49-79" aria-hidden="true" tabindex="-1"></a>The summary above displays information about the database. The database contains four tables, named <span class="in">`sqlite_sequence`</span>, <span class="in">`Dragon`</span>, <span class="in">`Dish`</span>, and <span class="in">`Scene`</span>. The rightmost column above lists the command that was used to construct each table. </span>
-<span id="cb49-80"><a href="#cb49-80" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-81"><a href="#cb49-81" aria-hidden="true" tabindex="-1"></a>Let's look more closely at the command used to create the <span class="in">`Dragon`</span> table (the second entry above). </span>
-<span id="cb49-82"><a href="#cb49-82" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-83"><a href="#cb49-83" aria-hidden="true" tabindex="-1"></a><span class="in">    CREATE TABLE Dragon (name TEXT PRIMARY KEY,</span></span>
-<span id="cb49-84"><a href="#cb49-84" aria-hidden="true" tabindex="-1"></a><span class="in">                         year INTEGER CHECK (year &gt;= 2000),</span></span>
-<span id="cb49-85"><a href="#cb49-85" aria-hidden="true" tabindex="-1"></a><span class="in">                         cute INTEGER)</span></span>
-<span id="cb49-86"><a href="#cb49-86" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-87"><a href="#cb49-87" aria-hidden="true" tabindex="-1"></a>The statement <span class="in">`CREATE TABLE`</span> is used to specify the **schema** of the table – a description of what logic is used to organize the table. Schema follows a set format:</span>
-<span id="cb49-88"><a href="#cb49-88" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-89"><a href="#cb49-89" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`ColName`</span>: the name of a column</span>
-<span id="cb49-90"><a href="#cb49-90" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`DataType`</span>: the type of data to be stored in a column. Some of the most common SQL data types are <span class="in">`INT`</span> (integers), <span class="in">`FLOAT`</span> (floating point numbers), <span class="in">`TEXT`</span> (strings), <span class="in">`BLOB`</span> (arbitrary data, such as audio/video files), and <span class="in">`DATETIME`</span> (a date and time).</span>
-<span id="cb49-91"><a href="#cb49-91" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`Constraint`</span>: some restriction on the data to be stored in the column. Common constraints are <span class="in">`CHECK`</span> (data must obey a certain condition), <span class="in">`PRIMARY KEY`</span> (designate a column as the table's primary key), <span class="in">`NOT NULL`</span> (data cannot be null), and <span class="in">`DEFAULT`</span> (a default fill value if no specific entry is given).</span>
-<span id="cb49-92"><a href="#cb49-92" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-93"><a href="#cb49-93" aria-hidden="true" tabindex="-1"></a>We see that <span class="in">`Dragon`</span> contains five columns. The first of these, <span class="in">`"name"`</span>, contains text data. It is designated as the **primary key** of the table; that is, the data contained in <span class="in">`"name"`</span> uniquely identifies each entry in the table. Because <span class="in">`"name"`</span> is the primary key of the table, no two entries in the table can have the same name – a given value of <span class="in">`"name"`</span> is unique to each dragon. The <span class="in">`"year"`</span> column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, <span class="in">`"cute"`</span>, contains integer data with no restrictions on allowable values. </span>
-<span id="cb49-94"><a href="#cb49-94" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-95"><a href="#cb49-95" aria-hidden="true" tabindex="-1"></a>We can verify this by viewing <span class="in">`Dragon`</span> itself.</span>
-<span id="cb49-96"><a href="#cb49-96" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-99"><a href="#cb49-99" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-100"><a href="#cb49-100" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-101"><a href="#cb49-101" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
-<span id="cb49-102"><a href="#cb49-102" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-103"><a href="#cb49-103" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-104"><a href="#cb49-104" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-105"><a href="#cb49-105" aria-hidden="true" tabindex="-1"></a>Database tables (also referred to as **relations**) are structured much like `DataFrame`s in `pandas`. Each row, sometimes called a **tuple**, represents a single record in the dataset. Each column, sometimes called an **attribute** or **field**, describes some feature of the record. </span>
-<span id="cb49-106"><a href="#cb49-106" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-107"><a href="#cb49-107" aria-hidden="true" tabindex="-1"></a><span class="fu">## `SELECT`ing From Tables</span></span>
-<span id="cb49-108"><a href="#cb49-108" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-109"><a href="#cb49-109" aria-hidden="true" tabindex="-1"></a>To extract and manipulate data stored in a SQL table, we will need to familiarize ourselves with the syntax to write pieces of SQL code, which we call **queries**. </span>
-<span id="cb49-110"><a href="#cb49-110" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-111"><a href="#cb49-111" aria-hidden="true" tabindex="-1"></a>The basic unit of a SQL query is the <span class="in">`SELECT`</span> statement. <span class="in">`SELECT`</span> specifies what columns we would like to extract from a given table. We use <span class="in">`FROM`</span> to tell SQL the table from which we want to <span class="in">`SELECT`</span> our data. </span>
-<span id="cb49-112"><a href="#cb49-112" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-115"><a href="#cb49-115" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-116"><a href="#cb49-116" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-117"><a href="#cb49-117" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
-<span id="cb49-118"><a href="#cb49-118" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-119"><a href="#cb49-119" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-120"><a href="#cb49-120" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-121"><a href="#cb49-121" aria-hidden="true" tabindex="-1"></a>In SQL, <span class="in">`*`</span> means "everything." The query above grabs *all* the columns in <span class="in">`Dragon`</span> and displays them in the outputted table. We can also specify a specific subset of columns to be <span class="in">`SELECT`</span>ed. Notice that the outputted columns appear in the order that they were <span class="in">`SELECT`</span>ed.</span>
-<span id="cb49-122"><a href="#cb49-122" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-125"><a href="#cb49-125" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-126"><a href="#cb49-126" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-127"><a href="#cb49-127" aria-hidden="true" tabindex="-1"></a>SELECT cute, year</span>
-<span id="cb49-128"><a href="#cb49-128" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-129"><a href="#cb49-129" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-130"><a href="#cb49-130" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-131"><a href="#cb49-131" aria-hidden="true" tabindex="-1"></a>And just like that, we've already written two SQL queries. There are a few points of note in the queries above. Firstly, notice that every "verb" is written in uppercase. It is convention to write SQL operations in capital letters, but your code will run just fine even if you choose to keep things in lowercase. Second, the query above separates each statement with a new line. SQL queries are not impacted by whitespace within the query; this means that SQL code is typically written with a new line after each statement to make things more readable. The semicolon (<span class="in">`;`</span>) indicates the end of a query. There are some "flavors" of SQL in which a query will not run if no semicolon is present; however, in Data 100, the SQL version we will use works with or without an ending semicolon. Queries in these notes will end with semicolons to build up good habits.</span>
-<span id="cb49-132"><a href="#cb49-132" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-133"><a href="#cb49-133" aria-hidden="true" tabindex="-1"></a>The <span class="in">`AS`</span> keyword allows us to give a column a new name (called an **alias**) after it has been <span class="in">`SELECT`</span>ed. The general syntax is:</span>
-<span id="cb49-134"><a href="#cb49-134" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-135"><a href="#cb49-135" aria-hidden="true" tabindex="-1"></a><span class="in">    SELECT column_name_in_database_table AS new_name_in_output_table</span></span>
-<span id="cb49-136"><a href="#cb49-136" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-139"><a href="#cb49-139" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-140"><a href="#cb49-140" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-141"><a href="#cb49-141" aria-hidden="true" tabindex="-1"></a>SELECT cute AS cuteness, year AS birth</span>
-<span id="cb49-142"><a href="#cb49-142" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-143"><a href="#cb49-143" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-144"><a href="#cb49-144" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-145"><a href="#cb49-145" aria-hidden="true" tabindex="-1"></a>To <span class="in">`SELECT`</span> only the *unique* values in a column, we use the <span class="in">`DISTINCT`</span> keyword. This will cause any any duplicate entries in a column to be removed. If we want to find only the unique years in <span class="in">`Dragon`</span>, without any repeats, we would write:</span>
-<span id="cb49-146"><a href="#cb49-146" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-149"><a href="#cb49-149" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-150"><a href="#cb49-150" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-151"><a href="#cb49-151" aria-hidden="true" tabindex="-1"></a>SELECT DISTINCT year</span>
-<span id="cb49-152"><a href="#cb49-152" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-153"><a href="#cb49-153" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-154"><a href="#cb49-154" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-155"><a href="#cb49-155" aria-hidden="true" tabindex="-1"></a>**Every** SQL query must include both a <span class="in">`SELECT`</span> and <span class="in">`FROM`</span> statement. Intuitively, this makes sense – we know that we'll want to extract some piece of information from the table; to do so, we also need to indicate what table we want to consider. </span>
-<span id="cb49-156"><a href="#cb49-156" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-157"><a href="#cb49-157" aria-hidden="true" tabindex="-1"></a>It is important to note that SQL enforces a strict "order of operations" – SQL clauses must *always* follow the same sequence. For example, the <span class="in">`SELECT`</span> statement must always precede <span class="in">`FROM`</span>. This means that any SQL query will follow the same structure. </span>
-<span id="cb49-158"><a href="#cb49-158" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-159"><a href="#cb49-159" aria-hidden="true" tabindex="-1"></a><span class="in">    SELECT &lt;column list&gt;</span></span>
-<span id="cb49-160"><a href="#cb49-160" aria-hidden="true" tabindex="-1"></a><span class="in">    FROM &lt;table&gt;</span></span>
-<span id="cb49-161"><a href="#cb49-161" aria-hidden="true" tabindex="-1"></a><span class="in">    [additional clauses]</span></span>
-<span id="cb49-162"><a href="#cb49-162" aria-hidden="true" tabindex="-1"></a><span class="in">    </span></span>
-<span id="cb49-163"><a href="#cb49-163" aria-hidden="true" tabindex="-1"></a>The additional clauses that we use depend on the specific task trying to be achieved. We may refine our query to filter on a certain condition, aggregate a particular column, or join several tables together. We will spend the rest of this lecture outlining some useful clauses to build up our understanding of the order of operations.</span>
-<span id="cb49-164"><a href="#cb49-164" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-165"><a href="#cb49-165" aria-hidden="true" tabindex="-1"></a><span class="fu">## Applying `WHERE` Conditions</span></span>
-<span id="cb49-166"><a href="#cb49-166" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-167"><a href="#cb49-167" aria-hidden="true" tabindex="-1"></a>The <span class="in">`WHERE`</span> keyword is used to select only some rows of a table, filtered on a given Boolean condition. </span>
-<span id="cb49-168"><a href="#cb49-168" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-171"><a href="#cb49-171" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-172"><a href="#cb49-172" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-173"><a href="#cb49-173" aria-hidden="true" tabindex="-1"></a>SELECT name, year</span>
-<span id="cb49-174"><a href="#cb49-174" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-175"><a href="#cb49-175" aria-hidden="true" tabindex="-1"></a>WHERE cute <span class="op">&gt;</span> <span class="dv">0</span></span>
-<span id="cb49-176"><a href="#cb49-176" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-177"><a href="#cb49-177" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-178"><a href="#cb49-178" aria-hidden="true" tabindex="-1"></a>We can add complexity to the <span class="in">`WHERE`</span> condition using the keywords <span class="in">`AND`</span>, <span class="in">`OR`</span>, and <span class="in">`NOT`</span>, much like we would in Python.</span>
-<span id="cb49-179"><a href="#cb49-179" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-182"><a href="#cb49-182" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-183"><a href="#cb49-183" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-184"><a href="#cb49-184" aria-hidden="true" tabindex="-1"></a>SELECT name, year</span>
-<span id="cb49-185"><a href="#cb49-185" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-186"><a href="#cb49-186" aria-hidden="true" tabindex="-1"></a>WHERE cute <span class="op">&gt;</span> <span class="dv">0</span> OR year <span class="op">&gt;</span> <span class="dv">2013</span></span>
-<span id="cb49-187"><a href="#cb49-187" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-188"><a href="#cb49-188" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-189"><a href="#cb49-189" aria-hidden="true" tabindex="-1"></a>To spare ourselves needing to write complicated logical expressions by combining several conditions, we can also filter for entries that are <span class="in">`IN`</span> a specified list of values. This is similar to the use of <span class="in">`in`</span> or <span class="in">`.isin`</span> in Python.</span>
-<span id="cb49-190"><a href="#cb49-190" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-193"><a href="#cb49-193" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-194"><a href="#cb49-194" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-195"><a href="#cb49-195" aria-hidden="true" tabindex="-1"></a>SELECT name, year</span>
-<span id="cb49-196"><a href="#cb49-196" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-197"><a href="#cb49-197" aria-hidden="true" tabindex="-1"></a>WHERE name IN (<span class="st">"hiccup"</span>, <span class="st">"puff"</span>)</span>
-<span id="cb49-198"><a href="#cb49-198" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-199"><a href="#cb49-199" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-200"><a href="#cb49-200" aria-hidden="true" tabindex="-1"></a>You may have noticed earlier that our table actually has a missing value. In SQL, missing data is given the special value <span class="in">`NULL`</span>. <span class="in">`NULL`</span> behaves in a fundamentally different way to other data types. We can't use the typical operators (=, &gt;, and &lt;) on <span class="in">`NULL`</span> values (in fact, <span class="in">`NULL == NULL`</span> returns <span class="in">`False`</span>!); instead, we check to see if a value <span class="in">`IS`</span> or <span class="in">`IS NOT`</span> <span class="in">`NULL`</span>.</span>
-<span id="cb49-201"><a href="#cb49-201" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-204"><a href="#cb49-204" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-205"><a href="#cb49-205" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-206"><a href="#cb49-206" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
-<span id="cb49-207"><a href="#cb49-207" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-208"><a href="#cb49-208" aria-hidden="true" tabindex="-1"></a>WHERE cute IS NOT NULL</span>
-<span id="cb49-209"><a href="#cb49-209" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-210"><a href="#cb49-210" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-211"><a href="#cb49-211" aria-hidden="true" tabindex="-1"></a><span class="fu">## Sorting and Restricting Output</span></span>
-<span id="cb49-212"><a href="#cb49-212" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb49-213"><a href="#cb49-213" aria-hidden="true" tabindex="-1"></a>What if we want the output table to appear in a certain order? The <span class="in">`ORDER BY`</span> keyword behaves similarly to <span class="in">`.sort_values()`</span> in <span class="in">`pandas`</span>. </span>
-<span id="cb49-214"><a href="#cb49-214" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-217"><a href="#cb49-217" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-218"><a href="#cb49-218" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-219"><a href="#cb49-219" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
-<span id="cb49-220"><a href="#cb49-220" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-221"><a href="#cb49-221" aria-hidden="true" tabindex="-1"></a>ORDER BY cute</span>
-<span id="cb49-222"><a href="#cb49-222" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-223"><a href="#cb49-223" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-224"><a href="#cb49-224" aria-hidden="true" tabindex="-1"></a>By default, <span class="in">`ORDER BY`</span> will display results in ascending order (with the lowest values at the top of the table). To sort in descending order, we use the <span class="in">`DESC`</span> keyword after specifying the column to be used for ordering.</span>
-<span id="cb49-225"><a href="#cb49-225" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-228"><a href="#cb49-228" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-229"><a href="#cb49-229" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-230"><a href="#cb49-230" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
-<span id="cb49-231"><a href="#cb49-231" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-232"><a href="#cb49-232" aria-hidden="true" tabindex="-1"></a>ORDER BY cute DESC</span>
-<span id="cb49-233"><a href="#cb49-233" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-234"><a href="#cb49-234" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-235"><a href="#cb49-235" aria-hidden="true" tabindex="-1"></a>We can also tell SQL to <span class="in">`ORDER BY`</span> two columns at once. This will sort the table by the first listed column, then use the values in the second listed column to break any ties.</span>
-<span id="cb49-236"><a href="#cb49-236" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-239"><a href="#cb49-239" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-240"><a href="#cb49-240" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-241"><a href="#cb49-241" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
-<span id="cb49-242"><a href="#cb49-242" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-243"><a href="#cb49-243" aria-hidden="true" tabindex="-1"></a>ORDER BY name, cute</span>
-<span id="cb49-244"><a href="#cb49-244" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-245"><a href="#cb49-245" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-246"><a href="#cb49-246" aria-hidden="true" tabindex="-1"></a>In many instances, we are only concerned with a certain number of rows in the output table (for example, wanting to find the first two dragons in the table). The <span class="in">`LIMIT`</span> keyword restricts the output to a specified number of rows. It serves a function similar to that of <span class="in">`.head()`</span> in <span class="in">`pandas`</span>.</span>
-<span id="cb49-247"><a href="#cb49-247" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-250"><a href="#cb49-250" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-251"><a href="#cb49-251" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-252"><a href="#cb49-252" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
-<span id="cb49-253"><a href="#cb49-253" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-254"><a href="#cb49-254" aria-hidden="true" tabindex="-1"></a>LIMIT <span class="dv">2</span></span>
-<span id="cb49-255"><a href="#cb49-255" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-256"><a href="#cb49-256" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-257"><a href="#cb49-257" aria-hidden="true" tabindex="-1"></a>The <span class="in">`OFFSET`</span> keyword indicates the index at which <span class="in">`LIMIT`</span> should start. In other words, we can use <span class="in">`OFFSET`</span> to shift where the <span class="in">`LIMIT`</span>ing begins by a specified number of rows. For example, we might care about the dragons that are at positions #2 and #3 in the table. </span>
-<span id="cb49-258"><a href="#cb49-258" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-261"><a href="#cb49-261" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-262"><a href="#cb49-262" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-263"><a href="#cb49-263" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
-<span id="cb49-264"><a href="#cb49-264" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-265"><a href="#cb49-265" aria-hidden="true" tabindex="-1"></a>LIMIT <span class="dv">2</span></span>
-<span id="cb49-266"><a href="#cb49-266" aria-hidden="true" tabindex="-1"></a>OFFSET <span class="dv">1</span></span>
-<span id="cb49-267"><a href="#cb49-267" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-268"><a href="#cb49-268" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-269"><a href="#cb49-269" aria-hidden="true" tabindex="-1"></a>Let's summarize what we've learned so far. We know that <span class="in">`SELECT`</span> and <span class="in">`FROM`</span> are the fundamental building blocks of any SQL query. We can augment these two keywords with additional clauses to refine the data in our output table. </span>
-<span id="cb49-270"><a href="#cb49-270" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-271"><a href="#cb49-271" aria-hidden="true" tabindex="-1"></a>Any clauses that we include must follow a strict ordering within the query:</span>
-<span id="cb49-272"><a href="#cb49-272" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-273"><a href="#cb49-273" aria-hidden="true" tabindex="-1"></a><span class="in">    SELECT &lt;column list&gt;</span></span>
-<span id="cb49-274"><a href="#cb49-274" aria-hidden="true" tabindex="-1"></a><span class="in">    FROM &lt;table&gt;</span></span>
-<span id="cb49-275"><a href="#cb49-275" aria-hidden="true" tabindex="-1"></a><span class="in">    [WHERE &lt;predicate&gt;]</span></span>
-<span id="cb49-276"><a href="#cb49-276" aria-hidden="true" tabindex="-1"></a><span class="in">    [ORDER BY &lt;column list&gt;]</span></span>
-<span id="cb49-277"><a href="#cb49-277" aria-hidden="true" tabindex="-1"></a><span class="in">    [LIMIT &lt;number of rows&gt;]</span></span>
-<span id="cb49-278"><a href="#cb49-278" aria-hidden="true" tabindex="-1"></a><span class="in">    [OFFSET &lt;number of rows&gt;]</span></span>
-<span id="cb49-279"><a href="#cb49-279" aria-hidden="true" tabindex="-1"></a><span class="in">    </span></span>
-<span id="cb49-280"><a href="#cb49-280" aria-hidden="true" tabindex="-1"></a>Here, any clause contained in square brackets <span class="in">`[ ]`</span> is optional – we only need to use the keyword if it is relevant to the table operation we want to perform. Also note that by convention, we use all caps for keywords in SQL statements and use newlines to make code more readable.</span>
-<span id="cb49-281"><a href="#cb49-281" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-282"><a href="#cb49-282" aria-hidden="true" tabindex="-1"></a><span class="fu">## Aggregating with `GROUP BY`</span></span>
-<span id="cb49-283"><a href="#cb49-283" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-284"><a href="#cb49-284" aria-hidden="true" tabindex="-1"></a>At this point, we've seen that SQL offers much of the same functionality that was given to us by <span class="in">`pandas`</span>. We can extract data from a table, filter it, and reorder it to suit our needs.</span>
-<span id="cb49-285"><a href="#cb49-285" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-286"><a href="#cb49-286" aria-hidden="true" tabindex="-1"></a>In <span class="in">`pandas`</span>, much of our analysis work relied heavily on being able to use <span class="in">`.groupby()`</span> to aggregate across the rows of our dataset. SQL's answer to this task is the (very conveniently named) <span class="in">`GROUP BY`</span> clause. While the outputs of <span class="in">`GROUP BY`</span> are similar to those of <span class="in">`.groupby()`</span> – in both cases, we obtain an output table where some column has been used for grouping – the syntax and logic used to group data in SQL are fairly different to the <span class="in">`pandas`</span> implementation.</span>
-<span id="cb49-287"><a href="#cb49-287" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-288"><a href="#cb49-288" aria-hidden="true" tabindex="-1"></a>To illustrate <span class="in">`GROUP BY`</span>, we will consider the <span class="in">`Dish`</span> table from the <span class="in">`basic_examples.db`</span> database.</span>
-<span id="cb49-289"><a href="#cb49-289" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-292"><a href="#cb49-292" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-293"><a href="#cb49-293" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-294"><a href="#cb49-294" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span> </span>
-<span id="cb49-295"><a href="#cb49-295" aria-hidden="true" tabindex="-1"></a>FROM Dish</span>
-<span id="cb49-296"><a href="#cb49-296" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-297"><a href="#cb49-297" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-298"><a href="#cb49-298" aria-hidden="true" tabindex="-1"></a>Say we wanted to find the total costs of dishes of a certain <span class="in">`type`</span>. To accomplish this, we would write the following code.</span>
-<span id="cb49-299"><a href="#cb49-299" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-302"><a href="#cb49-302" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-303"><a href="#cb49-303" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-304"><a href="#cb49-304" aria-hidden="true" tabindex="-1"></a>SELECT <span class="bu">type</span>, SUM(cost)</span>
-<span id="cb49-305"><a href="#cb49-305" aria-hidden="true" tabindex="-1"></a>FROM Dish</span>
-<span id="cb49-306"><a href="#cb49-306" aria-hidden="true" tabindex="-1"></a>GROUP BY <span class="bu">type</span></span>
-<span id="cb49-307"><a href="#cb49-307" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-308"><a href="#cb49-308" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-309"><a href="#cb49-309" aria-hidden="true" tabindex="-1"></a>What is going on here? The statement <span class="in">`GROUP BY type`</span> tells SQL to group the data based on the value contained in the <span class="in">`type`</span> column (whether a record is an appetizer, entree, or dessert). <span class="in">`SUM(cost)`</span> sums up the costs of dishes in each <span class="in">`type`</span> and displays the result in the output table.</span>
-<span id="cb49-310"><a href="#cb49-310" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-311"><a href="#cb49-311" aria-hidden="true" tabindex="-1"></a>You may be wondering: why does <span class="in">`SUM(cost)`</span> come before the command to <span class="in">`GROUP BY type`</span>? Don't we need to form groups before we can count the number of entries in each?</span>
-<span id="cb49-312"><a href="#cb49-312" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-313"><a href="#cb49-313" aria-hidden="true" tabindex="-1"></a>Remember that SQL is a *declarative* programming language – a SQL programmer simply states what end result they would like to see, and leaves the task of figuring out *how* to obtain this result to SQL itself. This means that SQL queries sometimes don't follow what a reader sees as a "logical" sequence of thought. Instead, SQL requires that we follow its set order of operations when constructing queries. So long as we follow this ordering, SQL will handle the underlying logic.</span>
-<span id="cb49-314"><a href="#cb49-314" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-315"><a href="#cb49-315" aria-hidden="true" tabindex="-1"></a>In practical terms: our goal with this query was to output the total <span class="in">`cost`</span>s of each <span class="in">`type`</span>. To communicate this to SQL, we say that we want to <span class="in">`SELECT`</span> the <span class="in">`SUM`</span>med <span class="in">`cost`</span> values for each <span class="in">`type`</span> group. </span>
-<span id="cb49-316"><a href="#cb49-316" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-317"><a href="#cb49-317" aria-hidden="true" tabindex="-1"></a>There are many aggregation functions that can be used to aggregate the data contained in each group. Some common examples are:</span>
-<span id="cb49-318"><a href="#cb49-318" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-319"><a href="#cb49-319" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`COUNT`</span>: count the number of rows associated with each group</span>
-<span id="cb49-320"><a href="#cb49-320" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`MIN`</span>: find the minimum value of each group</span>
-<span id="cb49-321"><a href="#cb49-321" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`MAX`</span>: find the maximum value of each group</span>
-<span id="cb49-322"><a href="#cb49-322" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`SUM`</span>: sum across all records in each group</span>
-<span id="cb49-323"><a href="#cb49-323" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`AVG`</span>: find the average value of each group</span>
-<span id="cb49-324"><a href="#cb49-324" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-325"><a href="#cb49-325" aria-hidden="true" tabindex="-1"></a>We can easily compute multiple aggregations, all at once (a task that was very tricky in <span class="in">`pandas`</span>).</span>
-<span id="cb49-326"><a href="#cb49-326" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-329"><a href="#cb49-329" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-330"><a href="#cb49-330" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-331"><a href="#cb49-331" aria-hidden="true" tabindex="-1"></a>SELECT <span class="bu">type</span>, SUM(cost), MIN(cost), MAX(name)</span>
-<span id="cb49-332"><a href="#cb49-332" aria-hidden="true" tabindex="-1"></a>FROM Dish</span>
-<span id="cb49-333"><a href="#cb49-333" aria-hidden="true" tabindex="-1"></a>GROUP BY <span class="bu">type</span></span>
-<span id="cb49-334"><a href="#cb49-334" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-335"><a href="#cb49-335" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-336"><a href="#cb49-336" aria-hidden="true" tabindex="-1"></a>To count the number of rows associated with each group, we use the <span class="in">`COUNT`</span> keyword. Calling <span class="in">`COUNT(*)`</span> will compute the total number of rows in each group, including rows with null values. Its <span class="in">`pandas`</span> equivalent is <span class="in">`.groupby().size()`</span>.</span>
-<span id="cb49-337"><a href="#cb49-337" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-340"><a href="#cb49-340" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-341"><a href="#cb49-341" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-342"><a href="#cb49-342" aria-hidden="true" tabindex="-1"></a>SELECT <span class="bu">type</span>, COUNT(<span class="op">*</span>)</span>
-<span id="cb49-343"><a href="#cb49-343" aria-hidden="true" tabindex="-1"></a>FROM Dish</span>
-<span id="cb49-344"><a href="#cb49-344" aria-hidden="true" tabindex="-1"></a>GROUP BY <span class="bu">type</span></span>
-<span id="cb49-345"><a href="#cb49-345" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-346"><a href="#cb49-346" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-347"><a href="#cb49-347" aria-hidden="true" tabindex="-1"></a>To exclude <span class="in">`NULL`</span> values when counting the rows in each group, we explicitly call <span class="in">`COUNT`</span> on a column in the table. This is similar to calling <span class="in">`.groupby().count()`</span> in <span class="in">`pandas`</span>.</span>
-<span id="cb49-348"><a href="#cb49-348" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-351"><a href="#cb49-351" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
-<span id="cb49-352"><a href="#cb49-352" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
-<span id="cb49-353"><a href="#cb49-353" aria-hidden="true" tabindex="-1"></a>SELECT year, COUNT(cute)</span>
-<span id="cb49-354"><a href="#cb49-354" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
-<span id="cb49-355"><a href="#cb49-355" aria-hidden="true" tabindex="-1"></a>GROUP BY year</span>
-<span id="cb49-356"><a href="#cb49-356" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
-<span id="cb49-357"><a href="#cb49-357" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-358"><a href="#cb49-358" aria-hidden="true" tabindex="-1"></a>With this definition of <span class="in">`GROUP BY`</span> in hand, let's update our SQL order of operations. Remember: *every* SQL query must list clauses in this order. </span>
-<span id="cb49-359"><a href="#cb49-359" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-360"><a href="#cb49-360" aria-hidden="true" tabindex="-1"></a><span class="in">    SELECT &lt;column expression list&gt;</span></span>
-<span id="cb49-361"><a href="#cb49-361" aria-hidden="true" tabindex="-1"></a><span class="in">    FROM &lt;table&gt;</span></span>
-<span id="cb49-362"><a href="#cb49-362" aria-hidden="true" tabindex="-1"></a><span class="in">    [WHERE &lt;predicate&gt;]</span></span>
-<span id="cb49-363"><a href="#cb49-363" aria-hidden="true" tabindex="-1"></a><span class="in">    [GROUP BY &lt;column list&gt;]</span></span>
-<span id="cb49-364"><a href="#cb49-364" aria-hidden="true" tabindex="-1"></a><span class="in">    [ORDER BY &lt;column list&gt;]</span></span>
-<span id="cb49-365"><a href="#cb49-365" aria-hidden="true" tabindex="-1"></a><span class="in">    [LIMIT &lt;number of rows&gt;]</span></span>
-<span id="cb49-366"><a href="#cb49-366" aria-hidden="true" tabindex="-1"></a><span class="in">    [OFFSET &lt;number of rows&gt;];</span></span>
-<span id="cb49-367"><a href="#cb49-367" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb49-368"><a href="#cb49-368" aria-hidden="true" tabindex="-1"></a>Note that we can use the <span class="in">`AS`</span> keyword to rename columns during the selection process and that column expressions may include aggregation functions (<span class="in">`MAX`</span>, <span class="in">`MIN`</span>, etc.).</span>
+<span id="cb49-78"><a href="#cb49-78" aria-hidden="true" tabindex="-1"></a>The summary above displays information about the database. The database contains four tables, named <span class="in">`sqlite_sequence`</span>, <span class="in">`Dragon`</span>, <span class="in">`Dish`</span>, and <span class="in">`Scene`</span>. The rightmost column above lists the command that was used to construct each table. </span>
+<span id="cb49-79"><a href="#cb49-79" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-80"><a href="#cb49-80" aria-hidden="true" tabindex="-1"></a>Let's look more closely at the command used to create the <span class="in">`Dragon`</span> table (the second entry above). </span>
+<span id="cb49-81"><a href="#cb49-81" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-82"><a href="#cb49-82" aria-hidden="true" tabindex="-1"></a><span class="in">    CREATE TABLE Dragon (name TEXT PRIMARY KEY,</span></span>
+<span id="cb49-83"><a href="#cb49-83" aria-hidden="true" tabindex="-1"></a><span class="in">                         year INTEGER CHECK (year &gt;= 2000),</span></span>
+<span id="cb49-84"><a href="#cb49-84" aria-hidden="true" tabindex="-1"></a><span class="in">                         cute INTEGER)</span></span>
+<span id="cb49-85"><a href="#cb49-85" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-86"><a href="#cb49-86" aria-hidden="true" tabindex="-1"></a>The statement <span class="in">`CREATE TABLE`</span> is used to specify the **schema** of the table – a description of what logic is used to organize the table. Schema follows a set format:</span>
+<span id="cb49-87"><a href="#cb49-87" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-88"><a href="#cb49-88" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`ColName`</span>: the name of a column</span>
+<span id="cb49-89"><a href="#cb49-89" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`DataType`</span>: the type of data to be stored in a column. Some of the most common SQL data types are <span class="in">`INT`</span> (integers), <span class="in">`FLOAT`</span> (floating point numbers), <span class="in">`TEXT`</span> (strings), <span class="in">`BLOB`</span> (arbitrary data, such as audio/video files), and <span class="in">`DATETIME`</span> (a date and time).</span>
+<span id="cb49-90"><a href="#cb49-90" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`Constraint`</span>: some restriction on the data to be stored in the column. Common constraints are <span class="in">`CHECK`</span> (data must obey a certain condition), <span class="in">`PRIMARY KEY`</span> (designate a column as the table's primary key), <span class="in">`NOT NULL`</span> (data cannot be null), and <span class="in">`DEFAULT`</span> (a default fill value if no specific entry is given).</span>
+<span id="cb49-91"><a href="#cb49-91" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-92"><a href="#cb49-92" aria-hidden="true" tabindex="-1"></a>We see that <span class="in">`Dragon`</span> contains three columns. The first of these, <span class="in">`"name"`</span>, contains text data. It is designated as the **primary key** of the table; that is, the data contained in <span class="in">`"name"`</span> uniquely identifies each entry in the table. Because <span class="in">`"name"`</span> is the primary key of the table, no two entries in the table can have the same name – a given value of <span class="in">`"name"`</span> is unique to each dragon. The <span class="in">`"year"`</span> column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, <span class="in">`"cute"`</span>, contains integer data with no restrictions on allowable values. </span>
+<span id="cb49-93"><a href="#cb49-93" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-94"><a href="#cb49-94" aria-hidden="true" tabindex="-1"></a>We can verify this by viewing <span class="in">`Dragon`</span> itself.</span>
+<span id="cb49-95"><a href="#cb49-95" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-98"><a href="#cb49-98" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-99"><a href="#cb49-99" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-100"><a href="#cb49-100" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
+<span id="cb49-101"><a href="#cb49-101" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-102"><a href="#cb49-102" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-103"><a href="#cb49-103" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-104"><a href="#cb49-104" aria-hidden="true" tabindex="-1"></a>Database tables (also referred to as **relations**) are structured much like `DataFrame`s in `pandas`. Each row, sometimes called a **tuple**, represents a single record in the dataset. Each column, sometimes called an **attribute** or **field**, describes some feature of the record. </span>
+<span id="cb49-105"><a href="#cb49-105" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-106"><a href="#cb49-106" aria-hidden="true" tabindex="-1"></a><span class="fu">## `SELECT`ing From Tables</span></span>
+<span id="cb49-107"><a href="#cb49-107" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-108"><a href="#cb49-108" aria-hidden="true" tabindex="-1"></a>To extract and manipulate data stored in a SQL table, we will need to familiarize ourselves with the syntax to write pieces of SQL code, which we call **queries**. </span>
+<span id="cb49-109"><a href="#cb49-109" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-110"><a href="#cb49-110" aria-hidden="true" tabindex="-1"></a>The basic unit of a SQL query is the <span class="in">`SELECT`</span> statement. <span class="in">`SELECT`</span> specifies what columns we would like to extract from a given table. We use <span class="in">`FROM`</span> to tell SQL the table from which we want to <span class="in">`SELECT`</span> our data. </span>
+<span id="cb49-111"><a href="#cb49-111" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-114"><a href="#cb49-114" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-115"><a href="#cb49-115" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-116"><a href="#cb49-116" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
+<span id="cb49-117"><a href="#cb49-117" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-118"><a href="#cb49-118" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-119"><a href="#cb49-119" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-120"><a href="#cb49-120" aria-hidden="true" tabindex="-1"></a>In SQL, <span class="in">`*`</span> means "everything." The query above grabs *all* the columns in <span class="in">`Dragon`</span> and displays them in the outputted table. We can also specify a specific subset of columns to be <span class="in">`SELECT`</span>ed. Notice that the outputted columns appear in the order that they were <span class="in">`SELECT`</span>ed.</span>
+<span id="cb49-121"><a href="#cb49-121" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-124"><a href="#cb49-124" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-125"><a href="#cb49-125" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-126"><a href="#cb49-126" aria-hidden="true" tabindex="-1"></a>SELECT cute, year</span>
+<span id="cb49-127"><a href="#cb49-127" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-128"><a href="#cb49-128" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-129"><a href="#cb49-129" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-130"><a href="#cb49-130" aria-hidden="true" tabindex="-1"></a>And just like that, we've already written two SQL queries. There are a few points of note in the queries above. Firstly, notice that every "verb" is written in uppercase. It is convention to write SQL operations in capital letters, but your code will run just fine even if you choose to keep things in lowercase. Second, the query above separates each statement with a new line. SQL queries are not impacted by whitespace within the query; this means that SQL code is typically written with a new line after each statement to make things more readable. The semicolon (<span class="in">`;`</span>) indicates the end of a query. There are some "flavors" of SQL in which a query will not run if no semicolon is present; however, in Data 100, the SQL version we will use works with or without an ending semicolon. Queries in these notes will end with semicolons to build up good habits.</span>
+<span id="cb49-131"><a href="#cb49-131" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-132"><a href="#cb49-132" aria-hidden="true" tabindex="-1"></a>The <span class="in">`AS`</span> keyword allows us to give a column a new name (called an **alias**) after it has been <span class="in">`SELECT`</span>ed. The general syntax is:</span>
+<span id="cb49-133"><a href="#cb49-133" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-134"><a href="#cb49-134" aria-hidden="true" tabindex="-1"></a><span class="in">    SELECT column_name_in_database_table AS new_name_in_output_table</span></span>
+<span id="cb49-135"><a href="#cb49-135" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-138"><a href="#cb49-138" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-139"><a href="#cb49-139" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-140"><a href="#cb49-140" aria-hidden="true" tabindex="-1"></a>SELECT cute AS cuteness, year AS birth</span>
+<span id="cb49-141"><a href="#cb49-141" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-142"><a href="#cb49-142" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-143"><a href="#cb49-143" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-144"><a href="#cb49-144" aria-hidden="true" tabindex="-1"></a>To <span class="in">`SELECT`</span> only the *unique* values in a column, we use the <span class="in">`DISTINCT`</span> keyword. This will cause any any duplicate entries in a column to be removed. If we want to find only the unique years in <span class="in">`Dragon`</span>, without any repeats, we would write:</span>
+<span id="cb49-145"><a href="#cb49-145" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-148"><a href="#cb49-148" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-149"><a href="#cb49-149" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-150"><a href="#cb49-150" aria-hidden="true" tabindex="-1"></a>SELECT DISTINCT year</span>
+<span id="cb49-151"><a href="#cb49-151" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-152"><a href="#cb49-152" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-153"><a href="#cb49-153" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-154"><a href="#cb49-154" aria-hidden="true" tabindex="-1"></a>**Every** SQL query must include both a <span class="in">`SELECT`</span> and <span class="in">`FROM`</span> statement. Intuitively, this makes sense – we know that we'll want to extract some piece of information from the table; to do so, we also need to indicate what table we want to consider. </span>
+<span id="cb49-155"><a href="#cb49-155" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-156"><a href="#cb49-156" aria-hidden="true" tabindex="-1"></a>It is important to note that SQL enforces a strict "order of operations" – SQL clauses must *always* follow the same sequence. For example, the <span class="in">`SELECT`</span> statement must always precede <span class="in">`FROM`</span>. This means that any SQL query will follow the same structure. </span>
+<span id="cb49-157"><a href="#cb49-157" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-158"><a href="#cb49-158" aria-hidden="true" tabindex="-1"></a><span class="in">    SELECT &lt;column list&gt;</span></span>
+<span id="cb49-159"><a href="#cb49-159" aria-hidden="true" tabindex="-1"></a><span class="in">    FROM &lt;table&gt;</span></span>
+<span id="cb49-160"><a href="#cb49-160" aria-hidden="true" tabindex="-1"></a><span class="in">    [additional clauses]</span></span>
+<span id="cb49-161"><a href="#cb49-161" aria-hidden="true" tabindex="-1"></a><span class="in">    </span></span>
+<span id="cb49-162"><a href="#cb49-162" aria-hidden="true" tabindex="-1"></a>The additional clauses that we use depend on the specific task trying to be achieved. We may refine our query to filter on a certain condition, aggregate a particular column, or join several tables together. We will spend the rest of this lecture outlining some useful clauses to build up our understanding of the order of operations.</span>
+<span id="cb49-163"><a href="#cb49-163" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-164"><a href="#cb49-164" aria-hidden="true" tabindex="-1"></a><span class="fu">## Applying `WHERE` Conditions</span></span>
+<span id="cb49-165"><a href="#cb49-165" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-166"><a href="#cb49-166" aria-hidden="true" tabindex="-1"></a>The <span class="in">`WHERE`</span> keyword is used to select only some rows of a table, filtered on a given Boolean condition. </span>
+<span id="cb49-167"><a href="#cb49-167" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-170"><a href="#cb49-170" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-171"><a href="#cb49-171" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-172"><a href="#cb49-172" aria-hidden="true" tabindex="-1"></a>SELECT name, year</span>
+<span id="cb49-173"><a href="#cb49-173" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-174"><a href="#cb49-174" aria-hidden="true" tabindex="-1"></a>WHERE cute <span class="op">&gt;</span> <span class="dv">0</span></span>
+<span id="cb49-175"><a href="#cb49-175" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-176"><a href="#cb49-176" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-177"><a href="#cb49-177" aria-hidden="true" tabindex="-1"></a>We can add complexity to the <span class="in">`WHERE`</span> condition using the keywords <span class="in">`AND`</span>, <span class="in">`OR`</span>, and <span class="in">`NOT`</span>, much like we would in Python.</span>
+<span id="cb49-178"><a href="#cb49-178" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-181"><a href="#cb49-181" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-182"><a href="#cb49-182" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-183"><a href="#cb49-183" aria-hidden="true" tabindex="-1"></a>SELECT name, year</span>
+<span id="cb49-184"><a href="#cb49-184" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-185"><a href="#cb49-185" aria-hidden="true" tabindex="-1"></a>WHERE cute <span class="op">&gt;</span> <span class="dv">0</span> OR year <span class="op">&gt;</span> <span class="dv">2013</span></span>
+<span id="cb49-186"><a href="#cb49-186" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-187"><a href="#cb49-187" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-188"><a href="#cb49-188" aria-hidden="true" tabindex="-1"></a>To spare ourselves needing to write complicated logical expressions by combining several conditions, we can also filter for entries that are <span class="in">`IN`</span> a specified list of values. This is similar to the use of <span class="in">`in`</span> or <span class="in">`.isin`</span> in Python.</span>
+<span id="cb49-189"><a href="#cb49-189" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-192"><a href="#cb49-192" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-193"><a href="#cb49-193" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-194"><a href="#cb49-194" aria-hidden="true" tabindex="-1"></a>SELECT name, year</span>
+<span id="cb49-195"><a href="#cb49-195" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-196"><a href="#cb49-196" aria-hidden="true" tabindex="-1"></a>WHERE name IN (<span class="st">"hiccup"</span>, <span class="st">"puff"</span>)</span>
+<span id="cb49-197"><a href="#cb49-197" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-198"><a href="#cb49-198" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-199"><a href="#cb49-199" aria-hidden="true" tabindex="-1"></a>You may have noticed earlier that our table actually has a missing value. In SQL, missing data is given the special value <span class="in">`NULL`</span>. <span class="in">`NULL`</span> behaves in a fundamentally different way to other data types. We can't use the typical operators (=, &gt;, and &lt;) on <span class="in">`NULL`</span> values (in fact, <span class="in">`NULL == NULL`</span> returns <span class="in">`False`</span>!); instead, we check to see if a value <span class="in">`IS`</span> or <span class="in">`IS NOT`</span> <span class="in">`NULL`</span>.</span>
+<span id="cb49-200"><a href="#cb49-200" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-203"><a href="#cb49-203" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-204"><a href="#cb49-204" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-205"><a href="#cb49-205" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
+<span id="cb49-206"><a href="#cb49-206" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-207"><a href="#cb49-207" aria-hidden="true" tabindex="-1"></a>WHERE cute IS NOT NULL</span>
+<span id="cb49-208"><a href="#cb49-208" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-209"><a href="#cb49-209" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-210"><a href="#cb49-210" aria-hidden="true" tabindex="-1"></a><span class="fu">## Sorting and Restricting Output</span></span>
+<span id="cb49-211"><a href="#cb49-211" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb49-212"><a href="#cb49-212" aria-hidden="true" tabindex="-1"></a>What if we want the output table to appear in a certain order? The <span class="in">`ORDER BY`</span> keyword behaves similarly to <span class="in">`.sort_values()`</span> in <span class="in">`pandas`</span>. </span>
+<span id="cb49-213"><a href="#cb49-213" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-216"><a href="#cb49-216" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-217"><a href="#cb49-217" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-218"><a href="#cb49-218" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
+<span id="cb49-219"><a href="#cb49-219" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-220"><a href="#cb49-220" aria-hidden="true" tabindex="-1"></a>ORDER BY cute</span>
+<span id="cb49-221"><a href="#cb49-221" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-222"><a href="#cb49-222" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-223"><a href="#cb49-223" aria-hidden="true" tabindex="-1"></a>By default, <span class="in">`ORDER BY`</span> will display results in ascending order (with the lowest values at the top of the table). To sort in descending order, we use the <span class="in">`DESC`</span> keyword after specifying the column to be used for ordering.</span>
+<span id="cb49-224"><a href="#cb49-224" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-227"><a href="#cb49-227" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-228"><a href="#cb49-228" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-229"><a href="#cb49-229" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
+<span id="cb49-230"><a href="#cb49-230" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-231"><a href="#cb49-231" aria-hidden="true" tabindex="-1"></a>ORDER BY cute DESC</span>
+<span id="cb49-232"><a href="#cb49-232" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-233"><a href="#cb49-233" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-234"><a href="#cb49-234" aria-hidden="true" tabindex="-1"></a>We can also tell SQL to <span class="in">`ORDER BY`</span> two columns at once. This will sort the table by the first listed column, then use the values in the second listed column to break any ties.</span>
+<span id="cb49-235"><a href="#cb49-235" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-238"><a href="#cb49-238" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-239"><a href="#cb49-239" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-240"><a href="#cb49-240" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
+<span id="cb49-241"><a href="#cb49-241" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-242"><a href="#cb49-242" aria-hidden="true" tabindex="-1"></a>ORDER BY name, cute</span>
+<span id="cb49-243"><a href="#cb49-243" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-244"><a href="#cb49-244" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-245"><a href="#cb49-245" aria-hidden="true" tabindex="-1"></a>In many instances, we are only concerned with a certain number of rows in the output table (for example, wanting to find the first two dragons in the table). The <span class="in">`LIMIT`</span> keyword restricts the output to a specified number of rows. It serves a function similar to that of <span class="in">`.head()`</span> in <span class="in">`pandas`</span>.</span>
+<span id="cb49-246"><a href="#cb49-246" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-249"><a href="#cb49-249" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-250"><a href="#cb49-250" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-251"><a href="#cb49-251" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
+<span id="cb49-252"><a href="#cb49-252" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-253"><a href="#cb49-253" aria-hidden="true" tabindex="-1"></a>LIMIT <span class="dv">2</span></span>
+<span id="cb49-254"><a href="#cb49-254" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-255"><a href="#cb49-255" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-256"><a href="#cb49-256" aria-hidden="true" tabindex="-1"></a>The <span class="in">`OFFSET`</span> keyword indicates the index at which <span class="in">`LIMIT`</span> should start. In other words, we can use <span class="in">`OFFSET`</span> to shift where the <span class="in">`LIMIT`</span>ing begins by a specified number of rows. For example, we might care about the dragons that are at positions #2 and #3 in the table. </span>
+<span id="cb49-257"><a href="#cb49-257" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-260"><a href="#cb49-260" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-261"><a href="#cb49-261" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-262"><a href="#cb49-262" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span></span>
+<span id="cb49-263"><a href="#cb49-263" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-264"><a href="#cb49-264" aria-hidden="true" tabindex="-1"></a>LIMIT <span class="dv">2</span></span>
+<span id="cb49-265"><a href="#cb49-265" aria-hidden="true" tabindex="-1"></a>OFFSET <span class="dv">1</span></span>
+<span id="cb49-266"><a href="#cb49-266" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-267"><a href="#cb49-267" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-268"><a href="#cb49-268" aria-hidden="true" tabindex="-1"></a>Let's summarize what we've learned so far. We know that <span class="in">`SELECT`</span> and <span class="in">`FROM`</span> are the fundamental building blocks of any SQL query. We can augment these two keywords with additional clauses to refine the data in our output table. </span>
+<span id="cb49-269"><a href="#cb49-269" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-270"><a href="#cb49-270" aria-hidden="true" tabindex="-1"></a>Any clauses that we include must follow a strict ordering within the query:</span>
+<span id="cb49-271"><a href="#cb49-271" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-272"><a href="#cb49-272" aria-hidden="true" tabindex="-1"></a><span class="in">    SELECT &lt;column list&gt;</span></span>
+<span id="cb49-273"><a href="#cb49-273" aria-hidden="true" tabindex="-1"></a><span class="in">    FROM &lt;table&gt;</span></span>
+<span id="cb49-274"><a href="#cb49-274" aria-hidden="true" tabindex="-1"></a><span class="in">    [WHERE &lt;predicate&gt;]</span></span>
+<span id="cb49-275"><a href="#cb49-275" aria-hidden="true" tabindex="-1"></a><span class="in">    [ORDER BY &lt;column list&gt;]</span></span>
+<span id="cb49-276"><a href="#cb49-276" aria-hidden="true" tabindex="-1"></a><span class="in">    [LIMIT &lt;number of rows&gt;]</span></span>
+<span id="cb49-277"><a href="#cb49-277" aria-hidden="true" tabindex="-1"></a><span class="in">    [OFFSET &lt;number of rows&gt;]</span></span>
+<span id="cb49-278"><a href="#cb49-278" aria-hidden="true" tabindex="-1"></a><span class="in">    </span></span>
+<span id="cb49-279"><a href="#cb49-279" aria-hidden="true" tabindex="-1"></a>Here, any clause contained in square brackets <span class="in">`[ ]`</span> is optional – we only need to use the keyword if it is relevant to the table operation we want to perform. Also note that by convention, we use all caps for keywords in SQL statements and use newlines to make code more readable.</span>
+<span id="cb49-280"><a href="#cb49-280" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-281"><a href="#cb49-281" aria-hidden="true" tabindex="-1"></a><span class="fu">## Aggregating with `GROUP BY`</span></span>
+<span id="cb49-282"><a href="#cb49-282" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-283"><a href="#cb49-283" aria-hidden="true" tabindex="-1"></a>At this point, we've seen that SQL offers much of the same functionality that was given to us by <span class="in">`pandas`</span>. We can extract data from a table, filter it, and reorder it to suit our needs.</span>
+<span id="cb49-284"><a href="#cb49-284" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-285"><a href="#cb49-285" aria-hidden="true" tabindex="-1"></a>In <span class="in">`pandas`</span>, much of our analysis work relied heavily on being able to use <span class="in">`.groupby()`</span> to aggregate across the rows of our dataset. SQL's answer to this task is the (very conveniently named) <span class="in">`GROUP BY`</span> clause. While the outputs of <span class="in">`GROUP BY`</span> are similar to those of <span class="in">`.groupby()`</span> – in both cases, we obtain an output table where some column has been used for grouping – the syntax and logic used to group data in SQL are fairly different to the <span class="in">`pandas`</span> implementation.</span>
+<span id="cb49-286"><a href="#cb49-286" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-287"><a href="#cb49-287" aria-hidden="true" tabindex="-1"></a>To illustrate <span class="in">`GROUP BY`</span>, we will consider the <span class="in">`Dish`</span> table from the <span class="in">`basic_examples.db`</span> database.</span>
+<span id="cb49-288"><a href="#cb49-288" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-291"><a href="#cb49-291" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-292"><a href="#cb49-292" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-293"><a href="#cb49-293" aria-hidden="true" tabindex="-1"></a>SELECT <span class="op">*</span> </span>
+<span id="cb49-294"><a href="#cb49-294" aria-hidden="true" tabindex="-1"></a>FROM Dish</span>
+<span id="cb49-295"><a href="#cb49-295" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-296"><a href="#cb49-296" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-297"><a href="#cb49-297" aria-hidden="true" tabindex="-1"></a>Say we wanted to find the total costs of dishes of a certain <span class="in">`type`</span>. To accomplish this, we would write the following code.</span>
+<span id="cb49-298"><a href="#cb49-298" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-301"><a href="#cb49-301" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-302"><a href="#cb49-302" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-303"><a href="#cb49-303" aria-hidden="true" tabindex="-1"></a>SELECT <span class="bu">type</span>, SUM(cost)</span>
+<span id="cb49-304"><a href="#cb49-304" aria-hidden="true" tabindex="-1"></a>FROM Dish</span>
+<span id="cb49-305"><a href="#cb49-305" aria-hidden="true" tabindex="-1"></a>GROUP BY <span class="bu">type</span></span>
+<span id="cb49-306"><a href="#cb49-306" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-307"><a href="#cb49-307" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-308"><a href="#cb49-308" aria-hidden="true" tabindex="-1"></a>What is going on here? The statement <span class="in">`GROUP BY type`</span> tells SQL to group the data based on the value contained in the <span class="in">`type`</span> column (whether a record is an appetizer, entree, or dessert). <span class="in">`SUM(cost)`</span> sums up the costs of dishes in each <span class="in">`type`</span> and displays the result in the output table.</span>
+<span id="cb49-309"><a href="#cb49-309" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-310"><a href="#cb49-310" aria-hidden="true" tabindex="-1"></a>You may be wondering: why does <span class="in">`SUM(cost)`</span> come before the command to <span class="in">`GROUP BY type`</span>? Don't we need to form groups before we can count the number of entries in each?</span>
+<span id="cb49-311"><a href="#cb49-311" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-312"><a href="#cb49-312" aria-hidden="true" tabindex="-1"></a>Remember that SQL is a *declarative* programming language – a SQL programmer simply states what end result they would like to see, and leaves the task of figuring out *how* to obtain this result to SQL itself. This means that SQL queries sometimes don't follow what a reader sees as a "logical" sequence of thought. Instead, SQL requires that we follow its set order of operations when constructing queries. So long as we follow this ordering, SQL will handle the underlying logic.</span>
+<span id="cb49-313"><a href="#cb49-313" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-314"><a href="#cb49-314" aria-hidden="true" tabindex="-1"></a>In practical terms: our goal with this query was to output the total <span class="in">`cost`</span>s of each <span class="in">`type`</span>. To communicate this to SQL, we say that we want to <span class="in">`SELECT`</span> the <span class="in">`SUM`</span>med <span class="in">`cost`</span> values for each <span class="in">`type`</span> group. </span>
+<span id="cb49-315"><a href="#cb49-315" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-316"><a href="#cb49-316" aria-hidden="true" tabindex="-1"></a>There are many aggregation functions that can be used to aggregate the data contained in each group. Some common examples are:</span>
+<span id="cb49-317"><a href="#cb49-317" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-318"><a href="#cb49-318" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`COUNT`</span>: count the number of rows associated with each group</span>
+<span id="cb49-319"><a href="#cb49-319" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`MIN`</span>: find the minimum value of each group</span>
+<span id="cb49-320"><a href="#cb49-320" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`MAX`</span>: find the maximum value of each group</span>
+<span id="cb49-321"><a href="#cb49-321" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`SUM`</span>: sum across all records in each group</span>
+<span id="cb49-322"><a href="#cb49-322" aria-hidden="true" tabindex="-1"></a><span class="ss">* </span><span class="in">`AVG`</span>: find the average value of each group</span>
+<span id="cb49-323"><a href="#cb49-323" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-324"><a href="#cb49-324" aria-hidden="true" tabindex="-1"></a>We can easily compute multiple aggregations, all at once (a task that was very tricky in <span class="in">`pandas`</span>).</span>
+<span id="cb49-325"><a href="#cb49-325" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-328"><a href="#cb49-328" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-329"><a href="#cb49-329" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-330"><a href="#cb49-330" aria-hidden="true" tabindex="-1"></a>SELECT <span class="bu">type</span>, SUM(cost), MIN(cost), MAX(name)</span>
+<span id="cb49-331"><a href="#cb49-331" aria-hidden="true" tabindex="-1"></a>FROM Dish</span>
+<span id="cb49-332"><a href="#cb49-332" aria-hidden="true" tabindex="-1"></a>GROUP BY <span class="bu">type</span></span>
+<span id="cb49-333"><a href="#cb49-333" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-334"><a href="#cb49-334" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-335"><a href="#cb49-335" aria-hidden="true" tabindex="-1"></a>To count the number of rows associated with each group, we use the <span class="in">`COUNT`</span> keyword. Calling <span class="in">`COUNT(*)`</span> will compute the total number of rows in each group, including rows with null values. Its <span class="in">`pandas`</span> equivalent is <span class="in">`.groupby().size()`</span>.</span>
+<span id="cb49-336"><a href="#cb49-336" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-339"><a href="#cb49-339" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-340"><a href="#cb49-340" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-341"><a href="#cb49-341" aria-hidden="true" tabindex="-1"></a>SELECT <span class="bu">type</span>, COUNT(<span class="op">*</span>)</span>
+<span id="cb49-342"><a href="#cb49-342" aria-hidden="true" tabindex="-1"></a>FROM Dish</span>
+<span id="cb49-343"><a href="#cb49-343" aria-hidden="true" tabindex="-1"></a>GROUP BY <span class="bu">type</span></span>
+<span id="cb49-344"><a href="#cb49-344" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-345"><a href="#cb49-345" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-346"><a href="#cb49-346" aria-hidden="true" tabindex="-1"></a>To exclude <span class="in">`NULL`</span> values when counting the rows in each group, we explicitly call <span class="in">`COUNT`</span> on a column in the table. This is similar to calling <span class="in">`.groupby().count()`</span> in <span class="in">`pandas`</span>.</span>
+<span id="cb49-347"><a href="#cb49-347" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-350"><a href="#cb49-350" aria-hidden="true" tabindex="-1"></a><span class="in">```{python}</span></span>
+<span id="cb49-351"><a href="#cb49-351" aria-hidden="true" tabindex="-1"></a><span class="op">%%</span>sql</span>
+<span id="cb49-352"><a href="#cb49-352" aria-hidden="true" tabindex="-1"></a>SELECT year, COUNT(cute)</span>
+<span id="cb49-353"><a href="#cb49-353" aria-hidden="true" tabindex="-1"></a>FROM Dragon</span>
+<span id="cb49-354"><a href="#cb49-354" aria-hidden="true" tabindex="-1"></a>GROUP BY year</span>
+<span id="cb49-355"><a href="#cb49-355" aria-hidden="true" tabindex="-1"></a><span class="in">```</span></span>
+<span id="cb49-356"><a href="#cb49-356" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-357"><a href="#cb49-357" aria-hidden="true" tabindex="-1"></a>With this definition of <span class="in">`GROUP BY`</span> in hand, let's update our SQL order of operations. Remember: *every* SQL query must list clauses in this order. </span>
+<span id="cb49-358"><a href="#cb49-358" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-359"><a href="#cb49-359" aria-hidden="true" tabindex="-1"></a><span class="in">    SELECT &lt;column expression list&gt;</span></span>
+<span id="cb49-360"><a href="#cb49-360" aria-hidden="true" tabindex="-1"></a><span class="in">    FROM &lt;table&gt;</span></span>
+<span id="cb49-361"><a href="#cb49-361" aria-hidden="true" tabindex="-1"></a><span class="in">    [WHERE &lt;predicate&gt;]</span></span>
+<span id="cb49-362"><a href="#cb49-362" aria-hidden="true" tabindex="-1"></a><span class="in">    [GROUP BY &lt;column list&gt;]</span></span>
+<span id="cb49-363"><a href="#cb49-363" aria-hidden="true" tabindex="-1"></a><span class="in">    [ORDER BY &lt;column list&gt;]</span></span>
+<span id="cb49-364"><a href="#cb49-364" aria-hidden="true" tabindex="-1"></a><span class="in">    [LIMIT &lt;number of rows&gt;]</span></span>
+<span id="cb49-365"><a href="#cb49-365" aria-hidden="true" tabindex="-1"></a><span class="in">    [OFFSET &lt;number of rows&gt;];</span></span>
+<span id="cb49-366"><a href="#cb49-366" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb49-367"><a href="#cb49-367" aria-hidden="true" tabindex="-1"></a>Note that we can use the <span class="in">`AS`</span> keyword to rename columns during the selection process and that column expressions may include aggregation functions (<span class="in">`MAX`</span>, <span class="in">`MIN`</span>, etc.).</span>
 </code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </div></div></div></div></div>
 </div> <!-- /content -->
diff --git a/docs/visualization_1/visualization_1.html b/docs/visualization_1/visualization_1.html
index eeddc853..c8b2faad 100644
--- a/docs/visualization_1/visualization_1.html
+++ b/docs/visualization_1/visualization_1.html
@@ -853,7 +853,7 @@ <h2 data-number="7.7" class="anchored" data-anchor-id="box-plots-and-violin-plot
 <div class="cell" data-vscode="{&quot;languageId&quot;:&quot;python&quot;}" data-execution_count="8">
 <div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>sns.violinplot(data<span class="op">=</span>wb, y<span class="op">=</span><span class="st">"Gross national income per capita, Atlas method: $: 2016"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
-<p><img src="visualization_1_files/figure-html/cell-9-output-1.png" width="618" height="396"></p>
+<p><img src="visualization_1_files/figure-html/cell-9-output-1.png" width="619" height="396"></p>
 </div>
 </div>
 <p>A quartile represents a 25% portion of the data. We say that:</p>
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-10-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-10-output-1.png
index 86dcd1d8..322782d4 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-10-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-10-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-11-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-11-output-1.png
index a2d534cf..bdab34fb 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-11-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-11-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-12-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-12-output-1.png
index 1b541eca..c318462e 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-12-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-12-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-13-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-13-output-1.png
index 4dd3b843..98c2427f 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-13-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-13-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-14-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-14-output-1.png
index 6d91b178..f67232c9 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-14-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-14-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-15-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-15-output-1.png
index 01b737ad..bc418fbf 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-15-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-15-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-17-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-17-output-1.png
index ffeb495d..b23e61ee 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-17-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-17-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-18-output-2.png b/docs/visualization_1/visualization_1_files/figure-html/cell-18-output-2.png
index 23c4525f..dd3a63d2 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-18-output-2.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-18-output-2.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-19-output-2.png b/docs/visualization_1/visualization_1_files/figure-html/cell-19-output-2.png
index 8a1a3f89..5db38e27 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-19-output-2.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-19-output-2.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-20-output-2.png b/docs/visualization_1/visualization_1_files/figure-html/cell-20-output-2.png
index 0eb557b5..b0da5ec4 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-20-output-2.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-20-output-2.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-21-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-21-output-1.png
index 4585f210..e8d03b85 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-21-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-21-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-22-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-22-output-1.png
index 115447d5..8d349afd 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-22-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-22-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-23-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-23-output-1.png
index 7c2beef1..3ab690cd 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-23-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-23-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-3-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-3-output-1.png
index 2d12c86d..aff0c869 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-3-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-3-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-4-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-4-output-1.png
index 5583ffaf..2130b77e 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-4-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-4-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-5-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-5-output-1.png
index bb40c708..93e34b23 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-5-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-5-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-7-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-7-output-1.png
index 7ae87fd5..2e52c1f9 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-7-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-7-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-8-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-8-output-1.png
index 22b19b7f..2be9dc6e 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-8-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-8-output-1.png differ
diff --git a/docs/visualization_1/visualization_1_files/figure-html/cell-9-output-1.png b/docs/visualization_1/visualization_1_files/figure-html/cell-9-output-1.png
index 90a644c9..8ef47838 100644
Binary files a/docs/visualization_1/visualization_1_files/figure-html/cell-9-output-1.png and b/docs/visualization_1/visualization_1_files/figure-html/cell-9-output-1.png differ
diff --git a/docs/visualization_2/visualization_2.html b/docs/visualization_2/visualization_2.html
index c8adb764..702256a8 100644
--- a/docs/visualization_2/visualization_2.html
+++ b/docs/visualization_2/visualization_2.html
@@ -575,14 +575,8 @@ <h4 data-number="8.0.1.1" class="anchored" data-anchor-id="kde-theory"><span cla
 <span id="cb2-5"><a href="#cb2-5" aria-hidden="true" tabindex="-1"></a>                       kde <span class="op">=</span> <span class="va">True</span>, stat <span class="op">=</span> <span class="st">"density"</span>)</span>
 <span id="cb2-6"><a href="#cb2-6" aria-hidden="true" tabindex="-1"></a></span>
 <span id="cb2-7"><a href="#cb2-7" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of HIV rates"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<div class="cell-output cell-output-stderr">
-<pre><code>/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-</code></pre>
-</div>
 <div class="cell-output cell-output-display">
-<p><img src="visualization_2_files/figure-html/cell-3-output-2.png" width="470" height="490"></p>
+<p><img src="visualization_2_files/figure-html/cell-3-output-1.png" width="469" height="488"></p>
 </div>
 </div>
 <p>Notice that the smooth KDE curve is higher when the histogram bins are taller. You can think of the height of the KDE curve as representing how “probable” it is that we randomly sample a datapoint with the corresponding value. This intuitively makes sense – if we have already collected more datapoints with a particular value (resulting in a tall histogram bin), it is more likely that, if we randomly sample another datapoint, we will sample one with a similar value (resulting in a high KDE curve).</p>
@@ -601,14 +595,14 @@ <h4 data-number="8.0.1.2" class="anchored" data-anchor-id="constructing-a-kde"><
 <div class="cell" data-execution_count="3">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>data <span class="op">=</span> [<span class="fl">2.2</span>, <span class="fl">2.8</span>, <span class="fl">3.7</span>, <span class="fl">5.3</span>, <span class="fl">5.7</span>]</span>
-<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>sns.rugplot(data, height<span class="op">=</span><span class="fl">0.3</span>)</span>
-<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
-<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
-<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
-<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb3"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a>data <span class="op">=</span> [<span class="fl">2.2</span>, <span class="fl">2.8</span>, <span class="fl">3.7</span>, <span class="fl">5.3</span>, <span class="fl">5.7</span>]</span>
+<span id="cb3-2"><a href="#cb3-2" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb3-3"><a href="#cb3-3" aria-hidden="true" tabindex="-1"></a>sns.rugplot(data, height<span class="op">=</span><span class="fl">0.3</span>)</span>
+<span id="cb3-4"><a href="#cb3-4" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb3-5"><a href="#cb3-5" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
+<span id="cb3-6"><a href="#cb3-6" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
+<span id="cb3-7"><a href="#cb3-7" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
+<span id="cb3-8"><a href="#cb3-8" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-4-output-1.png" width="597" height="434"></p>
@@ -618,11 +612,11 @@ <h4 data-number="8.0.1.2" class="anchored" data-anchor-id="constructing-a-kde"><
 <div class="cell" data-execution_count="4">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>sns.kdeplot(data)</span>
-<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
-<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
-<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb4"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>sns.kdeplot(data)</span>
+<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
+<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
+<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-5-output-1.png" width="597" height="434"></p>
@@ -637,21 +631,21 @@ <h5 data-number="8.0.1.2.1" class="anchored" data-anchor-id="step-1-place-a-kern
 <div class="cell" data-execution_count="5">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> gaussian_kernel(x, z, a):</span>
-<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a>    <span class="co"># We'll discuss where this mathematical formulation came from later</span></span>
-<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> (<span class="dv">1</span><span class="op">/</span>np.sqrt(<span class="dv">2</span><span class="op">*</span>np.pi<span class="op">*</span>a<span class="op">**</span><span class="dv">2</span>)) <span class="op">*</span> np.exp((<span class="op">-</span>(x <span class="op">-</span> z)<span class="op">**</span><span class="dv">2</span> <span class="op">/</span> (<span class="dv">2</span> <span class="op">*</span> a<span class="op">**</span><span class="dv">2</span>)))</span>
-<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot our datapoint</span></span>
-<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>sns.rugplot([<span class="fl">2.2</span>], height<span class="op">=</span><span class="fl">0.3</span>)</span>
-<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot the kernel</span></span>
-<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a>x <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>, <span class="dv">1000</span>)</span>
-<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a>plt.plot(x, gaussian_kernel(x, <span class="fl">2.2</span>, <span class="dv">1</span>))</span>
-<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
-<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
-<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
-<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> gaussian_kernel(x, z, a):</span>
+<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a>    <span class="co"># We'll discuss where this mathematical formulation came from later</span></span>
+<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> (<span class="dv">1</span><span class="op">/</span>np.sqrt(<span class="dv">2</span><span class="op">*</span>np.pi<span class="op">*</span>a<span class="op">**</span><span class="dv">2</span>)) <span class="op">*</span> np.exp((<span class="op">-</span>(x <span class="op">-</span> z)<span class="op">**</span><span class="dv">2</span> <span class="op">/</span> (<span class="dv">2</span> <span class="op">*</span> a<span class="op">**</span><span class="dv">2</span>)))</span>
+<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot our datapoint</span></span>
+<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a>sns.rugplot([<span class="fl">2.2</span>], height<span class="op">=</span><span class="fl">0.3</span>)</span>
+<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a><span class="co"># Plot the kernel</span></span>
+<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a>x <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>, <span class="dv">1000</span>)</span>
+<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a>plt.plot(x, gaussian_kernel(x, <span class="fl">2.2</span>, <span class="dv">1</span>))</span>
+<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
+<span id="cb5-13"><a href="#cb5-13" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
+<span id="cb5-14"><a href="#cb5-14" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
+<span id="cb5-15"><a href="#cb5-15" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-6-output-1.png" width="597" height="434"></p>
@@ -661,41 +655,41 @@ <h5 data-number="8.0.1.2.1" class="anchored" data-anchor-id="step-1-place-a-kern
 <div class="cell" data-execution_count="6">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="co"># You will work with the functions below in Lab 4</span></span>
-<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> create_kde(kernel, pts, a):</span>
-<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Takes in a kernel, set of points, and alpha</span></span>
-<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Returns the KDE as a function</span></span>
-<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> f(x):</span>
-<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a>        output <span class="op">=</span> <span class="dv">0</span></span>
-<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a>        <span class="cf">for</span> pt <span class="kw">in</span> pts:</span>
-<span id="cb7-8"><a href="#cb7-8" aria-hidden="true" tabindex="-1"></a>            output <span class="op">+=</span> kernel(x, pt, a)</span>
-<span id="cb7-9"><a href="#cb7-9" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> output <span class="op">/</span> <span class="bu">len</span>(pts) <span class="co"># Normalization factor</span></span>
-<span id="cb7-10"><a href="#cb7-10" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> f</span>
-<span id="cb7-11"><a href="#cb7-11" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb7-12"><a href="#cb7-12" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> plot_kde(kernel, pts, a):</span>
-<span id="cb7-13"><a href="#cb7-13" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Calls create_kde and plots the corresponding KDE</span></span>
-<span id="cb7-14"><a href="#cb7-14" aria-hidden="true" tabindex="-1"></a>    f <span class="op">=</span> create_kde(kernel, pts, a)</span>
-<span id="cb7-15"><a href="#cb7-15" aria-hidden="true" tabindex="-1"></a>    x <span class="op">=</span> np.linspace(<span class="bu">min</span>(pts) <span class="op">-</span> <span class="dv">5</span>, <span class="bu">max</span>(pts) <span class="op">+</span> <span class="dv">5</span>, <span class="dv">1000</span>)</span>
-<span id="cb7-16"><a href="#cb7-16" aria-hidden="true" tabindex="-1"></a>    y <span class="op">=</span> [f(xi) <span class="cf">for</span> xi <span class="kw">in</span> x]</span>
-<span id="cb7-17"><a href="#cb7-17" aria-hidden="true" tabindex="-1"></a>    plt.plot(x, y)<span class="op">;</span></span>
-<span id="cb7-18"><a href="#cb7-18" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb7-19"><a href="#cb7-19" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> plot_separate_kernels(kernel, pts, a, norm<span class="op">=</span><span class="va">False</span>):</span>
-<span id="cb7-20"><a href="#cb7-20" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Plots individual kernels, which are then summed to create the KDE</span></span>
-<span id="cb7-21"><a href="#cb7-21" aria-hidden="true" tabindex="-1"></a>    x <span class="op">=</span> np.linspace(<span class="bu">min</span>(pts) <span class="op">-</span> <span class="dv">5</span>, <span class="bu">max</span>(pts) <span class="op">+</span> <span class="dv">5</span>, <span class="dv">1000</span>)</span>
-<span id="cb7-22"><a href="#cb7-22" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> pt <span class="kw">in</span> pts:</span>
-<span id="cb7-23"><a href="#cb7-23" aria-hidden="true" tabindex="-1"></a>        y <span class="op">=</span> kernel(x, pt, a)</span>
-<span id="cb7-24"><a href="#cb7-24" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> norm:</span>
-<span id="cb7-25"><a href="#cb7-25" aria-hidden="true" tabindex="-1"></a>            y <span class="op">/=</span> <span class="bu">len</span>(pts)</span>
-<span id="cb7-26"><a href="#cb7-26" aria-hidden="true" tabindex="-1"></a>        plt.plot(x, y)</span>
-<span id="cb7-27"><a href="#cb7-27" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb7-28"><a href="#cb7-28" aria-hidden="true" tabindex="-1"></a>    plt.show()<span class="op">;</span></span>
-<span id="cb7-29"><a href="#cb7-29" aria-hidden="true" tabindex="-1"></a>    </span>
-<span id="cb7-30"><a href="#cb7-30" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
-<span id="cb7-31"><a href="#cb7-31" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)</span>
-<span id="cb7-32"><a href="#cb7-32" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
-<span id="cb7-33"><a href="#cb7-33" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
-<span id="cb7-34"><a href="#cb7-34" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb7-35"><a href="#cb7-35" aria-hidden="true" tabindex="-1"></a>plot_separate_kernels(gaussian_kernel, data, a <span class="op">=</span> <span class="dv">1</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb6"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb6-1"><a href="#cb6-1" aria-hidden="true" tabindex="-1"></a><span class="co"># You will work with the functions below in Lab 4</span></span>
+<span id="cb6-2"><a href="#cb6-2" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> create_kde(kernel, pts, a):</span>
+<span id="cb6-3"><a href="#cb6-3" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Takes in a kernel, set of points, and alpha</span></span>
+<span id="cb6-4"><a href="#cb6-4" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Returns the KDE as a function</span></span>
+<span id="cb6-5"><a href="#cb6-5" aria-hidden="true" tabindex="-1"></a>    <span class="kw">def</span> f(x):</span>
+<span id="cb6-6"><a href="#cb6-6" aria-hidden="true" tabindex="-1"></a>        output <span class="op">=</span> <span class="dv">0</span></span>
+<span id="cb6-7"><a href="#cb6-7" aria-hidden="true" tabindex="-1"></a>        <span class="cf">for</span> pt <span class="kw">in</span> pts:</span>
+<span id="cb6-8"><a href="#cb6-8" aria-hidden="true" tabindex="-1"></a>            output <span class="op">+=</span> kernel(x, pt, a)</span>
+<span id="cb6-9"><a href="#cb6-9" aria-hidden="true" tabindex="-1"></a>        <span class="cf">return</span> output <span class="op">/</span> <span class="bu">len</span>(pts) <span class="co"># Normalization factor</span></span>
+<span id="cb6-10"><a href="#cb6-10" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> f</span>
+<span id="cb6-11"><a href="#cb6-11" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb6-12"><a href="#cb6-12" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> plot_kde(kernel, pts, a):</span>
+<span id="cb6-13"><a href="#cb6-13" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Calls create_kde and plots the corresponding KDE</span></span>
+<span id="cb6-14"><a href="#cb6-14" aria-hidden="true" tabindex="-1"></a>    f <span class="op">=</span> create_kde(kernel, pts, a)</span>
+<span id="cb6-15"><a href="#cb6-15" aria-hidden="true" tabindex="-1"></a>    x <span class="op">=</span> np.linspace(<span class="bu">min</span>(pts) <span class="op">-</span> <span class="dv">5</span>, <span class="bu">max</span>(pts) <span class="op">+</span> <span class="dv">5</span>, <span class="dv">1000</span>)</span>
+<span id="cb6-16"><a href="#cb6-16" aria-hidden="true" tabindex="-1"></a>    y <span class="op">=</span> [f(xi) <span class="cf">for</span> xi <span class="kw">in</span> x]</span>
+<span id="cb6-17"><a href="#cb6-17" aria-hidden="true" tabindex="-1"></a>    plt.plot(x, y)<span class="op">;</span></span>
+<span id="cb6-18"><a href="#cb6-18" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb6-19"><a href="#cb6-19" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> plot_separate_kernels(kernel, pts, a, norm<span class="op">=</span><span class="va">False</span>):</span>
+<span id="cb6-20"><a href="#cb6-20" aria-hidden="true" tabindex="-1"></a>    <span class="co"># Plots individual kernels, which are then summed to create the KDE</span></span>
+<span id="cb6-21"><a href="#cb6-21" aria-hidden="true" tabindex="-1"></a>    x <span class="op">=</span> np.linspace(<span class="bu">min</span>(pts) <span class="op">-</span> <span class="dv">5</span>, <span class="bu">max</span>(pts) <span class="op">+</span> <span class="dv">5</span>, <span class="dv">1000</span>)</span>
+<span id="cb6-22"><a href="#cb6-22" aria-hidden="true" tabindex="-1"></a>    <span class="cf">for</span> pt <span class="kw">in</span> pts:</span>
+<span id="cb6-23"><a href="#cb6-23" aria-hidden="true" tabindex="-1"></a>        y <span class="op">=</span> kernel(x, pt, a)</span>
+<span id="cb6-24"><a href="#cb6-24" aria-hidden="true" tabindex="-1"></a>        <span class="cf">if</span> norm:</span>
+<span id="cb6-25"><a href="#cb6-25" aria-hidden="true" tabindex="-1"></a>            y <span class="op">/=</span> <span class="bu">len</span>(pts)</span>
+<span id="cb6-26"><a href="#cb6-26" aria-hidden="true" tabindex="-1"></a>        plt.plot(x, y)</span>
+<span id="cb6-27"><a href="#cb6-27" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb6-28"><a href="#cb6-28" aria-hidden="true" tabindex="-1"></a>    plt.show()<span class="op">;</span></span>
+<span id="cb6-29"><a href="#cb6-29" aria-hidden="true" tabindex="-1"></a>    </span>
+<span id="cb6-30"><a href="#cb6-30" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
+<span id="cb6-31"><a href="#cb6-31" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)</span>
+<span id="cb6-32"><a href="#cb6-32" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
+<span id="cb6-33"><a href="#cb6-33" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
+<span id="cb6-34"><a href="#cb6-34" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb6-35"><a href="#cb6-35" aria-hidden="true" tabindex="-1"></a>plot_separate_kernels(gaussian_kernel, data, a <span class="op">=</span> <span class="dv">1</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-7-output-1.png" width="597" height="434"></p>
@@ -709,13 +703,13 @@ <h5 data-number="8.0.1.2.2" class="anchored" data-anchor-id="step-2-normalize-ke
 <div class="cell" data-execution_count="7">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
-<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)</span>
-<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
-<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
-<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a><span class="co"># The `norm` argument specifies whether or not to normalize the kernels</span></span>
-<span id="cb8-7"><a href="#cb8-7" aria-hidden="true" tabindex="-1"></a>plot_separate_kernels(gaussian_kernel, data, a <span class="op">=</span> <span class="dv">1</span>, norm <span class="op">=</span> <span class="va">True</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb7"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
+<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)</span>
+<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
+<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
+<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb7-6"><a href="#cb7-6" aria-hidden="true" tabindex="-1"></a><span class="co"># The `norm` argument specifies whether or not to normalize the kernels</span></span>
+<span id="cb7-7"><a href="#cb7-7" aria-hidden="true" tabindex="-1"></a>plot_separate_kernels(gaussian_kernel, data, a <span class="op">=</span> <span class="dv">1</span>, norm <span class="op">=</span> <span class="va">True</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-8-output-1.png" width="597" height="434"></p>
@@ -728,12 +722,12 @@ <h5 data-number="8.0.1.2.3" class="anchored" data-anchor-id="step-3-sum-the-norm
 <div class="cell" data-execution_count="8">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
-<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)</span>
-<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
-<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
-<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a>plot_kde(gaussian_kernel, data, a <span class="op">=</span> <span class="dv">1</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb8"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb8-1"><a href="#cb8-1" aria-hidden="true" tabindex="-1"></a>plt.xlim(<span class="op">-</span><span class="dv">3</span>, <span class="dv">10</span>)</span>
+<span id="cb8-2"><a href="#cb8-2" aria-hidden="true" tabindex="-1"></a>plt.ylim(<span class="dv">0</span>, <span class="fl">0.5</span>)</span>
+<span id="cb8-3"><a href="#cb8-3" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Data"</span>)</span>
+<span id="cb8-4"><a href="#cb8-4" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Density"</span>)</span>
+<span id="cb8-5"><a href="#cb8-5" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb8-6"><a href="#cb8-6" aria-hidden="true" tabindex="-1"></a>plot_kde(gaussian_kernel, data, a <span class="op">=</span> <span class="dv">1</span>)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-9-output-1.png" width="597" height="434"></p>
@@ -813,13 +807,13 @@ <h4 data-number="8.0.1.4" class="anchored" data-anchor-id="boxcar-kernel"><span
 <div class="cell" data-execution_count="9">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> boxcar_kernel(alpha, x, z):</span>
-<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> (((x<span class="op">-</span>z)<span class="op">&gt;=-</span>alpha<span class="op">/</span><span class="dv">2</span>)<span class="op">&amp;</span>((x<span class="op">-</span>z)<span class="op">&lt;=</span>alpha<span class="op">/</span><span class="dv">2</span>))<span class="op">/</span>alpha</span>
-<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a>xs <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">5</span>, <span class="dv">5</span>, <span class="dv">200</span>)</span>
-<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>alpha<span class="op">=</span><span class="dv">1</span></span>
-<span id="cb10-6"><a href="#cb10-6" aria-hidden="true" tabindex="-1"></a>kde_curve <span class="op">=</span> [boxcar_kernel(alpha, x, <span class="dv">0</span>) <span class="cf">for</span> x <span class="kw">in</span> xs]</span>
-<span id="cb10-7"><a href="#cb10-7" aria-hidden="true" tabindex="-1"></a>plt.plot(xs, kde_curve)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb9"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a><span class="kw">def</span> boxcar_kernel(alpha, x, z):</span>
+<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a>    <span class="cf">return</span> (((x<span class="op">-</span>z)<span class="op">&gt;=-</span>alpha<span class="op">/</span><span class="dv">2</span>)<span class="op">&amp;</span>((x<span class="op">-</span>z)<span class="op">&lt;=</span>alpha<span class="op">/</span><span class="dv">2</span>))<span class="op">/</span>alpha</span>
+<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a>xs <span class="op">=</span> np.linspace(<span class="op">-</span><span class="dv">5</span>, <span class="dv">5</span>, <span class="dv">200</span>)</span>
+<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a>alpha<span class="op">=</span><span class="dv">1</span></span>
+<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a>kde_curve <span class="op">=</span> [boxcar_kernel(alpha, x, <span class="dv">0</span>) <span class="cf">for</span> x <span class="kw">in</span> xs]</span>
+<span id="cb9-7"><a href="#cb9-7" aria-hidden="true" tabindex="-1"></a>plt.plot(xs, kde_curve)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <div class="quarto-figure quarto-figure-center">
@@ -847,51 +841,33 @@ <h4 data-number="8.0.1.5" class="anchored" data-anchor-id="diving-deeper-into-di
 <p>Below, we can see a couple of examples of how <code>sns.displot</code> can be used to plot various distributions.</p>
 <p>First, we can plot a histogram by setting <code>kind</code> to <code>"hist"</code>. Note that here we’ve specified <code>stat = density</code> to normalize the histogram such that the area under the histogram is equal to 1.</p>
 <div class="cell" data-execution_count="10">
-<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
-<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
-<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">"hist"</span>, </span>
-<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a>            stat<span class="op">=</span><span class="st">"density"</span>) <span class="co"># default: stat=count and density integrates to 1</span></span>
-<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of gross national income per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<div class="cell-output cell-output-stderr">
-<pre><code>/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-</code></pre>
-</div>
+<div class="sourceCode cell-code" id="cb10"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb10-1"><a href="#cb10-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
+<span id="cb10-2"><a href="#cb10-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
+<span id="cb10-3"><a href="#cb10-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">"hist"</span>, </span>
+<span id="cb10-4"><a href="#cb10-4" aria-hidden="true" tabindex="-1"></a>            stat<span class="op">=</span><span class="st">"density"</span>) <span class="co"># default: stat=count and density integrates to 1</span></span>
+<span id="cb10-5"><a href="#cb10-5" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of gross national income per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
-<p><img src="visualization_2_files/figure-html/cell-11-output-2.png" width="476" height="490"></p>
+<p><img src="visualization_2_files/figure-html/cell-11-output-1.png" width="476" height="488"></p>
 </div>
 </div>
 <p>Now, what if we want to generate a KDE plot? We can set <code>kind</code> = to <code>"kde"</code>!</p>
 <div class="cell" data-execution_count="11">
-<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
-<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
-<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">'kde'</span>)</span>
-<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of gross national income per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<div class="cell-output cell-output-stderr">
-<pre><code>/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-</code></pre>
-</div>
+<div class="sourceCode cell-code" id="cb11"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
+<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
+<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">'kde'</span>)</span>
+<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Distribution of gross national income per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
-<p><img src="visualization_2_files/figure-html/cell-12-output-2.png" width="471" height="490"></p>
+<p><img src="visualization_2_files/figure-html/cell-12-output-1.png" width="470" height="487"></p>
 </div>
 </div>
 <p>And finally, if we want to generate an Empirical Cumulative Distribution Function (ECDF), we can specify <code>kind = "ecdf"</code>.</p>
 <div class="cell" data-execution_count="12">
-<div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
-<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
-<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">'ecdf'</span>)</span>
-<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Cumulative Distribution of gross national income per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<div class="cell-output cell-output-stderr">
-<pre><code>/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-</code></pre>
-</div>
+<div class="sourceCode cell-code" id="cb12"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a>sns.displot(data<span class="op">=</span>wb, </span>
+<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>            x<span class="op">=</span><span class="st">"gni"</span>, </span>
+<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a>            kind<span class="op">=</span><span class="st">'ecdf'</span>)</span>
+<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Cumulative Distribution of gross national income per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
-<p><img src="visualization_2_files/figure-html/cell-13-output-2.png" width="508" height="485"></p>
+<p><img src="visualization_2_files/figure-html/cell-13-output-1.png" width="507" height="483"></p>
 </div>
 </div>
 </section>
@@ -904,23 +880,23 @@ <h4 data-number="8.1.0.1" class="anchored" data-anchor-id="scatter-plots"><span
 <p><strong>Scatter plots</strong> are one of the most useful tools in representing the relationship between two quantitative variables. They are particularly important in gauging the strength, or correlation, of the relationship between variables. Knowledge of these relationships can then motivate decisions in our modeling process.</p>
 <p>In <code>matplotlib</code>, we use the function <code>plt.scatter</code> to generate a scatter plot. Notice that, unlike our examples of plotting single-variable distributions, now we specify sequences of values to be plotted along the x-axis <em>and</em> the y-axis.</p>
 <div class="cell" data-execution_count="13">
-<div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a>plt.scatter(wb[<span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>], <span class="op">\</span></span>
-<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a>            wb[<span class="st">'Adult literacy rate: Female: % ages 15 and older: 2005-14'</span>])</span>
-<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"</span><span class="sc">% g</span><span class="st">rowth per capita"</span>)</span>
-<span id="cb17-5"><a href="#cb17-5" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Female adult literacy rate"</span>)</span>
-<span id="cb17-6"><a href="#cb17-6" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb13"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a>plt.scatter(wb[<span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>], <span class="op">\</span></span>
+<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a>            wb[<span class="st">'Adult literacy rate: Female: % ages 15 and older: 2005-14'</span>])</span>
+<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"</span><span class="sc">% g</span><span class="st">rowth per capita"</span>)</span>
+<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Female adult literacy rate"</span>)</span>
+<span id="cb13-6"><a href="#cb13-6" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-14-output-1.png" width="593" height="449"></p>
 </div>
 </div>
 <p>In <code>seaborn</code>, we call the function <code>sns.scatterplot</code>. We use the <code>x</code> and <code>y</code> parameters to indicate the values to be plotted along the x and y axes, respectively. By using the <code>hue</code> parameter, we can specify a third variable to be used for coloring each scatter point.</p>
 <div class="cell" data-execution_count="14">
-<div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>sns.scatterplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
-<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>               y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, </span>
-<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>               hue <span class="op">=</span> <span class="st">"Continent"</span>)</span>
-<span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb14"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb14-1"><a href="#cb14-1" aria-hidden="true" tabindex="-1"></a>sns.scatterplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
+<span id="cb14-2"><a href="#cb14-2" aria-hidden="true" tabindex="-1"></a>               y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, </span>
+<span id="cb14-3"><a href="#cb14-3" aria-hidden="true" tabindex="-1"></a>               hue <span class="op">=</span> <span class="st">"Continent"</span>)</span>
+<span id="cb14-4"><a href="#cb14-4" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb14-5"><a href="#cb14-5" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-15-output-1.png" width="593" height="449"></p>
 </div>
@@ -933,25 +909,25 @@ <h4 data-number="8.1.0.1" class="anchored" data-anchor-id="scatter-plots"><span
 </ul>
 <p>In the cell below, we first jitter the data using <code>np.random.uniform</code>, then re-plot it with smaller markers. The resulting plot is much easier to interpret.</p>
 <div class="cell" data-execution_count="15">
-<div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Setting a seed ensures that we produce the same plot each time</span></span>
-<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a><span class="co"># This means that the course notes will not change each time you access them</span></span>
-<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">150</span>)</span>
-<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a><span class="co"># This call to np.random.uniform generates random numbers between -1 and 1</span></span>
-<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a><span class="co"># We add these random numbers to the original x data to jitter it slightly</span></span>
-<span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a>x_noise <span class="op">=</span> np.random.uniform(<span class="op">-</span><span class="dv">1</span>, <span class="dv">1</span>, <span class="bu">len</span>(wb))</span>
-<span id="cb19-8"><a href="#cb19-8" aria-hidden="true" tabindex="-1"></a>jittered_x <span class="op">=</span> wb[<span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>] <span class="op">+</span> x_noise</span>
-<span id="cb19-9"><a href="#cb19-9" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb19-10"><a href="#cb19-10" aria-hidden="true" tabindex="-1"></a><span class="co"># Repeat for y data</span></span>
-<span id="cb19-11"><a href="#cb19-11" aria-hidden="true" tabindex="-1"></a>y_noise <span class="op">=</span> np.random.uniform(<span class="op">-</span><span class="dv">5</span>, <span class="dv">5</span>, <span class="bu">len</span>(wb))</span>
-<span id="cb19-12"><a href="#cb19-12" aria-hidden="true" tabindex="-1"></a>jittered_y <span class="op">=</span> wb[<span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>] <span class="op">+</span> y_noise</span>
-<span id="cb19-13"><a href="#cb19-13" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb19-14"><a href="#cb19-14" aria-hidden="true" tabindex="-1"></a><span class="co"># Setting the size parameter `s` changes the size of each point</span></span>
-<span id="cb19-15"><a href="#cb19-15" aria-hidden="true" tabindex="-1"></a>plt.scatter(jittered_x, jittered_y, s<span class="op">=</span><span class="dv">15</span>)</span>
-<span id="cb19-16"><a href="#cb19-16" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb19-17"><a href="#cb19-17" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"</span><span class="sc">% g</span><span class="st">rowth per capita (jittered)"</span>)</span>
-<span id="cb19-18"><a href="#cb19-18" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Female adult literacy rate (jittered)"</span>)</span>
-<span id="cb19-19"><a href="#cb19-19" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb15"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Setting a seed ensures that we produce the same plot each time</span></span>
+<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a><span class="co"># This means that the course notes will not change each time you access them</span></span>
+<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a>np.random.seed(<span class="dv">150</span>)</span>
+<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-5"><a href="#cb15-5" aria-hidden="true" tabindex="-1"></a><span class="co"># This call to np.random.uniform generates random numbers between -1 and 1</span></span>
+<span id="cb15-6"><a href="#cb15-6" aria-hidden="true" tabindex="-1"></a><span class="co"># We add these random numbers to the original x data to jitter it slightly</span></span>
+<span id="cb15-7"><a href="#cb15-7" aria-hidden="true" tabindex="-1"></a>x_noise <span class="op">=</span> np.random.uniform(<span class="op">-</span><span class="dv">1</span>, <span class="dv">1</span>, <span class="bu">len</span>(wb))</span>
+<span id="cb15-8"><a href="#cb15-8" aria-hidden="true" tabindex="-1"></a>jittered_x <span class="op">=</span> wb[<span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>] <span class="op">+</span> x_noise</span>
+<span id="cb15-9"><a href="#cb15-9" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-10"><a href="#cb15-10" aria-hidden="true" tabindex="-1"></a><span class="co"># Repeat for y data</span></span>
+<span id="cb15-11"><a href="#cb15-11" aria-hidden="true" tabindex="-1"></a>y_noise <span class="op">=</span> np.random.uniform(<span class="op">-</span><span class="dv">5</span>, <span class="dv">5</span>, <span class="bu">len</span>(wb))</span>
+<span id="cb15-12"><a href="#cb15-12" aria-hidden="true" tabindex="-1"></a>jittered_y <span class="op">=</span> wb[<span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>] <span class="op">+</span> y_noise</span>
+<span id="cb15-13"><a href="#cb15-13" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-14"><a href="#cb15-14" aria-hidden="true" tabindex="-1"></a><span class="co"># Setting the size parameter `s` changes the size of each point</span></span>
+<span id="cb15-15"><a href="#cb15-15" aria-hidden="true" tabindex="-1"></a>plt.scatter(jittered_x, jittered_y, s<span class="op">=</span><span class="dv">15</span>)</span>
+<span id="cb15-16"><a href="#cb15-16" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb15-17"><a href="#cb15-17" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"</span><span class="sc">% g</span><span class="st">rowth per capita (jittered)"</span>)</span>
+<span id="cb15-18"><a href="#cb15-18" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Female adult literacy rate (jittered)"</span>)</span>
+<span id="cb15-19"><a href="#cb15-19" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-16-output-1.png" width="593" height="449"></p>
 </div>
@@ -962,30 +938,24 @@ <h4 data-number="8.1.0.2" class="anchored" data-anchor-id="lmplot-and-jointplot"
 <p><code>seaborn</code> also includes several built-in functions for creating more sophisticated scatter plots. Two of the most commonly used examples are <code>sns.lmplot</code> and <code>sns.jointplot</code>.</p>
 <p><code>sns.lmplot</code> plots both a scatter plot <em>and</em> a linear regression line, all in one function call. We’ll discuss linear regression in a few lectures.</p>
 <div class="cell" data-execution_count="16">
-<div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a>sns.lmplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
-<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>           y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>)</span>
-<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
-<div class="cell-output cell-output-stderr">
-<pre><code>/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-</code></pre>
-</div>
+<div class="sourceCode cell-code" id="cb16"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a>sns.lmplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
+<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a>           y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>)</span>
+<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
-<p><img src="visualization_2_files/figure-html/cell-17-output-2.png" width="470" height="490"></p>
+<p><img src="visualization_2_files/figure-html/cell-17-output-1.png" width="469" height="488"></p>
 </div>
 </div>
 <p><code>sns.jointplot</code> creates a visualization with three components: a scatter plot, a histogram of the distribution of x values, and a histogram of the distribution of y values.</p>
 <div class="cell" data-execution_count="17">
-<div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a>sns.jointplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
-<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>           y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>)</span>
-<span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb22-4"><a href="#cb22-4" aria-hidden="true" tabindex="-1"></a><span class="co"># plt.suptitle allows us to shift the title up so it does not overlap with the histogram</span></span>
-<span id="cb22-5"><a href="#cb22-5" aria-hidden="true" tabindex="-1"></a>plt.suptitle(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)</span>
-<span id="cb22-6"><a href="#cb22-6" aria-hidden="true" tabindex="-1"></a>plt.subplots_adjust(top<span class="op">=</span><span class="fl">0.9</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb17"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a>sns.jointplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
+<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a>           y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>)</span>
+<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a><span class="co"># plt.suptitle allows us to shift the title up so it does not overlap with the histogram</span></span>
+<span id="cb17-5"><a href="#cb17-5" aria-hidden="true" tabindex="-1"></a>plt.suptitle(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)</span>
+<span id="cb17-6"><a href="#cb17-6" aria-hidden="true" tabindex="-1"></a>plt.subplots_adjust(top<span class="op">=</span><span class="fl">0.9</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
-<p><img src="visualization_2_files/figure-html/cell-18-output-1.png" width="570" height="569"></p>
+<p><img src="visualization_2_files/figure-html/cell-18-output-1.png" width="570" height="567"></p>
 </div>
 </div>
 </section>
@@ -995,15 +965,15 @@ <h4 data-number="8.1.0.3" class="anchored" data-anchor-id="hex-plots"><span clas
 <p><strong>Hex plots</strong> can be thought of as two-dimensional histograms that show the joint distribution between two variables. This is particularly useful when working with very dense data. In a hex plot, the x-y plane is binned into hexagons. Hexagons that are darker in color indicate a greater density of data – that is, there are more data points that lie in the region enclosed by the hexagon.</p>
 <p>We can generate a hex plot using <code>sns.jointplot</code> modified with the <code>kind</code> parameter.</p>
 <div class="cell" data-execution_count="18">
-<div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a>sns.jointplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
-<span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a>              y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, <span class="op">\</span></span>
-<span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a>              kind <span class="op">=</span> <span class="st">"hex"</span>)</span>
-<span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb23-5"><a href="#cb23-5" aria-hidden="true" tabindex="-1"></a><span class="co"># plt.suptitle allows us to shift the title up so it does not overlap with the histogram</span></span>
-<span id="cb23-6"><a href="#cb23-6" aria-hidden="true" tabindex="-1"></a>plt.suptitle(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)</span>
-<span id="cb23-7"><a href="#cb23-7" aria-hidden="true" tabindex="-1"></a>plt.subplots_adjust(top<span class="op">=</span><span class="fl">0.9</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb18"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb18-1"><a href="#cb18-1" aria-hidden="true" tabindex="-1"></a>sns.jointplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
+<span id="cb18-2"><a href="#cb18-2" aria-hidden="true" tabindex="-1"></a>              y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, <span class="op">\</span></span>
+<span id="cb18-3"><a href="#cb18-3" aria-hidden="true" tabindex="-1"></a>              kind <span class="op">=</span> <span class="st">"hex"</span>)</span>
+<span id="cb18-4"><a href="#cb18-4" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb18-5"><a href="#cb18-5" aria-hidden="true" tabindex="-1"></a><span class="co"># plt.suptitle allows us to shift the title up so it does not overlap with the histogram</span></span>
+<span id="cb18-6"><a href="#cb18-6" aria-hidden="true" tabindex="-1"></a>plt.suptitle(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)</span>
+<span id="cb18-7"><a href="#cb18-7" aria-hidden="true" tabindex="-1"></a>plt.subplots_adjust(top<span class="op">=</span><span class="fl">0.9</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
-<p><img src="visualization_2_files/figure-html/cell-19-output-1.png" width="570" height="569"></p>
+<p><img src="visualization_2_files/figure-html/cell-19-output-1.png" width="570" height="567"></p>
 </div>
 </div>
 </section>
@@ -1012,11 +982,11 @@ <h4 data-number="8.1.0.4" class="anchored" data-anchor-id="contour-plots"><span
 <p><strong>Contour plots</strong> are an alternative way of plotting the joint distribution of two variables. You can think of them as the 2-dimensional versions of KDE plots. A contour plot can be interpreted in a similar way to a <a href="https://gisgeography.com/contour-lines-topographic-map/">topographic map</a>. Each contour line represents an area that has the same <em>density</em> of datapoints throughout the region. Contours marked with darker colors contain more datapoints (a higher density) in that region.</p>
 <p><code>sns.kdeplot</code> will generate a contour plot if we specify both x and y data.</p>
 <div class="cell" data-execution_count="19">
-<div class="sourceCode cell-code" id="cb24"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb24-1"><a href="#cb24-1" aria-hidden="true" tabindex="-1"></a>sns.kdeplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
-<span id="cb24-2"><a href="#cb24-2" aria-hidden="true" tabindex="-1"></a>            y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, <span class="op">\</span></span>
-<span id="cb24-3"><a href="#cb24-3" aria-hidden="true" tabindex="-1"></a>            fill <span class="op">=</span> <span class="va">True</span>)</span>
-<span id="cb24-4"><a href="#cb24-4" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb24-5"><a href="#cb24-5" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb19"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>sns.kdeplot(data <span class="op">=</span> wb, x <span class="op">=</span> <span class="st">"per capita: </span><span class="sc">% g</span><span class="st">rowth: 2016"</span>, <span class="op">\</span></span>
+<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a>            y <span class="op">=</span> <span class="st">"Adult literacy rate: Female: % ages 15 and older: 2005-14"</span>, <span class="op">\</span></span>
+<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a>            fill <span class="op">=</span> <span class="va">True</span>)</span>
+<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Female adult literacy against </span><span class="sc">% g</span><span class="st">rowth"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-20-output-1.png" width="596" height="449"></p>
 </div>
@@ -1032,17 +1002,17 @@ <h2 data-number="8.2" class="anchored" data-anchor-id="transformations"><span cl
 <div class="cell" data-execution_count="20">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb25"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Some data cleaning to help with the next example</span></span>
-<span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> pd.DataFrame(index<span class="op">=</span>wb.index)</span>
-<span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a>df[<span class="st">'lit'</span>] <span class="op">=</span> wb[<span class="st">'Adult literacy rate: Female: % ages 15 and older: 2005-14'</span>] <span class="op">\</span></span>
-<span id="cb25-4"><a href="#cb25-4" aria-hidden="true" tabindex="-1"></a>            <span class="op">+</span> wb[<span class="st">"Adult literacy rate: Male: % ages 15 and older: 2005-14"</span>]</span>
-<span id="cb25-5"><a href="#cb25-5" aria-hidden="true" tabindex="-1"></a>df[<span class="st">'inc'</span>] <span class="op">=</span> wb[<span class="st">'gni'</span>]</span>
-<span id="cb25-6"><a href="#cb25-6" aria-hidden="true" tabindex="-1"></a>df.dropna(inplace<span class="op">=</span><span class="va">True</span>)</span>
-<span id="cb25-7"><a href="#cb25-7" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb25-8"><a href="#cb25-8" aria-hidden="true" tabindex="-1"></a>plt.scatter(df[<span class="st">"inc"</span>], df[<span class="st">"lit"</span>])</span>
-<span id="cb25-9"><a href="#cb25-9" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Gross national income per capita"</span>)</span>
-<span id="cb25-10"><a href="#cb25-10" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate"</span>)</span>
-<span id="cb25-11"><a href="#cb25-11" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Adult literacy rate against GNI per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb20"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb20-1"><a href="#cb20-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Some data cleaning to help with the next example</span></span>
+<span id="cb20-2"><a href="#cb20-2" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> pd.DataFrame(index<span class="op">=</span>wb.index)</span>
+<span id="cb20-3"><a href="#cb20-3" aria-hidden="true" tabindex="-1"></a>df[<span class="st">'lit'</span>] <span class="op">=</span> wb[<span class="st">'Adult literacy rate: Female: % ages 15 and older: 2005-14'</span>] <span class="op">\</span></span>
+<span id="cb20-4"><a href="#cb20-4" aria-hidden="true" tabindex="-1"></a>            <span class="op">+</span> wb[<span class="st">"Adult literacy rate: Male: % ages 15 and older: 2005-14"</span>]</span>
+<span id="cb20-5"><a href="#cb20-5" aria-hidden="true" tabindex="-1"></a>df[<span class="st">'inc'</span>] <span class="op">=</span> wb[<span class="st">'gni'</span>]</span>
+<span id="cb20-6"><a href="#cb20-6" aria-hidden="true" tabindex="-1"></a>df.dropna(inplace<span class="op">=</span><span class="va">True</span>)</span>
+<span id="cb20-7"><a href="#cb20-7" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb20-8"><a href="#cb20-8" aria-hidden="true" tabindex="-1"></a>plt.scatter(df[<span class="st">"inc"</span>], df[<span class="st">"lit"</span>])</span>
+<span id="cb20-9"><a href="#cb20-9" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Gross national income per capita"</span>)</span>
+<span id="cb20-10"><a href="#cb20-10" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate"</span>)</span>
+<span id="cb20-11"><a href="#cb20-11" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Adult literacy rate against GNI per capita"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-21-output-1.png" width="593" height="449"></p>
@@ -1070,12 +1040,12 @@ <h3 data-number="8.2.1" class="anchored" data-anchor-id="linearization-and-apply
 <p>One function that produces this result is the <strong>log transformation</strong>. When we take the logarithm of a large number, the original number will decrease in magnitude dramatically. Conversely, when we take the logarithm of a small number, the original number does not change its value by as significant of an amount (to illustrate this, consider the difference between <span class="math inline">\(\log{(100)} = 4.61\)</span> and <span class="math inline">\(\log{(10)} = 2.3\)</span>).</p>
 <p>In Data 100 (and most upper-division STEM classes), <span class="math inline">\(\log\)</span> is used to refer to the natural logarithm with base <span class="math inline">\(e\)</span>.</p>
 <div class="cell" data-execution_count="21">
-<div class="sourceCode cell-code" id="cb26"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb26-1"><a href="#cb26-1" aria-hidden="true" tabindex="-1"></a><span class="co"># np.log takes the logarithm of an array or Series</span></span>
-<span id="cb26-2"><a href="#cb26-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(np.log(df[<span class="st">"inc"</span>]), df[<span class="st">"lit"</span>])</span>
-<span id="cb26-3"><a href="#cb26-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb26-4"><a href="#cb26-4" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Log(gross national income per capita)"</span>)</span>
-<span id="cb26-5"><a href="#cb26-5" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate"</span>)</span>
-<span id="cb26-6"><a href="#cb26-6" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Adult literacy rate against Log(GNI per capita)"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb21"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="co"># np.log takes the logarithm of an array or Series</span></span>
+<span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(np.log(df[<span class="st">"inc"</span>]), df[<span class="st">"lit"</span>])</span>
+<span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb21-4"><a href="#cb21-4" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Log(gross national income per capita)"</span>)</span>
+<span id="cb21-5"><a href="#cb21-5" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate"</span>)</span>
+<span id="cb21-6"><a href="#cb21-6" aria-hidden="true" tabindex="-1"></a>plt.title(<span class="st">"Adult literacy rate against Log(GNI per capita)"</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-22-output-1.png" width="593" height="449"></p>
 </div>
@@ -1089,13 +1059,13 @@ <h3 data-number="8.2.1" class="anchored" data-anchor-id="linearization-and-apply
 </ul>
 <p>In this case, it is helpful to apply a <strong>power transformation</strong> – that is, raise our y values to a power. Let’s try raising our adult literacy rate values to the power of 4. Large values raised to the power of 4 will increase in magnitude proportionally much more than small values raised to the power of 4 (consider the difference between <span class="math inline">\(2^4 = 16\)</span> and <span class="math inline">\(200^4 = 1600000000\)</span>).</p>
 <div class="cell" data-execution_count="22">
-<div class="sourceCode cell-code" id="cb27"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Apply a log transformation to the x values and a power transformation to the y values</span></span>
-<span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(np.log(df[<span class="st">"inc"</span>]), df[<span class="st">"lit"</span>]<span class="op">**</span><span class="dv">4</span>)</span>
-<span id="cb27-3"><a href="#cb27-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb27-4"><a href="#cb27-4" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Log(gross national income per capita)"</span>)</span>
-<span id="cb27-5"><a href="#cb27-5" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate (4th power)"</span>)</span>
-<span id="cb27-6"><a href="#cb27-6" aria-hidden="true" tabindex="-1"></a>plt.suptitle(<span class="st">"Adult literacy rate (4th power) against Log(GNI per capita)"</span>)</span>
-<span id="cb27-7"><a href="#cb27-7" aria-hidden="true" tabindex="-1"></a>plt.subplots_adjust(top<span class="op">=</span><span class="fl">0.9</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb22"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb22-1"><a href="#cb22-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Apply a log transformation to the x values and a power transformation to the y values</span></span>
+<span id="cb22-2"><a href="#cb22-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(np.log(df[<span class="st">"inc"</span>]), df[<span class="st">"lit"</span>]<span class="op">**</span><span class="dv">4</span>)</span>
+<span id="cb22-3"><a href="#cb22-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb22-4"><a href="#cb22-4" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Log(gross national income per capita)"</span>)</span>
+<span id="cb22-5"><a href="#cb22-5" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate (4th power)"</span>)</span>
+<span id="cb22-6"><a href="#cb22-6" aria-hidden="true" tabindex="-1"></a>plt.suptitle(<span class="st">"Adult literacy rate (4th power) against Log(GNI per capita)"</span>)</span>
+<span id="cb22-7"><a href="#cb22-7" aria-hidden="true" tabindex="-1"></a>plt.subplots_adjust(top<span class="op">=</span><span class="fl">0.9</span>)<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-23-output-1.png" width="589" height="477"></p>
 </div>
@@ -1108,26 +1078,26 @@ <h3 data-number="8.2.1" class="anchored" data-anchor-id="linearization-and-apply
 <div class="cell" data-execution_count="23">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb28"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb28-1"><a href="#cb28-1" aria-hidden="true" tabindex="-1"></a><span class="co"># The code below fits a linear regression model. We'll discuss it at length in a future lecture</span></span>
-<span id="cb28-2"><a href="#cb28-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LinearRegression</span>
-<span id="cb28-3"><a href="#cb28-3" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb28-4"><a href="#cb28-4" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> LinearRegression()</span>
-<span id="cb28-5"><a href="#cb28-5" aria-hidden="true" tabindex="-1"></a>model.fit(np.log(df[[<span class="st">"inc"</span>]]), df[<span class="st">"lit"</span>]<span class="op">**</span><span class="dv">4</span>)</span>
-<span id="cb28-6"><a href="#cb28-6" aria-hidden="true" tabindex="-1"></a>m, b <span class="op">=</span> model.coef_[<span class="dv">0</span>], model.intercept_</span>
-<span id="cb28-7"><a href="#cb28-7" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb28-8"><a href="#cb28-8" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"The slope, m, of the transformed data is: </span><span class="sc">{</span>m<span class="sc">}</span><span class="ss">"</span>)</span>
-<span id="cb28-9"><a href="#cb28-9" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"The intercept, b, of the transformed data is: </span><span class="sc">{</span>b<span class="sc">}</span><span class="ss">"</span>)</span>
-<span id="cb28-10"><a href="#cb28-10" aria-hidden="true" tabindex="-1"></a></span>
-<span id="cb28-11"><a href="#cb28-11" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> df.sort_values(<span class="st">"inc"</span>)</span>
-<span id="cb28-12"><a href="#cb28-12" aria-hidden="true" tabindex="-1"></a>plt.scatter(np.log(df[<span class="st">"inc"</span>]), df[<span class="st">"lit"</span>]<span class="op">**</span><span class="dv">4</span>, label<span class="op">=</span><span class="st">"Transformed data"</span>)</span>
-<span id="cb28-13"><a href="#cb28-13" aria-hidden="true" tabindex="-1"></a>plt.plot(np.log(df[<span class="st">"inc"</span>]), m<span class="op">*</span>np.log(df[<span class="st">"inc"</span>])<span class="op">+</span>b, c<span class="op">=</span><span class="st">"red"</span>, label<span class="op">=</span><span class="st">"Linear regression"</span>)</span>
-<span id="cb28-14"><a href="#cb28-14" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Log(gross national income per capita)"</span>)</span>
-<span id="cb28-15"><a href="#cb28-15" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate (4th power)"</span>)</span>
-<span id="cb28-16"><a href="#cb28-16" aria-hidden="true" tabindex="-1"></a>plt.legend()<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb23"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a><span class="co"># The code below fits a linear regression model. We'll discuss it at length in a future lecture</span></span>
+<span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a><span class="im">from</span> sklearn.linear_model <span class="im">import</span> LinearRegression</span>
+<span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a>model <span class="op">=</span> LinearRegression()</span>
+<span id="cb23-5"><a href="#cb23-5" aria-hidden="true" tabindex="-1"></a>model.fit(np.log(df[[<span class="st">"inc"</span>]]), df[<span class="st">"lit"</span>]<span class="op">**</span><span class="dv">4</span>)</span>
+<span id="cb23-6"><a href="#cb23-6" aria-hidden="true" tabindex="-1"></a>m, b <span class="op">=</span> model.coef_[<span class="dv">0</span>], model.intercept_</span>
+<span id="cb23-7"><a href="#cb23-7" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb23-8"><a href="#cb23-8" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"The slope, m, of the transformed data is: </span><span class="sc">{</span>m<span class="sc">}</span><span class="ss">"</span>)</span>
+<span id="cb23-9"><a href="#cb23-9" aria-hidden="true" tabindex="-1"></a><span class="bu">print</span>(<span class="ss">f"The intercept, b, of the transformed data is: </span><span class="sc">{</span>b<span class="sc">}</span><span class="ss">"</span>)</span>
+<span id="cb23-10"><a href="#cb23-10" aria-hidden="true" tabindex="-1"></a></span>
+<span id="cb23-11"><a href="#cb23-11" aria-hidden="true" tabindex="-1"></a>df <span class="op">=</span> df.sort_values(<span class="st">"inc"</span>)</span>
+<span id="cb23-12"><a href="#cb23-12" aria-hidden="true" tabindex="-1"></a>plt.scatter(np.log(df[<span class="st">"inc"</span>]), df[<span class="st">"lit"</span>]<span class="op">**</span><span class="dv">4</span>, label<span class="op">=</span><span class="st">"Transformed data"</span>)</span>
+<span id="cb23-13"><a href="#cb23-13" aria-hidden="true" tabindex="-1"></a>plt.plot(np.log(df[<span class="st">"inc"</span>]), m<span class="op">*</span>np.log(df[<span class="st">"inc"</span>])<span class="op">+</span>b, c<span class="op">=</span><span class="st">"red"</span>, label<span class="op">=</span><span class="st">"Linear regression"</span>)</span>
+<span id="cb23-14"><a href="#cb23-14" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Log(gross national income per capita)"</span>)</span>
+<span id="cb23-15"><a href="#cb23-15" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate (4th power)"</span>)</span>
+<span id="cb23-16"><a href="#cb23-16" aria-hidden="true" tabindex="-1"></a>plt.legend()<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-stdout">
-<pre><code>The slope, m, of the transformed data is: 336400693.43172693
-The intercept, b, of the transformed data is: -1802204836.0479977</code></pre>
+<pre><code>The slope, m, of the transformed data is: 336400693.43172705
+The intercept, b, of the transformed data is: -1802204836.0479987</code></pre>
 </div>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-24-output-2.png" width="597" height="443"></p>
@@ -1142,12 +1112,12 @@ <h3 data-number="8.2.1" class="anchored" data-anchor-id="linearization-and-apply
 <div class="cell" data-execution_count="24">
 <details>
 <summary>Code</summary>
-<div class="sourceCode cell-code" id="cb30"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb30-1"><a href="#cb30-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Now, plug the values for m and b into the relationship between the untransformed x and y</span></span>
-<span id="cb30-2"><a href="#cb30-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(df[<span class="st">"inc"</span>], df[<span class="st">"lit"</span>], label<span class="op">=</span><span class="st">"Untransformed data"</span>)</span>
-<span id="cb30-3"><a href="#cb30-3" aria-hidden="true" tabindex="-1"></a>plt.plot(df[<span class="st">"inc"</span>], (m<span class="op">*</span>np.log(df[<span class="st">"inc"</span>])<span class="op">+</span>b)<span class="op">**</span>(<span class="dv">1</span><span class="op">/</span><span class="dv">4</span>), c<span class="op">=</span><span class="st">"red"</span>, label<span class="op">=</span><span class="st">"Modeled relationship"</span>)</span>
-<span id="cb30-4"><a href="#cb30-4" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Gross national income per capita"</span>)</span>
-<span id="cb30-5"><a href="#cb30-5" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate"</span>)</span>
-<span id="cb30-6"><a href="#cb30-6" aria-hidden="true" tabindex="-1"></a>plt.legend()<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
+<div class="sourceCode cell-code" id="cb25"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Now, plug the values for m and b into the relationship between the untransformed x and y</span></span>
+<span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a>plt.scatter(df[<span class="st">"inc"</span>], df[<span class="st">"lit"</span>], label<span class="op">=</span><span class="st">"Untransformed data"</span>)</span>
+<span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a>plt.plot(df[<span class="st">"inc"</span>], (m<span class="op">*</span>np.log(df[<span class="st">"inc"</span>])<span class="op">+</span>b)<span class="op">**</span>(<span class="dv">1</span><span class="op">/</span><span class="dv">4</span>), c<span class="op">=</span><span class="st">"red"</span>, label<span class="op">=</span><span class="st">"Modeled relationship"</span>)</span>
+<span id="cb25-4"><a href="#cb25-4" aria-hidden="true" tabindex="-1"></a>plt.xlabel(<span class="st">"Gross national income per capita"</span>)</span>
+<span id="cb25-5"><a href="#cb25-5" aria-hidden="true" tabindex="-1"></a>plt.ylabel(<span class="st">"Adult literacy rate"</span>)</span>
+<span id="cb25-6"><a href="#cb25-6" aria-hidden="true" tabindex="-1"></a>plt.legend()<span class="op">;</span></span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
 </details>
 <div class="cell-output cell-output-display">
 <p><img src="visualization_2_files/figure-html/cell-25-output-1.png" width="593" height="429"></p>
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-1.png
index 9cf5a793..5a833e4e 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-2.png
new file mode 100644
index 00000000..713bcd7a
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-10-output-2.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-1.png
new file mode 100644
index 00000000..46273523
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-2.png
deleted file mode 100644
index e4bce154..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-11-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-1.png
new file mode 100644
index 00000000..4bf04ac8
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-2.png
deleted file mode 100644
index d5161d06..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-12-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-1.png
new file mode 100644
index 00000000..49a309a8
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-2.png
index ed88fc12..f26551da 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-2.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-13-output-2.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-14-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-14-output-1.png
index cd04f2ce..589bdf46 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-14-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-14-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-15-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-15-output-1.png
index 987e6cb1..13d34951 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-15-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-15-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-16-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-16-output-1.png
index 61f3796c..a8fb882c 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-16-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-16-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-17-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-17-output-1.png
new file mode 100644
index 00000000..bb36c8db
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-17-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-17-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-17-output-2.png
deleted file mode 100644
index 14868a55..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-17-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png
index 333a1fc0..e0f6a953 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-2.png
new file mode 100644
index 00000000..5ac50068
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-18-output-2.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-1.png
index 4108fbae..9ad3a069 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-2.png
new file mode 100644
index 00000000..19f726c4
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-19-output-2.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-1.png
index 37b1cc10..fab9a8c5 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-2.png
new file mode 100644
index 00000000..f26551da
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-2.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-3.png b/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-3.png
new file mode 100644
index 00000000..143fbba2
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-20-output-3.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-21-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-21-output-1.png
index 3fd3006a..fc02b67d 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-21-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-21-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-22-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-22-output-1.png
index e44fe1b2..024bfbca 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-22-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-22-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-23-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-23-output-1.png
index eec2ac2f..c09460e8 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-23-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-23-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-1.png
new file mode 100644
index 00000000..780f3b61
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-2.png
index f24c2bc1..ba6f6b01 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-2.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-24-output-2.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-25-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-25-output-1.png
index d7a3e2d0..994192f3 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-25-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-25-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-26-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-26-output-1.png
new file mode 100644
index 00000000..221442b1
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-26-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-27-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-27-output-1.png
new file mode 100644
index 00000000..972bfb17
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-27-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-28-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-28-output-1.png
new file mode 100644
index 00000000..bb029adc
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-28-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-29-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-29-output-1.png
new file mode 100644
index 00000000..3754b377
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-29-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-1.png
new file mode 100644
index 00000000..61062631
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-2.png
deleted file mode 100644
index d8f3018b..00000000
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-3-output-2.png and /dev/null differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-30-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-30-output-1.png
new file mode 100644
index 00000000..488b5b59
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-30-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-31-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-31-output-1.png
new file mode 100644
index 00000000..68c761ad
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-31-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-31-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-31-output-2.png
new file mode 100644
index 00000000..dc53abe1
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-31-output-2.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-32-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-32-output-1.png
new file mode 100644
index 00000000..fe8cc273
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-32-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-32-output-2.png b/docs/visualization_2/visualization_2_files/figure-html/cell-32-output-2.png
new file mode 100644
index 00000000..84e90a93
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-32-output-2.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-33-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-33-output-1.png
new file mode 100644
index 00000000..04301b1e
Binary files /dev/null and b/docs/visualization_2/visualization_2_files/figure-html/cell-33-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-4-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-4-output-1.png
index 68112ffe..3c68e649 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-4-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-4-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-5-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-5-output-1.png
index ead1a2af..185c1b76 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-5-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-5-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-1.png
index 87d0de06..ff012f3a 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-6-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-7-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-7-output-1.png
index 54db2407..06af8ea7 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-7-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-7-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-8-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-8-output-1.png
index 850eacd0..9f71a672 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-8-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-8-output-1.png differ
diff --git a/docs/visualization_2/visualization_2_files/figure-html/cell-9-output-1.png b/docs/visualization_2/visualization_2_files/figure-html/cell-9-output-1.png
index 3236767e..82706283 100644
Binary files a/docs/visualization_2/visualization_2_files/figure-html/cell-9-output-1.png and b/docs/visualization_2/visualization_2_files/figure-html/cell-9-output-1.png differ
diff --git a/index.log b/index.log
index b9b89c20..d5c62e27 100644
--- a/index.log
+++ b/index.log
@@ -1,4 +1,4 @@
-This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.11.3)  7 NOV 2023 20:26
+This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.11.3)  9 NOV 2023 18:34
 entering extended mode
  restricted \write18 enabled.
  %&-line parsing enabled.
@@ -6,25 +6,25 @@ entering extended mode
 (./index.tex
 LaTeX2e <2023-11-01>
 L3 programming layer <2023-11-01>
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrreprt.cls
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrreprt.cls
 Document Class: scrreprt 2023/07/07 v3.41 KOMA-Script document class (report)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrkbase.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrkbase.sty
 Package: scrkbase 2023/07/07 v3.41 KOMA-Script package (KOMA-Script-dependent basics and keyval usage)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrbase.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrbase.sty
 Package: scrbase 2023/07/07 v3.41 KOMA-Script package (KOMA-Script-independent basics and keyval usage)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrlfile.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrlfile.sty
 Package: scrlfile 2023/07/07 v3.41 KOMA-Script package (file load hooks)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrlfile-hook.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrlfile-hook.sty
 Package: scrlfile-hook 2023/07/07 v3.41 KOMA-Script package (using LaTeX hooks)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrlogo.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrlogo.sty
 Package: scrlogo 2023/07/07 v3.41 KOMA-Script package (logo)
-))) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics/keyval.sty
+))) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics/keyval.sty
 Package: keyval 2022/05/29 v1.15 key=value parser (DPC)
 \KV@toks@=\toks17
 )
 Applying: [2021/05/01] Usage of raw or classic option list on input line 252.
 Already applied: [0000/00/00] Usage of raw or classic option list on input line 368.
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/tocbasic.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/tocbasic.sty
 Package: tocbasic 2023/07/07 v3.41 KOMA-Script package (handling toc-files)
 \scr@dte@tocline@numberwidth=\skip48
 \scr@dte@tocline@numbox=\box51
@@ -34,9 +34,9 @@ Package tocbasic Info: omitting babel extension for `toc'
 (tocbasic)             for `toc' on input line 135.
 Class scrreprt Info: File `scrsize11pt.clo' used instead of
 (scrreprt)           file `scrsize11.clo' to setup font sizes on input line 2688.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrsize11pt.clo
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/scrsize11pt.clo
 File: scrsize11pt.clo 2023/07/07 v3.41 KOMA-Script font size class option (11pt)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/koma-script/typearea.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/koma-script/typearea.sty
 Package: typearea 2023/07/07 v3.41 KOMA-Script package (type area)
 \ta@bcor=\skip49
 \ta@div=\count182
@@ -121,20 +121,20 @@ Package tocbasic Info: omitting babel extension for `lot'
 \c@table=\count191
 Class scrreprt Info: Redefining `\numberline' on input line 7428.
 \bibindent=\dimen140
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amsmath.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amsmath.sty
 Package: amsmath 2023/05/13 v2.17o AMS math features
 \@mathmargin=\skip65
 For additional information on amsmath, use the `?' option.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amstext.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amstext.sty
 Package: amstext 2021/08/26 v2.01 AMS text
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amsgen.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amsgen.sty
 File: amsgen.sty 1999/11/30 v2.0 generic functions
 \@emptytoks=\toks18
 \ex@=\dimen141
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amsbsy.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amsbsy.sty
 Package: amsbsy 1999/11/29 v1.2d Bold Symbols
 \pmbraise@=\dimen142
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amsopn.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsmath/amsopn.sty
 Package: amsopn 2022/04/08 v2.04 operator names
 )
 \inf@bad=\count192
@@ -184,20 +184,20 @@ LaTeX Info: Redefining \Relbar on input line 971.
 \mathdisplay@stack=\toks22
 LaTeX Info: Redefining \[ on input line 2953.
 LaTeX Info: Redefining \] on input line 2954.
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsfonts/amssymb.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsfonts/amssymb.sty
 Package: amssymb 2013/01/14 v3.01 AMS font symbols
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsfonts/amsfonts.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsfonts/amsfonts.sty
 Package: amsfonts 2013/01/14 v3.01 Basic AMSFonts support
 \symAMSa=\mathgroup4
 \symAMSb=\mathgroup5
 LaTeX Font Info:    Redeclaring math symbol \hbar on input line 98.
 LaTeX Font Info:    Overwriting math alphabet `\mathfrak' in version `bold'
 (Font)                  U/euf/m/n --> U/euf/b/n on input line 106.
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/iftex/iftex.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/iftex/iftex.sty
 Package: iftex 2022/02/03 v1.0f TeX engine tests
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/unicode-math/unicode-math.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/l3kernel/expl3.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/unicode-math/unicode-math.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/l3kernel/expl3.sty
 Package: expl3 2023-11-01 L3 programming layer (loader) 
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/l3backend/l3backend-xetex.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/l3backend/l3backend-xetex.def
 File: l3backend-xetex.def 2023-10-23 L3 backend support: XeTeX
 \g__graphics_track_int=\count270
 \l__pdf_internal_box=\box55
@@ -206,15 +206,15 @@ File: l3backend-xetex.def 2023-10-23 L3 backend support: XeTeX
 \g__pdf_backend_link_int=\count273
 ))
 Package: unicode-math 2023/08/13 v0.8r Unicode maths in XeLaTeX and LuaLaTeX
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/unicode-math/unicode-math-xetex.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/unicode-math/unicode-math-xetex.sty
 Package: unicode-math-xetex 2023/08/13 v0.8r Unicode maths in XeLaTeX and LuaLaTeX
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/l3packages/xparse/xparse.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/l3packages/xparse/xparse.sty
 Package: xparse 2023-10-10 L3 Experimental document command parser
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/l3packages/l3keys2e/l3keys2e.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/l3packages/l3keys2e/l3keys2e.sty
 Package: l3keys2e 2023-10-10 LaTeX2e option processing using LaTeX3 keys
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/fontspec/fontspec.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/fontspec/fontspec.sty
 Package: fontspec 2022/01/15 v2.8a Font selection for XeLaTeX and LuaLaTeX
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/fontspec/fontspec-xetex.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/fontspec/fontspec-xetex.sty
 Package: fontspec-xetex 2022/01/15 v2.8a Font selection for XeLaTeX and LuaLaTeX
 \l__fontspec_script_int=\count274
 \l__fontspec_language_int=\count275
@@ -230,11 +230,11 @@ Package: fontspec-xetex 2022/01/15 v2.8a Font selection for XeLaTeX and LuaLaTeX
 \l__fontspec_tmpa_dim=\dimen150
 \l__fontspec_tmpb_dim=\dimen151
 \l__fontspec_tmpc_dim=\dimen152
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/base/fontenc.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/base/fontenc.sty
 Package: fontenc 2021/04/29 v2.0v Standard LaTeX package
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/fontspec/fontspec.cfg))) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/base/fix-cm.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/fontspec/fontspec.cfg))) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/base/fix-cm.sty
 Package: fix-cm 2020/11/24 v1.1t fixes to LaTeX
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/base/ts1enc.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/base/ts1enc.def
 File: ts1enc.def 2001/06/05 v3.0e (jk/car/fm) Standard LaTeX file
 LaTeX Font Info:    Redeclaring font encoding TS1 on input line 47.
 ))
@@ -242,7 +242,7 @@ LaTeX Font Info:    Redeclaring font encoding TS1 on input line 47.
 \g__um_fonts_used_int=\count286
 \l__um_primecount_int=\count287
 \g__um_primekern_muskip=\muskip17
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/unicode-math/unicode-math-table.tex))) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/lm/lmodern.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/unicode-math/unicode-math-table.tex))) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/lm/lmodern.sty
 Package: lmodern 2015/05/01 v1.6.1 Latin Modern Fonts
 LaTeX Font Info:    Overwriting symbol font `operators' in version `normal'
 (Font)                  OT1/cmr/m/n --> OT1/lmr/m/n on input line 22.
@@ -276,13 +276,13 @@ LaTeX Font Info:    Overwriting math alphabet `\mathit' in version `bold'
 (Font)                  OT1/cmr/bx/it --> OT1/lmr/bx/it on input line 37.
 LaTeX Font Info:    Overwriting math alphabet `\mathtt' in version `bold'
 (Font)                  OT1/cmtt/m/n --> OT1/lmtt/m/n on input line 38.
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/upquote/upquote.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/upquote/upquote.sty
 Package: upquote 2012/04/19 v1.3 upright-quote and grave-accent glyphs in verbatim
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/base/textcomp.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/base/textcomp.sty
 Package: textcomp 2020/02/02 v2.0n Standard LaTeX package
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/microtype/microtype.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/microtype/microtype.sty
 Package: microtype 2023/03/13 v3.1a Micro-typographical refinements (RS)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/etoolbox/etoolbox.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/etoolbox/etoolbox.sty
 Package: etoolbox 2020/10/05 v2.5k e-TeX tools for LaTeX (JAW)
 \etb@tempcnta=\count288
 )
@@ -297,22 +297,22 @@ LaTeX Info: Redefining \textls on input line 1368.
 \MT@outer@kern=\dimen153
 LaTeX Info: Redefining \textmicrotypecontext on input line 1988.
 \MT@listname@count=\count290
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/microtype/microtype-xetex.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/microtype/microtype-xetex.def
 File: microtype-xetex.def 2023/03/13 v3.1a Definitions specific to xetex (RS)
 LaTeX Info: Redefining \lsstyle on input line 238.
 )
 Package microtype Info: Loading configuration file microtype.cfg.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/microtype/microtype.cfg
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/microtype/microtype.cfg
 File: microtype.cfg 2023/03/13 v3.1a microtype main configuration file (RS)
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/xcolor/xcolor.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/xcolor/xcolor.sty
 Package: xcolor 2022/06/12 v2.14 LaTeX color extensions (UK)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics-cfg/color.cfg
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics-cfg/color.cfg
 File: color.cfg 2016/01/02 v1.6 sample color configuration
 )
 Package xcolor Info: Driver file: xetex.def on input line 227.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics-def/xetex.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics-def/xetex.def
 File: xetex.def 2022/09/22 v5.0n Graphics/color driver for xetex
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics/mathcolor.ltx)
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics/mathcolor.ltx)
 Package xcolor Info: Model `cmy' substituted by `cmy0' on input line 1353.
 Package xcolor Info: Model `RGB' extended on input line 1369.
 Package xcolor Info: Model `HTML' substituted by `rgb' on input line 1371.
@@ -321,13 +321,13 @@ Package xcolor Info: Model `tHsb' substituted by `hsb' on input line 1373.
 Package xcolor Info: Model `HSB' substituted by `hsb' on input line 1374.
 Package xcolor Info: Model `Gray' substituted by `gray' on input line 1375.
 Package xcolor Info: Model `wave' substituted by `hsb' on input line 1376.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics/dvipsnam.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics/dvipsnam.def
 File: dvipsnam.def 2016/06/17 v3.0m Driver-dependent file (DPC,SPQR)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/xcolor/svgnam.def
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/xcolor/svgnam.def
 File: svgnam.def 2022/06/12 v2.14 Predefined colors according to SVG 1.1 (UK)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/xcolor/x11nam.def
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/xcolor/x11nam.def
 File: x11nam.def 2022/06/12 v2.14 Predefined colors according to Unix/X11 (UK)
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/fancyvrb/fancyvrb.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/fancyvrb/fancyvrb.sty
 Package: fancyvrb 2023/01/19 4.5a verbatim text (tvz,hv)
 \FV@CodeLineNo=\count291
 \FV@InFile=\read2
@@ -335,14 +335,14 @@ Package: fancyvrb 2023/01/19 4.5a verbatim text (tvz,hv)
 \c@FancyVerbLine=\count292
 \FV@StepNumber=\count293
 \FV@OutFile=\write3
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/framed/framed.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/framed/framed.sty
 Package: framed 2011/10/22 v 0.96: framed or shaded text with page breaks
 \OuterFrameSep=\skip68
 \fb@frw=\dimen154
 \fb@frh=\dimen155
 \FrameRule=\dimen156
 \FrameSep=\dimen157
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tools/longtable.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tools/longtable.sty
 Package: longtable 2023-11-01 v4.19 Multi-page Table package (DPC)
 \LTleft=\skip69
 \LTright=\skip70
@@ -362,7 +362,7 @@ Package: longtable 2023-11-01 v4.19 Multi-page Table package (DPC)
 \LT@p@ftn=\toks25
 )
 Class scrreprt Info: longtable captions redefined on input line 98.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/booktabs/booktabs.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/booktabs/booktabs.sty
 Package: booktabs 2020/01/12 v1.61803398 Publication quality tables
 \heavyrulewidth=\dimen159
 \lightrulewidth=\dimen160
@@ -381,7 +381,7 @@ Package: booktabs 2020/01/12 v1.61803398 Publication quality tables
 \@thisruleclass=\count301
 \@lastruleclass=\count302
 \@thisrulewidth=\dimen171
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tools/array.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tools/array.sty
 Package: array 2023/10/16 v2.5g Tabular extension package (FMi)
 \col@sep=\dimen172
 \ar@mcellbox=\box63
@@ -390,7 +390,7 @@ Package: array 2023/10/16 v2.5g Tabular extension package (FMi)
 \extratabsurround=\skip73
 \backup@length=\skip74
 \ar@cellbox=\box64
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tools/calc.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tools/calc.sty
 Package: calc 2023/07/08 v4.3 Infix arithmetic (KKT,FJ)
 \calc@Acount=\count303
 \calc@Bcount=\count304
@@ -402,41 +402,41 @@ LaTeX Info: Redefining \setlength on input line 80.
 LaTeX Info: Redefining \addtolength on input line 81.
 \calc@Ccount=\count305
 \calc@Cskip=\skip77
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/mdwtools/footnote.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/mdwtools/footnote.sty
 Package: footnote 1997/01/28 1.13 Save footnotes around boxes
 \fn@notes=\box65
 \fn@width=\dimen176
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics/graphicx.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics/graphicx.sty
 Package: graphicx 2021/09/16 v1.2d Enhanced LaTeX Graphics (DPC,SPQR)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics/graphics.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics/graphics.sty
 Package: graphics 2022/03/10 v1.4e Standard LaTeX Graphics (DPC,SPQR)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics/trig.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics/trig.sty
 Package: trig 2021/08/11 v1.11 sin cos tan (DPC)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/graphics-cfg/graphics.cfg
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/graphics-cfg/graphics.cfg
 File: graphics.cfg 2016/06/04 v1.11 sample graphics configuration
 )
 Package graphics Info: Driver file: xetex.def on input line 107.
 )
 \Gin@req@height=\dimen177
 \Gin@req@width=\dimen178
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tcolorbox/tcolorbox.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tcolorbox/tcolorbox.sty
 Package: tcolorbox 2023/09/26 version 6.1.0 text color boxes
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/basiclayer/pgf.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/utilities/pgfrcs.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfutil-common.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/basiclayer/pgf.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/utilities/pgfrcs.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfutil-common.tex
 \pgfutil@everybye=\toks27
 \pgfutil@tempdima=\dimen179
 \pgfutil@tempdimb=\dimen180
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfutil-latex.def
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfutil-latex.def
 \pgfutil@abb=\box66
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfrcs.code.tex (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/pgf.revision.tex)
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfrcs.code.tex (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/pgf.revision.tex)
 Package: pgfrcs 2023-01-15 v3.1.10 (3.1.10)
 ))
 Package: pgf 2023-01-15 v3.1.10 (3.1.10)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/basiclayer/pgfcore.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/systemlayer/pgfsys.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsys.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/basiclayer/pgfcore.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/systemlayer/pgfsys.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsys.code.tex
 Package: pgfsys 2023-01-15 v3.1.10 (3.1.10)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfkeys.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfkeys.code.tex
 \pgfkeys@pathtoks=\toks28
 \pgfkeys@temptoks=\toks29
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfkeyslibraryfiltered.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfkeyslibraryfiltered.code.tex
 \pgfkeys@tmptoks=\toks30
 ))
 \pgf@x=\dimen181
@@ -459,36 +459,36 @@ Package: pgfsys 2023-01-15 v3.1.10 (3.1.10)
 \t@pgf@tokb=\toks32
 \t@pgf@tokc=\toks33
 \pgf@sys@id@count=\count310
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgf.cfg
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgf.cfg
 File: pgf.cfg 2023-01-15 v3.1.10 (3.1.10)
 )
 Driver file for pgf: pgfsys-xetex.def
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsys-xetex.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsys-xetex.def
 File: pgfsys-xetex.def 2023-01-15 v3.1.10 (3.1.10)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsys-dvipdfmx.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsys-dvipdfmx.def
 File: pgfsys-dvipdfmx.def 2023-01-15 v3.1.10 (3.1.10)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsys-common-pdf.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsys-common-pdf.def
 File: pgfsys-common-pdf.def 2023-01-15 v3.1.10 (3.1.10)
 )
 \pgfsys@objnum=\count311
-))) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsyssoftpath.code.tex
+))) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsyssoftpath.code.tex
 File: pgfsyssoftpath.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgfsyssoftpath@smallbuffer@items=\count312
 \pgfsyssoftpath@bigbuffer@items=\count313
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsysprotocol.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/systemlayer/pgfsysprotocol.code.tex
 File: pgfsysprotocol.code.tex 2023-01-15 v3.1.10 (3.1.10)
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcore.code.tex
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcore.code.tex
 Package: pgfcore 2023-01-15 v3.1.10 (3.1.10)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmath.code.tex (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathutil.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathparser.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmath.code.tex (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathutil.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathparser.code.tex
 \pgfmath@dimen=\dimen191
 \pgfmath@count=\count314
 \pgfmath@box=\box67
 \pgfmath@toks=\toks34
 \pgfmath@stack@operand=\toks35
 \pgfmath@stack@operation=\toks36
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.basic.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.trigonometric.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.random.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.comparison.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.base.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.round.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.misc.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.integerarithmetics.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathcalc.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfloat.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.basic.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.trigonometric.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.random.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.comparison.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.base.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.round.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.misc.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfunctions.integerarithmetics.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathcalc.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmathfloat.code.tex
 \c@pgfmathroundto@lastzeros=\count315
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfint.code.tex) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepoints.code.tex
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfint.code.tex) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepoints.code.tex
 File: pgfcorepoints.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgf@picminx=\dimen192
 \pgf@picmaxx=\dimen193
@@ -504,74 +504,74 @@ File: pgfcorepoints.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgf@yy=\dimen259
 \pgf@zx=\dimen260
 \pgf@zy=\dimen261
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepathconstruct.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepathconstruct.code.tex
 File: pgfcorepathconstruct.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgf@path@lastx=\dimen262
 \pgf@path@lasty=\dimen263
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepathusage.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepathusage.code.tex
 File: pgfcorepathusage.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgf@shorten@end@additional=\dimen264
 \pgf@shorten@start@additional=\dimen265
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorescopes.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorescopes.code.tex
 File: pgfcorescopes.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgfpic=\box68
 \pgf@hbox=\box69
 \pgf@layerbox@main=\box70
 \pgf@picture@serial@count=\count316
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoregraphicstate.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoregraphicstate.code.tex
 File: pgfcoregraphicstate.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgflinewidth=\dimen266
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoretransformations.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoretransformations.code.tex
 File: pgfcoretransformations.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgf@pt@x=\dimen267
 \pgf@pt@y=\dimen268
 \pgf@pt@temp=\dimen269
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorequick.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorequick.code.tex
 File: pgfcorequick.code.tex 2023-01-15 v3.1.10 (3.1.10)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoreobjects.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoreobjects.code.tex
 File: pgfcoreobjects.code.tex 2023-01-15 v3.1.10 (3.1.10)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepathprocessing.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepathprocessing.code.tex
 File: pgfcorepathprocessing.code.tex 2023-01-15 v3.1.10 (3.1.10)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorearrows.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorearrows.code.tex
 File: pgfcorearrows.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgfarrowsep=\dimen270
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoreshade.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoreshade.code.tex
 File: pgfcoreshade.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgf@max=\dimen271
 \pgf@sys@shading@range@num=\count317
 \pgf@shadingcount=\count318
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoreimage.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoreimage.code.tex
 File: pgfcoreimage.code.tex 2023-01-15 v3.1.10 (3.1.10)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoreexternal.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoreexternal.code.tex
 File: pgfcoreexternal.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgfexternal@startupbox=\box71
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorelayers.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorelayers.code.tex
 File: pgfcorelayers.code.tex 2023-01-15 v3.1.10 (3.1.10)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoretransparency.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcoretransparency.code.tex
 File: pgfcoretransparency.code.tex 2023-01-15 v3.1.10 (3.1.10)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepatterns.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorepatterns.code.tex
 File: pgfcorepatterns.code.tex 2023-01-15 v3.1.10 (3.1.10)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorerdf.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/basiclayer/pgfcorerdf.code.tex
 File: pgfcorerdf.code.tex 2023-01-15 v3.1.10 (3.1.10)
-))) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/modules/pgfmoduleshapes.code.tex
+))) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/modules/pgfmoduleshapes.code.tex
 File: pgfmoduleshapes.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgfnodeparttextbox=\box72
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/modules/pgfmoduleplot.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/modules/pgfmoduleplot.code.tex
 File: pgfmoduleplot.code.tex 2023-01-15 v3.1.10 (3.1.10)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/compatibility/pgfcomp-version-0-65.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/compatibility/pgfcomp-version-0-65.sty
 Package: pgfcomp-version-0-65 2023-01-15 v3.1.10 (3.1.10)
 \pgf@nodesepstart=\dimen272
 \pgf@nodesepend=\dimen273
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/compatibility/pgfcomp-version-1-18.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/compatibility/pgfcomp-version-1-18.sty
 Package: pgfcomp-version-1-18 2023-01-15 v3.1.10 (3.1.10)
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tools/verbatim.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tools/verbatim.sty
 Package: verbatim 2022-07-02 v1.5u LaTeX2e package for verbatim enhancements
 \every@verbatim=\toks37
 \verbatim@line=\toks38
 \verbatim@in@stream=\read4
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/environ/environ.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/environ/environ.sty
 Package: environ 2014/05/04 v0.3 A new way to define environments
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/trimspaces/trimspaces.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/trimspaces/trimspaces.sty
 Package: trimspaces 2009/09/17 v1.1 Trim spaces around a token list
 ))
 \tcb@titlebox=\box73
@@ -585,17 +585,17 @@ Package: trimspaces 2009/09/17 v1.1 Trim spaces around a token list
 \tcb@temp=\box78
 \tcb@temp=\box79
 \tcb@temp=\box80
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tcolorbox/tcbskins.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tcolorbox/tcbskins.code.tex
 Library (tcolorbox): 'tcbskins.code.tex' version '6.1.0'
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/frontendlayer/tikz.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/utilities/pgffor.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/utilities/pgfkeys.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfkeys.code.tex)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pgf/math/pgfmath.sty (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmath.code.tex)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgffor.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/frontendlayer/tikz.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/utilities/pgffor.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/utilities/pgfkeys.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgfkeys.code.tex)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pgf/math/pgfmath.sty (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/math/pgfmath.code.tex)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/utilities/pgffor.code.tex
 Package: pgffor 2023-01-15 v3.1.10 (3.1.10)
 \pgffor@iter=\dimen274
 \pgffor@skip=\dimen275
 \pgffor@stack=\toks39
 \pgffor@toks=\toks40
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/frontendlayer/tikz/tikz.code.tex
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/frontendlayer/tikz/tikz.code.tex
 Package: tikz 2023-01-15 v3.1.10 (3.1.10)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/libraries/pgflibraryplothandlers.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/libraries/pgflibraryplothandlers.code.tex
 File: pgflibraryplothandlers.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgf@plot@mark@count=\count322
 \pgfplotmarksize=\dimen276
@@ -616,31 +616,31 @@ File: pgflibraryplothandlers.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \tikznumberofchildren=\count324
 \tikznumberofcurrentchild=\count325
 \tikz@fig@count=\count326
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/modules/pgfmodulematrix.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/modules/pgfmodulematrix.code.tex
 File: pgfmodulematrix.code.tex 2023-01-15 v3.1.10 (3.1.10)
 \pgfmatrixcurrentrow=\count327
 \pgfmatrixcurrentcolumn=\count328
 \pgf@matrix@numberofcolumns=\count329
 )
 \tikz@expandcount=\count330
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pgf/frontendlayer/tikz/libraries/tikzlibrarytopaths.code.tex
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pgf/frontendlayer/tikz/libraries/tikzlibrarytopaths.code.tex
 File: tikzlibrarytopaths.code.tex 2023-01-15 v3.1.10 (3.1.10)
-))) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tikzfill/tikzfill.image.sty
+))) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tikzfill/tikzfill.image.sty
 Package: tikzfill.image 2023/08/08 v1.0.1 Image filling library for TikZ
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tikzfill/tikzfill-common.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tikzfill/tikzfill-common.sty
 Package: tikzfill-common 2023/08/08 v1.0.1 Auxiliary code for tikzfill
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tikzfill/tikzlibraryfill.image.code.tex
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tikzfill/tikzlibraryfill.image.code.tex
 File: tikzlibraryfill.image.code.tex 2023/08/08 v1.0.1 Image filling library
 \l__tikzfill_img_box=\box85
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tcolorbox/tcbskinsjigsaw.code.tex
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tcolorbox/tcbskinsjigsaw.code.tex
 Library (tcolorbox): 'tcbskinsjigsaw.code.tex' version '6.1.0'
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/tcolorbox/tcbbreakable.code.tex
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/tcolorbox/tcbbreakable.code.tex
 Library (tcolorbox): 'tcbbreakable.code.tex' version '6.1.0'
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/pdfcol/pdfcol.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/pdfcol/pdfcol.sty
 Package: pdfcol 2022-09-21 v1.7 Handle new color stacks for pdfTeX (HO)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/ltxcmds/ltxcmds.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/ltxcmds/ltxcmds.sty
 Package: ltxcmds 2020-05-10 v1.25 LaTeX kernel commands for general use (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/infwarerr/infwarerr.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/infwarerr/infwarerr.sty
 Package: infwarerr 2019/12/03 v1.5 Providing info/warning/error messages (HO)
 )
 Package pdfcol Info: Interface disabled because of missing PDF mode of pdfTeX.
@@ -649,41 +649,41 @@ Package pdfcol Info: pdfTeX's color stacks are not available.
 \tcb@testbox=\box86
 \tcb@totalupperbox=\box87
 \tcb@totallowerbox=\box88
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/fontawesome5/fontawesome5.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/fontawesome5/fontawesome5.sty
 Package: fontawesome5 2022/05/02 v5.15.4 Font Awesome 5
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/fontawesome5/fontawesome5-utex-helper.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/fontawesome5/fontawesome5-utex-helper.sty
 Package: fontawesome5-utex-helper 2022/05/02 v5.15.4 uTeX helper for fontawesome5
 LaTeX Font Info:    Trying to load font information for TU+fontawesomefree on input line 69.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/fontawesome5/tufontawesomefree.fd)
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/fontawesome5/tufontawesomefree.fd)
 LaTeX Font Info:    Trying to load font information for TU+fontawesomebrands on input line 70.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/fontawesome5/tufontawesomebrands.fd))) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/bookmark/bookmark.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/fontawesome5/tufontawesomebrands.fd))) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/bookmark/bookmark.sty
 Package: bookmark 2020-11-06 v1.29 PDF bookmarks (HO)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/hyperref/hyperref.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/hyperref/hyperref.sty
 Package: hyperref 2023-10-27 v7.01d Hypertext links for LaTeX
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pdftexcmds/pdftexcmds.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pdftexcmds/pdftexcmds.sty
 Package: pdftexcmds 2020-06-27 v0.33 Utility functions of pdfTeX for LuaTeX (HO)
 Package pdftexcmds Info: \pdf@primitive is available.
 Package pdftexcmds Info: \pdf@ifprimitive is available.
 Package pdftexcmds Info: \pdfdraftmode not found.
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/kvsetkeys/kvsetkeys.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/kvsetkeys/kvsetkeys.sty
 Package: kvsetkeys 2022-10-05 v1.19 Key value parser (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/kvdefinekeys/kvdefinekeys.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/kvdefinekeys/kvdefinekeys.sty
 Package: kvdefinekeys 2019-12-19 v1.6 Define keys (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/pdfescape/pdfescape.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/pdfescape/pdfescape.sty
 Package: pdfescape 2019/12/09 v1.15 Implements pdfTeX's escape features (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/hycolor/hycolor.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/hycolor/hycolor.sty
 Package: hycolor 2020-01-27 v1.10 Color options for hyperref/bookmark (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/letltxmacro/letltxmacro.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/letltxmacro/letltxmacro.sty
 Package: letltxmacro 2019/12/03 v1.6 Let assignment for LaTeX macros (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/auxhook/auxhook.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/auxhook/auxhook.sty
 Package: auxhook 2019-12-17 v1.6 Hooks for auxiliary files (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/hyperref/nameref.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/hyperref/nameref.sty
 Package: nameref 2023-10-05 v2.54 Cross-referencing by name of section
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/refcount/refcount.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/refcount/refcount.sty
 Package: refcount 2019/12/15 v3.6 Data extraction from label references (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/gettitlestring/gettitlestring.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/gettitlestring/gettitlestring.sty
 Package: gettitlestring 2019/12/15 v1.6 Cleanup title references (HO)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/kvoptions/kvoptions.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/kvoptions/kvoptions.sty
 Package: kvoptions 2022-06-15 v3.15 Key value format for package options (HO)
 ))
 \c@section@level=\count331
@@ -691,13 +691,13 @@ Package: kvoptions 2022-06-15 v3.15 Key value format for package options (HO)
 \@linkdim=\dimen285
 \Hy@linkcounter=\count332
 \Hy@pagecounter=\count333
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/hyperref/pd1enc.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/hyperref/pd1enc.def
 File: pd1enc.def 2023-10-27 v7.01d Hyperref: PDFDocEncoding definition (HO)
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/intcalc/intcalc.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/intcalc/intcalc.sty
 Package: intcalc 2019/12/15 v1.3 Expandable calculations with integers (HO)
 )
 \Hy@SavedSpaceFactor=\count334
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/hyperref/puenc.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/hyperref/puenc.def
 File: puenc.def 2023-10-27 v7.01d Hyperref: PDF Unicode definition (HO)
 )
 Package hyperref Info: Option `unicode' set `true' on input line 4051.
@@ -709,15 +709,15 @@ Package hyperref Info: Backreferencing OFF on input line 4188.
 Package hyperref Info: Implicit mode ON; LaTeX internals redefined.
 Package hyperref Info: Bookmarks ON on input line 4435.
 \c@Hy@tempcnt=\count335
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/url/url.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/url/url.sty
 \Urlmuskip=\muskip18
 Package: url 2013/09/16  ver 3.4  Verb mode for urls, etc.
 )
 LaTeX Info: Redefining \url on input line 4773.
 \XeTeXLinkMargin=\dimen286
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/bitset/bitset.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/bitset/bitset.sty
 Package: bitset 2019/12/09 v1.3 Handle bit-vector datatype (HO)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/bigintcalc/bigintcalc.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/bigintcalc/bigintcalc.sty
 Package: bigintcalc 2019/12/15 v1.5 Expandable calculations on big integers (HO)
 ))
 \Fld@menulength=\count336
@@ -730,7 +730,7 @@ Package hyperref Info: backreferencing OFF on input line 6067.
 Package hyperref Info: Link coloring OFF on input line 6072.
 Package hyperref Info: Link coloring with OCG OFF on input line 6077.
 Package hyperref Info: PDF/A mode OFF on input line 6082.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/base/atbegshi-ltx.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/base/atbegshi-ltx.sty
 Package: atbegshi-ltx 2021/01/10 v1.0c Emulation of the original atbegshi
 package with kernel methods
 )
@@ -739,9 +739,9 @@ package with kernel methods
 \c@Hfootnote=\count339
 )
 Package hyperref Info: Driver (autodetected): hxetex.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/hyperref/hxetex.def
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/hyperref/hxetex.def
 File: hxetex.def 2023-10-27 v7.01d Hyperref driver for XeTeX
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/stringenc/stringenc.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/stringenc/stringenc.sty
 Package: stringenc 2019/11/29 v1.12 Convert strings between diff. encodings (HO)
 )
 \pdfm@box=\box89
@@ -749,23 +749,23 @@ Package: stringenc 2019/11/29 v1.12 Convert strings between diff. encodings (HO)
 \HyField@AnnotCount=\count341
 \Fld@listcount=\count342
 \c@bookmark@seq@number=\count343
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/rerunfilecheck/rerunfilecheck.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/rerunfilecheck/rerunfilecheck.sty
 Package: rerunfilecheck 2022-07-10 v1.10 Rerun checks for auxiliary files (HO)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/base/atveryend-ltx.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/base/atveryend-ltx.sty
 Package: atveryend-ltx 2020/08/19 v1.0a Emulation of the original atveryend package
 with kernel methods
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/generic/uniquecounter/uniquecounter.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/generic/uniquecounter/uniquecounter.sty
 Package: uniquecounter 2019/12/15 v1.4 Provide unlimited unique counter (HO)
 )
 Package uniquecounter Info: New unique counter `rerunfilecheck' on input line 285.
 )
 \Hy@SectionHShift=\skip78
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/bookmark/bkm-dvipdfm.def
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/bookmark/bkm-dvipdfm.def
 File: bkm-dvipdfm.def 2020-11-06 v1.29 bookmark driver for dvipdfm (HO)
 \BKM@id=\count344
-)) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/caption/caption.sty
+)) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/caption/caption.sty
 Package: caption 2023/08/05 v3.6o Customizing captions (AR)
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/caption/caption3.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/caption/caption3.sty
 Package: caption3 2023/07/31 v2.4d caption3 kernel (AR)
 \caption@tempdima=\dimen289
 \captionmargin=\dimen290
@@ -776,20 +776,20 @@ Package: caption3 2023/07/31 v2.4d caption3 kernel (AR)
 \caption@parindent=\dimen295
 \caption@hangindent=\dimen296
 Package caption Info: KOMA-Script document class detected.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/caption/caption-koma.sto
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/caption/caption-koma.sto
 File: caption-koma.sto 2023/09/08 v2.0e Adaption of the caption package to the KOMA-Script document classes (AR)
 ))
 \c@caption@flags=\count345
 \c@continuedfloat=\count346
 Package caption Info: hyperref package is loaded.
 Package caption Info: longtable package is loaded.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/caption/ltcaption.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/caption/ltcaption.sty
 Package: ltcaption 2021/01/08 v1.4c longtable captions (AR)
 )
 Package caption Info: KOMA-Script scrextend package detected.
 \caption@addmargin@hsize=\dimen297
 \caption@addmargin@linewidth=\dimen298
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/float/float.sty
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/float/float.sty
 Package: float 2001/11/08 v1.3d Float enhancements (AL)
 \c@float@type=\count347
 \float@exts=\toks41
@@ -799,7 +799,7 @@ Package: float 2001/11/08 v1.3d Float enhancements (AL)
 )
 \@float@every@codelisting=\toks43
 \c@codelisting=\count348
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/caption/subcaption.sty
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/caption/subcaption.sty
 Package: subcaption 2023/07/28 v1.6b Sub-captions (AR)
 Package caption Info: New subtype `subfigure' on input line 238.
 \c@subfigure=\count349
@@ -1084,7 +1084,7 @@ Package microtype Info: Using protrusion set `basicmath'.
 Package microtype Info: No adjustment of tracking.
 Package microtype Info: No adjustment of spacing.
 Package microtype Info: No adjustment of kerning.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/microtype/mt-LatinModernRoman.cfg
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/microtype/mt-LatinModernRoman.cfg
 File: mt-LatinModernRoman.cfg 2021/02/21 v1.1 microtype config. file: Latin Modern Roman (RS)
 )
 Package hyperref Info: Link coloring ON on input line 221.
@@ -1105,7 +1105,7 @@ LaTeX Font Info:    Font shape `TU/latinmodern-math.otf(1)/m/n' will be
 LaTeX Font Info:    Font shape `TU/latinmodern-math.otf(1)/m/n' will be
 (Font)              scaled to size 7.0pt on input line 223.
 LaTeX Font Info:    Trying to load font information for OML+lmm on input line 223.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/lm/omllmm.fd
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/lm/omllmm.fd
 File: omllmm.fd 2015/05/01 v1.6.1 Font defs for Latin Modern
 )
 LaTeX Font Info:    Font shape `TU/latinmodern-math.otf(2)/m/n' will be
@@ -1121,15 +1121,15 @@ LaTeX Font Info:    Font shape `TU/latinmodern-math.otf(3)/m/n' will be
 LaTeX Font Info:    Font shape `TU/latinmodern-math.otf(3)/m/n' will be
 (Font)              scaled to size 6.99925pt on input line 223.
 LaTeX Font Info:    Trying to load font information for U+msa on input line 223.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsfonts/umsa.fd
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsfonts/umsa.fd
 File: umsa.fd 2013/01/14 v3.01 AMS symbols A
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/microtype/mt-msa.cfg
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/microtype/mt-msa.cfg
 File: mt-msa.cfg 2006/02/04 v1.1 microtype config. file: AMS symbols (a) (RS)
 )
 LaTeX Font Info:    Trying to load font information for U+msb on input line 223.
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/amsfonts/umsb.fd
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/amsfonts/umsb.fd
 File: umsb.fd 2013/01/14 v3.01 AMS symbols B
-) (/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/microtype/mt-msb.cfg
+) (/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/microtype/mt-msb.cfg
 File: mt-msb.cfg 2005/06/01 v1.0 microtype config. file: AMS symbols (b) (RS)
 ) [1
 
@@ -1162,7 +1162,7 @@ Class scrreprt Warning: \float@addtolists detected!
 (scrreprt)              a package that still implements the
 (scrreprt)              deprecated \float@addtolist interface.
 
-(/Users/Ishani/Library/TinyTeX/texmf-dist/tex/latex/microtype/mt-TU-empty.cfg
+(/Users/lillianweng/Library/TinyTeX/texmf-dist/tex/latex/microtype/mt-TU-empty.cfg
 File: mt-TU-empty.cfg 2021/06/22 v1.1 microtype config. file: fonts with nonstandard glyph set (RS)
 )
 LaTeX Font Info:    Font shape `TU/latinmodern-math.otf(1)/m/n' will be
@@ -1196,8 +1196,8 @@ Underfull \hbox (badness 10000) in paragraph at lines 537--539
 [7] [8]
 chapter 2.
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 617--617
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 617--617
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1226,8 +1226,8 @@ Overfull \hbox (10.95212pt too wide) in paragraph at lines 878--879
  []
 
 [11] [12]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 951--951
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 951--951
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1236,8 +1236,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 953--953
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 986--986
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 986--986
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1249,8 +1249,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 988--988
 LaTeX Font Info:    Font shape `TU/lmtt/bx/n' in size <10.95> not available
 (Font)              Font shape `TU/lmtt/b/n' tried instead on input line 1013.
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1040--1040
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1040--1040
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1259,8 +1259,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1042--1042
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1079--1079
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1079--1079
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 [14]
@@ -1271,8 +1271,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1081--1081
 [15]
 Missing character: There is no   (U+2003) in font [lmroman10-regular]:mapping=tex-text;!
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1194--1194
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1194--1194
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1289,8 +1289,8 @@ Overfull \hbox (53.61371pt too wide) in paragraph at lines 1199--1387
 Overfull \vbox (1913.30832pt too high) has occurred while \output is active []
 
 [17]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1417--1417
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1417--1417
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1299,8 +1299,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1419--1419
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1440--1440
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1440--1440
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1309,8 +1309,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1442--1442
  []
 
 [18]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1473--1473
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1473--1473
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1319,8 +1319,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1475--1475
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1495--1495
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1495--1495
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1329,8 +1329,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1497--1497
  []
 
 [19]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1533--1533
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1533--1533
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1339,8 +1339,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1535--1535
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1555--1555
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1555--1555
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1349,8 +1349,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1557--1557
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1577--1577
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1577--1577
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1359,8 +1359,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 1579--1579
  []
 
 [20]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1609--1609
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1609--1609
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1377,8 +1377,8 @@ Overfull \hbox (25.18869pt too wide) in paragraph at lines 1614--1803
 Overfull \vbox (1926.90831pt too high) has occurred while \output is active []
 
 [22]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 1817--1817
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 1817--1817
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1418,8 +1418,8 @@ Overfull \hbox (36.13486pt too wide) in paragraph at lines 2078--2078
 LaTeX Font Info:    Font shape `TU/lmtt/bx/n' in size <12> not available
 (Font)              Font shape `TU/lmtt/b/n' tried instead on input line 2136.
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2154--2154
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2154--2154
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1433,8 +1433,8 @@ Overfull \hbox (24.53052pt too wide) in paragraph at lines 2159--2170
  []
 
 [27]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2182--2182
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2182--2182
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1450,8 +1450,8 @@ Overfull \hbox (10.97139pt too wide) in paragraph at lines 2211--2212
  []
 
 [28]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2254--2254
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2254--2254
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1460,8 +1460,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 2256--2256
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2280--2280
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2280--2280
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1470,8 +1470,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 2282--2282
  []
 
 [29]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2307--2307
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2307--2307
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1485,8 +1485,8 @@ Overfull \hbox (24.53052pt too wide) in paragraph at lines 2312--2322
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2332--2332
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2332--2332
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1498,8 +1498,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 2334--2334
 Overfull \vbox (1913.30832pt too high) has occurred while \output is active []
 
 [31]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2543--2543
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2543--2543
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1508,8 +1508,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 2545--2545
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2568--2568
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2568--2568
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1523,8 +1523,8 @@ Overfull \hbox (24.53052pt too wide) in paragraph at lines 2573--2583
  []
 
 [32]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2645--2645
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2645--2645
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1533,8 +1533,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 2647--2647
  []
 
 [33]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2676--2676
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2676--2676
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1543,8 +1543,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 2678--2678
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2702--2702
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2702--2702
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1556,8 +1556,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 2704--2704
 Overfull \vbox (1913.30832pt too high) has occurred while \output is active []
 
 [35]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2949--2949
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2949--2949
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1571,8 +1571,8 @@ Overfull \hbox (24.53052pt too wide) in paragraph at lines 2954--2964
  []
 
 [36]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 2977--2977
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 2977--2977
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1591,8 +1591,8 @@ chapter 3.
 [42
 
 ]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3456--3456
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3456--3456
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1601,8 +1601,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3458--3458
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3503--3503
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3503--3503
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1611,8 +1611,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3505--3505
  []
 
 [43]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3529--3529
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3529--3529
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1621,8 +1621,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3531--3531
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3565--3565
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3565--3565
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1631,8 +1631,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3567--3567
  []
 
 [44]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3635--3635
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3635--3635
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1646,8 +1646,8 @@ Package longtable Warning: Column widths have changed
 (longtable)                in table 3.1 on input line 3667.
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3684--3684
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3684--3684
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1656,8 +1656,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3686--3686
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3711--3711
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3711--3711
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 [46]
@@ -1666,8 +1666,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3713--3713
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3745--3745
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3745--3745
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1676,8 +1676,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3747--3747
  []
 
 [47]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3797--3797
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3797--3797
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1686,8 +1686,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3799--3799
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3845--3845
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3845--3845
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 [48]
@@ -1696,8 +1696,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3847--3847
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3888--3888
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3888--3888
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1706,8 +1706,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3890--3890
  []
 
 [49]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3919--3919
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3919--3919
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1716,8 +1716,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3921--3921
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3949--3949
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3949--3949
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1726,8 +1726,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3951--3951
  []
 
 [50]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 3982--3982
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 3982--3982
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1736,8 +1736,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 3984--3984
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4020--4020
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4020--4020
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1746,8 +1746,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4022--4022
  []
 
 [51] [52]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4177--4177
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4177--4177
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1756,8 +1756,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4179--4179
  []
 
 [53]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4207--4207
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4207--4207
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1771,8 +1771,8 @@ Underfull \hbox (badness 1845) in paragraph at lines 4233--4236
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4245--4245
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4245--4245
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1781,8 +1781,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4247--4247
  []
 
 [54]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4269--4269
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4269--4269
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1791,8 +1791,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4271--4271
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4294--4294
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4294--4294
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1801,8 +1801,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4296--4296
  []
 
 [55]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4333--4333
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4333--4333
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1811,8 +1811,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4335--4335
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4388--4388
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4388--4388
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1821,8 +1821,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4390--4390
  []
 
 [56] [57]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4456--4456
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4456--4456
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1831,8 +1831,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4458--4458
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4485--4485
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4485--4485
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1841,8 +1841,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4487--4487
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4514--4514
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4514--4514
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1851,8 +1851,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4516--4516
  []
 
 [58]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4546--4546
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4546--4546
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1861,8 +1861,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4548--4548
  []
 
 [59]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4588--4588
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4588--4588
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1871,8 +1871,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4590--4590
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4617--4617
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4617--4617
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1884,8 +1884,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4619--4619
 File: pandas_2/images/gb.png Graphic file (type bmp)
 <pandas_2/images/gb.png>
 [61]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4700--4700
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4700--4700
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1896,8 +1896,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4702--4702
 File: pandas_2/images/agg.png Graphic file (type bmp)
 <pandas_2/images/agg.png>
 [62]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4751--4751
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4751--4751
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1906,8 +1906,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4753--4753
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4796--4796
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4796--4796
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1916,8 +1916,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4798--4798
  []
 
 [63]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4822--4822
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4822--4822
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1926,8 +1926,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4824--4824
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4852--4852
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4852--4852
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1936,8 +1936,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4854--4854
  []
 
 [64]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4913--4913
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4913--4913
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1948,8 +1948,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4915--4915
 File: pandas_2/images/first.png Graphic file (type bmp)
 <pandas_2/images/first.png>
 [65]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4955--4955
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4955--4955
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1958,8 +1958,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4957--4957
  []
 
 
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 4983--4983
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 4983--4983
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1968,8 +1968,8 @@ Overfull \hbox (1214.62836pt too wide) in paragraph at lines 4985--4985
  []
 
 [66]
-Overfull \hbox (122.36606pt too wide) in paragraph at lines 5017--5017
-[]\TU/lmtt/m/n/10.95 /Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
+Overfull \hbox (151.10982pt too wide) in paragraph at lines 5017--5017
+[]\TU/lmtt/m/n/10.95 /Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:[] 
  []
 
 
@@ -1987,7 +1987,7 @@ l.25472 \end{Shaded}
                      
 Here is how much of TeX's memory you used:
  39264 strings out of 476914
- 803880 string characters out of 5805006
+ 805975 string characters out of 5804556
  4999999 words of memory out of 5000000
  60559 multiletter control sequences out of 15000+600000
  564556 words of font info for 93 fonts, out of 8000000 for 9000
diff --git a/index.pdf b/index.pdf
index 3cffd14d..fb9ddd11 100644
Binary files a/index.pdf and b/index.pdf differ
diff --git a/index.tex b/index.tex
index ddb5a9e6..e47464d4 100644
--- a/index.tex
+++ b/index.tex
@@ -220,7 +220,7 @@
 
 \begin{document}
 \maketitle
-\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[frame hidden, breakable, interior hidden, enhanced, boxrule=0pt, borderline west={3pt}{0pt}{shadecolor}, sharp corners]}{\end{tcolorbox}}\fi
+\ifdefined\Shaded\renewenvironment{Shaded}{\begin{tcolorbox}[frame hidden, sharp corners, enhanced, boxrule=0pt, breakable, borderline west={3pt}{0pt}{shadecolor}, interior hidden]}{\end{tcolorbox}}\fi
 
 \renewcommand*\contentsname{Table of contents}
 {
@@ -258,7 +258,7 @@ \section*{About the Course Notes}\label{about-the-course-notes}}
 \hypertarget{introduction}{%
 \chapter{Introduction}\label{introduction}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -304,7 +304,7 @@ \chapter{Introduction}\label{introduction}}
 allowing you to take data and produce useful insights on the world's
 most challenging and ambiguous problems.
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Course Goals}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Course Goals}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -320,7 +320,7 @@ \chapter{Introduction}\label{introduction}}
 
 \end{tcolorbox}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Some Topics We'll Cover}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Some Topics We'll Cover}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -567,7 +567,7 @@ \section{Conclusion}\label{conclusion}}
 \hypertarget{pandas-i}{%
 \chapter{Pandas I}\label{pandas-i}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -614,7 +614,7 @@ \section{Tabular Data}\label{tabular-data}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -948,7 +948,7 @@ \subsection{Series}\label{series}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -983,7 +983,7 @@ \subsection{Series}\label{series}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1037,7 +1037,7 @@ \subsubsection{\texorpdfstring{Selection in
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1076,7 +1076,7 @@ \subsubsection{\texorpdfstring{Selection in
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1191,7 +1191,7 @@ \subsubsection{\texorpdfstring{Creating a
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1414,7 +1414,7 @@ \subsubsection{\texorpdfstring{Creating a
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1437,7 +1437,7 @@ \subsubsection{\texorpdfstring{Creating a
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1470,7 +1470,7 @@ \subsubsection{\texorpdfstring{Creating a
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1492,7 +1492,7 @@ \subsubsection{\texorpdfstring{Creating a
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1530,7 +1530,7 @@ \subsubsection{\texorpdfstring{Creating a
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1552,7 +1552,7 @@ \subsubsection{\texorpdfstring{Creating a
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1574,7 +1574,7 @@ \subsubsection{\texorpdfstring{Creating a
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1606,7 +1606,7 @@ \subsection{Indices}\label{indices}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -1814,7 +1814,7 @@ \subsection{Indices}\label{indices}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2151,7 +2151,7 @@ \subsection{\texorpdfstring{Extracting data with \texttt{.head} and
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2179,7 +2179,7 @@ \subsection{\texorpdfstring{Extracting data with \texttt{.head} and
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2251,7 +2251,7 @@ \subsection{\texorpdfstring{Label-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2277,7 +2277,7 @@ \subsection{\texorpdfstring{Label-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2304,7 +2304,7 @@ \subsection{\texorpdfstring{Label-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2329,7 +2329,7 @@ \subsection{\texorpdfstring{Label-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2540,7 +2540,7 @@ \subsection{\texorpdfstring{Label-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2565,7 +2565,7 @@ \subsection{\texorpdfstring{Label-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2642,7 +2642,7 @@ \subsection{\texorpdfstring{Integer-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2673,7 +2673,7 @@ \subsection{\texorpdfstring{Integer-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2699,7 +2699,7 @@ \subsection{\texorpdfstring{Integer-based Extraction: Indexing with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2946,7 +2946,7 @@ \subsubsection{A slice of row numbers}\label{a-slice-of-row-numbers}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -2974,7 +2974,7 @@ \subsubsection{A list of column labels}\label{a-list-of-column-labels}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3398,7 +3398,7 @@ \section{Parting Note}\label{parting-note}}
 \hypertarget{pandas-ii}{%
 \chapter{Pandas II}\label{pandas-ii}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -3453,7 +3453,7 @@ \chapter{Pandas II}\label{pandas-ii}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3500,7 +3500,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3526,7 +3526,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3562,7 +3562,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3632,7 +3632,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3681,7 +3681,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3708,7 +3708,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3742,7 +3742,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3794,7 +3794,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3842,7 +3842,7 @@ \section{Conditional Selection}\label{conditional-selection}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3885,7 +3885,7 @@ \section{Adding, Removing, and Modifying
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3916,7 +3916,7 @@ \section{Adding, Removing, and Modifying
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3946,7 +3946,7 @@ \section{Adding, Removing, and Modifying
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -3979,7 +3979,7 @@ \section{Adding, Removing, and Modifying
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4017,7 +4017,7 @@ \section{Adding, Removing, and Modifying
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4174,7 +4174,7 @@ \subsection{\texorpdfstring{\texttt{.describe()}}{.describe()}}\label{describe}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4204,7 +4204,7 @@ \subsection{\texorpdfstring{\texttt{.describe()}}{.describe()}}\label{describe}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4242,7 +4242,7 @@ \subsection{\texorpdfstring{\texttt{.sample()}}{.sample()}}\label{sample}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4251,7 +4251,7 @@ \subsection{\texorpdfstring{\texttt{.sample()}}{.sample()}}\label{sample}}
 \toprule
 {} & State & Sex &  Year &    Name &  Count \\
 \midrule
-182560 &    CA &   F &  2008 &  Dianne &     18 \\
+135741 &    CA &   F &  1996 &  Stevie &     34 \\
 \bottomrule
 \end{tabular}
 
@@ -4266,20 +4266,20 @@ \subsection{\texorpdfstring{\texttt{.sample()}}{.sample()}}\label{sample}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
 
 \begin{tabular}{lrlr}
 \toprule
-{} &  Year &       Name &  Count \\
+{} &  Year &        Name &  Count \\
 \midrule
-257405 &  1944 &   Trinidad &      5 \\
-249082 &  1930 &      Marco &      5 \\
-32575  &  1953 &       Tari &      6 \\
-159791 &  2002 &  Mirabella &      6 \\
-140784 &  1997 &      Moira &      8 \\
+295347 &  1978 &        Sage &     10 \\
+154166 &  2001 &      Sirena &     20 \\
+151924 &  2000 &  Aleksandra &      6 \\
+102032 &  1986 &      Jaymee &      6 \\
+99941  &  1986 &       Maira &    124 \\
 \bottomrule
 \end{tabular}
 
@@ -4291,19 +4291,19 @@ \subsection{\texorpdfstring{\texttt{.sample()}}{.sample()}}\label{sample}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
 
 \begin{tabular}{lrlr}
 \toprule
-{} &  Year &     Name &  Count \\
+{} &  Year &    Name &  Count \\
 \midrule
-152763 &  2000 &   Suraya &      5 \\
-343292 &  2000 &   Khalid &     21 \\
-342831 &  2000 &    Byron &     78 \\
-150304 &  2000 &  Cherish &     21 \\
+151872 &  2000 &  Shalom &      7 \\
+151687 &  2000 &   Aries &      7 \\
+151593 &  2000 &    Nour &      8 \\
+150128 &  2000 &  Winnie &     27 \\
 \bottomrule
 \end{tabular}
 
@@ -4330,7 +4330,7 @@ \subsection{\texorpdfstring{\texttt{.value\_counts()}}{.value\_counts()}}\label{
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4385,7 +4385,7 @@ \subsection{\texorpdfstring{\texttt{.sort\_values()}}{.sort\_values()}}\label{so
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4453,7 +4453,7 @@ \subsection{Approach 1: Create a Temporary
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4482,7 +4482,7 @@ \subsection{Approach 1: Create a Temporary
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4511,7 +4511,7 @@ \subsection{Approach 1: Create a Temporary
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4543,7 +4543,7 @@ \subsection{\texorpdfstring{Approach 2: Sorting using the \texttt{key}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4585,7 +4585,7 @@ \subsection{\texorpdfstring{Approach 3: Sorting using the \texttt{map}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4614,7 +4614,7 @@ \subsection{\texorpdfstring{Approach 3: Sorting using the \texttt{map}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4652,7 +4652,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-<pandas.core.groupby.generic.DataFrameGroupBy object at 0x13c208130>
+<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbb39d85000>
 \end{verbatim}
 
 What does this strange output mean? Calling
@@ -4697,7 +4697,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4748,7 +4748,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4793,7 +4793,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4819,7 +4819,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4849,7 +4849,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4910,7 +4910,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4952,7 +4952,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -4980,7 +4980,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -5014,7 +5014,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -25472,7 +25472,7 @@ \section{\texorpdfstring{Aggregating Data with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -45939,7 +45939,7 @@ \section{Parting Note}\label{parting-note-1}}
 \hypertarget{pandas-iii}{%
 \chapter{Pandas III}\label{pandas-iii}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -45997,7 +45997,7 @@ \section{\texorpdfstring{Revisiting the \texttt{.agg()}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -46032,7 +46032,7 @@ \section{\texorpdfstring{Revisiting the \texttt{.agg()}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -46182,7 +46182,7 @@ \subsection{Aggregation Functions}\label{aggregation-functions}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -46237,7 +46237,7 @@ \subsection{Renaming Columns After
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60046,7 +60046,7 @@ \subsection{Some Data Science Payoff}\label{some-data-science-payoff}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60113,7 +60113,7 @@ \subsection{Some Data Science Payoff}\label{some-data-science-payoff}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60146,7 +60146,7 @@ \subsection{Some Data Science Payoff}\label{some-data-science-payoff}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60209,7 +60209,7 @@ \section{\texorpdfstring{\texttt{GroupBy()},
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60264,7 +60264,7 @@ \subsection{\texorpdfstring{Raw \texttt{GroupBy}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60334,7 +60334,7 @@ \subsection{\texorpdfstring{Other \texttt{GroupBy}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60365,7 +60365,7 @@ \subsection{\texorpdfstring{Other \texttt{GroupBy}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60388,7 +60388,7 @@ \subsection{\texorpdfstring{Other \texttt{GroupBy}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60417,7 +60417,7 @@ \subsection{\texorpdfstring{Other \texttt{GroupBy}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60506,7 +60506,7 @@ \subsection{Filtering by Group}\label{filtering-by-group}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60575,7 +60575,7 @@ \subsection{\texorpdfstring{Aggregation with \texttt{lambda}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60643,7 +60643,7 @@ \subsection{\texorpdfstring{Aggregation with \texttt{lambda}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60670,7 +60670,7 @@ \subsection{\texorpdfstring{Aggregation with \texttt{lambda}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60722,7 +60722,7 @@ \subsection{\texorpdfstring{Aggregation with \texttt{lambda}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60748,7 +60748,7 @@ \subsection{\texorpdfstring{Aggregation with \texttt{lambda}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60787,7 +60787,7 @@ \section{Aggregating Data with Pivot
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60813,7 +60813,7 @@ \section{Aggregating Data with Pivot
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60872,7 +60872,7 @@ \section{Aggregating Data with Pivot
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60933,7 +60933,7 @@ \section{Aggregating Data with Pivot
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -60971,7 +60971,7 @@ \section{Joining Tables}\label{joining-tables}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -61006,7 +61006,7 @@ \section{Joining Tables}\label{joining-tables}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -61032,7 +61032,7 @@ \section{Joining Tables}\label{joining-tables}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -61064,7 +61064,7 @@ \section{Joining Tables}\label{joining-tables}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -61137,7 +61137,7 @@ \chapter{Data Cleaning and EDA}\label{data-cleaning-and-eda}}
 \end{Highlighting}
 \end{Shaded}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -61403,7 +61403,7 @@ \subsubsection{JSON}\label{json}}
 \end{Shaded}
 
 \begin{verbatim}
-Using cached version that was downloaded (UTC): Fri Aug 18 22:19:42 2023
+Using cached version that was downloaded (UTC): Fri Aug 25 09:57:25 2023
 \end{verbatim}
 
 \begin{verbatim}
@@ -61467,7 +61467,7 @@ \subsubsection{JSON}\label{json}}
 \end{Shaded}
 
 \begin{verbatim}
--rw-r--r--  1 Ishani  staff   114K Aug 18 22:19 data/confirmed-cases.json
+-rw-r--r--  1 lillianweng  staff   114K Aug 25 09:57 data/confirmed-cases.json
 \end{verbatim}
 
 \begin{verbatim}
@@ -63245,15 +63245,9 @@ \section{\texorpdfstring{Understanding Missing Value 1:
 \end{Highlighting}
 \end{Shaded}
 
-\begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-\end{verbatim}
-
 \begin{figure}[H]
 
-{\centering \includegraphics{eda/eda_files/figure-pdf/cell-67-output-2.pdf}
+{\centering \includegraphics{eda/eda_files/figure-pdf/cell-67-output-1.pdf}
 
 }
 
@@ -63322,15 +63316,9 @@ \section{\texorpdfstring{Understanding Missing Value 2:
 \end{Highlighting}
 \end{Shaded}
 
-\begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-\end{verbatim}
-
 \begin{figure}[H]
 
-{\centering \includegraphics{eda/eda_files/figure-pdf/cell-69-output-2.pdf}
+{\centering \includegraphics{eda/eda_files/figure-pdf/cell-69-output-1.pdf}
 
 }
 
@@ -63686,7 +63674,7 @@ \section{EDA and Data Wrangling}\label{eda-and-data-wrangling}}
 \hypertarget{regular-expressions}{%
 \chapter{Regular Expressions}\label{regular-expressions}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -63876,7 +63864,7 @@ \subsection{Canonicalization}\label{canonicalization}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -63893,7 +63881,7 @@ \subsection{Canonicalization}\label{canonicalization}}
 \end{tabular}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -63959,7 +63947,7 @@ \subsubsection{\texorpdfstring{Canonicalization with \texttt{python}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -63976,7 +63964,7 @@ \subsubsection{\texorpdfstring{Canonicalization with \texttt{python}
 \end{tabular}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -64026,15 +64014,15 @@ \subsubsection{Canonicalization with Pandas Series
 \end{Shaded}
 
 \begin{verbatim}
-/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_96342/2523629438.py:3: FutureWarning:
+/var/folders/sy/b85yc0p951zdr__z5hvdmbjm0000gn/T/ipykernel_22193/2523629438.py:7: FutureWarning:
 
 The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
 
-/var/folders/7t/zbwy02ts2m7cn64fvwjqb8xw0000gp/T/ipykernel_96342/2523629438.py:3: FutureWarning:
+/var/folders/sy/b85yc0p951zdr__z5hvdmbjm0000gn/T/ipykernel_22193/2523629438.py:7: FutureWarning:
 
 The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
 
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -64051,7 +64039,7 @@ \subsubsection{Canonicalization with Pandas Series
 \end{tabular}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -64660,7 +64648,7 @@ \subsubsection{\texorpdfstring{Canonicalization with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -64742,7 +64730,7 @@ \subsubsection{\texorpdfstring{Extraction with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -64793,7 +64781,7 @@ \subsubsection{\texorpdfstring{Extraction with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -64816,7 +64804,7 @@ \subsubsection{\texorpdfstring{Extraction with
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -64977,7 +64965,7 @@ \section{Limitations of Regular
 \hypertarget{visualization-i}{%
 \chapter{Visualization I}\label{visualization-i}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -65118,7 +65106,7 @@ \section{Bar Plots}\label{bar-plots}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -65296,7 +65284,7 @@ \section{Distributions of Quantitative
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -65796,7 +65784,7 @@ \subsection{Modes}\label{modes}}
 \hypertarget{visualization-ii}{%
 \chapter{Visualization II}\label{visualization-ii}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -65855,7 +65843,7 @@ \subsubsection{KDE Theory}\label{kde-theory}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -65884,15 +65872,9 @@ \subsubsection{KDE Theory}\label{kde-theory}}
 \end{Highlighting}
 \end{Shaded}
 
-\begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-\end{verbatim}
-
 \begin{figure}[H]
 
-{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-3-output-2.pdf}
+{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-3-output-1.pdf}
 
 }
 
@@ -66320,15 +66302,9 @@ \subsubsection{\texorpdfstring{Diving Deeper into
 \end{Highlighting}
 \end{Shaded}
 
-\begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-\end{verbatim}
-
 \begin{figure}[H]
 
-{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-11-output-2.pdf}
+{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-11-output-1.pdf}
 
 }
 
@@ -66346,15 +66322,9 @@ \subsubsection{\texorpdfstring{Diving Deeper into
 \end{Highlighting}
 \end{Shaded}
 
-\begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-\end{verbatim}
-
 \begin{figure}[H]
 
-{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-12-output-2.pdf}
+{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-12-output-1.pdf}
 
 }
 
@@ -66372,15 +66342,9 @@ \subsubsection{\texorpdfstring{Diving Deeper into
 \end{Highlighting}
 \end{Shaded}
 
-\begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-\end{verbatim}
-
 \begin{figure}[H]
 
-{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-13-output-2.pdf}
+{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-13-output-1.pdf}
 
 }
 
@@ -66533,15 +66497,9 @@ \subsubsection{\texorpdfstring{\texttt{lmplot} and
 \end{Highlighting}
 \end{Shaded}
 
-\begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-\end{verbatim}
-
 \begin{figure}[H]
 
-{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-17-output-2.pdf}
+{\centering \includegraphics{visualization_2/visualization_2_files/figure-pdf/cell-17-output-1.pdf}
 
 }
 
@@ -66875,8 +66833,8 @@ \subsection{Linearization and Applying
 \end{Shaded}
 
 \begin{verbatim}
-The slope, m, of the transformed data is: 336400693.43172693
-The intercept, b, of the transformed data is: -1802204836.0479977
+The slope, m, of the transformed data is: 336400693.43172705
+The intercept, b, of the transformed data is: -1802204836.0479987
 \end{verbatim}
 
 \begin{figure}[H]
@@ -67284,7 +67242,7 @@ \subsection{Harnessing Context}\label{harnessing-context}}
 \hypertarget{sampling}{%
 \chapter{Sampling}\label{sampling}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -67705,7 +67663,7 @@ \subsection{Demo: Barbie v.
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -67800,7 +67758,7 @@ \subsubsection{Check for Bias}\label{check-for-bias}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -67865,7 +67823,7 @@ \subsubsection{Simple Random Sample}\label{simple-random-sample}}
 \end{Shaded}
 
 \begin{verbatim}
-0.5307682890182417
+0.5300198110162606
 \end{verbatim}
 
 This is very close to the actual vote of 0.5302792307692308!
@@ -67889,8 +67847,8 @@ \subsubsection{Simple Random Sample}\label{simple-random-sample}}
 \end{Highlighting}
 \end{Shaded}
 
-\textbf{Actual} = 0.5303, \textbf{Sample} = 0.5275, \textbf{Err} =
-0.52\%.
+\textbf{Actual} = 0.5303, \textbf{Sample} = 0.5350, \textbf{Err} =
+0.89\%.
 
 We'll learn how to choose this number when we (re)learn the Central
 Limit Theorem later in the semester.
@@ -67940,7 +67898,7 @@ \subsubsection{Quantifying Chance
 \end{Shaded}
 
 \begin{verbatim}
-0.952
+0.958
 \end{verbatim}
 
 You can see the curve looks roughly Gaussian/normal. Using KDE:
@@ -67973,7 +67931,7 @@ \section{Summary}\label{summary-1}}
 \hypertarget{introduction-to-modeling}{%
 \chapter{Introduction to Modeling}\label{introduction-to-modeling}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -68336,7 +68294,7 @@ \subsection{Derivation}\label{derivation}}
   \(\hat{a} = \text{average of }y - \text{slope}\cdot\text{average of }x\)
 \end{itemize}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colback=white, left=2mm, arc=.35mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacityback=0, left=2mm, arc=.35mm, rightrule=.15mm, leftrule=.75mm, bottomrule=.15mm]
 
 Proof:
 
@@ -68688,7 +68646,7 @@ \section{Fitting the Model}\label{fitting-the-model}}
 \chapter{Constant Model, Loss, and
 Transformations}\label{constant-model-loss-and-transformations}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -69882,7 +69840,7 @@ \section{Transformations to fit Linear
 \hypertarget{ordinary-least-squares}{%
 \chapter{Ordinary Least Squares}\label{ordinary-least-squares}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -70063,7 +70021,7 @@ \section{Multiple Linear Regression}\label{multiple-linear-regression}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -70104,7 +70062,7 @@ \section{Multiple Linear Regression}\label{multiple-linear-regression}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -70512,7 +70470,7 @@ \section{OLS Properties}\label{ols-properties}}
 
 \[\mathbb{X}^Te = 0 \]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colback=white, left=2mm, arc=.35mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacityback=0, left=2mm, arc=.35mm, rightrule=.15mm, leftrule=.75mm, bottomrule=.15mm]
 
 Proof:
 
@@ -70550,7 +70508,7 @@ \section{OLS Properties}\label{ols-properties}}
 
 \[\sum_i^n e_i = 0\]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colback=white, left=2mm, arc=.35mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacityback=0, left=2mm, arc=.35mm, rightrule=.15mm, leftrule=.75mm, bottomrule=.15mm]
 
 Proof:
 
@@ -70579,7 +70537,7 @@ \section{OLS Properties}\label{ols-properties}}
   only if \(\mathbb{X}\) is \textbf{full column rank}.
 \end{enumerate}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colback=white, left=2mm, arc=.35mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacityback=0, left=2mm, arc=.35mm, rightrule=.15mm, leftrule=.75mm, bottomrule=.15mm]
 
 Proof:
 
@@ -70661,7 +70619,7 @@ \section{OLS Properties}\label{ols-properties}}
 \hypertarget{gradient-descent}{%
 \chapter{Gradient Descent}\label{gradient-descent}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -71393,7 +71351,7 @@ \section{Batch, Mini-Batch Gradient Descent and Stochastic Gradient
 \chapter{Sklearn and Feature
 Engineering}\label{sklearn-and-feature-engineering}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -71761,7 +71719,7 @@ \section{\texorpdfstring{\texttt{sklearn}}{sklearn}}\label{sklearn}}
 \end{Shaded}
 
 \begin{verbatim}
-The RMSE of the model is 0.9881331104079044
+The RMSE of the model is 0.9881331104079045
 \end{verbatim}
 
 We can also see that we obtain the same predictions using
@@ -72089,7 +72047,7 @@ \section{Polynomial Features}\label{polynomial-features}}
 \end{Shaded}
 
 \begin{verbatim}
-MSE of model with (hp^2) feature: 18.984768907617223
+MSE of model with (hp^2) feature: 18.984768907617216
 \end{verbatim}
 
 \begin{figure}[H]
@@ -72205,7 +72163,7 @@ \section{Complexity and Overfitting}\label{complexity-and-overfitting}}
 \chapter{Case Study in Human Contexts and
 Ethics}\label{case-study-in-human-contexts-and-ethics}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -72396,7 +72354,7 @@ \section{The Response: Cook County Open Data
 \subsection{Question/Problem
 Formulation}\label{questionproblem-formulation}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -72442,7 +72400,7 @@ \subsection{Question/Problem
   \end{itemize}
 \end{enumerate}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Definitions: Fairness and Transparency}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Definitions: Fairness and Transparency}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 The definitions, as given by the Cook County Assessor's Office, are
 given below:
@@ -72493,7 +72451,7 @@ \subsection{Question/Problem
 \subsection{Data Acquisition and
 Cleaning}\label{data-acquisition-and-cleaning}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -72540,7 +72498,7 @@ \subsection{Data Acquisition and
 \hypertarget{exploratory-data-analysis}{%
 \subsection{Exploratory Data Analysis}\label{exploratory-data-analysis}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -72585,7 +72543,7 @@ \subsection{Exploratory Data Analysis}\label{exploratory-data-analysis}}
 \hypertarget{prediction-and-inference}{%
 \subsection{Prediction and Inference}\label{prediction-and-inference}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -72626,7 +72584,7 @@ \subsection{Prediction and Inference}\label{prediction-and-inference}}
 \subsection{Reports Decisions, and
 Conclusions}\label{reports-decisions-and-conclusions}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Driving Questions}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -72766,7 +72724,7 @@ \section{Lessons for Data Science
 \chapter{Cross Validation and
 Regularization}\label{cross-validation-and-regularization}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -73468,7 +73426,7 @@ \section{L2 (Ridge) Regularization}\label{l2-ridge-regularization}}
 \end{Shaded}
 
 \begin{verbatim}
-array([ 5.89130559e-02, -6.42445915e-03,  4.44468157e-05, -8.83981945e-08])
+array([ 5.89130560e-02, -6.42445916e-03,  4.44468157e-05, -8.83981945e-08])
 \end{verbatim}
 
 \hypertarget{regression-summary}{%
@@ -73520,7 +73478,7 @@ \section{Regression Summary}\label{regression-summary}}
 \hypertarget{random-variables}{%
 \chapter{Random Variables}\label{random-variables}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -73563,7 +73521,7 @@ \chapter{Random Variables}\label{random-variables}}
   perspective to investigate our choice of model complexity
 \end{enumerate}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Data 8 Recap}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Data 8 Recap}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 Recall the following concepts from Data 8:
 
@@ -73805,7 +73763,7 @@ \subsection{Variance}\label{variance}}
 
 \[\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \[\begin{align}
    \text{Var}(X) &= \mathbb{E}[(X-\mathbb{E}[X])^2] \\
@@ -73833,7 +73791,7 @@ \subsection{Example: Dice}\label{example-dice}}
       0, \text{otherwise} 
    \end{cases}\]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-caution-color-frame, left=2mm, title=\textcolor{quarto-callout-caution-color}{\faFire}\hspace{0.5em}{What's the expectation \(\mathbb{E}[X]?\)}, arc=.35mm, colbacktitle=quarto-callout-caution-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-caution-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-caution-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-caution-color}{\faFire}\hspace{0.5em}{What's the expectation \(\mathbb{E}[X]?\)}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \[ \begin{align} 
          \mathbb{E}[X] &= 1(\frac{1}{6}) + 2(\frac{1}{6}) + 3(\frac{1}{6}) + 4(\frac{1}{6}) + 5(\frac{1}{6}) + 6(\frac{1}{6}) \\
@@ -73843,7 +73801,7 @@ \subsection{Example: Dice}\label{example-dice}}
 
 \end{tcolorbox}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-caution-color-frame, left=2mm, title=\textcolor{quarto-callout-caution-color}{\faFire}\hspace{0.5em}{What's the variance \(\text{Var}(X)?\)}, arc=.35mm, colbacktitle=quarto-callout-caution-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-caution-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-caution-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-caution-color}{\faFire}\hspace{0.5em}{What's the variance \(\text{Var}(X)?\)}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 Using approach 1: \[\begin{align} 
       \text{Var}(X) &= (\frac{1}{6})((1 - \frac{7}{2})^2 + (2 - \frac{7}{2})^2 + (3 - \frac{7}{2})^2 + (4 - \frac{7}{2})^2 + (5 - \frac{7}{2})^2 + (6 - \frac{7}{2})^2) \\
@@ -73942,7 +73900,7 @@ \subsection{Properties of Expectation}\label{properties-of-expectation}}
 
 \[\mathbb{E}[aX+b] = aE[\mathbb{X}] + b\]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \[\begin{align}
         \mathbb{E}[aX+b] &= \sum_{x} (ax + b) P(X=x) \\
@@ -73963,7 +73921,7 @@ \subsection{Properties of Expectation}\label{properties-of-expectation}}
 
 \[\mathbb{E}[X+Y] = \mathbb{E}[X] + \mathbb{E}[Y]\]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \[\begin{align}
     \mathbb{E}[X+Y] &= \sum_{s} (X+Y)(s) P(s) \\
@@ -74013,7 +73971,7 @@ \subsection{Properties of Variance}\label{properties-of-variance}}
   \(X\) by \(b\) units.
 \end{itemize}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 We know that \[\mathbb{E}[aX+b] = aE[\mathbb{X}] + b\]
 
@@ -74051,7 +74009,7 @@ \subsection{Properties of Variance}\label{properties-of-variance}}
   \[\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) \qquad \text{if } X, Y \text{ independent}\]
 \end{enumerate}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Proof}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 The variance of a sum is affected by the dependence between the two
 random variables that are being added. Let's expand out the definition
@@ -74136,7 +74094,7 @@ \subsection{Summary}\label{summary-2}}
 \chapter{Estimators, Bias, and
 Variance}\label{estimators-bias-and-variance}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -74286,7 +74244,7 @@ \subsection{Example}\label{example}}
 
 C. \(Y_C = 20 * X_1\)
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-caution-color-frame, left=2mm, title=\textcolor{quarto-callout-caution-color}{\faFire}\hspace{0.5em}{Solution}, arc=.35mm, colbacktitle=quarto-callout-caution-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-caution-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-caution-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-caution-color}{\faFire}\hspace{0.5em}{Solution}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 Let \(X_1, X_2, ... X_{20}\) be 20 i.i.d Bernoulli(0.5) random
 variables. Since the \(X_i\)'s are independent,
@@ -74447,7 +74405,7 @@ \subsection{Using the Sample Mean to Estimate the Population
 \textbf{unbiased estimator} of the population mean and will explore this
 idea more in the next lecture.
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Data 8 Recap: Square Root Law}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Data 8 Recap: Square Root Law}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 The square root law
 (\href{https://inferentialthinking.com/chapters/14/5/Variability_of_the_Sample_Mean.html\#the-square-root-law}{Data
@@ -74553,7 +74511,7 @@ \subsubsection{Estimating a Linear
 modeled by \[Y = g(x) + \epsilon\]
 \[ f_{\theta}(x) = Y = \theta_0 + \sum_{j=1}^p \theta_j x_j + \epsilon\]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-warning-color-frame, left=2mm, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Which Expressions are random?}, arc=.35mm, colbacktitle=quarto-callout-warning-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-warning-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Which Expressions are random?}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 In our two equations above, the true relationship
 \(g(x) = \theta_0 + \sum_{j=1}^p \theta_j x_j\) is not random, but
@@ -74568,7 +74526,7 @@ \subsubsection{Estimating a Linear
 use it to train a model and obtain an estimate of \(\hat{\theta}\)
 \[\hat{Y}(x) = f_{\hat{\theta}}(x) = \hat{\theta_0} + \sum_{j=1}^p \hat{\theta_j} x_j\]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-warning-color-frame, left=2mm, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Which Expressions are random?}, arc=.35mm, colbacktitle=quarto-callout-warning-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-warning-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Which Expressions are random?}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 In our estimating equation above, our sample \(\Bbb{X}\), \(\Bbb{Y}\)
 are random. Hence, the estimates we calculate from our samples
@@ -74620,7 +74578,7 @@ \section{Bootstrap Resampling
 list of estimates is the bootstrapped sampling distribution of f
 \end{verbatim}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-warning-color-frame, left=2mm, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Why must we resample \emph{with replacement}?}, arc=.35mm, colbacktitle=quarto-callout-warning-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-warning-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-warning-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-warning-color}{\faExclamationTriangle}\hspace{0.5em}{Why must we resample \emph{with replacement}?}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 Given an original sample of size \(n\), we want a resample that has the
 same size \(n\) as the original. Sampling \emph{without} replacement
@@ -74660,7 +74618,7 @@ \section{Bootstrap Resampling
 \chapter{Bias, Variance, and
 Inference}\label{bias-variance-and-inference}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -74888,7 +74846,7 @@ \subsubsection{Model Bias}\label{model-bias}}
 \(g(x)\); if it's negative, our model tends to underestimate \(g(x)\).
 And if it's 0, we can say that our model is \textbf{unbiased}.
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Unbiased Estimators}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Unbiased Estimators}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 An \textbf{unbiased model} has a \(\text{model bias } = 0\). In other
 words, our model predicts \(g(x)\) on average.
@@ -75111,7 +75069,7 @@ \section{Hypothesis Testing through Bootstrap: PurpleAir
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -75188,7 +75146,7 @@ \section{Hypothesis Testing through Bootstrap: PurpleAir
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -75266,7 +75224,7 @@ \section{Hypothesis Testing through Bootstrap: PurpleAir
 \end{Shaded}
 
 \begin{verbatim}
-(-0.25864811956848754, 1.1034243854204049)
+(-0.258648119568487, 1.103424385420405)
 \end{verbatim}
 
 We find that our bootstrapped approximate 95\% confidence interval for
@@ -75319,7 +75277,7 @@ \section{Colinearity}\label{colinearity}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -75369,15 +75327,9 @@ \section{Colinearity}\label{colinearity}}
 \end{Highlighting}
 \end{Shaded}
 
-\begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/seaborn/axisgrid.py:118: UserWarning:
-
-The figure layout has changed to tight
-\end{verbatim}
-
 \begin{figure}[H]
 
-{\centering \includegraphics{inference_causality/inference_causality_files/figure-pdf/cell-7-output-2.pdf}
+{\centering \includegraphics{inference_causality/inference_causality_files/figure-pdf/cell-7-output-1.pdf}
 
 }
 
@@ -75438,7 +75390,7 @@ \subsection{A Simpler Model}\label{a-simpler-model}}
 \end{Shaded}
 
 \begin{verbatim}
-/Users/Ishani/micromamba/lib/python3.9/site-packages/IPython/core/formatters.py:342: FutureWarning:
+/Users/lillianweng/anaconda3/lib/python3.10/site-packages/IPython/core/formatters.py:342: FutureWarning:
 
 In future versions `DataFrame.to_latex` is expected to utilise the base implementation of `Styler.to_latex` for formatting and rendering. The arguments signature may therefore change. It is recommended instead to use `DataFrame.style.to_latex` which also contains additional functionality.
 \end{verbatim}
@@ -75527,7 +75479,7 @@ \subsection{A Simpler Model}\label{a-simpler-model}}
 \end{Shaded}
 
 \begin{verbatim}
-(0.6029335250209633, 0.8208401738546206)
+(0.6029335250209632, 0.8208401738546206)
 \end{verbatim}
 
 In retrospect, it's no surprise that the weight of an egg best predicts
@@ -75920,7 +75872,7 @@ \section{(Bonus) Proof of Bias-Variance
 Decomposition in the Bias-Variance Tradeoff section earlier in this
 note.
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colback=white, left=2mm, arc=.35mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacityback=0, left=2mm, arc=.35mm, rightrule=.15mm, leftrule=.75mm, bottomrule=.15mm]
 
 \textbf{Click to show}\vspace{2mm}
 
@@ -76068,7 +76020,7 @@ \subsection{Step 4: Bias-Variance
 \hypertarget{sql-i}{%
 \chapter{SQL I}\label{sql-i}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -76279,7 +76231,7 @@ \section{Structured Query Language and Database
   \texttt{DEFAULT} (a default fill value if no specific entry is given).
 \end{itemize}
 
-We see that \texttt{Dragon} contains five columns. The first of these,
+We see that \texttt{Dragon} contains three columns. The first of these,
 \texttt{"name"}, contains text data. It is designated as the
 \textbf{primary key} of the table; that is, the data contained in
 \texttt{"name"} uniquely identifies each entry in the table. Because
@@ -77027,7 +76979,7 @@ \section{\texorpdfstring{Aggregating with
 \hypertarget{sql-ii}{%
 \chapter{SQL II}\label{sql-ii}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -77573,7 +77525,7 @@ \section{\texorpdfstring{\texttt{JOIN}ing
 \hypertarget{logistic-regression-i}{%
 \chapter{Logistic Regression I}\label{logistic-regression-i}}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-note-color-frame, left=2mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, colbacktitle=quarto-callout-note-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-note-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-note-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-note-color}{\faInfo}\hspace{0.5em}{Learning Outcomes}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -78028,7 +77980,7 @@ \section{Deriving the Logistic Regression
 
 \[\sigma(t) = \frac{1}{1+e^{-t}}\]
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Properties of the Sigmoid}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Properties of the Sigmoid}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 \begin{itemize}
 \tightlist
@@ -78080,7 +78032,7 @@ \section{Deriving the Logistic Regression
 \hat{P}_{\theta}(Y = 1 | x) = \sigma(x^{\top}\theta)
 \end{align}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colframe=quarto-callout-tip-color-frame, left=2mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Properties of the Logistic Model}, arc=.35mm, colbacktitle=quarto-callout-tip-color!10!white, opacitybacktitle=0.6, titlerule=0mm, toptitle=1mm, colback=white, coltitle=black, bottomtitle=1mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacitybacktitle=0.6, opacityback=0, left=2mm, colframe=quarto-callout-tip-color-frame, titlerule=0mm, bottomtitle=1mm, colbacktitle=quarto-callout-tip-color!10!white, bottomrule=.15mm, toptitle=1mm, title=\textcolor{quarto-callout-tip-color}{\faLightbulb}\hspace{0.5em}{Properties of the Logistic Model}, arc=.35mm, rightrule=.15mm, leftrule=.75mm, coltitle=black]
 
 Consider a logistic regression model with one feature and an intercept
 term:
@@ -78105,7 +78057,7 @@ \section{Deriving the Logistic Regression
 
 \end{tcolorbox}
 
-\begin{tcolorbox}[enhanced jigsaw, rightrule=.15mm, breakable, toprule=.15mm, opacityback=0, colback=white, left=2mm, arc=.35mm, bottomrule=.15mm, leftrule=.75mm]
+\begin{tcolorbox}[enhanced jigsaw, toprule=.15mm, colback=white, breakable, opacityback=0, left=2mm, arc=.35mm, rightrule=.15mm, leftrule=.75mm, bottomrule=.15mm]
 
 \textbf{Example Calculation}\vspace{2mm}
 
diff --git a/logistic_regression_1/logistic_reg_1.ipynb b/logistic_regression_1/logistic_reg_1.ipynb
deleted file mode 100644
index 87999b90..00000000
--- a/logistic_regression_1/logistic_reg_1.ipynb
+++ /dev/null
@@ -1,707 +0,0 @@
-{
-  "cells": [
-    {
-      "cell_type": "raw",
-      "metadata": {},
-      "source": [
-        "---\n",
-        "title: Logistic Regression I\n",
-        "format:\n",
-        "  html:\n",
-        "    toc: true\n",
-        "    toc-depth: 5\n",
-        "    toc-location: right\n",
-        "    code-fold: false\n",
-        "    theme:\n",
-        "      - cosmo\n",
-        "      - cerulean\n",
-        "    callout-icon: false\n",
-        "---"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "::: {.callout-note collapse=\"false\"}\n",
-        "## Learning Outcomes\n",
-        "* Understand the difference between regression and classification\n",
-        "* Derive the logistic regression model for classifying data\n",
-        "* Quantify the error of our logistic regression model with cross-entropy loss\n",
-        ":::\n",
-        "\n",
-        "\n",
-        "Up until this point in the class , we've focused on **regression** tasks - that is, predicting a *numerical* quantity from a given dataset. We discussed optimization, feature engineering, and regularization all in the context of performing regression to predict some quantity. \n",
-        "\n",
-        "Now that we have this deep understanding of the modeling process, let's expand our knowledge of possible modeling tasks. \n",
-        "\n",
-        "\n",
-        "## Classification\n",
-        "\n",
-        "In the next two lectures, we'll tackle the task of **classification**. A classification problem aims to classify data into  *categories*. Unlike in regression, where we predicted a numeric output, classification involves predicting some **categorical variable**, or **response**, $y$. Examples of classification tasks include:\n",
-        "\n",
-        "* Predicting which team won from its turnover percentage \n",
-        "* Predicting the day of the week of a meal from the total restaurant bill \n",
-        "* Predicting the model of car from its horsepower\n",
-        "\n",
-        "There are a couple of different types of classification:\n",
-        "\n",
-        "* Binary classification: classify data into two classes, and responses $y$ are either 0 or 1\n",
-        "* Multiclass classification: classify data into multiple classes (e.g., image labeling, next word in a sentence, etc.)\n",
-        "* Structured prediction tasks: conduct mutliple related classification predictions (e.g., translation, voice recognition, etc.)\n",
-        "\n",
-        "In Data 100, we will mostly deal with **binary classification**, where we are attempting to classify data into one of two classes. \n",
-        "\n",
-        "To build a classification model, we need to modify our modeling workflow slightly. Recall that in regression we:\n",
-        "\n",
-        "1. Created a design matrix of numeric features\n",
-        "2. Defined our model as a linear combination of these numeric features\n",
-        "3. Used the model to output numeric predictions\n",
-        "\n",
-        "In classification, however, we no longer want to output numeric predictions; instead, we want to predict the class to which a datapoint belongs. This means that we need to update our workflow. To build a classification model, we will:\n",
-        "\n",
-        "1. Create a design matrix of numeric features.\n",
-        "2. Define our model as a linear combination of these numeric features, transformed by a non-linear **sigmoid function**. This outputs a numeric quantity.\n",
-        "3. Apply a **decision rule** to interpret the outputted quantity and decide a classification.\n",
-        "4. Output a predicted class.\n",
-        "\n",
-        "There are two key differences: as we'll soon see, we need to incorporate a non-linear transformation to capture non-linear relationships hidden in our data. We do so by applying the sigmoid function to a linear combination of the features. Secondly, we must apply a decision rule to convert the numeric quantities computed by our model into an actual class prediction. This can be as simple as saying that any datapoint with a feature greater than some number $x$ belongs to Class 1.\n",
-        "\n",
-        "<center><img src=\"images/reg.png\" alt='reg' width='750'></center>\n",
-        "\n",
-        "<center><img src=\"images/class.png\" alt='class' width='750'></center>\n",
-        "\n",
-        "This was a very high-level overview. Let's walk through the process in detail to clarify what we mean.\n",
-        "\n",
-        "## Deriving the Logistic Regression Model\n",
-        "\n",
-        "Throughout this lecture, we will work with the `games` dataset, which contains information about games played in the NBA basketball league. Our goal will be to use a basketball team's `\"GOAL_DIFF\"` to predict whether or not a given team won their game (`\"WON\"`). If a team wins their game, we'll say they belong to Class 1. If they lose, they belong to Class 0.\n",
-        "\n",
-        "For those who are curious, `\"GOAL_DIFF\"` represents the difference in successful field goal percentages between the two competing teams. "
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "#| code-fold: true\n",
-        "import warnings\n",
-        "warnings.filterwarnings(\"ignore\")\n",
-        "\n",
-        "import pandas as pd\n",
-        "import numpy as np\n",
-        "np.seterr(divide='ignore')\n",
-        "\n",
-        "games = pd.read_csv(\"data/games\").dropna()\n",
-        "games.head()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Let's visualize the relationship between `\"GOAL_DIFF\"` and `\"WON\"` using the Seaborn function `sns.stripplot`. A strip plot automatically introduces a small amount of random noise to jitter the data. Recall that all values in the `\"WON\"` column are either 1 (won) or 0 (lost) – if we were to directly plot them without jittering, we would see severe overplotting."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "#| code-fold: true\n",
-        "import seaborn as sns\n",
-        "import matplotlib.pyplot as plt\n",
-        "\n",
-        "sns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\")\n",
-        "# By default, sns.stripplot plots 0, then 1. We invert the y axis to reverse this behavior\n",
-        "plt.gca().invert_yaxis();"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "This dataset is unlike anything we've seen before – our target variable contains only two unique values! Remember that each y value is either 0 or 1; the plot above jitters the y data slightly for ease of reading.\n",
-        "\n",
-        "The regression models we have worked with always assumed that we were attempting to predict a continuous target. If we apply a linear regression model to this dataset, something strange happens."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "#| code-fold: true\n",
-        "import sklearn.linear_model as lm\n",
-        "\n",
-        "X, Y = games[[\"GOAL_DIFF\"]], games[\"WON\"]\n",
-        "regression_model = lm.LinearRegression()\n",
-        "regression_model.fit(X, Y)\n",
-        "\n",
-        "plt.plot(X.squeeze(), regression_model.predict(X), \"k\")\n",
-        "sns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\")\n",
-        "plt.gca().invert_yaxis();"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "The linear regression fit follows the data as closely as it can. However, there are a few key flaws with this approach:\n",
-        "\n",
-        "* The predicted output, $\\hat{y}$, can be outside the range of possible classes (there are predictions above 1 and below 0)\n",
-        "* This means that the output can't always be interpreted (what does it mean to predict a class of -2.3?)\n",
-        "\n",
-        "Our usual linear regression framework won't work here. Instead, we'll need to get more creative.\n",
-        "\n",
-        "Back in [Data 8](https://inferentialthinking.com/chapters/08/1/Applying_a_Function_to_a_Column.html#example-prediction), you gradually built up to the concept of linear regression by using the **graph of averages**. Before you knew the mathematical underpinnings of the regression line, you took a more intuitive approach: you bucketed the $x$ data into bins of common values, then computed the average $y$ for all datapoints in the same bin. The result gave you the insight needed to derive the regression fit.\n",
-        "\n",
-        "Let's take the same approach as we grapple with our new classification task. In the cell below, we 1) bucket the `\"GOAL_DIFF\"` data into bins of similar values and 2) compute the average `\"WON\"` value of all datapoints in a bin."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "bins = pd.cut(games[\"GOAL_DIFF\"], 20)\n",
-        "games[\"bin\"] = [(b.left + b.right) / 2 for b in bins]\n",
-        "win_rates_by_bin = games.groupby(\"bin\")[\"WON\"].mean()\n",
-        "\n",
-        "# alpha makes the points transparent so we can see the line more clearly\n",
-        "sns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\", alpha=0.3)\n",
-        "plt.plot(win_rates_by_bin.index, win_rates_by_bin, c=\"tab:red\")\n",
-        "plt.gca().invert_yaxis();"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "Interesting: our result is certainly not like the straight line produced by finding the graph of averages for a linear relationship. We can make two observations:\n",
-        "\n",
-        "* All predictions on our line are between 0 and 1\n",
-        "* The predictions are **non-linear**, following a rough \"S\" shape\n",
-        "\n",
-        "Let's think more about what we've just done.\n",
-        "\n",
-        "To find the average $y$ value for each bin, we computed:\n",
-        "\n",
-        "$$\\frac{1 \\text{(\\# Y = 1 in bin)} + 0 \\text{(\\# Y = 0 in bin)}}{\\text{\\# datapoints in bin}} = \\frac{\\text{\\# Y = 1 in bin}}{\\text{\\# datapoints in bin}} = P(\\text{Y = 1} | \\text{bin})$$\n",
-        "\n",
-        "This is simply the probability of a datapoint in that bin belonging to Class 1! This aligns with our observation from earlier: all of our predictions lie between 0 and 1, just as we would expect for a probability.\n",
-        "\n",
-        "Our graph of averages was really modeling the probability, $p$, that a datapoint belongs to Class 1, or essentially that $\\text{Y = 1}$ for a particular value of $\\text{x}$.\n",
-        "\n",
-        "$$ p = P(Y = 1 | \\text{ x} )$$\n",
-        "\n",
-        "In logistic regression, we have a new modeling goal. We want to model the **probability that a particular datapoint belongs to Class 1**. To do so, we'll need to create a model that can approximate the S-shaped curve we plotted above.\n",
-        "\n",
-        "Fortunately for us, we're already well-versed with a technique to model non-linear relationships – applying non-linear transformations to linearize the relationship. Recall the steps we've applied previously:\n",
-        "\n",
-        "* Transform the variables until we linearize their relationship\n",
-        "* Fit a linear model to the transformed variables\n",
-        "* \"Undo\" our transformations to identify the underlying relationship between the original variables\n",
-        "\n",
-        "In past examples, we used the bulge diagram to help us decide what transformations may be useful. The S-shaped curve we saw above, however, looks nothing like any relationship we've seen in the past. We'll need to think carefully about what transformations will linearize this curve.\n",
-        "\n",
-        "Let's consider our eventual goal: determining if we should predict a Class of 0 or 1 for each datapoint. Rephrased, we want to decide if it seems more \"likely\" that the datapoint belongs to Class 0 or to Class 1. One way of deciding this is to see which class has the higher predicted probability for a given datapoint. The **odds** is defined as the probability of a datapoint belonging to Class 1 divided by the probability of it belonging to Class 0. \n",
-        "\n",
-        "$$\\text{odds} = \\frac{P(Y=1|x)}{P(Y=0|x)} = \\frac{p}{1-p}$$\n",
-        "\n",
-        "If we plot the odds for each input `\"GOAL_DIFF\"` ($x$), we see something that looks more promising."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "p = win_rates_by_bin\n",
-        "odds = p/(1-p) \n",
-        "\n",
-        "plt.plot(odds.index, odds)\n",
-        "plt.xlabel(\"x\")\n",
-        "plt.ylabel(r\"Odds $= \\frac{p}{1-p}$\");"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "It turns out that the relationship between our input `\"GOAL_DIFF\"` and the odds is roughly exponential! Let's linearize the exponential by taking the logarithm. As a reminder, you should assume that any logarithm in Data 100 is the base $e$ natural logarithm unless told otherwise."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "import numpy as np\n",
-        "log_odds = np.log(odds)\n",
-        "plt.plot(odds.index, log_odds, c=\"tab:green\")\n",
-        "plt.xlabel(\"x\")\n",
-        "plt.ylabel(r\"Log-Odds $= \\log{\\frac{p}{1-p}}$\");"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "We see something promising – the relationship between the log-odds and our input feature is approximately linear. This means that we can use a linear model to describe the relationship between the log-odds and $x$. In other words:\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\log{(\\frac{p}{1-p})} &= \\theta_0 + \\theta_1 x_i\\\\\n",
-        "&= x^{\\top} \\theta\n",
-        "\\end{align}\n",
-        "\n",
-        "Here, we use $x^{\\top}$ to represent an observation in our dataset, stored as a row vector. You can imagine it as a single row in our design matrix. $x^{\\top} \\theta$ indicates a linear combination of the features for this observation (just as we used in multiple linear regression). \n",
-        "\n",
-        "We're in good shape! We have now derived the following relationship:\n",
-        "\n",
-        "$$\\log{(\\frac{p}{1-p})} = x^{\\top} \\theta$$\n",
-        "\n",
-        "Remember that our goal is to predict the probability of a datapoint belonging to Class 1, $p$. Let's rearrange this relationship to uncover the original relationship between $p$ and our input data, $x^{\\top}$.\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\log{(\\frac{p}{1-p})} &= x^T \\theta\\\\\n",
-        "\\frac{p}{1-p} &= e^{x^T \\theta}\\\\\n",
-        "p &= (1-p)e^{x^T \\theta}\\\\\n",
-        "p &= e^{x^T \\theta}- p e^{x^T \\theta}\\\\\n",
-        "p(1 + e^{x^T \\theta}) &= e^{x^T \\theta} \\\\\n",
-        "p &= \\frac{e^{x^T \\theta}}{1+e^{x^T \\theta}}\\\\\n",
-        "p &= \\frac{1}{1+e^{-x^T \\theta}}\\\\\n",
-        "\\end{align}\n",
-        "\n",
-        "Phew, that was a lot of algebra. What we've uncovered is the **logistic regression model** to predict the probability of a datapoint $x^{\\top}$ belonging to Class 1. If we plot this relationship for our data, we see the S-shaped curve from earlier!"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "#| code-fold: true\n",
-        "# We'll discuss the `LogisticRegression` class next time\n",
-        "xs = np.linspace(-0.3, 0.3)\n",
-        "\n",
-        "logistic_model = lm.LogisticRegression(C=20)\n",
-        "logistic_model.fit(X, Y)\n",
-        "predicted_prob = logistic_model.predict_proba(xs[:, np.newaxis])[:, 1]\n",
-        "\n",
-        "sns.stripplot(data=games, x=\"GOAL_DIFF\", y=\"WON\", orient=\"h\", alpha=0.5)\n",
-        "plt.plot(xs, predicted_prob, c=\"k\", lw=3, label=\"Logistic regression model\")\n",
-        "plt.plot(win_rates_by_bin.index, win_rates_by_bin, lw=2, c=\"tab:red\", label=\"Graph of averages\")\n",
-        "plt.legend(loc=\"upper left\")\n",
-        "plt.gca().invert_yaxis();"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "To predict a probability using the logistic regression model, we:\n",
-        "\n",
-        "1. Compute a linear combination of the features, $x^{\\top}\\theta$\n",
-        "2. Apply the sigmoid activation function, $\\sigma(x^{\\top} \\theta)$.\n",
-        "\n",
-        "Our predicted probabilities are of the form $P(Y=1|x) = p = \\frac{1}{1+e^{-(\\theta_0 + \\theta_1 x_1 + \\theta_2 x_2 + \\ldots + \\theta_p x_p)}}$\n",
-        "\n",
-        "An important note: despite its name, logistic regression is used for *classification* tasks, not regression tasks. In Data 100, we always apply logistic regression with the goal of classifying data.\n",
-        "\n",
-        "The S-shaped curve is formally known as the **sigmoid function** and is typically denoted by $\\sigma$. \n",
-        "\n",
-        "$$\\sigma(t) = \\frac{1}{1+e^{-t}}$$\n",
-        "\n",
-        "::: {.callout-tip}\n",
-        "## Properties of the Sigmoid\n",
-        "* Reflection/Symmetry: $1-\\sigma(t) = \\frac{e^{-t}}{1+e^{-t}}=\\sigma(-t)$\n",
-        "* Inverse: $t=\\sigma^{-1}(p)=\\log{(\\frac{p}{1-p})}$\n",
-        "* Derivative: $\\frac{d}{dz} \\sigma(t) = \\sigma(t) (1-\\sigma(t))=\\sigma(t)\\sigma(-t)$\n",
-        "* Domain: $-\\infty < t < \\infty$\n",
-        "* Range: $0 < \\sigma(t) < 1$\n",
-        ":::\n",
-        "\n",
-        "In the context of our modeling process, the sigmoid is considered an **activation function**. It takes in a linear combination of the features and applies a non-linear transformation.\n",
-        "\n",
-        "Let's summarize our logistic regression modeling workflow.\n",
-        "\n",
-        "<center><img src=\"images/log_reg.png\" alt='log_reg' width='750'></center>\n",
-        "\n",
-        "Our main takeaways from this section are:\n",
-        "\n",
-        "* Fit the \"S\" curve as best as possible\n",
-        "* The curve models the probability: $P = (Y=1 | x)$\n",
-        "* Assume log-odds is a linear combination of $x$ and $\\theta$\n",
-        "\n",
-        "Putting this together, we know that the estimated probability that response is 1 given the features $x$ is equal to the logistic function $\\sigma()$ at the value $x^{\\top}\\theta$:\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\hat{P}_{\\theta}(Y = 1 | x) = \\frac{1}{1 + e^{-x^{\\top}\\theta}}\n",
-        "\\end{align}\n",
-        "\n",
-        "More commonly, the logistic regression model is written as:\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\hat{P}_{\\theta}(Y = 1 | x) = \\sigma(x^{\\top}\\theta)\n",
-        "\\end{align}\n",
-        "\n",
-        "\n",
-        "::: {.callout-tip}\n",
-        "## Properties of the Logistic Model\n",
-        "Consider a logistic regression model with one feature and an intercept term:\n",
-        "\n",
-        "\\begin{align}\n",
-        "p = P(Y = 1 | x) = \\frac{1}{1+e^{-(\\theta_0 + \\theta_1 x)}}\n",
-        "\\end{align}\n",
-        "\n",
-        "Properties:\n",
-        "\n",
-        "* $\\theta_0$ controls the position of the curve along the horizontal axis\n",
-        "* The magnitude of $\\theta_1$ controls the \"steepness\" of the sigmoid\n",
-        "* The sign of $\\theta_1$ controls the orientation of the curve\n",
-        "\n",
-        ":::\n",
-        "\n",
-        "::: {.callout collapse=\"true\"}\n",
-        "## Example Calculation\n",
-        "Suppose we want to predict the probability that a team wins a game, given `\"GOAL_DIFF\"` (first feature) and the number of free throws (second feature). Let's say we fit a logistic regression model (with no intercept) using the training data and estimate the optimal parameters. Now we want to predict the probability that a new team will win their game.\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\hat{\\theta}^{\\top} = \\begin{matrix}[0.1 & -0.5]\\end{matrix}\n",
-        "\\\\x^{\\top} = \\begin{matrix}[15 & 1]\\end{matrix}\n",
-        "\\end{align}\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\hat{P}_{\\hat{\\theta}}(Y = 1 | x) = \\sigma(x^{\\top}\\hat{\\theta}) = \\sigma(0.1 \\cdot 15 + (-0.5) \\cdot 1) = \\sigma(1) = \\frac{1}{1+e^{-1}} \\approx 0.7311\n",
-        "\\end{align}\n",
-        "\n",
-        "We see that the response is more likely to be 1 than 0, indicating that a reasonable prediction is $\\hat{y} = 1$. We'll dive deeper into this in the next lecture.\n",
-        "\n",
-        ":::\n",
-        "\n",
-        "\n",
-        "## Cross-Entropy Loss\n",
-        "\n",
-        "To quantify the error of our logistic regression model, we'll need to define a loss function. \n",
-        "\n",
-        "### Why Not MSE?\n",
-        "You may wonder: why not use our familiar mean squared error? It turns out that the MSE is not well suited for logistic regression. To see why, let's consider a simple, artificially generated `toy` dataset (this will be easier to work with than the more complicated `games` data)."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "#| code-fold: true\n",
-        "toy_df = pd.DataFrame({\n",
-        "        \"x\": [-4, -2, -0.5, 1, 3, 5],\n",
-        "        \"y\": [0, 0, 1, 0, 1, 1]})\n",
-        "toy_df.head()"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "We'll construct a basic logistic regression model with only one feature and no intercept term. Our predicted probabilities take the form:\n",
-        "\n",
-        "$$p=P(Y=1|x)=\\frac{1}{1+e^{-\\theta_1 x}}$$\n",
-        "\n",
-        "In the cell below, we plot the MSE for our model on the data."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "#| code-fold: true\n",
-        "def sigmoid(z):\n",
-        "    return 1/(1+np.e**(-z))\n",
-        "    \n",
-        "def mse_on_toy_data(theta):\n",
-        "    p_hat = sigmoid(toy_df['x'] * theta)\n",
-        "    return np.mean((toy_df['y'] - p_hat)**2)\n",
-        "\n",
-        "thetas = np.linspace(-15, 5, 100)\n",
-        "plt.plot(thetas, [mse_on_toy_data(theta) for theta in thetas])\n",
-        "plt.title(\"MSE on toy classification data\")\n",
-        "plt.xlabel(r'$\\theta_1$')\n",
-        "plt.ylabel('MSE');"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "This looks nothing like the parabola we found when plotting the MSE of a linear regression model! In particular, we can identify two flaws with using the MSE for logistic regression:\n",
-        "\n",
-        "1. The MSE loss surface is *non-convex*. There is both a global minimum and a (barely perceptible) local minimum in the loss surface above. This means that there is the risk of gradient descent converging on the local minimum of the loss surface, missing the true optimum parameter $\\theta_1$.\n",
-        "2. Squared loss is *bounded* for a classification task. Recall that each true $y$ has a value of either 0 or 1. This means that even if our model makes the worst possible prediction (e.g. predicting $p=0$ for $y=1$), the squared loss for an observation will be no greater than 1: $$(y-p)^2=(1-0)^2=1$$ The MSE does not strongly penalize poor predictions.\n",
-        "\n",
-        "### Motivating Cross-Entropy Loss\n",
-        "Suffice to say, we don't want to use the MSE when working with logistic regression. Instead, we'll consider what kind of behavior we would *like* to see in a loss function.\n",
-        "\n",
-        "Let $y$ be the binary label ${0, 1}$, and $p$ be the model's predicted probability of the label being 1. \n",
-        "\n",
-        "* When the true $y$ is 1, we should incur *low* loss when the model predicts large $p$\n",
-        "* When the true $y$ is 0, we should incur *high* loss when the model predicts large $p$\n",
-        "\n",
-        "In other words, our loss function should behave differently depending on the value of the true class, $y$. \n",
-        "\n",
-        "The **cross-entropy loss** incorporates this changing behavior. We will use it throughout our work on logistic regression. Below, we write out the cross-entropy loss for a *single* datapoint (no averages just yet).\n",
-        "\n",
-        "$$\\text{Cross-Entropy Loss} = \\begin{cases}\n",
-        "  -\\log{(p)}  & \\text{if } y=1 \\\\\n",
-        "  -\\log{(1-p)} & \\text{if } y=0\n",
-        "\\end{cases}$$\n",
-        "\n",
-        "Why does this (seemingly convoluted) loss function \"work\"? Let's break it down.\n",
-        "\n",
-        ":::: {.columns}\n",
-        "\n",
-        "::: {.column width=\"35%\"}\n",
-        "When $y=1$\n",
-        "<center><img src=\"images/y=1.png\" alt='cross-entropy loss when Y=1' width='300'></center>\n",
-        "\n",
-        "* As $p \\rightarrow 0$, loss approches $\\infty$\n",
-        "* As $p \\rightarrow 1$, loss approaches 0\n",
-        "  \n",
-        ":::\n",
-        "\n",
-        "::: {.column width=\"20%\"}\n",
-        ":::\n",
-        "\n",
-        "::: {.column width=\"35%\"}\n",
-        "When $y=0$\n",
-        "<center><img src=\"images/y=0.png\" alt='cross-entropy loss when Y=0' width='300'></center>\n",
-        "\n",
-        "* As $p \\rightarrow 0$, loss approches 0\n",
-        "* As $p \\rightarrow 1$, loss approaches $\\infty$\n",
-        "  \n",
-        ":::\n",
-        "\n",
-        "::::\n",
-        "\n",
-        "All good – we are seeing the behavior we want for our logistic regression model.\n",
-        "\n",
-        "The piecewise function we outlined above is difficult to optimize: we don't want to constantly \"check\" which form of the loss function we should be using at each step of choosing the optimal model parameters. We can re-express cross-entropy loss in a more convenient way:\n",
-        "\n",
-        "$$\\text{Cross-Entropy Loss} = -\\left(y\\log{(p)}-(1-y)\\log{(1-p)}\\right)$$\n",
-        "\n",
-        "By setting $y$ to 0 or 1, we see that this new form of cross-entropy loss gives us the same behavior as the original formulation.\n",
-        "\n",
-        ":::: {.columns}\n",
-        "\n",
-        "::: {.column width=\"35%\"}\n",
-        "When $y=1$:\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\text{CE} &= -\\left((1)\\log{(p)}-(1-1)\\log{(1-p)}\\right)\\\\\n",
-        "&= -\\log{(p)}\n",
-        "\\end{align}\n",
-        ":::\n",
-        "\n",
-        "::: {.column width=\"20%\"}\n",
-        ":::\n",
-        "\n",
-        "::: {.column width=\"35%\"}\n",
-        "When $y=0$:\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\text{CE} &= -\\left((0)\\log{(p)}-(1-0)\\log{(1-p)}\\right)\\\\\n",
-        "&= -\\log{(1-p)}\n",
-        "\\end{align}\n",
-        ":::\n",
-        "\n",
-        "::::\n",
-        "\n",
-        "The empirical risk of the logistic regression model is then the mean cross-entropy loss across all datapoints in the dataset. When fitting the model, we want to determine the model parameter $\\theta$ that leads to the lowest mean cross-entropy loss possible.\n",
-        "\n",
-        "$$R(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(p_i)}-(1-y_i)\\log{(1-p_i)}\\right)$$\n",
-        "$$R(\\theta) = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(\\sigma(X_i^{\\top}\\theta)}-(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta)}\\right)$$\n",
-        "\n",
-        "The optimization problem is therefore to find the estimate $\\hat{\\theta}$ that minimizes $R(\\theta)$:\n",
-        "\n",
-        "\\begin{align}\n",
-        "\\hat{\\theta} = \\underset{\\theta}{\\arg\\min} = - \\frac{1}{n} \\sum_{i=1}^n \\left(y_i\\log{(\\sigma(X_i^{\\top}\\theta)}-(1-y_i)\\log{(1-\\sigma(X_i^{\\top}\\theta)}\\right)\n",
-        "\\end{align}\n",
-        "\n",
-        "Plotting the cross-entropy loss surface for our `toy` dataset gives us a more encouraging result – our loss function is now convex. This means we can optimize it using gradient descent. Computing the gradient of the logistic model is fairly challenging, so we'll let `sklearn` take care of this for us. You won't need to compute the gradient of the logistic model in Data 100."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "#| code-fold: true\n",
-        "def cross_entropy(y, p_hat):\n",
-        "    return - y * np.log(p_hat) - (1 - y) * np.log(1 - p_hat)\n",
-        "\n",
-        "def mean_cross_entropy_on_toy_data(theta):\n",
-        "    p_hat = sigmoid(toy_df['x'] * theta)\n",
-        "    return np.mean(cross_entropy(toy_df['y'], p_hat))\n",
-        "\n",
-        "plt.plot(thetas, [mean_cross_entropy_on_toy_data(theta) for theta in thetas], color = 'green')\n",
-        "plt.ylabel(r'Mean Cross-Entropy Loss($\\theta$)')\n",
-        "plt.xlabel(r'$\\theta$');"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## (Bonus) Maximum Likelihood Estimation\n",
-        "\n",
-        "It may have seemed like we pulled cross-entropy loss out of thin air. How did we know that taking the negative logarithms of our probabilities would work so well? It turns out that cross-entropy loss is justified by probability theory.\n",
-        "\n",
-        "The following section is out of scope, but is certainly an interesting read!\n",
-        "\n",
-        "### Building Intuition: The Coin Flip\n",
-        "To build some intuition for logistic regression, let’s look at an introductory example to classification: the coin flip. Suppose we observe some outcomes of a coin flip (1 = Heads, 0 = Tails)."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "flips = [0, 0, 1, 1, 1, 1, 0, 0, 0, 0]\n",
-        "flips"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "A reasonable model is to assume all flips are IID (independent and identically distributed). In other words, each flip has the same probability of returning a 1 (or heads). Let's define a parameter $\\theta$, the probability that the next flip is a heads. We will use this parameter to inform our decision for $\\hat y$ (predicting either 0 or 1) of the next flip. If $\\theta \\ge 0.5, \\hat y = 1, \\text{else } \\hat y = 0$.\n",
-        "\n",
-        "You may be inclined to say $0.5$ is the best choice for $\\theta$. However, notice that we made no assumption about the coin itself. The coin may be biased, so we should make our decision based only on the data. We know that exactly $\\frac{4}{10}$ of the flips were heads, so we might guess $\\hat \\theta = 0.4$. In the next section, we will mathematically prove why this is the best possible estimate.\n",
-        "\n",
-        "### Likelihood of Data\n",
-        "\n",
-        "Let's call the result of the coin flip a random variable $Y$. This is a Bernoulli random variable with two outcomes. $Y$ has the following distribution: \n",
-        "\n",
-        "$$P(Y = y) = \\begin{cases}\n",
-        "        p, \\text{if }  y=1\\\\\n",
-        "        1 - p, \\text{if }  y=0\n",
-        "    \\end{cases} $$\n",
-        "\n",
-        "$p$ is unknown to us. But we can find the $p$ that makes the data we observed the most *likely*.\n",
-        "\n",
-        "The probability of observing 4 heads and 6 tails follows the binomial distribution.\n",
-        "\n",
-        "$$\\binom{10}{4} (p)^4 (1-p)^6$$ \n",
-        "\n",
-        "We define the **likelihood** of obtaining our observed data as a quantity *proportional* to the probability above. To find it, simply multiply the probabilities of obtaining each coin flip.\n",
-        "\n",
-        "$$(p)^{4} (1-p)^6$$ \n",
-        "\n",
-        "The technique known as **maximum likelihood estimation** finds the $p$ that maximizes the above likelihood. You can find this maximum by taking the derivative of the likelihood, but we'll provide a more intuitive graphical solution."
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "thetas = np.linspace(0, 1)\n",
-        "plt.plot(thetas, (thetas**4)*(1-thetas)**6)\n",
-        "plt.xlabel(r\"$\\theta$\")\n",
-        "plt.ylabel(\"Likelihood\");"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "More generally, the likelihood for some Bernoulli($p$) random variable $Y$  is:\n",
-        "\n",
-        "$$P(Y = y) = \\begin{cases}\n",
-        "        1, \\text{with probability }  p\\\\\n",
-        "        0, \\text{with probability }  1 - p\n",
-        "    \\end{cases} $$\n",
-        "    \n",
-        "Equivalently, this can be written in a compact way:\n",
-        "\n",
-        "$$P(Y=y) = p^y(1-p)^{1-y}$$\n",
-        "\n",
-        "- When $y = 1$, this reads $P(Y=y) = p$\n",
-        "- When $y = 0$, this reads $P(Y=y) = (1-p)$\n",
-        "\n",
-        "In our example, a Bernoulli random variable is analogous to a single data point (e.g., one instance of a basketball team winning or losing a game). All together, our `games` data consists of many IID Bernoulli($p$) random variables. To find the likelihood of independent events in succession, simply multiply their likelihoods.\n",
-        "\n",
-        "$$\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}$$\n",
-        "\n",
-        "As with the coin example, we want to find the parameter $p$ that maximizes this likelihood. Earlier, we gave an intuitive graphical solution, but let's take the derivative of the likelihood to find this maximum.\n",
-        "\n",
-        "At a first glance, this derivative will be complicated! We will have to use the product rule, followed by the chain rule. Instead, we can make an observation that simplifies the problem. \n",
-        "\n",
-        "Finding the $p$ that maximizes $$\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i}$$ is equivalent to the $p$ that maximizes $$\\text{log}(\\prod_{i=1}^{n} p^{y_i} (1-p)^{1-y_i})$$\n",
-        "\n",
-        "This is because $\\text{log}$ is a strictly *increasing* function. It won't change the maximum or minimum of the function it was applied to. From $\\text{log}$ properties, $\\text{log}(a*b)$ = $\\text{log}(a) + \\text{log}(b)$. We can apply this to our equation above to get:\n",
-        "\n",
-        "$$\\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} \\text{log}(p^{y_i} (1-p)^{1-y_i})$$\n",
-        "\n",
-        "$$= \\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} (\\text{log}(p^{y_i}) + \\text{log}((1-p)^{1-y_i}))$$\n",
-        "\n",
-        "$$= \\underset{p}{\\text{argmax}} \\sum_{i=1}^{n} (y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p))$$\n",
-        "\n",
-        "We can add a constant factor of $\\frac{1}{n}$ out front. It won't affect the $p$ that maximizes our likelihood.\n",
-        "\n",
-        "$$=\\underset{p}{\\text{argmax}}  \\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)$$\n",
-        "\n",
-        "One last \"trick\" we can do is change this to a minimization problem by negating the result. This works because we are dealing with a *concave* function, which can be made *convex*.\n",
-        "\n",
-        "$$= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p) + (1-y_i)\\text{log}(1-p)$$\n",
-        "\n",
-        "Now let's say that we have data that are independent with different probability $p_i$. Then, we would want to find the $p_1, p_2, \\dots, p_n$ that maximize $$\\prod_{i=1}^{n} p_i^{y_i} (1-p_i)^{1-y_i}$$\n",
-        "\n",
-        "Setting up and simplifying the optimization problems as we did above, we ultimately want to find:\n",
-        "\n",
-        "$$= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(p_i) + (1-y_i)\\text{log}(1-p_i)$$\n",
-        "\n",
-        "For logistic regression, $p_i = \\sigma(x^{\\top}\\theta)$. Plugging that in, we get:  \n",
-        "\n",
-        "$$= \\underset{p}{\\text{argmin}} -\\frac{1}{n} \\sum_{i=1}^{n} y_i\\text{log}(\\sigma(x^{\\top}\\theta)) + (1-y_i)\\text{log}(1-\\sigma(x^{\\top}\\theta))$$\n",
-        "\n",
-        "This is exactly our average cross-entropy loss minimization problem from before! \n",
-        "\n",
-        "Why did we do all this complicated math? We have shown that *minimizing* cross-entropy loss is equivalent to *maximizing* the likelihood of the training data.\n",
-        "\n",
-        "- By minimizing cross-entropy loss, we are choosing the model parameters that are \"most likely\" for the data we observed.\n",
-        "\n",
-        "Note that this is under the assumption that all data is drawn independently from the same logistic regression model with parameter $\\theta$. In fact, many of the model + loss combinations we've seen can be motivated using MLE (e.g., OLS, Ridge Regression, etc.). In probability and ML classes, you'll get the chance to explore MLE further.  \n"
-      ]
-    }
-  ],
-  "metadata": {
-    "kernelspec": {
-      "display_name": "Python 3 (ipykernel)",
-      "language": "python",
-      "name": "python3"
-    },
-    "language_info": {
-      "name": "python",
-      "version": "3.10.9"
-    }
-  },
-  "nbformat": 4,
-  "nbformat_minor": 4
-}
diff --git a/sql_I/sql_I.qmd b/sql_I/sql_I.qmd
index 80c83df5..be95a77b 100644
--- a/sql_I/sql_I.qmd
+++ b/sql_I/sql_I.qmd
@@ -67,7 +67,6 @@ FROM sqlite_master
 WHERE type="table"
 ```
 
-
 The summary above displays information about the database. The database contains four tables, named `sqlite_sequence`, `Dragon`, `Dish`, and `Scene`. The rightmost column above lists the command that was used to construct each table. 
 
 Let's look more closely at the command used to create the `Dragon` table (the second entry above). 
@@ -82,7 +81,7 @@ The statement `CREATE TABLE` is used to specify the **schema** of the table –
 * `DataType`: the type of data to be stored in a column. Some of the most common SQL data types are `INT` (integers), `FLOAT` (floating point numbers), `TEXT` (strings), `BLOB` (arbitrary data, such as audio/video files), and `DATETIME` (a date and time).
 * `Constraint`: some restriction on the data to be stored in the column. Common constraints are `CHECK` (data must obey a certain condition), `PRIMARY KEY` (designate a column as the table's primary key), `NOT NULL` (data cannot be null), and `DEFAULT` (a default fill value if no specific entry is given).
 
-We see that `Dragon` contains five columns. The first of these, `"name"`, contains text data. It is designated as the **primary key** of the table; that is, the data contained in `"name"` uniquely identifies each entry in the table. Because `"name"` is the primary key of the table, no two entries in the table can have the same name – a given value of `"name"` is unique to each dragon. The `"year"` column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, `"cute"`, contains integer data with no restrictions on allowable values. 
+We see that `Dragon` contains three columns. The first of these, `"name"`, contains text data. It is designated as the **primary key** of the table; that is, the data contained in `"name"` uniquely identifies each entry in the table. Because `"name"` is the primary key of the table, no two entries in the table can have the same name – a given value of `"name"` is unique to each dragon. The `"year"` column contains integer data, with the constraint that year values must be greater than or equal to 2000. The final column, `"cute"`, contains integer data with no restrictions on allowable values. 
 
 We can verify this by viewing `Dragon` itself.