From 449970610f06e0c861fa08f615c7007ebf882ae5 Mon Sep 17 00:00:00 2001
From: Lillian Weng <lillianweng@berkeley.edu>
Date: Mon, 23 Oct 2023 23:21:17 -0700
Subject: [PATCH] final changes to random var notes

---
 probability_1/probability_1.html  | 67 +++++++++++++++----------------
 probability_1/probability_1.ipynb | 50 +++++++++++------------
 probability_1/probability_1.qmd   | 50 +++++++++++------------
 3 files changed, 78 insertions(+), 89 deletions(-)
diff --git a/probability_1/probability_1.html b/probability_1/probability_1.html
index df06be04..6648c8c7 100644
--- a/probability_1/probability_1.html
+++ b/probability_1/probability_1.html
@@ -56,8 +56,6 @@ <h2 id="toc-title">Random Variables</h2>
   <li><a href="#expectation-and-variance" id="toc-expectation-and-variance" class="nav-link" data-scroll-target="#expectation-and-variance">Expectation and Variance</a>
   <ul class="collapse">
   <li><a href="#expectation" id="toc-expectation" class="nav-link" data-scroll-target="#expectation">Expectation</a></li>
-  <li><a href="#example-1-coin-toss" id="toc-example-1-coin-toss" class="nav-link" data-scroll-target="#example-1-coin-toss">Example 1: Coin Toss</a></li>
-  <li><a href="#example-2" id="toc-example-2" class="nav-link" data-scroll-target="#example-2">Example 2</a></li>
   <li><a href="#variance" id="toc-variance" class="nav-link" data-scroll-target="#variance">Variance</a></li>
   <li><a href="#example-dice" id="toc-example-dice" class="nav-link" data-scroll-target="#example-dice">Example: Dice</a></li>
   </ul></li>
@@ -126,14 +124,16 @@ <h1 class="title">Random Variables</h1>
 <li>Estimators, Bias, and Variance: re-express the ideas of model variance and training error in terms of random variables and use this new perspective to investigate our choice of model complexity</li>
 </ol>
 <div class="callout callout-style-default callout-tip no-icon callout-titled">
-<div class="callout-header d-flex align-content-center">
+<div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="false" aria-label="Toggle callout">
 <div class="callout-icon-container">
 <i class="callout-icon no-icon"></i>
 </div>
 <div class="callout-title-container flex-fill">
-Data 8
+Data 8 Recap
 </div>
+<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
 </div>
+<div id="callout-2" class="callout-2-contents callout-collapse collapse">
 <div class="callout-body-container callout-body">
 <p>Recall the following concepts from Data 8:</p>
 <ol type="1">
@@ -147,6 +147,7 @@ <h1 class="title">Random Variables</h1>
 </ol>
 </div>
 </div>
+</div>
 <section id="random-variables-and-distributions" class="level2">
 <h2 class="anchored" data-anchor-id="random-variables-and-distributions">Random Variables and Distributions</h2>
 <p>Suppose we generate a set of random data, like a random sample from some population. A <strong>random variable</strong> is a <em>numerical function</em> of the randomness in the data. It is <em>random</em> since our sample was drawn at random; it is <em>variable</em> because its exact value depends on how this random sample came out. As such, the domain or input of our random variable is all possible (random) outcomes in a <em>sample space</em>, and its range or output is the number line. We typically denote random variables with uppercase letters, such as <span class="math inline">\(X\)</span> or <span class="math inline">\(Y\)</span>.</p>
@@ -157,14 +158,9 @@ <h3 class="anchored" data-anchor-id="distribution">Distribution</h3>
 <li>Possible values: the set of values the random variable can take on.</li>
 <li>Probabilities: the set of probabilities describing how the total probability of 100% is split over the possible values.</li>
 </ol>
-<p>If <span class="math inline">\(X\)</span> is discrete (has a finite number of possible values),</p>
-<ul>
-<li>The probability that a random variable <span class="math inline">\(X\)</span> takes on the value <span class="math inline">\(x\)</span> is given by <span class="math inline">\(P(X=x)\)</span>.</li>
-<li>Probabilities must sum to 1: <span class="math inline">\(\sum_{\text{all} x} P(X=x) = 1\)</span>,</li>
-</ul>
-<p>We can often display this using a <strong>probability distribution table</strong> (example shown below).</p>
-<p>The <strong>distribution</strong> of a random variable <span class="math inline">\(X\)</span> is a description of how the total probability of 100% is split over all the possible values of <span class="math inline">\(X\)</span>, and it fully defines a random variable.</p>
-<p>The distribution of a discrete random variable can also be represented using a histogram. If a variable is <strong>continuous</strong> – it can take on infinitely many values – we can illustrate its distribution using a density curve.</p>
+<p>If <span class="math inline">\(X\)</span> is discrete (has a finite number of possible values), the probability that a random variable <span class="math inline">\(X\)</span> takes on the value <span class="math inline">\(x\)</span> is given by <span class="math inline">\(P(X=x)\)</span>, and probabilities must sum to 1: <span class="math inline">\(\sum_{\text{all} x} P(X=x) = 1\)</span>,</p>
+<p>We can often display this using a <strong>probability distribution table</strong>, which you will see in the coin toss example below.</p>
+<p>The <strong>distribution</strong> of a random variable <span class="math inline">\(X\)</span> is a description of how the total probability of 100% is split over all the possible values of <span class="math inline">\(X\)</span>, and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is <strong>continuous</strong> – it can take on infinitely many values – we can illustrate its distribution using a density curve.</p>
 <p align="center">
 <img src="images/discrete_continuous.png" alt="discrete_continuous" width="700">
 </p>
@@ -180,7 +176,7 @@ <h3 class="anchored" data-anchor-id="example-tossing-a-coin">Example: Tossing a
       1, \text{if the coin lands heads} \\
       0, \text{if the coin lands tails}
    \end{cases}\]</span></p>
-<p><span class="math inline">\(X\)</span> is a function with a domain (input) of <span class="math inline">\(\{H, T\}\)</span> and a range (output) of <span class="math inline">\(\{1, 0\}\)</span>. We can write this in function notation as <span class="math display">\[\begin{cases}  X(H) = 1 \\ X(T) = 0 \end{cases}\]</span> and the probability distribution table of <span class="math inline">\(X\)</span> is</p>
+<p><span class="math inline">\(X\)</span> is a function with a domain, or input, of <span class="math inline">\(\{H, T\}\)</span> and a range, or output, of <span class="math inline">\(\{1, 0\}\)</span>. We can write this in function notation as <span class="math display">\[\begin{cases}  X(H) = 1 \\ X(T) = 0 \end{cases}\]</span> The probability distribution table of <span class="math inline">\(X\)</span> is given by.</p>
 <table class="table">
 <thead>
 <tr class="header">
@@ -205,7 +201,7 @@ <h3 class="anchored" data-anchor-id="example-tossing-a-coin">Example: Tossing a
 </p>
 <p>We can show the distribution of <span class="math inline">\(Y\)</span> in the following tables. The table on the left lists all possible samples of <span class="math inline">\(s\)</span> and the number of times they can appear (<span class="math inline">\(Y(s)\)</span>). We can use this to calculate the values for the table on the right, a <strong>probability distribution table</strong>.</p>
 <p align="center">
-<img src="images/distribution.png" alt="distribution" width="400">
+<img src="images/distribution.png" alt="distribution" width="600">
 </p>
 </section>
 <section id="simulation" class="level3">
@@ -215,14 +211,17 @@ <h3 class="anchored" data-anchor-id="simulation">Simulation</h3>
 </section>
 <section id="expectation-and-variance" class="level2">
 <h2 class="anchored" data-anchor-id="expectation-and-variance">Expectation and Variance</h2>
-<p>There are several ways to describe a random variable. The methods shown above - table of all samples <span class="math inline">\(s, X(s)\)</span>, distribution table <span class="math inline">\(P(X=x)\)</span>, and histograms - are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary, rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a “summary” of how the variable tends to behave, they are <em>not</em> random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.</p>
+<p>There are several ways to describe a random variable. The methods shown above – a table of all samples <span class="math inline">\(s, X(s)\)</span>, distribution table <span class="math inline">\(P(X=x)\)</span>, and histograms – are all definitions that <em>fully describe</em> a random variable. Often, it is easier to describe a random variable using some <em>numerical summary</em> rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a “summary” of how the variable tends to behave, they are <em>not</em> random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.</p>
 <section id="expectation" class="level3">
 <h3 class="anchored" data-anchor-id="expectation">Expectation</h3>
-<p>The <strong>expectation</strong> of a random variable <span class="math inline">\(X\)</span> is the weighted average of the values of <span class="math inline">\(X\)</span>, where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation: 1. Apply the weights one <em>sample</em> at a time: <span class="math display">\[\mathbb{E}[X] = \sum_{\text{all possible } s} X(s) P(s)\]</span>. 2. Apply the weights one possible <em>value</em> at a time: <span class="math display">\[\mathbb{E}[X] = \sum_{\text{all possible } x} x P(X=x)\]</span></p>
-<p>We want to emphasize that the expectation is a <em>number</em>, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram. If we simulate the variable many times, it is the long-run average of the random variable.</p>
-</section>
-<section id="example-1-coin-toss" class="level3">
-<h3 class="anchored" data-anchor-id="example-1-coin-toss">Example 1: Coin Toss</h3>
+<p>The <strong>expectation</strong> of a random variable <span class="math inline">\(X\)</span> is the weighted average of the values of <span class="math inline">\(X\)</span>, where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation:</p>
+<ol type="1">
+<li>Apply the weights one <em>sample</em> at a time: <span class="math display">\[\mathbb{E}[X] = \sum_{\text{all possible } s} X(s) P(s)\]</span>.</li>
+<li>Apply the weights one possible <em>value</em> at a time: <span class="math display">\[\mathbb{E}[X] = \sum_{\text{all possible } x} x P(X=x)\]</span></li>
+</ol>
+<p>We want to emphasize that the expectation is a <em>number</em>, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable.</p>
+<section id="example-1-coin-toss" class="level4">
+<h4 class="anchored" data-anchor-id="example-1-coin-toss">Example 1: Coin Toss</h4>
 <p>Going back to our coin toss example, we define a random variable <span class="math inline">\(X\)</span> as: <span class="math display">\[X = \begin{cases}
       1, \text{if the coin lands heads} \\
       0, \text{if the coin lands tails}
@@ -232,8 +231,8 @@ <h3 class="anchored" data-anchor-id="example-1-coin-toss">Example 1: Coin Toss</
 &amp;= 0.5
 \end{align}\]</span> Note that <span class="math inline">\(\mathbb{E}[X] = 0.5\)</span> is not a possible value of <span class="math inline">\(X\)</span>; it’s an average. <strong>The expectation of X does not need to be a possible value of X</strong>.</p>
 </section>
-<section id="example-2" class="level3">
-<h3 class="anchored" data-anchor-id="example-2">Example 2</h3>
+<section id="example-2" class="level4">
+<h4 class="anchored" data-anchor-id="example-2">Example 2</h4>
 <p>Consider the random variable <span class="math inline">\(X\)</span>:</p>
 <table class="table">
 <thead>
@@ -268,14 +267,15 @@ <h3 class="anchored" data-anchor-id="example-2">Example 2</h3>
 &amp;= 5.9
 \end{align}\]</span> Again, note that <span class="math inline">\(\mathbb{E}[X] = 5.9\)</span> is not a possible value of <span class="math inline">\(X\)</span>; it’s an average. <strong>The expectation of X does not need to be a possible value of X</strong>.</p>
 </section>
+</section>
 <section id="variance" class="level3">
 <h3 class="anchored" data-anchor-id="variance">Variance</h3>
 <p>The <strong>variance</strong> of a random variable is a measure of its chance error. It is defined as the expected squared deviation from the expectation of <span class="math inline">\(X\)</span>. Put more simply, variance asks: how far does <span class="math inline">\(X\)</span> typically vary from its average value, just by chance? What is the spread of <span class="math inline">\(X\)</span>’s distribution?</p>
 <p><span class="math display">\[\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]\]</span></p>
-<p>The units of variance are the square of the units of <span class="math inline">\(X\)</span>. To get it back to the right scale, use the standard deviation of <span class="math inline">\(X\)</span>: <span class="math inline">\(\text{SD}(X) = \sqrt{\text{Var}(X)}\)</span>.</p>
+<p>The units of variance are the square of the units of <span class="math inline">\(X\)</span>. To get it back to the right scale, use the standard deviation of <span class="math inline">\(X\)</span>: <span class="math display">\[\text{SD}(X) = \sqrt{\text{Var}(X)}\]</span></p>
 <p>Like with expectation, <strong>variance is a number, not a random variable</strong>! Its main use is to quantify chance error.</p>
 <p>By <a href="https://www.inferentialthinking.com/chapters/14/2/Variability.html#Chebychev's-Bounds">Chebyshev’s inequality</a>, which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”</p>
-<p>If we expand the square and use properties of expectation, we can re-express variance as the <strong>computational formula for variance</strong>. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations (if <span class="math inline">\(X\)</span> is centered and <span class="math inline">\(E(X)=0\)</span>, then <span class="math inline">\(\mathbb{E}[X^2] = \text{Var}(X)\)</span>).</p>
+<p>If we expand the square and use properties of expectation, we can re-express variance as the <strong>computational formula for variance</strong>. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as <span class="math inline">\(\mathbb{E}[X^2] = \text{Var}(X)\)</span> if <span class="math inline">\(X\)</span> is centered and <span class="math inline">\(E(X)=0\)</span>.</p>
 <p><span class="math display">\[\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\]</span></p>
 <div class="callout callout-style-default callout-tip no-icon callout-titled">
 <div class="callout-header d-flex align-content-center" data-bs-toggle="collapse" data-bs-target=".callout-3-contents" aria-controls="callout-3" aria-expanded="false" aria-label="Toggle callout">
@@ -351,7 +351,7 @@ <h3 class="anchored" data-anchor-id="example-dice">Example: Dice</h3>
 </section>
 <section id="sums-of-random-variables" class="level2">
 <h2 class="anchored" data-anchor-id="sums-of-random-variables">Sums of Random Variables</h2>
-<p>Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable, so if you create multiple random variables based on your sample, then functions of those random variables are also random variables.</p>
+<p>Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables.</p>
 <p>For example, if <span class="math inline">\(X_1, X_2, ..., X_n\)</span> are random variables, then so are all of these:</p>
 <ul>
 <li><span class="math inline">\(X_n^2\)</span></li>
@@ -374,16 +374,16 @@ <h3 class="anchored" data-anchor-id="equal-vs.-identically-distributed-vs.-i.i.d
 </ul>
 <p>For example, let <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> be numbers on rolls of two fair die. <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> are i.i.d, so <span class="math inline">\(X_1\)</span> and <span class="math inline">\(X_2\)</span> have the same distribution. However, the sums <span class="math inline">\(Y = X_1 + X_1 = 2X_1\)</span> and <span class="math inline">\(Z=X_1+X_2\)</span> have different distributions but the same expectation.</p>
 <p align="center">
-<img src="images/yz_distribution.png" alt="distribution" width="400">
+<img src="images/yz_distribution.png" alt="distribution" width="=500">
 </p>
 <p>However, <span class="math inline">\(Y = X_1\)</span> has a larger variance</p>
 <p align="center">
-<img src="images/yz.png" alt="distribution" width="400">
+<img src="images/yz.png" alt="distribution" width="200">
 </p>
 </section>
 <section id="properties-of-expectation" class="level3">
 <h3 class="anchored" data-anchor-id="properties-of-expectation">Properties of Expectation</h3>
-<p>Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: <span class="math display">\[\mathbb{E}[X] = \sum_{x} x P(X=x)\]</span></p>
+<p>Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: <span class="math display">\[\mathbb{E}[X] = \sum_{x} x P(X=x)\]</span> From it, we can derive some useful properties of expectation:</p>
 <ol type="1">
 <li><strong>Linearity of expectation</strong>. The expectation of the linear transformation <span class="math inline">\(aX+b\)</span>, where <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> are constants, is:</li>
 </ol>
@@ -435,15 +435,12 @@ <h3 class="anchored" data-anchor-id="properties-of-expectation">Properties of Ex
 </div>
 </div>
 <ol start="3" type="1">
-<li>If <span class="math inline">\(g\)</span> is a non-linear function, then in general, <span class="math display">\[\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])\]</span></li>
+<li>If <span class="math inline">\(g\)</span> is a non-linear function, then in general, <span class="math display">\[\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])\]</span> For example, if <span class="math inline">\(X\)</span> is -1 or 1 with equal probability, then <span class="math inline">\(\mathbb{E}[X] = 0\)</span>, but <span class="math inline">\(\mathbb{E}[X^2] = 1 \neq 0\)</span>.</li>
 </ol>
-<ul>
-<li>For example, if <span class="math inline">\(X\)</span> is -1 or 1 with equal probability, then <span class="math inline">\(\mathbb{E}[X] = 0\)</span> but <span class="math inline">\(\mathbb{E}[X^2] = 1 \neq 0\)</span></li>
-</ul>
 </section>
 <section id="properties-of-variance" class="level3">
 <h3 class="anchored" data-anchor-id="properties-of-variance">Properties of Variance</h3>
-<p>Recall the definition of variance: <span class="math display">\[\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]\]</span></p>
+<p>Recall the definition of variance: <span class="math display">\[\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]\]</span> Combining it with the properties of expectation, we can derive some useful properties of variance:</p>
 <ol type="1">
 <li>Unlike expectation, variance is <em>non-linear</em>. The variance of the linear transformation <span class="math inline">\(aX+b\)</span> is: <span class="math display">\[\text{Var}(aX+b) = a^2 \text{Var}(X)\]</span></li>
 </ol>
@@ -467,7 +464,7 @@ <h3 class="anchored" data-anchor-id="properties-of-variance">Properties of Varia
 <p>In order to compute <span class="math inline">\(\text{Var}(aX+b)\)</span>, consider that a shift by b units does not affect spread, so <span class="math inline">\(\text{Var}(aX+b) = \text{Var}(aX)\)</span>.</p>
 <p>Then, <span class="math display">\[\begin{align}
     \text{Var}(aX+b) &amp;= \text{Var}(aX) \\
-    &amp;= E((aX)^2) - (E(aX))^2
+    &amp;= E((aX)^2) - (E(aX))^2 \\
     &amp;= E(a^2 X^2) - (aE(X))^2\\
     &amp;= a^2 (E(X^2) - (E(X))^2) \\
     &amp;= a^2 \text{Var}(X)
@@ -612,7 +609,7 @@ <h3 class="anchored" data-anchor-id="sample-mean">Sample Mean</h3>
 <h3 class="anchored" data-anchor-id="central-limit-theorem">Central Limit Theorem</h3>
 <p>The CLT states that no matter what population you are drawing from, if an i.i.d. sample of size <span class="math inline">\(n\)</span> is large, the probability distribution of the sample mean is roughly normal with mean 𝜇 and SD <span class="math inline">\(\sigma/\sqrt{n}\)</span>.</p>
 <p>Any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists because we rarely know a lot about the population!</p>
-<p>For a more in-depth demo check out <a href="https://onlinestatbook.com/stat_sim/sampling_dist/">onlinestatbook</a>.</p>
+<p>For a more in-depth demo, check out <a href="https://onlinestatbook.com/stat_sim/sampling_dist/">onlinestatbook</a>.</p>
 <p>The CLT applies if the sample size <span class="math inline">\(n\)</span> is large, but how large does n have to be for the normal approximation to be good? It depends on the shape of the distribution of the population.</p>
 <ul>
 <li>If the population is roughly symmetric and unimodal/uniform, could need as few as <span class="math inline">\(n = 20\)</span>.</li>
diff --git a/probability_1/probability_1.ipynb b/probability_1/probability_1.ipynb
index 76b093bf..265440f6 100644
--- a/probability_1/probability_1.ipynb
+++ b/probability_1/probability_1.ipynb
@@ -42,8 +42,8 @@
     "1. Random Variables Estimators: introduce random variables, considering the concepts of expectation, variance, and covariance\n",
     "2. Estimators, Bias, and Variance: re-express the ideas of model variance and training error in terms of random variables and use this new perspective to investigate our choice of model complexity\n",
     "\n",
-    "::: {.callout-tip}\n",
-    "## Data 8\n",
+    "::: {.callout-tip collapse=\"true\"}\n",
+    "## Data 8 Recap\n",
     "Recall the following concepts from Data 8: \n",
     "\n",
     "1. Sample mean: the mean of your random sample\n",
@@ -72,16 +72,11 @@
     "1. Possible values: the set of values the random variable can take on.\n",
     "2. Probabilities: the set of probabilities describing how the total probability of 100% is split over the possible values.\n",
     "\n",
-    "If $X$ is discrete (has a finite number of possible values),\n",
-    "\n",
-    "* The probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$.\n",
-    "* Probabilities must sum to 1: $\\sum_{\\text{all} x} P(X=x) = 1$,\n",
-    "\n",
-    "We can often display this using a **probability distribution table** (example shown below).\n",
+    "If $X$ is discrete (has a finite number of possible values), the probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$, and probabilities must sum to 1: $\\sum_{\\text{all} x} P(X=x) = 1$,\n",
     "\n",
-    "The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable.\n",
+    "We can often display this using a **probability distribution table**, which you will see in the coin toss example below.\n",
     "\n",
-    "The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve. \n",
+    "The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve. \n",
     "\n",
     "<p align=\"center\">\n",
     "<img src=\"images/discrete_continuous.png\" alt='discrete_continuous' width='700'>\n",
@@ -103,9 +98,9 @@
     "      0, \\text{if the coin lands tails} \n",
     "   \\end{cases}$$\n",
     "\n",
-    "$X$ is a function with a domain (input) of $\\{H, T\\}$ and a range (output) of $\\{1, 0\\}$. We can write this in function notation as \n",
+    "$X$ is a function with a domain, or input, of $\\{H, T\\}$ and a range, or output, of $\\{1, 0\\}$. We can write this in function notation as \n",
     "$$\\begin{cases}  X(H) = 1 \\\\ X(T) = 0 \\end{cases}$$\n",
-    "and the probability distribution table of $X$ is \n",
+    "The probability distribution table of $X$ is given by.\n",
     "\n",
     "| $x$ | $P(X=x)$ | \n",
     "| --- | -------- |\n",
@@ -122,7 +117,7 @@
     "We can show the distribution of $Y$ in the following tables. The table on the left lists all possible samples of $s$ and the number of times they can appear ($Y(s)$). We can use this to calculate the values for the table on the right, a **probability distribution table**. \n",
     "\n",
     "<p align=\"center\">\n",
-    "<img src=\"images/distribution.png\" alt='distribution' width='400'>\n",
+    "<img src=\"images/distribution.png\" alt='distribution' width='600'>\n",
     "</p>\n",
     "\n",
     "### Simulation\n",
@@ -134,16 +129,17 @@
    "metadata": {},
    "source": [
     "## Expectation and Variance\n",
-    "There are several ways to describe a random variable. The methods shown above - table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms - are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary, rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a \"summary\" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.\n",
+    "There are several ways to describe a random variable. The methods shown above -- a table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms -- are all definitions that *fully describe* a random variable. Often, it is easier to describe a random variable using some *numerical summary* rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a \"summary\" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.\n",
     "\n",
     "### Expectation\n",
     "The **expectation** of a random variable $X$ is the weighted average of the values of $X$, where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation: \n",
+    "\n",
     "1. Apply the weights one *sample* at a time: $$\\mathbb{E}[X] = \\sum_{\\text{all possible } s} X(s) P(s)$$.\n",
     "2. Apply the weights one possible *value* at a time: $$\\mathbb{E}[X] = \\sum_{\\text{all possible } x} x P(X=x)$$\n",
     "\n",
-    "We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram. If we simulate the variable many times, it is the long-run average of the random variable.\n",
+    "We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable.\n",
     "\n",
-    "### Example 1: Coin Toss\n",
+    "#### Example 1: Coin Toss\n",
     "Going back to our coin toss example, we define a random variable $X$ as: \n",
     "$$X = \\begin{cases} \n",
     "      1, \\text{if the coin lands heads} \\\\\n",
@@ -157,7 +153,7 @@
     "\\end{align}$$\n",
     "Note that $\\mathbb{E}[X] = 0.5$ is not a possible value of $X$; it's an average. **The expectation of X does not need to be a possible value of X**.\n",
     "\n",
-    "### Example 2\n",
+    "#### Example 2\n",
     "Consider the random variable $X$: \n",
     "\n",
     "| $x$ | $P(X=x)$ | \n",
@@ -181,13 +177,13 @@
     "\n",
     "$$\\text{Var}(X) = \\mathbb{E}[(X-\\mathbb{E}[X])^2]$$\n",
     "\n",
-    "The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $\\text{SD}(X) = \\sqrt{\\text{Var}(X)}$.\n",
+    "The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $$\\text{SD}(X) = \\sqrt{\\text{Var}(X)}$$\n",
     "\n",
     "Like with expectation, **variance is a number, not a random variable**! Its main use is to quantify chance error.\n",
     "\n",
     "By [Chebyshev’s inequality](https://www.inferentialthinking.com/chapters/14/2/Variability.html#Chebychev's-Bounds), which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”\n",
     "\n",
-    "If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations (if $X$ is centered and $E(X)=0$, then $\\mathbb{E}[X^2] = \\text{Var}(X)$).\n",
+    "If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as $\\mathbb{E}[X^2] = \\text{Var}(X)$ if $X$ is centered and $E(X)=0$.\n",
     "\n",
     "$$\\text{Var}(X) = \\mathbb{E}[X^2] - (\\mathbb{E}[X])^2$$\n",
     "\n",
@@ -242,7 +238,7 @@
    "metadata": {},
    "source": [
     "## Sums of Random Variables\n",
-    "Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable, so if you create multiple random variables based on your sample, then functions of those random variables are also random variables.\n",
+    "Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables.\n",
     "\n",
     "For example, if $X_1, X_2, ..., X_n$ are random variables, then so are all of these: \n",
     "\n",
@@ -265,17 +261,18 @@
     "For example, let $X_1$ and $X_2$ be numbers on rolls of two fair die. $X_1$ and $X_2$ are i.i.d, so  $X_1$ and $X_2$ have the same distribution. However, the sums $Y = X_1 + X_1 = 2X_1$ and $Z=X_1+X_2$ have different distributions but the same expectation.\n",
     "\n",
     "<p align=\"center\">\n",
-    "<img src=\"images/yz_distribution.png\" alt='distribution' width='400'>\n",
+    "<img src=\"images/yz_distribution.png\" alt='distribution' width='=500'>\n",
     "</p>\n",
     "\n",
     "However, $Y = X_1$ has a larger variance\n",
     "\n",
     "<p align=\"center\">\n",
-    "<img src=\"images/yz.png\" alt='distribution' width='400'>\n",
+    "<img src=\"images/yz.png\" alt='distribution' width='200'>\n",
     "</p>\n",
     "\n",
     "### Properties of Expectation \n",
     "Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: $$\\mathbb{E}[X] = \\sum_{x} x P(X=x)$$\n",
+    "From it, we can derive some useful properties of expectation: \n",
     "\n",
     "1. **Linearity of expectation**. The expectation of the linear transformation $aX+b$, where $a$ and $b$ are constants, is:\n",
     "\n",
@@ -306,13 +303,12 @@
     ":::\n",
     "\n",
     "3. If $g$ is a non-linear function, then in general, \n",
-    "$$\\mathbb{E}[g(X)] \\neq g(\\mathbb{E}[X])$$\n",
-    "\n",
-    "* For example, if $X$ is -1 or 1 with equal probability, then $\\mathbb{E}[X] = 0$ but $\\mathbb{E}[X^2] = 1 \\neq 0$\n",
+    "$$\\mathbb{E}[g(X)] \\neq g(\\mathbb{E}[X])$$ For example, if $X$ is -1 or 1 with equal probability, then $\\mathbb{E}[X] = 0$, but $\\mathbb{E}[X^2] = 1 \\neq 0$.\n",
     "\n",
     "### Properties of Variance\n",
     "Recall the definition of variance: \n",
     "$$\\text{Var}(X) = \\mathbb{E}[(X-\\mathbb{E}[X])^2]$$\n",
+    "Combining it with the properties of expectation, we can derive some useful properties of variance: \n",
     "\n",
     "1. Unlike expectation, variance is *non-linear*. The variance of the linear transformation $aX+b$ is:\n",
     "$$\\text{Var}(aX+b) = a^2 \\text{Var}(X)$$\n",
@@ -329,7 +325,7 @@
     "Then, \n",
     "$$\\begin{align}\n",
     "    \\text{Var}(aX+b) &= \\text{Var}(aX) \\\\\n",
-    "    &= E((aX)^2) - (E(aX))^2\n",
+    "    &= E((aX)^2) - (E(aX))^2 \\\\\n",
     "    &= E(a^2 X^2) - (aE(X))^2\\\\\n",
     "    &= a^2 (E(X^2) - (E(X))^2) \\\\\n",
     "    &= a^2 \\text{Var}(X)\n",
@@ -454,7 +450,7 @@
     "\n",
     "Any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists because we rarely know a lot about the population!\n",
     "\n",
-    "For a more in-depth demo check out [onlinestatbook](https://onlinestatbook.com/stat_sim/sampling_dist/). \n",
+    "For a more in-depth demo, check out [onlinestatbook](https://onlinestatbook.com/stat_sim/sampling_dist/). \n",
     "\n",
     "The CLT applies if the sample size $n$ is large, but how large does n have to be for the normal approximation to be good? It depends on the shape of the distribution of the population.\n",
     "\n",
diff --git a/probability_1/probability_1.qmd b/probability_1/probability_1.qmd
index a6abbb8c..395d10db 100644
--- a/probability_1/probability_1.qmd
+++ b/probability_1/probability_1.qmd
@@ -32,8 +32,8 @@ To better understand the origin of this tradeoff, we will need to introduce the
 1. Random Variables Estimators: introduce random variables, considering the concepts of expectation, variance, and covariance
 2. Estimators, Bias, and Variance: re-express the ideas of model variance and training error in terms of random variables and use this new perspective to investigate our choice of model complexity
 
-::: {.callout-tip}
-## Data 8
+::: {.callout-tip collapse="true"}
+## Data 8 Recap
 Recall the following concepts from Data 8: 
 
 1. Sample mean: the mean of your random sample
@@ -57,16 +57,11 @@ For any random variable $X$, we need to be able to specify 2 things:
 1. Possible values: the set of values the random variable can take on.
 2. Probabilities: the set of probabilities describing how the total probability of 100% is split over the possible values.
 
-If $X$ is discrete (has a finite number of possible values),
-
-* The probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$.
-* Probabilities must sum to 1: $\sum_{\text{all} x} P(X=x) = 1$,
-
-We can often display this using a **probability distribution table** (example shown below).
+If $X$ is discrete (has a finite number of possible values), the probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$, and probabilities must sum to 1: $\sum_{\text{all} x} P(X=x) = 1$,
 
-The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable.
+We can often display this using a **probability distribution table**, which you will see in the coin toss example below.
 
-The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve. 
+The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve. 
 
 <p align="center">
 <img src="images/discrete_continuous.png" alt='discrete_continuous' width='700'>
@@ -88,9 +83,9 @@ $$X = \begin{cases}
       0, \text{if the coin lands tails} 
    \end{cases}$$
 
-$X$ is a function with a domain (input) of $\{H, T\}$ and a range (output) of $\{1, 0\}$. We can write this in function notation as 
+$X$ is a function with a domain, or input, of $\{H, T\}$ and a range, or output, of $\{1, 0\}$. We can write this in function notation as 
 $$\begin{cases}  X(H) = 1 \\ X(T) = 0 \end{cases}$$
-and the probability distribution table of $X$ is 
+The probability distribution table of $X$ is given by.
 
 | $x$ | $P(X=x)$ | 
 | --- | -------- |
@@ -107,23 +102,24 @@ We can define $Y$ as the number of data science students in our sample. Its doma
 We can show the distribution of $Y$ in the following tables. The table on the left lists all possible samples of $s$ and the number of times they can appear ($Y(s)$). We can use this to calculate the values for the table on the right, a **probability distribution table**. 
 
 <p align="center">
-<img src="images/distribution.png" alt='distribution' width='400'>
+<img src="images/distribution.png" alt='distribution' width='600'>
 </p>
 
 ### Simulation
 Given a random variable $X$’s distribution, how could we **generate/simulate** a population? To do so, we can randomly pick values of $X$ according to its distribution using `np.random.choice` or `df.sample`. 
 
 ## Expectation and Variance
-There are several ways to describe a random variable. The methods shown above - table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms - are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary, rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a "summary" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.
+There are several ways to describe a random variable. The methods shown above -- a table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms -- are all definitions that *fully describe* a random variable. Often, it is easier to describe a random variable using some *numerical summary* rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a "summary" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.
 
 ### Expectation
 The **expectation** of a random variable $X$ is the weighted average of the values of $X$, where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation: 
+
 1. Apply the weights one *sample* at a time: $$\mathbb{E}[X] = \sum_{\text{all possible } s} X(s) P(s)$$.
 2. Apply the weights one possible *value* at a time: $$\mathbb{E}[X] = \sum_{\text{all possible } x} x P(X=x)$$
 
-We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram. If we simulate the variable many times, it is the long-run average of the random variable.
+We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable.
 
-### Example 1: Coin Toss
+#### Example 1: Coin Toss
 Going back to our coin toss example, we define a random variable $X$ as: 
 $$X = \begin{cases} 
       1, \text{if the coin lands heads} \\
@@ -137,7 +133,7 @@ $$\begin{align}
 \end{align}$$
 Note that $\mathbb{E}[X] = 0.5$ is not a possible value of $X$; it's an average. **The expectation of X does not need to be a possible value of X**.
 
-### Example 2
+#### Example 2
 Consider the random variable $X$: 
 
 | $x$ | $P(X=x)$ | 
@@ -161,13 +157,13 @@ The **variance** of a random variable is a measure of its chance error. It is de
 
 $$\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$$
 
-The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $\text{SD}(X) = \sqrt{\text{Var}(X)}$.
+The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $$\text{SD}(X) = \sqrt{\text{Var}(X)}$$
 
 Like with expectation, **variance is a number, not a random variable**! Its main use is to quantify chance error.
 
 By [Chebyshev’s inequality](https://www.inferentialthinking.com/chapters/14/2/Variability.html#Chebychev's-Bounds), which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”
 
-If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations (if $X$ is centered and $E(X)=0$, then $\mathbb{E}[X^2] = \text{Var}(X)$).
+If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as $\mathbb{E}[X^2] = \text{Var}(X)$ if $X$ is centered and $E(X)=0$.
 
 $$\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$
 
@@ -217,7 +213,7 @@ $$\text{Var}(X) = \frac{91}{6} - (\frac{7}{2})^2 = \frac{35}{12}$$
 :::
 
 ## Sums of Random Variables
-Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable, so if you create multiple random variables based on your sample, then functions of those random variables are also random variables.
+Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables.
 
 For example, if $X_1, X_2, ..., X_n$ are random variables, then so are all of these: 
 
@@ -240,17 +236,18 @@ Suppose that we have two random variables $X$ and $Y$:
 For example, let $X_1$ and $X_2$ be numbers on rolls of two fair die. $X_1$ and $X_2$ are i.i.d, so  $X_1$ and $X_2$ have the same distribution. However, the sums $Y = X_1 + X_1 = 2X_1$ and $Z=X_1+X_2$ have different distributions but the same expectation.
 
 <p align="center">
-<img src="images/yz_distribution.png" alt='distribution' width='400'>
+<img src="images/yz_distribution.png" alt='distribution' width='=500'>
 </p>
 
 However, $Y = X_1$ has a larger variance
 
 <p align="center">
-<img src="images/yz.png" alt='distribution' width='400'>
+<img src="images/yz.png" alt='distribution' width='200'>
 </p>
 
 ### Properties of Expectation 
 Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: $$\mathbb{E}[X] = \sum_{x} x P(X=x)$$
+From it, we can derive some useful properties of expectation: 
 
 1. **Linearity of expectation**. The expectation of the linear transformation $aX+b$, where $a$ and $b$ are constants, is:
 
@@ -281,13 +278,12 @@ $$\begin{align}
 :::
 
 3. If $g$ is a non-linear function, then in general, 
-$$\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$$
-
-* For example, if $X$ is -1 or 1 with equal probability, then $\mathbb{E}[X] = 0$ but $\mathbb{E}[X^2] = 1 \neq 0$
+$$\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$$ For example, if $X$ is -1 or 1 with equal probability, then $\mathbb{E}[X] = 0$, but $\mathbb{E}[X^2] = 1 \neq 0$.
 
 ### Properties of Variance
 Recall the definition of variance: 
 $$\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$$
+Combining it with the properties of expectation, we can derive some useful properties of variance: 
 
 1. Unlike expectation, variance is *non-linear*. The variance of the linear transformation $aX+b$ is:
 $$\text{Var}(aX+b) = a^2 \text{Var}(X)$$
@@ -304,7 +300,7 @@ In order to compute $\text{Var}(aX+b)$, consider that a shift by b units does no
 Then, 
 $$\begin{align}
     \text{Var}(aX+b) &= \text{Var}(aX) \\
-    &= E((aX)^2) - (E(aX))^2
+    &= E((aX)^2) - (E(aX))^2 \\
     &= E(a^2 X^2) - (aE(X))^2\\
     &= a^2 (E(X^2) - (E(X))^2) \\
     &= a^2 \text{Var}(X)
@@ -419,7 +415,7 @@ The CLT states that no matter what population you are drawing from, if an i.i.d.
 
 Any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists because we rarely know a lot about the population!
 
-For a more in-depth demo check out [onlinestatbook](https://onlinestatbook.com/stat_sim/sampling_dist/). 
+For a more in-depth demo, check out [onlinestatbook](https://onlinestatbook.com/stat_sim/sampling_dist/). 
 
 The CLT applies if the sample size $n$ is large, but how large does n have to be for the normal approximation to be good? It depends on the shape of the distribution of the population.