From 449970610f06e0c861fa08f615c7007ebf882ae5 Mon Sep 17 00:00:00 2001 From: Lillian Weng Date: Mon, 23 Oct 2023 23:21:17 -0700 Subject: [PATCH] final changes to random var notes --- probability_1/probability_1.html | 67 +++++++++++++++---------------- probability_1/probability_1.ipynb | 50 +++++++++++------------ probability_1/probability_1.qmd | 50 +++++++++++------------ 3 files changed, 78 insertions(+), 89 deletions(-) diff --git a/probability_1/probability_1.html b/probability_1/probability_1.html index df06be04..6648c8c7 100644 --- a/probability_1/probability_1.html +++ b/probability_1/probability_1.html @@ -56,8 +56,6 @@

Random Variables

  • Expectation and Variance
  • @@ -126,14 +124,16 @@

    Random Variables

  • Estimators, Bias, and Variance: re-express the ideas of model variance and training error in terms of random variables and use this new perspective to investigate our choice of model complexity
  • -
    + +

    Recall the following concepts from Data 8:

      @@ -147,6 +147,7 @@

      Random Variables

    +

    Random Variables and Distributions

    Suppose we generate a set of random data, like a random sample from some population. A random variable is a numerical function of the randomness in the data. It is random since our sample was drawn at random; it is variable because its exact value depends on how this random sample came out. As such, the domain or input of our random variable is all possible (random) outcomes in a sample space, and its range or output is the number line. We typically denote random variables with uppercase letters, such as \(X\) or \(Y\).

    @@ -157,14 +158,9 @@

    Distribution

  • Possible values: the set of values the random variable can take on.
  • Probabilities: the set of probabilities describing how the total probability of 100% is split over the possible values.
  • -

    If \(X\) is discrete (has a finite number of possible values),

    -
      -
    • The probability that a random variable \(X\) takes on the value \(x\) is given by \(P(X=x)\).
    • -
    • Probabilities must sum to 1: \(\sum_{\text{all} x} P(X=x) = 1\),
    • -
    -

    We can often display this using a probability distribution table (example shown below).

    -

    The distribution of a random variable \(X\) is a description of how the total probability of 100% is split over all the possible values of \(X\), and it fully defines a random variable.

    -

    The distribution of a discrete random variable can also be represented using a histogram. If a variable is continuous – it can take on infinitely many values – we can illustrate its distribution using a density curve.

    +

    If \(X\) is discrete (has a finite number of possible values), the probability that a random variable \(X\) takes on the value \(x\) is given by \(P(X=x)\), and probabilities must sum to 1: \(\sum_{\text{all} x} P(X=x) = 1\),

    +

    We can often display this using a probability distribution table, which you will see in the coin toss example below.

    +

    The distribution of a random variable \(X\) is a description of how the total probability of 100% is split over all the possible values of \(X\), and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is continuous – it can take on infinitely many values – we can illustrate its distribution using a density curve.

    discrete_continuous

    @@ -180,7 +176,7 @@

    Example: Tossing a 1, \text{if the coin lands heads} \\ 0, \text{if the coin lands tails} \end{cases}\]

    -

    \(X\) is a function with a domain (input) of \(\{H, T\}\) and a range (output) of \(\{1, 0\}\). We can write this in function notation as \[\begin{cases} X(H) = 1 \\ X(T) = 0 \end{cases}\] and the probability distribution table of \(X\) is

    +

    \(X\) is a function with a domain, or input, of \(\{H, T\}\) and a range, or output, of \(\{1, 0\}\). We can write this in function notation as \[\begin{cases} X(H) = 1 \\ X(T) = 0 \end{cases}\] The probability distribution table of \(X\) is given by.

    @@ -205,7 +201,7 @@

    Example: Tossing a

    We can show the distribution of \(Y\) in the following tables. The table on the left lists all possible samples of \(s\) and the number of times they can appear (\(Y(s)\)). We can use this to calculate the values for the table on the right, a probability distribution table.

    -distribution +distribution

    @@ -215,14 +211,17 @@

    Simulation

    Expectation and Variance

    -

    There are several ways to describe a random variable. The methods shown above - table of all samples \(s, X(s)\), distribution table \(P(X=x)\), and histograms - are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary, rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a “summary” of how the variable tends to behave, they are not random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.

    +

    There are several ways to describe a random variable. The methods shown above – a table of all samples \(s, X(s)\), distribution table \(P(X=x)\), and histograms – are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a “summary” of how the variable tends to behave, they are not random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.

    Expectation

    -

    The expectation of a random variable \(X\) is the weighted average of the values of \(X\), where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation: 1. Apply the weights one sample at a time: \[\mathbb{E}[X] = \sum_{\text{all possible } s} X(s) P(s)\]. 2. Apply the weights one possible value at a time: \[\mathbb{E}[X] = \sum_{\text{all possible } x} x P(X=x)\]

    -

    We want to emphasize that the expectation is a number, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram. If we simulate the variable many times, it is the long-run average of the random variable.

    -
    -
    -

    Example 1: Coin Toss

    +

    The expectation of a random variable \(X\) is the weighted average of the values of \(X\), where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation:

    +
      +
    1. Apply the weights one sample at a time: \[\mathbb{E}[X] = \sum_{\text{all possible } s} X(s) P(s)\].
    2. +
    3. Apply the weights one possible value at a time: \[\mathbb{E}[X] = \sum_{\text{all possible } x} x P(X=x)\]
    4. +
    +

    We want to emphasize that the expectation is a number, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable.

    +
    +

    Example 1: Coin Toss

    Going back to our coin toss example, we define a random variable \(X\) as: \[X = \begin{cases} 1, \text{if the coin lands heads} \\ 0, \text{if the coin lands tails} @@ -232,8 +231,8 @@

    Example 1: Coin Toss Note that \(\mathbb{E}[X] = 0.5\) is not a possible value of \(X\); it’s an average. The expectation of X does not need to be a possible value of X.

    -
    -

    Example 2

    +
    +

    Example 2

    Consider the random variable \(X\):

    @@ -268,14 +267,15 @@

    Example 2

    &= 5.9 \end{align}\] Again, note that \(\mathbb{E}[X] = 5.9\) is not a possible value of \(X\); it’s an average. The expectation of X does not need to be a possible value of X.

    +

    Variance

    The variance of a random variable is a measure of its chance error. It is defined as the expected squared deviation from the expectation of \(X\). Put more simply, variance asks: how far does \(X\) typically vary from its average value, just by chance? What is the spread of \(X\)’s distribution?

    \[\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]\]

    -

    The units of variance are the square of the units of \(X\). To get it back to the right scale, use the standard deviation of \(X\): \(\text{SD}(X) = \sqrt{\text{Var}(X)}\).

    +

    The units of variance are the square of the units of \(X\). To get it back to the right scale, use the standard deviation of \(X\): \[\text{SD}(X) = \sqrt{\text{Var}(X)}\]

    Like with expectation, variance is a number, not a random variable! Its main use is to quantify chance error.

    By Chebyshev’s inequality, which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”

    -

    If we expand the square and use properties of expectation, we can re-express variance as the computational formula for variance. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations (if \(X\) is centered and \(E(X)=0\), then \(\mathbb{E}[X^2] = \text{Var}(X)\)).

    +

    If we expand the square and use properties of expectation, we can re-express variance as the computational formula for variance. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as \(\mathbb{E}[X^2] = \text{Var}(X)\) if \(X\) is centered and \(E(X)=0\).

    \[\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2\]

    Sums of Random Variables

    -

    Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable, so if you create multiple random variables based on your sample, then functions of those random variables are also random variables.

    +

    Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables.

    For example, if \(X_1, X_2, ..., X_n\) are random variables, then so are all of these:

    • \(X_n^2\)
    • @@ -374,16 +374,16 @@

      \(X_1\) and \(X_2\) be numbers on rolls of two fair die. \(X_1\) and \(X_2\) are i.i.d, so \(X_1\) and \(X_2\) have the same distribution. However, the sums \(Y = X_1 + X_1 = 2X_1\) and \(Z=X_1+X_2\) have different distributions but the same expectation.

      -distribution +distribution

      However, \(Y = X_1\) has a larger variance

      -distribution +distribution

    Properties of Expectation

    -

    Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: \[\mathbb{E}[X] = \sum_{x} x P(X=x)\]

    +

    Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: \[\mathbb{E}[X] = \sum_{x} x P(X=x)\] From it, we can derive some useful properties of expectation:

    1. Linearity of expectation. The expectation of the linear transformation \(aX+b\), where \(a\) and \(b\) are constants, is:
    @@ -435,15 +435,12 @@

    Properties of Ex
      -
    1. If \(g\) is a non-linear function, then in general, \[\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])\]
    2. +
    3. If \(g\) is a non-linear function, then in general, \[\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])\] For example, if \(X\) is -1 or 1 with equal probability, then \(\mathbb{E}[X] = 0\), but \(\mathbb{E}[X^2] = 1 \neq 0\).
    -
      -
    • For example, if \(X\) is -1 or 1 with equal probability, then \(\mathbb{E}[X] = 0\) but \(\mathbb{E}[X^2] = 1 \neq 0\)
    • -

    Properties of Variance

    -

    Recall the definition of variance: \[\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]\]

    +

    Recall the definition of variance: \[\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]\] Combining it with the properties of expectation, we can derive some useful properties of variance:

    1. Unlike expectation, variance is non-linear. The variance of the linear transformation \(aX+b\) is: \[\text{Var}(aX+b) = a^2 \text{Var}(X)\]
    @@ -467,7 +464,7 @@

    Properties of Varia

    In order to compute \(\text{Var}(aX+b)\), consider that a shift by b units does not affect spread, so \(\text{Var}(aX+b) = \text{Var}(aX)\).

    Then, \[\begin{align} \text{Var}(aX+b) &= \text{Var}(aX) \\ - &= E((aX)^2) - (E(aX))^2 + &= E((aX)^2) - (E(aX))^2 \\ &= E(a^2 X^2) - (aE(X))^2\\ &= a^2 (E(X^2) - (E(X))^2) \\ &= a^2 \text{Var}(X) @@ -612,7 +609,7 @@

    Sample Mean

    Central Limit Theorem

    The CLT states that no matter what population you are drawing from, if an i.i.d. sample of size \(n\) is large, the probability distribution of the sample mean is roughly normal with mean 𝜇 and SD \(\sigma/\sqrt{n}\).

    Any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists because we rarely know a lot about the population!

    -

    For a more in-depth demo check out onlinestatbook.

    +

    For a more in-depth demo, check out onlinestatbook.

    The CLT applies if the sample size \(n\) is large, but how large does n have to be for the normal approximation to be good? It depends on the shape of the distribution of the population.

    • If the population is roughly symmetric and unimodal/uniform, could need as few as \(n = 20\).
    • diff --git a/probability_1/probability_1.ipynb b/probability_1/probability_1.ipynb index 76b093bf..265440f6 100644 --- a/probability_1/probability_1.ipynb +++ b/probability_1/probability_1.ipynb @@ -42,8 +42,8 @@ "1. Random Variables Estimators: introduce random variables, considering the concepts of expectation, variance, and covariance\n", "2. Estimators, Bias, and Variance: re-express the ideas of model variance and training error in terms of random variables and use this new perspective to investigate our choice of model complexity\n", "\n", - "::: {.callout-tip}\n", - "## Data 8\n", + "::: {.callout-tip collapse=\"true\"}\n", + "## Data 8 Recap\n", "Recall the following concepts from Data 8: \n", "\n", "1. Sample mean: the mean of your random sample\n", @@ -72,16 +72,11 @@ "1. Possible values: the set of values the random variable can take on.\n", "2. Probabilities: the set of probabilities describing how the total probability of 100% is split over the possible values.\n", "\n", - "If $X$ is discrete (has a finite number of possible values),\n", - "\n", - "* The probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$.\n", - "* Probabilities must sum to 1: $\\sum_{\\text{all} x} P(X=x) = 1$,\n", - "\n", - "We can often display this using a **probability distribution table** (example shown below).\n", + "If $X$ is discrete (has a finite number of possible values), the probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$, and probabilities must sum to 1: $\\sum_{\\text{all} x} P(X=x) = 1$,\n", "\n", - "The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable.\n", + "We can often display this using a **probability distribution table**, which you will see in the coin toss example below.\n", "\n", - "The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve. \n", + "The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve. \n", "\n", "

      \n", "discrete_continuous\n", @@ -103,9 +98,9 @@ " 0, \\text{if the coin lands tails} \n", " \\end{cases}$$\n", "\n", - "$X$ is a function with a domain (input) of $\\{H, T\\}$ and a range (output) of $\\{1, 0\\}$. We can write this in function notation as \n", + "$X$ is a function with a domain, or input, of $\\{H, T\\}$ and a range, or output, of $\\{1, 0\\}$. We can write this in function notation as \n", "$$\\begin{cases} X(H) = 1 \\\\ X(T) = 0 \\end{cases}$$\n", - "and the probability distribution table of $X$ is \n", + "The probability distribution table of $X$ is given by.\n", "\n", "| $x$ | $P(X=x)$ | \n", "| --- | -------- |\n", @@ -122,7 +117,7 @@ "We can show the distribution of $Y$ in the following tables. The table on the left lists all possible samples of $s$ and the number of times they can appear ($Y(s)$). We can use this to calculate the values for the table on the right, a **probability distribution table**. \n", "\n", "

      \n", - "distribution\n", + "distribution\n", "

      \n", "\n", "### Simulation\n", @@ -134,16 +129,17 @@ "metadata": {}, "source": [ "## Expectation and Variance\n", - "There are several ways to describe a random variable. The methods shown above - table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms - are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary, rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a \"summary\" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.\n", + "There are several ways to describe a random variable. The methods shown above -- a table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms -- are all definitions that *fully describe* a random variable. Often, it is easier to describe a random variable using some *numerical summary* rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a \"summary\" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable.\n", "\n", "### Expectation\n", "The **expectation** of a random variable $X$ is the weighted average of the values of $X$, where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation: \n", + "\n", "1. Apply the weights one *sample* at a time: $$\\mathbb{E}[X] = \\sum_{\\text{all possible } s} X(s) P(s)$$.\n", "2. Apply the weights one possible *value* at a time: $$\\mathbb{E}[X] = \\sum_{\\text{all possible } x} x P(X=x)$$\n", "\n", - "We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram. If we simulate the variable many times, it is the long-run average of the random variable.\n", + "We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable.\n", "\n", - "### Example 1: Coin Toss\n", + "#### Example 1: Coin Toss\n", "Going back to our coin toss example, we define a random variable $X$ as: \n", "$$X = \\begin{cases} \n", " 1, \\text{if the coin lands heads} \\\\\n", @@ -157,7 +153,7 @@ "\\end{align}$$\n", "Note that $\\mathbb{E}[X] = 0.5$ is not a possible value of $X$; it's an average. **The expectation of X does not need to be a possible value of X**.\n", "\n", - "### Example 2\n", + "#### Example 2\n", "Consider the random variable $X$: \n", "\n", "| $x$ | $P(X=x)$ | \n", @@ -181,13 +177,13 @@ "\n", "$$\\text{Var}(X) = \\mathbb{E}[(X-\\mathbb{E}[X])^2]$$\n", "\n", - "The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $\\text{SD}(X) = \\sqrt{\\text{Var}(X)}$.\n", + "The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $$\\text{SD}(X) = \\sqrt{\\text{Var}(X)}$$\n", "\n", "Like with expectation, **variance is a number, not a random variable**! Its main use is to quantify chance error.\n", "\n", "By [Chebyshev’s inequality](https://www.inferentialthinking.com/chapters/14/2/Variability.html#Chebychev's-Bounds), which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.”\n", "\n", - "If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations (if $X$ is centered and $E(X)=0$, then $\\mathbb{E}[X^2] = \\text{Var}(X)$).\n", + "If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as $\\mathbb{E}[X^2] = \\text{Var}(X)$ if $X$ is centered and $E(X)=0$.\n", "\n", "$$\\text{Var}(X) = \\mathbb{E}[X^2] - (\\mathbb{E}[X])^2$$\n", "\n", @@ -242,7 +238,7 @@ "metadata": {}, "source": [ "## Sums of Random Variables\n", - "Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable, so if you create multiple random variables based on your sample, then functions of those random variables are also random variables.\n", + "Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables.\n", "\n", "For example, if $X_1, X_2, ..., X_n$ are random variables, then so are all of these: \n", "\n", @@ -265,17 +261,18 @@ "For example, let $X_1$ and $X_2$ be numbers on rolls of two fair die. $X_1$ and $X_2$ are i.i.d, so $X_1$ and $X_2$ have the same distribution. However, the sums $Y = X_1 + X_1 = 2X_1$ and $Z=X_1+X_2$ have different distributions but the same expectation.\n", "\n", "

      \n", - "distribution\n", + "distribution\n", "

      \n", "\n", "However, $Y = X_1$ has a larger variance\n", "\n", "

      \n", - "distribution\n", + "distribution\n", "

      \n", "\n", "### Properties of Expectation \n", "Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: $$\\mathbb{E}[X] = \\sum_{x} x P(X=x)$$\n", + "From it, we can derive some useful properties of expectation: \n", "\n", "1. **Linearity of expectation**. The expectation of the linear transformation $aX+b$, where $a$ and $b$ are constants, is:\n", "\n", @@ -306,13 +303,12 @@ ":::\n", "\n", "3. If $g$ is a non-linear function, then in general, \n", - "$$\\mathbb{E}[g(X)] \\neq g(\\mathbb{E}[X])$$\n", - "\n", - "* For example, if $X$ is -1 or 1 with equal probability, then $\\mathbb{E}[X] = 0$ but $\\mathbb{E}[X^2] = 1 \\neq 0$\n", + "$$\\mathbb{E}[g(X)] \\neq g(\\mathbb{E}[X])$$ For example, if $X$ is -1 or 1 with equal probability, then $\\mathbb{E}[X] = 0$, but $\\mathbb{E}[X^2] = 1 \\neq 0$.\n", "\n", "### Properties of Variance\n", "Recall the definition of variance: \n", "$$\\text{Var}(X) = \\mathbb{E}[(X-\\mathbb{E}[X])^2]$$\n", + "Combining it with the properties of expectation, we can derive some useful properties of variance: \n", "\n", "1. Unlike expectation, variance is *non-linear*. The variance of the linear transformation $aX+b$ is:\n", "$$\\text{Var}(aX+b) = a^2 \\text{Var}(X)$$\n", @@ -329,7 +325,7 @@ "Then, \n", "$$\\begin{align}\n", " \\text{Var}(aX+b) &= \\text{Var}(aX) \\\\\n", - " &= E((aX)^2) - (E(aX))^2\n", + " &= E((aX)^2) - (E(aX))^2 \\\\\n", " &= E(a^2 X^2) - (aE(X))^2\\\\\n", " &= a^2 (E(X^2) - (E(X))^2) \\\\\n", " &= a^2 \\text{Var}(X)\n", @@ -454,7 +450,7 @@ "\n", "Any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists because we rarely know a lot about the population!\n", "\n", - "For a more in-depth demo check out [onlinestatbook](https://onlinestatbook.com/stat_sim/sampling_dist/). \n", + "For a more in-depth demo, check out [onlinestatbook](https://onlinestatbook.com/stat_sim/sampling_dist/). \n", "\n", "The CLT applies if the sample size $n$ is large, but how large does n have to be for the normal approximation to be good? It depends on the shape of the distribution of the population.\n", "\n", diff --git a/probability_1/probability_1.qmd b/probability_1/probability_1.qmd index a6abbb8c..395d10db 100644 --- a/probability_1/probability_1.qmd +++ b/probability_1/probability_1.qmd @@ -32,8 +32,8 @@ To better understand the origin of this tradeoff, we will need to introduce the 1. Random Variables Estimators: introduce random variables, considering the concepts of expectation, variance, and covariance 2. Estimators, Bias, and Variance: re-express the ideas of model variance and training error in terms of random variables and use this new perspective to investigate our choice of model complexity -::: {.callout-tip} -## Data 8 +::: {.callout-tip collapse="true"} +## Data 8 Recap Recall the following concepts from Data 8: 1. Sample mean: the mean of your random sample @@ -57,16 +57,11 @@ For any random variable $X$, we need to be able to specify 2 things: 1. Possible values: the set of values the random variable can take on. 2. Probabilities: the set of probabilities describing how the total probability of 100% is split over the possible values. -If $X$ is discrete (has a finite number of possible values), - -* The probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$. -* Probabilities must sum to 1: $\sum_{\text{all} x} P(X=x) = 1$, - -We can often display this using a **probability distribution table** (example shown below). +If $X$ is discrete (has a finite number of possible values), the probability that a random variable $X$ takes on the value $x$ is given by $P(X=x)$, and probabilities must sum to 1: $\sum_{\text{all} x} P(X=x) = 1$, -The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable. +We can often display this using a **probability distribution table**, which you will see in the coin toss example below. -The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve. +The **distribution** of a random variable $X$ is a description of how the total probability of 100% is split over all the possible values of $X$, and it fully defines a random variable. The distribution of a discrete random variable can also be represented using a histogram. If a variable is **continuous** – it can take on infinitely many values – we can illustrate its distribution using a density curve.

      discrete_continuous @@ -88,9 +83,9 @@ $$X = \begin{cases} 0, \text{if the coin lands tails} \end{cases}$$ -$X$ is a function with a domain (input) of $\{H, T\}$ and a range (output) of $\{1, 0\}$. We can write this in function notation as +$X$ is a function with a domain, or input, of $\{H, T\}$ and a range, or output, of $\{1, 0\}$. We can write this in function notation as $$\begin{cases} X(H) = 1 \\ X(T) = 0 \end{cases}$$ -and the probability distribution table of $X$ is +The probability distribution table of $X$ is given by. | $x$ | $P(X=x)$ | | --- | -------- | @@ -107,23 +102,24 @@ We can define $Y$ as the number of data science students in our sample. Its doma We can show the distribution of $Y$ in the following tables. The table on the left lists all possible samples of $s$ and the number of times they can appear ($Y(s)$). We can use this to calculate the values for the table on the right, a **probability distribution table**.

      -distribution +distribution

      ### Simulation Given a random variable $X$’s distribution, how could we **generate/simulate** a population? To do so, we can randomly pick values of $X$ according to its distribution using `np.random.choice` or `df.sample`. ## Expectation and Variance -There are several ways to describe a random variable. The methods shown above - table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms - are all definitions that fully describe a random variable. Often, it is easier to describe a random variable using some numerical summary, rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a "summary" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable. +There are several ways to describe a random variable. The methods shown above -- a table of all samples $s, X(s)$, distribution table $P(X=x)$, and histograms -- are all definitions that *fully describe* a random variable. Often, it is easier to describe a random variable using some *numerical summary* rather than fully defining its distribution. These numerical summaries are numbers that characterize some properties of the random variable. Because they give a "summary" of how the variable tends to behave, they are *not* random – think of them as a static number that describes a certain property of the random variable. In Data 100, we will focus our attention on the expectation and variance of a random variable. ### Expectation The **expectation** of a random variable $X$ is the weighted average of the values of $X$, where the weights are the probabilities of each value occurring. There are two equivalent ways to compute the expectation: + 1. Apply the weights one *sample* at a time: $$\mathbb{E}[X] = \sum_{\text{all possible } s} X(s) P(s)$$. 2. Apply the weights one possible *value* at a time: $$\mathbb{E}[X] = \sum_{\text{all possible } x} x P(X=x)$$ -We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram. If we simulate the variable many times, it is the long-run average of the random variable. +We want to emphasize that the expectation is a *number*, not a random variable. Expectation is a generalization of the average, and it has the same units as the random variable. It is also the center of gravity of the probability distribution histogram, meaning if we simulate the variable many times, it is the long-run average of the random variable. -### Example 1: Coin Toss +#### Example 1: Coin Toss Going back to our coin toss example, we define a random variable $X$ as: $$X = \begin{cases} 1, \text{if the coin lands heads} \\ @@ -137,7 +133,7 @@ $$\begin{align} \end{align}$$ Note that $\mathbb{E}[X] = 0.5$ is not a possible value of $X$; it's an average. **The expectation of X does not need to be a possible value of X**. -### Example 2 +#### Example 2 Consider the random variable $X$: | $x$ | $P(X=x)$ | @@ -161,13 +157,13 @@ The **variance** of a random variable is a measure of its chance error. It is de $$\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$$ -The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $\text{SD}(X) = \sqrt{\text{Var}(X)}$. +The units of variance are the square of the units of $X$. To get it back to the right scale, use the standard deviation of $X$: $$\text{SD}(X) = \sqrt{\text{Var}(X)}$$ Like with expectation, **variance is a number, not a random variable**! Its main use is to quantify chance error. By [Chebyshev’s inequality](https://www.inferentialthinking.com/chapters/14/2/Variability.html#Chebychev's-Bounds), which you saw in Data 8, no matter what the shape of the distribution of X is, the vast majority of the probability lies in the interval “expectation plus or minus a few SDs.” -If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations (if $X$ is centered and $E(X)=0$, then $\mathbb{E}[X^2] = \text{Var}(X)$). +If we expand the square and use properties of expectation, we can re-express variance as the **computational formula for variance**. This form is often more convenient to use when computing the variance of a variable by hand, and it is also useful in Mean Squared Error calculations, as $\mathbb{E}[X^2] = \text{Var}(X)$ if $X$ is centered and $E(X)=0$. $$\text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2$$ @@ -217,7 +213,7 @@ $$\text{Var}(X) = \frac{91}{6} - (\frac{7}{2})^2 = \frac{35}{12}$$ ::: ## Sums of Random Variables -Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable, so if you create multiple random variables based on your sample, then functions of those random variables are also random variables. +Often, we will work with multiple random variables at the same time. A function of a random variable is also a random variable; if you create multiple random variables based on your sample, then functions of those random variables are also random variables. For example, if $X_1, X_2, ..., X_n$ are random variables, then so are all of these: @@ -240,17 +236,18 @@ Suppose that we have two random variables $X$ and $Y$: For example, let $X_1$ and $X_2$ be numbers on rolls of two fair die. $X_1$ and $X_2$ are i.i.d, so $X_1$ and $X_2$ have the same distribution. However, the sums $Y = X_1 + X_1 = 2X_1$ and $Z=X_1+X_2$ have different distributions but the same expectation.

      -distribution +distribution

      However, $Y = X_1$ has a larger variance

      -distribution +distribution

      ### Properties of Expectation Instead of simulating full distributions, we often just compute expectation and variance directly. Recall the definition of expectation: $$\mathbb{E}[X] = \sum_{x} x P(X=x)$$ +From it, we can derive some useful properties of expectation: 1. **Linearity of expectation**. The expectation of the linear transformation $aX+b$, where $a$ and $b$ are constants, is: @@ -281,13 +278,12 @@ $$\begin{align} ::: 3. If $g$ is a non-linear function, then in general, -$$\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$$ - -* For example, if $X$ is -1 or 1 with equal probability, then $\mathbb{E}[X] = 0$ but $\mathbb{E}[X^2] = 1 \neq 0$ +$$\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$$ For example, if $X$ is -1 or 1 with equal probability, then $\mathbb{E}[X] = 0$, but $\mathbb{E}[X^2] = 1 \neq 0$. ### Properties of Variance Recall the definition of variance: $$\text{Var}(X) = \mathbb{E}[(X-\mathbb{E}[X])^2]$$ +Combining it with the properties of expectation, we can derive some useful properties of variance: 1. Unlike expectation, variance is *non-linear*. The variance of the linear transformation $aX+b$ is: $$\text{Var}(aX+b) = a^2 \text{Var}(X)$$ @@ -304,7 +300,7 @@ In order to compute $\text{Var}(aX+b)$, consider that a shift by b units does no Then, $$\begin{align} \text{Var}(aX+b) &= \text{Var}(aX) \\ - &= E((aX)^2) - (E(aX))^2 + &= E((aX)^2) - (E(aX))^2 \\ &= E(a^2 X^2) - (aE(X))^2\\ &= a^2 (E(X^2) - (E(X))^2) \\ &= a^2 \text{Var}(X) @@ -419,7 +415,7 @@ The CLT states that no matter what population you are drawing from, if an i.i.d. Any theorem that provides the rough distribution of a statistic and doesn’t need the distribution of the population is valuable to data scientists because we rarely know a lot about the population! -For a more in-depth demo check out [onlinestatbook](https://onlinestatbook.com/stat_sim/sampling_dist/). +For a more in-depth demo, check out [onlinestatbook](https://onlinestatbook.com/stat_sim/sampling_dist/). The CLT applies if the sample size $n$ is large, but how large does n have to be for the normal approximation to be good? It depends on the shape of the distribution of the population.