02-random-variables.Rmd

# Defining Random Variables

```{r load packages unit 02, echo=FALSE}
library(tidyverse)
library(patchwork)

theme_set(theme_minimal())
```

![yosemite valley](./images/yosemite.jpg)

## Learning Objectives 

At the end of this week's course of study (which includes the async, sync, and homework) students should be able to 

1. **Remember** that random variable are neither random, or variables, but instead that they are  a foundational object that we can use to reason about a world. 
2. **Understand** that the intuition developed by the use of set-theory probability maps into the more expressive space of random variables
3. **Apply** the appropriate mathematical transformations to move between joint, marginal, and conditional distributions.

This week's materials are theoretical tooling to build toward one of the first notable results of the course, **conditional probability**. This is the idea that, if we know that one event has occurred, we can make a conditional statement about the probability distribution for another, dependent distribution. 

## Introduction to the Materirals

From the axioms of probability, it is possible to build a whole, expressive modeling system (that need not be grounded **at all** in the minutia of the world). With this probability model in place, we can describe how frequently events in the random variable will occur. When variable are dependent upon each other, we can utilize information that is encoded in this dependence in order to make predictions that are *closer to the truth* than predictions made without this information. 

There is both a beauty and a tragedy when reasoning about random variables: we describe random variables using their joint density function. 

The **beauty** is that by reasoning with such general objects -- the definitions that we create, and the theorems that we derive in this section of the course -- produce guarantees that hold in every case, no matter the function that stands in for the joint density function. We will compute several examples of *specific* functions to provide a chance to reason about these objects and how they "work". 

The **tragedy** is that in the "real world", the world where we are going to eventually going to train and deploy our models, we are never provided with this joint density function. Perhaps this is the creation myth for probability theory: in a perfect world, we can produce a perfect result. But, in the "fallen" world of data, we will only be able to produce approximations.  

## Class Announcements
### Homework {-}

1. You should have turned in your first homework. The solution set for this homework is scheduled to be released to you in two days. The solution set contains a full explanation of how we solved the questions posed to you. You can expect that feedback for this homework will be released back to you within seven days. 
2. You can start working on your second homework when we are out of this class. 

### Study Groups {-}

It is a **very** good idea for you to create a recurring time to work with a set of your classmates. Working together will help you solve questions more effectively, quickly, and will also help you to learn how to communicate what you do and do not understand about a problem to a group of collaborating data scientists. And, working together with a group will help you to find people who share data science interests with you. 

### Course Resources {-}

There are several resources to support your learning. A learning object last week was that you would be introduced to each of these systems. Please continue to make sure that you have access to the: 

- [Library VPN](https://www.lib.berkeley.edu/using-the-libraries/vpn) to read all of the scholarly content in the known universe, including the course textbook. 
- [Course LMS Page](https://www.bcourses.berkeley.edu)

## Using Definitions of Random Variables

### Random Varaible 

What is a random variable? Does this definition help you? 

::: {.definition name="Random Variable"}
A random variable is a function $X : \Omega \rightarrow \mathbb{R},$ such that $\forall r \in \mathbb{R}, \{\omega \in \Omega: X(\omega) \leq r\} \in S$. 
:::

Someone, please, read that without using a single "omega", $\mathbb{R}$, or other jargon terminology. Instead, someone read this aloud and tell us what each of the concepts mean. 

The goal of writing with math symbols like this is to be *absolutely* clear what concepts the author does and does not mean to invoke when they write a definition or a theorem. In a very real sense, this is a language that has specific meaning attached to specific symbols; there is a correspondence between the mathematical language and each of our home languages, but exactly what the relationship is needs to be defined into each student's home language. 

::: {.discussion-question name="Why Random Variables?"}
- What are the key things that random variables allow you to accomplish? 
  - Suppose that you were going to try to make a model that predicts the probability of winning "big money" on a slot machine. Big money might be that you get :cherries: :cherries: :cherries:. Can you do *math* with :cherries:?
  - Suppose that you wanted to build a chatbort that uses a language model so that you don't have to do your homework anymore. How would you go about it? 
  - Suppose you want to direct class support to students in 203, but their grades are scored `[A, A-, ..., ]` and features include prior statistics classes grades, also scored `A, A-, ...]`
::: 

## Pieces of a Random Variable 

::: {.definition name="Random Variable, Suite"}
A random variable is a function $X : \Omega \rightarrow \mathbb{R},$ such that $\forall r \in \mathbb{R}, \{\omega \in \Omega\}: X(\omega) \leq r\} \in S$. 
:::

There are two key pieces that must exist for every random variable. What are these pieces? The first of these pieces is provided to us in **Definition 1.2.1** *Random Variable* (on page 16). The second is provided to us in **Definition 1.2.5** *Probability Mass Function* (on page 18). 

1. 
2. 

::: {.discussion-question} 
Suppose that a random variable is simple and discrete. For concreteness, you could think of this random variable as the answer to the question, "Is the grass wet outside?". 

1. What is the sample space? 
2. What is a sensible function that you might use to map from the sample space to real values? 
3. What is a insensible function that you might use to map from the sample space to real values? (A student well-seasoned in Maths might use (and define for the rest of the class) the concept of a *bijective function*). 
4. If you simply had the values that the random variable function maps to are you guaranteed to be able to describe the entire sample space? Why or why not? 
5. How would you go about determining the probability mass function for this random variable? 
::: 

### Functions of Functions

::: {.discussion-question name="Why Functions?"}
Why do we say that random variables are functions? Is there some useful property of these being functions rather than any other quantity? What else *could* they be if not a function? 
:::

What about a function of a random variable, which is a function of a function. 

::: {.definition name="Function of a Random Variable"}
Let $g : U \rightarrow \mathbb{R}$ be some function, where $X(\Omega) \subset U \subset \mathbb{R}$. Then, if $g \circ X : \Omega \rightarrow \mathbb{R}$ is a random variable, we say that $g$ is a *function* of X and write $g(X)$ to denote the random variable $g \circ X$.  
:::

If a random variable is a function from the real world, or the sample space, or the outcome space to a real number, then what does it mean to define a function of a random variable? 

- At what point does this function work? Does this function change the sample space that is possible to observe? Or, does this function change the real-number that each outcome points to? 

::: {.example name="MNIST"}
Suppose that you are doing some image processing work. To keep things simple, that you are doing image classification in the style of the MNIST dataset. 

- Can someone describe what this task is trying to accomplish? 
- Has anyone done work like this? 
  
However, suppose that rather than having good clean indicators for whether a pixel is on or off, instead you have weak indicators -- there's a lot of grey. A lot of the cells are marked in the range $0.2 - 0.3$.  

1. How might creating a function that re-maps this grey into more extreme values help your model? 
2. Is it possible to "blur" events that are in the outcome space? Does this "blurring" meet the requirements of a function of a random variable, as provided above? 
:::

### Probability Density Functions and Cumulative Distribution Functions 

- What is a probability mass function? 
- What do the **Kolmogorov Axioms** mean must be true about any probability mass function (*pmf*)? 

::: {.example name="Berkeley Drivers, No Survivors"} 
You should try driving in Berkeley some time. It is a **trip**! Without being deliberately ageist, the city is full of ageing hippies driving Subaru Outbacks and making what seem to be stochastic right-or-left turns to buy incense, pottery, or just sourdough bread. 

Suppose that you are walking to campus, and you have to cross 10 crosswalks, each of which are spaced a block apart. Further, suppose that as you get closer to campus, there are fewer aging hippies, and therefore, there is decreasing risk that you're hit by a Subaru as you cross the street. Specifically, and fortunately for our math, the risk of being hit decreases linearly with each block that you cross. 

Finally, campus provides you with the safety reports from last year, and reports that there were 120 student-Subaru incidents last year, out of 10,000 student-crosswalk crossings. 

1. What is the *pmf* for the probability that you are involved in a student-Subaru incident as you walk across these 10 blocks? What sample space, $\Omega$ is appropriate to represent this scenario? 
2. Suppose that you don't leave your house -- this is a remote program after all! What is your cumulative probability of being involved in a student-subaru incident?  
3. What is the cumulative probability *cmf* for the probability that you are involved in a student-Subaru incident? 
4. Suppose that you live three blocks from campus, but your classmate lives five blocks from campus. What is the difference in the cumulative probability? 
5. How would you describe the cumulative probability of being hit as you walk closer to campus? That is, suppose that you start 10 blocks away from campus, and are walking to get closer. Is your cumulative probability of being hit on your way to campus increasing or decreasing as you get closer to campus? 
6. How would you describe the cumulative probability of being hit as you walk **further** from campus? That is, suppose that you start on campus, and you're walking to a bar after classes. Is your cumulative probability of being hit on your way away from campus increasing or decreasing as you get further from campus? 
:::

## Discrete & Continuous Random Variables

What, if anything is fundamentally different between discrete and continuous random variables? As a way of starting the conversation, consider the following cases: 

- Suppose $X$ is a random variable that describes the time a student spends on w203 homework 1.
  - If you have only granular measurement -- i.e. the number of nights spent working on the homework -- is this discrete or continuous? 
  - If you have the number of hours, is it discrete or continuous? 
  - If you have the number of seconds? Or milliseconds? 
- Is it possible that $P(X = a) = 0$ for every point $a$? For example, that $P(X = 3600) = 0$. 
- Does one of these measures have more *information* in it than another?
  - How are measurement choices that we make as designers of information capture systems -- i.e. the machine processes, human processes, or other processes that we are going to work with as data scientists -- reflected in both the amount of information that is gathered, the type of information that is gathered, and the types of random variables that are manifest as a result? 

## Moving Between PDF and CDF 

The book defines *pmf* and *cmf* first as a way of developing intuition and a way of reasoning about these concepts. It then moves to defining continuous density functions, which is many ways are easier to work with although they lack the means of reasoning about them intuitively. Continuous distributions are defined in the book, and more generally, in terms of the *cdf*, which is the cumulative distribution function. There are technical reasons for this choice of definition, some of which are signed in the footnotes on the page where the book presents it. 

More importantly for this course, in **Definition 1.2.15** the book defines the relationship between *cdf* and *pdf* in the following way:

::: {.definition name="Probability Density Function (PDF)"}
For a continuous random variable $X$ with CDF $F$, the *probability density function* of $X$ is 

$$
  f(x) = \left. \frac{d F(u)}{du} \right|_{u=x}, \forall x \in \mathbb{R}.
$$  
:::

- How does this definition, which relates *pdf* and *cdf* by a means of differentiation and integration, fit with the ideas that we just developed in the context of walking to and from campus?  

::: {.example name="Working with a continuous pdf and cdf"}
Suppose that you learn than a particular random variable, $X$ has the following function that describes its *pdf*, $f_{x}(x) = \frac{1}{10}x$. Also, suppose that you know that the smallest value that is possible for this random variable to obtain is 0. 

1. What is the CDF of $X$? 
2. What is the maximum possible value that $x$ can obtain? How did you develop this answer, using the Kolmogorov axioms of probability? 
3. What is the cumulative probability of an outcome up to 0.5? 
4. What is the probability of an outcome between 0.25 and 0.75? Produce an answer to this in two ways: 
  1. Using the $pdf$
  2. Using the $cdf$
:::

## Joint Density 

Working with a single random variable helps to develop our understanding of how to relate the different features of a *pdf* and a *cdf* through differentiation and integration. However, there's not really *that* much else that we can do; and, there is probably very little in our professional worlds that would look like a single random variable in isolation. 

We really start to get to something useful when we consider joint density functions. Joint density functions describe the probability that *both* of two random variables. That is, if we are working with random variables $X$ and $Y$, then the joint density function provides a probability statement for $P(X \cap Y)$. 

In this course, we might typically write this joint density function as $f_{X,Y}(x,y) = f(\cdot)$ where $f(\cdot)$ is the actual function that represents the joint probability. The $f(\cdot)$ means, essentially, "some function" where we just have not designated the specifics of the function; you might think of this as a generic function. 

### Example: Uniform Joint Density 

Suppose that we know that two variables, $X$ and $Y$ are jointly uniformly distributed within the the *support* $x \in [0,4], y \in [0,4]$. We have a requirement, imposed by the *Kolmogorov Axioms* that all probabilities must be non-zero, and that the total probability across the whole support must be one. 

- Can you use these facts to determine answers to the following: 
  - What kind of shape does this joint *pdf* have? 
  - What is the specific function that describes this shape? 
  - If you draw this shape on three axes, and $X$, and $Y$, and a $P(X,Y)$, what does this plot look like? 
  - How do you get from the joint density function, to a marginal density function for $X$? 
  - How do you get form the joint density function, to a marginal density function for $Y$? 
  - How do you get from these marginal density functions of $X$ and $Y$ back to the joint density? Is this always possible? 
  
### Examples: Thinking Through Many Plots 

An alumni of the MIDS program, and a former instructor of this course, [Todd Young](https://www.linkedin.com/in/dtoddyoung/) built this nifty tool that lets us consider several different joint probability functions. 

As a class, lets consider a few of these PDFs, beginning with this "triangle" distribution.
  
```{r, fig.width=8}
knitr::include_app('http://www.statistics.wtf/PDF_Explorer/', height="1000px")
```

### Triangle Math

After considering the intuition for the triangle distribution, do the following: Write down the function that accords with the figure that you're seeing above.^[Notice, that in general, this kind of *curve fitting* isn't really a common data science task. Instead, this is just a learning task that lets the class assess their understanding of the definitions of random variables.] 

- What is a full statement of the PDF of this image? 
- What is the marginal distribution of $X$, $f_{X}(x)$? 
- What is the marginal distribution of $Y$, $f_{Y}(y)$? 
- Using the definition of independence, are $X$ and $Y$ independent of each other? 
- What is the CDF of $X$, $F_{X}(x)$? 

### Saddle Sores 

Suppose that you know that two random variables, $X$ and $Y$ are jointly distributed with the following *pdf*: 

\[
f_{X,Y}(x,y) = 
  \begin{cases}
    a * x^{2} * y^{2} & 0 < x < 1, 0 < y < 1 \\
    0 & otherwise
  \end{cases}
\]

This joint pdf is similar to the pdf that you can visualize above, under the distribution called "saddle". The difference between this function and the image above is that the function bounds the with support of $x$ and $y$ on the range $[0,1]$. This is to make the math easier for us in the next step. 

- Can you use these facts to determine the following? 
  - What value of $a$ makes this a valid joint pdf? 
  - What is the marginal pdf of $x$? That is, what is $f_{x}(x)$? 
  - What is the conditional pdf of $X$ given $Y$? That is, what is $f_{x|y}(x,y)$? 
  - Given these facts, would you say that $X$ and $Y$ are dependent or independent? 
  - If the support for this joint distribution were instead $[0,4]$ (rather than $[0,1]$), how would the shape of the distribution change? 
  
## Computing Different Distributions.

Suppose that random variables $X$ and $Y$ are jointly continuous, with joint density function given by,

$$
f(x,y) = 
  \begin{cases}
    c, & 0 \leq x \leq 1, 0 \leq y \leq x \\
    0, & otherwise
\end{cases}
$$

where $c$ is a constant.

1. Draw a graph showing the region of the X-Y plane with positive probability density.
2. What is the constant $c$?
3. Compute the marginal density function for $X$.  (Be sure to write a complete expression)
4. Compute the conditional density function for $Y$, conditional on $X=x$.  (Be sure to specify for what values of $x$ this is defined)

## Conditional Probability 

Conditional probability is **incredible**. In fact, without exaggeration, almost **all** of data science is an exercise in making statements about conditional probability distributions. *Don't believe us?*

- What is the goal of a "customer churn" model or a conversion model? 
- What is the goal of a language-completion model? 
- What is the goal of flight-departures model? 

::: {.discussion-question}
**If** we possessed the whole information about a process; **if** we had the CDF that governed probability of occurrences, what kinds of statements would we be able to make? Would we even need data?
:::

::: {.discussion-question}
Using the distribution above, produce a statement of conditional probability, $f_{Y|X}(y|x)$. 
::: 

## Visualizing Distributions Via Simulation

To this point in the course, we have focused on concepts in "the population" with no reference to samples. This is on purpose! We want to develop the theory that defines the **best possible** predictor if we knew **everything** (if we know formula of the function that maps from $\omega \rightarrow \mathbb{R}$, and we know the probability of each $\omega \in \Omega$ then we know everything). Beginning in week 5 of the course, we will talk about "approximating" (which we will call estimating) this best possible predictor with a limited sample of data. 

However, at this point, to help build your working understanding, or intuition, for what is happening, we are going to work on a way to *simulate* draws from a population. In some places, people might refer to these as *Monte Carlo* methods -- this is because the method was developed by von Neumann \& Ulam during World War II, and they needed a way to talk about it using a code name. They chose *Monte Carlo* after a famous casino in Monaco. 

### Example: The Uniform Distribution 

> You: "Gosh. There sure are a lot of examples that use the uniform distribution. That must be a really important statistical distribution." 
> 
> Instructor: "Nah. Not really. We're just using the uniform a bunch so that we don't get too lost in doing math while we're working with these concepts." 

We'll start with a simple uniform distribution, but then we'll make it a little more complex in a moment.

We can use R to simulate draws from a probability distribution function by providing it with the name of the distribution that we're considering, the support of that distribution, or other features of the distribution. In the case of the uniform, the entire distribution is can be described just from it support. 

So, suppose that you had a uniform distribution that had positive probability on the range $[1.1, 4.3]$. Why these? No particular reason. That is, suppose

\[
  f_{X}(x) = \begin{cases} 
    a & 1.1 \leq x \leq 4.3 \\ 
    0 & otherwise
  \end{cases}
\]

What does this distribution "look like"? Because it is a uniform, you might have a sense that it will be a horizontal line. But, what is the height of that line? Aha! We could do the math to figure it out, or we could generate an approximation using a simulation. 

In the code below, we are going to create an object called `samples_uniform` that stores the results of the `runif` function call. 

```{r create uniform samples}
samples_uniform <- runif(n=1000, min=1.1, max=4.3)
```

What is happening inside `runif`? 

When you're writing you own code, you can pull up the documentation for this (and any) function using a question mark, i.e. `?`, followed by the function name -- `?runif`. 

But, we can speed this up slightly by simply telling you that `n` is the number of samples to take from the population; `min` is the low-end of the support, and `max` is the high-end of the support. 

If we look into this object, we can see the results of the function call. Below, we will show the first $20$ elements of the `samples_uniform` object. 

```{r show first 20 results}
samples_uniform[1:20]
```

(Notice that R is a $1$ index language (python is a zero-index language).)

With this object created, we can plot a density of the data and then learn from this histogram what the pdf looks like. 

```{r plot uniform samples}
plot_full_data <- ggplot() + 
  aes(x=1:length(samples_uniform), y=samples_uniform) + 
  geom_point()  + 
  labs(
    title = 'Showing the Data', 
    y     = 'Sample Value', 
    x     = 'Index')

plot_density <- ggplot() + 
  aes(x=samples_uniform) + 
  geom_density(bw=0.1)   + 
  labs(
    title = 'Showing the PDF', 
    y     = 'Probability of Drawing Value', 
    x     = 'Sample Value')

(plot_full_data | (plot_density + coord_flip())) / 
  plot_density 
```

Interesting. From what we can see here, there does not appear to be any discernible pattern. This leaves us with two options: either, we might reduce the resolution that we're using to view this pattern, or we might take more samples and hold the resolution constant. Below, two different plots show these differing approaches, and are *very* explicit about the code that creates them. 

```{r create more data}
samples_uniform_moar <- runif(n=1000000, min=1.1, max=4.3)
```

```{r plot uniform distributions}
plot_low_res <- ggplot()  + 
  aes(x=samples_uniform)  + 
  geom_density(bw=0.1)    + 
  lims(y=c(0,0.4))        + 
  labs(title = 'Low Res, Low Data')

plot_high_res <- ggplot()     + 
  aes(x=samples_uniform_moar) + 
  geom_density(bw=0.01)       + 
  lims(y=c(0,0.4))            + 
  labs(title = 'High Res, More Data')

plot_low_res | plot_high_res
```

### Example: The Normal Distribution 

Folks might have some prior beliefs about the Normal distribution. Don't worry, we'll cover this later in the course. But, this is the distribution that you have in mind when you're thinking of a "bell curve". 

We can use the same method to visualize a normal distribution as we did for a uniform distribution. In this case, we would issue the call `rnorm`, together with the population parameters that define the population. At this point in the course, we do not expect that you will know these (and, actually memorizing these facts are not a core focus of the course), but you can [look them up](https://en.wikipedia.org/wiki/Normal_distribution) if you like. Truthfully, statistics wikipedia is *very* good. 

Do do you notice anything about the `runif` and the `rnorm` calls that we have identified? Both seem to name the distribution: $unif \approx uniform$ and $norm \approx normal$, but prepened with a `r`? This is for "random draw". 

Base R is loaded with a *pile* of basic statistics distributions, which you can look into using `?distributions`. 

```{r draw many samples from a normal distribution}
samples_normal <- rnorm(n=100000, mean=18, sd=4)
```

Like before, we could look at the first $20$ of these samples. 

```{r spot-check first 20} 
samples_normal[1:20]
```

And, from here we could visualize this distribution. 

```{r plot normal}
ggplot() + 
  aes(x=samples_normal) + 
  geom_density() + 
  labs(title='Visualization of this Normal Distribution')
```

#### Combining This Ability 

::: {.discussion-question}
Consider three random variables $A, B, C$. Suppose, 

\[ 
  \begin{aligned}
    A & \sim Uniform(min=1.1, max=4.3) \\ 
    B & \sim Normal(mean=18, sd=4)     \\ 
    C = A + B
  \end{aligned}
\]

And, suppose that $B$ is a random variable that is described by the normal density that we considered earlier. Suppose that $A$ and $B$ are independent of each other. 

Finally, suppose that $C = A + 2B$. 

What does $C$ look like? 
::: 

Although this is a simple function applied to a random variable -- a legal move -- the math would be tedious. What if, instead, one used this simulation method to get a sense for the distribution? 

```{r create C}
samples_A <- runif(n=10000, min=1.1, max=4.3)
samples_B <- rnorm(n=10000, mean=18, sd=4)

samples_C <- samples_A + samples_B
```

```{r plot C}
plot_C <- ggplot() + 
  aes(x=samples_C) + 
  geom_density()

plot_C_and_A_and_B <- ggplot()   + 
  geom_density(aes(x=samples_A), color = '#003262') + 
  geom_density(aes(x=samples_B), color = '#FDB515') + 
  geom_density(aes(x=samples_C), color = 'darkred')

plot_C_and_A_and_B
```

## Review of Terms

Remember some of the key terms we learned in the async:

- Joint Density Function
- Conditional Distribution
- Marginal Distribution

Explain each of these three in terms of the cake metaphor.