-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-random-variables.Rmd
445 lines (304 loc) · 26.6 KB
/
02-random-variables.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
# Defining Random Variables
```{r load packages unit 02, echo=FALSE}
library(tidyverse)
library(patchwork)
theme_set(theme_minimal())
```
![yosemite valley](./images/yosemite.jpg)
## Learning Objectives
At the end of this week's course of study (which includes the async, sync, and homework) students should be able to
1. **Remember** that random variable are neither random, or variables, but instead that they are a foundational object that we can use to reason about a world.
2. **Understand** that the intuition developed by the use of set-theory probability maps into the more expressive space of random variables
3. **Apply** the appropriate mathematical transformations to move between joint, marginal, and conditional distributions.
This week's materials are theoretical tooling to build toward one of the first notable results of the course, **conditional probability**. This is the idea that, if we know that one event has occurred, we can make a conditional statement about the probability distribution for another, dependent distribution.
## Introduction to the Materirals
From the axioms of probability, it is possible to build a whole, expressive modeling system (that need not be grounded **at all** in the minutia of the world). With this probability model in place, we can describe how frequently events in the random variable will occur. When variable are dependent upon each other, we can utilize information that is encoded in this dependence in order to make predictions that are *closer to the truth* than predictions made without this information.
There is both a beauty and a tragedy when reasoning about random variables: we describe random variables using their joint density function.
The **beauty** is that by reasoning with such general objects -- the definitions that we create, and the theorems that we derive in this section of the course -- produce guarantees that hold in every case, no matter the function that stands in for the joint density function. We will compute several examples of *specific* functions to provide a chance to reason about these objects and how they "work".
The **tragedy** is that in the "real world", the world where we are going to eventually going to train and deploy our models, we are never provided with this joint density function. Perhaps this is the creation myth for probability theory: in a perfect world, we can produce a perfect result. But, in the "fallen" world of data, we will only be able to produce approximations.
## Class Announcements
### Homework {-}
1. You should have turned in your first homework. The solution set for this homework is scheduled to be released to you in two days. The solution set contains a full explanation of how we solved the questions posed to you. You can expect that feedback for this homework will be released back to you within seven days.
2. You can start working on your second homework when we are out of this class.
### Study Groups {-}
It is a **very** good idea for you to create a recurring time to work with a set of your classmates. Working together will help you solve questions more effectively, quickly, and will also help you to learn how to communicate what you do and do not understand about a problem to a group of collaborating data scientists. And, working together with a group will help you to find people who share data science interests with you.
### Course Resources {-}
There are several resources to support your learning. A learning object last week was that you would be introduced to each of these systems. Please continue to make sure that you have access to the:
- [Library VPN](https://www.lib.berkeley.edu/using-the-libraries/vpn) to read all of the scholarly content in the known universe, including the course textbook.
- [Course LMS Page](https://www.bcourses.berkeley.edu)
## Using Definitions of Random Variables
### Random Varaible
What is a random variable? Does this definition help you?
::: {.definition name="Random Variable"}
A random variable is a function $X : \Omega \rightarrow \mathbb{R},$ such that $\forall r \in \mathbb{R}, \{\omega \in \Omega: X(\omega) \leq r\} \in S$.
:::
Someone, please, read that without using a single "omega", $\mathbb{R}$, or other jargon terminology. Instead, someone read this aloud and tell us what each of the concepts mean.
The goal of writing with math symbols like this is to be *absolutely* clear what concepts the author does and does not mean to invoke when they write a definition or a theorem. In a very real sense, this is a language that has specific meaning attached to specific symbols; there is a correspondence between the mathematical language and each of our home languages, but exactly what the relationship is needs to be defined into each student's home language.
::: {.discussion-question name="Why Random Variables?"}
- What are the key things that random variables allow you to accomplish?
- Suppose that you were going to try to make a model that predicts the probability of winning "big money" on a slot machine. Big money might be that you get :cherries: :cherries: :cherries:. Can you do *math* with :cherries:?
- Suppose that you wanted to build a chatbort that uses a language model so that you don't have to do your homework anymore. How would you go about it?
- Suppose you want to direct class support to students in 203, but their grades are scored `[A, A-, ..., ]` and features include prior statistics classes grades, also scored `A, A-, ...]`
:::
## Pieces of a Random Variable
::: {.definition name="Random Variable, Suite"}
A random variable is a function $X : \Omega \rightarrow \mathbb{R},$ such that $\forall r \in \mathbb{R}, \{\omega \in \Omega\}: X(\omega) \leq r\} \in S$.
:::
There are two key pieces that must exist for every random variable. What are these pieces? The first of these pieces is provided to us in **Definition 1.2.1** *Random Variable* (on page 16). The second is provided to us in **Definition 1.2.5** *Probability Mass Function* (on page 18).
1.
2.
::: {.discussion-question}
Suppose that a random variable is simple and discrete. For concreteness, you could think of this random variable as the answer to the question, "Is the grass wet outside?".
1. What is the sample space?
2. What is a sensible function that you might use to map from the sample space to real values?
3. What is a insensible function that you might use to map from the sample space to real values? (A student well-seasoned in Maths might use (and define for the rest of the class) the concept of a *bijective function*).
4. If you simply had the values that the random variable function maps to are you guaranteed to be able to describe the entire sample space? Why or why not?
5. How would you go about determining the probability mass function for this random variable?
:::
### Functions of Functions
::: {.discussion-question name="Why Functions?"}
Why do we say that random variables are functions? Is there some useful property of these being functions rather than any other quantity? What else *could* they be if not a function?
:::
What about a function of a random variable, which is a function of a function.
::: {.definition name="Function of a Random Variable"}
Let $g : U \rightarrow \mathbb{R}$ be some function, where $X(\Omega) \subset U \subset \mathbb{R}$. Then, if $g \circ X : \Omega \rightarrow \mathbb{R}$ is a random variable, we say that $g$ is a *function* of X and write $g(X)$ to denote the random variable $g \circ X$.
:::
If a random variable is a function from the real world, or the sample space, or the outcome space to a real number, then what does it mean to define a function of a random variable?
- At what point does this function work? Does this function change the sample space that is possible to observe? Or, does this function change the real-number that each outcome points to?
::: {.example name="MNIST"}
Suppose that you are doing some image processing work. To keep things simple, that you are doing image classification in the style of the MNIST dataset.
- Can someone describe what this task is trying to accomplish?
- Has anyone done work like this?
However, suppose that rather than having good clean indicators for whether a pixel is on or off, instead you have weak indicators -- there's a lot of grey. A lot of the cells are marked in the range $0.2 - 0.3$.
1. How might creating a function that re-maps this grey into more extreme values help your model?
2. Is it possible to "blur" events that are in the outcome space? Does this "blurring" meet the requirements of a function of a random variable, as provided above?
:::
### Probability Density Functions and Cumulative Distribution Functions
- What is a probability mass function?
- What do the **Kolmogorov Axioms** mean must be true about any probability mass function (*pmf*)?
::: {.example name="Berkeley Drivers, No Survivors"}
You should try driving in Berkeley some time. It is a **trip**! Without being deliberately ageist, the city is full of ageing hippies driving Subaru Outbacks and making what seem to be stochastic right-or-left turns to buy incense, pottery, or just sourdough bread.
Suppose that you are walking to campus, and you have to cross 10 crosswalks, each of which are spaced a block apart. Further, suppose that as you get closer to campus, there are fewer aging hippies, and therefore, there is decreasing risk that you're hit by a Subaru as you cross the street. Specifically, and fortunately for our math, the risk of being hit decreases linearly with each block that you cross.
Finally, campus provides you with the safety reports from last year, and reports that there were 120 student-Subaru incidents last year, out of 10,000 student-crosswalk crossings.
1. What is the *pmf* for the probability that you are involved in a student-Subaru incident as you walk across these 10 blocks? What sample space, $\Omega$ is appropriate to represent this scenario?
2. Suppose that you don't leave your house -- this is a remote program after all! What is your cumulative probability of being involved in a student-subaru incident?
3. What is the cumulative probability *cmf* for the probability that you are involved in a student-Subaru incident?
4. Suppose that you live three blocks from campus, but your classmate lives five blocks from campus. What is the difference in the cumulative probability?
5. How would you describe the cumulative probability of being hit as you walk closer to campus? That is, suppose that you start 10 blocks away from campus, and are walking to get closer. Is your cumulative probability of being hit on your way to campus increasing or decreasing as you get closer to campus?
6. How would you describe the cumulative probability of being hit as you walk **further** from campus? That is, suppose that you start on campus, and you're walking to a bar after classes. Is your cumulative probability of being hit on your way away from campus increasing or decreasing as you get further from campus?
:::
## Discrete & Continuous Random Variables
What, if anything is fundamentally different between discrete and continuous random variables? As a way of starting the conversation, consider the following cases:
- Suppose $X$ is a random variable that describes the time a student spends on w203 homework 1.
- If you have only granular measurement -- i.e. the number of nights spent working on the homework -- is this discrete or continuous?
- If you have the number of hours, is it discrete or continuous?
- If you have the number of seconds? Or milliseconds?
- Is it possible that $P(X = a) = 0$ for every point $a$? For example, that $P(X = 3600) = 0$.
- Does one of these measures have more *information* in it than another?
- How are measurement choices that we make as designers of information capture systems -- i.e. the machine processes, human processes, or other processes that we are going to work with as data scientists -- reflected in both the amount of information that is gathered, the type of information that is gathered, and the types of random variables that are manifest as a result?
## Moving Between PDF and CDF
The book defines *pmf* and *cmf* first as a way of developing intuition and a way of reasoning about these concepts. It then moves to defining continuous density functions, which is many ways are easier to work with although they lack the means of reasoning about them intuitively. Continuous distributions are defined in the book, and more generally, in terms of the *cdf*, which is the cumulative distribution function. There are technical reasons for this choice of definition, some of which are signed in the footnotes on the page where the book presents it.
More importantly for this course, in **Definition 1.2.15** the book defines the relationship between *cdf* and *pdf* in the following way:
::: {.definition name="Probability Density Function (PDF)"}
For a continuous random variable $X$ with CDF $F$, the *probability density function* of $X$ is
$$
f(x) = \left. \frac{d F(u)}{du} \right|_{u=x}, \forall x \in \mathbb{R}.
$$
:::
- How does this definition, which relates *pdf* and *cdf* by a means of differentiation and integration, fit with the ideas that we just developed in the context of walking to and from campus?
::: {.example name="Working with a continuous pdf and cdf"}
Suppose that you learn than a particular random variable, $X$ has the following function that describes its *pdf*, $f_{x}(x) = \frac{1}{10}x$. Also, suppose that you know that the smallest value that is possible for this random variable to obtain is 0.
1. What is the CDF of $X$?
2. What is the maximum possible value that $x$ can obtain? How did you develop this answer, using the Kolmogorov axioms of probability?
3. What is the cumulative probability of an outcome up to 0.5?
4. What is the probability of an outcome between 0.25 and 0.75? Produce an answer to this in two ways:
1. Using the $pdf$
2. Using the $cdf$
:::
## Joint Density
Working with a single random variable helps to develop our understanding of how to relate the different features of a *pdf* and a *cdf* through differentiation and integration. However, there's not really *that* much else that we can do; and, there is probably very little in our professional worlds that would look like a single random variable in isolation.
We really start to get to something useful when we consider joint density functions. Joint density functions describe the probability that *both* of two random variables. That is, if we are working with random variables $X$ and $Y$, then the joint density function provides a probability statement for $P(X \cap Y)$.
In this course, we might typically write this joint density function as $f_{X,Y}(x,y) = f(\cdot)$ where $f(\cdot)$ is the actual function that represents the joint probability. The $f(\cdot)$ means, essentially, "some function" where we just have not designated the specifics of the function; you might think of this as a generic function.
### Example: Uniform Joint Density
Suppose that we know that two variables, $X$ and $Y$ are jointly uniformly distributed within the the *support* $x \in [0,4], y \in [0,4]$. We have a requirement, imposed by the *Kolmogorov Axioms* that all probabilities must be non-zero, and that the total probability across the whole support must be one.
- Can you use these facts to determine answers to the following:
- What kind of shape does this joint *pdf* have?
- What is the specific function that describes this shape?
- If you draw this shape on three axes, and $X$, and $Y$, and a $P(X,Y)$, what does this plot look like?
- How do you get from the joint density function, to a marginal density function for $X$?
- How do you get form the joint density function, to a marginal density function for $Y$?
- How do you get from these marginal density functions of $X$ and $Y$ back to the joint density? Is this always possible?
### Examples: Thinking Through Many Plots
An alumni of the MIDS program, and a former instructor of this course, [Todd Young](https://www.linkedin.com/in/dtoddyoung/) built this nifty tool that lets us consider several different joint probability functions.
As a class, lets consider a few of these PDFs, beginning with this "triangle" distribution.
```{r, fig.width=8}
knitr::include_app('http://www.statistics.wtf/PDF_Explorer/', height="1000px")
```
### Triangle Math
After considering the intuition for the triangle distribution, do the following: Write down the function that accords with the figure that you're seeing above.^[Notice, that in general, this kind of *curve fitting* isn't really a common data science task. Instead, this is just a learning task that lets the class assess their understanding of the definitions of random variables.]
- What is a full statement of the PDF of this image?
- What is the marginal distribution of $X$, $f_{X}(x)$?
- What is the marginal distribution of $Y$, $f_{Y}(y)$?
- Using the definition of independence, are $X$ and $Y$ independent of each other?
- What is the CDF of $X$, $F_{X}(x)$?
### Saddle Sores
Suppose that you know that two random variables, $X$ and $Y$ are jointly distributed with the following *pdf*:
\[
f_{X,Y}(x,y) =
\begin{cases}
a * x^{2} * y^{2} & 0 < x < 1, 0 < y < 1 \\
0 & otherwise
\end{cases}
\]
This joint pdf is similar to the pdf that you can visualize above, under the distribution called "saddle". The difference between this function and the image above is that the function bounds the with support of $x$ and $y$ on the range $[0,1]$. This is to make the math easier for us in the next step.
- Can you use these facts to determine the following?
- What value of $a$ makes this a valid joint pdf?
- What is the marginal pdf of $x$? That is, what is $f_{x}(x)$?
- What is the conditional pdf of $X$ given $Y$? That is, what is $f_{x|y}(x,y)$?
- Given these facts, would you say that $X$ and $Y$ are dependent or independent?
- If the support for this joint distribution were instead $[0,4]$ (rather than $[0,1]$), how would the shape of the distribution change?
## Computing Different Distributions.
Suppose that random variables $X$ and $Y$ are jointly continuous, with joint density function given by,
$$
f(x,y) =
\begin{cases}
c, & 0 \leq x \leq 1, 0 \leq y \leq x \\
0, & otherwise
\end{cases}
$$
where $c$ is a constant.
1. Draw a graph showing the region of the X-Y plane with positive probability density.
2. What is the constant $c$?
3. Compute the marginal density function for $X$. (Be sure to write a complete expression)
4. Compute the conditional density function for $Y$, conditional on $X=x$. (Be sure to specify for what values of $x$ this is defined)
## Conditional Probability
Conditional probability is **incredible**. In fact, without exaggeration, almost **all** of data science is an exercise in making statements about conditional probability distributions. *Don't believe us?*
- What is the goal of a "customer churn" model or a conversion model?
- What is the goal of a language-completion model?
- What is the goal of flight-departures model?
::: {.discussion-question}
**If** we possessed the whole information about a process; **if** we had the CDF that governed probability of occurrences, what kinds of statements would we be able to make? Would we even need data?
:::
::: {.discussion-question}
Using the distribution above, produce a statement of conditional probability, $f_{Y|X}(y|x)$.
:::
## Visualizing Distributions Via Simulation
To this point in the course, we have focused on concepts in "the population" with no reference to samples. This is on purpose! We want to develop the theory that defines the **best possible** predictor if we knew **everything** (if we know formula of the function that maps from $\omega \rightarrow \mathbb{R}$, and we know the probability of each $\omega \in \Omega$ then we know everything). Beginning in week 5 of the course, we will talk about "approximating" (which we will call estimating) this best possible predictor with a limited sample of data.
However, at this point, to help build your working understanding, or intuition, for what is happening, we are going to work on a way to *simulate* draws from a population. In some places, people might refer to these as *Monte Carlo* methods -- this is because the method was developed by von Neumann \& Ulam during World War II, and they needed a way to talk about it using a code name. They chose *Monte Carlo* after a famous casino in Monaco.
### Example: The Uniform Distribution
> You: "Gosh. There sure are a lot of examples that use the uniform distribution. That must be a really important statistical distribution."
>
> Instructor: "Nah. Not really. We're just using the uniform a bunch so that we don't get too lost in doing math while we're working with these concepts."
We'll start with a simple uniform distribution, but then we'll make it a little more complex in a moment.
We can use R to simulate draws from a probability distribution function by providing it with the name of the distribution that we're considering, the support of that distribution, or other features of the distribution. In the case of the uniform, the entire distribution is can be described just from it support.
So, suppose that you had a uniform distribution that had positive probability on the range $[1.1, 4.3]$. Why these? No particular reason. That is, suppose
\[
f_{X}(x) = \begin{cases}
a & 1.1 \leq x \leq 4.3 \\
0 & otherwise
\end{cases}
\]
What does this distribution "look like"? Because it is a uniform, you might have a sense that it will be a horizontal line. But, what is the height of that line? Aha! We could do the math to figure it out, or we could generate an approximation using a simulation.
In the code below, we are going to create an object called `samples_uniform` that stores the results of the `runif` function call.
```{r create uniform samples}
samples_uniform <- runif(n=1000, min=1.1, max=4.3)
```
What is happening inside `runif`?
When you're writing you own code, you can pull up the documentation for this (and any) function using a question mark, i.e. `?`, followed by the function name -- `?runif`.
But, we can speed this up slightly by simply telling you that `n` is the number of samples to take from the population; `min` is the low-end of the support, and `max` is the high-end of the support.
If we look into this object, we can see the results of the function call. Below, we will show the first $20$ elements of the `samples_uniform` object.
```{r show first 20 results}
samples_uniform[1:20]
```
(Notice that R is a $1$ index language (python is a zero-index language).)
With this object created, we can plot a density of the data and then learn from this histogram what the pdf looks like.
```{r plot uniform samples}
plot_full_data <- ggplot() +
aes(x=1:length(samples_uniform), y=samples_uniform) +
geom_point() +
labs(
title = 'Showing the Data',
y = 'Sample Value',
x = 'Index')
plot_density <- ggplot() +
aes(x=samples_uniform) +
geom_density(bw=0.1) +
labs(
title = 'Showing the PDF',
y = 'Probability of Drawing Value',
x = 'Sample Value')
(plot_full_data | (plot_density + coord_flip())) /
plot_density
```
Interesting. From what we can see here, there does not appear to be any discernible pattern. This leaves us with two options: either, we might reduce the resolution that we're using to view this pattern, or we might take more samples and hold the resolution constant. Below, two different plots show these differing approaches, and are *very* explicit about the code that creates them.
```{r create more data}
samples_uniform_moar <- runif(n=1000000, min=1.1, max=4.3)
```
```{r plot uniform distributions}
plot_low_res <- ggplot() +
aes(x=samples_uniform) +
geom_density(bw=0.1) +
lims(y=c(0,0.4)) +
labs(title = 'Low Res, Low Data')
plot_high_res <- ggplot() +
aes(x=samples_uniform_moar) +
geom_density(bw=0.01) +
lims(y=c(0,0.4)) +
labs(title = 'High Res, More Data')
plot_low_res | plot_high_res
```
### Example: The Normal Distribution
Folks might have some prior beliefs about the Normal distribution. Don't worry, we'll cover this later in the course. But, this is the distribution that you have in mind when you're thinking of a "bell curve".
We can use the same method to visualize a normal distribution as we did for a uniform distribution. In this case, we would issue the call `rnorm`, together with the population parameters that define the population. At this point in the course, we do not expect that you will know these (and, actually memorizing these facts are not a core focus of the course), but you can [look them up](https://en.wikipedia.org/wiki/Normal_distribution) if you like. Truthfully, statistics wikipedia is *very* good.
Do do you notice anything about the `runif` and the `rnorm` calls that we have identified? Both seem to name the distribution: $unif \approx uniform$ and $norm \approx normal$, but prepened with a `r`? This is for "random draw".
Base R is loaded with a *pile* of basic statistics distributions, which you can look into using `?distributions`.
```{r draw many samples from a normal distribution}
samples_normal <- rnorm(n=100000, mean=18, sd=4)
```
Like before, we could look at the first $20$ of these samples.
```{r spot-check first 20}
samples_normal[1:20]
```
And, from here we could visualize this distribution.
```{r plot normal}
ggplot() +
aes(x=samples_normal) +
geom_density() +
labs(title='Visualization of this Normal Distribution')
```
#### Combining This Ability
::: {.discussion-question}
Consider three random variables $A, B, C$. Suppose,
\[
\begin{aligned}
A & \sim Uniform(min=1.1, max=4.3) \\
B & \sim Normal(mean=18, sd=4) \\
C = A + B
\end{aligned}
\]
And, suppose that $B$ is a random variable that is described by the normal density that we considered earlier. Suppose that $A$ and $B$ are independent of each other.
Finally, suppose that $C = A + 2B$.
What does $C$ look like?
:::
Although this is a simple function applied to a random variable -- a legal move -- the math would be tedious. What if, instead, one used this simulation method to get a sense for the distribution?
```{r create C}
samples_A <- runif(n=10000, min=1.1, max=4.3)
samples_B <- rnorm(n=10000, mean=18, sd=4)
samples_C <- samples_A + samples_B
```
```{r plot C}
plot_C <- ggplot() +
aes(x=samples_C) +
geom_density()
plot_C_and_A_and_B <- ggplot() +
geom_density(aes(x=samples_A), color = '#003262') +
geom_density(aes(x=samples_B), color = '#FDB515') +
geom_density(aes(x=samples_C), color = 'darkred')
plot_C_and_A_and_B
```
## Review of Terms
Remember some of the key terms we learned in the async:
- Joint Density Function
- Conditional Distribution
- Marginal Distribution
Explain each of these three in terms of the cake metaphor.