-
Notifications
You must be signed in to change notification settings - Fork 0
/
14-maximum-likelihood.Rmd
302 lines (211 loc) · 10.9 KB
/
14-maximum-likelihood.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
# Maximum Likelihood Estimation
![salvation mountain](./images/salvation_mountain.jpg)
## Learning Objectives
1.
2.
3.
## Class Announcements
## Roadmap
**Rearview Mirror: What We've Seen**
- **WLLN**: $\displaystyle\lim_{n \to \infty} \overline{X}_n \overset{p}{=} E[X]$
- **CLT** $\displaystyle\lim_{n \to \infty} \bar{X}_n \overset{d}{=} N(E[X], \text{Var}[X])$
**Today**
- Use maximum likelihood to generate a good guess for model parameters;
- Use a confidence interval to indicate a range of plausible parameter values
## What is a model?
- A data science model is:
- A representation of the world built from random variables
- FOIS: "agnostic" models place minimal restrictions on joint distribution
- Parametric models (i.e. MLE) are models based on a family of distributions.
- $f_{Y|X}(y|\mathbf{x}) \sim g(y, \mathbf{x}; \mathbf{\theta})$
## Estimation
- We have the tools to use data to infer information about the (joint) distribution
- Because the joint distribution is complicated, we'll usually estimate simpler summaries of the joint distribution -- e.g. $E[X]$, $V[X]$, $E[Y|X]$, $Cov[X,Y]$
- There are a number of techniques that you can use to develop an estimator for a parameter. These techniques vary in terms of the principle used to arrive at the estimator and the strength of the assumptions needed to support it.
- However, all of these estimators are statistics meaning they are functions of the data $\{X_i\}_{i=1}^n$
## Discussion of Maximum Likelihood Estimation
1. What is the goal of estimating a parameter? Why is this something that we are interested in as data scientists?
2. In your own words, describe how the method of maximum likelihood is used to estimate the unknown parameters.
3. Why does a likelihood function have a $\Pi$ (product operator) within it?
4. Is it possible to estimate using maximum likelihood without writing down a model for the data?
5. What happens if your model for the data is wrong? Are your estimates for the parameters "incorrect"? Or, are they "correct" within the context of the model that you've written down?
## Optimization in R
- The method of maximum likelihood requires an optimization routine.
- For a few very simple probability models, a closed-form solution exists and the MLE can be derived by hand. (This is also *potentially* the case for OLS regression.)
- But, instead lets use some machine learning to find the estimates that maximize the likelihood function.
- There are many optimizers (e.g. `optimize`, and `optim`). `optimize` is the simplest to use, but only works in one dimension.
### Optimization Example: Optimum Price
- Suppose that a firm's profit from selling a product is related to price, $p$, and cost, $c$, as follows:
$$
\text{profit} = (p - p^2) - c + 100
$$
1. Explain how you would use calculus to find the maximizing price. Assume that cost is fixed.
2. What is the firms revenue as `p=0, cost = 2`? What is it at `p=10, cost = 2`?
3. Create a plot with the following characteristics:
- On the x-axis is a sequence (`seq()`) of prices from [0, 10].
- On the y-axis is the revenue as a function of those prices. Hold cost constant at `c=2`.
- What does the best price seem to be?
4. Solve this numerically in *R*, using the *optimize()* function.
- Take note: using the default arguments, will `optimize` try to find a maximum or a minimum?
- Check into the help documentation.
```{r define profit function}
profit <- function(p, c) {
r = (p - p^2) - c + 100
return(r)
}
```
```{r compute profit for specific parameter values}
profit(p=2, c=2)
```
```{r use optimize}
best_price <- optimize(
profit, # profit is the function
lower = 0, upper = 1000, # this is the low and high we consider
c = 2, # here, we're passing cost into profit
maximum = TRUE) # we'd like to maximize, not minimize
best_price
```
## MLE for Poisson Random Variables
- Suppose we use a camera to record an intersection for a particular length of time, and we write down the number of cars accidents in that interval.
- This process can be modeled by a *Poisson* random variable (now we are non-agnostic), that has a well-known probability mass function given by,
$$
f(x;\lambda) = \frac{\lambda^x e^{-\lambda}}{x!}
$$
Here is an example of a string of outcomes generated by a Poisson RV, with parameter $\lambda = 2$.
```{r}
rpois(n = 10, lambda = 2)
```
### MLE for Poisson Random Variables: Data
- Suppose that we conduct an iid sample, and gather the following number of accidents. (It is a busy street!)
```{r}
data <- c(
2, 6, 2, 1, 3, 3, 4, 4, 24, 1, 5, 4, 5, 1, 2, 2, 5, 2, 1, 5,
2, 1, 2, 9, 9, 1, 3, 2, 1, 1, 3, 1, 3, 2, 2, 4, 1, 1, 5, 3,
3, 2, 2, 1, 1, 1, 5, 1, 3, 1, 1, 1, 1, 2, 2, 4, 2, 1, 2, 2,
3, 1, 2, 6, 2, 2, 3, 2, 3, 5, 1, 3, 2, 5, 2, 1, 3, 2, 1, 2,
4, 2, 6, 1, 2, 2, 3, 5, 2, 1, 4, 2, 2, 1, 3, 2, 2, 4, 1, 1,
1, 1, 2, 3, 5, 1, 2, 2, 3, 1, 4, 1, 3, 2, 2, 2, 2, 2, 2, 3,
3, 1, 1, 2, 2, 4, 1, 5, 2, 7, 5, 2, 3, 2, 5, 3, 1, 2, 1, 1,
2, 3, 1, 5, 3, 4, 6, 3, 3, 2, 2, 1, 2, 2, 4, 2, 3, 4, 3, 1,
6, 3, 1, 2, 3, 2, 2, 3, 1, 1, 1, 1, 1, 10, 3, 2, 1, 1, 3, 2,
2, 3, 1, 1, 2, 2, 2, 4, 2, 2, 3, 3, 6, 1, 3, 2, 3, 2, 2, 2
)
table(data)
```
### MLE Estimation
- Use the data that is stored in `data`, together with a Poisson model to estimate the $\lambda$ values that produce the "good" model from the Poisson family.
- That is, use MLE to estimate $\lambda$.
- Here is your work flow:
1. Define your random variables.
2. Write down the likelihood function for a sample of data that is generated by a *Poisson* process.
3. To make the math easier, take the log of this likelihood function.
4. Optimize this log-likelihood using calculus -- what is the value of $\lambda$ that results? Compute this value, given the data that you have.
5. Maximize this log-likelihood numerically, and report the value for $\lambda$ that produces the highest likelihood of seeing this data.
6. Comment on your answers from parts 4 and 5. Are you surprised or not by what you see?
```{r writing poisson: students will write this}
poisson_ll <- function(data, lambda) {
## fill this in:
lambda # this is a placeholder, change this
}
```
```{r plot the likelihood}
search_space <- seq(0,100, by = 0.1)
plot(
x = search_space, xlab = 'Search Space',
y = poisson_ll(data=data, lambda=search_space), ylab = 'Log Likelihood',
type = 'l'
)
```
```{r pass poisson into optimizer}
# optimize(poisson_ll, lower = 0, upper = 100, data = data, maximum = TRUE)
```
## Confidence Intervals
This exercise is meant to demonstrate what the confidence level in a confidence interval represents. We will assume a standard normal population distribution and simulate what happens when we draw a sample and compute a confidence interval.
Your task is to complete the following function so that it,
1) simulates and stores` `n draws from a standard normal distribution
2) based on those draws, computes a valid confidence interval with confidence level $\alpha$, a parameter that you pass to the function.
Your function should return a vector of length 2, containing the lower bound and upper bound of the confidence interval.
$$
CI_{\alpha} = \overline{X} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}
$$
where:
- $CI_\alpha$ is the confidence interval that you're seeking to produce
- $\overline{X}$ is the sample average,
- $t_{\alpha/2}$ is your critical value (accessible through `qt`),
- and $s$ is your sample standard deviation. Notice that you'll need each of these pieces in the code that you're about to write.
```{r}
sim_conf_int <- function(n, alpha) {
# Fill in your code to:
# 1. simulate n draws from a standard normal dist.
# 2. compute a confidence interval with confidence level alpha
sample_draws <- 'fill this in'
sample_mean <- 'fill this in'
sample_sd <- 'fill this in'
critical_t <- 'fill this in'
ci_95 <- 'fill this in'
return(ci_95)
}
sim_conf_int(n = 100, alpha = 0.25)
```
When your function is complete, you can use the following code to run your function 100 times and plot the results.
```{r make confidence intervals data frame)}
many_confidence_intervals <- function(num_simulations, n, alpha) {
## args:
## - num_simulations: the number of simulated confidence intervals
## - n: the number of observations in each simulation that will pass
## into your `sim_conf_int` function
## - alpha: the confidence interval that you will pass into
## your `sim_conf_int` function
results <- NULL
for(i in 1:num_simulations) {
interval = sim_conf_int(n, alpha)
results = rbind(results, c(interval[1], interval[2], interval[1]<0 & interval[2]>0))
}
resultsdf = data.frame(results)
names(resultsdf) = c("low", "high", "captured")
return(resultsdf)
}
n = 20
confidence_intervals = many_confidence_intervals(100, n, .05)
```
```{r}
plot_many_confidence_intervals <- function(c) {
plot(NULL, type = "n",
xlim = c(1,100), xlab = 'Trial',
ylim = c(min(c$low), max(c$high)), ylab=expression(mu),pch=19)
abline(h = 0, col = 'gray')
abline(h = qt(0.975, n-1)/sqrt(n), lty = 2, col = 'gray')
abline(h = qt(0.025, n-1)/sqrt(n), lty = 2, col = 'gray')
points(c$high, col = 2+c$captured, pch = 20)
points(c$low, col = 2+c$captured, pch = 20)
for(i in 1:nrow(c)) {
lines(c(i,i), c(c$low[i],c$high[i]), col = 2+c$captured[i], pch = 19)
}
title(expression(paste("Simulation of t-Confidence Intervals for ", mu,
" with Sample Size 20")))
legend(0,-.65, legend = c(expression(paste(mu," Captured")),
expression(paste(mu," Not Captured"))), fill = c(3,2))
}
```
```{r use plot_many_confidence_intervals}
# plot_many_confidence_intervals(confidence_intervals)
```
1. How many of the simulated confidence intervals contain the true mean, zero?
2. Suppose you run a single study. Based on what you've just written above, why is it incorrect to say that, "There is a 95% probability that the true mean is inside this (single) confidence interval"?
## Maximum Likelihood Example: Printers
**Part I**
Suppose that you’ve got a particular sequence of values: ${1, 0, 0, 1, 0, 1, 1, 1, 1, 1}$ that indicate whether a printer any particular time you try to print.
You have data from the last 10 times you tried.
**Question**:
- What is the probability ($p$) that the printer jams on the next print job?
![bbc, office space](./images/office_space.jpg){width=80%}
**Part II**
The data resembles draws from a Bernoulli distribution.
However, even if we want to model this as a Bernoulli distribution, we do not know the value of the parameter, $p$.
1- Define your random variable.
2- Write down the likelihood function
3- If it will make the math easier, log the likelihood function.
4- *Path 1:* Maximize the likelihood using calculus
5- *Path 2:* Maximize using numeric methods.
```{r}
```