14-maximum-likelihood.Rmd

# Maximum Likelihood Estimation
![salvation mountain](./images/salvation_mountain.jpg)

## Learning Objectives 

1. 
2. 
3. 

## Class Announcements

## Roadmap
**Rearview Mirror: What We've Seen**

- **WLLN**:  $\displaystyle\lim_{n \to \infty} \overline{X}_n \overset{p}{=} E[X]$ 
- **CLT** $\displaystyle\lim_{n \to \infty} \bar{X}_n \overset{d}{=}  N(E[X], \text{Var}[X])$ 


**Today**

- Use maximum likelihood to generate a good guess for model parameters; 
- Use a confidence interval to indicate a range of plausible parameter values

## What is a model? 

- A data science model is: 
  - A representation of the world built from random variables
  - FOIS: "agnostic" models place minimal restrictions on joint distribution
  - Parametric models (i.e. MLE) are models based on a family of distributions. 
  - $f_{Y|X}(y|\mathbf{x}) \sim g(y, \mathbf{x}; \mathbf{\theta})$

 
## Estimation 

- We have the tools to use data to infer information about the (joint) distribution
- Because the joint distribution is complicated,  we'll usually estimate simpler summaries of the joint distribution -- e.g. $E[X]$, $V[X]$, $E[Y|X]$, $Cov[X,Y]$

- There are a number of techniques that you can use to develop an estimator for a parameter. These techniques vary in terms of the principle used to arrive at the estimator and the strength of the assumptions needed to support it. 

- However, all of these estimators are statistics meaning they are functions of the data $\{X_i\}_{i=1}^n$ 

## Discussion of Maximum Likelihood Estimation 

1. What is the goal of estimating a parameter? Why is this something that we are interested in as data scientists? 
2. In your own words, describe how the method of maximum likelihood is used to estimate the unknown parameters. 
3. Why does a likelihood function have a $\Pi$ (product operator) within it? 
4. Is it possible to estimate using maximum likelihood without writing down a model for the data? 
5. What happens if your model for the data is wrong? Are your estimates for the parameters "incorrect"? Or, are they "correct" within the context of the model that you've written down? 

## Optimization in R 

- The method of maximum likelihood requires an optimization routine. 
- For a few very simple probability models, a closed-form solution exists and the MLE can be derived by hand. (This is also *potentially* the case for OLS regression.) 
- But, instead lets use some machine learning to find the estimates that maximize the likelihood function. 
- There are many optimizers (e.g. `optimize`, and `optim`). `optimize` is the simplest to use, but only works in one dimension. 

### Optimization Example: Optimum Price 

- Suppose that a firm's profit from selling a product is related to price, $p$, and cost, $c$, as follows:

$$ 
\text{profit} = (p - p^2) - c + 100
$$


1. Explain how you would use calculus to find the maximizing price. Assume that cost is fixed. 
2. What is the firms revenue as `p=0, cost = 2`? What is it at `p=10, cost = 2`? 
3. Create a plot with the following characteristics:

   - On the x-axis is a sequence (`seq()`) of prices from [0, 10]. 
   - On the y-axis is the revenue as a function of those prices. Hold cost constant at `c=2`.  
   - What does the best price seem to be? 
4. Solve this numerically in *R*, using the *optimize()* function. 

   - Take note: using the default arguments, will `optimize` try to find a maximum or a minimum? 
   - Check into the help documentation. 

```{r define profit function}
profit <- function(p, c) { 
  r = (p - p^2) - c + 100
  return(r) 
  } 
```

```{r compute profit for specific parameter values} 
profit(p=2, c=2)
```


```{r use optimize}
best_price <- optimize(
  profit,                    # profit is the function
  lower = 0, upper = 1000,   # this is the low and high we consider
  c = 2,                     # here, we're passing cost into profit
  maximum = TRUE)            # we'd like to maximize, not minimize
best_price
```


## MLE for Poisson Random Variables

- Suppose we use a camera to record an intersection for a particular length of time, and we write down the number of cars accidents in that interval.  
- This process can be modeled by a *Poisson* random variable (now we are non-agnostic), that has a well-known probability mass function given by,

$$
f(x;\lambda) = \frac{\lambda^x e^{-\lambda}}{x!}
$$

Here is an example of a string of outcomes generated by a Poisson RV, with parameter $\lambda = 2$. 

```{r}
rpois(n = 10, lambda = 2)
```

### MLE for Poisson Random Variables: Data 

- Suppose that we conduct an iid sample, and gather the following number of accidents. (It is a busy street!) 

```{r}
data <- c(
  2, 6, 2, 1, 3, 3, 4, 4, 24, 1, 5, 4, 5, 1,  2, 2, 5, 2, 1, 5, 
  2, 1, 2, 9, 9, 1, 3, 2, 1,  1, 3, 1, 3, 2,  2, 4, 1, 1, 5, 3, 
  3, 2, 2, 1, 1, 1, 5, 1, 3,  1, 1, 1, 1, 2,  2, 4, 2, 1, 2, 2, 
  3, 1, 2, 6, 2, 2, 3, 2, 3,  5, 1, 3, 2, 5,  2, 1, 3, 2, 1, 2, 
  4, 2, 6, 1, 2, 2, 3, 5, 2,  1, 4, 2, 2, 1,  3, 2, 2, 4, 1, 1, 
  1, 1, 2, 3, 5, 1, 2, 2, 3,  1, 4, 1, 3, 2,  2, 2, 2, 2, 2, 3, 
  3, 1, 1, 2, 2, 4, 1, 5, 2,  7, 5, 2, 3, 2,  5, 3, 1, 2, 1, 1, 
  2, 3, 1, 5, 3, 4, 6, 3, 3,  2, 2, 1, 2, 2,  4, 2, 3, 4, 3, 1, 
  6, 3, 1, 2, 3, 2, 2, 3, 1,  1, 1, 1, 1, 10, 3, 2, 1, 1, 3, 2, 
  2, 3, 1, 1, 2, 2, 2, 4, 2,  2, 3, 3, 6, 1,  3, 2, 3, 2, 2, 2
  )

table(data)
```

### MLE Estimation 

- Use the data that is stored in `data`, together with a Poisson model to estimate the $\lambda$ values that produce the "good" model from the Poisson family. 
- That is, use MLE to estimate $\lambda$. 

- Here is your work flow: 

  1. Define your random variables.
  2. Write down the likelihood function for a sample of data that is generated by a *Poisson* process. 
  3. To make the math easier, take the log of this likelihood function. 
  4. Optimize this log-likelihood using calculus -- what is the value of $\lambda$ that results? Compute this value, given the data that you have. 
  5. Maximize this log-likelihood numerically, and report the value for $\lambda$ that produces the highest likelihood of seeing this data. 
  6. Comment on your answers from parts 4 and 5. Are you surprised or not by what you see? 

```{r writing poisson: students will write this}
poisson_ll <- function(data, lambda) { 
  ## fill this in:  
  lambda # this is a placeholder, change this
}
```

```{r plot the likelihood}
search_space <- seq(0,100, by = 0.1)
plot(
  x = search_space, xlab = 'Search Space',
  y = poisson_ll(data=data, lambda=search_space), ylab = 'Log Likelihood',
  type = 'l'
)
```

```{r pass poisson into optimizer}
# optimize(poisson_ll, lower = 0, upper = 100, data = data, maximum = TRUE)
```

## Confidence Intervals 

This exercise is meant to demonstrate what the confidence level in a confidence interval represents. We will assume a standard normal population distribution and simulate what happens when we draw a sample and compute a confidence interval.

Your task is to complete the following function so that it,

1) simulates and stores` `n draws from a standard normal distribution
2) based on those draws, computes a valid confidence interval with confidence level $\alpha$, a parameter that you pass to the function.

Your function should return a vector of length 2, containing the lower bound and upper bound of the confidence interval.

$$
CI_{\alpha} = \overline{X} \pm t_{\alpha/2} \cdot \frac{s}{\sqrt{n}}
$$ 

where: 

- $CI_\alpha$ is the confidence interval that you're seeking to produce
- $\overline{X}$ is the sample average,  
- $t_{\alpha/2}$ is your critical value (accessible through `qt`), 
- and $s$ is your sample standard deviation. Notice that you'll need each of these pieces in the code that you're about to write. 

```{r}
sim_conf_int <- function(n, alpha) {
  # Fill in your code to: 
  # 1. simulate n draws from a standard normal dist.
  # 2. compute a confidence interval with confidence level alpha
  
  sample_draws <- 'fill this in'
  sample_mean  <- 'fill this in'
  sample_sd    <- 'fill this in'
  
  critical_t <- 'fill this in'
  
  ci_95 <- 'fill this in'
  
  return(ci_95)  
}

sim_conf_int(n = 100, alpha = 0.25)
```

When your function is complete, you can use the following code to run your function 100 times and plot the results.

```{r make confidence intervals data frame)}
many_confidence_intervals <- function(num_simulations, n, alpha) {
  ## args: 
  ##  - num_simulations: the number of simulated confidence intervals 
  ##  - n: the number of observations in each simulation that will pass 
  ##           into your `sim_conf_int` function
  ##  - alpha: the confidence interval that you will pass into 
  ##           your `sim_conf_int` function
  
  results <- NULL
  for(i in 1:num_simulations) {
    interval = sim_conf_int(n, alpha)
    results = rbind(results, c(interval[1], interval[2], interval[1]<0 & interval[2]>0))
  }
  resultsdf = data.frame(results)
  names(resultsdf) = c("low", "high", "captured")
  return(resultsdf)
}

n = 20
confidence_intervals = many_confidence_intervals(100, n, .05)
```

```{r}
plot_many_confidence_intervals <- function(c) { 
  plot(NULL, type = "n",
       xlim = c(1,100), xlab = 'Trial',
       ylim = c(min(c$low), max(c$high)), ylab=expression(mu),pch=19)

  abline(h = 0, col = 'gray')
  abline(h = qt(0.975, n-1)/sqrt(n), lty = 2, col = 'gray')
  abline(h = qt(0.025, n-1)/sqrt(n), lty = 2, col = 'gray')
  
  points(c$high, col = 2+c$captured, pch = 20)
  points(c$low,  col = 2+c$captured, pch = 20)
  for(i in 1:nrow(c)) {
    lines(c(i,i), c(c$low[i],c$high[i]), col = 2+c$captured[i], pch = 19)
  }
  
  title(expression(paste("Simulation of t-Confidence Intervals for ", mu,
                          " with Sample Size 20")))

  legend(0,-.65, legend = c(expression(paste(mu," Captured")),
                             expression(paste(mu," Not Captured"))), fill = c(3,2))
  }
```

```{r use plot_many_confidence_intervals}
# plot_many_confidence_intervals(confidence_intervals)
```

1. How many of the simulated confidence intervals contain the true mean, zero?   
2. Suppose you run a single study. Based on what you've just written above, why is it incorrect to say that, "There is a 95% probability that the true mean is inside this (single) confidence interval"? 


## Maximum Likelihood Example: Printers 

**Part I** 

Suppose that you’ve got a particular sequence of values: ${1, 0, 0, 1, 0, 1, 1, 1, 1, 1}$ that indicate whether a printer any particular time you try to print. 

You have data from the last 10 times you tried. 

**Question**: 
  
  - What is the probability ($p$) that the printer jams on the next print job?

![bbc, office space](./images/office_space.jpg){width=80%}

**Part II** 

The data resembles draws from a Bernoulli distribution. 

However, even if we want to model this as a Bernoulli distribution, we do not know the value of the parameter, $p$. 

1- Define your random variable. 

2- Write down the likelihood function 

3- If it will make the math easier, log the likelihood function. 

4- *Path 1:* Maximize the likelihood using calculus

5- *Path 2:* Maximize using numeric methods. 


```{r}

```