-
Notifications
You must be signed in to change notification settings - Fork 26
/
data_exploration.Rmd
68 lines (49 loc) · 1.93 KB
/
data_exploration.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
title: "Data exploration: the importance of plotting"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Simple example
Consider that you have two variables `x` and `y`.
```{r create data}
x <- 1:100
y <- 20 * x - 0.2 * x^2 + rnorm(100, 0, 30)
```
You are interested to understand if `x` explains variation in `y`.
How will you approach this?
## Should you build a linear model first or plot the variables first and visually explore?
Let's examine the outcome and inference if **we do not visually examine our data**
and only rely on regression modeling.
```{r model}
mod_lin <- lm(y ~ x)
summary(mod_lin)
```
The above model summary table would lead us to believe that there is no relationship
between `x` and `y` which we know is false because we created the variables above.
## Why is it our regression table leading us to an incorrect inference?
Because our model is systematically mis-representing the functional form of the
relationship between `x` and `y` which we defined to be a quadratic relationship.
This would have been obvious if we first graphed `y` and `x`
```{r}
plot(y ~ x)
```
This simple step would indicate to us that `y` is not just a linear function of
`x` but it is a quadratic function, such that a more appropriate model is:
```{r}
mod_quad <- lm(y ~ x + I(x^2))
summary(mod_quad)
```
To further articulate why the regression only approach failed we should
overlay the models and the data. Unfortunately, this critical step is often
overlooked.
```{r}
plot(y ~ x)
lines(x, predict(mod_lin), col='red')
lines(x, predict(mod_quad), col='blue')
legend('bottom', c('linear', 'quadratic'), col=c('red', 'blue'), lty=1, bty='n')
```
Here we just looked at two variables so it may be obvious that graphing them is wise,
but even in multivariate scenarios where a graphical exploration may be more
tedious it can still be very helpful and illuminating prior to model fitting.