-
Notifications
You must be signed in to change notification settings - Fork 24
/
resistant_lines.qmd
298 lines (212 loc) · 11.7 KB
/
resistant_lines.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
---
output: html_document
editor_options:
chunk_output_type: console
---
# Resistant lines
```{r echo=FALSE}
source("libs/Common.R")
```
```{r echo = FALSE}
pkg_ver(c("dplyr", "ggplot2", "MASS"))
```
## Introduction
Ordinary least squares regression lines (those created with the `lm()` function) suffer from sensitivity to outliers. Because `lm`'s best fit line makes use of the mean (which is not a robust measure of location), its breakdown point is $1/n$, meaning that all it takes is for one data point to behave differently from the rest of the points to significantly alter the slope and intercept of the best fit line. For example, let's start with a well behaved dataset where all points are perfectly aligned, and fit this batch with a regression line:
```{r fig.height=2.5, fig.width=3, small.mar=TRUE}
x <- seq(1:11)
y <- 5 + 2 * x
plot(y ~ x, pch = 20)
M <- lm(y ~ x)
abline( M , col = "red")
```
As expected, we have a perfect fit. And the regression model's coefficients match those used to create the data.
```{r}
coef(M)
```
Now, what if one of the points is re-positioned in the plot, what happens to the regression line?
```{r fig.height=2.5, fig.width=3, small.mar=TRUE}
y[11] <- 2
plot(y ~ x, pch = 20)
M <- lm(y ~ x)
abline( M , col = "red")
```
Note the significant change in the line's characteristics, its intercept and slope are now:
```{r}
coef(M)
```
The slope dropped from 2 to 0.86 because of a single point!
If our goal is to explore what the *bulk* of the data has to say about the phenomena being investigated, we certainly don't want a few "maverick" values to hijack the analysis. We therefore need a set of fitting tools that minimize the influence of outliers. There are many options out there; most notable are **Tukey's resistant line** and the **bisquare** robust estimation method.
## Robust lines
### Tukey's resistant line
The idea is simple in concept. It involves dividing the dataset into three approximately equal groups of values and summarizing these groups by computing their respective medians. The batches’ median values are then used to compute the slope and intercept. The motivation behind this plot is to use the three-point summary to provide a robust assessment of the type of relationship between both variables.
Let's look at an example of its implementation. We'll continue with the example used in the previous section. First, we divide the plot into three approximately equal batches.
```{r fig.height=3, fig.width=3, small.mar=TRUE,echo=FALSE}
library(tukeyedar)
df <- data.frame(x,y)
Mr <- eda_rline(df, x, y, iter=1)
plot(Mr, model=F, pt3=F, fit = F)
```
Next, we compute the median x and y values within each section.
```{r fig.height=3, fig.width=3, small.mar=TRUE,echo=FALSE}
plot(Mr, model=F, pt3=T, fit = F)
```
The x median values are `r Mr$xmed` and the y median values are `r Mr$ymed` respectively.
The two end medians are used to compute the slope as:
$$
b = \frac{y_r - y_l}{x_r-x_l}
$$
where the subscripts $r$ and $l$ reference the median values for the right-most and left-most batches.
Once the slope is computed, the intercept can be computed as follows:
$$
median(y_{l,m,r} - b * x_{l,m,r})
$$
where $(x,y)_{l,m,r}$ are the median x and y values for each batch.
```{r fig.height=3, fig.width=3, small.mar=TRUE,echo=FALSE}
plot(Mr, model=F, pt3=T, fit = F)
rl1 <- line(x,y,iter=1)
abline(rl1, col="red")
```
This is not bad for a first iteration. But the outlier's influence is still present given that the line is not passing through all nine points. Recall that the goal is to *not* allow one lone observation from influencing a fitted line.
From this first iteration, we can extract the residuals. A line is then fitted to the residuals following the same procedure outlined above.
```{r fig.height=3, fig.width=3, small.mar=TRUE,echo=FALSE}
df$r1 <- residuals(rl1)
plot(eda_rline(df, x, r1), model=F, pt3=T, fit = F, ylab = "Residuals")
rl1r <- line(x,df$r1,iter=1)
abline(rl1r,col="red")
```
The initial model intercept and slope are `r round(rl1$coefficients[1],3)` and `r round(rl1$coefficients[2],3)` respectively and the residual's slope and intercept are `r round(rl1r$coefficients[1],3)` and `r round(rl1r$coefficients[2],3)` respectively.
The residual slope is then added to the first computed slope giving us an updated slope of `r round(rl1$coefficients[2] + rl1r$coefficients[2] ,3)` and the process is again repeated thus generating the following tweaked slope and updated residuals:
```{r fig.height=3, fig.width=6, small.mar=TRUE,echo=FALSE}
OP <- par(mfrow = c(1, 2))
plot(Mr, model=F, pt3=T, fit = F)
rl2 <- line(x,y,iter=2)
abline(rl2, col="red")
df$r2 <- residuals(rl2)
plot(eda_rline(df, x, r2), model=F, pt3=T, fit = F, ylab = "Residuals")
rl2r <- line(x,df$r2,iter=1)
abline(rl2r,col="red")
par(OP)
```
The second set of residuals is added to the previously computed slope giving us an updated slope of `r round(rl2$coefficients[2] + rl2r$coefficients[2],3)`. The iteration continues until the slope residuals stabilize. Each successive iteration is shown in the following plot with the final line in dark red. This took three iterations giving us the final fitted line whose intercept and slope are `r round(line(x,y,iter=3)$coefficients[1],2)` and `r round(line(x,y,iter=3)$coefficients[2],2)` respectively--very close to the actual relationship between the two variables.
```{r fig.height=3, fig.width=3, small.mar=TRUE,echo=FALSE}
df$r1 <- residuals(rl1)
plot(eda_rline(df, x, y), model=F, pt3=T, fit = F)
abline(line(x,y,iter=1),col="#FFCCCC")
abline(line(x,y,iter=2),col="#FF8888")
abline(line(x,y,iter=3),col="#FF0000")
```
Tukey's resistant line can be computed using the R built-in `line()` function. You specify the maximum number of iterations via the `iter` argument. Usually, five iterations will be enough. But, the iteration will stop before the maximum number of iterations is reached if the residual slope stabilizes before then.
```{r}
r.fit <- line(x,y, iter = 5)
```
You can add the resistant line to a base plot as follows:
```{r fig.height=3, fig.width=3, echo = 2:3}
OP <- par(mar = c(4,4,1,0))
plot(y ~ x, df)
abline(r.fit, col = "red")
par(OP)
```
You can also add the line to a `ggplot`:
```{r fig.height=3, fig.width=3}
library(ggplot2)
ggplot(df, aes(x,y)) + geom_point() +
geom_abline(intercept = r.fit$coefficients[1],
slope = r.fit$coefficients[2],
col = "red")
```
### Bisquare
A bisquare fitting strategy makes use of a non-parametric model that assigns *weights* to observations based on their proximity to the center of the distribution. The closer an observation is to that center, the greater its weight.
The first step is to run a linear regression model on the data. The next step is to extract its residuals. Next, a weighting scheme is fitted to the residuals such that the points associated with extreme residuals are assigned the smallest weight and the points associated with the smaller residual values are assigned the largest weight. The regression analysis is then re-run using those same weights thus minimizing the influence of the rogue points. This process is repeated several times until the residuals stabilize. The following figure shows three iterations of the bisquare function whereby the weights (shown as grey text next to each point) start off as `1` then are modified following the residuals derived from the most recent regression model.
```{r fig.height=2.5, fig.width=8, small.mar=TRUE, echo = FALSE}
wt.bisquare <- function(u, c = 6) {
ifelse( abs(u/c) < 1, (1-(u/c)^2)^2, 0)
}
# Initialize weights
wt <- rep(1, length(x))
ramp <- colorRamp(c("gold", "orange", "black"), interpolate = "spline")
OP <- par(mfrow = c(1,3))
# First plot
M <- lm(y ~ x, weights = wt)
plot(y ~ x, pch = 20, xlim = c(0,11), cex = 2)
abline( M , col = "blue")
text(x, y, labels = round(wt,2), pos = 2, offset = 0.5 , col = "grey30", font=1)
# Second plot
wt <- wt.bisquare( M$residuals/ median(abs(M$residuals)), c=6 )
M <- lm(y ~ x, weights = wt)
plot(y ~ x, pch = 20, xlim = c(0,11), cex = 2)
abline( M, col = "blue")
text(x, y, labels = round(wt,2), pos = 2, offset = 0.5 , col = "grey30", font=1)
# Third plot
wt <- wt.bisquare( M$residuals/ median(abs(M$residuals)), c=6 )
M <- lm(y ~ x, weights = wt)
plot(y ~ x, pch = 20, xlim = c(0,11), cex = 2)
abline(M , col = "blue")
text(x, y, labels = round(wt,2), pos = 2, offset = 0.5 , col = "grey30", font=1)
par(OP)
```
There are different weighing strategies that can be implemented. One such implementation is presented next.
```{r fig.height=2.5, fig.width=3, small.mar=TRUE}
# Create the bisquare function
wt.bisquare <- function(u, c = 6) {
ifelse( abs(u/c) < 1, (1-(u/c)^2)^2, 0)
}
# Assign an equal weight to all points
wt <- rep(1, length(x))
# Compute the regression, then assign weights based on residual values
for(i in 1:10){
dat.lm <- lm(y ~ x ,weights=wt)
wt <- wt.bisquare( dat.lm$res/ median(abs(dat.lm$res)), c = 6 )
}
# Plot the data and the resulting line
plot(x, y, pch = 20)
abline(dat.lm, col = rgb(1,0,0,0.3))
```
In the above example, the bisquare method does a great job in eliminating the outlier's influence.
#### Built-in implementation of the bisquare
The `MASS` package has a robust linear modeling function called `rlm` that will implement a variation of the aforementioned bisquare estimation technique. Its results may differ slightly from those presented above, but the difference will be insignificant for the most part.
Note that if you make use of `dplyr` in a workflow, loading `MASS` after `dplyr` will mask dplyr's `select` function. This can be problematic. So you either want to load `MASS` before `dplyr`, or you can call the function via `MASS::rlm`. An example of its use follows.
```{r fig.height=2.5, fig.width=3, small.mar=TRUE, fig.show='hold'}
dat <- data.frame(x,y)
M <- lm( y ~ x, dat=dat)
Mr <- MASS::rlm( y ~ x, dat=dat, psi="psi.bisquare")
plot(y ~ x,dat, pch=16, col=rgb(0,0,0,0.2))
# Add the robust line
abline( Mr, col="red")
# Add the default regression line for reference
abline( M , col="grey50", lty=2)
```
The default regression model is added as a dashed line for reference.
The function `rlm` can also be called directly from within `ggplot`.
```{r fig.height=2.5, fig.width=3, small.mar=TRUE, fig.show='hold'}
library(ggplot2)
ggplot(dat, aes(x=x, y=y)) + geom_point() +
stat_smooth(method = MASS::rlm, se = FALSE, col = "red",
method.args = list(psi = "psi.bisquare"))
```
## Robust loess
The loess fit is also not immune to outliers. The bisquare estimation method can be extended to the loess smoothing function by passing the `"symmetric"` option to the `family` argument.
```{r fig.height=2.5, fig.width=3, small.mar=TRUE, fig.show='hold'}
# Fit a regular loess model
lo <- loess(y ~ x, dat)
# Fit a robust loess model
lor <- loess(y ~ x, dat, family = "symmetric")
# Plot the data
plot(y ~ x, dat, pch = 16, col = rgb(0,0,0,0.2))
# Add the default loess
lines(dat$x, predict(lo), col = "grey50", lty = 2)
# Add the robust loess
lines(dat$x, predict(lor), col = "red")
```
The robust loess is in red, the default loess fit is added as a dashed line for reference.
The robust `loess` can be implemented in `ggplot2`'s `stat_smooth` function as follows:
```{r fig.height=2.5, fig.width=3, small.mar=TRUE, fig.show='hold'}
library(ggplot2)
ggplot(dat, aes(x=x, y=y)) + geom_point() +
stat_smooth(method = "loess",
method.args = list(family="symmetric"),
se=FALSE, col="red")
```
The loess parameters are passed to the `loess` function via the call to `method.args`. The red curve is that of the robust loess and the grey dashed curve is that of the standard loess.