resistant_lines.qmd

---
output: html_document
editor_options: 
  chunk_output_type: console
---
# Resistant lines

```{r echo=FALSE}
source("libs/Common.R")
```


```{r echo = FALSE}
pkg_ver(c("dplyr", "ggplot2", "MASS"))
```

## Introduction

Ordinary least squares regression lines (those created with the `lm()` function) suffer from sensitivity to outliers. Because `lm`'s best fit line makes use of the mean (which is not a robust measure of location), its breakdown point is $1/n$,  meaning that all it takes is for one data point to behave differently from the rest of the points to significantly alter the slope and intercept of the best fit line. For example, let's start with a well behaved dataset where all points are perfectly aligned, and fit this batch with a regression line:

```{r fig.height=2.5, fig.width=3, small.mar=TRUE}
x <- seq(1:11)
y <- 5 + 2 * x
plot(y ~ x, pch = 20)
M <- lm(y ~ x)
abline( M , col = "red")
```

As expected, we have a perfect fit. And the regression model's coefficients match those used to create the data.

```{r}
coef(M)
```


Now, what if one of the points is re-positioned in the plot, what happens to the regression line?

```{r fig.height=2.5, fig.width=3, small.mar=TRUE}
y[11] <- 2
plot(y ~ x, pch = 20)
M <- lm(y ~ x)
abline( M , col = "red")
```

Note the significant change in the line's characteristics, its intercept and slope are now:

```{r}
coef(M)
```

The slope dropped from 2 to 0.86 because of a single point!

If our goal is to explore what the *bulk* of the data has to say about the phenomena being investigated, we certainly don't want a few "maverick" values to hijack the analysis. We therefore need a set of fitting tools that minimize the influence of outliers. There are many options out there; most notable are  **Tukey's resistant line** and the **bisquare** robust estimation method.

&nbsp;&nbsp;

## Robust lines
&nbsp;&nbsp;

### Tukey's resistant line

The idea is simple in concept. It involves dividing the dataset into three approximately equal groups of values and summarizing these groups by computing their respective medians. The batches’ median values are then used to compute the slope and intercept. The motivation behind this plot is to use the three-point summary to provide a robust assessment of the type of relationship between both variables. 

Let's look at an example of its implementation. We'll continue with the example used in the previous section. First, we divide the plot into three approximately equal batches.


```{r fig.height=3, fig.width=3,  small.mar=TRUE,echo=FALSE}
library(tukeyedar)
df <- data.frame(x,y)
Mr <- eda_rline(df, x, y, iter=1)
plot(Mr, model=F, pt3=F, fit = F)
```

Next, we compute the median x and y values within each section. 

```{r fig.height=3, fig.width=3,  small.mar=TRUE,echo=FALSE}
plot(Mr, model=F, pt3=T, fit = F)
```

The x median values are `r Mr$xmed` and the y median values are `r Mr$ymed` respectively.

The two end medians are used to compute the slope as:

$$
b = \frac{y_r - y_l}{x_r-x_l}
$$
where the subscripts $r$ and $l$ reference the median values for the right-most and left-most batches.

Once the slope is computed, the intercept can be computed as follows:

$$
median(y_{l,m,r} - b * x_{l,m,r})
$$

where $(x,y)_{l,m,r}$ are the median x and y values for each batch.


```{r fig.height=3, fig.width=3,  small.mar=TRUE,echo=FALSE}
plot(Mr, model=F, pt3=T, fit = F)
rl1 <- line(x,y,iter=1)
abline(rl1, col="red")
```


This is not bad for a first iteration. But the outlier's influence is still present given that the line is not passing through all nine points. Recall that the goal is to *not* allow one lone observation from influencing a fitted line.

From this first iteration, we can extract the residuals. A line is then fitted to the residuals following the same procedure outlined above.

```{r fig.height=3, fig.width=3,  small.mar=TRUE,echo=FALSE}
df$r1 <- residuals(rl1)
plot(eda_rline(df, x, r1), model=F, pt3=T, fit = F, ylab = "Residuals")
rl1r <- line(x,df$r1,iter=1)
abline(rl1r,col="red")
```

The initial model intercept and slope are `r round(rl1$coefficients[1],3)` and `r round(rl1$coefficients[2],3)` respectively and the residual's slope and intercept are `r round(rl1r$coefficients[1],3)` and `r round(rl1r$coefficients[2],3)` respectively.
The residual slope is then added to the first computed slope giving us an updated slope of `r round(rl1$coefficients[2] + rl1r$coefficients[2] ,3)` and the process is again repeated thus generating the following tweaked slope and updated residuals:

```{r fig.height=3, fig.width=6,  small.mar=TRUE,echo=FALSE}
OP <- par(mfrow = c(1, 2))
plot(Mr, model=F, pt3=T, fit = F)
rl2 <- line(x,y,iter=2)
abline(rl2, col="red")

df$r2 <- residuals(rl2)
plot(eda_rline(df, x, r2), model=F, pt3=T, fit = F, ylab = "Residuals")
rl2r <- line(x,df$r2,iter=1)
abline(rl2r,col="red")
par(OP)
```

The second set of residuals is added to the previously computed slope giving us an updated slope of `r round(rl2$coefficients[2] +  rl2r$coefficients[2],3)`. The iteration continues until the slope residuals stabilize. Each successive iteration is shown in the following plot with the final line in dark red. This took three iterations giving us the final fitted line whose intercept and slope are `r round(line(x,y,iter=3)$coefficients[1],2)` and `r round(line(x,y,iter=3)$coefficients[2],2)` respectively--very close to the actual relationship between the two variables.

```{r fig.height=3, fig.width=3,  small.mar=TRUE,echo=FALSE}
df$r1 <- residuals(rl1)
plot(eda_rline(df, x, y), model=F, pt3=T, fit = F)
abline(line(x,y,iter=1),col="#FFCCCC")
abline(line(x,y,iter=2),col="#FF8888")
abline(line(x,y,iter=3),col="#FF0000")
```

Tukey's resistant line can be computed using the R built-in `line()` function. You specify the maximum number of iterations via the `iter` argument. Usually, five iterations will be enough. But, the iteration will stop before the maximum number of iterations is reached if the residual slope stabilizes before then.

```{r}
r.fit <- line(x,y, iter = 5)
```

You can add the resistant line to a base plot as follows:

```{r fig.height=3, fig.width=3, echo = 2:3}
OP <- par(mar = c(4,4,1,0))
plot(y ~ x, df)
abline(r.fit, col = "red")
par(OP)
```

You can also add the line to a `ggplot`:

```{r fig.height=3, fig.width=3}
library(ggplot2)
ggplot(df, aes(x,y)) + geom_point() +
  geom_abline(intercept = r.fit$coefficients[1], 
              slope = r.fit$coefficients[2],
              col = "red")
```


### Bisquare

A bisquare fitting strategy makes use of a non-parametric model that assigns *weights* to observations based on their proximity to the center of the distribution. The closer an observation is to that center, the greater its weight. 

The first step is to run a linear regression model on the data. The next step is to extract its residuals. Next, a weighting scheme is fitted to the residuals such that the points associated with extreme residuals are assigned the smallest weight and the points associated with the smaller residual values are assigned the largest weight. The regression analysis is then re-run using those same weights thus minimizing the influence of the rogue points. This process is repeated several times until the residuals stabilize. The following figure shows three iterations of the bisquare function whereby the weights (shown as grey text next to each point) start off as `1` then are modified following the residuals derived from the most recent regression model.

```{r fig.height=2.5, fig.width=8, small.mar=TRUE, echo = FALSE}
wt.bisquare <- function(u, c = 6) {
  ifelse( abs(u/c) < 1, (1-(u/c)^2)^2, 0)
}

# Initialize weights
wt <- rep(1, length(x))
ramp <- colorRamp(c("gold", "orange", "black"), interpolate = "spline")

OP <- par(mfrow = c(1,3))
 # First plot
 M <- lm(y ~ x, weights = wt)
 plot(y ~ x, pch = 20, xlim = c(0,11), cex = 2)
 abline( M , col = "blue")
 text(x, y, labels = round(wt,2), pos = 2, offset = 0.5 , col = "grey30", font=1)

  # Second plot
 wt <- wt.bisquare( M$residuals/ median(abs(M$residuals)), c=6 )
 M <- lm(y ~ x, weights = wt)
 plot(y ~ x, pch = 20, xlim = c(0,11), cex = 2)
 abline( M, col = "blue")
 text(x, y, labels = round(wt,2), pos = 2, offset = 0.5 , col = "grey30", font=1)

   # Third plot
 wt <- wt.bisquare( M$residuals/ median(abs(M$residuals)), c=6 )
 M <- lm(y ~ x, weights = wt)
 plot(y ~ x, pch = 20, xlim = c(0,11), cex = 2)
 abline(M , col = "blue")
 text(x, y, labels = round(wt,2), pos = 2, offset = 0.5 , col = "grey30", font=1)
 par(OP)
```


There are different weighing strategies that can be implemented. One such implementation is presented next.

```{r fig.height=2.5, fig.width=3, small.mar=TRUE}
# Create the bisquare function
wt.bisquare <- function(u, c = 6) {
   ifelse( abs(u/c) < 1, (1-(u/c)^2)^2, 0)
}

# Assign an equal weight to all points
wt <- rep(1, length(x))

# Compute the regression, then assign weights based on residual values
for(i in 1:10){
  dat.lm <- lm(y ~ x ,weights=wt)
  wt <- wt.bisquare( dat.lm$res/ median(abs(dat.lm$res)), c = 6 )
}

# Plot the data and the resulting line
plot(x, y, pch = 20)
abline(dat.lm, col = rgb(1,0,0,0.3))
```

In the above example, the bisquare method does a great job in eliminating the outlier's influence. 


#### Built-in implementation of the bisquare

The `MASS` package has a robust linear modeling function called `rlm` that will implement a variation of the aforementioned bisquare estimation technique. Its results may differ slightly from those presented above, but the difference will be insignificant for the most part. 

Note that if you make use of `dplyr` in a workflow, loading `MASS` after `dplyr` will mask dplyr's `select` function. This can be problematic. So you either want to load `MASS` before `dplyr`, or you can call the function via `MASS::rlm`. An example of its use follows. 

```{r fig.height=2.5, fig.width=3,  small.mar=TRUE, fig.show='hold'}
dat <- data.frame(x,y)
M   <- lm( y ~ x, dat=dat)
Mr  <- MASS::rlm( y ~ x, dat=dat, psi="psi.bisquare")
plot(y ~ x,dat, pch=16, col=rgb(0,0,0,0.2))

# Add the robust line
abline( Mr, col="red")

# Add the default regression line for reference
abline( M , col="grey50", lty=2)
```

The default regression model is added as a dashed line for reference.

The function `rlm` can also be called directly from within `ggplot`.

```{r fig.height=2.5, fig.width=3,  small.mar=TRUE, fig.show='hold'}
library(ggplot2)
ggplot(dat, aes(x=x, y=y)) + geom_point() + 
       stat_smooth(method = MASS::rlm, se = FALSE, col = "red",
                   method.args = list(psi = "psi.bisquare"))
``` 

&nbsp;&nbsp;

## Robust loess

The loess fit is also not immune to outliers. The bisquare estimation method can be extended to the loess smoothing function by passing the `"symmetric"` option to the `family` argument.


```{r fig.height=2.5, fig.width=3,  small.mar=TRUE,  fig.show='hold'}
# Fit a regular loess model
lo <- loess(y ~ x, dat)

# Fit a robust loess model
lor <- loess(y ~ x, dat, family = "symmetric")

# Plot the data
plot(y ~ x, dat, pch = 16, col = rgb(0,0,0,0.2))

# Add the default loess
lines(dat$x, predict(lo), col = "grey50", lty = 2)

# Add the robust loess
lines(dat$x, predict(lor), col = "red")
```

The robust loess is in red, the default loess fit is added as a dashed line for reference.

The robust `loess` can be implemented in `ggplot2`'s `stat_smooth` function as follows:

```{r fig.height=2.5, fig.width=3,  small.mar=TRUE, fig.show='hold'}
library(ggplot2)
ggplot(dat, aes(x=x, y=y)) + geom_point() + 
          stat_smooth(method = "loess", 
                      method.args = list(family="symmetric"), 
                      se=FALSE,  col="red") 
```

The loess parameters are passed to the `loess` function via the call to `method.args`. The red curve is that of the robust loess and the grey dashed curve is that of the standard loess.