-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
215 lines (157 loc) · 8.34 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file and knit again -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
warning = FALSE, # avoid warnings and messages in the output
message = FALSE,
collapse = TRUE,
fig.width = 5,
fig.height = 5,
comment = "#>",
fig.path = "man/figures/README-"
)
par(mar=c(3,3,1,1)+.1)
options(digits=3)
```
```{r, echo=FALSE}
library(mvinfluence)
```
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/mvinfluence)](https://cran.r-project.org/package=mvinfluence)
[![](http://cranlogs.r-pkg.org/badges/grand-total/mvinfluence)](https://cran.r-project.org/package=mvinfluence)
[![DOI](https://zenodo.org/badge/128774860.svg)](https://zenodo.org/badge/latestdoi/128774860)
# mvinfluence <img src="man/figures/logo.png" align="right" height="200px" />
**Influence Measures and Diagnostic Plots for Multivariate Linear Models**
Version 0.9-1
Functions in this package compute regression deletion diagnostics for multivariate linear models following
methods proposed by Barrett & Ling (1992)
and provide some associated
diagnostic plots. The diagnostic measures include hat-values (leverages), generalized Cook's distance, and
generalized squared 'studentized' residuals. Several types of plots to detect influential observations are
provided.
In addition, the functions provide diagnostics for deletion of
subsets of observations of size `m>1`. This case is theoretically interesting
because sometimes pairs (`m=2`) of influential observations can mask each other,
sometimes they can have joint influence far exceeding their individual effects,
as well as other interesting phenomena described by Lawrence (1995).
Associated methods for the case
`m>1` are still under development in this package.
## Documentation
Documentation for the package is now available at [https://friendly.github.io/mvinfluence/](https://friendly.github.io/mvinfluence/).
## Installation
Get the released CRAN version or this development version:
| | |
|---------------------|---------------------------------------------------|
| CRAN version | `install.packages("mvinfluence")` |
| Development version | `remotes::install_github("friendly/mvinfluence")` |
## Goals
The design goal for this package is that, as an extension of standard methods for univariate linear models, you should be able to fit a linear model with a **multivariate** response,
mymlm <- lm( cbind(y1, y2, y3) ~ x1 + x2 + x3, data=mydata)
and then get useful diagnostics and plots with:
influence(mymlm)
hatvalues(mymlm)
cooks.distance(mymlm)
influencePlot(mymlm, ...)
As is done in comparable univariate functions in the `car` package, *noteworthy* points are identified in printed output and graphs.
## Examples
The `Rohwer` data contains data on kindergarten children designed to examine how well performance on a set of paired-associate (PA)
learning tasks can predict performance on some measures of aptitude and achievement---
`SAT` (a scholastic aptitude test),
`PPVT` (Peabody Picture Vocabulary Test), and
`Raven` ( Raven Progressive Matrices Test). The PA tasks differ in how the stimulus item was presented:
`n` (named),
`s` (still),
`ns` (named still),
`na` (named action) and
`ss` (sentence still).
Here, we fit a MLM to a subset of the Rohwer data (the Low SES group).
```{r rohwer1}
data(Rohwer, package="heplots")
Rohwer2 <- subset(Rohwer, subset=group==2)
rownames(Rohwer2)<- 1:nrow(Rohwer2)
Rohwer.mod <- lm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss, data=Rohwer2)
car::Anova(Rohwer.mod)
```
### Influence plots
The default influence plot (`type="stres"`) shows the squared standardized residual against the Hat value. The areas of the circles representing the observations are proportional to generalized Cook's distances.
```{r rohwer2}
(infl <-influencePlot(Rohwer.mod, id.n=4, type = "stres"))
```
As you can see above,
the function returns a data frame of the influence statistics for the identified points. "Noteworthy" points are those that are unusual on *either* Hat value (H) or the squared studentized residual (Q), so more points will be shown than the `id.n` value. It is often more useful to sort these in descending order by one of the influence measures.
```{r infl}
infl |> dplyr::arrange(desc(H))
```
An alternative (`type="LR"`) plots residual components against leverage components, both on log scales. Because influence is a product of residual $\times$ Leverage, this plot had the property that contours of constant Cook's distance fall on diagonal lines with slope = -1. Each successive dashed line represents
a **multiple** of Cook's D.
This plot is often easier to read than the standard version.
```{r rohwer3}
influencePlot(Rohwer.mod, id.n=4, type="LR")
```
We observe that case 5 has the largest leverage and it is highly influential.
Case 25 has the largest residual component and middling leverage, so it is moderately influential.
Cases 14, 29, 27 have nearly identical residuals, and their influence increases from left to right
with leverage.
### Index plots
If you wish to see how the observations fare on each of the the measures (as well as Mahalanobis $D^2$ of the residuals from the origin),
the `inflIndexPlot()` function gives you index plots.
There are extensive options for identifying and labeling "noteworthy"
observations, with various methods. These rely on `car::showLabels()`, where the default `id.method = "y"` label points whose
Y coordinate is very large.
```{r indexplot, fig.width=9}
infIndexPlot(Rohwer.mod,
id.n=3, id.col = "red", id.cex=1.5, id.location="ab")
```
In this example, note that while case 5 stands out as influential, it does not have an exceptionally large Mahalanobis
squared distance, $D^2$ of the residuals.
# Robust MLMs
Influential cases and those with large residuals can sometimes be dealt with by fitting a **robust** version of
the multivariate model. The function `heplots::robmlm()` uses a simple M-estimator that down-weights cases
with large residuals. Fitting is done by iterated re-weighted least squares (IWLS), using weights based on the Mahalanobis squared distances of the current residuals from the origin, and a scaling (covariance) matrix calculated by `MASS::cov.trob()`.
```{r robmlm1}
Rohwer.rmod <- heplots::robmlm(cbind(SAT, PPVT, Raven) ~ n + s + ns + na + ss,
data=Rohwer2)
```
The returned object has a `weights` component, the weight for each case in the final iteration. Which ones are less than 0.9 here?
```{r rob-weights}
which(Rohwer.rmod$weights < .9)
```
A simple index plot makes the down-weighted observations stand out. Case 5 is not among them, but I label it anyway.
```{r rob-index-plot, fig.width=8, fig.height=4}
par(mar = c(4,4,1,1)+.1)
wts <- Rohwer.rmod$weights
idx <- c(5, which(wts < .9))
plot(wts, type="h",
xlab = "Case index",
ylab = "Robust mlm weight",
cex.lab = 1.25)
rect(0, .9, 33, 1.1,
col=scales::alpha("gray", .25),
border=NA)
points(wts, pch = 16,
cex = ifelse(wts < .9, 1.5, 1),
col = ifelse(wts < .9, "red", "black"))
text(idx, wts[idx], label=idx, pos=3, cex=1.2, xpd=NA )
```
What's up with case 5? It had the largest leverage, but it's Mahalanobis $D^2$ was not large.
Thus, it was not down-weighted, even though it is an influential observation.
What difference do these observations make in the fitted regression? This calculates the percentage relative difference
between the coefficients in the standard `lm()` and the robust version. The largest changes are for the coefficients
of the `ss` task, but there is an even greater one for `PPVT` on the `n` task.
```{r}
100 * abs(coef(Rohwer.mod) - coef(Rohwer.rmod)) / abs(coef(Rohwer.mod))
```
## Citation
To cite `mvinfluence` in publications, use:
```{r citation}
citation("mvinfluence")
```
## References
Barrett, B. E. and Ling, R. F. (1992).
General Classes of Influence Measures for Multivariate Regression.
*Journal of the American Statistical Association*, **87**(417), 184-191.
Barrett, B. E. (2003). Understanding Influence in Multivariate Regression.
*Communications in Statistics -- Theory and Methods*, **32**, 3, 667-680.
Lawrence, A. J. (1995). Deletion Influence and Masking in Regression.
*Journal of the Royal Statistical Society. Series B (Methodological)* , **57**, No. 1, pp. 181-189.