-
Notifications
You must be signed in to change notification settings - Fork 26
/
multivariate_models.Rmd
477 lines (383 loc) · 20.5 KB
/
multivariate_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
---
title: "Multivariate models"
date: '`r paste("First created on 2015-02-15. Updated on", Sys.Date())`'
output:
html_document:
toc: true
toc_float: true
---
Home Page - http://dmcglinn.github.io/quant_methods/
GitHub Repo - https://github.com/dmcglinn/quant_methods
### Source Code Link
https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/multivariate_models.Rmd
The goal of this lesson is to introduce multivariate ordination analyses.
## Readings
* Chapters 5 and 6 of *Numerical Ecology with R*
* Chapters 9 and 11 of *Numerical Ecology* 2nd edition
## Online Docs
* [Ordination analysis by David Zeleny](https://www.davidzeleny.net/anadat-r/doku.php/en:ordination)
* [The Ordination Webpage by Mike Palmer](http://ordination.okstate.edu/)
- great for term definitions, layman's explanation of how the methods
differ, and how ecologists should interpret
* [Vegan: an introduction to ordination by Jari Oksanen](http://cran.r-project.org/web/packages/vegan/vignettes/intro-vegan.pdf)
- A brief demonstration of an ordination analysis in the R package vegan
* [Multivariate Analysis of Ecological Communities in R: vegan tutorial by Jari Oksanen](https://john-quensen.com/wp-content/uploads/2018/10/Oksanen-Jari-vegantutor.pdf)
- A more thorough of ordination in the R package vegan
## Outline
* Overview of ordination methods
* Create a community matrix.
* Indirect or Unconstrained Ordination
- Principal Components Analysis (PCA)
- Correspondence Analysis (CA)
- Detrended Correspondence Analysis (DCA)
- Non-metric Multidimensional Scaling (NMDS)
* Direct or Constrained Ordination
- Redundancy Analysis (RDA)
- Canonical Correspondence Analysis (CCA)
- Hypothesis Testing
- Model Comparison
- Variance partitioning
## Overview of multivariate ordination methods
Multivariate statistical analyses are generalizations of the univariate
approaches we examined in the last lesson. In the case of a univariate approach
we considered a model in which
$$\hat{y_i} = b_0 1 + b_1 x_{i1} + \cdots + b_p x_{ip} + \varepsilon_i, \qquad i = 1, \ldots, n, \qquad \text{(equ. 1)}$$
We are interested in using multivariate methods when we no longer are just
interested in explaining variance in the vector *y* instead we would like to
understand the dominant patterns and sources of variation in a matrix of response
variables **Y** that are all of the same type (i.e., same units).
Often times, we may be simply be interested in if we can represent our response matrix **Y**
using a relatively small number of dimensions (without any reference to an
explanatory matrix). For example, if **Y** has *n* samples (i.e., rows) and *p*
descriptors (i.e., columns) then we can fully represent all the variation in
**Y** using *p* hypothetical axes; however, what would a p-dimensional graph
look like - any dimensionality > 3 is difficult for the human mind to comprehend. Furthermore, is that degree of complexity even needed to see the major trends
in **Y**? Picture trying to interpret a pairs plot that has $p*(p-1)/2$
graphs (one graph for each unique combination of columns in **Y**).
The good news is that in general, the information in matrices can be reduced
down to a few major axes of variation. Ordination is a term that describes this
process of reducing the information needed to represent a matrix. The term
derives from the idea to ordinate or put things into order. With ordination
approaches we are attempting to take a high-dimensional data matrix and explain
its patterns with a small number of axes. Although this is a very general
concept across scientific disciplines, the term ordination is generally only
used in ecology.
There are generally considered to be two types of ordination.
1. Indirect or unconstrained ordination in which only a single matrix is analyzed
2. Direct or constrained ordination in which one matrix is used to explain the
variance of another matrix.
The figure below (Legendre and Legendre 1998, Fig 11.1) visually illustrates this
conceptual shift from modeling a vector *y* (panel b) to a matrix **Y**.
Indirect ordination does not use an explanatory matrix **X** (panel a) while
direct ordination methods do (panel c)
![](./figures/Fig11-1_legendre.png)
Today we will examine both indirect and direction methods of ordination. In
general, ordination is frequently used when exploring patterns in datasets
graphically; however, it can also be used to carry out hypothesis testing.
Despite their sometimes terrifying names ordination has a close kinship with
multiple regression (as illustrated in the figure above). One key difference is
that the response variable is multivariate rather than univariate; however, with
many approaches the underlying algebra is very similar between regression and
ordination.
Overview of Ordination methods (adapted from <http://ordination.okstate.edu/terminol.htm>)
```{r, echo = FALSE}
library(knitr)
# setup the R environment for kniting markdown doc properly
opts_knit$set(root.dir='../')
ord <- read.csv('./ordination_table.csv')
names(ord) <- c("Acronym", "Name", "Type of analysis", "Algorithm", "R function")
kable(ord)
```
This [presentation](./community_structure_slides_with_notes.pdf) by
[Abel Valdivia](https://twitter.com/AbelValdivia) provides a review of the types of
ordination and provides examples of their graphical output.
Additionally this [key](http://ordination.okstate.edu/key.htm) created by Mike
Palmer provides a decision tree that can help to guide your choice of methods.
## Create a community matrix
The first very common challenge when working with multivariate analyses is to construct the multivariate matrix we wish to analyze.
Essentially a community matrix is a cross-tab structure in which you have each descriptor element (e.g., species identities) as column ids and each sample element (e.g., site identities) as row ids.
A cross-tab structure is a very inefficient way to store your raw data and in many cases we must aggregate the raw data which can be accomplished using a for loop or other simple functions.
Using the Great Smokey Mountains tree dataset that was using in HW 3, I will demonstrate how to do this with a computationally inefficient but easy to understand for loop.
As a reminder the metadata for the tree data is located [here](../data/tree_metadata.txt).
Quick note: In this lesson we will use the R package `dummies` to create dummy
variables. This package is now defunct and no longer on the main package repository
CRAN. To install you will have to use the package `remotes` and the function
`install_github` specifically once `remotes` is installed use
`remotes::install_github('cran/dummies')`
```{r load data}
# load relevant packages and code for today's lesson
library(vegan)
library(dummies) # this package is no longer on cran see note above
source('./scripts/utility_functions.R')
# load data
tree <- read.csv('./data/treedata_subset.csv')
# create a community site x species matrix by summing species cover values
# we can do this with a for loop but it take a while to run
uni_sp <- unique(tree$spcode)
uni_site <- unique(tree$plotID)
```
```{r for loop, eval=FALSE}
comm <- matrix(0, ncol=length(uni_sp), nrow=length(uni_site))
colnames(comm) <- uni_sp
rownames(comm) <- uni_site
for(i in seq_along(uni_sp)) {
for(j in seq_along(uni_site)) {
comm[j , i] <- mean(tree$cover[tree$spcode == uni_sp[i] &
tree$plotID == uni_site[j]])
}
}
comm[1:5, 1:5]
```
Alternatively we could use the function `tapply` which is much more efficient.
```{r tapply}
# alternatively we can use a tapply function
comm <- tapply(tree$cover,
INDEX = list(tree$plotID, tree$spcode),
mean)
# examine the community matrix
comm[1:5, 1:5]
# replace the NAs with zeros
comm <- ifelse(is.na(comm), 0, comm)
comm[1:5, 1:5]
```
Now that we've created our matrix it is usually a good idea to make sure everything looks ok - the row and column sums (i.e., marginal sums) provide one reasonable metric to see overall how the data is structured in a very coarse way.
```{r maginal sums}
# visually explore the cover variable between species and sites
uni_sp <- unique(tree$spcode)
sp_sum <- apply(comm, 2, sum)
site_sum <- apply(comm, 1, sum)
par(mfrow=c(2,2))
hist(sp_sum)
col <- colorRamp(c('red', 'orange', 'blue'))
sp_cols <- col(length(uni_sp))
plot(sp_sum[order(sp_sum, decreasing=T)], type='o', col='red', lwd=2,
xlab='Sp Rank', ylab='Sum Cover')
hist(site_sum)
plot(site_sum[order(site_sum, decreasing=T)], type='o', col='red', lwd=2,
xlab='Site Rank', ylab='Sum Cover')
par(mfrow=c(1,1))
```
Above we can see that most species have small total covers (i.e., most species are rare) which is a fairly universal ecological law so this seems correct.
Also we can see that most sites have a total cover of approximately 40 and that this distribution isn't as highly skewed as the species marginal totals.
NOTE: These plots are not terribly helpful at understanding the variability in
the community matrix. They are just useful for looking at gross aggregate
properties of the matrix; therefore, they can be useful for spotting an error in
the matrix. Let's get our environmental variables ready to use and start to
conduct some actual analyses.
## Create an explanatory matrix
In the tree dataset, each site has one set of environmental measurements. These
are replicated across the rows of the `tree` data object
```{r, echo=FALSE}
with(tree, head(tree[plotID == "ATBN-01-0403", ]))
```
So we just need to pull out the environmental data for a single one of those rows.
In the end we need the explanatory matrix to have the same number of rows in the
same order as the community matrix. We could write a for loop to do this but
we'll take the easier approach of using the function `aggregate`.
```{r create env matrix}
cols_to_keep <- c('elev', 'tci', 'streamdist', 'disturb', 'beers')
env <- aggregate(tree[ , cols_to_keep], by = list(tree$plotID), function(x) x[1])
# aggregate does have a side effect of adding a column to the output which
# is the unique plotID in this case. We just need to move that so it is a
# row name instead.
row.names(env) <- env[ , 1]
env <- env[ , -1]
head(env)
# before we can use this explanatory matrix we need to check
# that its rows are in the same order as our response matrix
all.equal(rownames(comm), rownames(env))
# now that we've completed that check we can rename the rows to something
# more manageable
rownames(comm) <- 1:nrow(comm)
rownames(env) <- 1:nrow(env)
```
## Indirect or Unconstrained Ordination
### Principle Components Analysis (PCA)
Principle Components Analysis (PCA) is a useful tool for either
1. examining correlations between the columns of a matrix
2. potentially reducing the set of explanatory variables included in model
Sometimes PCA is also known as Reciprocal Averaging (RA).
Simplified description of PCA algorithm (from [Zelený 2018](https://www.davidzeleny.net/anadat-r/doku.php/en:pca)):
a. Use the matrix of samples x species (or, generally, samples x descriptors), and display each sample into the multidimensional space where each dimension is defined by an abundance of one species (or descriptor). In this way, the samples will produce a cloud located in the multidimensional space.
b. Calculate the centroid of the cloud.
c. Move the centers of axes to this centroid.
d. Rotate the axes in such a way that the first axis goes through the cloud in the direction of the highest variance, the second is perpendicular to the first and goes in the direction of the second highest variance, and so on. The position of samples on resulting rotated axes are sample scores on ordination axes.
<img src="https://www.davidzeleny.net/anadat-r/lib/exe/fetch.php/obrazky:pca_algorithm_2d_legendre-legendre.jpg?cache="/>
Figure 1: PCA ordination of five samples and two species. (Fig. 9.2 from Legendre & Legendre 1998.)
<img src="https://www.davidzeleny.net/anadat-r/lib/exe/fetch.php/obrazky:pca_algorithm_3d.jpg?cache="/>
Figure 2: 3D schema of PCA ordination algorithm
Check out the nice interactive visualization at [Explained Visually](http://setosa.io/ev/principal-component-analysis/)
We'll start by using PCA to simply examine the correlations in tree species
covers that are contained in the community matrix.
```{r comm pca}
tree_pca <- rda(comm)
tree_pca
```
Like the object returned by the function `lm`, the output of `rda` is just a named list.
We can get a sense of the structure of this named list using the function `str`
```{r using str}
str(tree_pca)
```
This is useful because for example if we wanted to pull out the eigenvalues of the
analysis we would see that they are stored under `$CA$eig`.
We can access each of the objects stored in a named list as we would reference columns in a `data.frame` using the `$`.
```{r eigenvalues}
tree_pca$tot.chi
tree_pca$CA$eig
# the eigenvalues sum up to equal the total inertia
sum(tree_pca$CA$eig)
# inertia in a PCA is equal to the total variance of the matrix because PCA
# uses the usual Euclidean distance as its metric of how different samples are:
# 1 / (n-1) * sum(x - xbar)^2 where n is sample size and xbar is the mean x value
# you can verify that this is equal to the sum of the variances (i.e., diagonal
# elements) in the species covariance matrix
sum(diag(cov(comm)))
# of if you want to do it by hand with matrix algebra
# first center the community matrix by subtracting the means
cen_comm <- apply(comm, 2, function(x) x - mean(x))
# then compute the covariance matrix
cov_comm <- nrow(comm)^-1 * t(cen_comm) %*% cen_comm
sum(diag(cov_comm))
# the ratio of the eigenvalue to the total variance is the amount of
# variance explained by each PCA axis
round(tree_pca$CA$eig / tree_pca$tot.chi, 2)
```
We can see from above that the PCA axis 1 captures approximately 20% of the total
variance in the community matrix.
Let's graph the data to better get a sense of the correlation structure.
```{r plot pca}
plot(tree_pca)
biplot(tree_pca)
cleanplot.pca(tree_pca)
# p120-121 of Numerical Ecology in R:
# Scaling 1 = distance biplot: the eigenvectors are scaled to unit length. (1)
# Distances among objects in the biplot are approximations of their
# Euclidean distances in multidimensional space. (2) The angles among
# descriptor vectors are meaningless.
# Scaling 2 = correlation biplot: each eigenvector is scaled to the square root of
# its eigenvalue. (1) Distances among objects in the biplot are not approximations
# of their Euclidean distances in multidimensional space. (2) The angles
# between descriptors in the biplot reflect their correlations.
# scaling 2 is the default for vegan ordination graphing functions
ordiplot(tree_pca, display = 'sp')
```
### Correspondance Anlysis (CA), Detrended Coresspondance Analysis (DCA), and NMDS
```{r, eval=FALSE}
# each of these different indirect ordination approaches
# has different strengths and weaknesses
# Correspondence analysis (CA) examines differences on weighted
# averages of the columns (i.e., species in this case)
tree_ca <- cca(comm)
# Detrended correspondence analysis (DCA) is identical except it attempts
# to account for a well known artifact of CA known as the arch-effect
# by detrending subsequent axes from previous axes.
tree_dca <- decorana(comm)
# Non-metric multidimensional scaling (MDS) is unique in that you may
# specify one of a number of different distance metrics to use. By
# default the Bray-Curtis distance is used by metaMDS.
tree_mds <- metaMDS(comm)
```
NMDS Maximizes rank-order correlation between distance measures and distance in
ordination space. Points are moved to minimize "stress". Stress is a measure of
the mismatch between the two kinds of distance.
## Direct or Constrained Ordination
### Redundancy Analysis (RDA)
First let's carry out an RDA which expects a linear response of each species to
the environmental variables. RDA is the most direct analog of OLS regression to
the multivariate response variable.
```{r, error=TRUE}
rda_tree <- rda(comm, env)
# the above breaks b/c we have a categorical factor in env
# vegan requires that we write out each term if we are not going to
# convert the factor to a dummy matrix
rda_tree <- rda(comm ~ env$elev + env$tci +
env$streamdist + env$disturb + env$beers)
# alternatively we could use a shorthand approach
rda_tree <- rda(comm ~ . , data=env)
rda_tree
RsquareAdj(rda_tree)
```
The output above provides us with some useful information. Inertia is another name
for variation or variance in this case. "Total" refers to total variance, "Constrained"
refers to the amount of variance explained by the explanatory variables, "Unconstrained"
refers to the residual variance. Constrained + Unconstrained = Total.
An $R^2$ statistic can be derived simply as Constrained / Total. The function
`RsquareAdj` computes $R^2$ and $R^2$-adjusted.
The variable "Rank" indicates the number of variables included.
The eigenvalues are displayed for both the constrained and unconstrained axes.
In this context these eigenvalues indicate how much variance each of the axes
contribute to.
We can plot our model result to get a sense of which variables are correlating
with with species along which axes.
```{r}
plot(rda_tree, type='n', scaling=1)
orditorp(rda_tree, display='sp', cex=0.5, scaling=1, col='blue')
text(rda_tree, display='cn', col='red')
```
We interpret the plot above as we have interpreted the previously ordination
plots with one important difference. The environmental variables are now
displayed and their placement indicates their loading on the two displayed
RDA axes. `elev` is loading heavily on RDA1 indicating that this variable explains
a larger portion of the variance associated with axis 1. The location of the
species relative to the environmental variables indicates how strongly a species
is associated with a particular environmental variable. So for example
ABIEFRA or *Abies fraseri* increases as elevation increases.
### Canonical Correspondence Analysis (CCA)
Let's carry out a Canonical Correspondence Analysis (CCA) as well. CCA is appropriate
for modeling unimodal or hump-shaped responses to explanatory variables (rather
than linear as with RDA).
```{r}
cca_tree <- cca(comm ~ ., data=env)
RsquareAdj(cca_tree, 100)
anova(cca_tree, permutations = 999)
anova(cca_tree, by='margin', permutations = 999)
plot(cca_tree, type='n', scaling=1)
orditorp(cca_tree, display='sp', cex=0.5, scaling=1, col='blue')
text(cca_tree, display='bp', col='red')
```
The CCA models don't explain as much variation and their plots look slightly
different but the general take home message has not changed much.
### Hypothesis testing
Now let's carry out hypothesis testing.
```{r}
anova(rda_tree, permutations=10)
anova(rda_tree, by='margin', permutations=10)
```
In a real analysis you would specify a much larger number of permutations (at
least 1000). The first test examines overall model fit relative to a randomized
or permuted matrix of data. The second test examines the partial effects of the
individual variables included in the model.
### Model Comparison
Model comparison can be carried out using direct ordination methods either using
nested models of different degrees of complexity as we did with univariate `lm`
type models
```{r}
rda_tree_simple <- update(rda_tree, . ~ . - beers)
anova(rda_tree_simple, rda_tree)
```
The above tests suggests that the more complex model including the variable
`beers` is strongly supported. Note this specific example is identical to the
output for the `beers` parameter when using `anova(tree_rda, by = 'margin')`.
Model comparison can also be undertaken using adjusted $R^2$. This is often
easier to discuss and communicate then a myriad of significance tests.
### Variation Paritioning
Lastly let's carry out variance partitioning. We can use this approach to
examine how much of the explained variance is due to different groups of
variables. In other words this approach is really only useful if you are
interested in comparing the relative importance of several variables to another
set of variables.
```{r}
## variance partitioning
moisture <- env[ , c('elev', 'tci', 'beers', 'streamdist')]
# because the variable disturb is a factor we need to convert it into
# a dummy matrix using the function dummies::dummy
disturb <- dummy(env$disturb)
# examine the explanatory variable of each class of variables.
varpart(comm, moisture, disturb)
showvarparts(2)
```
The output indicates that the moisture group of variables has the largest
individual fraction of explained variance (10%), whereas, the disturbance groups
of variables explain only approximately 2%.
We can also see that there are not any really large fractions of shared variance
which indicates the variables effects are somewhat independent of one another.