lessons/multivariate_models.Rmd

---
title: "Multivariate models"
date: '`r paste("First created on 2015-02-15. Updated on", Sys.Date())`'
output: 
  html_document:
  toc: true
  toc_float: true
---

Home Page - http://dmcglinn.github.io/quant_methods/ 
GitHub Repo - https://github.com/dmcglinn/quant_methods 

### Source Code Link
https://raw.githubusercontent.com/dmcglinn/quant_methods/gh-pages/lessons/multivariate_models.Rmd


The goal of this lesson is to introduce multivariate ordination analyses.

## Readings
* Chapters 5 and 6 of *Numerical Ecology with R*
* Chapters 9 and 11 of *Numerical Ecology* 2nd edition

## Online Docs
* [Ordination analysis by David Zeleny](https://www.davidzeleny.net/anadat-r/doku.php/en:ordination)
* [The Ordination Webpage by Mike Palmer](http://ordination.okstate.edu/)
    - great for term definitions, layman's explanation of how the methods
    differ, and how ecologists should interpret
* [Vegan: an introduction to ordination by Jari Oksanen](http://cran.r-project.org/web/packages/vegan/vignettes/intro-vegan.pdf)
    - A brief demonstration of an ordination analysis in the R package vegan
* [Multivariate Analysis of Ecological Communities in R: vegan tutorial by Jari Oksanen](https://john-quensen.com/wp-content/uploads/2018/10/Oksanen-Jari-vegantutor.pdf)
    - A more thorough  of ordination in the R package vegan

## Outline
* Overview of ordination methods
* Create a community matrix.
* Indirect or Unconstrained Ordination
    - Principal Components Analysis (PCA)
    - Correspondence Analysis (CA) 
    - Detrended Correspondence Analysis (DCA)
    - Non-metric Multidimensional Scaling (NMDS)
* Direct or Constrained Ordination
    - Redundancy Analysis (RDA)
    - Canonical Correspondence Analysis (CCA)
    - Hypothesis Testing
    - Model Comparison
    - Variance partitioning

## Overview of multivariate ordination methods

Multivariate statistical analyses are generalizations of the univariate 
approaches we examined in the last lesson. In the case of a univariate approach
we considered a model in which

$$\hat{y_i} = b_0 1 + b_1 x_{i1} + \cdots + b_p x_{ip} + \varepsilon_i, \qquad i = 1, \ldots, n, \qquad \text{(equ. 1)}$$

We are interested in using multivariate methods when we no longer are just
interested in explaining variance in the vector *y* instead we would like to
understand the dominant patterns and sources of variation in a matrix of response
variables **Y** that are all of the same type (i.e., same units).

Often times, we may be simply be interested in if we can represent our response matrix **Y**
using a relatively small number of dimensions (without any reference to an
explanatory matrix). For example, if **Y** has *n* samples (i.e., rows) and *p*
descriptors (i.e., columns) then we can fully represent all the variation in
**Y** using *p* hypothetical axes; however, what would a p-dimensional graph
look like - any dimensionality > 3 is difficult for the human mind to comprehend. Furthermore, is that degree of complexity even needed to see the major trends
in **Y**? Picture trying to interpret a pairs plot that has $p*(p-1)/2$
graphs (one graph for each unique combination of columns in **Y**).

The good news is that in general, the information in matrices can be reduced
down to a few major axes of variation. Ordination is a term that describes this
process of reducing the information needed to represent a matrix. The term
derives from the idea to ordinate or put things into order. With ordination
approaches we are attempting to take a high-dimensional data matrix and explain
its patterns with a small number of axes. Although this is a very general 
concept across scientific disciplines, the term ordination is generally only
used in ecology. 

There are generally considered to be two types of ordination. 

1. Indirect or unconstrained ordination in which only a single matrix is analyzed
2. Direct or constrained ordination in which one matrix is used to explain the 
variance of another matrix. 

The figure below (Legendre and Legendre 1998, Fig 11.1) visually illustrates this
conceptual shift from modeling a vector *y* (panel b) to a matrix **Y**.
Indirect ordination does not use an explanatory matrix **X** (panel a) while
direct ordination methods do (panel c)

![](./figures/Fig11-1_legendre.png)

Today we will examine both indirect and direction methods of ordination. In
general, ordination is frequently used when exploring patterns in datasets
graphically; however, it can also be used to carry out hypothesis testing.

Despite their sometimes terrifying names ordination has a close kinship with
multiple regression (as illustrated in the figure above). One key difference is
that the response variable is multivariate rather than univariate; however, with
many approaches the underlying algebra is very similar between regression and
ordination.

Overview of Ordination methods (adapted from <http://ordination.okstate.edu/terminol.htm>)

```{r, echo = FALSE}
library(knitr)
# setup the R environment for kniting markdown doc properly
opts_knit$set(root.dir='../')
ord <- read.csv('./ordination_table.csv')
names(ord) <- c("Acronym",	"Name", "Type of analysis", "Algorithm",	"R function")
kable(ord)
```

This [presentation](./community_structure_slides_with_notes.pdf) by 
[Abel Valdivia](https://twitter.com/AbelValdivia) provides a review of the types of
ordination and provides examples of their graphical output. 

Additionally this [key](http://ordination.okstate.edu/key.htm) created by Mike
Palmer provides a decision tree that can help to guide your choice of methods.

## Create a community matrix

The first very common challenge when working with multivariate analyses is to  construct the multivariate matrix we wish to analyze.
Essentially a community matrix is a cross-tab structure in which you have each descriptor element (e.g., species identities) as column ids and each sample element (e.g., site identities) as row ids. 
A cross-tab structure is a very inefficient way to store your raw data and in many cases we must aggregate the raw data which can be accomplished using a for loop or other simple functions. 
Using the Great Smokey Mountains tree dataset that was using in HW 3, I will demonstrate how to do this with a computationally inefficient but easy to understand for loop. 
As a reminder the metadata for the tree data is located [here](../data/tree_metadata.txt).

Quick note: In this lesson we will use the R package `dummies` to create dummy
variables. This package is now defunct and no longer on the main package repository
CRAN. To install you will have to use the package `remotes` and the function
`install_github` specifically once `remotes` is installed use
`remotes::install_github('cran/dummies')`

```{r load data}
# load relevant packages and code for today's lesson
library(vegan)
library(dummies) # this package is no longer on cran see note above
source('./scripts/utility_functions.R')

# load data
tree <- read.csv('./data/treedata_subset.csv')

# create a community site x species matrix by summing species cover values
# we can do this with a for loop but it take a while to run
uni_sp <- unique(tree$spcode)
uni_site <- unique(tree$plotID)
```
```{r for loop, eval=FALSE}
comm <- matrix(0, ncol=length(uni_sp), nrow=length(uni_site))
colnames(comm) <- uni_sp
rownames(comm) <- uni_site
for(i in seq_along(uni_sp)) {
    for(j in seq_along(uni_site)) {
        comm[j , i] <- mean(tree$cover[tree$spcode == uni_sp[i] &
                          tree$plotID == uni_site[j]])
    }
}
comm[1:5, 1:5]
```

Alternatively we could use the function `tapply` which is much more efficient.

```{r tapply}
# alternatively we can use a tapply function 
comm <- tapply(tree$cover,
              INDEX = list(tree$plotID, tree$spcode),
              mean)
# examine the community matrix
comm[1:5, 1:5]
# replace the NAs with zeros
comm <- ifelse(is.na(comm), 0, comm)
comm[1:5, 1:5]
```

Now that we've created our matrix it is usually a good idea to make sure everything looks ok - the row and column sums (i.e., marginal sums) provide one reasonable metric to see overall how the data is structured in a very coarse way. 

```{r maginal sums}
# visually explore the cover variable between species and sites
uni_sp <- unique(tree$spcode)
sp_sum <- apply(comm, 2, sum)
site_sum <- apply(comm, 1, sum)
par(mfrow=c(2,2))
hist(sp_sum)
col <- colorRamp(c('red', 'orange', 'blue'))
sp_cols <- col(length(uni_sp))
plot(sp_sum[order(sp_sum, decreasing=T)], type='o', col='red', lwd=2,
     xlab='Sp Rank', ylab='Sum Cover')
hist(site_sum)
plot(site_sum[order(site_sum, decreasing=T)], type='o', col='red', lwd=2,
     xlab='Site Rank', ylab='Sum Cover')
par(mfrow=c(1,1))
```

Above we can see that most species have small total covers (i.e., most species are rare) which is a fairly universal ecological law so this seems correct. 
Also we can see that most sites have a total cover of approximately 40 and that this distribution isn't as highly skewed as the species marginal totals. 

NOTE: These plots are not terribly helpful at understanding the variability in
the community matrix. They are just useful for looking at gross aggregate
properties of the matrix; therefore, they can be useful for spotting an error in
the matrix. Let's get our environmental variables ready to use and start to
conduct some actual analyses.

## Create an explanatory matrix

In the tree dataset, each site has one set of environmental measurements. These 
are replicated across the rows of the `tree` data object

```{r, echo=FALSE}
with(tree, head(tree[plotID == "ATBN-01-0403", ]))
```

So we just need to pull out the environmental data for a single one of those rows. 
In the end we need the explanatory matrix to have the same number of rows in the 
same order as the community matrix. We could write a for loop to do this but 
we'll take the easier approach of using the function `aggregate`.

```{r create env matrix}
cols_to_keep <- c('elev', 'tci', 'streamdist', 'disturb', 'beers')
env <- aggregate(tree[ , cols_to_keep], by = list(tree$plotID), function(x) x[1])
# aggregate does have a side effect of adding a column to the output which 
# is the unique plotID in this case. We just need to move that so it is a 
# row name instead. 
row.names(env) <- env[ , 1]
env <- env[ , -1]
head(env)

# before we can use this explanatory matrix we need to check 
# that its rows are in the same order as our response matrix

all.equal(rownames(comm), rownames(env))

# now that we've completed that check we can rename the rows to something 
# more manageable

rownames(comm) <- 1:nrow(comm)
rownames(env) <- 1:nrow(env)

```

## Indirect or Unconstrained Ordination
### Principle Components Analysis (PCA)

Principle Components Analysis (PCA) is a useful tool for either

1. examining correlations between the columns of a matrix
2. potentially reducing the set of explanatory variables included in model

Sometimes PCA is also known as Reciprocal Averaging (RA). 

Simplified description of PCA algorithm (from [Zelený 2018](https://www.davidzeleny.net/anadat-r/doku.php/en:pca)):

a. Use the matrix of samples x species (or, generally, samples x descriptors), and display each sample into the multidimensional space where each dimension is defined by an abundance of one species (or descriptor). In this way, the samples will produce a cloud located in the multidimensional space.
b. Calculate the centroid of the cloud.
c. Move the centers of axes to this centroid.
d. Rotate the axes in such a way that the first axis goes through the cloud in the direction of the highest variance, the second is perpendicular to the first and goes in the direction of the second highest variance, and so on. The position of samples on resulting rotated axes are sample scores on ordination axes.

<img src="https://www.davidzeleny.net/anadat-r/lib/exe/fetch.php/obrazky:pca_algorithm_2d_legendre-legendre.jpg?cache="/>

Figure 1: PCA ordination of five samples and two species. (Fig. 9.2 from Legendre & Legendre 1998.)
<img src="https://www.davidzeleny.net/anadat-r/lib/exe/fetch.php/obrazky:pca_algorithm_3d.jpg?cache="/>

Figure 2: 3D schema of PCA ordination algorithm

Check out the nice interactive visualization at [Explained Visually](http://setosa.io/ev/principal-component-analysis/)

We'll start by using PCA to simply examine the correlations in tree species 
covers that are contained in the community matrix. 

```{r comm pca}
tree_pca <- rda(comm)
tree_pca
```

Like the object returned by the function `lm`, the output of `rda` is just a named list. 
We can get a sense of the structure of this named list using the function `str`

```{r using str}
str(tree_pca)
```

This is useful because for example if we wanted to pull out the eigenvalues of the 
analysis we would see that they are stored under `$CA$eig`. 
We can access each of the objects stored in a named list as we would reference columns in a `data.frame` using the `$`. 

```{r eigenvalues}
tree_pca$tot.chi
tree_pca$CA$eig
# the eigenvalues sum up to equal the total inertia 
sum(tree_pca$CA$eig)
# inertia in a PCA is equal to the total variance of the matrix because PCA
# uses the usual Euclidean distance as its metric of how different samples are:
# 1 / (n-1) * sum(x - xbar)^2 where n is sample size and xbar is the mean x value
# you can verify that this is equal to the sum of the variances (i.e., diagonal 
# elements) in the species covariance matrix 
sum(diag(cov(comm)))
# of if you want to do it by hand with matrix algebra
# first center the community matrix by subtracting the means
cen_comm <- apply(comm, 2, function(x) x - mean(x))
# then compute the covariance matrix
cov_comm <- nrow(comm)^-1 * t(cen_comm) %*% cen_comm
sum(diag(cov_comm))

# the ratio of the eigenvalue to the total variance is the amount of 
# variance explained by each PCA axis
round(tree_pca$CA$eig / tree_pca$tot.chi, 2)
```

We can see from above that the PCA axis 1 captures approximately 20% of the total
variance in the community matrix. 
Let's graph the data to better get a sense of the correlation structure.

```{r plot pca}
plot(tree_pca)
biplot(tree_pca)
cleanplot.pca(tree_pca)
# p120-121 of Numerical Ecology in R:
# Scaling 1 = distance biplot: the eigenvectors are scaled to unit length. (1)
# Distances among objects in the biplot are approximations of their
# Euclidean distances in multidimensional space. (2) The angles among
# descriptor vectors are meaningless.
# Scaling 2 = correlation biplot: each eigenvector is scaled to the square root of
# its eigenvalue. (1) Distances among objects in the biplot are not approximations
# of their Euclidean distances in multidimensional space. (2) The angles
# between descriptors in the biplot reflect their correlations.
# scaling 2 is the default for vegan ordination graphing functions
ordiplot(tree_pca, display = 'sp')
```

### Correspondance Anlysis (CA), Detrended Coresspondance Analysis (DCA), and NMDS

```{r, eval=FALSE}
# each of these different indirect ordination approaches
# has different strengths and weaknesses
# Correspondence analysis  (CA) examines differences on weighted
# averages of the columns (i.e., species in this case)
tree_ca <- cca(comm)

# Detrended correspondence analysis (DCA) is identical except it attempts
# to account for a well known artifact of CA known as the arch-effect 
# by detrending subsequent axes from previous axes. 
tree_dca <- decorana(comm)

# Non-metric multidimensional scaling (MDS) is unique in that you may 
# specify one of a number of different distance metrics to use. By 
# default the Bray-Curtis distance is used by metaMDS. 
tree_mds <- metaMDS(comm)
```

NMDS Maximizes rank-order correlation between distance measures and distance in
ordination space. Points are moved to minimize "stress". Stress is a measure of
the mismatch between the two kinds of distance.

## Direct or Constrained Ordination
### Redundancy Analysis (RDA)

First let's carry out an RDA which expects a linear response of each species to
the environmental variables. RDA is the most direct analog of OLS regression to 
the multivariate response variable. 

```{r, error=TRUE}
rda_tree <- rda(comm, env)
# the above breaks b/c we have a categorical factor in env 

# vegan requires that we write out each term if we are not going to 
# convert the factor to a dummy matrix 
rda_tree <- rda(comm ~ env$elev + env$tci +
               env$streamdist + env$disturb + env$beers)
# alternatively we could use a shorthand approach
rda_tree <- rda(comm ~ . , data=env)
rda_tree
RsquareAdj(rda_tree)
```

The output above provides us with some useful information. Inertia is another name
for variation or variance in this case. "Total" refers to total variance, "Constrained"
refers to the amount of variance explained by the explanatory variables, "Unconstrained"
refers to the residual variance. Constrained + Unconstrained = Total. 
An $R^2$ statistic can be derived simply as Constrained / Total. The function 
`RsquareAdj` computes $R^2$ and $R^2$-adjusted. 
The variable "Rank" indicates the number of variables included. 
The eigenvalues are displayed for both the constrained and unconstrained axes. 
In this context these eigenvalues indicate how much variance each of the axes
contribute to. 

We can plot our model result to get a sense of which variables are correlating
with with species along which axes. 

```{r}
plot(rda_tree, type='n', scaling=1)
orditorp(rda_tree, display='sp', cex=0.5, scaling=1, col='blue')
text(rda_tree, display='cn', col='red')
```

We interpret the plot above as we have interpreted the previously ordination
plots with one important difference. The environmental variables are now
displayed and their placement indicates their loading on the two displayed
RDA axes. `elev` is loading heavily on RDA1 indicating that this variable explains
a larger portion of the variance associated with axis 1. The location of the 
species relative to the environmental variables indicates how strongly a species
is associated with a particular environmental variable. So for example 
ABIEFRA or *Abies fraseri* increases as elevation increases. 

### Canonical Correspondence Analysis (CCA)

Let's carry out a Canonical Correspondence Analysis (CCA) as well. CCA is appropriate
for modeling unimodal or hump-shaped responses to explanatory variables (rather
than linear as with RDA). 

```{r}
cca_tree <- cca(comm ~ ., data=env)
RsquareAdj(cca_tree, 100)
anova(cca_tree, permutations = 999)
anova(cca_tree, by='margin', permutations = 999)

plot(cca_tree, type='n', scaling=1)
orditorp(cca_tree, display='sp', cex=0.5, scaling=1, col='blue')
text(cca_tree, display='bp', col='red')
```

The CCA models don't explain as much variation and their plots look slightly
different but the general take home message has not changed much. 

### Hypothesis testing
Now let's carry out hypothesis testing. 
```{r}
anova(rda_tree, permutations=10)
anova(rda_tree, by='margin', permutations=10)
```

In a real analysis you would specify a much larger number of permutations (at
least 1000). The first test examines overall model fit relative to a randomized
or permuted matrix of data. The second test examines the partial effects of the
individual variables included in the model.

### Model Comparison
Model comparison can be carried out using direct ordination methods either using
nested models of different degrees of complexity as we did with univariate `lm` 
type models

```{r}
rda_tree_simple <- update(rda_tree, . ~ . - beers)
anova(rda_tree_simple, rda_tree)
```

The above tests suggests that the more complex model including the variable
`beers` is strongly supported. Note this specific example is identical to the 
output for the `beers` parameter when using `anova(tree_rda, by = 'margin')`. 

Model comparison can also be undertaken using adjusted $R^2$. This is often 
easier to discuss and communicate then a myriad of significance tests. 

### Variation Paritioning

Lastly let's carry out variance partitioning. We can use this approach to 
examine how much of the explained variance is due to different groups of 
variables. In other words this approach is really only useful if you are
interested in comparing the relative importance of several variables to another
set of variables. 

```{r}
## variance partitioning

moisture <- env[ , c('elev', 'tci', 'beers', 'streamdist')]
# because the variable disturb is a factor we need to convert it into 
# a dummy matrix using the function dummies::dummy
disturb <- dummy(env$disturb)

# examine the explanatory variable of each class of variables.
varpart(comm, moisture, disturb)
showvarparts(2)
```

The output indicates that the moisture group of variables has the largest
individual fraction of explained variance (10%), whereas, the disturbance groups
of variables explain only approximately 2%.
We can also see that there are not any really large fractions of shared variance 
which indicates the variables effects are somewhat independent of one another.