-
Notifications
You must be signed in to change notification settings - Fork 26
/
spatial_models.Rmd
551 lines (446 loc) · 22.2 KB
/
spatial_models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
---
title: "Spatial Models"
output: html_document
---
<!--clickable code reveal:
http://stackoverflow.com/questions/14127321/how-to-hide-code-in-rmarkdown-with-option-to-see-it
-->
<style>
div.hidecode + pre {display: none}
</style>
<script>
doclick = function(e){
e.nextSibling.nextSibling.style.display="block";
}
</script>
The goals of this lesson are to introduce how spatial dependence detected and
modeled for uni- and multi-variate response variables. Although in this lesson
we focus on spatial dependence the core concepts of how to detect and model a
lack of independence in samples may be applied to dependence due to: time,
relatedness, position on a genome, and many other study specific driver
variables.
## Readings
* The R Book p778-785 on Generalized Least Squares models with spatially
correlated errors
* Numerical Ecology in R, p228-238 on detecting spatial dependence.
* https://beckmw.wordpress.com/2013/01/07/breaking-the-rules-with-spatial-correlation/
## Outline
* Spatial autocorrelation and induced spatial dependence
* Detecting a spatial signal
* Univariate modeling
* Multivariate modeling
## Spatial autocorrelation and induced spatial dependence
There are two general reasons why spatial dependence may
exist in a response variable.
1) autocorrelation - the variable is spatially non-random due its own
internal dynamics. In ecology this is sometimes referred to as a false gradient.
An example of this is when offspring have limited dispersal from their natal
site and therefore finding one organism increases the likelihood of finding another.
2) induced spatial dependence: the variable is not inherently spatially structured
but the driving factors that it responds to are. In ecology this is sometimes
referred to as a true gradient. An example of this phenomena would be if an
organism's physiology is driven by temperature, and if temperature is non-randomly
spatially structured then the organism should also display non-random patterns
of occurrence.
There is no statistical way to distinguish false from true gradients.
## Detecting a spatial signal
We are going to use data on the occurrence of Oribatid mites collected from 70
soil cores across a field to examine how to detect non-random spatial signals
for both univariate and multi-variate response variables. Specifically, we will
examine a vector of species richness (i.e., number of species in a site), and a
matrix of species composition (i.e., a community abundance matrix):
```{r}
library(vegan)
library(nlme)
# Oribatid mite data. 70 soil cores collected by Daniel Borcard in 1989.
# See Borcard et al. (1992, 1994) for details.
data(mite)
data(mite.env)
data(mite.xy)
?mite
```
```{r}
plot(mite.xy)
sr <- rowSums(mite > 0)
hist(sr)
plot(mite.xy, cex = sr/max(sr))
col_brks <- hist(sr, plot=F)$breaks
col_indices <- as.numeric(cut(sr, col_brks))
cols <- rev(terrain.colors(length(col_brks)))
plot(mite.xy, cex=2, pch=19, col=cols[col_indices])
```
Visually it appears that low richness sites (i.e., brown circles) are more
likely to be near other low richness sites - this indicates a pattern of
spatial dependence. We can carry out some very simple analyses to
examine if this relationship is stronger than we would expect under a
null model of randomly shuffled spatial positions. Our test statistic in this
context is the Pearson correlation coefficient between the difference in the
response variable (i.e., richness) and the difference in spatial proximity.
```{r}
# calculate Euclidean distance between richness and spatial coordinates
sr_dist <- dist(sr)
xy_dist <- dist(mite.xy)
```
For interpretation purposes a rule of thumb is not to interpret distances
great than 1/2 the maximum distance in the dataset. This is to avoid
examining spatial patterns that are underlaid by only a few samples. At
small to intermediate distances there are typically many more pairs of samples
where as at the extreme ends of a sampling grid there are only two sets of
samples (i.e., those that lie along the two diagonals corners from one another)
```{r}
max_dist <- max(xy_dist) / 2
# plot result
plot(xy_dist, sr_dist)
abline(lm(sr_dist ~ xy_dist), lwd=3, col='red')
lines(lowess(xy_dist, sr_dist), lwd=3, col='pink')
abline(v = max_dist, col='red', lwd=3, lty=2)
# compute correlation
obs_cor <- cor(xy_dist, sr_dist)
obs_cor
```
##### **Question**
Based on the graphic above does it appear that values of species richness are
more similar or more different the further apart they are in space?
<div class="hidecode" onclick="doclick(this);">[Show answer]</div>
```
The above graphic demonstrates that the greater the distance between two samples
the more different those samples are in their species richness. In other words,
values of species richness are displaying positive autocorrelation or a spatially
aggregated pattern.
```
##### **Question**
What would a random spatial signal look like on this graph?
<div class="hidecode" onclick="doclick(this);">[Show answer]</div>
```
A flat line (i.e., a correlation close to zero).
```
We can better examine if our intuition is correct by randomizing the location
of each sample and re-making our graphic and re-estimating the correlation
coefficient.
```{r}
# randomize the rows of the spatial coordinates matrix
?sample
null_xy <- mite.xy[sample(nrow(mite.xy)), ]
null_dist <- dist(null_xy)
plot(null_dist, sr_dist)
abline(lm(sr_dist ~ null_dist), lwd=3, col='red')
lines(lowess(null_dist, sr_dist), lwd=3, col='pink')
abline(v = max_dist, col='red', lwd=3, lty=2)
# compute null correlation
null_cor <- cor(null_dist, sr_dist)
null_cor
```
That does look pretty flat for this randomized spatial location. We can
develop a permutation test if we carry out this same procedure many times to
develop a null distribution of correlation values. Then we can ask if our
observed value of `r paste('r = ', round(obs_cor, 2), sep='')` is larger than
we would expect simply due to random chance.
Here we will demonstrate how to make make this on your own and how to carry it
out using the `vegan` function `mantel`
```{r}
# carry out a permutation test for significance:
nperm <- 999
null_cor <- NULL
# now we'll want to generate 999 random null estimates of r
for (i in 1:nperm) {
# shuffle the rows of the spatial coordinates
null_xy <- mite.xy[sample(nrow(mite.xy)), ]
# correlation between the shuffled spatial coordinates and sr_dist
null_cor[i] <- cor(dist(null_xy), sr_dist)
}
# to get our 1000th possible outcome we'll consider that the observed
# correlation value is a 'possibility' we should consider in the null distribution:
null_cor <- c(null_cor, obs_cor)
# compute the p-value which is simply the proportion of times we observed a
# value equal to or greater than the observed correlation. We will learn that a
# value equal to or as large as the observed value only occurs once out of the
# 1000 possiblities considered so p = 0.001
sum(null_cor >= obs_cor) / (nperm + 1)
# carry out the same analysis using the function mantel()
?mantel
sr_mantel <- mantel(xy_dist, sr_dist)
sr_mantel
```
We can see that the estimated p-value is 0.001 (from both our hand made
algorithm and the `vegan::mantel` function). This indicates that there is a
statistically significant pattern of positive spatial autocorrelation which fits
our intuitive interpretation we developed graphically.
Now let's examine the distribution of null correlation values to better
understand how the p-value was computed.
```{r}
# compare the two approaches graphically using stacked boxplots
boxplot(list(null_cor, sr_mantel$perm), horizontal = F, boxwex = 0.5,
names = c('hand-made algo', 'vegan::mantel'), ylab='Correlation',
ylim = range(null_cor))
abline(h=obs_cor, col='red')
abline(h=0, lty = 2, col = 'grey')
```
As expected the null distribution is much closer to zero (grey dashed line) than
the observed value at `r round(obs_cor, 2)` (red solid line). Additionally, it
is re-assuring to see that our hand-made algorithm and the `vegan::mantel`
approaches are generating identical null distributions.
##### **Question**
What would a significant negative correlation indicate?
<div class="hidecode" onclick="doclick(this);">[Show answer]</div>
```
That the spatial pattern is uniform (rather than random or aggregated). This
might occur for example when individuals repel one another due to competition.
```
This preliminary analysis suggests that there is a significant relationship
between the spatial distance that two points are separated and the difference
in species richness of the points.
Let's take a similar approach but using a multivariate response variable in this
case a site-by-species community matrix. It is difficult to visualize from a
birds-eye view changes in a multivariate response variable because you would
need to layer a different map for each column in the matrix. That still might
not really provide you with a reduction of information that is useful for making
an inference. Here we will first reduce the information in the response matrix
using a non-metric multi-dimensional scaling ordination procedure. For simplicity
of visualization we'll just map the 1st axis on to the spatial location of the
samples.
```{r}
mite_mds <- metaMDS(mite)
mds_axs1 <- mite_mds$points[ , 1]
col_brks <- hist(mds_axs1, plot=F)$breaks
col_indices <- as.numeric(cut(mds_axs1, col_brks))
cols <- rev(terrain.colors(length(col_brks)))
plot(mite.xy, cex=2, pch=19, col=cols[col_indices])
```
As with richness there is a pretty clear spatial pattern from south to north.
In this case though it appears that the signal is even stronger than in the
species richness example as clear bands appear.
Let's now undertake a more quantitative approach and estimate the degree of
autocorrelation.
```{r}
## compute bray curtis distance for the community matrix
## note when working with species-richness we just used Euclidean distance
comm_dist <- vegdist(mite)
plot(xy_dist, comm_dist)
abline(lm(comm_dist ~ xy_dist), lwd=3, col='red')
lines(lowess(xy_dist, comm_dist), lwd=3, col='pink')
lines(lowess(xy_dist, comm_dist, f=0.1), lwd=3, col='blue')
abline(v = max_dist, col='red', lwd=3, lty=2)
comm_mantel <- mantel(xy_dist, comm_dist)
comm_mantel
```
Species composition also appears to display positive spatial auto-correlation
and it is a little stronger than species richness.
The previous plots included both the linear regression model which is what the
mantel analysis is based upon and the lowess smoother line. The smoother can
help to identify if the relationship is non-linear and how the strength of the
relationship varies with spatial distance.
Note that the estimated correlation coefficients that we have been using as
our test statistic use all of the distance classes including distances that
are greater than 1/2 the max distance which we noted above is generally
frowned upon. Notice in the above plot for `comm_dist` that the slope of the
lowess lines is steeper than the linear regression line - this indicates a
signal of spatial dependence or a non-linear change in the degree of spatial
autocorrelation as a function of distance
To develop a more nuanced understanding of the pattern of spatial autocorrelation
we can bin distances into bins and then compute **r** at each distance class to
build a correlogram.
```{r}
sr_corlog <- mantel.correlog(sr_dist, xy_dist)
comm_corlog <- mantel.correlog(comm_dist, xy_dist)
sr_corlog
comm_corlog
par(mfrow=c(1,2))
plot(sr_corlog)
mtext(side=3, 'Species Richness')
abline(v = max_dist, col='red', lwd=3, lty=2)
plot(comm_corlog)
mtext(side=3, 'Community Composition')
abline(v = max_dist, col='red', lwd=3, lty=2)
```
Above we can see that samples close to one another are more similar while
samples that are further apart are more different than expected due to chance.
Note that for some of the intermediate distance classes species-richness
does not differ from random expectation.
Last point to make is that in this analysis rather than computing 1 p-value
we have computed a p-value for each distance class. Therefore, this kind of
analysis suffers from the problem of multiple comparisons in which the p-values
are not adhering to expected type-I error. Therefore, a correction is applied
to each successive test using by default the Holm correction. See the
documentation for the `vegan::manel_correlog` function for more information.
## Univariate Modeling
Spatial (and temporal) dependence is a potential problem for inferential
statistics because of an assumption of independence of error. However, if
sufficient data is available it is often possible to model the spatial
component of error and thus "correct" for the lack of independence in a model's
error.
Crawley (2014) provides a straightforward description of these methods and a
few examples. Pinheiro and Bates (2000) provide a more detailed discussion with
more examples and they provide a useful table and figure that is helpful when
deciding which error model to chose from:
This is Table 5.2 from Pinheiro and Bates (2000) in which *s* is the spatial lag
and *rho* is the correlation parameter. This is a subset of the models
presented in Cressie (1993).
![table](figures/isotropic_variogram_models_table.png)
Graphically these models of spatial correlation can be visualized like this
(Figure 5.9 of Pinheiro and Bates 2000):
![plots](figures/isotropic_variogram_models_plots.png)
### Model fitting, interpretation, and comparison
When we carried out ordinary regression we used the formula:
$$y = X\beta + \varepsilon, \qquad \varepsilon \sim N(0, \sigma^2) \qquad \text{(equ. 1)}$$
Where *y* was a vector we wished to explain variance in (our response variable),
*X* was a matrix of explanatory variables we thought influenced *y*, the vector
$\beta$ was the vector of intercept and slope coefficients, and $\varepsilon$ was
the error or residuals of the model.
As noted above we can consider two sources of non-random spatial structure in
*y*. For induced spatial dependence or true gradients we can consider which
explanatory variables we include in *X* and if they themselves are spatially
structured. In practice, considering induced spatial dependence requires no
changes to our modeling structure or philosophy assuming that our samples are
truly independent of each other (i.e., sample similarity is only due to having
similar conditions in *X* and not because they were simply located near each
other).
If however samples are not truly independent and autocorrelation is present in
the response variable then we are violating a core assumption of OLS regression
that residual error is identically independently distributed (i.i.d). As equ. 1
states the error should be Normally distributed with variance $\sigma^2$
regardless of which samples we are considering. The key problem with ignoring
correlations in the error between samples is it means that our sample size is
inflated.
##### **Question**
If we are over estimating our sample size will that bias p-values to be
smaller or larger?
<div class="hidecode" onclick="doclick(this);">[Show answer]</div>
```
They will be biased to be smaller because larger samples require a smaller
signal to appear to not be conforming to the null hypothesis. Thus when we
inflate our samples size artifically we are inflating our Type I error and are
more likely to reject the null hypothesis when there is actually no signal.
```
Well what are we going to do about this inflated sample size? The most
conservative but often least appealing approach would be to censor your data
down to a set of samples that are truly independent. Alternatively you average
across non-independent replicates (aka pseudo-replicates). This is often either
not feasible or simply unappealing because of the loss of precision and/or power
that accompanies a loss of samples. An alternative solution we will examine here
is modeling the pattern of autocorrelation in the error of the model using
Generalized Linear Models (GLS).
**Note:** GLS models are distinct from General Linear Models (GLM) because GLS
models are primarily concerned with modeling the covariance between samples
while still adhering to the standard Normal distribution assumption for
measurement error. In contract, GLMs are typically harnessed when the Normal
distribution is not appropriate and we wish to use a different distribution for
the error term such as Poisson or Binomial distributions. It is possible to
model the correlation structure of the error term in of a GLM but we will not
consider that case here.
Let's start with a simple OLS model for species richness of mites. We'll examine
if substrate density (`SubsDens`) is linearly correlated with richness and if the
error of that regression model shows spatial structure thus violating a core
assumption of the model.
```{r}
sr_dat <- data.frame(sr, mite.env, mite.xy)
sr_lm <- gls(sr ~ SubsDens, data=sr_dat)
summary(sr_lm)
```
We can see from the output of that model that substrate density does not
seem to correlated to species richness. Let's examine the residuals of that model
using a special kind of plot called a *Variogram* which plots the semivariance
of the residuals against spatial lag (i.e., the spatial distance between samples).
Although that might sound intimidating we can build our own kind of variogram
plot using the `dist` function we used above in the lesson to compute Euclidean
distances between entries in a vector or matrix.
```{r}
vario_sr <- Variogram(sr_lm, form= ~ x + y, resType = 'response')
# note the default of teh Variogram function is to compute normalized residuals
plot(vario_sr)
# compare that to a plot we make on our own
res <- residuals(sr_lm)
res_var <- dist(res)^2 * 0.5
plot(dist(sr_dat[, c('x', 'y')]), res_var)
lines(lowess(dist(sr_dat[, c('x', 'y')]), res_var), col='red', lwd=2)
abline(v = max_dist, col='red', lwd=3, lty=2)
# they seem to tell a similar story - the error of the model is showing
# positive autocorrelation.
```
The two plots may look a little different only because the Variogram plot has
averaged across many pairs of points (the `n.pairs` column of that function's
ouput) to get semivariance estimates for different distance classes. One
important point to note is that if you take the weighted average of all the
semi-variance estiamtes then that equals the variance of the residuals, here's a
quick demonstration of that:
```{r}
sum(vario_sr$variog * vario_sr$n.pairs) / sum(vario_sr$n.pairs)
var(res)
```
Ok so far we've learned that substrate density does not seem to be influencing
richness spatially or otherwise and that there still remains autcorrelation in
richness. Let's examine the most common model of spatial (and temporal)
autocorrelation: the exponential model also know as Autoregressive 1 model or AR1.
```{r}
sr_exp <- update(sr_lm, corr=corExp(form=~x + y))
summary(sr_exp)
# examine fit of error model to the raw model residuals
# note this function defaults to displaying pearson standardized residuals
# resType='p' or resType='pearson' which have been standardized
plot(Variogram(sr_exp, maxDist = max_dist))
# that doesn't look so good because clearly the model does not fit the error
# very well, it appears that there is a nugget (i.e., non-zero y-intercept)
# Let's examine the normalized residuals in which the residuals are
# divided by the estimate of the variance-covariance matrix. If the model
# fits well these residuals should be normally distributed.
plot(Variogram(sr_exp, resType='normalized', maxDist = max_dist))
# we see a little bit of a trend in the residuals but not too bad
# actually which is a bit surprising given the output of the raw residuals
# let's look at the same model but with a nugget
sr_exp_nug <- update(sr_exp, corr=corExp(c(0.5, 0.1), form=~x + y, nugget=T))
# Notice above I had to put in starting values for the rate and nugget of the
# spatial model. I found that if I left those off the model did not converge.
# This may be due to the fact that the estimated value of the nugget is very
# close to zero.
summary(sr_exp_nug)
plot(Variogram(sr_exp_nug, maxDist = max_dist))
plot(Variogram(sr_exp_nug, resType='n', maxDist = max_dist))
# those look like they provide a better fit to the data
# let's examine the rational quadratic error model
sr_rat_nug<-update(sr_lm, corr=corRatio(form=~x + y, nugget=T))
# examine fit of error model to model residuals
plot(Variogram(sr_rat_nug, maxDist = max_dist))
plot(Variogram(sr_rat_nug, resType='n', maxDist = max_dist))
# this model seems to fit about as a good as the exponential with the nugget
# let's compare the models
anova(sr_lm, sr_exp, sr_exp_nug, sr_rat_nug, test=F)
# so it appears that the exponential and rational models with the nuggets
# fit equally as well and much better than models without spatial error terms
# and better than a model with a nugget set to zero.
summary(sr_exp_nug)
summary(sr_rat_nug)
```
Note that in the above model summaries both the regression coefficients and the
associated t-statistics were shifted towards greater importance. In general, it
is thought that beta coefficients should be fairly robust to spatial dependence
but that tests of significance will be highly sensitive.
Let's now examine our spatial map of richness and this time focus on the
residuals of our model. Do they still look spatially structured?
```{r}
col_brks <- hist(residuals(sr_exp_nug), plot=F)$breaks
col_indices <- as.numeric(cut(residuals(sr_exp_nug), col_brks))
cols <- rev(terrain.colors(length(col_brks)))
plot(mite.xy, cex=2, pch=19, col=cols[col_indices])
```
It does appear that they do still look spatially structured at least at large
distances, it is difficult to tell at short distances.
Exercise: Include the variable `WatrCont` along with `SubsDens` in the linear
model of richness. Go back through the spatial modeling routines. Does the map
of the residuals still seem to show the strong pattern of non-stationary?
Now let's examine spatial dependence in multivariate response such as species
composition in a modeling framework
```{r}
mite_rda <- rda(mite, mite.env[ , 1:2])
plot(mite_rda, display=c('sp', 'bp'))
anova(mite_rda)
mite_mso_raw <- mso(rda(mite), mite.xy, permutations = 1000)
mite_mso <- mso(mite_rda, mite.xy, permutations = 1000)
mite_mso
par(mfrow=c(1,1))
msoplot(mite_mso_raw)
msoplot(mite_mso)
```
## Literature Cited
* Pinheiro, J, and D.M. Bates. 2000. Mixed-Effects Models in S and S-PLUS.
Springer. New York, NY, USA.
* Crawley, M. 2014. The R Book, 2nd ed. Wiley, New York.
* Cressie, N.A.C. 1993. Statistics for Spatial Data, Wiley, New York.