forked from geocompx/geocompr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
14-eco.Rmd
648 lines (544 loc) · 39.3 KB
/
14-eco.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
# Ecology {#eco}
## Prerequisites {-}
This chapter assumes you have a strong grasp of geographic data analysis and processing, covered in chapters \@ref(spatial-class) to \@ref(geometric-operations).
In it you will also make use of R's interfaces to dedicated GIS software, and spatial cross validation, topics covered in chapters \@ref(gis) and \@ref(spatial-cv) respectively.
The chapter uses the following packages:
```{r, message=FALSE}
library(sf)
library(raster)
library(RQGIS)
library(mlr)
library(dplyr)
library(vegan)
```
## Introduction
In this chapter we will model the floristic gradient of fog oases to reveal distinctive vegetation belts that are clearly controlled by water availability.
To do so, we will bring together concepts presented in previous chapters and even extend them (chapters \@ref(spatial-class) to \@ref(geometric-operations) and chapters \@ref(gis) and \@ref(spatial-cv)).
Fog oases are one of the most fascinating vegetation formations we have ever encountered.
These formations, locally termed *lomas*, develop on mountains along the coastal deserts of Peru and Chile.^[Similar vegetation formations develop also in other parts of the world, e.g. in Namibia and along the coasts of Yemen and Oman [@galletti_land_2016].]
The deserts' extreme conditions and remoteness provide the habitat for a unique ecosystem, including species endemic to the fog oases.
Despite the arid conditions and low levels of precipitation of around 30-50 mm per year on average, fog deposition increases the amount of water available to plants during austal winter.
This results in green southern-facing mountain slopes along the coastal strip of Peru (Fig. \@ref(fig:study-area-mongon)).
This fog, which develops below the temperature inversion caused by the cold Humboldt current in austral winter, provides the name for this habitat.
Every few years, the El Niño phenomenon brings torrential rainfall to this sun-baked environment [@dillon_lomas_2003].
This causes the desert to bloom, and provides tree seedlings a chance to develop roots long enough to survive the following arid conditions.
Unfortunately fog oases are heavily endangered.
This is mostly due to human activity (agriculture and climate change).
To effectively protect the last remnants of this unique vegetation ecosystem, evidence is needed on the composition and spatial distribution of the native flora [@muenchow_predictive_2013; @muenchow_soil_2013].
*Lomas* mountains also have economic value as a tourist destination, and can contribute to the well-being of local people via recreation.
For example, most Peruvians live in the coastal desert, and *lomas* mountains are frequently the closest "green" destination.
In this chapter we will demonstrate ecological applications of some of the techniques learned in the previous chapters.
This case study will involve analyzing the composition and the spatial distribution of the vascular plants on the southern slope of Mt. Mongón, a *lomas* mountain near Casma on the central northern coast of Peru (Fig. \@ref(fig:study-area-mongon)).
```{r study-area-mongon, echo=FALSE, fig.cap="The Mt. Mongón study area, from Muenchow, Schratz, and Brenning (2017).", out.width="60%", fig.scap="The Mt. Mongón study area."}
knitr::include_graphics("https://user-images.githubusercontent.com/1825120/38989956-6eae7c9a-43d0-11e8-8f25-3dd3594f7e74.png")
```
During a field study to Mt. Mongón we recorded all vascular plants living in 100 randomly sampled 4x4 m^2^ plots in the austral winter of 2011 [@muenchow_predictive_2013].
The sampling coincided with a strong La Niña event that year (see ENSO monitoring of the [NOASS Climate Prediction Center](http://origin.cpc.ncep.noaa.gov/products/analysis_monitoring/ensostuff/ONI_v5.php)).
This led to even higher levels of aridity than usual in the coastal desert.
On the other hand, it also increased fog activity on the southern slopes of Peruvian *lomas* mountains.
<!--
The first hypothesis is that four plant belts will be found along the altitudinal gradient: a low-elevation *Tillandsia* belt, a herbaceous belt, a bromeliad belt, and an uppermost succulent belt [@muenchow_soil_2013].
-->
Ordinations are dimension-reducing techniques which allow the extraction of the main gradients from a (noisy) dataset, in our case the floristic gradient developing along the southern mountain slope (see next section).
In this chapter we will model the first ordination axis, i.e., the floristic gradient, as a function of environmental predictors such as altitude, slope, catchment area and NDVI.
For this, we will make use of a random forest model - a very popular machine learning algorithm [@breiman_random_2001].
The model will allow us to make spatial predictions of the floristic composition anywhere in the study area.
To guarantee an optimal prediction, it is advisable to tune beforehand the hyperparameters with the help of spatial cross-validation (see Section \@ref(svm)).
## Data and data preparation
All the data needed for the subsequent analyses is available via the **RQGIS** package.
```{r}
data("study_area", "random_points", "comm", "dem", "ndvi", package = "RQGIS")
```
`study_area` is an `sf` polygon representing the outlines of the study area.
`random_points` is an `sf` object, and contains the 100 randomly chosen sites.
`comm` is a community matrix of the wide data format [@wickham_tidy_2014] where the rows represent the visited sites in the field and the columns the observed species.
^[In statistics this is also called a contingency or cross-table.]
```{r}
# sites 35 to 40 and corresponding occurrences of the first five species in the
# community matrix
comm[35:40, 1:5]
```
The values represent species cover per site, and were recorded as the area covered by a species in proportion to the site area in percentage points (%; please note that one site can have >100% due to overlapping cover between individual plants).
The rownames of `comm` correspond to the `id` column of `random_points`.
`dem` is the digital elevation model for the study area, and `ndvi` is the Normalized Difference Vegetation Index (NDVI) computed from the red and near-infrared channels of a Landsat scene (see Section \@ref(local-operations) and `?ndvi`).
elps to get more familiar with it, as in Figure \@ref(fig:sa-mongon) where the `dem` is overplotted by the `random_points` and the `study_area`.
```{r, eval=FALSE, echo=FALSE}
# create hillshade
hs = hillShade(terrain(dem), terrain(dem, "aspect"))
# plot the data
par(mar = rep(1, 4))
plot(hs, col = gray(0:100 / 100), legend = FALSE, axes = FALSE, box = FALSE)
plot(dem, axes = FALSE, add = TRUE, alpha = 0.5, legend = FALSE)
axis(1)
axis(2)
plot(st_geometry(random_points), add = TRUE)
plot(st_geometry(study_area), add = TRUE)
# white margins between axes and plot are too wide
```
```{r sa-mongon, echo=FALSE, message=FALSE, fig.cap="Study mask (polygon), location of the sampling sites (black points) and DEM in the background.", fig.scap="Study mask, location of the sampling sites."}
library("latticeExtra")
library("grid")
hs = hillShade(terrain(dem), terrain(dem, "aspect"))
spplot(dem, col.regions = terrain.colors(50), alpha.regions = 0.5,
scales = list(draw = TRUE,
tck = c(1, 0)),
colorkey = list(space = "right", title = "m asl",
width = 0.5, height = 0.5,
axis.line = list(col = "black")),
sp.layout = list(
list("sp.points", as(random_points, "Spatial"), pch = 16,
col = "black", cex = 0.8, first = FALSE),
list("sp.polygons", as(study_area, "Spatial"),
col = "black", first = FALSE)
)
) +
latticeExtra::as.layer(spplot(hs, col.regions = gray(0:100 / 100)),
under = TRUE)
grid.text("m asl", x = unit(0.8, "npc"), y = unit(0.75, "npc"),
gp = gpar(cex = 0.8))
```
The next step is to compute variables which we will predominantly need for the modeling and predictive mapping (see Section \@ref(predictive-mapping)) but also for aligning the NMDS axes with the main gradient, altitude and humidity, respectively, in the study area (see Section \@ref(nmds)).
Specifically, we will compute catchment slope and catchment area from a digital elevation model using R-GIS bridges (see Chapter \@ref(gis)).
Curvatures might also represent valuable predictors, in the exercise section you can find out how they would change the modeling result.
To compute catchment area and catchment slope, we will make use of the `saga:sagawetnessindex` function.
^[Admittedly, it is a bit unsatisfying, that the only way of knowing that `sagawetnessindex` computes the desired terrain attributes, is to be familiar with SAGA and/or google for "SAGA catchment slope".]
`get_usage()` returns all function parameters and default values of a specific geoalgorithm.
Here, we present only a selection of the complete output.
```{r, eval=FALSE}
get_usage("saga:sagawetnessindex")
#>ALGORITHM: Saga wetness index
#> DEM <ParameterRaster>
#> ...
#> SLOPE_TYPE <ParameterSelection>
#> ...
#> AREA <OutputRaster>
#> SLOPE <OutputRaster>
#> AREA_MOD <OutputRaster>
#> TWI <OutputRaster>
#> ...
#>SLOPE_TYPE(Type of Slope)
#> 0 - [0] local slope
#> 1 - [1] catchment slope
#> ...
```
Subsequently, we can specify the needed parameters using R named arguments (see Section \@ref(rqgis)).
Remember that we can use a `RasterLayer` living in R's global environment to specify the input raster `DEM` (see Section \@ref(rqgis)).
Specifying 1 as the `SLOPE_TYPE` makes sure that the algorithm will return the catchment slope.
The resulting output rasters should be saved to temporary files with an `.sdat` extension which is a SAGA raster format.
Setting `load_output` to `TRUE` ensures that the resulting rasters will be imported into R.
```{r, eval=FALSE}
# environmental predictors: catchment slope and catchment area
ep = run_qgis(alg = "saga:sagawetnessindex",
DEM = dem,
SLOPE_TYPE = 1,
SLOPE = tempfile(fileext = ".sdat"),
AREA = tempfile(fileext = ".sdat"),
load_output = TRUE,
show_output_paths = FALSE)
```
This returns a list named `ep` consisting of two elements: `AREA` and `SLOPE`.
Let us add two more raster objects to the list, namely `dem` and `ndvi`, and convert it into a raster stack (see Section \@ref(raster-classes)), .
```{r, eval=FALSE}
ep = stack(c(dem, ndvi, ep))
names(ep) = c("dem", "ndvi", "carea", "cslope")
```
Additionally, the catchment area values are highly skewed to the right (`hist(ep$carea)`).
A log10-transformation makes the distribution more normal.
```{r, eval=FALSE}
ep$carea = log10(ep$carea)
```
As a convenience to the reader we have added `ep` to **spDataLarge**:
```{r}
data("ep", package = "spDataLarge")
```
Finally, we can extract the terrain attributes to our field observations (see also Section \@ref(raster-extraction)).
```{r}
random_points[, names(ep)] = raster::extract(ep, as(random_points, "Spatial"))
```
## Reducing dimensionality {#nmds}
Ordinations are a popular tool in vegetation science to extract the main information, frequently corresponding to ecological gradients, from large species-plot matrices mostly filled with 0s.
However, they are also used in remote sensing, the soil sciences, geomarketing and many other fields.
If you are unfamiliar with ordination techniques or in need of a refresher, have a look at Michael W. Palmers [webpage](http://ordination.okstate.edu/overview.htm) for a short introduction to popular ordination techniques in ecology and at @borcard_numerical_2011 for a deeper look on how to apply these techniques in R.
**vegan**'s package documentation is also a very helpful resource (`vignette(package = "vegan")`).
Principal component analysis (PCA) is probably the most famous ordination technique.
It is a great tool to reduce dimensionality if one can expect linear relationships between variables, and if the joint absence of a variable (for example calcium) in two plots (observations) can be considered a similarity.
This is barely the case with vegetation data.
For one, relationships are usually non-linear along environmental gradients.
That means the presence of a plant usually follows a unimodal relationship along a gradient (e.g., humidity, temperature or salinity) with a peak at the most favorable conditions and declining ends towards the unfavorable conditions.
Secondly, the joint absence of a species in two plots is hardly an indication for similarity.
Suppose a plant species is absent from the driest (e.g., an extreme desert) and the most moist locations (e.g., a tree savanna) of our sampling.
Then we really should refrain from counting this as a similarity because it is very likely that the only thing these two completely different environmental settings have in common in terms of floristic composition is the shared absence of species (except for rare ubiquitous species).
Non-metric multidimensional scaling (NMDS) is one popular dimension-reducing technique in ecology [@vonwehrden_pluralism_2009].
NMDS reduces the rank-based differences between the distances between objects in the original matrix and distances between the ordinated objects.
The difference is expressed as stress.
The lower the stress value, the better the ordination, i.e. the low-dimensional representation of the original matrix.
Stress values lower than 10 represent an excellent fit, stress values of around 15 are still good, and values greater than 20 represent a poor fit [@mccune_analysis_2002].
In R, `metaMDS()` of the **vegan** package can execute a NMDS.
As input it expects a community matrix with the sites as rows and the species as columns.
Often ordinations using presence-absence data yield better results (in terms of explained variance) though the prize is, of course, a less informative input matrix (see also exercises).
`decostand()` converts numerical observations into presences and absences with 1 indicating the occurrence of a species and 0 the absence of a species.
Ordination techniques such as NMDS require at least one observation per site.
Hence, we need to dismiss all sites in which no species were found.
```{r}
# presence-absence matrix
pa = decostand(comm, "pa") # 100 rows (sites), 69 columns (species)
# keep only sites in which at least one species was found
pa = pa[rowSums(pa) != 0, ] # 84 rows, 69 columns
```
The resulting output matrix serves as input for the NMDS.
`k` specifies the number of output axes, here, set to 4.^[One way of choosing `k` is to try `k` values between 1 and 6 and then using the result which yields the best stress value [@mccune_analysis_2002].]
NMDS is an iterative procedure trying to make the ordinated space more similar to the input matrix in each step.
To make sure that the algorithm converges, we set the number of steps to 500 (`try` parameter).
```{r, eval=FALSE, message=FALSE}
set.seed(25072018)
nmds = metaMDS(comm = pa, k = 4, try = 500)
nmds$stress
#> ...
#> Run 498 stress 0.08834745
#> ... Procrustes: rmse 0.004100446 max resid 0.03041186
#> Run 499 stress 0.08874805
#> ... Procrustes: rmse 0.01822361 max resid 0.08054538
#> Run 500 stress 0.08863627
#> ... Procrustes: rmse 0.01421176 max resid 0.04985418
#> *** Solution reached
#> 0.08831395
```
```{r, eval=FALSE, echo=FALSE}
saveRDS(nmds, "extdata/14-nmds.rds")
```
```{r, include=FALSE}
nmds = readRDS("extdata/14-nmds.rds")
```
A stress value of 9 represents a very good result, which means that the reduced ordination space represents the large majority of the variance of the input matrix.
Overall, NMDS puts objects that are more similar (in terms of species composition) closer together in ordination space.
However, as opposed to most other ordination techniques, the axes are arbitrary and not necessarily ordered by importance [@borcard_numerical_2011].
However, we already know that humidity represents the main gradient in the study area [@muenchow_predictive_2013;@muenchow_rqgis:_2017].
Since humidity is highly correlated with elevation, we rotate the NMDS in accordance with elevation (see also `?MDSrotate` for more details on rotating NMDS axes).
Plotting the result reveals that the first axis is, as intended, clearly associated with altitude (Figure \@ref(fig:xy-nmds)).
```{r xy-nmds, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%"}
elev = dplyr::filter(random_points, id %in% rownames(pa)) %>%
dplyr::pull(dem)
# rotating NMDS in accordance with altitude (proxy for humidity)
rotnmds = MDSrotate(nmds, elev)
# extracting the first two axes
sc = scores(rotnmds, choices = 1:2)
# plotting the first axis against altitude
plot(y = sc[, 1], x = elev, xlab = "elevation in m",
ylab = "First NMDS axis", cex.lab = 0.8, cex.axis = 0.8)
```
```{r, eval=FALSE, echo=FALSE}
# scores and rotated scores in one figure
p1 = xyplot(scores(rotnmds)[, 2] ~ scores(rotnmds)[, 1], pch = 16,
col = "lightblue", xlim = c(-3, 2), ylim = c(-2, 2),
xlab = list("Dimension 1", cex = 0.8),
ylab = list("Dimension 2", cex = 0.8),
scales = list(x = list(relation = "same", cex = 0.8),
y = list(relation = "same", cex = 0.8),
# ticks on top are suppressed
tck = c(1, 0),
# plots axes labels only in row and column 1 and 4
alternating = c(1, 0, 0, 1),
draw = TRUE),
# we have to use the same colors in the legend as used for the plot
# points
par.settings = simpleTheme(col = c("lightblue", "salmon"),
pch = 16, cex = 0.9),
# also the legend point size should be somewhat smaller
auto.key = list(x = 0.7, y = 0.9, text = c("unrotated", "rotated"),
between = 0.5, cex = 0.9),
panel = function(x, y, ...) {
# Plot the points
panel.points(x, y, cex = 0.6, ...)
panel.points(x = scores(nmds)[, 1],
y = scores(nmds)[, 2],
col = "salmon", pch = 16, cex = 0.6)
panel.arrows(x0 = scores(nmds)[, 1],
y0 = scores(nmds)[, 2],
x1 = x,
y1 = y,
length = 0.04,
lwd = 0.4)
})
plot(scores(nmds, choices = 1:2))
points(scores(rotnmds, choices = 1:2), col = "lightblue", pch = 16)
sc = scores(nmds, choices = 1:2) %>% as.data.frame
sc$id = rownames(sc) %>% as.numeric
rp = inner_join(select(sc, id), st_set_geometry(random_points, NULL))
fit_1 = envfit(nmds, select(rp, dem))
fit_2 = envfit(rotnmds, select(rp, dem))
par(mfrow = c(1, 2))
plot(nmds, display = "sites")
plot(fit_1)
plot(rotnmds, display = "sites")
plot(fit_2)
```
The scores of the first NMDS axis represent the different vegetation formations, i.e. the floristic gradient, appearing along the slope of Mt. Mongón.
To spatially visualize them, we can model the NMDS scores with the previously created predictors (Section \@ref(data-and-data-preparation)), and use the resulting model for predictive mapping (see next section).
## Modeling the floristic gradient
To predict the floristic gradient spatially, we will make use of a random forest model [@hengl_random_2018].
Random forest models are frequently used in environmental and ecological modeling, and often provide the best results in terms of predictive performance [@schratz_performance_nodate].
Here, we shortly introduce decision trees and bagging, since they form the basis of random forests.
We refer the reader to @james_introduction_2013 for a more detailed description of random forests and related techniques.
To introduce decision trees by example, we first construct a response-predictor matrix by joining the rotated NMDS scores to the field observations (`random_points`).
We will also use the resulting data frame for the **mlr** modeling later on.
```{r}
# construct response-predictor matrix
# id- and response variable
rp = data.frame(id = as.numeric(rownames(sc)), sc = sc[, 1])
# join the predictors (dem, ndvi and terrain attributes)
rp = inner_join(random_points, rp, by = "id")
```
Decision trees split the predictor space into a number of regions.
To illustrate this, we apply a decision tree to our data using the scores of the first NMDS axis as the response (`sc`) and altitude (`dem`) as the only predictor.
```{r, eval=FALSE}
library("tree")
tree_mo = tree(sc ~ dem, data = rp)
plot(tree_mo)
text(tree_mo, pretty = 0)
```
```{r, echo=FALSE, eval=TRUE}
library("tree")
tree_mo = tree(sc ~ dem, data = rp)
```
```{r, eval=FALSE, echo=FALSE}
png("figures/14_tree.png", width = 1100, height = 700, units = "px", res = 300)
par(mar = rep(1, 4))
plot(tree_mo)
text(tree_mo, pretty = 0)
dev.off()
```
```{r tree, echo=FALSE, fig.cap="Simple example of a decision tree with three internal nodes and four terminal nodes.", fig.scap="Simple example of a decision tree."}
knitr::include_graphics("figures/14_tree.png")
```
The resulting tree consists of three internal nodes and four terminal nodes (Figure \@ref(fig:tree)).
The first internal node at the top of the tree assigns all observations which are below `r gsub("<", "", as.character(as.data.frame(tree_mo$fram$splits)[1, "cutleft"]))` m to the left and all other observations to the right branch.
The observations falling into the left branch have a mean NMDS score of `r round(tree_mo$frame$yval[2], 3)`.
Overall, we can interpret the tree as follows: the higher the elevation, the higher the NMDS score becomes.
Decision trees have a tendency to overfit, that is they mirror too closely the input data including its noise which in turn leads to bad predictive performances [Section \@ref(intro-cv); @james_introduction_2013].
Bootstrap aggregation (bagging) is an ensemble technique and helps to overcome this problem.
Ensemble techniques simply combine the predictions of multiple models.
Thus, bagging takes repeated samples from the same input data and averages the predictions.
This reduces the variance and overfitting with the result of a much better predictive accuracy compared to decision trees.
Finally, random forests extend and improve bagging by decorrelating trees which is desirable since averaging the predictions of highly correlated trees shows a higher variance and thus lower reliability than averaging predictions of decorrelated trees [@james_introduction_2013].
To achieve this, random forests use bagging but in contrast to the traditional bagging where each tree is allowed to use all available predictors, random forests only use a random sample of all available predictors.
<!--
Recall that bagging is simply a special case of a random forest with m = p. Therefore, the randomForest() function can be used to perform both random forests and bagging.
The argument mtry=13 indicates that all 13 predictors should be considered
for each split of the tree—in other words, that bagging should be done.
@james_introduction_2013
-->
### **mlr** building blocks
The code in this section largely follows the steps we have introduced in Section \@ref(svm).
The only differences are:
1. The response variable is numeric, hence a regression task will replace the classification task of Section \@ref(svm).
1. Instead of the AUROC which can only be used for categorical response variables, we will use the root mean squared error (RMSE) as performance measure.
1. We use a random forest model instead of a support vector machine which naturally goes along with different hyperparameters.
1. We are leaving the assessment of a bias-reduced performance measure as an exercise to the reader (see exercises).
Instead we show how to tune hyperparameters for (spatial) predictions.
Remember that 125,500 models were necessary to retrieve bias-reduced performance estimates when using 100-repeated 5-fold spatial cross-validation and a random search of 50 iterations (see Section \@ref(svm)).
In the hyperparameter tuning level, we found the best hyperparameter combination which in turn was used in the outer performance level for predicting the test data of a specific spatial partition (see also Figure \@ref(fig:inner-outer)).
This was done for five spatial partitions, and repeated a 100 times yielding in total 500 optimal hyperparameter combinations.
Which one should we use for making spatial predictions?
The answer is simple, none at all.
Remember, the tuning was done to retrieve a bias-reduced performance estimate, not to do the best possible spatial prediction.
For the latter, one estimates the best hyperparameter combination from the complete dataset.
This means, the inner hyperparameter tuning level is no longer needed which makes perfect sense since we are applying our model to new data (unvisited field observations) for which the true outcomes are unavailable, hence testing is impossible in any case.
Therefore, we tune the hyperparameters for a good spatial prediction on the complete dataset via a 5-fold spatial CV with one repetition.
<!-- If we used more than one repetition (say 2) we would retrieve multiple optimal tuned hyperparameter combinations (say 2) -->
The preparation for the modeling using the **mlr** package includes the construction of a response-predictor matrix containing only variables which should be used in the modeling and the construction of a separate coordinate data frame.
```{r}
# extract the coordinates into a separate data frame
coords = sf::st_coordinates(rp) %>%
as.data.frame() %>%
rename(x = X, y = Y)
# only keep response and predictors which should be used for the modeling
rp = dplyr::select(rp, -id, -spri) %>%
st_set_geometry(NULL)
```
Having constructed the input variables, we are all set for specifying the **mlr** building blocks (task, learner, and resampling).
We will use a regression task since the response variable is numeric.
The learner is a random forest model implementation from the **ranger** package.
```{r}
# create task
task = makeRegrTask(data = rp, target = "sc", coordinates = coords)
# learner
lrn_rf = makeLearner(cl = "regr.ranger", predict.type = "response")
```
As opposed to for example support vector machines (see Section \@ref(svm)), random forests often already show good performances when used with the default values of their hyperparameters (which may be one reason for their popularity).
Still, tuning often moderately improves model results, and thus is worth the effort [@probst_hyperparameters_2018].
Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters (see sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)).
Specifically, we will use a five-fold spatial partitioning with only one repetition (`makeResampleDesc()`).
In each of these spatial partitions, we run 50 models (`makeTuneControlRandom`) to find the optimal hyperparameter combination.
```{r}
# spatial partitioning
perf_level = makeResampleDesc("SpCV", iters = 5)
# specifying random search
ctrl = makeTuneControlRandom(maxit = 50L)
```
In random forests, the hyperparameters `mtry`, `min.node.size` and `sample.fraction` determine the degree of randomness, and should be tuned [@probst_hyperparameters_2018].
`mtry` indicates how many predictor variables should be used in each tree.
If all predictors are used, then this corresponds in fact to bagging (see beginning of Section \@ref(modeling-the-floristic-gradient)).
The `sample.fraction` parameter specifies the fraction of observations to be used in each tree.
Smaller fractions lead to greater diversity, and thus less correlated trees which often is desirable (see above).
The `min.node.size` parameter indicates the number of observations a terminal node should at least have (see also Figure \@ref(fig:tree)).
Naturally, as trees and computing time become larger, the lower the `min.node.size`.
Hyperparameter combinations will be selected randomly but should fall inside specific tuning limits (`makeParamSet()`).
`mtry` should range between 1 and the number of predictors (`r ncol(rp) - 1`), `sample.fraction` should range between 0.2 and 0.9 and `min.node.size` should range between 1 and 10.
```{r}
# specifying the search space
ps = makeParamSet(
makeIntegerParam("mtry", lower = 1, upper = ncol(rp) - 1),
makeNumericParam("sample.fraction", lower = 0.2, upper = 0.9),
makeIntegerParam("min.node.size", lower = 1, upper = 10)
)
```
Finally, `tuneParams()` runs the hyperparameter tuning, and will find the optimal hyperparameter combination for the specified parameters.
The performance measure is the root mean squared error (RMSE).
```{r, eval=FALSE}
# hyperparamter tuning
set.seed(02082018)
tune = tuneParams(learner = lrn_rf,
task = task,
resampling = perf_level,
par.set = ps,
control = ctrl,
measures = mlr::rmse)
#>...
#> [Tune-x] 49: mtry=3; sample.fraction=0.533; min.node.size=5
#> [Tune-y] 49: rmse.test.rmse=0.5636692; time: 0.0 min
#> [Tune-x] 50: mtry=1; sample.fraction=0.68; min.node.size=5
#> [Tune-y] 50: rmse.test.rmse=0.6314249; time: 0.0 min
#> [Tune] Result: mtry=4; sample.fraction=0.887; min.node.size=10 :
#> rmse.test.rmse=0.5104918
```
```{r, eval=FALSE, echo=FALSE}
saveRDS(tune, "extdata/14-tune.rds")
```
```{r, echo=FALSE}
tune = readRDS("extdata/14-tune.rds")
```
An `mtry` of `r tune$x$mtry`, a `sample.fraction` of `r tune$x$sample.fraction`, and a `min.node.size` of `r tune$x$min.node.size` represent the best hyperparameter combination.
A RMSE of `r round(tune$y[attr(tune$y, "names") == "rmse.test.rmse"], 2)` is relatively good when considering the range of the response variable which is `r round(diff(range(rp$sc)), 2)` (`diff(range(rp$sc))`).
### Predictive mapping
The tuned hyperparameters can now be used for the prediction.
We simply have to modify our learner using the result of the hyperparameter tuning, and run the corresponding model.
```{r}
# learning using the best hyperparameter combination
lrn_rf = makeLearner(cl = "regr.ranger",
predict.type = "response",
mtry = tune$x$mtry,
sample.fraction = tune$x$sample.fraction,
min.node.size = tune$x$min.node.size)
# doing the same more elegantly using setHyperPars()
# lrn_rf = setHyperPars(makeLearner("regr.ranger", predict.type = "response"),
# par.vals = tune$x)
# train model
model_rf = train(lrn_rf, task)
# to retrieve the ranger output, run:
# mlr::getLearnerModel(model_rf)
# which corresponds to:
# ranger(sc ~ ., data = rp,
# mtry = tune$x$mtry,
# sample.fraction = tune$x$sample.fraction,
# min.node.sie = tune$x$min.node.size)
```
The last step is to apply the model to the spatially available predictors, i.e. to the raster stack.
So far, `raster::predict()` does not support the output of **ranger** models, hence, we will have to program the prediction ourselves.
First, we convert `ep` into a prediction data frame which secondly serves as input for the `predict.ranger()` function.
Thirdly, we put the predicted values back into a `RasterLayer` (see Section \@ref(raster-subsetting) and Figure \@ref(fig:rf-pred)).
```{r}
# convert raster stack into a data frame
new_data = as.data.frame(as.matrix(ep))
# apply the model to the data frame
pred_rf = predict(model_rf, newdata = new_data)
# put the predicted values into a raster
pred = dem
# replace altitudinal values by rf-prediction values
pred[] = pred_rf$data$response
```
```{r rf-pred, echo=FALSE, fig.cap="Predictive mapping of the floristic gradient clearly revealing distinct vegetation belts.", fig.width = 10, fig.height = 10, fig.scap="Predictive mapping of the floristic gradient."}
library("latticeExtra")
library("grid")
# create a color palette
blue = rgb(0, 0, 146, maxColorValue = 255)
lightblue = rgb(0, 129, 255, maxColorValue = 255)
turquoise = rgb(0, 233, 255, maxColorValue = 255)
green = rgb(142, 255, 11, maxColorValue = 255)
yellow = rgb(245, 255, 8, maxColorValue = 255)
orange = rgb(255, 173, 0, maxColorValue = 255)
lightred = rgb(255, 67, 0, maxColorValue = 255)
red = rgb(170, 0, 0, maxColorValue = 255)
pal = colorRampPalette(c(blue, lightblue, turquoise, green, yellow,
orange, lightred, red))
# restrict the prediction to your study area
pred = mask(pred, study_area) %>%
trim
# create a hillshade
hs = hillShade(terrain(dem), terrain(dem, "aspect")) %>%
mask(., study_area)
spplot(extend(pred, 2), col.regions = pal(50), alpha.regions = 0.7,
scales = list(draw = TRUE,
tck = c(1, 0),
cex = 0.8),
colorkey = list(space = "right", width = 0.5, height = 0.5,
axis.line = list(col = "black")),
sp.layout = list(
# list("sp.points", as(random_points, "Spatial"), pch = 16,
# col = "black", cex = 0.8, first = FALSE),
list("sp.polygons", as(study_area, "Spatial"),
col = "black", first = FALSE, lwd = 3)
)
) +
latticeExtra::as.layer(spplot(hs, col.regions = gray(0:100 / 100)),
under = TRUE)
grid.text("NMDS1", x = unit(0.75, "npc"), y = unit(0.75, "npc"),
gp = gpar(cex = 0.8))
```
The predictive mapping clearly reveals distinct vegetation belts (Figure \@ref(fig:rf-pred)).
Please refer to @muenchow_soil_2013 for a detailed descriptions of vegetation belts on **lomas** mountains.
The blue color tones represent the so-called *Tillandsia*-belt.
*Tillandsia* is a highly adapted genus especially found in high quantities at the sandy and quite desertic foot of *lomas* mountains.
The yellow color tones refer to a herbaceous vegetation belt with a much higher plant cover compared to the *Tillandsia*-belt.
The orange colors represent the bromeliad belt, which features the highest species richness and plant cover.
It can be found directly beneath the temperature inversion (ca. 750-850 m asl) where humidity due to fog is highest.
Water availability naturally decreases above the temperature inversion, and the landscape becomes desertic again with only a few succulent species (succulent belt; red colors).
Interestingly, the spatial prediction clearly reveals that the bromeliad belt is interrupted - a very interesting finding we would have not detected without the predictive mapping.
## Conclusions
In this chapter we have ordinated the community matrix of the **lomas** Mt. Mongón with the help of a NMDS (Section \@ref(nmds)).
The first axis, representing the main floristic gradient in the study area, was modeled as a function of environmental predictors which partly were derived through R-GIS bridges (Section \@ref(data-and-data-preparation)).
The **mlr** package provided the building blocks to spatially tune the hyperparameters `mtry`, `sample.fraction` and `min.node.size` (Section \@ref(mlr-building-blocks)).
The tuned hyperparameters served as input for the final model which in turn was applied to the environmental predictors for a spatial representation of the floristic gradient (Section \@ref(predictive-mapping)).
The result demonstrates spatially the astounding biodiversity in the middle of the desert.
Since **lomas** mountains are heavily endangered, the prediction map can serve as basis for informed decision-making on delineating protection zones, and making the local population aware of the uniqueness found in their immediate neighborhood.
In terms of methodology, a few additional points could be addressed:
- It would be interesting to also model the second ordination axis, and to subsequently find an innovative way of visualizing jointly the modeled scores of the two axes in one prediction map.
- If we were interested in interpreting the model in an ecologically meaningful way, we should probably use (semi-)parametric models [@muenchow_predictive_2013;@zuur_mixed_2009;@zuur_beginners_2017].
However, there are at least approaches that help to interpret machine learning models such as random forests (see e.g., [https://mlr-org.github.io/interpretable-machine-learning-iml-and-mlr/](https://mlr-org.github.io/interpretable-machine-learning-iml-and-mlr/)).
- A sequential model-based optimization (SMBO) might be preferable to the here used random search for hyperparameter optimization [@probst_hyperparameters_2018].
Finally, please note that random forest and other machine-learning models are frequently used in a setting with lots of observations and many predictors, much more than used in this chapter, and where it is unclear which variables and variable interactions contribute to explaining the response.
Additionally, the relationships might be highly non-linear.
In our use case, the relationship between response and predictors are pretty clear, there is only a slight amount of non-linearity and the number of observations and predictors is low.
Hence, it might be worth to try a linear model.
A linear model is much easier to explain and understand than a random forest model, and therefore to be preferred (law of parsimony), additionally it is computationally less demanding (see exercises).
If the linear model cannot cope with the degree of non-linearity present in the data, one could also try a generalized additive model (GAM).
The point here is that the toolbox of a data scientist consists of more than one tool, and it is your responsibility to select the tool best suited for the task or purpose at hand.
Here, we wanted to introduce the reader to random forest modeling and how to use the corresponding results for spatial predictions.
For this purpose, a well-studied dataset with known relationships between response and predictors, is appropriate.
However, this does not imply that the random forest model has returned the best result in terms of predictive performance (see exercises).
## Exercises
1. Run a NMDS using the percentage data of the community matrix.
Report the stress value and compare it to the stress value as retrieved from the NMDS using presence-absence data.
What might explain the observed difference?
1. Compute all the predictor rasters we have used in the chapter (catchment slope, catchment area), and put them into a raster stack.
Add `dem` and `ndvi` to the raster stack.
Next, compute profile and tangential curvature as additional predictor rasters and add them to the raster stack (hint: `grass7:r.slope.aspect`).
Finally, construct a response-predictor matrix.
The scores of the first NMDS axis (which were the result when using the presence-absence community matrix) rotated in accordance with elevation represent the response variable, and should be joined to `random_points` (use an inner join).
To complete the response-predictor matrix, extract the values of the environmental predictor raster stack to `random_points`.
1. Use the response-predictor matrix of the previous exercise to fit a random forest model.
Find the optimal hyperparameters and use them for making a prediction map.
1. Retrieve the bias-reduced RMSE of a random forest model using spatial cross-validation including the estimation of optimal hyperparameter combinations (random search with 50 iterations) in an inner tuning loop (see Section \@ref(svm)).
Parallelize the tuning level (see Section \@ref(svm)).
Report the mean RMSE and use a boxplot to visualize all retrieved RMSEs.
1. Retrieve the bias-reduced RMSE of a simple linear model using spatial cross-validation.
Compare the result to the result of the random forest model by making RMSE boxplots for each modeling approach.