forked from fmsabatini/sPlotOpen_Code
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02_BuildDataset.Rmd
1949 lines (1725 loc) · 80.7 KB
/
02_BuildDataset.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Project#02 - Build Dataset"
author: "Francesco Maria Sabatini"
date: "4/28/2020"
output:
html_document:
toc: true
theme: united
---
<center>
![](https://www.idiv.de/fileadmin/content/Files_sDiv/sDiv_Workshops_Photos_Docs/sDiv_WS_Documents_sPlot/splot-long-rgb.png "sPlot Logo")
</center>
**Timestamp:** `r date()`
**Drafted:** Francesco Maria Sabatini
**Revised:**
**Version:** 2.0
*Changes in version 2.0* - Build sPlotOpen based on three resampling runs.
<br>
This report describes how data from the sPlot database have been extracted to build an environmentally-balanced subset. Resampling of plots in the environmental space follows [Bruelheide et al. 2018 NEE](https://www.nature.com/articles/s41559-018-0699-8), and is done elsewhere.
All data custodians were contacted individually and asked for permission to make a chunk of their data open-access. Here I only reported the collated answers.
```{r results="hide", message=F, warning=F}
library(tidyverse)
library(openxlsx)
library(bib2df)
library(knitr)
library(kableExtra)
library(viridis)
library(plotbiomes)
library(colorRamps)
library(fBasics)
library(raster)
library(sp)
library(sf)
library(rgdal)
library(rnaturalearth)
library(dggridR)
# library(rgeos)
library(Taxonstand)
filter <- dplyr::filter
#save temporary files
write("TMPDIR = /data/sPlot/users/Francesco/_tmp", file=file.path(Sys.getenv('TMPDIR'), '.Renviron'))
write("R_USER = /data/sPlot/users/Francesco/_tmp", file=file.path(Sys.getenv('R_USER'), '.Renviron'))
#rasterOptions(tmpdir="/data/sPlot/users/Francesco/_tmp")
```
# Import and clean data
Import and fix sPlot data. Import and attach database-level information and [GIVD](https://www.givd.info/) codes.
```{r, cache=T, results="hide", message=F, warning=F}
load("/data/sPlot/releases/sPlot2.1/DT2_20161025.RData")
load("/data/sPlot/releases/sPlot2.1/sPlot_header_20161124.RData")
#fix header data
source("/data/sPlot/users/Francesco/_sPlot_Management/Fix.header.R")
header <- fix.header(header, exclude.sophy = F)
```
Import database level answers from custodians. This table reports whether the plots from a given dataset can be released open-access without condition (Yes); whether this is true only for a set of manually selected plots (Conditional); or cannot be used (No).
```{r}
answers <- openxlsx::read.xlsx("_management/resampling_answers.xlsx", sheet = 2)
answers <- answers %>%
mutate(`Yes/Conditional/No`=fct_recode(`Yes/Conditional/No`, No="NO", Yes="yes")) %>%
# Manually set some dataset to yes
# Rasmus Revermann and Donald Walker's acceptance is conditional,
# but depends on conditions others than the selection of plot.
mutate(`Yes/Conditional/No`=replace(`Yes/Conditional/No`,
list=GIVD.ID %in% c("NA-US-014","AF-00-009",
"AF-00-006", "00-00-003",
"00-RU-001", "EU-UA-001"),
values="Yes")) %>%
mutate(`Yes/Conditional/No`=replace(`Yes/Conditional/No`,
list=GIVD.ID %in% c("AF-00-008"),
values="Conditional"))
head(answers)
```
## Last-minute adjustments to plot selections
Mark usable plots from AF-00-008, as last-minute approved.
```{r, warning=F, message=F}
#plots usable:
tava.keywords <- paste0(c("Azagny", "Djouroutou", "GEPRENAF", "Grebo", "Kayan", "Sapo", "sobeya", "Tai-E", "Tai-R"), collapse="|")
tavaplots <- read_delim("_data/Update_TavaApes/tava_header.csv", delim="\t") %>%
dplyr::select(PlotObservationID, `Original nr in database`) %>%
filter(PlotObservationID %in% (header %>%
filter(`GIVD ID`=="AF-00-008") %>%
pull(PlotObservationID))) %>%
filter(grepl(pattern=tava.keywords, x = `Original nr in database`))
```
Add additional plots to the usable list, from Luis Cayuela
```{r}
luis.sel <- c( 26827, 27251, 27252, 27285, 27286, 27287, 27288, 27289, 27295, 27297)
```
## Mark Usable plots
Import IDs of first choice plots, i.e. plots resampled in iteration 1. Load redundant list of plots selected in runs 1-3 of resampling (first choice + reserves), with plot-level specification from dataset custodians wheter a plot is usable (i.e., can be released OA) or not.
```{r results="hide", message=F, warning=F}
load("_data/plot_sel.RData")
sel123 <- plot_data_sel[1:3]
# first choice plot IDs
#sel1 <- readr::read_csv("_data/Resampled1.csv")$x
# First choice plots + reserves
usable.plots123 <- readr::read_csv("_output/header.sel.final.csv") %>%
filter(PlotObservationID %in% unique(unlist(sel123))) %>%
mutate(first.choice=PlotObservationID %in% sel123[[1]]) %>%
mutate(Usable=ifelse(`GIVD ID` %in% (answers %>%
filter(`Yes/Conditional/No`=="Yes") %>%
pull(GIVD.ID)),
"Yes", Usable)) %>%
#Hjalmar Kuhl could only grant access to subset of data
mutate(Usable=ifelse(`GIVD ID` %in% c("AF-00-008") &
PlotObservationID %in% tavaplots$PlotObservationID, "Yes", Usable)) %>%
mutate(Usable=ifelse(`GIVD ID` %in% c("AF-00-008") &
!PlotObservationID %in% tavaplots$PlotObservationID, "No", Usable)) %>%
#Luis Cayuela additional plots
mutate(Usable=ifelse(PlotObservationID %in% luis.sel, "Yes", Usable)) %>%
distinct()
table(usable.plots123$Usable)
## Plots for which we received no authorizarion
usable.plots123 %>% filter(Usable != "Yes") %>% nrow() + 99
#99 is the number of selected plots in the dataset China_Xinjang, which has been withdrawn from sPlot
#compute summary
summary.sel.final <- usable.plots123 %>%
group_by(`GIVD ID`, Dataset, Custodian, `Deputy custodian`) %>%
### Summarize data at dataset level
summarize(N.redundant=n(),
usable=sum(Usable=="Yes"),
not.usable=sum(Usable=="No"),
unknown=sum(Usable=="Unknown"), .groups = 'drop') %>%
### total number of plots in a dataset
left_join(header %>%
group_by(`GIVD ID`) %>%
summarize(n.tot.plot=n(), .groups = 'drop'),
by="GIVD ID") %>%
### number of first choice plots
left_join(header %>%
filter(PlotObservationID %in% sel123[[1]]) %>%
group_by(`GIVD ID`) %>%
summarize(n.sel.plot=n(), .groups = 'drop'),
by="GIVD ID") %>%
mutate(share.perc=round(n.sel.plot/n.tot.plot*100),1) %>%
dplyr::select(`GIVD ID`:`Deputy custodian`, n.tot.plot:share.perc, N.redundant:unknown)
#check how many first choice plots can be used, and how many need replacement
firstchoice <- usable.plots123 %>%
mutate(Usable=forcats::fct_recode(Usable, "No" = "Unknown")) %>%
group_by(first.choice, Usable) %>%
summarize(n=n(), .groups = 'drop')
firstchoice
```
```{r, echo=F}
knitr::kable(summary.sel.final %>%
rename(`Total # plots in DB (A)`=n.tot.plot,
`First choice plots (B)`=n.sel.plot,
`Percentage B/A`=share.perc,
`# First choice + reserves (C)`=N.redundant,
`# of plots in (C) usable`=usable,
`# of plots in (C) not_usable`=not.usable,
`# of plots in (C) no_info`=unknown),
caption="Summary of first choice and reserve plots per dataset, with aggregated info on how many plots can be used (i.e., release OA)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F, position = "center")
```
Out of the `r length(sel123[[1]])` plots selected in the first run of the resampling, `r firstchoice %>% filter(first.choice==T & Usable =="Yes") %>% pull(n)` can be used. The remaining `r firstchoice %>% filter(first.choice==T & Usable =="No") %>% pull(n)` require to be replaced.
# Explore distribution of all plots in PCA space
Load PCA data
```{r}
load("_data/pca3.RData") ### PCA ordination of the world
path.sPlot <- "/data/sPlot2.0/"
load(paste(path.sPlot, "splot.world2.RData", sep="/")) ## environmental data of the world at 2.5 res
```
Assign PCA values to selected plots
```{r}
## code adapted from @lenjon's 'resampling_2d_JL.R'
plot_data <- header %>%
filter(PlotObservationID %in% unique(unlist(sel123))) %>%
dplyr::select(PlotObservationID, Longitude, Latitude) %>%
dplyr::filter(!is.na(Latitude))
#filter(PlotObservationID %in% sel123[[1]])
## transform to SpatialPointsDataFrame
CRSlonlat <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")
coords <- cbind(plot_data$Longitude, plot_data$Latitude)
coords <- SpatialPoints(coords, proj4string=CRSlonlat)
plot_data <- SpatialPointsDataFrame(coords, plot_data)#, proj4string=CRSlonlat)
# Create world rasters of PCA values and extract plot values by geographic intersection
# raster at half a degree resolution (cf. 30 arc minute resolution)
rgeo <- raster(nrows=360, ncols=720, xmn=-180, xmx=180, ymn=-90, ymx=90)
rgeo <- disaggregate(rgeo, fact=12) # raster at 2.5 arc minute resolution
splot.world2$cellID <- cellFromXY(rgeo, cbind(splot.world2$RAST_X, splot.world2$RAST_Y))
### create rasters from PCA
posit <- splot.world2$cellID
temp <- getValues(rgeo)
temp[posit] <- pca3$x[, 1]
PC1_r <- setValues(rgeo, temp)
temp[posit] <- pca3$x[, 2]
PC2_r <- setValues(rgeo, temp)
plot_data@data$cellID <- cellFromXY(rgeo,
cbind(plot_data@data$Longitude, plot_data@data$Latitude))
plot_data@data$pc1_val <- extract(PC1_r, coordinates(plot_data))
plot_data@data$pc2_val <- extract(PC2_r, coordinates(plot_data))
# Compute the density of environmental conditions available at the global scale across the entire bivariate (PC1-PC2) environmental space
res <- 100 # Setting the number of bins per PCA axis to 100
reco <- raster(nrows=res, ncols=res,
xmn=min(pca3$x[, 1]), xmx=max(pca3$x[, 1]),
ymn=min(pca3$x[, 2]), ymx=max(pca3$x[, 2]))
PC1_PC2_r <- rasterize(pca3$x[, 1:2], reco, fun="count")
plot_data <- plot_data@data
plot_data$pc_cellID <- cellFromXY(reco, cbind(plot_data$pc1_val, plot_data$pc2_val))
# Compute the sampling effort (number of vegetation plots) per environmental unit (cell) across the entire bivariate (PC1-PC2) environmental space
sPlot_reco <- rasterize(plot_data[, c("pc1_val", "pc2_val")], reco, fun="count")
# Put zero values for the empty cells (cf. there is no vegeteation plots available for those environmental conditions: gaps)
temp1 <- getValues(PC1_PC2_r)
temp1[!is.na(temp1)] <- 0
temp2 <- getValues(sPlot_reco)
temp2[which(temp1==0&is.na(temp2))] <- 0
sPlot_reco <- setValues(reco, temp2)
```
Plotting the number of sPlot relevés for each cell of the PC1-PC2 space
```{r, fig.width=5, fig.height=5, fig.align="center", warning=F, message=F, cache=T}
#png(filename="Sampling_effort_PC1-PC2.png", width=12, height=12, res=300, unit="cm")
par(mar=c(4, 4, 4, 1))
plot(log(sPlot_reco+1), asp=0, col=c("grey", rev(divPalette(n=99, name="RdBu"))), xlab="PC1 (cold and seasonal to hot and stable)", ylab="PC2 (dry to wet)", legend=F)
plot(log(sPlot_reco+1), asp=0, col=c("grey", rev(divPalette(n=99, name="RdBu"))),
legend.only=TRUE, legend.width=1, legend.shrink=0.75,
axis.args=list(at=seq(0, log(maxValue(sPlot_reco)+1), length.out=5),
labels=round(exp(seq(0, log(maxValue(sPlot_reco)+1), length.out=5))),
cex.axis=0.6),
legend.args=list(text="N", side=3, font=2, line=0, cex=0.8))
title(main="Number of sPlotOpen relevés \nper environmental cell (log scale)")
#dev.off()
```
# Replace plots not usable with reserves
In those cases where we do not have permission to use a plot selected in resampling run #1 [first.choice], we replace it with a reserve belonging to the same cell in the PCA space grid. Reserves correspond to plots selected in resamplings runs #2 or #3, whose use was approved by the respective custodians. ~~Additionally we considered as usable reserves ALL those plots belonging to datasets whose custodian gave us unconditional permission to use their data.~~
PCA is calculated in the environmental space defined by the 30 climatic and soil variables used in Bruelheide et al. 2018 NEE.
<br>
```{r}
pca.grids <- plot_data %>%
filter(PlotObservationID %in% header$PlotObservationID) %>%
#filter(PlotObservationID %in% unique(unlist(sel123))) %>%
#attach GIVD codes
left_join(header %>%
distinct(PlotObservationID, `GIVD ID`,Dataset),
by="PlotObservationID") %>%
dplyr::select(PlotObservationID,`GIVD ID`, Dataset,
pc_cellID, pc1_val, pc2_val) %>%
as_tibble() %>%
#Attach info on first choice, reserve and usable plots
mutate(first.choice=PlotObservationID %in% sel123[[1]]) %>%
left_join(usable.plots123 %>%
dplyr::select(PlotObservationID, Usable) %>%
mutate(Usable=Usable=="Yes"),
by="PlotObservationID") %>%
mutate(reserve= (!PlotObservationID %in% sel123[[1]]) & Usable==T) %>%
# #Consider as usable reserves ALL those plots belonging to datasets whose custodian gave us
# # unconditional permission to use the data
mutate(Usable=replace(Usable,
list= ( is.na(Usable) &
`GIVD ID` %in% (answers %>%
filter(`Yes/Conditional/No`=="Yes") %>%
pull(GIVD.ID))),
values=T)) %>%
mutate(reserve=replace(reserve,
list= (first.choice==F &
`GIVD ID` %in% (answers %>%
filter(`Yes/Conditional/No`=="Yes") %>%
pull(GIVD.ID))),
values=T)) %>%
filter(!is.na(Usable))
head(pca.grids %>%
dplyr::select(-pc1_val, -pc2_val))
```
For each non-usable first choice plot, find a reserve from the same grid cell in the PCA space.
```{r}
toreplace <- pca.grids %>%
filter(first.choice==T) %>%
filter(Usable==F)
# number of PCA cells from which the plots to replace stem
(npcacell <- toreplace %>%
distinct(pc_cellID) %>%
nrow())
# proportion of occupied cells
npcacell / (pca.grids %>% distinct(pc_cellID) %>% nrow())
toreplace <- toreplace %>%
pull(PlotObservationID)
#number of first choice plots needing replacement
length(toreplace)
set.seed(9999)
selected.reserves <- pca.grids %>%
#for each cell, calculate how many reserves would be needed, and how many reserves are available
left_join(pca.grids %>%
group_by(pc_cellID) %>%
summarize(n.first=sum(first.choice, na.rm=T),
n.first.usable=sum(first.choice*Usable, na.rm=T),
reserve.available=sum(reserve, na.rm=T), .groups = 'drop') %>%
mutate(reserve.needed=n.first-n.first.usable),
by=c("pc_cellID")) %>%
filter(reserve.needed>0)
# calculate number of plots that cannot be replaced
not_replaceable <- selected.reserves %>%
distinct(pc_cellID, .keep_all = T) %>%
filter(reserve.needed>0) %>%
mutate(missing=reserve.needed-reserve.available) %>%
mutate(missing=ifelse(missing<0, 0, missing)) %>%
filter(missing>0) %>%
pull(missing)
#number of plots that cannot be replaced
sum(not_replaceable)
#number of grid cells these plots come from
length(not_replaceable)
#distribution of number of not-replaceable plots per PC grid cell
summary(not_replaceable)
# from each cell where >0 reserves are needed, sample randomly n usable reserves,
# where n is the minimun between the number of reserves needed and reserves available
selected.reserves <- selected.reserves %>%
filter(reserve==T) %>%
group_by(pc_cellID) %>%
mutate(reserve.available=min(reserve.needed, reserve.available)) %>%
sample_n(reserve.available)
```
```{r, echo=F}
knitr::kable(selected.reserves %>%
dplyr::select(-pc1_val, -pc2_val) %>%
ungroup() %>%
arrange(pc_cellID) %>%
slice(1:20),
caption="Example of selected reserves [first 20 rows shown]") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F, position = "center")
```
By selecting plots within the same cell in the PCA grid, we can only replace `r nrow(selected.reserves)`, out of the `r length(toreplace)` non-usable first choice plots.
# Build sPlot OA dataset
## Header
```{r}
header.oa <- header %>%
filter(PlotObservationID %in% (usable.plots123 %>%
#filter(first.choice==T) %>%
filter(Usable=="Yes") %>%
pull(PlotObservationID)) #|
# PlotObservationID %in% (selected.reserves %>%
# pull(PlotObservationID))
) %>%
left_join(pca.grids %>%
dplyr::select(PlotObservationID, SoilClim_PC1=pc1_val, SoilClim_PC2=pc2_val, pc_cellID),
by="PlotObservationID") %>%
mutate(Resample_1=PlotObservationID %in% sel123[[1]]) %>%
mutate(Resample_2=PlotObservationID %in% sel123[[2]]) %>%
mutate(Resample_3=PlotObservationID %in% sel123[[3]]) %>%
mutate(Resample_1_consensus=ifelse(PlotObservationID %in% selected.reserves$PlotObservationID,
TRUE,
Resample_1))
```
After merging first choice plots and the corresponding reserves, the database contains `r nrow(header.oa)` unique plots, stemming from `r header.oa %>% distinct("GIVD ID") %>% nrow()` databases.
```{r}
# calculate share of unique plots for each resampling, and plots shared across resamplings
header.oa %>%
group_by(Resample_1,Resample_2, Resample_3) %>%
summarize(n=n())
```
Number of plots for which data on mosses and lichens is available
```{r, warning=F}
header.oa %>%
mutate_at(.vars=vars(`Mosses identified (y/n)`, `Lichens identified (y/n)`),
.funs=~factor(.)) %>%
mutate_at(.vars=vars(`Mosses identified (y/n)`, `Lichens identified (y/n)`),
.funs=~forcats::fct_recode(.,
"NA" = ".",
"TRUE" = "1" ,
"TRUE" = "y" ,
"TRUE" = "Y" ,
"TRUE" = "J" ,
"FALSE" = "N",
"FALSE" = "n")) %>%
summarize(n.mosses=sum(`Mosses identified (y/n)`=="TRUE", na.rm=T),
n.lichens=sum(`Lichens identified (y/n)`=="TRUE", na.rm=T))
```
Data preparation: adjust header data, select relevant variables, reformat variables into the right classes, correct macroscopic errors.
```{r, warning=F}
header.oa <- header.oa %>%
# MEMO - releve area of SA-BR-002 is always 1000
#TO ADD
#reformat and rename
mutate_at(.vars=vars(`Altitude (m)`, `Aspect (°)`, `Slope (°)`),
~as.numeric(.)) %>%
mutate_at(.vars=vars(ESY),
~as.character(.)) %>%
mutate_at(.vars=vars(Forest:Wetland),
~as.logical(.)) %>%
mutate_at(.vars=vars(`Herbs identified (y/n)`, `Mosses identified (y/n)`, `Lichens identified (y/n)`),
~ifelse(.=="Y", T, F)) %>%
mutate(`Date of recording`=ifelse(`Date of recording`=="1-1-101", NA, `Date of recording`)) %>%
mutate(`Date of recording`=as.Date(`Date of recording`, "%d-%m-%Y") ) %>%
mutate(CONTINENT=factor(CONTINENT, exclude = " ")) %>%
mutate(`Plants recorded`=forcats::fct_explicit_na(f = `Plants recorded`, "Not specified")) %>%
mutate(`Plants recorded`=forcats::fct_recode(`Plants recorded`,
"Not specified" = "#N/A",
"All vascular plants"="Complete vegetation",
"All vascular plants"="all vascular plants",
"All vascular plants"="complete",
"All vascular plants"="Complete vegetation (including non-terricolous tax",
"All vascular plants"="Vascular plants",
"All vascular plants"="All vascular plants and dominant cryptogams",
"All woody plants"="Woody plants",
"All woody plants"="All woody species",
"Woody plants >= 10 cm dbh"= "trees>=10cm dbh",
"Woody plants >= 10 cm dbh"= "Woody plants >= 10 cm dbh and domin",
"All trees & dominant understory"="All trees & dominant shrubs",
"Woody plants >= 5 cm dbh"="Woody plants >= 5 cm dbh & dominant",
"Woody plants >= 1 cm dbh" = "Plants >= 1 cm dbh",
"Only dominant species"="Dominant vascular plants",
"Woody plants >= 1 m height"="trees and shrubs >1 m height"
)) %>%
mutate(Biome = fct_recode(Biome, "Subtropics with year-round rain"="Subtrop. with year-round rain")) %>%
#reorder levels of `Plants recorded`
mutate(`Plants recorded`=factor(`Plants recorded`,
levels=c('All vascular plants',
'All trees & dominant understory',
'Dominant trees',
'Only dominant species',
'Dominant woody plants >= 2.5 cm dbh',
'All woody plants',
'Woody plants >= 1 cm dbh',
'Woody plants >= 2.5 cm dbh',
'Woody plants >= 5 cm dbh',
'Woody plants >= 10 cm dbh',
'Woody plants >= 20 cm dbh',
'Woody plants >= 1 m height',
'Not specified'))) %>%
##correct mistakes
mutate(`Altitude (m)`=ifelse(`Altitude (m)`< -100, NA, `Altitude (m)`)) %>%
#plots from Veg_bank seem to have a mix of feet and meter in Altitude
mutate(`Altitude (m)`=ifelse(`GIVD ID`=="NA-US-002", NA, `Altitude (m)`)) %>%
#constrain Aspect between 1 and 360
mutate(`Aspect (°)`=ifelse(`Aspect (°)`<1, 360-`Aspect (°)`, `Aspect (°)`)) %>%
mutate(`Slope (°)`=ifelse(`Slope (°)`<0, NA, `Slope (°)`)) %>%
mutate(`Slope (°)`=ifelse(`Slope (°)`>90, NA, `Slope (°)`)) %>%
mutate_at(.vars=vars(starts_with("Height") & contains("shrubs")),
~ifelse(.>=10|.<0, NA, .)) %>%
mutate(`Relevé area (m²)`=ifelse(`Relevé area (m²)`<0, NA, `Relevé area (m²)`)) %>%
mutate(`Cover bare soil (%)`=ifelse(`Cover bare soil (%)`<0, NA, `Cover bare soil (%)`)) %>%
mutate(`Date of recording`=replace(`Date of recording`,
list=`Date of recording`> as.Date('2016-01-01'),
NA)) %>%
# round SoilClim PCA
mutate_at(.vars=vars(SoilClim_PC1, SoilClim_PC2),
.funs=list(~round(., 3))) %>%
# Rename fields
dplyr::select(
#metadata + location
PlotObservationID,
GIVD_ID = "GIVD ID",
Dataset,
Continent = CONTINENT,
Country,
Biome,
Date_of_recording = "Date of recording",
Latitude,
Longitude,
Location_uncertainty = "Location uncertainty (m)", #POINT_X, POINT_Y,
#sampling design
Releve_area = "Relevé area (m²)",
#Herbs_identified = "Herbs identified (y/n)",
#Mosses_identified="Mosses identified (y/n)",
#"Lichens identified (y/n)",
Plant_recorded = "Plants recorded",
#topography
Elevation = "Altitude (m)",
Aspect = "Aspect (°)",
Slope = "Slope (°)",
#vegetation type
is_forest = "is.forest",
is_nonforest = "is.non.forest",
ESY,
Naturalness,
Forest,
Shrubland,
Grassland,
Sparse_vegetation = "Sparse.vegetation",
Wetland,
#vegetation structure
Cover_total = "Cover total (%)",
Cover_tree_layer = "Cover tree layer (%)",
Cover_shrub_layer = "Cover shrub layer (%)",
Cover_herb_layer = "Cover herb layer (%)",
Cover_moss_layer = "Cover moss layer (%)",
Cover_lichen_layer ="Cover lichen layer (%)",
Cover_algae_layer = "Cover algae layer (%)",
Cover_litter_layer = "Cover litter layer (%)",
Cover_bare_rocks = "Cover bare rock (%)",
Cover_cryptogams = "Cover cryptogams (%)",
Cover_bare_soil = "Cover bare soil (%)",
Height_trees_highest = "Height (highest) trees (m)",
Height_trees_lowest = "Height lowest trees (m)",
Height_shrubs_highest = "Height (highest) shrubs (m)",
Height_shrubs_lowest = "Height lowest shrubs (m)",
Height_herbs_average = "Aver. height (high) herbs (cm)",
Height_herbs_lowest = "Aver. height lowest herbs (cm)",
Height_herbs_highest = "Maximum height herbs (cm)",
#environment PCA
SoilClim_PC1,
SoilClim_PC2,
#Resampling
Resample_1,
Resample_2,
Resample_3,
Resample_1_consensus)
```
The location of some RAINFOR plots is sensitive. I reduce the precision of their spatial coordinates
```{r}
header.oa <- header.oa %>%
mutate(Latitude=ifelse(GIVD_ID=="00-00-001",
round(Latitude, 2),
Latitude)) %>%
mutate(Longitude=ifelse(GIVD_ID=="00-00-001",
round(Longitude, 2),
Longitude)) %>%
mutate(Location_uncertainty=ifelse(GIVD_ID=="00-00-001",
1000,
Location_uncertainty))
```
### Formations
For those plots being classified based on the EUNIS codes, we used a cross-link table to use EUNIS to assign vegetation types and naturalness, but only when these columns are empty.
```{r}
eunis.key <- xlsx::read.xlsx("/data/sPlot/users/Francesco/sPlot3/_input/EUNIS_WFT.xlsx",
sheetIndex = "Sheet1", endRow = 246) %>%
dplyr::select(EUNIS_code, NATURALNESS:SPARSE_VEG) %>%
mutate(EUNIS_code=as.character(EUNIS_code)) %>%
rename(ESY=EUNIS_code,
Naturalness=NATURALNESS,
Forest=FOREST,
Shrubland=SCRUBLAND,
Grassland=GRASSLAND,
Wetland=WETLAND,
Sparse_vegetation=SPARSE_VEG)#,
header.oa <- header.oa %>% # header.backup %>%
mutate(ESY=as.character(ESY)) %>%
#mutate(ESY=ifelse(ESY=="?", NA, ESY)) %>%
# Systematically assign some databases to forest
mutate(Forest=ifelse(Dataset %in%
c("Turkey Oak_Forest Database",
"Turkey Forest Database",
"Chile_forest", "Ethiopia"),
T, Forest)) %>%
#fill up with F those rows where at least one column on formation is assigned
rowwise() %>%
mutate(Any=any(Forest, Shrubland, Grassland, Wetland, Sparse_vegetation)) %>%
mutate(Forest=ifelse( (is.na(Forest) & Any), F, Forest)) %>%
mutate(Shrubland=ifelse( (is.na(Shrubland) & Any), F, Shrubland)) %>%
mutate(Grassland=ifelse( (is.na(Grassland) & Any), F, Grassland)) %>%
mutate(Wetland=ifelse( (is.na(Wetland) & Any), F, Wetland)) %>%
mutate(Sparse_vegetation=ifelse( (is.na(Sparse_vegetation) & Any), F, Sparse_vegetation)) %>%
ungroup() %>%
dplyr::select(-Any) %>%
##join and coalesce with eunis.key
left_join(eunis.key %>%
distinct(), by = "ESY") %>%
mutate(
Forest = dplyr:::coalesce(Forest.x, Forest.y),
Shrubland = coalesce(Shrubland.x, Shrubland.y),
Grassland = coalesce(Grassland.x, Grassland.y),
Wetland = coalesce(Wetland.x, Wetland.y),
Sparse_vegetation = coalesce(Sparse_vegetation.x, Sparse_vegetation.y),
Naturalness = coalesce(Naturalness.x, Naturalness.y)
) %>%
dplyr::select(-ends_with(".x"), -ends_with(".y")) %>%
#transform naturalness to ordered factor
mutate(Naturalness=factor(Naturalness,
levels=c(1,2),
labels=c("Natural", "Semi-natural"),
ordered = T)) %>%
relocate(Forest:Sparse_vegetation, .after=ESY) %>%
relocate(Naturalness, .after=ESY)
```
Fix `is_forest` and `is_nonforest` based on vegetation type. Make the fields consistent with each other.
```{r}
header.oa <- header.oa %>%
# If a plot has Forest ==1 and all other veg types==0, force is_forest to TRUE and is_nonforest to F
mutate(is_forest=replace(is_forest,
list=(Forest==T & Grassland==F & Shrubland==F & Wetland==F & Sparse_vegetation==F),
values=T)) %>%
mutate(is_nonforest=replace(is_nonforest,
list=(Forest==T & Grassland==F & Shrubland==F & Wetland==F & Sparse_vegetation==F),
values=F)) %>%
# If a plot has Forest ==0 and any other veg types==1, force is_forest to F and is_nonforest to T
mutate(is_forest=replace(is_forest,
list=(Forest==F & (Grassland==T | Shrubland==T | Wetland==T | Sparse_vegetation==T)),
values=F)) %>%
mutate(is_nonforest=replace(is_nonforest,
list=(Forest==F & (Grassland==T | Shrubland==T | Wetland==T | Sparse_vegetation==T)),
values=T)) %>%
## fill up NAs when either is_forest or is_nonforest is not NA
## note that if a plot is marked as is_forest == T, the plot is assigned to is_nonforest =F
## BUT is the plots is marked as is_nonforest = T, the opposite is not done (conservative choice)
mutate(is_forest=replace(is_forest,
list=is.na(is_forest) & is_nonforest==T,
values=F)) %>%
mutate(is_nonforest=replace(is_nonforest,
list=is.na(is_nonforest) & is_forest==F,
values=T)) %>%
## assign replace double FALSE to NA
mutate(is_forest=replace(is_forest,
list= ((is.na(is_forest) | is_forest==F) & (is.na(is_nonforest) | is_nonforest==F)),
values=NA)) %>%
mutate(is_nonforest=replace(is_nonforest,
list= ((is.na(is_forest) | is_forest==F) & (is.na(is_nonforest) | is_nonforest==F)),
values=NA))
## Correct known misassigne plots
indre <- c(977877:978502, 981362:981969) #as requested by Adrian Indreica
header.oa <- header.oa %>%
mutate(is_forest=ifelse(PlotObservationID %in% indre,
T,
is_forest))
#double check
header.oa %>%
group_by(is_forest, is_nonforest) %>%
summarize(n=n())
```
### Complete missing values, when possible
There are 75 entries without continent info
```{r}
header.oa <- header.oa %>%
mutate(Continent=as.character(Continent)) %>%
mutate(Continent=ifelse(is.na(Continent) & Country %in% c("Bulgaria", "Denmark", "Greece", "Iceland",
"Italy", "Norway", "Svalbard and Jan Mayen Is",
"Sweden", "United Kingdom", "France", "Spain"),
"Europe", Continent)) %>%
mutate(Continent=ifelse(is.na(Continent) & Country=="Australia",
"Australia", Continent)) %>%
mutate(Continent=ifelse(is.na(Continent) & Country %in% c("Chile", "Colombia"),
"South America", Continent)) %>%
mutate(Continent=ifelse(is.na(Continent) & Country %in% c("United States", "Canada", "Greenland"),
"North America", Continent)) %>%
# correct a couple of mistakes
mutate(Continent=ifelse(Continent=="Australia", "Oceania", Continent)) %>%
mutate(Continent=ifelse(Country=="Papua New Guinea", "Oceania", Continent)) %>%
mutate(Continent=as.factor(Continent)) %>%
# tranform Forest:Sparse veg to T/F
mutate_at(.vars=vars(Forest:Sparse_vegetation),
.funs = ~as.logical(.) )
```
The field `is_nonforest` is now redundant. Drop it.
```{r}
header.oa <- header.oa %>%
dplyr::select(-is_nonforest)
```
```{r}
## distribution of plot sizes:
cut(header.oa$Releve_area, breaks=c(0,10,100,1000, Inf),
labels=c("<10", "10-100", "100-1000", ">=10000")) %>% table()
```
### Show Output
```{r, echo=F}
knitr::kable(header.oa %>%
sample_n(20),
caption="Example of header.oa [20 randomly selected plots shown]") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F, position = "center")
```
## Global reference list
sPlot stems from the work of thousands of vegetation scientists. Much of this work has already been published. Here, we try to create a list of all relevant references, which can be used to refer to when providing information on the plots or datasets contained in sPlotOpen. This reference list if formatted according to .BibTex standards.
### Import and parse plot level info
Plot level info imported from Turboveg. Before importing, need to seek and replace all quotation and double quotation marks, and escape them. Done via LINUX console
```{bash, engine.opts='-l', eval=F}
#not sure it works from markdown, works from console,though
sed "s/'/\\'/g" _data/PlotLevelInfo/TV3_PlotLevelInfo_Export_123.csv > _data/PlotLevelInfo/TV3_PlotLevelInfo_Export_test.csv
sed 's/"/\\"/g' _data/PlotLevelInfo/TV3_PlotLevelInfo_Export_test.csv > _data/PlotLevelInfo/TV3_PlotLevelInfo_Export_test2.csv
```
```{r}
plotinfo.raw <- read_delim("_data/PlotLevelInfo/TV3_PlotLevelInfo_Export_test2.csv", delim="\t",
col_types = cols(
.default = col_character(),
PlotObservationID = col_double(),
PlotID = col_double(),
Country = col_character(),
`Nr. table in publ.` = col_character(),
`Nr. relevé in table` = col_character(),
Author = col_character(),
Remarks = col_character(),
`Original nr in database` = col_character(),
Collection = col_character(),
Dataset = col_character(),
SURVEY = col_character(),
Longitude = col_double(),
Latitude = col_double(),
`Location uncertainty (m)` = col_double(),
Dataset_1 = col_character(),
GUID = col_character(),
DB_OWNER = col_character(),
ORIG_DB = col_character()
)) %>%
#drop empty field
dplyr::select(where(~ !(all(is.na(.)) | all(. == ""))))
```
### Dataset-level biblio-references
Import dataset-level BibTex reference list and database-level information
```{r, message=F}
bib.db <- bib2df("/data/sPlot/users/Francesco/_sPlot_Management/Consortium/sPlot_References.bib")
databases <- read_csv("/data/sPlot/users/Francesco/_sPlot_Management/Consortium/Databases.out.csv")
plotinfo.dbref <- header.oa %>%
dplyr::select(PlotObservationID, GIVD_ID) %>%
left_join(databases %>%
dplyr::select(GIVD_ID=`GIVD ID`, DB_BIBTEXKEY=BIBTEXKEY) %>%
distinct(),
by="GIVD_ID")
dim(plotinfo.dbref)
```
### Plot-level biblio-references
Data from Turobevg come with a dictionary of references. These references, however, are not formally formatted, but are simple strings of text.
```{r, message=F}
#Import biblioreference dictionary from TurboVeg3
Biblioref.raw <- read_delim("_data/PlotLevelInfo/BiblioReference_v2.txt",
delim="\t", col_names = c("PlotObservationID", "Fullref"))
Biblioref.raw_123 <- read_delim("_data/PlotLevelInfo/BiblioReference_v123_additional.txt",
delim="\t", col_names = c("PlotObservationID", "Fullref"))
```
These bibliographic references are then parsed using the library [anystyle](https://github.com/inukshuk/anystyle), in ruby. Yet, the output needs some additional cleaning first.
Below some code which might help for the scope, which might benefit from further refinements.
Do some string modification before parsing with anystyle. Need to convert all words being completely upper case to lower case, with first letter uppercase.
```{r}
.simpleCap <- function(x) {
s <- strsplit(x, "-")[[1]]
s <- tolower(s)
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep = "", collapse = "-")
}
br1 <- Biblioref.raw %>%
#bind_rows(Biblioref.raw_123)
distinct(Fullref) %>%
arrange(Fullref)
for(i in 1:nrow(br1)){
tmp <- str_split(br1[i,], pattern = " ")[[1]]
tochange <- str_detect(tmp, "^[:upper:]+$|^[:upper:]+,$|^[:upper:]+-[:upper:]+$") & str_count(tmp, pattern="[A-Za-z]|-")>1 #doesn't match non ASCI letters, though
if(sum(tochange)>0){
tmp[tochange] <- sapply(tmp[tochange], .simpleCap)
br1[i,] <- paste(tmp, collapse=" ")
}
}
#split in chunks with 300 refs each
nchunks <- ceiling(nrow(br1)/300)
iii <- 1:nrow(br1)
splitted <- split(iii, sort(iii%%nchunks))
```
```{r, eval=F}
##clean up output folder first
filenames <- paste0("_data/PlotLevelInfo/Ref_to_parse_", 1:nchunks, ".txt")
if(any(file.exists(filenames))){
#Delete file if it exists
file.remove(filenames)
}
## sink references to format into .txt files in batches of 300
for(i in 1:nchunks){
tmp <- br1$Fullref[splitted[[i]]]
write_lines(tmp, paste0("_data/PlotLevelInfo/Ref_to_parse_", i, ".txt"), )
}
```
These references were submitted to the [anystyle](https://github.com/inukshuk/anystyle)'s web interface. Output was exported to bibTex.
Reimport and match.
```{r, message=F, warning=F}
filenames <- paste0("_data/PlotLevelInfo/anystyle_", 1:11, ".bib")
bib.list <- lapply(filenames, bib2df)
bib.df <- bind_rows(bib.list) %>%
select_if(function(x) !(all(is.na(x)) | all(x=="")))
br1.out <- bind_cols(Biblioref.raw %>%
distinct(Fullref) %>%
arrange(Fullref), bib.df)
filenames <- paste0("_data/PlotLevelInfo/anystyle_123_", 1:5, ".bib")
bib.list <- lapply(filenames, bib2df)
bib.df <- bind_rows(bib.list) %>%
select_if(function(x) !(all(is.na(x)) | all(x=="")))
br1.out.123 <- bind_cols(Biblioref.raw_123 %>%
distinct(Fullref) %>%
arrange(Fullref), bib.df)
```
Parse additional references not stored in TURBOVEG3's dictionary
```{r}
br2 <- plotinfo.raw %>%
dplyr::select(`Biblio reference`) %>%
distinct(`Biblio reference`) %>%
arrange(`Biblio reference`) %>%
filter(!is.na(`Biblio reference`)) %>%
filter(!str_detect(`Biblio reference`, pattern ="^\\d+$"))
#split in chunks with 300 refs each
nchunks2 <- ceiling(nrow(br2)/300)
iii2 <- 1:nrow(br2)
splitted2 <- split(iii2, sort(iii2%%nchunks2))
```
Manually submit the files to [anystyle](https://github.com/inukshuk/anystyle)'s web interface. Output was exported to bibTex.
```{r, eval=F}
##clean up before saving
filenames <- paste0("_data/PlotLevelInfo/Ref2_to_parse_", 1:nchunks2, ".txt")
if(any(file.exists(filenames))){
#Delete file if it exists
file.remove(filenames)
}
for(i in 1:nchunks2){
tmp <- br2$`Biblio reference`[splitted2[[i]]]
write_lines(tmp, paste0("_data/PlotLevelInfo/Ref2_to_parse_", i, ".txt"), )
}
```
Reimport and match
```{r, message=F, warning=F}
filenames2 <- paste0("_data/PlotLevelInfo/anystyle2_", 1:nchunks2, ".bib")
bib.list <- lapply(filenames2, bib2df)
bib.df <- bind_rows(bib.list) %>%
select_if(function(x) !(all(is.na(x)) | all(x=="")))
br2.out <- bind_cols(br2, bib.df) %>%
rename(Fullref=`Biblio reference`)
```
Create a unique df with all formatted references and correct duplicated bibtexkeys
```{r}
#define helper function
rename.duplicates <- function(x){
tick <- 1
while(sum(duplicated(x)) > 0) {
#print(tick)
if (tick == 1) {x[duplicated(x)] <- paste0(x[duplicated(x)], tick, sep = "")}
if (tick > 1) {
x[duplicated(x)] <-paste0(str_sub(x[duplicated(x)], end = -2), #strip last character of string
tick, sep ="")}
tick <- tick + 1
}
return(x)
}
#fix duplicated bibtex keys
reference.oa <- bind_rows(bib.db, br1.out, br1.out.123, br2.out) %>%
distinct() %>%
filter((Fullref %in% Biblioref.raw$Fullref |
Fullref %in% Biblioref.raw_123$Fullref |
BIBTEXKEY %in% bib.db$BIBTEXKEY))
reference.oa$BIBTEXKEY <- rename.duplicates(reference.oa$BIBTEXKEY)
```
The reference list contains `r nrow(reference.oa)` parsed references.
```{ruby, eval=F, echo=F, engine.path = '~/.rubies/ruby-2.7.2/bin/ruby'}
#Check if I can use Ruby's `anystyle` library inside RMarkdown.
#Parsing is in general pretty good, yet not perfect. I leave this on standby for the time being.
puts RUBY_VERSION
require "anystyle"
File.open("_data/PlotLevelInfo/BiblioReference.clean.txt", "r") do |file_handle|
file_handle.each_line do |ref|
File.open("_data/PlotLevelInfo/output.txt", mode:"a") {|f| f.write AnyStyle.parse ref }
end
end
```
### Show Output
```{r, echo=F}
knitr::kable(reference.oa %>%
sample_n(20)%>%
select_if(function(x) !(all(is.na(x)) | all(x==""))),
caption="Example of reference.oa [20 randomly selected references, and only non-empty columns shown]") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F, position = "center")
```
Clearly, there is some additional parsing needed, as well as some encoding problems (these come from the original data, though, which means that there is little we can do programmatically). Yet the results is a very good starting point. We refer to this reference list as a Beta-Version, and recommend the users to carefully check the references they need to cite, before their use.
## Metadata - Plot-level
We store a lot of plot-level metadata in TurboVeg3. Yet, these metadata are only partially standardized across datasets. Here, we try to organize this metadata information into a few meaningful fields. For each plot we provide information on:
- DB_BIBTEXTKEY - Key linking to the bibliographic reference of the dataset from which the plot stems from. Keys refer to the reference.oa object.
- Releve_author - Name of the person originally collecting the data in the field
- Releve_coauthors - Names of additional persons originally collecting the data in the field
- Plot_Biblioreference - Bibliographic reference where the plot was first published, if any
- BIBTEXTKEY - Key linking to the plot-level bibliographic reference. It refers to the reference.oa object
- Nr_table_in_publ - Number of the table reporting the plot in the publication where it was originally published, if any
- Nr_releve_in_table - Plot number in the table where the plot was originally reported
- Original_nr_in_database - Original plot number, in the database the plot stems from
- Original_plotID - Only for nested plots
- Original_subplotID - Only for nested plots. In case a plot is nested inside another
- Project - Name of the project a specific plot stems from
- Remarks - Any additional notes associated with a plot
- GUID - Unique ID generated by Turboveg
Plot-level metadata information is stored in a heterogeneous manner across the datasets participating to sPlot.
```{r}
colnames(plotinfo.raw)
```
In the following subsection, we try and harmonize the information from these multiple fields.
### Plot-level biblioreference
```{r}
#select all fields in plotinfo.raw having biblioref info
plotinfo.biblio <- plotinfo.raw %>%
dplyr::select(PlotObservationID, Country,
Biblioreference, `Biblio reference`, PUBL, THESIS) %>%
#keep only non-empty
filter_at(.vars = vars(Biblioreference:THESIS), .vars_predicate = any_vars(!is.na(.))) %>%
## attach full reference
left_join(Biblioref.raw, by="PlotObservationID") %>%
#coalesce into a unique field
mutate(Biblioreference=coalesce(Fullref, Biblioreference, `Biblio reference`, PUBL, THESIS)) %>%
dplyr::select(PlotObservationID, Biblioreference) %>%
left_join(reference.oa %>%
dplyr::select(Fullref, BIBTEXKEY), by=c("Biblioreference"="Fullref"))
dim(plotinfo.biblio)
```
Biblioreference information exists for `r plotinfo.biblio %>% filter(!is.na(Biblioreference)) %>% nrow()` plots.
### Author
Coalesce columns reporting information on author
```{r}
plotinfo.author <- plotinfo.raw %>%