02_BuildDataset.Rmd

---
title: "Project#02 - Build Dataset"
author: "Francesco Maria Sabatini"
date: "4/28/2020"
output:
  html_document:
    toc: true
    theme: united
---

<center>
  ![](https://www.idiv.de/fileadmin/content/Files_sDiv/sDiv_Workshops_Photos_Docs/sDiv_WS_Documents_sPlot/splot-long-rgb.png "sPlot Logo")
</center>
  
    
**Timestamp:** `r date()`  
**Drafted:** Francesco Maria Sabatini  
**Revised:**  
**Version:** 2.0

*Changes in version 2.0* - Build sPlotOpen based on three resampling runs.  
<br>
This report describes how data from the sPlot database have been extracted to build an environmentally-balanced subset. Resampling of plots in the environmental space follows [Bruelheide et al. 2018 NEE](https://www.nature.com/articles/s41559-018-0699-8), and is done elsewhere.  
All data custodians were contacted individually and asked for permission to make a chunk of their data  open-access. Here I only reported the collated answers. 
  
```{r results="hide", message=F, warning=F}
library(tidyverse)
library(openxlsx)
library(bib2df)
library(knitr)
library(kableExtra)
library(viridis)
library(plotbiomes)
library(colorRamps)
library(fBasics)

library(raster)
library(sp)
library(sf)
library(rgdal)
library(rnaturalearth)
library(dggridR)
# library(rgeos)

library(Taxonstand)

filter <- dplyr::filter

#save temporary files
write("TMPDIR = /data/sPlot/users/Francesco/_tmp", file=file.path(Sys.getenv('TMPDIR'), '.Renviron'))
write("R_USER = /data/sPlot/users/Francesco/_tmp", file=file.path(Sys.getenv('R_USER'), '.Renviron'))
#rasterOptions(tmpdir="/data/sPlot/users/Francesco/_tmp")
```
# Import and clean data

Import and fix sPlot data. Import and attach database-level information and [GIVD](https://www.givd.info/) codes. 
```{r, cache=T, results="hide", message=F, warning=F}
load("~/share/groups/splot/releases/sPlot2.1/DT2_20161025.RData")
load("~/share/groups/splot/releases/sPlot2.1/sPlot_header_20161124.RData")
#fix header data
source("~/share/groups/splot/users/Francesco/_sPlot_Management/Fix.header.R")
header <- fix.header(header, exclude.sophy = F)
```
Import database level answers from custodians. This table reports whether the plots from a given dataset can be released open-access without condition (Yes); whether this is true only for a set of manually selected plots (Conditional); or cannot be used (No).
```{r}
answers <- openxlsx::read.xlsx("_management/resampling_answers.xlsx", sheet = 2)
answers <- answers %>% 
  mutate(`Yes/Conditional/No`=fct_recode(`Yes/Conditional/No`, No="NO", Yes="yes")) %>% 
  # Manually set some dataset to yes
  # Rasmus Revermann and Donald Walker's acceptance is conditional, 
  # but depends on conditions others than the selection of plot.
  mutate(`Yes/Conditional/No`=replace(`Yes/Conditional/No`, 
                                    list=GIVD.ID %in% c("NA-US-014","AF-00-009", 
                                                        "AF-00-006", "00-00-003", 
                                                        "00-RU-001", "EU-UA-001"),  
                                    values="Yes")) %>% 
  mutate(`Yes/Conditional/No`=replace(`Yes/Conditional/No`, 
                                    list=GIVD.ID %in% c("AF-00-008"),  
                                    values="Conditional")) 

head(answers)
```
## Last-minute adjustments to plot selections
Mark usable plots from AF-00-008, as last-minute approved.
```{r, warning=F, message=F}
#plots usable:
tava.keywords <- paste0(c("Azagny", "Djouroutou", "GEPRENAF", "Grebo", "Kayan", "Sapo", "sobeya", "Tai-E", "Tai-R"), collapse="|")

tavaplots <- read_delim("_data/Update_TavaApes/tava_header.csv", delim="\t") %>% 
  dplyr::select(PlotObservationID, `Original nr in database...101`) %>% 
  filter(PlotObservationID %in% (header %>% 
                                   filter(`GIVD ID`=="AF-00-008") %>% 
                                   pull(PlotObservationID))) %>% 
  filter(grepl(pattern=tava.keywords, x = `Original nr in database...101`))

```
Add additional plots to the usable list, from Luis Cayuela
```{r}
luis.sel <- c( 26827, 27251, 27252, 27285, 27286, 27287, 27288, 27289, 27295, 27297)
```

## Mark Usable plots
Import IDs of first choice plots, i.e. plots resampled in iteration 1. Load redundant list of plots selected in runs 1-3 of resampling (first choice + reserves), with plot-level specification from dataset custodians wheter a plot is usable (i.e., can be released OA) or not. 
```{r results="hide", message=F, warning=F}
load("_data/plot_sel.RData")
sel123 <- plot_data_sel[1:3]

# first choice plot IDs
#sel1 <- readr::read_csv("_data/Resampled1.csv")$x

# First choice plots + reserves
usable.plots123 <- readr::read_csv("_output/header.sel.final.csv") %>% 
  filter(PlotObservationID %in% unique(unlist(sel123))) %>% 
  mutate(first.choice=PlotObservationID %in% sel123[[1]]) %>% 
  mutate(Usable=ifelse(`GIVD ID` %in% (answers %>% 
                                        filter(`Yes/Conditional/No`=="Yes") %>% 
                                        pull(GIVD.ID)), 
                       "Yes", Usable)) %>% 
  #Hjalmar Kuhl could only grant access to subset of data
  mutate(Usable=ifelse(`GIVD ID` %in% c("AF-00-008") & 
                         PlotObservationID %in% tavaplots$PlotObservationID, "Yes", Usable))  %>%
  mutate(Usable=ifelse(`GIVD ID` %in% c("AF-00-008") & 
                         !PlotObservationID %in% tavaplots$PlotObservationID, "No", Usable))  %>% 
  #Luis Cayuela additional plots
  mutate(Usable=ifelse(PlotObservationID %in% luis.sel, "Yes", Usable))  %>% 
  distinct()

table(usable.plots123$Usable)

## Plots for which we received no authorizarion
usable.plots123 %>% filter(Usable != "Yes") %>% nrow() + 99 
#99 is the number of selected plots in the dataset China_Xinjang, which has been withdrawn from sPlot


#compute summary
summary.sel.final <- usable.plots123 %>% 
  group_by(`GIVD ID`, Dataset, Custodian, `Deputy custodian`) %>% 
  ### Summarize data at dataset level
  summarize(N.redundant=n(), 
            usable=sum(Usable=="Yes"), 
            not.usable=sum(Usable=="No"), 
            unknown=sum(Usable=="Unknown"), .groups = 'drop') %>% 
  ### total number of plots in a dataset
  left_join(header %>% 
              group_by(`GIVD ID`) %>% 
              summarize(n.tot.plot=n(), .groups = 'drop'),
            by="GIVD ID") %>% 
  ### number of first choice plots
  left_join(header %>% 
              filter(PlotObservationID %in% sel123[[1]]) %>% 
              group_by(`GIVD ID`) %>% 
              summarize(n.sel.plot=n(), .groups = 'drop'),
            by="GIVD ID") %>% 
  mutate(share.perc=round(n.sel.plot/n.tot.plot*100),1) %>% 
  dplyr::select(`GIVD ID`:`Deputy custodian`, n.tot.plot:share.perc, N.redundant:unknown) 

#check how many first choice plots can be used, and how many need replacement
firstchoice <- usable.plots123 %>% 
  mutate(Usable=forcats::fct_recode(Usable, "No" = "Unknown")) %>% 
  group_by(first.choice, Usable) %>% 
  summarize(n=n(), .groups = 'drop')
firstchoice
```

```{r, echo=F}
knitr::kable(summary.sel.final %>% 
               rename(`Total # plots in DB (A)`=n.tot.plot, 
                      `First choice plots (B)`=n.sel.plot, 
                      `Percentage B/A`=share.perc, 
                      `# First choice + reserves (C)`=N.redundant, 
                      `# of plots in (C) usable`=usable,
                      `# of plots in (C) not_usable`=not.usable,
                      `# of plots in (C) no_info`=unknown), 
             
             caption="Summary of first choice and reserve plots per dataset, with aggregated info on how many plots can be used (i.e., release OA)") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                  full_width = F, position = "center")
```
Out of the `r length(sel123[[1]])` plots selected in the first run of the resampling, `r firstchoice %>% filter(first.choice==T & Usable =="Yes") %>% pull(n)` can be used. The remaining `r firstchoice %>% filter(first.choice==T & Usable =="No") %>% pull(n)` require to be replaced.


# Explore distribution of all plots in PCA space
Load PCA data
```{r}
load("_data/pca3.RData") ### PCA ordination of the world
path.sPlot <- "~/share/groups/splot/releases/sPlot2.0/"
load(paste(path.sPlot, "splot.world2.RData", sep="/")) ## environmental data of the world at 2.5 res
```

Assign PCA values to selected plots
```{r}
## code adapted from @lenjon's 'resampling_2d_JL.R'
plot_data <- header %>% 
  filter(PlotObservationID %in% unique(unlist(sel123))) %>% 
  dplyr::select(PlotObservationID, Longitude, Latitude) %>% 
  dplyr::filter(!is.na(Latitude))
  #filter(PlotObservationID %in% sel123[[1]])
## transform to SpatialPointsDataFrame
CRSlonlat <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0")
coords <- cbind(plot_data$Longitude, plot_data$Latitude)
coords <- SpatialPoints(coords, proj4string=CRSlonlat)
plot_data <- SpatialPointsDataFrame(coords, plot_data)#, proj4string=CRSlonlat)


# Create world rasters of PCA values and extract plot values by geographic intersection
# raster at half a degree resolution (cf. 30 arc minute resolution)
rgeo <- raster(nrows=360, ncols=720, xmn=-180, xmx=180, ymn=-90, ymx=90) 
rgeo <- disaggregate(rgeo, fact=12) # raster at 2.5 arc minute resolution
splot.world2$cellID <- cellFromXY(rgeo, cbind(splot.world2$RAST_X, splot.world2$RAST_Y))

### create rasters from PCA
posit <- splot.world2$cellID
temp <- getValues(rgeo)

temp[posit] <- pca3$x[, 1]
PC1_r <- setValues(rgeo, temp)
temp[posit] <- pca3$x[, 2]
PC2_r <- setValues(rgeo, temp)

plot_data@data$cellID <- cellFromXY(rgeo, 
                                    cbind(plot_data@data$Longitude, plot_data@data$Latitude))
plot_data@data$pc1_val <- extract(PC1_r, coordinates(plot_data))
plot_data@data$pc2_val <- extract(PC2_r, coordinates(plot_data)) 


# Compute the density of environmental conditions available at the global scale across the entire bivariate (PC1-PC2) environmental space
res <- 100 # Setting the number of bins per PCA axis to 100
reco <- raster(nrows=res, ncols=res, 
               xmn=min(pca3$x[, 1]), xmx=max(pca3$x[, 1]), 
               ymn=min(pca3$x[, 2]), ymx=max(pca3$x[, 2]))
PC1_PC2_r <- rasterize(pca3$x[, 1:2], reco, fun="count")
plot_data <- plot_data@data
plot_data$pc_cellID <- cellFromXY(reco, cbind(plot_data$pc1_val, plot_data$pc2_val))

# Compute the sampling effort (number of vegetation plots) per environmental unit (cell) across the entire bivariate (PC1-PC2) environmental space
sPlot_reco <- rasterize(plot_data[, c("pc1_val", "pc2_val")], reco, fun="count")
# Put zero values for the empty cells (cf. there is no vegeteation plots available for those environmental conditions: gaps) 
temp1 <- getValues(PC1_PC2_r)
temp1[!is.na(temp1)] <- 0
temp2 <- getValues(sPlot_reco)
temp2[which(temp1==0&is.na(temp2))] <- 0
sPlot_reco <- setValues(reco, temp2)
```

Plotting the number of sPlot relevés for each cell of the PC1-PC2 space
```{r, fig.width=5, fig.height=5, fig.align="center", warning=F, message=F, cache=T}
#png(filename="Sampling_effort_PC1-PC2.png", width=12, height=12, res=300, unit="cm")
par(mar=c(4, 4, 4, 1)) 
plot(log(sPlot_reco+1), asp=0, col=c("grey", rev(divPalette(n=99, name="RdBu"))), xlab="PC1 (cold and seasonal to hot and stable)", ylab="PC2 (dry to wet)", legend=F)
plot(log(sPlot_reco+1), asp=0, col=c("grey", rev(divPalette(n=99, name="RdBu"))), 
     legend.only=TRUE, legend.width=1, legend.shrink=0.75, 
     axis.args=list(at=seq(0, log(maxValue(sPlot_reco)+1), length.out=5), 
                    labels=round(exp(seq(0, log(maxValue(sPlot_reco)+1), length.out=5))),  
                    cex.axis=0.6), 
     legend.args=list(text="N", side=3, font=2, line=0, cex=0.8))
title(main="Number of sPlotOpen relevés \nper environmental cell (log scale)")
#dev.off()
```


# Replace plots not usable with reserves
In those cases where we do not have permission to use a plot selected in resampling run #1 [first.choice], we replace it with a reserve belonging to the same cell in the PCA space grid. Reserves correspond to plots selected in resamplings runs #2 or #3, whose use was approved by the respective custodians. ~~Additionally we considered as usable reserves ALL those plots belonging to datasets whose custodian gave us unconditional permission to use their data.~~  
PCA is calculated in the environmental space defined by the 30 climatic and soil variables used in Bruelheide et al. 2018 NEE.  
<br>


```{r}
pca.grids <- plot_data %>% 
  filter(PlotObservationID %in% header$PlotObservationID) %>% 
  #filter(PlotObservationID %in% unique(unlist(sel123))) %>% 
  #attach GIVD codes 
  left_join(header %>% 
              distinct(PlotObservationID, `GIVD ID`,Dataset), 
            by="PlotObservationID") %>% 
  dplyr::select(PlotObservationID,`GIVD ID`, Dataset, 
                pc_cellID, pc1_val, pc2_val) %>% 
  as_tibble() %>% 
  #Attach info on first choice, reserve and usable plots
  mutate(first.choice=PlotObservationID %in% sel123[[1]]) %>% 
  left_join(usable.plots123 %>% 
              dplyr::select(PlotObservationID, Usable) %>% 
              mutate(Usable=Usable=="Yes"), 
            by="PlotObservationID") %>% 
  mutate(reserve= (!PlotObservationID %in% sel123[[1]]) & Usable==T) %>% 
#  #Consider as usable reserves ALL those plots belonging to datasets whose custodian gave us
#  # unconditional permission to use the data
  mutate(Usable=replace(Usable, 
                        list= ( is.na(Usable) & 
                                 `GIVD ID` %in% (answers %>% 
                                     filter(`Yes/Conditional/No`=="Yes") %>% 
                                     pull(GIVD.ID))), 
                        values=T)) %>% 
  mutate(reserve=replace(reserve, 
                        list= (first.choice==F & 
                                 `GIVD ID` %in% (answers %>% 
                                     filter(`Yes/Conditional/No`=="Yes") %>% 
                                     pull(GIVD.ID))), 
                        values=T)) %>% 
  filter(!is.na(Usable))

head(pca.grids %>% 
       dplyr::select(-pc1_val, -pc2_val))
```

For each non-usable first choice plot, find a reserve from the same grid cell in the PCA space.

```{r}
toreplace <- pca.grids %>% 
  filter(first.choice==T) %>% 
  filter(Usable==F)
# number of PCA cells from which the plots to replace stem
(npcacell <- toreplace %>% 
  distinct(pc_cellID) %>% 
  nrow())
# proportion of occupied cells
npcacell / (pca.grids %>% distinct(pc_cellID) %>% nrow()) 
toreplace <- toreplace %>% 
  pull(PlotObservationID)

#number of first choice plots needing replacement
length(toreplace)

set.seed(9999)
selected.reserves <- pca.grids %>% 
  #for each cell, calculate how many reserves would be needed, and how many reserves are available
  left_join(pca.grids %>% 
              group_by(pc_cellID) %>% 
              summarize(n.first=sum(first.choice, na.rm=T),
                        n.first.usable=sum(first.choice*Usable, na.rm=T), 
                        reserve.available=sum(reserve, na.rm=T), .groups = 'drop') %>% 
              mutate(reserve.needed=n.first-n.first.usable),
            by=c("pc_cellID")) %>% 
  filter(reserve.needed>0)

# calculate number of plots that cannot be replaced
not_replaceable <- selected.reserves %>% 
  distinct(pc_cellID, .keep_all = T) %>% 
  filter(reserve.needed>0) %>%  
  mutate(missing=reserve.needed-reserve.available) %>% 
  mutate(missing=ifelse(missing<0, 0, missing)) %>% 
  filter(missing>0) %>% 
  pull(missing)
#number of plots that cannot be replaced
sum(not_replaceable)
#number of grid cells these plots come from
length(not_replaceable)
#distribution of number of not-replaceable plots per PC grid cell
summary(not_replaceable)

# from each cell where >0 reserves are needed, sample randomly n usable reserves, 
# where n is the minimun between the number of reserves needed and reserves available
selected.reserves <- selected.reserves %>% 
  filter(reserve==T) %>% 
  group_by(pc_cellID) %>% 
  mutate(reserve.available=min(reserve.needed, reserve.available)) %>% 
  #slice_sample(n=reserve.available)
  #mutate(nn = n()) %>% 
  mutate(samp = sample(n())) %>%
  filter(samp <= reserve.available) %>%
  dplyr::select(-samp) %>% 
  ungroup()
```

```{r, echo=F}
knitr::kable(selected.reserves %>% 
               dplyr::select(-pc1_val, -pc2_val) %>% 
               ungroup() %>%
               arrange(pc_cellID) %>% 
               slice(1:20),
             caption="Example of selected reserves [first 20 rows shown]") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                  full_width = F, position = "center")
```

By selecting plots within the same cell in the PCA grid, we can only replace `r nrow(selected.reserves)`, out of the `r length(toreplace)` non-usable first choice plots. 


# Build sPlot OA dataset
## Header
```{r}
header.oa <- header %>% 
  filter(PlotObservationID %in% (usable.plots123 %>%
                                   #filter(first.choice==T) %>% 
                                   filter(Usable=="Yes") %>% 
                                     pull(PlotObservationID)) #|
         # PlotObservationID %in% (selected.reserves %>% 
          #                           pull(PlotObservationID))
           ) %>% 
  left_join(pca.grids %>% 
              dplyr::select(PlotObservationID, SoilClim_PC1=pc1_val, SoilClim_PC2=pc2_val, pc_cellID), 
            by="PlotObservationID") %>% 
  mutate(Resample_1=PlotObservationID %in% sel123[[1]]) %>% 
  mutate(Resample_2=PlotObservationID %in% sel123[[2]]) %>% 
  mutate(Resample_3=PlotObservationID %in% sel123[[3]]) %>% 
  mutate(Resample_1_consensus=ifelse(PlotObservationID %in% selected.reserves$PlotObservationID, 
                                     TRUE, 
                                     Resample_1))
```
After merging first choice plots and the corresponding reserves, the database contains `r nrow(header.oa)` unique plots, stemming from `r header.oa %>% distinct("GIVD ID") %>% nrow()` databases.
```{r}
# calculate share of unique plots for each resampling, and plots shared across resamplings
header.oa %>% 
  group_by(Resample_1,Resample_2, Resample_3) %>% 
  summarize(n=n())
```
Number of plots for which data on mosses and lichens is available
```{r, warning=F}
header.oa %>% 
  mutate_at(.vars=vars(`Mosses identified (y/n)`, `Lichens identified (y/n)`),
            .funs=~factor(.)) %>% 
  mutate_at(.vars=vars(`Mosses identified (y/n)`, `Lichens identified (y/n)`),
         .funs=~forcats::fct_recode(., 
                                     "NA" = ".",
                                     "TRUE" = "1" ,
                                     "TRUE" = "y" ,
                                     "TRUE" = "Y" ,
                                     "TRUE" = "J" ,
                                     "FALSE" = "N",
                                     "FALSE" = "n")) %>% 
  summarize(n.mosses=sum(`Mosses identified (y/n)`=="TRUE", na.rm=T), 
            n.lichens=sum(`Lichens identified (y/n)`=="TRUE", na.rm=T))

                    
```


Data preparation: adjust header data, select relevant variables, reformat variables into the right classes, correct macroscopic errors.
```{r, warning=F}
header.oa <- header.oa %>% 
# MEMO - releve area of SA-BR-002 is always 1000
  #TO ADD
    #reformat and rename
  mutate_at(.vars=vars(`Altitude (m)`, `Aspect (°)`, `Slope (°)`), 
            ~as.numeric(.)) %>% 
  mutate_at(.vars=vars(ESY), 
            ~as.character(.)) %>% 
  mutate_at(.vars=vars(Forest:Wetland), 
            ~as.logical(.)) %>% 
  mutate_at(.vars=vars(`Herbs identified (y/n)`, `Mosses identified (y/n)`, `Lichens identified (y/n)`),
            ~ifelse(.=="Y", T, F)) %>% 
  mutate(`Date of recording`=ifelse(`Date of recording`=="1-1-101", NA, `Date of recording`)) %>% 
  mutate(`Date of recording`=as.Date(`Date of recording`, "%d-%m-%Y") ) %>%
  mutate(CONTINENT=factor(CONTINENT, exclude = " ")) %>% 
  mutate(`Plants recorded`=forcats::fct_explicit_na(f = `Plants recorded`, "Not specified")) %>% 
  mutate(`Plants recorded`=forcats::fct_recode(`Plants recorded`, 
                                      "Not specified" = "#N/A",
                                      "All vascular plants"="Complete vegetation",
                                      "All vascular plants"="all vascular plants", 
                                      "All vascular plants"="complete", 
                                      "All vascular plants"="Complete vegetation (including non-terricolous tax",
                                      "All vascular plants"="Vascular plants",
                                      "All vascular plants"="All vascular plants and dominant cryptogams",
                                      "All woody plants"="Woody plants",
                                      "All woody plants"="All woody species",
                                      "Woody plants >= 10 cm dbh"= "trees>=10cm dbh",
                                      "Woody plants >= 10 cm dbh"= "Woody plants >= 10 cm dbh and domin",
                                      "All trees & dominant understory"="All trees & dominant shrubs",
                                      "Woody plants >= 5 cm dbh"="Woody plants >= 5 cm dbh & dominant",
                                      "Woody plants >= 1 cm dbh" = "Plants >= 1 cm dbh", 
                                      "Only dominant species"="Dominant vascular plants",
                                      "Woody plants >= 1 m height"="trees and shrubs >1 m height"
                                      )) %>% 
  mutate(Biome = fct_recode(Biome, "Subtropics with year-round rain"="Subtrop. with year-round rain")) %>% 
  #reorder levels of `Plants recorded`
  mutate(`Plants recorded`=factor(`Plants recorded`, 
                                  levels=c('All vascular plants', 
                                           
                                           'All trees & dominant understory', 
                                           'Dominant trees', 
                                           'Only dominant species', 
                                           'Dominant woody plants >= 2.5 cm dbh', 
                                           
                                           'All woody plants', 
                                           'Woody plants >= 1 cm dbh', 
                                           'Woody plants >= 2.5 cm dbh', 
                                           'Woody plants >= 5 cm dbh', 
                                           'Woody plants >= 10 cm dbh', 
                                           'Woody plants >= 20 cm dbh', 
                                           'Woody plants >= 1 m height', 
                                           
                                           'Not specified'))) %>% 
  ##correct mistakes
  mutate(`Altitude (m)`=ifelse(`Altitude (m)`< -100, NA, `Altitude (m)`)) %>% 
  #plots from Veg_bank seem to have a mix of feet and meter in Altitude
  mutate(`Altitude (m)`=ifelse(`GIVD ID`=="NA-US-002", NA, `Altitude (m)`)) %>%  
  #constrain Aspect between 1 and 360
  mutate(`Aspect (°)`=ifelse(`Aspect (°)`<1, 360-`Aspect (°)`, `Aspect (°)`)) %>% 
  mutate(`Slope (°)`=ifelse(`Slope (°)`<0, NA, `Slope (°)`)) %>%  
  mutate(`Slope (°)`=ifelse(`Slope (°)`>90, NA, `Slope (°)`)) %>%  
  mutate_at(.vars=vars(starts_with("Height") & contains("shrubs")),  
            ~ifelse(.>=10|.<0, NA, .)) %>% 
  mutate(`Relevé area (m²)`=ifelse(`Relevé area (m²)`<0, NA, `Relevé area (m²)`)) %>% 
  mutate(`Cover bare soil (%)`=ifelse(`Cover bare soil (%)`<0, NA, `Cover bare soil (%)`)) %>% 
  mutate(`Date of recording`=replace(`Date of recording`, 
                                     list=`Date of recording`> as.Date('2016-01-01'), 
                                     NA)) %>% 
  # round SoilClim PCA
  mutate_at(.vars=vars(SoilClim_PC1, SoilClim_PC2), 
            .funs=list(~round(., 3))) %>% 
  # Rename fields
  dplyr::select(
    #metadata + location
    PlotObservationID,
    GIVD_ID = "GIVD ID",
    Dataset,
    Continent = CONTINENT,
    Country,
    Biome,
    Date_of_recording = "Date of recording",
    Latitude,
    Longitude,
    Location_uncertainty = "Location uncertainty (m)", #POINT_X, POINT_Y,
    #sampling design
    Releve_area = "Relevé area (m²)", 
    #Herbs_identified = "Herbs identified (y/n)", 
    #Mosses_identified="Mosses identified (y/n)", 
    #"Lichens identified (y/n)", 
    Plant_recorded = "Plants recorded",
    #topography
    Elevation = "Altitude (m)", 
    Aspect = "Aspect (°)", 
    Slope = "Slope (°)",
    #vegetation type
    is_forest = "is.forest",
    is_nonforest = "is.non.forest", 
    ESY, 
    Naturalness, 
    Forest,
    Shrubland, 
    Grassland, 
    Sparse_vegetation = "Sparse.vegetation", 
    Wetland,
    #vegetation structure
    Cover_total = "Cover total (%)", 
    Cover_tree_layer = "Cover tree layer (%)",
    Cover_shrub_layer = "Cover shrub layer (%)",
    Cover_herb_layer = "Cover herb layer (%)", 
    Cover_moss_layer = "Cover moss layer (%)", 
    Cover_lichen_layer ="Cover lichen layer (%)",
    Cover_algae_layer = "Cover algae layer (%)", 
    Cover_litter_layer = "Cover litter layer (%)", 
    Cover_bare_rocks = "Cover bare rock (%)",
    Cover_cryptogams = "Cover cryptogams (%)", 
    Cover_bare_soil = "Cover bare soil (%)", 
    Height_trees_highest = "Height (highest) trees (m)",
    Height_trees_lowest = "Height lowest trees (m)", 
    Height_shrubs_highest = "Height (highest) shrubs (m)", 
    Height_shrubs_lowest = "Height lowest shrubs (m)",
    Height_herbs_average = "Aver. height (high) herbs (cm)", 
    Height_herbs_lowest = "Aver. height lowest herbs (cm)", 
    Height_herbs_highest = "Maximum height herbs (cm)", 
    #environment PCA
    SoilClim_PC1, 
    SoilClim_PC2, 
    #Resampling
    Resample_1,
    Resample_2, 
    Resample_3, 
    Resample_1_consensus)
```


The location of some RAINFOR plots is sensitive. I reduce the precision of their spatial coordinates
```{r}
header.oa <- header.oa %>% 
  mutate(Latitude=ifelse(GIVD_ID=="00-00-001", 
                         round(Latitude, 2), 
                         Latitude)) %>% 
  mutate(Longitude=ifelse(GIVD_ID=="00-00-001", 
                         round(Longitude, 2), 
                         Longitude)) %>% 
  mutate(Location_uncertainty=ifelse(GIVD_ID=="00-00-001", 
                         1000, 
                         Location_uncertainty))
           
```

### Formations
For those plots being classified based on the EUNIS codes, we used a cross-link table to use EUNIS to assign vegetation types and naturalness, but only when these columns are empty. 
```{r}
eunis.key <- openxlsx::read.xlsx("~/share/groups/splot/users/Francesco/sPlot3/_input/EUNIS_WFT.xlsx", 
                       sheet = "Sheet1") %>% 
  dplyr::select(EUNIS_code, NATURALNESS:SPARSE_VEG) %>% 
  mutate(EUNIS_code=as.character(EUNIS_code)) %>% 
  rename(ESY=EUNIS_code, 
         Naturalness=NATURALNESS, 
         Forest=FOREST,
         Shrubland=SCRUBLAND,
         Grassland=GRASSLAND,
         Wetland=WETLAND,
         Sparse_vegetation=SPARSE_VEG)#,

header.oa <- header.oa %>% # header.backup %>% 
  mutate(ESY=as.character(ESY)) %>% 
  #mutate(ESY=ifelse(ESY=="?", NA, ESY)) %>% 
  # Systematically assign some databases to forest
  mutate(Forest=ifelse(Dataset %in% 
                         c("Turkey Oak_Forest Database", 
                           "Turkey Forest Database", 
                           "Chile_forest", "Ethiopia"), 
                       T, Forest)) %>% 
  #fill up with F those rows where at least one column on formation is assigned
  rowwise() %>% 
  mutate(Any=any(Forest, Shrubland, Grassland, Wetland, Sparse_vegetation)) %>% 
  mutate(Forest=ifelse( (is.na(Forest) & Any), F, Forest))  %>%
  mutate(Shrubland=ifelse( (is.na(Shrubland) & Any), F, Shrubland))  %>% 
  mutate(Grassland=ifelse( (is.na(Grassland) & Any), F, Grassland))  %>% 
  mutate(Wetland=ifelse( (is.na(Wetland) & Any), F, Wetland))  %>% 
  mutate(Sparse_vegetation=ifelse( (is.na(Sparse_vegetation) & Any), F, Sparse_vegetation))  %>%
  ungroup() %>% 
  dplyr::select(-Any) %>%
  ##join and coalesce with eunis.key
  left_join(eunis.key %>% 
              distinct(), by = "ESY") %>% 
    mutate(
        Forest = dplyr:::coalesce(Forest.x, Forest.y), 
        Shrubland = coalesce(Shrubland.x, Shrubland.y),
        Grassland = coalesce(Grassland.x, Grassland.y),
        Wetland = coalesce(Wetland.x, Wetland.y),
        Sparse_vegetation = coalesce(Sparse_vegetation.x, Sparse_vegetation.y),
        Naturalness = coalesce(Naturalness.x, Naturalness.y)
    ) %>% 
  dplyr::select(-ends_with(".x"), -ends_with(".y")) %>% 
  #transform naturalness to ordered factor
  mutate(Naturalness=factor(Naturalness, 
                          levels=c(1,2), 
                          labels=c("Natural", "Semi-natural"), 
                          ordered = T)) %>%  
  relocate(Forest:Sparse_vegetation, .after=ESY) %>% 
  relocate(Naturalness, .after=ESY)

```


Fix `is_forest` and `is_nonforest` based on vegetation type. Make the fields consistent with each other.
```{r}
header.oa <- header.oa %>% 
  # If a plot has Forest ==1 and all other veg types==0, force is_forest to TRUE and is_nonforest to F
  mutate(is_forest=replace(is_forest,
                           list=(Forest==T & Grassland==F & Shrubland==F & Wetland==F & Sparse_vegetation==F), 
                           values=T)) %>% 
  mutate(is_nonforest=replace(is_nonforest,
                           list=(Forest==T & Grassland==F & Shrubland==F & Wetland==F & Sparse_vegetation==F), 
                           values=F)) %>% 
  # If a plot has Forest ==0 and any other veg types==1, force is_forest to F and is_nonforest to T
  mutate(is_forest=replace(is_forest,
                           list=(Forest==F & (Grassland==T | Shrubland==T | Wetland==T | Sparse_vegetation==T)), 
                           values=F)) %>% 
  mutate(is_nonforest=replace(is_nonforest,
                           list=(Forest==F & (Grassland==T | Shrubland==T | Wetland==T | Sparse_vegetation==T)), 
                           values=T)) %>% 
  ## fill up NAs when either is_forest or is_nonforest is not NA
  ## note that if a plot is marked as is_forest == T, the plot is assigned to is_nonforest =F
  ## BUT is the plots is marked as is_nonforest = T, the opposite is not done (conservative choice)
  mutate(is_forest=replace(is_forest,
                           list=is.na(is_forest) & is_nonforest==T, 
                           values=F)) %>% 
  mutate(is_nonforest=replace(is_nonforest,
                         list=is.na(is_nonforest) & is_forest==F, 
                         values=T)) %>% 
  ## assign replace double FALSE to NA
  mutate(is_forest=replace(is_forest, 
                           list= ((is.na(is_forest) | is_forest==F) & (is.na(is_nonforest) | is_nonforest==F)), 
                           values=NA)) %>% 
  mutate(is_nonforest=replace(is_nonforest, 
                           list= ((is.na(is_forest) | is_forest==F) & (is.na(is_nonforest) | is_nonforest==F)), 
                           values=NA))


## Correct known misassigne plots
indre <- c(977877:978502, 981362:981969) #as requested by Adrian Indreica
header.oa <- header.oa %>% 
  mutate(is_forest=ifelse(PlotObservationID %in% indre, 
                          T, 
                          is_forest))
#double check
header.oa %>% 
  group_by(is_forest, is_nonforest) %>% 
  summarize(n=n())
  
```


### Complete missing values, when possible
There are 75 entries without continent info
```{r}
header.oa <- header.oa %>% 
  mutate(Continent=as.character(Continent)) %>% 
  mutate(Continent=ifelse(is.na(Continent) & Country %in% c("Bulgaria", "Denmark", "Greece", "Iceland", 
                                                            "Italy", "Norway", "Svalbard and Jan Mayen Is", 
                                                            "Sweden", "United Kingdom", "France", "Spain"), 
                          "Europe", Continent)) %>% 
  mutate(Continent=ifelse(is.na(Continent) & Country=="Australia", 
                          "Australia", Continent)) %>% 
  mutate(Continent=ifelse(is.na(Continent) & Country %in% c("Chile", "Colombia"), 
                          "South America", Continent)) %>% 
  mutate(Continent=ifelse(is.na(Continent) & Country %in% c("United States", "Canada", "Greenland"), 
                          "North America", Continent)) %>% 
  # correct a couple of mistakes
  mutate(Continent=ifelse(Continent=="Australia", "Oceania", Continent)) %>%
  mutate(Continent=ifelse(Country=="Papua New Guinea", "Oceania", Continent)) %>% 
  mutate(Continent=as.factor(Continent)) %>% 
  # tranform Forest:Sparse veg to T/F
  mutate_at(.vars=vars(Forest:Sparse_vegetation),
            .funs = ~as.logical(.) )
```


The field `is_nonforest` is now redundant. Drop it.
```{r}
header.oa <- header.oa %>% 
  dplyr::select(-is_nonforest)
```

```{r}
## distribution of plot sizes:
cut(header.oa$Releve_area, breaks=c(0,10,100,1000, Inf), 
    labels=c("<10", "10-100", "100-1000", ">=10000")) %>% table()
```


### Show Output


```{r, echo=F}
knitr::kable(header.oa %>%
               sample_n(20),
             caption="Example of header.oa [20 randomly selected plots shown]") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                  full_width = F, position = "center")
```


## Global reference list
sPlot stems from the work of thousands of vegetation scientists. Much of this work has already been published. Here, we try to create a list of all relevant references, which can be used to refer to when providing information on the plots or datasets contained in sPlotOpen. This reference list if formatted according to .BibTex standards.

### Import and parse plot level info
Plot level info imported from Turboveg. Before importing, need to seek and replace all quotation and double quotation marks, and escape them. Done via LINUX console
```{bash, engine.opts='-l', eval=F}
#not sure it works from markdown, works from console,though
sed "s/'/\\'/g" _data/PlotLevelInfo/TV3_PlotLevelInfo_Export_123.csv >  _data/PlotLevelInfo/TV3_PlotLevelInfo_Export_test.csv
sed 's/"/\\"/g' _data/PlotLevelInfo/TV3_PlotLevelInfo_Export_test.csv >  _data/PlotLevelInfo/TV3_PlotLevelInfo_Export_test2.csv
```

```{r}
plotinfo.raw <- read_delim("_data/PlotLevelInfo/TV3_PlotLevelInfo_Export_test2.csv", delim="\t", 
                           col_types = cols(
                             .default = col_character(),
                             PlotObservationID = col_double(),
                             PlotID = col_double(),
                             Country = col_character(),
                             `Nr. table in publ.` = col_character(),
                             `Nr. relevé in table` = col_character(),
                             Author = col_character(),
                             Remarks = col_character(),
                             `Original nr in database` = col_character(),
                             Collection = col_character(),
                             `Dataset...14` = col_character(),
                             SURVEY = col_character(),
                             Longitude = col_double(),
                             Latitude = col_double(),
                             `Location uncertainty (m)` = col_double(),
                             `Dataset...65` = col_character(),
                             GUID = col_character(), 
                             DB_OWNER = col_character(),
                             ORIG_DB = col_character()
                           )) %>% 
  #drop empty field
  dplyr::select(where(~ !(all(is.na(.)) | all(. == "")))) %>% 
  rename(Dataset = Dataset...14,
         Dataset_1 = Dataset...65, 
         TAB_NR=`TAB_NR...30`, 
         TAB_NR_1=`TAB_NR...49`,
         ORIG_ID=`ORIG_ID...27`, 
         ORIG_ID_1=`ORIG_ID...34`)
```
### Dataset-level biblio-references
Import dataset-level BibTex reference list and database-level information
```{r, message=F}
bib.db <- bib2df("~/share/groups/splot/users/Francesco/_sPlot_Management/Consortium/sPlot_References.bib")
databases <- read_csv("~/share/groups/splot/users/Francesco/_sPlot_Management/Consortium/Databases.out.csv")

plotinfo.dbref <- header.oa %>% 
  dplyr::select(PlotObservationID, GIVD_ID) %>% 
  left_join(databases %>% 
              dplyr::select(GIVD_ID=`GIVD ID`, DB_BIBTEXKEY=BIBTEXKEY) %>% 
              distinct(), 
            by="GIVD_ID")
dim(plotinfo.dbref)
```


### Plot-level biblio-references
Data from Turobevg come with a dictionary of references. These references, however, are not formally formatted, but are simple strings of text. 
```{r, message=F}
#Import biblioreference dictionary from TurboVeg3
Biblioref.raw <- read_delim("_data/PlotLevelInfo/BiblioReference_v2.txt", 
                            delim="\t", col_names = c("PlotObservationID", "Fullref"))
Biblioref.raw_123 <- read_delim("_data/PlotLevelInfo/BiblioReference_v123_additional.txt", 
                            delim="\t", col_names = c("PlotObservationID", "Fullref"))
```
These bibliographic references are then parsed using the library [anystyle](https://github.com/inukshuk/anystyle), in ruby. Yet, the output needs some additional cleaning first.  
Below some code which might help for the scope, which might benefit from further refinements.  
Do some string modification before parsing with anystyle. Need to convert all words being completely upper case to lower case, with first letter uppercase.
```{r}
.simpleCap <- function(x) {
  s <- strsplit(x, "-")[[1]]  
  s <- tolower(s)
  paste(toupper(substring(s, 1, 1)), substring(s, 2),
          sep = "", collapse = "-")
}

br1 <- Biblioref.raw %>% 
  #bind_rows(Biblioref.raw_123)
  distinct(Fullref) %>% 
  arrange(Fullref)
for(i in 1:nrow(br1)){
  tmp <- str_split(br1[i,], pattern = " ")[[1]]
  tochange <- str_detect(tmp, "^[:upper:]+$|^[:upper:]+,$|^[:upper:]+-[:upper:]+$") & str_count(tmp, pattern="[A-Za-z]|-")>1 #doesn't match non ASCI letters, though
  if(sum(tochange)>0){
    tmp[tochange] <- sapply(tmp[tochange], .simpleCap)
    br1[i,] <- paste(tmp, collapse=" ")
  }
}

#split in chunks with 300 refs each
nchunks <- ceiling(nrow(br1)/300)
iii <- 1:nrow(br1)
splitted <- split(iii, sort(iii%%nchunks))
```

```{r, eval=F}
##clean up output folder first
filenames <- paste0("_data/PlotLevelInfo/Ref_to_parse_", 1:nchunks, ".txt")
if(any(file.exists(filenames))){
  #Delete file if it exists
  file.remove(filenames)
}
## sink references to format into .txt files in batches of 300
for(i in 1:nchunks){
  tmp <- br1$Fullref[splitted[[i]]]
  write_lines(tmp, paste0("_data/PlotLevelInfo/Ref_to_parse_", i, ".txt"), )
}
```
These references were submitted to the [anystyle](https://github.com/inukshuk/anystyle)'s web interface. Output was exported to bibTex.  
Reimport and match.
```{r, message=F, warning=F}
filenames <- paste0("_data/PlotLevelInfo/anystyle_", 1:11, ".bib")
bib.list <- lapply(filenames, bib2df)
bib.df <- bind_rows(bib.list) %>% 
   select_if(function(x) !(all(is.na(x)) | all(x=="")))
br1.out <- bind_cols(Biblioref.raw %>% 
                   distinct(Fullref) %>% 
                   arrange(Fullref), bib.df)
filenames <- paste0("_data/PlotLevelInfo/anystyle_123_", 1:5, ".bib")
bib.list <- lapply(filenames, bib2df)
bib.df <- bind_rows(bib.list) %>% 
   select_if(function(x) !(all(is.na(x)) | all(x=="")))
br1.out.123 <- bind_cols(Biblioref.raw_123 %>% 
                   distinct(Fullref) %>% 
                   arrange(Fullref), bib.df)


```
Parse additional references not stored in TURBOVEG3's dictionary
```{r}
br2 <- plotinfo.raw %>%
  dplyr::select(`Biblio reference`) %>% 
  distinct(`Biblio reference`) %>% 
  arrange(`Biblio reference`) %>% 
  filter(!is.na(`Biblio reference`)) %>% 
  filter(!str_detect(`Biblio reference`, pattern ="^\\d+$"))

#split in chunks with 300 refs each
nchunks2 <- ceiling(nrow(br2)/300)
iii2 <- 1:nrow(br2)
splitted2 <- split(iii2, sort(iii2%%nchunks2))
```
Manually submit the files to [anystyle](https://github.com/inukshuk/anystyle)'s web interface. Output was exported to bibTex. 
```{r, eval=F}
##clean up before saving
filenames <- paste0("_data/PlotLevelInfo/Ref2_to_parse_", 1:nchunks2, ".txt")
if(any(file.exists(filenames))){
  #Delete file if it exists
  file.remove(filenames)
}

for(i in 1:nchunks2){
  tmp <- br2$`Biblio reference`[splitted2[[i]]]
  write_lines(tmp, paste0("_data/PlotLevelInfo/Ref2_to_parse_", i, ".txt"), )
}
```
Reimport and match
```{r, message=F, warning=F}
filenames2 <- paste0("_data/PlotLevelInfo/anystyle2_", 1:nchunks2, ".bib")
bib.list <- lapply(filenames2, bib2df)
bib.df <- bind_rows(bib.list) %>% 
   select_if(function(x) !(all(is.na(x)) | all(x=="")))

br2.out <- bind_cols(br2, bib.df) %>% 
  rename(Fullref=`Biblio reference`)
```

Create a unique df with all formatted references and correct duplicated bibtexkeys
```{r}
#define helper function
rename.duplicates <- function(x){
  tick <- 1
  while(sum(duplicated(x)) > 0) {
    #print(tick)
    if (tick == 1) {x[duplicated(x)] <- paste0(x[duplicated(x)], tick, sep = "")}
    if (tick > 1) {
      x[duplicated(x)] <-paste0(str_sub(x[duplicated(x)], end = -2), #strip last character of string
               tick, sep ="")}
    tick <- tick + 1
  }
  return(x)
}

#fix duplicated bibtex keys
reference.oa <- bind_rows(bib.db, br1.out, br1.out.123, br2.out) %>% 
  distinct() %>% 
  filter((Fullref %in% Biblioref.raw$Fullref  | 
            Fullref %in% Biblioref.raw_123$Fullref | 
            BIBTEXKEY %in% bib.db$BIBTEXKEY))
reference.oa$BIBTEXKEY <- rename.duplicates(reference.oa$BIBTEXKEY)
```
The reference list contains `r nrow(reference.oa)` parsed references.

```{ruby, eval=F, echo=F, engine.path = '~/.rubies/ruby-2.7.2/bin/ruby'}
#Check if I can use Ruby's `anystyle` library inside RMarkdown.  
#Parsing is in general pretty good, yet not perfect. I leave this on standby for the time being.

puts RUBY_VERSION
require "anystyle"
File.open("_data/PlotLevelInfo/BiblioReference.clean.txt", "r") do |file_handle|
    file_handle.each_line do |ref|
      File.open("_data/PlotLevelInfo/output.txt", mode:"a") {|f| f.write AnyStyle.parse ref }
    end
  end
```

### Show Output
```{r, echo=F}
knitr::kable(reference.oa  %>% 
               sample_n(20)%>%
               select_if(function(x) !(all(is.na(x)) | all(x==""))),
             caption="Example of reference.oa [20 randomly selected references, and only non-empty columns shown]") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                  full_width = F, position = "center")
```

Clearly, there is some additional parsing needed, as well as some encoding problems (these come from the original data, though, which means that there is little we can do programmatically). Yet the results is a very good starting point. We refer to this reference list as a Beta-Version, and recommend the users to carefully check the references they need to cite, before their use.


## Metadata - Plot-level
We store a lot of plot-level metadata in TurboVeg3. Yet, these metadata are only partially standardized across datasets. Here, we try to organize this metadata information into a few meaningful fields. For each plot we provide information on:  

- DB_BIBTEXTKEY - Key linking to the bibliographic reference of the dataset from which the plot stems from. Keys refer to the reference.oa object.
- Releve_author - Name of the person originally collecting the data in the field  
- Releve_coauthors - Names of additional persons originally collecting the data in the field 
- Plot_Biblioreference - Bibliographic reference where the plot was first published, if any  
- BIBTEXTKEY - Key linking to the plot-level bibliographic reference. It refers to the reference.oa object
- Nr_table_in_publ - Number of the table reporting the plot in the publication where it was originally published, if any  
- Nr_releve_in_table - Plot number in the table where the plot was originally reported  
- Original_nr_in_database - Original plot number, in the database the plot stems from  
- Original_plotID - Only for nested plots  
- Original_subplotID - Only for nested plots. In case a plot is nested inside another  
- Project - Name of the project a specific plot stems from   
- Remarks - Any additional notes associated with a plot  
- GUID - Unique ID generated by Turboveg  


Plot-level metadata information is stored in a heterogeneous manner across the datasets participating to sPlot. 
```{r}
colnames(plotinfo.raw)
```
In the following subsection, we try and harmonize the information from these multiple fields.


### Plot-level biblioreference
```{r}
#select all fields in plotinfo.raw having biblioref info
plotinfo.biblio <- plotinfo.raw %>%
  dplyr::select(PlotObservationID, Country, 
                Biblioreference, `Biblio reference`, PUBL, THESIS) %>% 
  #keep only non-empty
  filter_at(.vars = vars(Biblioreference:THESIS), .vars_predicate = any_vars(!is.na(.))) %>% 
  ## attach full reference
  left_join(Biblioref.raw, by="PlotObservationID") %>% 
  #coalesce into a unique field
  mutate(Biblioreference=coalesce(Fullref, Biblioreference, `Biblio reference`, PUBL, THESIS)) %>% 
  dplyr::select(PlotObservationID, Biblioreference) %>% 
  left_join(reference.oa %>% 
              dplyr::select(Fullref, BIBTEXKEY) %>% 
              distinct(Fullref, .keep_all=T), by=c("Biblioreference"="Fullref")) 
dim(plotinfo.biblio)
```
Biblioreference information exists for `r plotinfo.biblio %>% filter(!is.na(Biblioreference)) %>% nrow()` plots.


### Author
Coalesce columns reporting information on author
```{r}
plotinfo.author <- plotinfo.raw %>%
  dplyr::select(PlotObservationID, Country, 
                Author, Surveyor, COLLECTOR, PLOTAUTHOR, DATA_OWNER,AUTOR, AUTHORM,AUTOR_REL, AUTHORNAME, COAUTHORS) %>% 
  #keep only non-empty
  filter_at(.vars = vars(Author:COAUTHORS), .vars_predicate = any_vars(!is.na(.))) %>% 
  mutate(Author=coalesce(Author, Surveyor, COLLECTOR, PLOTAUTHOR, DATA_OWNER,AUTOR, AUTHORM,AUTOR_REL, AUTHORNAME)) %>% 
  dplyr::select(PlotObservationID, Releve_author=Author, Releve_coauthors=COAUTHORS)

#fix known issues
plotinfo.author <- plotinfo.author %>% 
  mutate(Releve_author=replace(Releve_author, 
                        list=Releve_author=="Udo Schickhoff *checking for initial", 
                        values="Udo Schickhoff"))
dim(plotinfo.author)
```
There are `r nrow(plotinfo.author)` plots having author information.  

### Project
Coalesce columns reporting information on project
```{r}
plotinfo.project <- plotinfo.raw %>%
  dplyr::select(PlotObservationID, Country, 
                Project, Dataset, SURVEY, PROJCTNAME, PROJ_NAME) %>% 
  #keep only non-empty
  filter_at(.vars = vars(Project:PROJ_NAME), .vars_predicate = any_vars(!is.na(.))) %>% 
  mutate(Project=coalesce(Project, Dataset, SURVEY, PROJCTNAME, PROJ_NAME )) %>% 
  dplyr::select(PlotObservationID, Project, -Country)
dim(plotinfo.project)
```
There are `r nrow(plotinfo.project)` plots having project level information.

### Numbering
Plot numbering in original databases, and if plot is a subplot of something else.
```{r}
plotinfo.number <- plotinfo.raw %>%
  dplyr::select(PlotObservationID, Country, 
                Nr_table_in_publ="Nr. table in publ.", Nr_releve_in_table="Nr. relevé in table",
                Original_nr_in_database="Original nr in database", PLOT, SUBPLOT, ORIG_ID,ORIG_ID_1,PLOTID, SUBPLOTID, ORIG_REL_N) %>% 
  #keep only non-empty
  filter_at(.vars = vars(Nr_table_in_publ:ORIG_REL_N), .vars_predicate = any_vars(!is.na(.))) %>% 
  mutate(PLOTID=coalesce(PLOT, PLOTID)) %>% 
  mutate(SUBPLOTID=coalesce(SUBPLOTID, SUBPLOT)) %>% 
  dplyr::select(-PLOT, -SUBPLOT) %>% 
  mutate(Original_nr_in_database=coalesce(Original_nr_in_database,ORIG_ID, ORIG_ID_1)) %>% 
  dplyr::select(-Country, -ORIG_ID, -ORIG_ID_1, -ORIG_REL_N, Original_plotID=PLOTID, Original_subplotID=SUBPLOTID)
dim(plotinfo.number)
```

### Correct individual data sets
Replace plot-level original IDs for NA-US-002 – VegBank  

```{r}
# Import new IDs to replace
update.vegbank <- read_csv("_data/Update_VegBank/plot_env.csv", 
                           col_types = cols(
                             .default = col_character(),
                             observation_id = col_double(),
                             obsstartdate_vb = col_date(format = ""),
                             obsenddate_vb = col_date(format = ""),
                             confidentialitystatus = col_double(),
                             latitude = col_double(),
                             longitude = col_double(),
                             locationaccuracy = col_double()
                           )) %>% 
  dplyr::select(observation_id, authorplotcode_vb) %>% 
  mutate(observation_id=as.character(observation_id))


# select plots from VegBank
vb.plots <- header.oa %>% 
  filter(GIVD_ID=="NA-US-002") %>% 
  pull(PlotObservationID)
# attach and replace new codes in field Original_nr_in_database
plotinfo.number <- plotinfo.number %>% 
  left_join(update.vegbank, by=c("Original_nr_in_database"="observation_id")) %>% 
  mutate(Original_nr_in_database=ifelse(PlotObservationID %in% vb.plots, 
                                        authorplotcode_vb, 
                                        Original_nr_in_database)) %>% 
  dplyr::select(-authorplotcode_vb)
```


### Recompile metadata
Recompile and explore plot-level info

```{r}
metadata.oa <- header.oa %>% 
  dplyr::select(PlotObservationID) %>% 
  left_join(plotinfo.raw %>% 
              dplyr::select(PlotObservationID, Remarks, GUID), 
            by="PlotObservationID") %>% 
  left_join(plotinfo.dbref, by="PlotObservationID") %>% 
  left_join(plotinfo.author, by="PlotObservationID") %>% 
  left_join(plotinfo.biblio, by="PlotObservationID") %>% 
  left_join(plotinfo.number, by="PlotObservationID") %>% 
  left_join(plotinfo.project, by="PlotObservationID") %>% 
  relocate(Remarks, .after=last_col()) %>% 
  relocate(GUID, .after=last_col())
dim(metadata.oa)

```

### Show Output

```{r, echo=F}
knitr::kable(metadata.oa %>%
               filter(!is.na(Releve_author) & !is.na(Biblioreference)) %>% 
               sample_n(10) %>% 
               bind_rows(metadata.oa %>%
                           filter(!is.na(Nr_table_in_publ) & !is.na(Biblioreference)) %>% 
                           sample_n(5)) %>% 
               bind_rows(metadata.oa %>%
                           filter(!is.na(Original_plotID) & !is.na(Biblioreference)) %>% 
                           sample_n(5)) %>% 
               arrange(PlotObservationID),
             caption="Example of plot level info [20 randomly selected plots shown]") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                  full_width = F, position = "center")
```
The vast majority of plots `r metadata.oa %>% filter_at(.vars = vars(Releve_author:Remarks), .vars_predicate = any_vars(!is.na(.))) %>% nrow()` out of `r nrow(metadata.oa)` have at least one entry among the selected plot level info.  


## Function to create reference list
Create function to create reference list, based on plot selection. Ideally the user will be able to conveniently create a citation report for the selection of plots he\\she is planning to use.
```{r}
sPlotOpen_citation <- function(IDs, level=c("plot", "database"), out.file){ 
  ## IDs - vector of PlotObservationIDs,
  ## level - At what level should the bibliorefrence be extracted? at the level of individual plot, or GIVD datasets?
  ## out.file - filename where to sink the reference list (also as a .bib file)
  require(dplyr)
  require(bib2df)
  if(level=="plot") {getkey <- "BIBTEXKEY"}
  if(level=="database") {getkey <- "DB_BIBTEXKEY"}
  bibtexkeys <- metadata.oa %>% 
    filter(PlotObservationID %in% IDs) %>% 
    distinct() %>% 
    pull(all_of(getkey))
  df2bib(reference.oa %>% 
           dplyr::select(-Fullref) %>% 
           filter(BIBTEXKEY %in% bibtexkeys), file = out.file)
  message("WARNING: This is a beta-version. References were parsed and converted automatically. They might need to be double-checked")
}
```
### Show Output
Example of using the function `sPlotOpen_citation`
```{r}
sel.plots <- sample(header.oa$PlotObservationID, 100, replace = F)
filename <- "_output/sPlot_BibText.bib"
sPlotOpen_citation(IDs=sel.plots, level="database", out.file = filename)

# show first 20 lines of output file
read_lines(filename, n_max = 20)
```


## DT Table

Complement taxon group using info from sPlot 3.0 and cross-matching congeneric species  
The field `Layer` is always zero. Delete.
```{r, warning=F}
#load Backbone sPlot 3.0
load("~/share/groups/splot/releases/sPlot3.0/Backbone3.0.RData")

#Ancillary function
# Combine cover accounting across species and layers of each plot
## needed as there are few plots where the same species was recorded multiple times. 
combine.cover <- function(x){
    while (length(x)>1){
      x[2] <- x[1]+(100-x[1])*x[2]/100
      x <- x[-1]
    }
  return(x)
}


DT.oa <- DT2 %>% 
  filter(PlotObservationID %in% unique(header.oa$PlotObservationID)) %>% 
  group_by(PlotObservationID, species, Taxon.group, Matched.concept) %>% 
  summarize(Cover=combine.cover(Cover)) %>%
  ungroup()
    
#rm(DT2)


## Assign genera to taxon group
taxon.groups <- DT.oa %>% 
  as.tbl() %>% 
  distinct(species) %>% 
  mutate(species2=species) %>% 
  separate(species2, into=c("Genus"), sep=" ") %>% 
  distinct(species, Genus) %>% 
  left_join(Backbone %>% 
              distinct(Name_short, `Taxon group`) %>% 
              rename(species=Name_short, 
                     Taxon.group=`Taxon group`) %>% 
              separate(species, into=c("Genus"), sep=" ") %>% 
              mutate(Taxon.group=as.character(Taxon.group)) %>% 
              mutate(Taxon.group=ifelse(Taxon.group=="Unknown", NA, Taxon.group)) %>% 
              distinct(Genus, .keep_all=T), 
            by="Genus") %>% 
  distinct(species, .keep_all = T) %>% 
  dplyr::select(-Genus)
  
DT2.oa <- DT.oa %>% 
  as.tbl() %>% 
  mutate(Taxon.group=as.character(Taxon.group)) %>% 
  mutate(Taxon.group=ifelse(Taxon.group=="Unknown", NA, Taxon.group)) %>% 
  mutate(species2=species) %>% 
  separate(species2, into=c("Genus"), sep=" ") %>% 
  ## cross complement internally, based on Genus
  left_join({.} %>% 
              filter(!is.na(Taxon.group)) %>% 
              #there may be conflict in attribution of specific genera to taxon group. Use majority vote
              group_by(Genus, Taxon.group) %>% 
              summarize(n=n(), .groups = 'drop') %>% 
              arrange(Genus, desc(n)) %>% 
              slice(1) %>% 
              dplyr::select(-n), 
            by="Genus") %>% 
  mutate(Taxon.group=coalesce(Taxon.group.x, Taxon.group.y)) %>% 
  dplyr::select(-Taxon.group.x, -Taxon.group.y) %>% 
  # attach taxon group info from Backbone 3.0
  left_join(taxon.groups, by="species") %>% 
  mutate(Taxon.group=coalesce(Taxon.group.x, Taxon.group.y)) %>% 
  dplyr::select(-Taxon.group.x, -Taxon.group.y, -Genus)

dim(DT.oa)
dim(DT2.oa)
table(DT2.oa$Taxon.group, exclude = NULL)
```

Clean up `DT2.oa` from all non vascular plant records. 
```{r}
DT2.oa <- DT2.oa %>% 
  filter(is.na(Taxon.group) | 
           !Taxon.group %in% c("Alga", "Lichen", "Moss"))
dim(DT2.oa)
```


### Abundance and cover data

Species abundance information varies across datasets and plots. While for the large majority of plots abundance values are returned as percentage cover, there is a subset where abundance is returned with different scales. These are marked in the column `Cover code` as follows:
\newline \newline
- *x_BA* - Basal Area  
- *x_IC* - Individual count  
- *x_SC* - Stem count  
- *x_IV* - Relative Importance  
- *x_PF* - Presence Frequency  
- *x* - Presence absence  
\newline \newline
Still, it's not really intuitive that in case `Cover code` belongs to one of the classes above, then the actual abundance value is stored in the `x_` column. This stems from the way this data is stored in `TURBOVEG`.  
To make the cover data more user friendly, I simplify the way cover is stored, so that there are only two columns:  
`Ab_scale` - to report the type of scale used  
`Abundance` - to coalesce the cover\\abundance values previously in the columns `Cover %` and `x_`.  
\newline

This information is stored in a working copy of the DT table, before standardization to relative cover. Import it and filter only plots in open-access selection.

```{r}
DTraw <- read_delim(
  "~/share/groups/splot/releases/sPlot2.0/sPlot_2015_07_29_species.csv",
  delim = "\t",
  col_types = cols(
    PlotObservationID = col_double(),
    Taxonomy = col_character(),
    `Taxon group` = col_character(),
    `Taxon group ID` = col_double(),
    `Turboveg2 concept` = col_character(),
    `Matched concept` = col_character(),
    Match = col_double(),
    `Original taxon concept` = col_character(),
    Layer = col_double(),
    `Cover %` = col_double(),
    `Cover code` = col_character(),
    x_ = col_character()
  )) %>% 
  filter(PlotObservationID %in% header.oa$PlotObservationID)

```

Create `Ab_scale` field
```{r}
DTraw <- DTraw %>% 
  mutate(Ab_scale = ifelse(`Cover code` %in% 
                             c("x_BA", "x_IC", "x_SC", "x_IV", "x_PF", "x") & !is.na(x_), 
                           `Cover code`, 
                           "CoverPerc"))  
#mark pa plots
DTraw <- DTraw %>% 
  mutate(Ab_scale=replace(Ab_scale,
                    list=`Cover code`=="x" & `Cover %`==0,
                    values="pa"))
```

Fix some errors. There are some plots where all species have zeros in the field `Cover %`. Some of them are marked as p\\a (`Cover code=="x"`), but other not. Consider all this plots as presence\\absence. Most of them come from `USA_CVS` and `USA_VegBank`. 
```{r}
allzeroes <- DTraw %>% 
  filter(Ab_scale=="CoverPerc") %>% 
  group_by(PlotObservationID) %>% 
  summarize(allzero=all(`Cover %`==0), .groups = 'drop' ) %>% 
  filter(allzero==T) %>% 
  pull(PlotObservationID)
DTraw <- DTraw %>%
  mutate(`Cover code` = replace(`Cover code`,
    list = (PlotObservationID %in% allzeroes),
    values = "x")) %>%
  mutate(`Ab_scale` = replace(`Ab_scale`,
    list = (PlotObservationID %in% allzeroes),
    values = "pa"))
```
For p\\a plots, replace the field `Cover %` with NA, and assign 1 to the field `x_`. 
```{r}
DTraw <- DTraw %>% 
 mutate(x_=replace(x_,
                    list=Ab_scale=="pa",
                    values=1)) %>% 
  mutate(`Cover %`=replace(`Cover %`,
                    list=Ab_scale=="pa",
                    values=NA))
```

There are also some plots having different cover scales. They are not many, and are mostly in Brazil.  
Find these plots first:
```{r}
mixed <- DTraw %>% 
  distinct(PlotObservationID, Ab_scale) %>% 
  group_by(PlotObservationID) %>% 
  summarize(n=n(), .groups = 'drop') %>% 
  filter(n>1) %>% 
  pull(PlotObservationID) %>% 
  unique()
length(mixed)
```
Most of these plots are a mixture of trees, being measured based on Basal area, and herbs, for which only p\\a was recorded. Transform the `Cover %` of p\\a species to an arbitrary small number, i.e., 1.  
Note that the field `Abundance` is created only here.
```{r}
DTraw <- DTraw %>% 
  mutate(Ab_scale=replace(Ab_scale, 
                           list=(PlotObservationID %in% mixed & 
                                   Ab_scale=="CoverPerc"), 
                           values="pa")) %>%
  mutate(`Cover %`=replace(`Cover %`, 
                           list=PlotObservationID %in% mixed,
                           values=NA)) %>% 
  mutate(x_=replace(x_,  list=Ab_scale=="pa", values=1)) %>% 
  #Create additional field Abundance to avoid overwriting original data
  mutate(Abundance =ifelse(Ab_scale %in% c("x_BA", "x_IC", "x_SC", "x_IV", "x_RF", "pa"), 
                          x_, `Cover %`)) %>% 
  mutate(Abundance=as.numeric(Abundance))
```

Double check and summarize `Ab_scales`
```{r}
scale_check <- DTraw %>% 
  distinct(PlotObservationID, Layer, Ab_scale) %>% 
  group_by(PlotObservationID) %>% 
  summarise(Ab_scale_combined=ifelse(length(unique(Ab_scale))==1, 
                                     unique(Ab_scale), 
                                     "Multiple_scales"), 
            .groups = 'drop')

nrow(scale_check)== length(unique(DTraw$PlotObservationID))
table(scale_check$Ab_scale_combined, exclude=NULL)
```
Attach fields `Ab_scale` and `Abundance` to DT2.oa. Rename columns.

```{r}
DT2.oa <- DT2.oa %>% 
  left_join(DTraw %>% 
              dplyr::select(PlotObservationID, Matched.concept=`Matched concept`, Ab_scale, Abundance) %>%
              ## a few species are duplicated in plots - combine ther covers
              group_by(PlotObservationID, Matched.concept, Ab_scale) %>% 
              summarize(Abundance=combine.cover(Abundance)) %>%
              ungroup() %>% 
              ## and make sure that if Ab_scale was pa, the output stays 0-1
              mutate(Abundance=replace(Abundance, 
                                       list=Ab_scale=="pa", 
                                       values=1)), 
            by=c("PlotObservationID", "Matched.concept")) %>% 
    ## recalculate relative Cover
    left_join({.} %>%
              group_by(PlotObservationID) %>% 
              summarize(tot.abundance=sum(Abundance), .groups = 'drop'), 
            by=c("PlotObservationID")) %>% 
  mutate(Relative_cover=Abundance/tot.abundance) %>% 
  dplyr::select(PlotObservationID, Species=species, Original_species=Matched.concept, 
                Taxon_group=Taxon.group, Original_abundance=Abundance, Abundance_scale=Ab_scale, 
                Relative_cover)
       
dim(DT2.oa)
```

### Update individual datasets
There are two datasets whose species names need to be updated before releasing OA.


#### AF-CD-001 - Congo
Import updated data and convert to long format
```{r, message=F}
congo.new.raw <- read_csv("_data/Update_Congo/Kearsley_Yangambi_sPlot_subplots_v2.csv")

congo.new <- congo.new.raw %>% 
  mutate(BA=pi* ((`DBH (cm)`/100/2)^2)) %>% 
  group_by(Site, Species) %>% 
  summarize(BA=sum(BA), .groups="drop") %>% 
  bind_rows({.} %>% 
              separate(Site, into=c("Site", "Subplot", sep=" - ")) %>% 
              group_by(Site, Species) %>% 
              summarize(BA=sum(BA))) %>% 
  distinct() %>% 
  mutate(Site2=Site) %>% 
  separate(Site2, into=c("Plot", "Subplot")) %>% 
  mutate(Area=ifelse(!is.na(Subplot), 625, 10000)) %>% 
  mutate(BA.ha=BA/Area*10000) %>% 
  arrange(Site, Species) %>% 
  dplyr::select(Site, Plot, Subplot, Species, BA, BA.ha)
head(congo.new)
```
Assign PlotObservationID to plot codes
```{r}
### Data from Stephan 
header.TV <- read_delim(file = "_data/sPlot-2.1_header.csv", delim = "\t", 
                        col_types = cols(PlotObservationID = col_double(),
                                        PlotID = col_double(),
                                        `TV2 relevé number` = col_double(),
                                        `Original nr in database` = col_character(),
                                        `ORIGDB_NR...5` = col_character(),
                                        `ORIGDB_NR...6` = col_character(),
                                        ORIG_REL_N = col_character(),
                                        Longitude = col_double(),
                                        Latitude = col_double(),
                                        `Location uncertainty (m)` = col_double(),
                                        Dataset = col_character())) %>% 
  rename(ORIGDB_NR = `ORIGDB_NR...5`, 
         ORIGDB_NR_1= `ORIGDB_NR...6`)

header.sel.TV <- header.oa %>%
  dplyr::select(PlotObservationID, GIVD_ID) %>% 
  left_join(header.TV %>% 
              dplyr::select(PlotObservationID, `Original nr in database`),
            by="PlotObservationID")

congo.new <- congo.new %>% 
  inner_join(header.sel.TV , by=c("Site"="Original nr in database")) %>% 
  relocate(PlotObservationID, .before=Site)
```

Correct species names based on sPlot v2.1 Backbone
```{r}
## load backbone
load("~/share/groups/splot/releases/sPlot2.1/backbone.splot2.1.try3.is.vascular.Rdata")

## join names from backbone to congo.new
congo.new <- congo.new %>% 
  mutate(Species=str_replace(Species, pattern=" sp.$", replacement="")) %>% 
  left_join(backbone.splot2.1.try3 %>% 
              dplyr::select(Species=Name_submitted, New_species=name.short.correct) %>% 
              distinct(), 
            by="Species") %>% 
  ##manually correct two mistakes in Backbone
  mutate(New_species=ifelse(New_species %in% c("Canarium schweinfurtii", "Prioria balsaminfera"),
                            Species, 
                            New_species)) %>% 
  ## delete unidentified species 
  filter(Species != "unknown")

## check species names to change, based on the backbone
congo.new %>% 
  filter(Species != New_species) %>% 
  distinct(Species, .keep_all=T)

### get all unmatched species and run through TPL
 tocheck <- congo.new %>% 
  filter(is.na(New_species)) %>% 
  pull(Species) %>% 
  unique()

 checked <- TPL(tocheck)
 checked2 <- data.frame(old_species=tocheck) %>% 
   bind_cols(checked %>% 
               dplyr::select(New.Genus, New.Species)) %>% 
   replace_na(list(New.Genus="", New.Species="")) %>% 
   unite(New.Genus, New.Species, col=new_species, sep=" ") %>% 
   mutate(new_species=str_replace(new_species, pattern=" *$", replacement="")) %>% 
   mutate(old_species=as.character(old_species))

## assign unmatched old names to new names
congo.new <- congo.new %>% 
  left_join(checked2, 
            by=c("Species"="old_species")) %>% 
  mutate(New_species=coalesce(New_species, new_species)) %>% 
  dplyr::select(-new_species)
  
```

Format `congo.new` to match `DT2.oa`
```{r}
congo.out <- congo.new %>%
  rename(Original_abundance = BA.ha, 
         Original_species=Species) %>% 
  rename(Species=New_species) %>%
  mutate(Taxon_group = "Vascular plant") %>%
  mutate(Abundance_scale = "x_BA") %>%
  left_join({.} %>%
    group_by(PlotObservationID) %>%
    summarize(Tot_abundance = sum(Original_abundance),
      .groups = "drop"),
    by = "PlotObservationID") %>%
  mutate(Relative_cover = Original_abundance / Tot_abundance) %>% 
  dplyr::select(all_of(colnames(DT2.oa)))

## compare a random plot before and after replacement
plotsel <- sample(unique(congo.new$PlotObservationID), 1)
DT2.oa %>% filter(PlotObservationID==plotsel) %>% print(n=15)
congo.out %>% filter(PlotObservationID==plotsel) %>% print(n=15)
```
Replace entries in `DT2.oa`
```{r}
dim(DT2.oa)
DT2.oa <- DT2.oa %>% 
  filter(!PlotObservationID %in% (congo.out %>% 
                                   pull(PlotObservationID) %>% 
                                   unique())) %>% 
  bind_rows(congo.out) %>% 
  arrange(PlotObservationID, Species)
dim(DT2.oa)
```

#### AF-00-009 - Angola
Import updated data 
```{r}
angola.new.raw <- openxlsx::read.xlsx("_data/Update_Angola/TFO_DB.xlsx", sheet=1)

header.sel.TV <- header.oa %>%
  dplyr::select(PlotObservationID, GIVD_ID) %>% 
  left_join(header.TV %>% 
              dplyr::select(PlotObservationID, "TV2 relevé number"),
            by="PlotObservationID")

## REMEMBER TO double check if plot level total vegetation cover match to header!

## attach PlotObservation ID
angola.new <- angola.new.raw %>% 
  dplyr::select(Plot.number, Species.name, Cover.total) %>% 
  inner_join(header.sel.TV %>% 
               filter(GIVD_ID=="AF-00-009") %>% 
               dplyr::select(-GIVD_ID),
            by=c("Plot.number"="TV2 relevé number")) %>% 
## one plot missing from update
  dplyr::select(PlotObservationID, Species=Species.name, Cover=Cover.total)

## join names from backbone to congo.new
angola.new <- angola.new %>% 
  left_join(backbone.splot2.1.try3 %>% 
              dplyr::select(Species=names.sPlot.TRY, New_species=name.short.correct) %>% 
              distinct(), 
            by="Species") %>% 
  mutate(New_species=str_replace(New_species, " *$", "")) #strip whitespace at the end of species names

## check species names to change, based on the backbone
# angola.new %>% 
#   filter(Species != New_species) %>% 
#   distinct(Species, .keep_all=T)
# 
# angola.new %>% 
#   filter(is.na(New_species)) %>% 
#   distinct(Species, .keep_all=T)
# 
```


Out of the `r length(unique(angola.new$Species))` species contained in the Angola update, `r angola.new %>% distinct(Species, .keep_all=T) %>% filter(!is.na(New_species)) %>% nrow()` matched an entry in the backbone, while the remaining `r angola.new %>% distinct(Species, .keep_all=T) %>% filter(is.na(New_species)) %>% nrow()` returned `NA`. 

```{r, eval=F}
### get all unmatched species and run through TPL
tocheck <- angola.new %>%
  filter(is.na(New_species)) %>%
  filter(!grepl(pattern = "[0-9]+", x = Species)) %>% # delete all entries with a code (not yet identified)
  pull(Species) %>%
  unique()
library(Taxonstand)
checked <- TPL(tocheck)
save(checked, file = "_data/Update_Angola/Species_Checked_TPL.RData")
```
```{r}
load(file = "_data/Update_Angola/Species_Checked_TPL.RData")
```

New `r checked %>% filter(Plant.Name.Index==T) %>% nrow()` matches found.
```{r}
checked2 <- checked %>%
  filter(Plant.Name.Index==T | (Plant.Name.Index==F & Taxonomic.status!="")) %>% 
  dplyr::select(Taxon, New.Genus, New.Species) %>% 
  replace_na(list(New.Genus = "", New.Species = "")) %>%
  unite(New.Genus, New.Species, col = new_species, sep = " ") 

angola.new <- angola.new %>% 
  as_tibble() %>% 
  left_join(checked2, by=c("Species"="Taxon")) %>% 
  mutate(New_species=coalesce(New_species,new_species)) %>%
  dplyr::select(-new_species)
```

There are still `r angola.new %>% filter(is.na(New_species)) %>% distinct(Species) %>% nrow()` unmatched species. Most of them haven't been yet identified, and only have working names, such as `r angola.new %>% filter(is.na(New_species)) %>% distinct(Species) %>% sample_n(3) %>% pull(Species) %>% paste(collapse="; ")` [Three randomly selected species]. Yet a few of these, at least have info about the family. Extract this.

```{r}
unmatched <- angola.new %>% filter(is.na(New_species)) %>% distinct(Species) %>% pull(Species)
unmatched <- data.frame(Species=unmatched, family=str_extract(unmatched, pattern='([^\\s]+aceae)'))

angola.new <- angola.new %>% 
  left_join(unmatched, by="Species") %>% 
  mutate(New_species=coalesce(New_species, family)) %>% 
  dplyr::select(-family)
```
There are still `r angola.new %>% filter(is.na(New_species)) %>% distinct(Species) %>% nrow()` unmatched species. Some of them can be manually parsed, at least at genus level

```{r}
species.dictionary <- c('Asparagus aff. nelsii x baumii' = "Protasparagus nelsii",
  'Asparagus spaerlich' = "Asparagus",
  'Baphia sp. geoxyle' = "Baphia",
  'Barleria like lanceolata' = "Barleria",
  'Brachiaria  132728' = "Brachiaria",
  'Bulbostylis sp. 132553' = "Bulbostylis",
  'Bulbostylis sp. 134794' = "Bulbostylis",
  'cf. Andropogon sp. 132765' = "Andropogon",
  'cf. Dalbergia 134918' = "Dalbergia",
  'Combretum Str imparapinnate krusselig 134690' = "Combretum",
  'Commelina pallidispatha' = "Commelina pallidispatha",
  'Crotalaria sp. Murray Hudson' = "Crotalaria",
  'Cyperus Simse gelb Festland' = "Cyperus",
  'Desmodium sp. 136411' = "Desmodium",
  'Diospyros pseudomespilus ssp. brevicalyx' = "Diospyros pseudomespilus",
  'Droogmannsia megalantha' = "Droogmansia megalantha",
  'einähriges Gras von gestern 132299' = "Poaceae",
  'Eragrostis kleinblütig fädig 132495' = "Eragrostis",
  'Fadogia flaum Spross 136500' = "Fadogia",
  'Fadogia Zwergstrauch Frucht rot 136530' = "Fadogia",
  'Gras branched Ähre 132297' = "Poaceae",
  'Gras non-cylindriflora' = "Poaceae",
  'Hafer, Haarkranz mit Knoten 132293' = "Poaceae",
  'Indigofera Rundblatt braunbehaart' = "Indigofera",
  'Kyllinga sp. 134793' = "Kyllinga",
  'Loudetia sp. 134798' = "Loudetia",
  'Mammutgras Bl breit' = "Poaceae",
  'Mammutgrass 133127' = "Poaceae",
  'Melinis sp.' = "Melinis",
  'Pavetta sp. 134685' = "Pavetta",
  'Phileptera nelsii' = "Philenoptera nelsii",
  'Pleiotaxis sp. doldig spitz' = "Pleiotaxis",
  'Pseudocarex rotgruen 134518' = "Poaceae",
  'Syzygium guineense ssp. barotsense' = "Syzygium guineense",
  'Terminalia serciea' = "Terminalia sericea",
  'Thesium von gestern' = "Thesium",
  'Tricalysia black glands' = "Tricalysia",
  'Triclaysia black glands' = "Tricalysia",
  'Tristachya sp. 132788' = "Tristachya",
  'Uapaca pinnat 134496b' = "Uapaca",
  'Uapaca schmalbltrg 134835' = "Uapaca",
  'Xyris gelb dick 135539' = "Xyris",
  'Xyris gelb groß binsenartig' = "Xyris",
  'Xyris goldgelb 132737' =   "Xyris"
)
species.dictionary <- data.frame(Species=names(species.dictionary), new_species=species.dictionary)  

angola.new <- angola.new %>% 
  left_join(species.dictionary, by="Species") %>% 
  mutate(New_species=coalesce(New_species, new_species)) %>% 
  dplyr::select(-new_species)

## exclude mosses
angola.new <- angola.new %>% 
  filter(!grepl(pattern="Moos|moos", x=Species))
```
Little can be done with the remaining `r angola.new %>% filter(is.na(New_species)) %>% distinct(Species) %>% nrow()` species. These species occur in `r angola.new %>% filter(is.na(New_species)) %>% nrow()` out of `r angola.new %>% nrow()` records.

Format `angola.new` to match `DT2.oa`
```{r, warning=F}
angola.out <- angola.new %>%
  rename(Original_abundance = Cover, 
         Original_species=Species) %>% 
  rename(Species=New_species) %>%
  #attach taxon group from Backbone 3.0
  left_join(Backbone %>% 
              distinct(Name_sPlot_TRY, `Taxon group`) %>% 
              rename(Taxon_group=`Taxon group`), 
            by=c("Original_species"="Name_sPlot_TRY")) %>% 
  mutate(species2=Species) %>% 
  separate(species2, into=c("Genus"), sep=" ") %>% 
  ## cross complement internally, based on Genus
  left_join(DT2.oa %>% 
              distinct(Species, .keep_all=T) %>% 
              separate(Species, into=c("Genus"), sep=" ") %>% 
              filter(!is.na(Taxon_group)) %>% 
              distinct(Genus, Taxon_group),
            by="Genus") %>% 
  mutate(Taxon_group=coalesce(Taxon_group.x, Taxon_group.y)) %>% 
  dplyr::select(-Taxon_group.x, -Taxon_group.y) %>% 
  mutate(Taxon_group=ifelse(Taxon_group=="Unknown", NA, Taxon_group)) %>% 
  mutate(Abundance_scale = "CoverPerc") %>%
  # calculate relative cover
  left_join({.} %>%
    group_by(PlotObservationID) %>%
    summarize(Tot_abundance = sum(Original_abundance), .groups = "drop"),
    by = "PlotObservationID") %>%
  mutate(Relative_cover = Original_abundance / Tot_abundance) %>% 
  dplyr::select(all_of(colnames(DT2.oa))) %>% 
  #filter out species with zero abundance -> probably due to species aggregation after taxa identification
  filter(Original_abundance!=0)

## compare a random plot before and after replacement
plotsel <- sample(unique(angola.out$PlotObservationID), 1)
DT2.oa %>% filter(PlotObservationID==plotsel) %>% print(n=25)
angola.out %>% filter(PlotObservationID==plotsel) %>% print(n=25)

```

Replace entries in `DT2.oa`
```{r}
dim(DT2.oa)
DT2.oa <- DT2.oa %>% 
  filter(!PlotObservationID %in% (angola.out %>% 
                                   pull(PlotObservationID) %>% 
                                   unique())) %>% 
  bind_rows(angola.out) %>% 
  arrange(PlotObservationID, Species)
dim(DT2.oa)
```


#### SA-BR-002 - Brazil forests
Replace with data received from the custodians in 2019. This data contain less unidentified species.
```{r}
#Import data from sPlot 3.0
load("~/share/groups/splot/releases/sPlot3.0/header_sPlot3.0.RData")
header3.0 <- header
rm(header)
```
Create key to match IDs between sPlot 2.1 and 3.0
```{r}
## create list of PlotIDs for Brazilian plots in sPlotOpen
braz.oa.pid <- header.oa %>% 
  filter(GIVD_ID=="SA-BR-002") %>% 
  pull(PlotObservationID)
## original plot numbers in database 
braz.oa.nr <- metadata.oa %>% 
  filter(PlotObservationID %in% braz.oa.pid) %>% 
  pull(Original_nr_in_database)

## data from sPlot 3.0 to replace
# original_nr_in_database with plotObservationID matching to sPlot 3.0
sPlot3.IDs <- read_delim("_data/Update_Brazil/Update_Brazil_IDs_header.csv", delim="\t") 

header.braz.new <- header3.0 %>% 
  filter(`GIVD ID`=="SA-BR-002") %>% 
  left_join(sPlot3.IDs %>% 
              dplyr::select(PlotObservationID, 
                            Original_nr_in_database=`Original nr in database`),
            by="PlotObservationID") %>% 
  filter(Original_nr_in_database %in% braz.oa.nr)

#Create PlotID keys
ID.key <- metadata.oa %>% 
  filter(PlotObservationID %in% braz.oa.pid) %>% 
  dplyr::select(PlotObservationID21 = PlotObservationID, Original_nr_in_database) %>% 
  left_join(header.braz.new %>% 
              dplyr::select(PlotObservationID30 = PlotObservationID,
                            Original_nr_in_database), 
                            by="Original_nr_in_database") %>% 
  as_tibble()

head(ID.key)
```

Extract plot observations from DT in sPlot 3.0 to replace those in DT.oa
```{r}
## DT new 
load("~/share/groups/splot/releases/sPlot3.0/DT_sPlot3.0.RData")
DT3.0 <- DT2 %>% 
  filter(PlotObservationID %in% (ID.key %>% pull(PlotObservationID30)))
rm(DT2)
DT3.0.braz <- DT3.0 %>% 
  left_join(ID.key, by=c("PlotObservationID"="PlotObservationID30")) %>% 
  filter(Original_nr_in_database %in% braz.oa.nr) %>% 
  dplyr::select(-PlotObservationID) %>% 
  dplyr::select(PlotObservationID=PlotObservationID21,
                Species,
                Original_species=Species_original, 
                Original_abundance=Abundance,
                Abundance_scale=Ab_scale,
                Relative_cover) %>% 
  filter(Original_species != "NI species") %>% 
  as_tibble()

nrow(DT3.0.braz)
```
Replace entries in `DT2.oa`
```{r}
DT2.oa <- DT2.oa %>% 
  filter(!PlotObservationID %in% braz.oa.pid) %>% 
  bind_rows(DT3.0.braz) %>% 
  arrange(PlotObservationID, Species)
dim(DT2.oa)
```
Add `releve_area` to Brazilian entries in `header.oa`
```{r}
header.oa <- header.oa %>% 
  mutate(Releve_area=ifelse(GIVD_ID=="SA-BR-002", 1000, Releve_area))
```

Modify `metadata.oa` to account for nested structure
```{r}
metadata.oa.braz <- metadata.oa %>% 
  filter(PlotObservationID %in% braz.oa.pid) %>% 
  dplyr::select(PlotObservationID, Original_nr_in_database) %>% 
  separate(Original_nr_in_database, into = c("Original_plotID.bz", "Original_subplotID.bz"), sep="_")

metadata.oa <- metadata.oa %>% 
  left_join(metadata.oa.braz, by="PlotObservationID") %>% 
  mutate(Original_plotID=coalesce(Original_plotID, Original_plotID.bz)) %>% 
  mutate(Original_subplotID=coalesce(Original_subplotID, Original_subplotID.bz)) %>% 
  dplyr::select(-Original_plotID.bz, -Original_subplotID.bz)
```


### Correct known issues
```{r}
DT2.oa <- DT2.oa %>%
  mutate(Species = ifelse(Species == "lachenalii subsp.",
                          "Hieracium lachenalii",
                          Species)) %>%
  mutate(Species = ifelse(Species == "virgaurea subsp.",
                          "Solidago virgaurea",
                          Species)) %>%
  mutate(Species = ifelse(Species == "murorum subsp.",
                          "Hieracium murorum",
                          Species)) %>%
  mutate(Species = ifelse(Species == "dubius subsp.",
                          "Tragopogon dubius",
                          Species)) %>% 
  filter(is.na(Taxon_group) | Taxon_group != "Moss") 

# drop taxon_group column
DT2.oa <- DT2.oa %>% 
  dplyr::select(-Taxon_group)
```

### Show Output

```{r, echo=F}
knitr::kable(DT2.oa %>%
               filter(PlotObservationID %in% sample(header.oa$PlotObservationID, 3)),
             caption="Example of DT2.oa [3 randomly selected plots shown]") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                  full_width = F, position = "center")
```


## CWM data

Load species level gap-filled trait data
```{r}
load("~/share/groups/splot/releases/sPlot2.0/TRY.all.mean.sd.3.by.genus.species.Rdata")
TRY <- TRY.all.mean.sd.3.by.genus.species
```

Merge species data table with traits, and calculate species coverage for each plot, both based on relative cover, and number of species having trait info.
```{r}
CWM_CWV.oa0 <- DT2.oa %>%
  as_tibble() %>%
  dplyr::select(PlotObservationID, Species, Relative_cover) %>%
  left_join(TRY %>%
              dplyr::rename(Species=StandSpeciesName) %>%
              dplyr::select(Species, LeafArea.mean:Wood.vessel.length.mean), 
            by="Species") %>% 
  rename_at(.vars=vars(ends_with(".mean")), 
            .funs=~gsub(pattern=".mean", replacement="", x=.))

# number of species with trait information.
CWM_CWV.oa0 %>% 
  distinct(Species, .keep_all = T) %>% 
  filter(!is.na(SLA)) %>% 
  nrow()

# Calculate coverage for each trait in each plot
CWM_CWV.oa2 <- CWM_CWV.oa0 %>%
  mutate_at(.vars = vars(LeafArea), 
            .funs = list(~if_else(is.na(.),0,1) * Relative_cover)) %>%
  group_by(PlotObservationID) %>%
  summarize(TraitCoverage_cover=sum(LeafArea, na.rm=T),
            Species_richness=n(),
            TraitCoverage_pa=mean(LeafArea>0), 
            .groups = 'drop')
```

Calculate CWM and CWV for each trait in each plot
```{r, cache=T}
# Ancillary function to calculate CWV
variance2.fun <- function(trait, abu){
  res <- as.double(NA)
  #nam <- nam[!is.na(trait)]
  abu <- abu[!is.na(trait)]
  trait <- trait[!is.na(trait)]
  abu <- abu/sum(abu)
  if (length(trait)>1){
    # you need more than 1 observation to calculate
    # skewness and kurtosis
    # for calculation see 
    # http://r.789695.n4.nabble.com/Weighted-skewness-and-curtosis-td4709956.html
    m.trait <- weighted.mean(trait,abu)
    res <- sum(abu*(trait-m.trait)^2)
  }
  res
}

CWM_CWV.oa1 <- CWM_CWV.oa0 %>%
  group_by(PlotObservationID) %>%
  summarize_at(.vars= vars(LeafArea:Wood.vessel.length),
               .funs = list(CWM=~weighted.mean(., Relative_cover, na.rm=T), 
                            CWV=~variance2.fun(., Relative_cover)))

```
Assemble output
```{r}  
CWM_CWV.oa <- header.oa %>% 
  dplyr::select(PlotObservationID) %>% 
  left_join(CWM_CWV.oa2, by="PlotObservationID") %>% 
  left_join(CWM_CWV.oa1, by="PlotObservationID")

dim(CWM_CWV.oa)
```

Rename fields to follow convention
```{r}
CWM_CWV.oa <- CWM_CWV.oa %>% 
  rename_all(.funs=~gsub('\\.', '_', x = .))
```

### Show Output

```{r, echo=F}
knitr::kable(CWM_CWV.oa %>%
               filter(PlotObservationID %in% sample(header.oa$PlotObservationID, 20)),
             caption="Example of metadata.oa [20 randomly selected plots shown]") %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), 
                  full_width = F, position = "center")
```


# Save Ouput
```{r}
path <- "_sPlotOpenDB"
save(DT2.oa, header.oa, metadata.oa, CWM_CWV.oa, reference.oa, sPlotOpen_citation, file = file.path(path, "sPlotOpen.RData"))

## Export to csv files
write_delim(DT2.oa, file = file.path(path, "sPlotOpen_DT.txt"), delim="\t")
write_delim(header.oa, file = file.path(path, "sPlotOpen_header.txt"), delim="\t")
write_delim(metadata.oa, file = file.path(path, "sPlotOpen_metadata.txt"), delim="\t")
write_delim(CWM_CWV.oa, file = file.path(path, "sPlotOpen_CWM_CWV.txt"), delim="\t")
df2bib(bib.db %>% 
         #bind db level and plot level references into a single bib file
         bind_rows(reference.oa %>%
                     dplyr::select(-Fullref)), 
       file = file.path(path, "sPlotOpen_references.bib"))

#export list of PlotObservationIDS
write_delim(header.oa %>% dplyr::select(PlotObservationID), file = "_output/Plot_list.txt")
```

# SessionInfo
```{r}
sessionInfo()
```