The fingR
is a comprehensive package designed to support Sediment
Source Fingerprinting studies. It provides essential tools including:
dataset characterisation, tracer selection from analysed properties
using the three-step method, modelling of source contributions using the
Bayesian Mixing Model (BMM), and evaluation of model predictions using
virtual mixtures, and it supports BMM and
MixSIAR models.
The fingR
package is available in this
Github repository and
archived on Zenodo.
Table of Contents
- Installation
- Usage
- Future updates
- Getting help
- Citation
- References
# install.packages(devtools) # if not installed yet
library(devtools)
# Install the most recent version from GitHub - check github page for updates
devtools::install_github("https://github.com/tchalauxclergue/fingR/releases/tag/2.1.4", ref = "master", force = T)
# Alternatively, install from a downloaded '.tar.gz' file
devtools::install_local("path_to_file/fingR_2.1.4.tar.gz", repos = NULL)
# 'path_to_file' should be modified accordingly to your working environment
To illustrate the usage of the fingR package, a database containing layers from a sediment core and potential sources samples was used. The 38 layers of a sediment core collected in the Mano Dam reservoir (Fukushima, Japan) in June 2021 were used as mixture targets. The potential source samples include four classes: undecontaminated cropland (n = 24), remediated cropland (n = 22), forest (n = 24), and subsoil (mainly granite saprolite; n = 24). All sediment and soil samples were sieved to 63 μm and analysed, the organic matter (total organic carbon (TOC), and total nitrogen (TN)) and elemental geochemistry by ED-XRF (Al, Ca, Co, Cr, Cu, Fe, K, Mg, Mn, Ni, Pb, Rb, Si, Sr, Ti, Zn, Zr) were used as potential tracer properties.
This dataset, along with detailed measurement protocols, is available for download on Zenodo at Chalaux-Clergue et al. (2024c).
For more details about functions, please read function vignettes (RStudio ‘Help’ section - press F1 for a quick access).
The first step is to load the database, in this guide it consists of a data file and a metadata file. To support database sharing and compatibility, a database template has been proposed and made available on Zenodo at Chalaux-Clergue et al., (2024b) (ver. 24.03.01).
# Load the fingR package
library(fingR)
# Get the directory to data and metadata files within the fingR package
data.dr <- system.file("extdata", "TCC_MDD_20210608_data_ChalauxClergue_et_al_v240319.csv", package = "fingR")
metadata.dr <- system.file("extdata", "TCC_MDD_20210608_metadata_ChalauxClergue_et_al_v240319.csv", package = "fingR")
# Load the csv files of data and metadata - replace the dir with your file direction
db.data <- read.csv(data.dr, sep = ";", fileEncoding = "latin1", na = "")
db.metadata <- read.csv(metadata.dr, sep = ";", fileEncoding = "latin1", na = "")
The different classes of potential sources and sediment samples
information is in the Class_decontamination column
. In which, the
Target class refers to sediment core layers, and other classes to
potential sediment sources.
table(db.metadata$Class_decontamination)
#>
#> Forest Remediated Subsoil Target
#> 24 10 10 38
#> Undecontaminated
#> 24
A single dataset is built by joining the metadata (general information)
and the data (analyses) sub-datasets. Both metadata and data
data.frames
are joined by common variables, here using the columns
IGSN
and Sample_name
. In addition, two filters were applied. A
filter to keep the analysed performed on the fraction ≤ 63 μm, and a
filter to remove remediated cropland from the dataset, and therefore
using three potential sources.
library(dplyr)
# Create a single dataset with metadata and data information
database <- dplyr::left_join(x = db.metadata, # Metadata data.frame
y = db.data, # Data data.frame
by = join_by(IGSN, Sample_name)) %>% # In common variables
dplyr::filter(Sample_size == "< 63 µm") %>% # Filter sample fraction
dplyr::filter(Class_decontamination != "Remediated") # Remove remediated cropland to keep 3 potential sources
Therefore, the final dataset contains three potential sources (i.e. forest, subsoil, and undecontaminated cropland).
table(database$Class_decontamination)
#>
#> Forest Subsoil Target Undecontaminated
#> 24 10 38 24
Among the analysed properties, a vector is created listing the column
names of the values of the potential tracer properties,
i.e. prop.values
.
# Display the database columns
# colnames(database)
# Select the column names of the properties values
prop.values <- database %>% # Database data.frame
dplyr::select(TOC_PrC, TN_PrC, # Organic matter properties
EDXRF_Al_mg.kg.1:EDXRF_Zr_mg.kg.1) %>% # Elemental geochemistry properties
base::names() # Extract column names
prop.values
#> [1] "TOC_PrC" "TN_PrC" "EDXRF_Al_mg.kg.1" "EDXRF_Ca_mg.kg.1"
#> [5] "EDXRF_Co_mg.kg.1" "EDXRF_Cr_mg.kg.1" "EDXRF_Cu_mg.kg.1" "EDXRF_Fe_mg.kg.1"
#> [9] "EDXRF_K_mg.kg.1" "EDXRF_Mg_mg.kg.1" "EDXRF_Mn_mg.kg.1" "EDXRF_Ni_mg.kg.1"
#> [13] "EDXRF_Pb_mg.kg.1" "EDXRF_Rb_mg.kg.1" "EDXRF_Si_mg.kg.1" "EDXRF_Sr_mg.kg.1"
#> [17] "EDXRF_Ti_mg.kg.1" "EDXRF_Zn_mg.kg.1" "EDXRF_Zr_mg.kg.1"
A vector of property measurement uncertainty column names is also
created, prop.uncertainties
. In order to maintain the relationship
between the column names of the properties and the uncertainty, the
column names of the properties are set as names
of the vector of
property names. This also simplifies the selection of the uncertainty
column names, as they can be selected from the property column names.
# Select the column names of the property measurement uncertainties values
prop.uncertainties <- database %>% # Database data.frame
dplyr::select(TOC_SD, TN_SD, # Organic matter properties
EDXRF_Al_RMSE:EDXRF_Zr_RMSE) %>% # Elemental geochemistry uncertainties
base::names() # Extract column names
# Set property names to property uncertainty names for easier selection
base::names(prop.uncertainties) <- prop.values # Set properties values as names for uncertainties
unname(prop.uncertainties)
#> [1] "TOC_SD" "TN_SD" "EDXRF_Al_RMSE" "EDXRF_Ca_RMSE"
#> [5] "EDXRF_Co_RMSE" "EDXRF_Cr_RMSE" "EDXRF_Cu_RMSE" "EDXRF_Fe_RMSE"
#> [9] "EDXRF_K_RMSE" "EDXRF_Mg_RMSE" "EDXRF_Mn_RMSE" "EDXRF_Ni_RMSE"
#> [13] "EDXRF_Pb_RMSE" "EDXRF_Rb_RMSE" "EDXRF_Si_RMSE" "EDXRF_Sr_RMSE"
#> [17] "EDXRF_Ti_RMSE" "EDXRF_Zn_RMSE" "EDXRF_Zr_RMSE"
The data.watcher
function allows to check the format of the selected
properties. The function verifies if the values of the property are all
negative (e.g. δ13C, δ15N ), which will require an
inversion of the value prior to Bayesian modelling where the data is
systematically log-transformed, and if some samples have negative
values, which will require manual verification or correction of the
data. If the measurement uncertainties are provided to the prop.uncer
argument, the function will indicate if the measurement uncertainty
makes some values virtually impossible, and will give the maximum
relative uncertainty if it is greater than 5 %. The result of this
function should encourage the user to consider the quality of his data.
fingR::data.watcher(data = database, # Database data.frame
properties = prop.values, # Vector of property labels
prop.uncer = prop.uncertainties) # vector of measurement uncertainty labels
#>
#> Following column(s) contain(s) some negative values: EDXRF_Cr_mg.kg.1.
#> Following column(s) have a measurement uncertainty that makes some values to be virtually impossible: EDXRF_Co_mg.kg.1, EDXRF_Cr_mg.kg.1, EDXRF_Cu_mg.kg.1, EDXRF_Ni_mg.kg.1.
#> Following column(s) have a relative measurement uncertainty above 5% (up to - number): EDXRF_Co_mg.kg.1 (max:753% - n:26), EDXRF_Cr_mg.kg.1 (max:211% - n:38), EDXRF_Ni_mg.kg.1 (max:105% - n:96), EDXRF_Cu_mg.kg.1 (max:103% - n:52), EDXRF_Rb_mg.kg.1 (max:89% - n:93), TN_PrC (max:45% - n:91), EDXRF_Pb_mg.kg.1 (max:38% - n:91), EDXRF_Zn_mg.kg.1 (max:34% - n:96), EDXRF_Sr_mg.kg.1 (max:15% - n:46), TOC_PrC (max:14% - n:95), EDXRF_Zr_mg.kg.1 (max:7% - n:2).
According to the data.watcher
function, some samples have negative
ED-XRF Cr values and the Co, Cr, Cu, Ni and Rb appear to be high. These
properties are removed from the properties used in the following guide.
# Remove the properties listed from the vector of properties
prop.values <- prop.values[!prop.values %in% c("EDXRF_Co_mg.kg.1", "EDXRF_Cr_mg.kg.1", "EDXRF_Cu_mg.kg.1", "EDXRF_Ni_mg.kg.1", "EDXRF_Rb_mg.kg.1")]
prop.values
#> [1] "TOC_PrC" "TN_PrC" "EDXRF_Al_mg.kg.1" "EDXRF_Ca_mg.kg.1"
#> [5] "EDXRF_Fe_mg.kg.1" "EDXRF_K_mg.kg.1" "EDXRF_Mg_mg.kg.1" "EDXRF_Mn_mg.kg.1"
#> [9] "EDXRF_Pb_mg.kg.1" "EDXRF_Si_mg.kg.1" "EDXRF_Sr_mg.kg.1" "EDXRF_Ti_mg.kg.1"
#> [13] "EDXRF_Zn_mg.kg.1" "EDXRF_Zr_mg.kg.1"
As well the measurement uncertainties of these properties are removed from the selection.
# Keep uncertainties associated to the new vector of properties
prop.uncertainties <- prop.uncertainties[prop.values]
prop.uncertainties
#> TOC_PrC TN_PrC EDXRF_Al_mg.kg.1 EDXRF_Ca_mg.kg.1
#> "TOC_SD" "TN_SD" "EDXRF_Al_RMSE" "EDXRF_Ca_RMSE"
#> EDXRF_Fe_mg.kg.1 EDXRF_K_mg.kg.1 EDXRF_Mg_mg.kg.1 EDXRF_Mn_mg.kg.1
#> "EDXRF_Fe_RMSE" "EDXRF_K_RMSE" "EDXRF_Mg_RMSE" "EDXRF_Mn_RMSE"
#> EDXRF_Pb_mg.kg.1 EDXRF_Si_mg.kg.1 EDXRF_Sr_mg.kg.1 EDXRF_Ti_mg.kg.1
#> "EDXRF_Pb_RMSE" "EDXRF_Si_RMSE" "EDXRF_Sr_RMSE" "EDXRF_Ti_RMSE"
#> EDXRF_Zn_mg.kg.1 EDXRF_Zr_mg.kg.1
#> "EDXRF_Zn_RMSE" "EDXRF_Zr_RMSE"
The three-step method assesses conservative behaviour using range tests
(RT), also known as bracket tests. For a property to be considered
conservative, the values of each target sample must fall within the
range of potential source classes. This range is defined by the highest
and lowest values in a given source class for a given criterion. The
range.test
function allows the calculation of the range test on each
property according to multiple range test criteria.
Several range test criteria have been documented, including
minimum-maximum (MM
), minimum-maximum with a 10 % margin of error
(MMe
), boxplot whiskers (outlier exclusion threshold), boxplot hinges
(representing the middle 50 %), mean, mean plus/minus one standard
deviation (mean.sd
), and median. For the mean and mean.sd
criteria,
log-transformed values are used, assuming a normal distribution. The
function applies all criteria by default (criteria = c("all")
), or the
user can freely select several tests from the following list: MM
,
MMe
, whiskers
, hinge
, mean
, mean.sd
and/or median
. Their
effectiveness in detecting conservative properties varies. Among them,
the mean.sd
criterion is mathematically the most robust.
The range.test
function outputs a list with two data frames:
-
$results.df
: A summary overview of all properties range test outcomes. -
$results.RT
: Detailed range test results for each target sample across properties, whereTRUE
indicates that the sample value is within the range, whilehigh
andlow
indicate values outside the range.
Within the range.test
function, the class
argument specifies the
column name containing the classifications for potential source and
target mixture samples. The label for the target mixtures used in
class
should be provided in the mixture
argument. The sample.id
argument designates the column containing the unique identifiers or
labels for each sample. When a saving directory path the results is
provided to save.dir
, the function write two CSV files:
“RT_samples[_note].csv” corresponding to results.RT
, and
“Range_test[_note].csv” corresponding to results.df
.
rt.results <- fingR::range.tests(data = database, # Data.frame with source and mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
mixture = "Target", # Labeling of mixtures in `class`
properties = prop.values, # Vector of property labels
sample.id = "Sample_name", # Identifier for individual samples
criteria = c("mean.sd"), # Range test critirion/criteria (options: "MM", "MMe", "whiskers", "hinge", "mean", "mean.sd", "median", or "all")
# MM.error = c(0.1), # Optional - Set min-max +/- error as a percentage (here 10 %)
# save.dir = dir.example, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
# Results of the mean.sd range test for all properties
rt.results$results.df
#> Property n_source n_mixture NAs RT_mean.sd_single
#> 1 TOC_PrC 58 38 0 TRUE
#> 2 TN_PrC 58 38 0 TRUE
#> 3 EDXRF_Al_mg.kg.1 58 38 0 TRUE
#> 4 EDXRF_Ca_mg.kg.1 58 38 0 FALSE
#> 5 EDXRF_Fe_mg.kg.1 58 38 0 FALSE
#> 6 EDXRF_K_mg.kg.1 58 38 0 FALSE
#> 7 EDXRF_Mg_mg.kg.1 58 38 0 FALSE
#> 8 EDXRF_Mn_mg.kg.1 58 38 0 FALSE
#> 9 EDXRF_Pb_mg.kg.1 58 38 0 FALSE
#> 10 EDXRF_Si_mg.kg.1 58 38 0 FALSE
#> 11 EDXRF_Sr_mg.kg.1 58 38 0 FALSE
#> 12 EDXRF_Ti_mg.kg.1 58 38 0 TRUE
#> 13 EDXRF_Zn_mg.kg.1 58 38 0 FALSE
#> 14 EDXRF_Zr_mg.kg.1 58 38 0 FALSE
The result of the range test for some target samples and a property that did not pass the range test.
rt.results$results.RT$EDXRF_Pb_mg.kg.1[1:5,]
#> Sample_name Property n_source RT_mean.sd
#> 1 ManoDd_2106_00-01 EDXRF_Pb_mg.kg.1 58 TRUE
#> 2 ManoDd_2106_01-02 EDXRF_Pb_mg.kg.1 58 TRUE
#> 3 ManoDd_2106_02-03 EDXRF_Pb_mg.kg.1 58 high
#> 4 ManoDd_2106_03-04 EDXRF_Pb_mg.kg.1 58 high
#> 5 ManoDd_2106_04-05 EDXRF_Pb_mg.kg.1 58 TRUE
Based on the $results.df
data frame from range.tests
, the
is.conservative
function filter the properties that passed the range
tests for each criterion, and returns a list of vectors. Each vector
contains the names of the properties that were identified as
conservative according to each criterion.
prop.cons <- fingR::is.conservative(data = rt.results$results.df, # Data.frame of the range test results (generated by fingR::range.tests())
# property = "Property", # Optional - Column with property labels
# test.format = "RT", # Optional - Common pattern in test column labels
# position = 2, # Optional - Position of test column labels in column names
# separator = "_", # Optional - Character to split test column labels
# note = "example" # Optional - Additional file name annotation
)
prop.cons
#> $mean.sd
#> [1] "TOC_PrC" "TN_PrC" "EDXRF_Al_mg.kg.1" "EDXRF_Ti_mg.kg.1"
The three-step method evaluates the ability of the traits to
discriminate between source groups using a statistical test to test
whether or not samples are from the same distribution. Conventionally,
the non-parametric Kruskal-Wallis H test (KW) is used. The KW test is an
extension of the two-sample Mann-Whitney U test. As an alternative to
the KW test, the two-sample Kolmogorov-Smirnov (KS) test could be used.
The KS test allows for a more detailed understanding of the
discrimination between pairs of source groups. For instance, the KS
tests for for three source groups will be structured as: source A
vs. source B, source A vs. source C, and source B vs source C. The
discriminant.test
function allows the calculation of the discriminant
power of each property. By setting the using the test argument to KW
or KS
, the Kruskal-Wallis H-test or Kolmogorov-Smirnov test is
carried out.
When using the KS test (test = "KW"
), the significance of the test p
value is commonly set to 5 % (p.level = 0.05
). Whilst, when using the
Kolmogorov-Smirnov test ( test = “KS” ), which is highly sensitive to
distribution differences, the significance of the test p value could
be set to lower value, for instance, 1 % (p.level = 0.01
).
Two CSV files will be written if the path to the save directory is given
to the save.dir
argument:
- ’Discriminant_pairs[_note].csv’, generated if
save.discrim.tests = TRUE
(default), contains the p value of each two-samples KS tests for each property. Setreturn.tests = TRUE
to return this table as$result.KS
(default isFALSE
).
Warning
When using the KW test, the KS test results can also help verify the KW assumption of similar distribution shape across groups.
-
’Discriminant_tests[_note].csv’ summarises the results from KS or KW tests.
-
For
test = "KW"
, the number of groups that follows a similar distribution shape(n.similar.groups
) is reported. If this value matches the total number of source groups, the the assumption of similar distribution shapes is satisfied (respect.KW.shape.assumption
). The table also includes the p value (KW_p.value
) and its significance (KW_signif
) for each properties. -
For
test = "KS"
, the number of statistically different groups (n.diff.groups
) is reported, and if it is at least equal to the threshold (min.discriminant = 1
) required to identify a property as discriminant, the property is marked as discriminant inKS_discriminant
.
-
Example for KW test:
KW.results <- fingR::discriminant.test(data = database, # Data.frame with source and mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
mixture = "Target", # Labeling of mixtures in `class`
test = "KW", # Performed test: Kruskal-Wallis (KW) or Kolmogorov-Smornov (KS)
properties = prop.values, # Vector of property labels
p.level = .05, # Optional - significance level value (i.e. p value)
# save.discrim.tests = TRUE, # Optional - If source couple tests should be saved
# save.dir = dir.example, # Optional - Directory path to save results
# note = "example", # Optional - Additional file name annotation
return.tests = TRUE # Optional - If source couple tests should be return
)
KW.results$result.KS[1:3,]
#> Property Source_A Source_B n_source_A n_source_B KS_p.value KS_signif
#> 1 TOC_PrC Forest Subsoil 24 10 1.01e-06 ***
#> 2 TOC_PrC Forest Undecontaminated 24 24 4.57e-06 ***
#> 3 TOC_PrC Subsoil Undecontaminated 10 24 1.34e-06 ***
KW.results$results.df[1:5,]
#> Property n.similar.groups respect.KW.shape.assumption KW_p.value
#> 1 TOC_PrC 0 FALSE 0.00000
#> 2 TN_PrC 0 FALSE 0.00000
#> 3 EDXRF_Al_mg.kg.1 0 FALSE 0.00000
#> 4 EDXRF_Ca_mg.kg.1 1 FALSE 0.00301
#> 5 EDXRF_Fe_mg.kg.1 3 TRUE 0.79411
#> KW_signif
#> 1 ***
#> 2 ***
#> 3 ***
#> 4 **
#> 5
Example for KS test:
KS.results <- fingR::discriminant.test(data = database, # Data.frame with source and mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
mixture = "Target", # Labeling of mixtures in `class`
test = "KS", # Performed test: Kruskal-Wallis (KW) or Kolmogorov-Smornov (KS)
properties = prop.values, # Vector of property labels
p.level = .01, # Optional - significance level value (i.e. p value)
min.discriminant = 1, # Optional - KS only. Minimum number of significantly different source couples
save.discrim.tests = TRUE, # Optional - If source couple tests should be saved
# save.dir = dir.example, # Optional - Directory path to save results
# note = "example", # Optional - Additional file name annotation
return.tests = TRUE # Optional - If source couple tests should be return
)
KS.results$result.KS[1:3,]
#> Property Source_A Source_B n_source_A n_source_B KS_p.value KS_signif
#> 1 TOC_PrC Forest Subsoil 24 10 1.01e-06 ***
#> 2 TOC_PrC Forest Undecontaminated 24 24 4.57e-06 ***
#> 3 TOC_PrC Subsoil Undecontaminated 10 24 1.34e-06 ***
KS.results$results.df[1:5,]
#> Property n.diff.groups KS_discriminant
#> 1 TOC_PrC 3 TRUE
#> 2 TN_PrC 3 TRUE
#> 3 EDXRF_Al_mg.kg.1 3 TRUE
#> 4 EDXRF_Ca_mg.kg.1 1 TRUE
#> 5 EDXRF_Fe_mg.kg.1 0 FALSE
The results from the KS test are used for the following guide.
Based on the $results.df
data frame from discriminant.test
, the
is.discriminant
function filter out the properties that passed the
discriminant power tests according to the set p value (p.level
) as
list them in a vector. The function recognise the structure of the data
frame produced by the discriminant.test
function, however, it is
possible to set it for other data frame format by setting its arguments.
prop.discrim <- fingR::is.discriminant(KS.results$results.df, # data.frame with the information regarding discriminant power (TRUE/FALSE)
## Folowing arguments are needed if the discrimant power test is not performed with `fingR::discriminant.test`
# property = "Property", # Optional - Column with property names
# test.format = "Kruskal.Wallis_p.value", # Optional - Test label pattern
# test.pos = 1, # Optional - Position of the test name in `test.format`
# sep.format = "_", # Optional - Separator used in `test.format`
# p.level = 0.01, # Optional - significance level value (i.e. p value)
# note = "example" # Optional - Additional file name annotation
)
prop.discrim
#> $KS
#> [1] "TOC_PrC" "TN_PrC" "EDXRF_Al_mg.kg.1" "EDXRF_Ca_mg.kg.1"
#> [5] "EDXRF_K_mg.kg.1" "EDXRF_Pb_mg.kg.1" "EDXRF_Si_mg.kg.1" "EDXRF_Sr_mg.kg.1"
#> [9] "EDXRF_Zn_mg.kg.1" "EDXRF_Zr_mg.kg.1"
The properties that pass both the evaluation of conservative behaviour
(prop.cons
from is.conservative
) and discriminant power
(prop.discrim
from is.discriminant
) are compared by the
selected.tracers
function, which list them in a vector.
tracers <- fingR::is.tracers(cons = prop.cons, # List of properties considered as conservative
discrim = prop.discrim) # List of properties considered as discriminant
tracers
#> $mean.sd_KS
#> [1] "TOC_PrC" "TN_PrC" "EDXRF_Al_mg.kg.1"
Frequently, a Discriminant Function Analysis (DFA) stepwise selection is applyed on the selected tracers. The DFA stepwise selection aims to retain the tracers that maximise source discrimination. Of note, the DFA stepwise selection has been criticised since its use is associated to a reduction of the number of tracers, especially since some studies showed that the use of a higher number of tracers decreases the sensitivity of sediment source fingerprinting modelling to non-conservative tracers (Martínez-Carreras et al., 2008; Sherriff et al., 2015). In addition, it is not useful for small selection of tracers, as it is the case here.
The stepwise.selection
function allows to apply a forward stepwise
selection based on the tracer selected, and returns a vector containing
the retained tracers.
tracers.SW <- fingR::stepwise.selection(data = database, # Data.frame with source and mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
tracers = tracers$mean.sd_KS, # Vector of selected tracers
target = "Target", # Labeling of mixtures in `class`
# save.dir = dir.example, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
tracers.SW
#> [1] "EDXRF_Al_mg.kg.1" "TOC_PrC" "TN_PrC"
Here, the stepwise selection did not remove any of the selected tracers. However, if the stepwise selection resulted in different tracers, examining the modelling results for both sets may provide useful insights. Both tracer selection vectors could be joint in a list as follows:
all.tracers <- list("mean.sd_KS" = tracers$mean.sd_KS, # Vector of selected tracers
"mean.sd_KS_DFA" = tracers.SW) # Vector of selected tracers after the DFA stepwise selection
all.tracers
#> $mean.sd_KS
#> [1] "TOC_PrC" "TN_PrC" "EDXRF_Al_mg.kg.1"
#>
#> $mean.sd_KS_DFA
#> [1] "EDXRF_Al_mg.kg.1" "TOC_PrC" "TN_PrC"
To evaluate the accuracy of unmixing models, virtual mixtures could be used as a reliable alternative to laboratory mixtures (Batista et al., 2022). These artificial mixtures (virtual or laboratory made) act as target samples with known contributions, enabling for the calculation of modelling accuracy metrics. In addition, the ease of virtual mixture generation allows to explore a wide range of source contribution combinations.
The VM.contrib.generator
function allows a combination of source
contributions to be generated. The contributions are designed to range
from a minimum value (min
) to a maximum value (max
) by a specified
step (step
). The function first creates a contribution sequence for
the first group of sources, from min
to max
according to the step
(e.g. min = 0
, max = 100
, step = 5
: 0, 5, 10, … 100). The
following sources are then defined as the remaining maximum after the
specified step. The contributions can be generated as a percentage
(e.g. min = 0
and max= 100
) or as a ratio (e.g. min = 0
and
max= 1
). A small increment results in a higher number of virtual
mixtures. For example, three sources with 5 % increment (min = 0
,
max = 100
, step = 5
) generated 231 contribution combinations, and
5151 contribution combinations with 1 % increment (step = 1
).
The contribution combination data frame is saved
(’VM_contributions[_note][number of virtual mixtures].csv’) in the
save directory specified by the save.dir
argument.
# Generate virtual mixture source contributions
VM.contrib <- fingR::VM.contrib.generator(n.sources = 3, # Number of sources groups
min = 0, # Minimum contribution value
max = 100, # Maximum contribution value (here in %)
step = 5, # Step between two contribution levels (here in %)
sources.class = c("Forest", "Subsoil", "Undecontaminated"), # Optional - Label for source groups
VM.name = "Sample_name", # Optional - Virtual mixtures id column name
save.dir = dir.example, # Optional - Directory path to save results
# note = "example", # Optional - Additional file name annotation
# return = TRUE, # Optional - To return VM data frames
# save = TRUE, # Optional - To save VM data frames
# fileEncoding = "latin1", # Optional - Encoding format used to save data
)
VM.contrib[1:5,]
#> Sample_name Forest Subsoil Undecontaminated
#> 1 VM-001 0 0 100
#> 2 VM-002 0 5 95
#> 3 VM-003 0 10 90
#> 4 VM-004 0 15 85
#> 5 VM-005 0 20 80
The property values for each virtual mixture are generated based on the source group property signatures (i.e. mean values). The property values are calculated as a simple mass balance, the mean value of each source group is multiplied by the theoretical contribution of each source group and the result is summed, as detailed in the following equation:
Where
The VM.builder
function allows the calculation of the tracers values
(tracers
) of the virtual mixtures generated with
VM.contrib.generator
according to their respective theoretical source
contribution combinations (contributions
). The VM.builder
function
returns and, if save.dir
is set, saves three data frames:
-
$property
- ’VM_properties[_note].csv’: contains the values of the tracers for each virtual mixtures. -
$uncertainty
- ’VM_properties_SD[_note].csv’: contains the uncertainty of the tracers. If the actual measurement uncertainty are provided (uncertainty
) they will be used, otherwise 5 % of the tracer values will be considered as measurement uncertainty. -
$full
- ’VM_properties_full[_note].csv’: contains the values of the tracers and their measurement uncertainty.
It is possible to add the property values data frame of the source
samples to the one of the virtual mixture by setting
add.sources = TRUE
, which could be required when running unmixing
models.
VM <- fingR::VM.builder(data = database, # Data.frame with source and mixture information
material = "Material", # Column with classification betwen source and target
source.name = "Source", # Labeling of source samples in `material`
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
tracers = tracers$mean.sd_KS, # Vector of tracers
uncertainty = unname(prop.uncertainties[tracers$mean.sd_KS]), # Vector of tracer uncertainties - note: easy selection of uncertainty labels
contributions = VM.contrib, # Virtual mixtures contributions
VM.name = "Sample_name", # Virtual mixtures id column name
add.sources = TRUE, # Add source information at the end of the VM data frames
save.dir = dir.example, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
VM$full[1:5,]
#> Sample_name Class_decontamination TOC_PrC TN_PrC EDXRF_Al_mg.kg.1 TOC_SD
#> 1 VM-001 Virtual Mixture 5.16 0.42 84858.53 4.75
#> 2 VM-002 Virtual Mixture 4.97 0.40 86004.71 4.75
#> 3 VM-003 Virtual Mixture 4.78 0.39 87150.90 4.75
#> 4 VM-004 Virtual Mixture 4.60 0.38 88297.08 4.75
#> 5 VM-005 Virtual Mixture 4.41 0.36 89443.26 4.75
#> TN_SD EDXRF_Al_RMSE
#> 1 0.28 17840.72
#> 2 0.28 17840.72
#> 3 0.28 17840.72
#> 4 0.28 17840.72
#> 5 0.28 17840.72
It is also possible to generate virtual mixture contributions and tracer
values in one step in the VM.builder
function, by setting the
contribution range as a vector (VM.range
) and the step (VM.step
),
such as below:
VM <- fingR::VM.builder(data = database, # Data.frame with source and mixture information
material = "Material", # Column with classification betwen source and target
source.name = "Source", # Labeling of source samples in `material`
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
tracers = tracers$mean.sd_KS, # Vector of tracers
uncertainty = unname(prop.uncertainties[tracers$mean.sd_KS]), # Vector of tracer uncertainties - note: easy selection of uncertainty labels
VM.range = c(0, 100), # Optional - Range of VM contributions
VM.step = 5, # Optional - Step between two contribution levels
VM.name = "Sample_name", # Virtual mixtures id column name
add.sources = TRUE, # Add source information at the end of the VM data frames
save.dir = dir.example, # Optional - Directory path to save results
note = "VM" # Optional - Additional file name annotation
)
VM$full[1:5,]
#> Sample_name Class_decontamination TOC_PrC TN_PrC EDXRF_Al_mg.kg.1 TOC_SD
#> 1 VM-001 Virtual Mixture 5.16 0.42 84858.53 4.75
#> 2 VM-002 Virtual Mixture 4.97 0.40 86004.71 4.75
#> 3 VM-003 Virtual Mixture 4.78 0.39 87150.90 4.75
#> 4 VM-004 Virtual Mixture 4.60 0.38 88297.08 4.75
#> 5 VM-005 Virtual Mixture 4.41 0.36 89443.26 4.75
#> TN_SD EDXRF_Al_RMSE
#> 1 0.28 17840.72
#> 2 0.28 17840.72
#> 3 0.28 17840.72
#> 4 0.28 17840.72
#> 5 0.28 17840.72
Running the models will generate numerous files, therefore we
recommended to create a dedicated folder to organise them, and keep the
directory with dir.modelling
.
dir.create(file.path(dir.example, "Modelling/"), showWarnings = FALSE) # Create a folder
dir.modelling <- paste0(dir.example, "Modelling/") # Keep folder path
Similarly, create a dedicated folder for Bayesian Mean modelling (BMM),
and keep the directory with dir.mod.BMM.
dir.create(file.path(dir.modelling, "BMM/"), showWarnings = FALSE) # Create a folder
dir.mod.BMM <- paste0(dir.modelling, "BMM/") # Keep folder path
The Bayesian Mean Model (BMM) is a statistical approach that uses Bayesian inference to estimate the mean of a population or dataset (Laceby and Olley, 2015; Batista et al., 2019). This approach incorporates prior knowledge and observed data to calculate a posterior distribution for the mean, providing a probabilistic framework that accounts for uncertainty in the estimates. The BMM approach that minimises the sum of square residuals of the mixing equation for each iteration of a Monte Carlo simulation.
The run.BMM
function has been developed to allow to simplify the use
of the BMM model. The number of iterations should be set in n.iter
,
between 2500 and 7500, to ensure convergence of the Monte Carlo chain.
The function returns a data frame containing the individual iteration
source contributions for each mixture and, if save.dir
is set, saves
it as ’BMM_prevision[_note].csv
Tip
Setting n.iter
to 30 allows the structure to be tested before
the actual modelling is carried out.
Note
It is not compulsory to provide the actual measurement uncertainty
of tracers to uncertainty
argument, although, it is highly
recommended.
BMM.mix <- fingR::run.BMM(data = database, # Data.frame with source and mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
mixture = "Target", # Labeling of mixtures in `class`
sample.id = "Sample_name", # Identifier for individual samples
tracers = tracers$mean.sd_KS, # Vector of tracers
uncertainty = unname(prop.uncertainties[tracers$mean.sd_KS]), # Vector of tracer uncertainties
n.iter = 30, # Number of iteration - TEST VERSION
save.dir = dir.mod.BMM, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
When dealing with isotopic ratios, which are non-linear properties, the
residuals of the mixing equation for each iteration should be calculated
taking into account the relative content of the related property (see
Laceby and Olley
(2015) for
further explanation). Several isotopic ratios could be used, bearing in
mind that the order must be identical between isotope.ratio
,
isotope.prop and isotopes.unc
. For example, when using the
δ13C isotopic ratio in organic matter, the run.BMM function
should be set as follows:
Note
isotope.ratio
, isotope.prop
, and isotopes.unc
work for any
type of non linear properties.
BMM.mix.iso <- fingR::run.BMM(data = database, # Data.frame with source and mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
mixture = "Target", # Labeling of mixtures in `class`
sample.id = "Sample_name", # Identifier for individual samples
tracers = tracers$mean.sd_KS, # Vector of tracers
uncertainty = unname(prop.uncertainties[tracers$mean.sd_KS]), # Vector of tracer uncertainties
isotope.ratio = c("d13C_PrM"), # Optional: Character vector containing isotopic ratios
isotope.prop = c("TOC_PrC"), # Optional: Character vector containing isotopic ratios respective properties
isotopes.unc = c("d13C_SD"), # Optional: Character vecotr containing uncertainty of the isotopic ratios
n.iter = 30, # Number of iteration - TEST VERSION
save.dir = dir.mod.BMM, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
Then, the predictions resulting from the BMM must be processed to determine the contribution values for each target mixture.
The BMM.summary
function provides a summary of the predictions,
including mean value, standard deviation, and various quantiles (2.5, 5,
25, 50, 75, 95, 97.5) for each target mixture (saves
’BMM_contrib[_note].csv’). From this summary, the BMM.pred
function
extracts the median and/or mean, stats = "Median"
, stats = "Mean"
,
or stats = c("Median", "Mean")
, value of source contributions for each
target mixture (saves ’BMM_ordered_contrib[_note].csv’). Finally, the
ensure.total
function ensures that the total of the predicted source
contributions from all sources sum to 1 or 100 % (saves
’corrected_contrib[_note].csv’).
# Summarise BMM model previsions
BMM.summary.mix <- fingR::BMM.summary(pred = BMM.mix, # Data.frame with the predicted contributions from BMM
# sample.id = "mix.names", # Column name for individual sample id
# source = "source", # Column name for source information
# value = "value", # Column name for individual sample predictions
save.dir = dir.mod.BMM, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
BMM.summary.mix[1:5,]
#> mix.names source Mean SD Q2.5 Q5 Q25 Q50 Q75
#> 1 ManoDd_2106_00-01 Forest 0.105 0.178 0.001 0.001 0.001 0.001 0.109
#> 2 ManoDd_2106_00-01 Subsoil 0.516 0.319 0.001 0.001 0.360 0.524 0.750
#> 3 ManoDd_2106_00-01 Undecontaminated 0.379 0.363 0.001 0.001 0.001 0.364 0.639
#> 4 ManoDd_2106_01-02 Forest 0.068 0.107 0.001 0.001 0.001 0.008 0.082
#> 5 ManoDd_2106_01-02 Subsoil 0.721 0.286 0.035 0.126 0.590 0.853 0.939
#> Q95 Q97.5
#> 1 0.496 0.545
#> 2 0.998 0.998
#> 3 0.971 0.991
#> 4 0.336 0.355
#> 5 0.992 0.998
# Extracts the median value of the previsions
BMM.preds.mix <- fingR::BMM.pred(data = BMM.summary.mix, # Data.frame with the summary statistics of BMM predicted contribution (usually from `fingR::BMM.summary.mix`)
stats = "Median", # Summary statistics for source contribution ("Mean" or "Median")
# sample.id = "mix.names", # Column name for individual sample id
# source = "source", # Column name for source information
save.dir = dir.mod.BMM, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
BMM.preds.mix[1:5,]
#> mix.names Median_Forest Median_Subsoil Median_Undecontaminated
#> 1 ManoDd_2106_00-01 0.001 0.524 0.364
#> 2 ManoDd_2106_01-02 0.008 0.853 0.049
#> 3 ManoDd_2106_02-03 0.001 0.739 0.101
#> 4 ManoDd_2106_03-04 0.001 0.586 0.210
#> 5 ManoDd_2106_04-05 0.001 0.582 0.375
# Ensure that the total predicted contribution sums to 1 or to 100%
BMM.preds.mixE <- fingR::ensure.total(data = BMM.preds.mix, # Data.frame with each sample predicted source contribution (usually from `fingR::BMM.pred`)
sample.id = "mix.names", # Column name for individual sample id
save.dir = dir.mod.BMM, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
BMM.preds.mixE[1:5,]
#> mix.names Median_Forest Median_Subsoil Median_Undecontaminated total
#> 1 ManoDd_2106_00-01 0.001 0.590 0.409 1
#> 2 ManoDd_2106_01-02 0.009 0.938 0.053 1
#> 3 ManoDd_2106_02-03 0.001 0.879 0.120 1
#> 4 ManoDd_2106_03-04 0.001 0.735 0.264 1
#> 5 ManoDd_2106_04-05 0.001 0.607 0.392 1
The same sequence of functions is applied for the virtual mixture
predictions, resulting in BMM.summary.VM
from BMM.summary
,
BMM.preds.VM
from BMM.pred
, and BMM.pred.VME
from ensure.total
.
BMM.VM <- fingR::run.BMM(data = VM$full, # Data.frame with source and mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
mixture = "Virtual Mixture", # Labeling of mixtures in `class`
sample.id = "Sample_name", # Identifier for individual samples
tracers = tracers$mean.sd_KS, # Vector of tracers
uncertainty = unname(prop.uncertainties[tracers$mean.sd_KS]), # Vector of tracer uncertainties
n.iter = 30, # Number of iteration - TEST VERSION
save.dir = dir.mod.BMM, # Optional - Directory path to save results
note = "VM" # Optional - Additional file name annotation
)
# Summarise BMM model previsions
BMM.summary.VM <- fingR::BMM.summary(pred = BMM.VM, # Data.frame with the predicted contributions from BMM
# sample.id = "mix.names", # Column name for individual sample id
# source = "source", # Column name for source information
# value = "value", # Column name for individual sample predictions
save.dir = dir.mod.BMM, # Optional - Directory path to save results
note = "VM" # Optional - Additional file name annotation
)
# Extracts the median value of the previsions
BMM.preds.VM <- fingR::BMM.pred(data = BMM.summary.VM, # Data.frame with the summary statistics of BMM predicted contribution (usually from `fingR::BMM.summary.mix`)
stats = "Median", # Summary statistics for source contribution ("Mean" or "Median")
# sample.id = "mix.names", # Column name for individual sample id
# source = "source", # Column name for source information
save.dir = dir.mod.BMM, # Optional - Directory path to save results
note = "VM" # Optional - Additional file name annotation
)
# Ensure that the total predicted contribution sums to 1 or to 100%
BMM.pred.VME <- fingR::ensure.total(data = BMM.preds.VM, # Data.frame with each sample predicted source contribution (usually from `fingR::BMM.pred`)
sample.id = "mix.names", # Column name for individual sample id
save.dir = dir.mod.BMM, # Optional - Directory path to save results
note = "VM" # Optional - Additional file name annotation
)
BMM.pred.VME[1:5,]
#> mix.names Median_Forest Median_Subsoil Median_Undecontaminated total
#> 1 VM-001 0.175 0.708 0.117 1
#> 2 VM-002 0.068 0.931 0.001 1
#> 3 VM-003 0.001 0.937 0.062 1
#> 4 VM-004 0.078 0.687 0.235 1
#> 5 VM-005 0.001 0.961 0.038 1
Create a dedicated folder for MixSIAR modelling, and keep the directory
with dir.mod.MixSIAR
.
dir.create(file.path(dir.modelling, "MixSIAR/"), showWarnings = FALSE) # Create a folder
dir.mod.MixSIAR <- paste0(dir.modelling, "MixSIAR/") # Keep folder path
The MixSIAR package is designed to create and run Bayesian mixing models using Just Another Gibbs Sampler (JAGS). This package is widely used in the sediment source fingerprinting community to predict source contribution. To explore more about MixSIAR, including detailed tutorials, examples, and technical documentation, please visit the official MixSIAR website. Additionally, the source code and further resources can be found on the MixSIAR GitHub page (https://github.com/brianstock/MixSIAR/tree/3.1.9).
According to MixSIAR guide, installation should follow these steps:
-
Download and install/update R (https://cran.r-project.org/bin/).
-
Download and install JAGS (https://mcmc-jags.sourceforge.io/).
-
Open R and run:
devtools::install.packages("MixSIAR", dependencies = TRUE)
Or install the GitHub version:
# install.packages(remotes)
remotes::install_github("brianstock/MixSIAR", dependencies = TRUE)
- Load the
MixSIAR
package:
library(MixSIAR)
MixSIAR model requires data to be formated in a specific format to
load the information of source and mixture samples. The
data.for.MixSIAR
function generates CSV files that conform to the
format required by MixSIAR loading functions (i.e. load_mix_data
,
load_source_data
, and load_discr_data
). The function generates
three files with the properties according to the specified tracer
selection (in tracers
argument):
-
’MixSIAR_mix[_note].csv’: containing mixtures information.
-
’MixSIAR_sources[_note].csv’: containing the mean and standard deviation (sd) of the source classes.
-
’MixSIAR_discrimination[_note].csv’: containing is a matrix of zero as there is no trophic information in sediment source fingerprinting studies.
# Actual sediment samples
MixSIAR.dt <- fingR::data.for.MixSIAR(data = database, # Data.frame with source and mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
target = "Target", # Labeling of mixtures in `class`
tracers = tracers$mean.sd_KS, # Vector of tracers
sample.id = "Sample_name", # Column name for individual sample
save.dir = dir.mod.MixSIAR, # Optional - Directory path to save results
# note = "exemple", # Optional - Additional file name annotation
# fileEncoding = "latin1", # Optional - Encoding format used to save data
show.data = TRUE, # Optional - To show created data.frame
)
MixSIAR.dt$sources
#> MeanTOC_PrC SDTOC_PrC MeanTN_PrC SDTN_PrC MeanEDXRF_Al_mg.kg.1
#> Forest 10.92 4.25 0.70 0.25 67477.06
#> Subsoil 1.41 1.10 0.12 0.10 107782.19
#> Undecontaminated 5.16 2.05 0.42 0.11 84858.53
#> SDEDXRF_Al_mg.kg.1 n
#> Forest 12201.85 24
#> Subsoil 9672.67 10
#> Undecontaminated 9276.76 24
MixSIAR.dt$mix[1:5,]
#> Sample_name TOC_PrC TN_PrC EDXRF_Al_mg.kg.1
#> 1 ManoDd_2106_00-01 3.17 0.28 83869.76
#> 2 ManoDd_2106_01-02 2.81 0.24 80412.18
#> 3 ManoDd_2106_02-03 3.04 0.25 83741.07
#> 4 ManoDd_2106_03-04 3.23 0.27 84663.05
#> 5 ManoDd_2106_04-05 2.88 0.27 81518.68
MixSIAR.dt$discimination
#> MeanTOC_PrC SDTOC_PrC MeanTN_PrC SDTN_PrC MeanEDXRF_Al_mg.kg.1
#> Forest 0 0 0 0 0
#> Subsoil 0 0 0 0 0
#> Undecontaminated 0 0 0 0 0
#> SDEDXRF_Al_mg.kg.1
#> Forest 0
#> Subsoil 0
#> Undecontaminated 0
The same function is run for the virtual mixtures.
# Virtual Mixtures
fingR::data.for.MixSIAR(data = VM$full, # Data.frame with source and virtual mixture information
class = "Class_decontamination", # Column with classification or grouping or sources and mixtures
target = "Virtual Mixture", # Labeling of mixtures in `class`
tracers = tracers$mean.sd_KS, # Vector of tracers
sample.id = "Sample_name", # Column name for individual sample
save.dir = dir.mod.MixSIAR, # Optional - Directory path to save results
note = "VM", # Optional - Additional file name annotation
# fileEncoding = "latin1", # Optional - Encoding format used to save data
show.data = FALSE, # Optional - To show created data.frame
)
The MixSIAR functions load_mix_data
, load_source_data
, and
load_discr_data
are used to load the corresponding CSV files for the
actual mixtures: ’MixSIAR_mix.csv’ (actual mixtures),
’MixSIAR_sources.csv’ (sources), and ’MixSIAR_discrimination.csv’
(discrimination matrix) that were generated previously.
# Load sediment samples data
MixSIAR.mix <- MixSIAR::load_mix_data(filename = paste0(dir.mod.MixSIAR, "MixSIAR_mix.csv"), # CSV file with mixture data
iso_names = tracers$mean.sd_KS, # Vector of tracers
factors = c("Sample_name"), # Column name for individual sample
fac_random = FALSE, # Indicates if `factors` are a random effect
cont_effects = NULL # Specify the column with continuous effect (here none)
)
# Load source data
MixSIAR.source <- MixSIAR::load_source_data(filename = paste0(dir.mod.MixSIAR, "MixSIAR_sources.csv"), # CSV file with source data
source_factors = NULL, # No source factors specified
conc_dep = FALSE, # Consideration of concentration
data_type = "means", # Type of source group values (here means)
mix = MixSIAR.mix # `load_mix_data` generated object
)
# Load discrimination data
MixSIAR.discr <- MixSIAR::load_discr_data(filename = paste0(dir.mod.MixSIAR, "MixSIAR_discrimination.csv"), # CSv file with discrimination data
mix = MixSIAR.mix # `load_mix_data` generated object
)
The same functions are used to load the virtual mixtures information.
The virtual mixtures properties (’MixSIAR_mix_VM.csv’) are loaded using
load_mix_data
and the same source and discrimination matrix files as
for the actual mixtures.
# Load virtual mixtures data
MixSIAR.VM <- MixSIAR::load_mix_data(filename = paste0(dir.mod.MixSIAR, "MixSIAR_mix_VM.csv"), # CSV file with mixture data
iso_names = tracers$mean.sd_KS, # Vector of tracers
factors = c("Sample_name"), # Column name for individual sample
fac_random = FALSE, # Indicates if `factors` are a random effect
cont_effects = NULL # Specify the column with continuous effect (here none)
)
# Load source data
MixSIAR.source.VM <- MixSIAR::load_source_data(filename = paste0(dir.mod.MixSIAR, "MixSIAR_sources_VM.csv"), # CSV file with source data
source_factors = NULL, # No source factors specified
conc_dep = FALSE, # Consideration of concentration
data_type = "means", # Type of source group values (here means)
mix = MixSIAR.VM # `load_mix_data` generated object
)
# Load discrimination data
MixSIAR.discr.VM <- MixSIAR::load_discr_data(filename = paste0(dir.mod.MixSIAR, "MixSIAR_discrimination_VM.csv"), # CSv file with discrimination data
mix = MixSIAR.mix # `load_mix_data` generated object
)
The “JAGS is Just Another Gibbs Sampler. It is a program for analysis of
Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC)
simulation not wholly” used in MixSIAR package (Stock et al.,
2022). The write_JAGS_model
writes the JAGS file to define the structure of the Bayesian model. The
JAGS model is saved at the path set in filename
and using the
specified txt file name (e.g. ’MixSIAR_model.txt’).
# Write JAGS model file for actual samples
MixSIAR::write_JAGS_model(filename = paste0(dir.mod.MixSIAR, "MixSIAR_model.txt"), # JAGS model file path and name
resid_err = FALSE, # Whether to include residual error
process_err = TRUE, # Whether to include process error
mix = MixSIAR.mix, # Mixtures loaded dataset (from MixSIAR::load_mix_data)
source = MixSIAR.source) # Sources loaded dataset (from MixSIAR::load_source_data)
Another JAGS model is written for virtual mixtures since the the mixture
samples and sources are set in mix
and source
.
# Write JAGS model file for virtual mixtures
MixSIAR::write_JAGS_model(filename = paste0(dir.mod.MixSIAR, "MixSIAR_model_VM.txt"), # JAGS model file path and name
resid_err = FALSE, # Whether to include residual error
process_err = TRUE, # Whether to include process error
mix = MixSIAR.VM, # Mixtures loaded dataset (from MixSIAR::load_mix_data)
source = MixSIAR.source.VM) # Sources loaded dataset (from MixSIAR::load_source_data)
When running MixSIAR model you must choose one of the MCMC run option (Stock et al., 2020)
run == | Chain Length | Burn-in | Thin | # Chains |
---|---|---|---|---|
“test” | 1,000 | 500 | 1 | 3 |
“very short” | 10,000 | 5,000 | 5 | 3 |
“short” | 50,000 | 25,000 | 25 | 3 |
“normal” | 100,000 | 50,000 | 50 | 3 |
“long” | 300,000 | 200,000 | 100 | 3 |
“very long” | 1,000,000 | 500,000 | 500 | 3 |
“extreme” | 3,000,000 | 1,500,000 | 500 | 3 |
In this example MCMC is set to “test”, which allows to quickly ensures that the code implemented is functional.
Warning
If “Error: .onload … ‘rgags’ -> could occurs when the R version is too old. You need at least R.2.2.
# Run MixSIAR model for sediment samples
jags.mix <- MixSIAR::run_model(run = "test", # Type of run
mix = MixSIAR.mix, # Mixtures loaded dataset (from MixSIAR::load_mix_data)
source = MixSIAR.source, # Sources loaded dataset (from MixSIAR::load_source_data)
discr = MixSIAR.discr, # Discrimination loaded dataset (from MixSIAR::load_discr_data)
model_filename = paste0(dir.mod.MixSIAR, "MixSIAR_model.txt") # JAGS model file path and name (generated with MixSIAR::write_JAGS_model)
)
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 114
#> Unobserved stochastic nodes: 93
#> Total graph size: 2838
#>
#> Initializing model
# Run MixSIAR model for sediment samples
jags.VM <- MixSIAR::run_model(run = "test", # Type of run
mix = MixSIAR.VM, # Mixtures loaded dataset (from MixSIAR::load_mix_data)
source = MixSIAR.source.VM, # Sources loaded dataset (from MixSIAR::load_source_data)
discr = MixSIAR.discr.VM, # Discrimination loaded dataset (from MixSIAR::load_discr_data)
model_filename = paste0(dir.mod.MixSIAR, "MixSIAR_model_VM.txt") # JAGS model file path and name (generated with MixSIAR::write_JAGS_model)
)
#> Compiling model graph
#> Resolving undeclared variables
#> Allocating nodes
#> Graph information:
#> Observed stochastic nodes: 693
#> Unobserved stochastic nodes: 479
#> Total graph size: 16541
#>
#> Initializing model
Similar to the BMM modelling, the predictions from the MixSIAR must be processed to determine the contribution values for each target mixture.
Conversely to fingR::run.BMM
function, the MixSIAR::run_model
function does not save the Bayesian previsions directly. Therefore the
JAGS.summary
function saves the previsions
(’JAGS_prevision[_note].csv’) and provides a summary of the
predictions, including mean value, standard deviation, and various
quantiles (2.5, 5, 25, 50, 75, 95, 97.5) for each target mixture (saves
’JAGS_contrib[_note].csv’). From this summary, the JAGS.pred
function extracts the median and/or mean, stats = "Median"
,
stats = "Mean"
, or stats = c("Median", "Mean")
, value of source
contributions for each target mixture (saves
’JAGS_ordered_contrib[_note].csv’). Finally, the ensure.total
function ensures that the total of the predicted source contributions
from all sources sum to 1 or 100 % (saves
’corrected_contrib[_note].csv’).
## Summarise MixSIAR model previsions
MixSIAR.summary.mix <- fingR::JAGS.summary(jags.1 = jags.mix, # Results from MixSIAR::run_model
mix = MixSIAR.mix, # Mixtures loaded dataset (from MixSIAR::load_mix_data)
sources = MixSIAR.source, # Sources loaded dataset (from MixSIAR::load_source_data)
save.dir = dir.mod.MixSIAR, # Optional - Directory path to save results
save.pred = TRUE # To also save Monte Carlo predictions
)
#> [1] " MixSIAR Monte-Carlo chain predictions saved."
## Extracts the median value of the previsions
MixSIAR.preds.mix <- fingR::JAGS.pred(path = paste0(dir.mod.MixSIAR, "JAGS_contrib.csv"), # Directory path to `fingR::JAGS.summary` saved file
stats = "Median", # Summary statistics (`Median` or `Mean`)
save = TRUE, # To save results
# note = "example" # Optional - Additional file name annotation
)
## Ensure that the total predicted contribution sums to 1 or 100%
MixSIAR.preds.mixE <- fingR::ensure.total(data = MixSIAR.preds.mix, # data.frame with Bayensian model predictions (usually from `fingR::JAGS.pred`)
sample.id = "sample", # Identifier for individual samples
save.dir = dir.mod.MixSIAR, # Optional - Directory path to save results
# note = "example # Optional - Additional file name annotation
)
MixSIAR.preds.mixE[1:5,]
#> sample Median_Forest Median_Subsoil Median_Undecontaminated total
#> 1 ManoDd_2106_00-01 0.171 0.022 0.807 0.999
#> 2 ManoDd_2106_01-02 0.057 0.021 0.922 0.999
#> 3 ManoDd_2106_02-03 0.063 0.024 0.913 0.999
#> 4 ManoDd_2106_03-04 0.064 0.022 0.914 1.000
#> 5 ManoDd_2106_04-05 0.063 0.023 0.914 1.000
The same sequence of functions is used for the virtual mixture
predictions, resulting in MixSIAR.summary.VM
from JAGS.summary
,
MixSIAR.preds.VM
from JAGS.pred
, and MixSIAR.pred.VME
from
ensure.total
.
The accuracy of unmixing models is evaluated by comparing the
theoretical source contribution values (VM.contrib
), representing the
known contributions of the virtual mixtures, with the predicted
contributions of the virtual mixtures (BMM.preds.VM
) generated by the
unmixing model.
Several criteria can be used to evaluate the accuracy of the predictions:
-
Uncertainty: Assessed using prediction interval widths (e.g., W50, W95; for Bayesian models).
-
Residual Error or Bias: Evaluated using metrics such as mean error (ME).
-
Performance: Measured using metrics such as the squared Pearson correlation coefficient (r2), root-mean-square error (RMSE), Nash-Sutcliffe modeling efficiency coefficient (NSE), and continuous ranked probability score (CRPS; for Bayesian models).
Modelling accuracy statistics could be interpreted the following way (Chalaux-Clergue et al., 2024a): “Higher values of W50 indicate a wider distribution, which is related to a higher uncertainty. The sign of the ME indicates the direction of the bias, i.e. an overestimation or underestimation (positive or negative value, respectively). As ME is affected by cancellation, a ME of zero can also reflect a balanced distribution of predictions around the 1:1 line. Although this is not a bias, it does not mean that the model outputs are devoid of errors. The RMSE is a measure of the accuracy and allows us to calculate prediction errors of different models for a particular dataset. RMSE is always positive, and its ideal value is zero, which indicates a perfect fit to the data. As RMSE depends on the squared error, it is sensitive to outliers. The r2 describes how linear the prediction is. The NSE indicates the magnitude of variance explained by the model, i.e. how well the predictions match with the observations. A negative RMSE indicates that the mean of the measured values provides a better predictor than the model. The joint use of r2 and NSE allows for a better appreciation of the distribution shape of predictions and thus facilitates the understanding of the nature of model prediction errors. The CRPS evaluates both the accuracy and sharpness (i.e. precision) of a distribution of predicted continuous values from a probabilistic model for each sample (Matheson and Winkler, 1976). The CRPS is minimised when the observed value corresponds to a high probability value in the distribution of model outputs.”
To illustrate, only predictions made using BMM are used, however, it also applies to predictions made using MixSIAR.
The eval.groups
function calculates general accuracy metrics (ME,
RMSE, r2, NSE) by comparing the theoretical source
contribution values (provided in df.obs
) with the predicted values
(provided in df.pred
) (saves ’stats[_note].csv’). In df.obs , the
virtual mixtures are identified by the column Sample_name , while in
df.pred
, they are identified by the column mix.names
, which is
specified in by
(The resulting data frame is saved as
’ObsPred[_note].csv’).
BMM.stats <- fingR::eval.groups(df.obs = VM.contrib, # data.frame of theoretical contribution
df.pred = BMM.pred.VME %>% # data.frame of predicted contribution
dplyr::select(-total), ## remove the $total column from ensured data.frame
by = c("Sample_name" = "mix.names"), # Variable to join df.obs and df.pred
save.dir = dir.mod.BMM, # Optional - Directory path to save results
#note = "example" # Optional - Additional file name annotation
)
BMM.stats
#> Type Source ME RMSE r2 NSE
#> 1 Median Forest -0.16 0.24 0.53 0.08
#> 2 Median Subsoil 0.41 0.48 0.37 -2.63
#> 3 Median Undecontaminated -0.25 0.39 0.01 -1.44
The CRPS
function calculates the continuous ranked probability score
(CRPS). The predicted values of the virtual mixture (prev
) are the
previsions for each sample from the Bayesian models. Since a large
number of previsions are made when running the Bayesian models, it is
possible to specify the path to the file containing the previsions
rather than loading them into the workspace in path.to.prev
(previously saved by BMM.summary
)
The function returns and, if save.path
is set, saves two data frames
at the save.dir
direction or if save = TRUE
is set, saves them at
the direction indicated by path.to.prev
:
-
$samples
- ’CRPS[_note].csv’: contains the CRPS values for each source group for each virtual mixtures. -
$mean
- ’CRPS[_note]_mean.csv’: contains the values of the mean CRPS value per contribution source class group for each virtual mixtures.
# Calculate prediction CRPS values
BMM.CRPS <- fingR::CRPS(obs = VM.contrib, # data.frame of theoretical contribution
prev = read.csv(paste0(dir.mod.BMM, "BMM_prevision_VM.csv")), # data.frame of Monte-Carlo simulation prevision (usually from fingR::JAGS.summary)
source.groups = c("Forest", "Subsoil", "Undecontaminated"), # vector of source group labels
mean.cal = TRUE, # Calculate the group mean CRPS (default FALSE)
save.dir = dir.mod.BMM, # Optional - Directory path to save results
# note = "example" # Optional - Additional file name annotation
)
#> Lade nötiges Paket: scoringRules
BMM.CRPS$samples[1:5,]
#> Sample_name Forest Subsoil Undecontaminated
#> 1 VM-001 0.0841 0.2615 0.5343
#> 2 VM-002 0.0710 0.3822 0.6758
#> 3 VM-003 0.0225 0.3324 0.4596
#> 4 VM-004 0.0774 0.2062 0.4256
#> 5 VM-005 0.0360 0.2387 0.3727
BMM.CRPS$mean
#> Source CRPS.mean
#> 1 Forest 0.1375
#> 2 Subsoil 0.1645
#> 3 Undecontaminated 0.1886
The interval.width
function calculates two prediction interval widths:
W50 and W95. The W50 contains 50 % of the previsions (Q75-Q25) and the
W95 contains 95 % of the previsions (Q97.5-Q2.5). Since the interval
width is only calculated on the Bayesian model previsions, it is
possible to calculate them on actual and virtual mixtures. Similarly to
CRPS
, it is possible to specify the path to the file containing the
Bayesian model previsions, rather than loading them into the workspace
with path.to.prev
(previously saved by BMM.summary
).
The function returns and, if save.path
is set, saves two data frames
at the save.dir
direction or if save = TRUE
is set, saves them at
the direction indicated by path.to.prev
:
-
$samples
- ’Interval_width[_note].csv’: contains both interval width values for each source group for each virtual mixtures. -
$mean
- ’Interval_width_mean[_note].csv’: contains the mean of both interval width values per contribution source class group for each virtual mixtures.
# Calculate prediction interval width (W95, W50)
BMM.predWidth <- fingR::interval.width(path.to.prev = paste0(dir.mod.BMM, "BMM_prevision_VM.csv"), # Directory to Monte-Carlo simulation prevision CSV file (usually from fingR::JAGS.summary)
mean.cal = TRUE, # Calculate the group mean interval width (default FALSE)
save = FALSE, # To save results at the same location as path.to.prev (default FALSE)
save.dir = dir.mod.BMM, # Optional - Directory path to save results
# note = "exemple" # Optional - Additional file name annotation
)
BMM.predWidth$samples[1:6,]
#> mix.names source W50 W95
#> 1 VM-001 Forest 0.329 0.997
#> 2 VM-001 Subsoil 0.820 0.997
#> 3 VM-001 Undecontaminated 0.464 0.997
#> 4 VM-002 Forest 0.319 0.997
#> 5 VM-002 Subsoil 0.623 0.997
#> 6 VM-002 Undecontaminated 0.224 0.997
BMM.predWidth$mean
#> # A tibble: 3 × 3
#> Source W50.mean W95.mean
#> <chr> <dbl> <dbl>
#> 1 Forest 0.361 0.893
#> 2 Subsoil 0.589 0.963
#> 3 Undecontaminated 0.407 0.931
The ESP
function calculates the Encompassed Sample Prediction (ESP).
The ESP is a newly introduced statistics in Chalaux-Clergue et
al. (2024d)
and was created to assess the transferability of the statistics
calculated on virtual mixtures to actual sediment samples. The ESP was
calculated as the percentage of actual samples for which the predicted
contributions remained within the lowest and the highest predicted
contributions obtained for the virtual mixtures. When expressed as a
percentage, ESP ranges from 0 to 100 %, the latter providing an optimal
value. Values close to 100 % indicate a higher transferability of
modelling evaluation statistics calculated on virtual mixture to actual
sediment samples.
sources.lvl <- c("Forest", "Subsoil", "Undecontaminated")
# Calculate encompassed sample predictions (ESP)
BMM.ESP <- fingR::ESP(obs = BMM.preds.VM, # data.frame with virtual mixtures predicted contributions
pred = BMM.preds.mixE, # Actual sediment samples predicted contributions
sources = paste0("Median_", sources.lvl), # Sources labels in prediction objects
# sources.obs = "", # Optional - character indicating sample id in obs data.frame
# sources.pred = "", # Optional - character indicating sample id in pred data.frame
count = "Both" # Count 'Number' and 'Percentage'
)
BMM.ESP
#> Source ESP.Number ESP.Percentage
#> Median_Forest Forest 36 95
#> Median_Subsoil Subsoil 38 100
#> Median_Undecontaminated Undecontaminated 17 45
The upcoming version will introduce the Python version of fingR, making the sediment source fingerprinting tools available to a broader community.
Graphical support functions such as Bayesian prediction density plots, prediction vs. observation plots, and ternary diagrams are under development.
If you encounter a clear bug, have a question or suggestion, please either open an Issues or send an email to Thomas Chalaux-Clergue (thomaschalaux@icloud.com) and Amaury Bardelle (amaury.bardelle@icloud.com).
To cite this packages:
utils::citation(package = "fingR")
#> To cite the 'fingR' package in publications please use:
#>
#> Chalaux-Clergue et al. (2025). fingR: A package to support sediment
#> source fingerprinting studies, Zenodo [Package]:
#> https://doi.org/10.5281/zenodo.8293595, Github [Package]:
#> https://github.com/tchalauxclergue/fingR, Version = 2.1.4.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {fingR: A support to sediment source fingerprinting studies},
#> author = {{Chalaux-Clergue} and {Thomas} and {Bizeul} and {Rémi} and {Amaury} and {Bardelle}},
#> year = {2025},
#> month = {7},
#> note = {R package version 2.1.4},
#> doi = {https://doi.org/10.5281/zenodo.8293595},
#> url = {https://github.com/tchalauxclergue/fingR},
#> }
-
Batista, P. V. G., Laceby, J. P., & Evrard, O. (2022). How to evaluate sediment fingerprinting source apportionments. Journal of Soils and Sediments, 22(4), 1315–1328. https://doi.org/10.1007/s11368-022-03157-4
-
Batista, P. V. G., Laceby, J. P., Silva, M. L. N., Tassinari, D., Bispo, D. F. A., Curi, N., Davies, J., & Quinton, J. N. (2019). Using pedological knowledge to improve sediment source apportionment in tropical environments. Journal of Soils and Sediments, 19(9), 3274–3289. https://doi.org/10.1007/s11368-018-2199-5
-
Chalaux-Clergue, T., Bizeul, R., Batista, P. V. G., Martínez-Carreras, N., Laceby, J. P., & Evrard, O. (2024a). Sensitivity of source sediment fingerprinting to tracer selection methods. SOIL, 10(1), 109–138. https://doi.org/10.5194/soil-10-109-2024
-
Chalaux-Clergue, T., Bizeul, R., Foucher, A., & Evrard, O. (2024b). A unified template for sediment source fingerprinting databases (Version 24.03.01) [Data set]. [object Object]. https://doi.org/10.5281/ZENODO.10725788
-
Chalaux-Clergue, T., Evrard, O., Durand, R., Caumon, A., Hayashi, S., Tsuji, H., Huon, S., Vaury, V., Wakiyama, Y., Nakao, A., Laceby, J. P., & Onda, Y. (2022c). Organic matter, geochemical and colorimetric properties of potential source material, target sediment and laboratory mixtures for conducting sediment fingerprinting approaches in the Mano Dam Reservoir (Hayama Lake) catchment, Fukushima Prefecture, Japan. (Version 1) [Data set]. Zenodo. https://doi.org/10.5281/ZENODO.7081094
-
Chalaux-Clergue, T., Foucher, A., Chaboche, P.-A., Hayashi, S., Tsuji, H., Wakiyama, Y., Huon, S., Vandromme, R., Cerdan, O., Nakao, A., & Evrard, O. (2024d). Impacts of farmland decontamination on 137Cs transfers in rivers after Fukushima nuclear accident: Evidence from a retrospective sediment core study. Science of The Total Environment, 947, 174546. https://doi.org/10.1016/j.scitotenv.2024.174546
-
Evrard, O., Batista, P. V. G., Company, J., Dabrin, A., Foucher, A., Frankl, A., García-Comendador, J., Huguet, A., Lake, N., Lizaga, I., Martínez‑Carreras, N., Navratil, O., Pignol, C., & Sellier, V. (2022). Improving the design and implementation of sediment fingerprinting studies: Summary and outcomes of the TRACING 2021 Scientific School. Journal of Soils and Sediments, 22(6), 1648–1661. https://doi.org/10.1007/s11368-022-03203-1
-
Laceby, J. P., & Olley, J. (2015). An examination of geochemical modelling approaches to tracing sediment sources incorporating distribution mixing and elemental correlations. Hydrological Processes, 29(6), 1669–1685. https://doi.org/10.1002/hyp.10287
-
Martínez-Carreras, N., Gallart, F., Iffly, J. F., Pfister, L., Walling, D. E., & Krein, A. (2008). Uncertainty assessment in suspended sediment fingerprinting based on tracer mixing models: A case study from Luxembourg. IAHS Publication, 325, 94.
-
Matheson, J. E., & Winkler, R. L. (1976). Scoring Rules for Continuous Probability Distributions. Management Science, 22(10), 1087–1096. https://doi.org/10.1287/mnsc.22.10.1087
-
Sherriff, S. C., Franks, S. W., Rowan, J. S., Fenton, O., & Ó’hUallacháin, D. (2015). Uncertainty-based assessment of tracer selection, tracer non-conservativeness and multiple solutions in sediment fingerprinting using synthetic and field data. Journal of Soils and Sediments, 15(10), 2101–2116. https://doi.org/10.1007/s11368-015-1123-5
-
Stock, B. C., Semmens, B. X., Ward, E. J., Parnell, A. C., & Phillips, D. L. (2020). MixSIAR: Bayesian Mixing Models in R (Version 3.1.12) [Computer software]. https://doi.org/10.5281/zenodo.1209993
-
Stock, B. C., Semmens, B. X., Ward, E. J., Parnell, A. C., & Phillips, D. L. (2022). JAGS: Bayesian Mixing Models in R (Version 4.3.1) [Computer software]. https://doi.org/10.5281/zenodo.1209993