Spatial-Temporal Patterns of Per- and Polyfluoroalkyl Substances (PFAS) in the Biota of the Laurentian Great Lakes: A Meta-Analysis (Martin et al., 2025)

Directory

This GitHub repository contains the R code (R version 4.4.1 – Race for Your Life¹), finalized dataset (i.e., the dataset produced after performing imputation of missing values and adding in additional variables), and the generalized additive models produced through the methodological workflow described in detail in Martin et al. (2025). The different parts of the repository are as follows:

Coding Workflow folder: this folder contains the six R scripts that were used to process the raw data files, impute left-censored concentration values, finalize the data frame, create generalized additive models for the concentrations of six PFAS (PFOS, PFNA, PFDA, PFUnA, PFDoA, and PFTrDA), test the models (using 80:20 holdout validation along with a supplementary validation data set), and generate figures and tables. The files within this folder are:
- Step_1_Data_Import_and_Spatial_Join.R - Imports the raw data files (initially 3,418 data points collected across 53 studies surveying a total of 79 PFAS), conducts a spatial join, with our custom shapefile of the Great Lakes watersheds, to assign watershed designations and remove points outside the study area (final count of 2,489 samples collected across 50 studies), and performs some initial formatting of the data frame
- Step_2_Imputation.R - Performs imputation of left-censored concentrations (i.e., values entered as "less than a limit of detection or quantification") for 10 PFAS (PFOS, PFDS, PFEtCHxS, PFNA, PFDA, PFUnA, PFDoA, PFTrDA, PFTeDA, and PFPeDA) that met our initial criteria for inclusion: a detection frequency ≥ 60%, at least 3 quantified samples for each watershed, and at least two watersheds represented in the contaminant's samples. Two algorithms (from the zCompositions package²) are used in imputation: a log-ratio Expectation-Maximization Algorithm, from the function lrEM(), and a log-ratio Data Augmentation Algorithm, from the function lrDA(). Because lrDA() will produce slightly different imputed estimates each time the algorithm is run, the finalized imputation results that were going to be used in modeling, generated on July 12th, 2024 with Version 1.5.0-4 of the zCompositions package, were saved as a .csv file (Finalized_Concentration_Values_07_12_2024.csv) to be loaded in during the next step
- Step_3_Additional_Variables.R - Finalizes the data frame that will be used in modeling: (i) Additional variables (e.g., Revised_Tissue, Composite, Trophic_Level, Waterbody_Type, Water_Level, etc.) are inserted into the data frame, (ii) the data frame's columns are reordered, (iii) the imputed concentration values to be used in modeling (saved in Finalized_Concentration_Values_07_12_2024.csv) are attached to the data frame, and (iv) extraneous columns are removed from the data frame, including the four contaminants (PFDS, PFEtCHxS, PFTeDA, and PFPeDA) that do not meet our final criterion for modeling (i.e., a detection frequency ≥ 80%). The finalized data frame is saved as Finalized_Imputed_Data_Frame.csv
- Step_4_Supplementary_Validation_Dataset.R - Contains all code used to format the supplementary validation data (i.e., the set of 8 raw data points that will be used with the test data set during modeling to assess the predictive accuracy of our models). The same procedures used in Step 3 are employed here, and the finalized data frame is saved as Finalized_Supp_Validation_Data_Frame.csv
- Step_5_Modeling.R - Constructs generalized additive models for the six selected analytes: PFOS, PFNA, PFDA, PFUnA, PFDoA, and PFTrDA). Models are built and evaluated using 80:20 holdout validation (80% training and 20% test data), using a number of functions provided by the mgcv³, caret⁴, gratia⁵, car⁶, and DHARMa⁷ packages, along with the supplementary validation data frame produced from Step 4. The six finalized models are saved in the Models folder
- Step_6_Results_and_Figures.R - Extracts summary statistics and concentration estimates from the six models, and constructs the majority of tables and figures that are presented in the main text and supplementary information of this project. Tables and figures are saved in the Tables and Figures folder, using the ggsave() function from the ggplot2 package⁸
Great Lakes Shapefiles folder: this folder contains the finalized shapefile, GL_Watershed_shapefile.shp, that was used to assign data points to one of the five watersheds of the Great Lakes, as well as to remove any points that fell outside the Great Lakes watersheds. This finalized vector layer, created with QGIS software version 3.36.0 – Maidenhead⁹, cobmines information from shapefiles characterizing the Great Lakes subbasins¹⁰ and the Saint Lawrence River¹¹.
Models folder: this folder contains the six finalized generalized additive models produced from Step 5, saved as .Rdata files. The estimates produced by these models are reported in the main text and supplement of Martin et al. (2025)
Tables and Figures folder: this folder stores the majority of tables and figures that are presented in the main text and supplementary information of Martin et al. (2025). Specifically, in addition to the first half of the graphical abstract (Conceptual_Figure.png), this folder contains Figure 1-4 and S3-S7, the relative concentration (%) sections of Table 2 and S5-S6, and a visual from Step 1 (Step_1_Visual.png) of the distribution of all 3,418 initial samples prior to any data frame formatting
Finalized_Concentration_Values_07_12_2024.csv file: this file contains the finalized imputation results, generated and saved on July 12th, 2024 with Version 1.5.0-4 of the zCompositions package², that were derived from the output of the lrEM() and lrDA() functions for 10 initial contaminants (PFOS, PFDS, PFEtCHxS, PFNA, PFDA, PFUnA, PFDoA, PFTrDA, PFTeDA, and PFPeDA) and that were used to generate the six models (for PFOS, PFNA, PFDA, PFUnA, PFDoA, and PFTrDA) reported in this project
Finalized_Imputed_Data_Frame.csv file: this file stores the finalized data frame used in modeling. A total of 2,489 samples from 50 studies are included in the file, and those samples have been heavily formatted (including going through the process of imputation) to achieve the end product reported here. Concentration values are directly taken from Finalized_Concentration_Values_07_12_2024.csv. For a thorough description of the different variables included in the data frame, please see the Supplementary Information (Text S1-2, Table S1-3). The data frame has not been reviewed or approved by agencies or entities that may have been involved in the individual studies included in the meta-analysis. For copies of the original raw data files, we recommend contacting the corresponding authors of the appropriate papers and/or accessing the data portals where those files are available. The data sources we used for this project, along with full citations for the 50 included studies, are provided in the Supplementary Information (Table S1).
Finalized_Supp_Validation_Data_Frame.csv file: this file contains the set of 8 formatted data points that are used with the test set in Step 5 to assess the predictive accuracy of the six generalized additive models. The set of included variables, as well as the general formatting, mirror Finalized_Imputed_Data_Frame.csv
PFAS_Review_supportingFunctions.R file: this R script contains a couple of custom functions, collect.frames() and exp10(), that were utilized by the scripts found in the Coding Workflow folder. In particular, collect.frames() was used in Step_1_Data_Import_and_Spatial_Join.R to read in and bind the 53 raw .csv data files (from the 53 initial studies) together into one data frame that could be formatted and saved for subsequent steps in the workflow

Summary

If you want to access the complete, finalized and imputed data frame used in modeling, the endpoint of the workflow for this meta-analysis, download Finalized_Imputed_Data_Frame.csv. You can also download Finalized_Supp_Validation_Data_Frame.csv and bind the two files together to assemble the full breadth of samples (n = 2,497) compiled by the authors. If you wish to acess the six generalized additive models reported in the publication, download the files found in the Models folder. And finally, if you wish to repeat/verify the entire workflow of this project, clone the entire repository and execute the scripts in the Coding Workflow folder: as long as the setwd() line at the top of each script correctly specifies the path to the repository on your local computer, and as long as the correct version of R¹ is installed, along with all the packages listed in each script (specifically the versions of those packages that were available in early July of 2024, see Main Text and Supplementary Information for more specific version information), the code should run without any errors.

Citation Information

IMPORTANT
If you wish to use the finalized dataset for research purposes, please cite Martin et al. (2025), as well as the data repository location on Zenodo where Finalized_Imputed_Data_Frame.csv and Finalized_Supp_Validation_Data_Frame.csv have been archived and can be downloaded (https://doi.org/10.5281/zenodo.16173644)

The Zenodo archive also offers additional information about the studes included in this meta-analysis and the variables that were incorporated into the two data frames

License

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

R Core Team. (2024). R: A language and environment for statistical computing. (Version 4.4.1 – Race for Your Life) [Computer software]. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org ↩ ↩²
Palarea-Albaladejo, J., & Martín-Fernández, J. A. (2015). zCompositions—R package for multivariate imputation of left-censored data under a compositional approach. Chemometrics and Intelligent Laboratory Systems, 143, 85–96. https://doi.org/10.1016/j.chemolab.2015.02.019 ↩ ↩²
Wood, S. (2023). mgcv: Mixed GAM Computation Vehicle with Automatic Smoothness Estimation (Version 1.9-1) [R package]. https://CRAN.R-project.org/package=mgcv
Wood, S. N. (2017). Generalized Additive Models: An Introduction with R, Second Edition (2nd ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781315370279 ↩
Kuhn, Max (2008). “Building Predictive Models in R Using the caret Package.” Journal of Statistical Software, 28(5), 1–26. https://doi.org/10.18637/jss.v028.i05 ↩
Simpson, G. (2024). gratia: Graceful ggplot-Based Graphics and Other Functions for GAMs Fitted using mgcv (Version 0.9.2) [R package]. https://doi.org/10.32614/CRAN.package.gratia ↩
Fox J, Weisberg S (2019). An R Companion to Applied Regression (3rd ed.). Sage, Thousand Oaks CA. https://www.john-fox.ca/Companion/. ↩
Hartig, F. (2022). DHARMa: Residual diagnostics for hierarchical (multi-level/mixed) regression models (Version 0.4.6) [R package]. http://florianhartig.github.io/DHARMa/ ↩
Wickham H (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org. ↩
QGIS Development Team. (2024). QGIS Geographic Information System (Version 3.36.0 – Maidenhead) [Computer software]. QGIS Association. https://www.qgis.org ↩
US Geological Survey. (2010). Great Lakes and Watersheds Shapefiles [Shapefile]. USGS ScienceBase-Catalog. https://www.sciencebase.gov/catalog/item/530f8a0ee4b0e7e46bd300dd ↩
Flanders Marine Institute. (2017). Gulf of Saint Lawrence [Shapefile]. Marine Regions (the VLIMAR Gazetteer and the VLIZ Maritime Boundaries Geodatabase). http://marineregions.org/mrgid/4290
Natural Resources Canada & US Geological Survey. (2010). North American Atlas – Basin Watersheds [Shapefile]. Government of Canada. http://www.cec.org/north-american-environmental-atlas/watersheds/ ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Spatial-Temporal Patterns of Per- and Polyfluoroalkyl Substances (PFAS) in the Biota of the Laurentian Great Lakes: A Meta-Analysis (Martin et al., 2025)

Directory

Summary

Citation Information

License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
Coding Workflow		Coding Workflow
Great Lakes Shapefiles		Great Lakes Shapefiles
Models		Models
Tables and Figures		Tables and Figures
.gitattributes		.gitattributes
.gitignore		.gitignore
Finalized_Concentration_Values_07_12_2024.csv		Finalized_Concentration_Values_07_12_2024.csv
Finalized_Imputed_Data_Frame.csv		Finalized_Imputed_Data_Frame.csv
Finalized_Supp_Validation_Data_Frame.csv		Finalized_Supp_Validation_Data_Frame.csv
LICENSE.txt		LICENSE.txt
PFAS_Review_supportingFunctions.R		PFAS_Review_supportingFunctions.R
README.md		README.md

License

peter-o-martin/Martin-et-al.-2025

Folders and files

Latest commit

History

Repository files navigation

Spatial-Temporal Patterns of Per- and Polyfluoroalkyl Substances (PFAS) in the Biota of the Laurentian Great Lakes: A Meta-Analysis (Martin et al., 2025)

Directory

Summary

Citation Information

License

Footnotes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages