This repository provides code for gap-filling carbon flux data measured with eddy covariance using Marginal Distribution Sampling (MDS) and Extreme Gradient Boosting (XGB), respectively.
- Folder Environment: python environment for XGBoost
- Flux data for Bartlett Research Forest, siteID: US-Bar, https://ameriflux.lbl.gov/sites/siteinfo/US-Bar
STEP 01 MDS_EProc_object.R
- IQR fitering
- u* filtering
- save RDS object
STEP 02 train_XGB.ipynb The script utilizes BayesSearchCV from scikit-optimize, efficiently completing the gap-filling of a 13-year time series with XGBoost (XGB) for FCO2 in approximately 20 minutes. It employs 10-fold cross-validation for model evaluation. This script consists of the following steps:
- 01: Finding the best hyperparameters for XGBoost.
- 02: Training the model using the best hyperparameters determined in Step 1.
- 03: Evaluating the model performance using 10-fold cross-validation. Several model performance metrics (RMSE, R2, and bias) are computed, and learning curves are plotted.
- 04: Plotting Feature (variable) importance
- 05: Computing annual sums of FCO2
- 06: Computing monthly sums of FCO2
STEP 03 MDS_10_CV.Rmd
- Gap filling using MDS, following the same cross-validation (10 fold) to ensure the best comparison between MDS and XGB.
STEP 04 ANN_create_synthetic_data.ipynb
- The script was executed on Google Colab with the purpose of generating synthetic data using an Artificial Neural Network (ANN). The script is adapted from Vekuri et al. 2023, https://doi.org/10.1038/s41598-023-28827-2, by adding GCC as input feature.
STEP 05 generate_data_for_scenarios.Rmd
- To generate data for different experimental scenario 1,2 and 3, as described in the manuscript.
- Experimental scenario 1: different gap lengths
- Experimental scenario 2: different gap timing or locations
- Experimental scenario 3: different subset (by year) of input data
STEP 06 repeat STEP 02 and 03 for different scenarios