Skip to content

Commit

Permalink
Update to_do_data_analysis.md
Browse files Browse the repository at this point in the history
  • Loading branch information
heleenderoo authored Dec 11, 2023
1 parent 7635f38 commit 298476f
Showing 1 changed file with 27 additions and 27 deletions.
54 changes: 27 additions & 27 deletions src/transformation_to_layer1/to_do_data_analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,40 +6,44 @@ Please add any observed data issues to this file

* R script: `./src/transformation_to_layer1/solid_soil_data_transformation_to_layer1.R`
* Gap-filling from external data sources and internal gap-filling using assumptions:
+ ~~Folder with direct partner communication (AFSCDB.LII.2.1 subfolder) - at least Austria, Spain, bulk density and coarse fragments from Sweden… (Note: still necessary?)~~
+ Folder AFSCDB.LII.2.2: also check whether there are any plot surveys that do not appear in so_som at all. Use the original data forms (with different repetitions etc), not the aggregated version.
+ ~~Folder BIOSOIL.LII - at least Spain, Finland… (Note: missing Spanish data now in layer 0)~~
+ Anything to gap-fill for LI? (e.g. folders BIOSOIL.LI, FSCDB.LI.1?) FSCDB.LI.1: check whether any “OPT” data are currently missing. Oldest survey from Italy seems to be missing, Latvia and Austria probably incomplete too?
+ FSCDB.LI: add profiles to pfh and prf that are lacking (check if this is actually the case: there may be an issue with different former German code plots with the same plot_id that now look like the same plot (same partner_code))
+ Add column with data source for different variables (Note: partly completed)
* profiles with data until 20 cm or 40 cm: ~~(i) assume carbon density in subsoil does not change with time; (ii) take a fixed carbon density of 0.1 ton C cm-1 ha-1 between 80 and 100 cm; or~~ (iii) monte carlo machine learning prediction of carbon density at a depth of 100 cm (assessing confidence interval)
* considerable uncertainty in code_horizon_coarse_vol (in “pfh”) → first priority: coarse fragment fractions from other survey year?
* Harmonise “horizon_coarse_weight” if volumetric instead of wt% (e.g. Slovak Republic), as indicated by partners in PIRs.
* Harmonise soil textures where needed (e.g. Wallonia) to 63 µm limit using R soil texture wizard
* Check “other_obs” columns properly
* Are coordinates fine? Was there a systematic coordinate issue in the Pyrennees in Spain? Correct coordinate sign mistakes (e.g. in Spain, where the minus sign was clearly sometimes forgotten, since it was there for other records of the same plot)
* Update plot codes LI UK back to original for 1994
* Solve issue with non-unique German plot codes in LI
* survey_year in "som" and "pfh" does not always correspond with the actual sampling year (i.e. sometimes lab analysis year). Correct by means of survey_year in "prf" and "pls"
* Add potential sources of uncertainty, for example ring test standard deviations. In theory, this lab analytical uncertainty as well as sample pretreatment uncertainty should be somehow included along with spatial variation in the variation between plot repetitions. At this stage, we will just compare the order of magnitude of the ring test standard deviation with the standard deviation between plot repetitions. At this stage, no need to exclude any data on the basis of bad ring tests.
* Make a separate script for LI versus LII, in which you can choose the survey form (so_som, so_pfh, s1_som, s1_pfh) and the variable to calculate stocks from. Different methodological decisions (e.g. about internal gap-filling opportunities) should be changeable via function input variables. List these important methodological variables and their options. A file with this methodological information should be exported as metadata in the output.


* ~~in s1_som files, for all MINERAL horizons, set organic_layer_weight==NA if bulk density value is given.~~ (Note: completed)
+ ~~Folder with direct partner communication (AFSCDB.LII.2.1 subfolder) - at least Austria, Spain, bulk density and coarse fragments from Sweden… (Note: still necessary?)~~
+ ~~Folder BIOSOIL.LII - at least Spain, Finland… (Note: missing Spanish data now in layer 0)~~
+ ~~PIRs! (note separate e-mail Sture Wijk; Czech pH-H2O in “pfh”) + add column with validation code different parameters?~~
+ ~~Delete impossible values or codes in data? (e.g. forest type 32)~~
+ Add column with data source for different variables (Note: partly completed)
+ ~~Apply further internal gap-filling possibilities (e.g. constant bulk density) + list assumptions (see presentation Bruno Vienna)~~
+ ~~"som" survey forms gap-filling for C stocks:~~
- ~~bulk_density: assumption constant over time; from "pfh" ("horizon_bulk_dens_measure", "horizon_bulk_dens_est"); "sw_swc" ("bulk_density");~~ pedotransfer functions/machine learning
- ~~organic_carbon_total: from "pfh" ("horizon_c_organic_total")~~
- ~~coarse_fragment_vol: assumption constant over time; from "pfh" ("horizon_coarse_weight", "code_horizon_coarse_vol");~~ machine learning?
- considerable uncertainty in code_horizon_coarse_vol (in “pfh”) → first priority: coarse fragment fractions from other survey year?
- ~~effective_soil_depth: assumption constant over time; from maximum "layer_limit_inferior" + machine learning? Or assumption: always deeper than 100 cm if we know it is deeper than 80 cm? This was manually harmonised by Nathalie (data form "./data/additional_data/SO_PRF_ADDS.csv")~~
- profiles with data until 20 cm or 40 cm: (i) assume carbon density in subsoil does not change with time; or (ii) monte carlo machine learning prediction of carbon density at a depth of 100 cm (assessing confidence interval)
- ~~"pfh" survey forms gap filling for C stocks: add bulk densities forest floor from "som" after linking forest floor layers across "pfh" and "som" (same survey, i.e. maximum difference in survey years of 3 years)~~
* ~~Check whether vertical shifting of layer limits is needed, e.g. negative for forest floor layers (e.g. Estonia, where top of forest floor was designated as the 0-cm line). Rule for organic H layers:~~
+ ~~if code_layer is H and no layer limits, the layer should be in the forest floor. Change layer_type to forest_floor.~~
+ ~~if organic H layers are < 40 cm thick in total (and below any forest floor or above any mineral soil), this layer(s) should be considered as the forest floor. Change layer_type to forest_floor.~~
+ ~~if organic H layers are >= 40 cm thick in total, they can be considered as actual peat layers.~~
~~The null line should be between the forest floor (including those H layers just added; negative layer limits) and the peat/mineral layers (positive layer limits). Move null line in accordance if necessary (by shifting the layers up or down and changing their layer limits).~~
* Check “other_obs” columns properly
* ~~Harmonise plot_id’s and coordinates (e.g. Poland, UK)~~ (Note: completed)
* Was there a systematic coordinate issue in the Pyrennees in Spain?
* Correct coordinate sign mistakes (e.g. in Spain, where the minus sign was clearly sometimes forgotten, since it was there for other records of the same plot)
* Harmonise “horizon_coarse_weight” if volumetric instead of wt% (e.g. Slovak Republic), as indicated by partners in PIRs.
* Harmonise soil textures where needed (e.g. Wallonia) to 63 µm limit using R soil texture wizard
* ~~“so_prf” (+ when possible: “s1_prf”): Left-join dataframe with harmonised WRB soil classification information by Nathalie --> one record per profile~~
+ ~~Include column with harmonisation method~~
+ ~~Include column with qualitative eutric/dystric factor~~
+ Recode humus type (e.g. amphihumus) in accordance with survey year
+ After joining: identify missing plots without soil classification (due to gap-filling) - also check in other survey forms.
+ s1_prf: create machine-learning model to predict WRB soil classes in plots where this information is lacking?
* ~~Check Russian plots in “so” survey: some of them actually belong to “s1”. Move accordingly.~~
* ~~Germany: harmonisation partner code across different survey forms?~~
* ~~Replace impossible data (e.g. bulk density above 2650 kg m-3) by NA~~
Expand All @@ -50,16 +54,10 @@ Please add any observed data issues to this file
- ~~Plot 212: organic layer weights give a bulk density of 19212.08 - 88668.75 kg m-3 (95 % quantile), i.e. a factor 455 higher~~
- ~~Conclusion there are two options: (i) either the data are wrongly reported in tonnes per ha (factor 100 higher); of (ii) the data are wrongly reported in g per m2 (factor 1000 higher). Assumption: they were reported in g per m2.~~
+ ~~Values of organic_layer_weight for which the derived bulk density is higher than 1400 kg m-3 (density of organic matter) are impossible and replaced by NA.~~
* survey_year in "som" and "pfh" does not always correspond with the actual sampling year (i.e. sometimes lab analysis year). Correct by means of survey_year in "prf" and "pls"
* ~~LOQ: harmonise and list assumptions~~
* Add potential sources of uncertainty, for example ring test standard deviations. In theory, this lab analytical uncertainty as well as sample pretreatment uncertainty should be somehow included along with spatial variation in the variation between plot repetitions. At this stage, we will just compare the order of magnitude of the ring test standard deviation with the standard deviation between plot repetitions. At this stage, no need to exclude any data on the basis of bad ring tests.
* Gap-filling forest types and WRB LI and humus + confirmation by national experts
* ~~Gap-filling forest types and WRB LI and humus + confirmation by national experts~~
* ~~Remove incomplete unique profiles (e.g. profiles with only one forest floor layer)? Or do we keep it for plot-level integration (at least in “som” forms, i.e. with fixed depths)?~~
* Add list with structure similar to PIRs, in which specific data updates by FSCC can be listed (along with their reason and a "change_date"). These can then be applied in a way similar to updated values in the checked PIRs.
* Convert script into a function, in which you can choose the survey form (so_som, so_pfh, s1_som, s1_pfh) and the variable to calculate stocks from. Different methodological decisions (e.g. about internal gap-filling opportunities) should be changeable via function input variables. List these important methodological variables and their options. A file with this methodological information should be exported as metadata in the output.

LESSONS LEARNED FROM MANUAL CORRECTIONS:
* ~~in s1_som files, for all MINERAL horizons, set organic_layer_weight==NA if bulk density value is given.~~ (Note: completed)
* ~~Add list with structure similar to PIRs, in which specific data updates by FSCC can be listed (along with their reason and a "change_date"). These can then be applied in a way similar to updated values in the checked PIRs.~~



Expand All @@ -85,15 +83,16 @@ LESSONS LEARNED FROM MANUAL CORRECTIONS:

## TO DO - Stocks, indicators

* Machine-learning prediction of lowest point splines (depth of 100 cm) + Monte Carlo uncertainty assessment?
* FSCDB.LI: Check whether indicator data in VWDD tables match the Vanmechelen report formulas
* ~~Check within-plot variability~~
* Propagate any uncertainty correctly.
* Convert scripts into a function. Different methodological decisions should be changeable via function input variables. List these important methodological variables and their options. A file with this methodological information should be included as metadata in the output. Also include total uncertainty (including uncertainty from the spline fitting + spline extrapolation + propagated uncertainty from other sources) in the output.
* Make functions to visualise output, e.g. violin plots per stratifier, overview graphs per plot_id, dynamic maps.
* Calculate stocks based on "pfh" survey forms too. Compare stocks based on fixed-depth layers with those based on pedogenetic horizons.
* Calculate change in carbon stock per year.
* Also include stocks until 30 cm as output in plot-aggregated stock files.
* Carbon stocks for carbon contents < LOQ?
* ~~Machine-learning prediction of lowest point splines (depth of 100 cm)~~ + Monte Carlo uncertainty assessment?
* ~~Convert scripts into a function.~~ Different methodological decisions should be changeable via function input variables. List these important methodological variables and their options. A file with this methodological information should be included as metadata in the output. Also include total uncertainty (including uncertainty from the spline fitting + spline extrapolation + propagated uncertainty from other sources) in the output.
* ~~Calculate stocks based on "pfh" survey forms too. Compare stocks based on fixed-depth layers with those based on pedogenetic horizons.~~
* ~~Also include stocks until 30 cm as output in plot-aggregated stock files.~~
* ~~Check within-plot variability~~


## TO DO - How to assess uncertainty?
Expand All @@ -119,9 +118,10 @@ LESSONS LEARNED FROM MANUAL CORRECTIONS:
* When no O layers are reported: are they actually not present or have they just been ignored in the survey?
* Check whether plots in other survey forms are also reported in “pls” (and system installment)
* Check whether values below LOQ equal -1
* ~~Add a plausible range test for organic_layer_weight~~
* Histosol: are H layers reported in “som”?
* Are % OC values below 20 % in M layers and above 20 % in H layers?
* When NA is reported for coarse fragments: is this actually not measured or does this mean that there are no stones?
* Check whether code_plots are uniquely linked with coordinates.
* Check whether H layers should be converted to forest floor and whether null line is on the correct location in the profile.
* ~~Add a plausible range test for organic_layer_weight~~

0 comments on commit 298476f

Please sign in to comment.