Skip to content

Commit

Permalink
Merge pull request #128 from holukas/ML-long-term-gap-filling
Browse files Browse the repository at this point in the history
Ml long term gap filling
  • Loading branch information
holukas authored Jun 11, 2024
2 parents ceebdb4 + 8464e20 commit 60e6623
Show file tree
Hide file tree
Showing 26 changed files with 5,762 additions and 3,133 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
/__local_folders
/notebooks/_scratch/
/notebooks/Workbench/FLUXNET_CH4-N2O_Committee_WP2/data/
/diive/configs/exampledata/local

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
46 changes: 46 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,52 @@

![DIIVE](images/logo_diive1_256px.png)

## v0.77.0 | 11 Jun 2024

### Additions

- Plotting cumulatives with `CumulativeYear` now also shows the cumulative for the reference, i.e. for the mean over the
reference years (`diive.core.plotting.cumulative.CumulativeYear`)
- Plotting `DielCycle` now accepts `ylim` parameter (`diive.core.plotting.dielcycle.DielCycle`)
- Added long-term dataset for local testing purposes (internal
only) (`diive.configs.exampledata.load_exampledata_parquet_long`)
- Added several classes in preparation for long-term gap-filling for a future update

### Changes

- Several updates and changes to the base class for regressor decision
trees (`diive.core.ml.common.MlRegressorGapFillingBase`):
- The data are now split into training set and test set at the very start of regressor setup. This test set is used
to evaluate models on unseen data. The default split is 80% training and 20% test data.
- Plotting (scores, importances etc.) is now generally separated from the method where they are calculated.
- the same `random_state` is now used for all processing steps
- refactored code
- beautified console output
- When correcting for relative humidity values above 100%, the maximum of the corrected time series is now set to 100,
after the (daily) offset was removed (`diive.pkgs.corrections.offsetcorrection.remove_relativehumidity_offset`)
- During feature reduction in machine learning regressors, features with permutation importance < 0 are now always
removed (`diive.core.ml.common.MlRegressorGapFillingBase._remove_rejected_features`)
- Changed default parameters for quick random forest gap-filling (`diive.pkgs.gapfilling.randomforest_ts.QuickFillRFTS`)
- I tried to improve the console output (clarity) for several functions and methods

### Environment

- Added package [dtreeviz](https://github.com/parrt/dtreeviz?tab=readme-ov-file) to visualize decision trees

### Notebooks

- Updated notebook (`notebooks/GapFilling/RandomForestGapFilling.ipynb`)
- Updated notebook (`notebooks/GapFilling/LinearInterpolation.ipynb`)
- Updated notebook (`notebooks/GapFilling/XGBoostGapFillingExtensive.ipynb`)
- Updated notebook (`notebooks/GapFilling/XGBoostGapFillingMinimal.ipynb`)
- Updated notebook (`notebooks/GapFilling/RandomForestParamOptimization.ipynb`)
- Updated notebook (`notebooks/GapFilling/QuickRandomForestGapFilling.ipynb`)

### Tests

- Updated and fixed test case (`tests.test_outlierdetection.TestOutlierDetection.test_zscore_increments`)
- Updated and fixed test case (`tests.test_gapfilling.TestGapFilling.test_gapfilling_randomforest`)

## v0.76.2 | 23 May 2024

### Additions
Expand Down
78 changes: 54 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@ Recent releases: [Releases](https://github.com/holukas/diive/releases)

## Overview of example notebooks

- For many examples see notebooks here: [Notebook overview](https://github.com/holukas/diive/blob/main/notebooks/OVERVIEW.ipynb)
- For many examples see notebooks
here: [Notebook overview](https://github.com/holukas/diive/blob/main/notebooks/OVERVIEW.ipynb)
- More notebooks are added constantly.

## Current Features
Expand All @@ -25,7 +26,8 @@ Recent releases: [Releases](https://github.com/holukas/diive/releases)

- Calculate z-aggregates in quantiles (classes) of x and
y ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/CalculateZaggregatesInQuantileClassesOfXY.ipynb))
- Daily correlation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/DailyCorrelation.ipynb))
- Daily
correlation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/DailyCorrelation.ipynb))
- Decoupling: Sorting bins
method ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/DecouplingSortingBins.ipynb))
- Find data gaps ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Analyses/GapFinder.ipynb))
Expand All @@ -42,7 +44,8 @@ Recent releases: [Releases](https://github.com/holukas/diive/releases)

### Create variable

- Calculate time since last occurrence, e.g. since last precipitation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/CalculateVariable/TimeSince.ipynb))
- Calculate time since last occurrence, e.g. since last
precipitation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/CalculateVariable/TimeSince.ipynb))
- Calculate daytime flag, nighttime flag and potential radiation from latitude and
longitude ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/CalculateVariable/Daytime_and_nighttime_flag.ipynb))
- Day/night flag from sun angle
Expand Down Expand Up @@ -78,9 +81,11 @@ Recent releases: [Releases](https://github.com/holukas/diive/releases)

### Flux processing chain

For info about the Swiss FluxNet flux levels, see [here](https://www.swissfluxnet.ethz.ch/index.php/data/ecosystem-fluxes/flux-processing-chain/).
For info about the Swiss FluxNet flux levels,
see [here](https://www.swissfluxnet.ethz.ch/index.php/data/ecosystem-fluxes/flux-processing-chain/).

- Flux processing chain ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/FluxProcessingChain/FluxProcessingChain.ipynb))
- Flux processing
chain ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/FluxProcessingChain/FluxProcessingChain.ipynb))
- The notebook example shows the application of:
- Level-2 quality flags
- Level-3.1 storage correction
Expand All @@ -101,10 +106,14 @@ Format data to specific formats

Fill gaps in time series with various methods

- XGBoostTS ([notebook example (minimal)](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/XGBoostGapFillingMinimal.ipynb), [notebook example (more extensive)](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/XGBoostGapFillingExtensive.ipynb))
- RandomForestTS ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/RandomForestGapFilling.ipynb))
- Linear interpolation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/LinearInterpolation.ipynb))
- Quick random forest gap-filling ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/QuickRandomForestGapFilling.ipynb))
-
XGBoostTS ([notebook example (minimal)](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/XGBoostGapFillingMinimal.ipynb), [notebook example (more extensive)](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/XGBoostGapFillingExtensive.ipynb))
-
RandomForestTS ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/RandomForestGapFilling.ipynb))
- Linear
interpolation ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/LinearInterpolation.ipynb))
- Quick random forest
gap-filling ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/GapFilling/QuickRandomForestGapFilling.ipynb))

### Outlier Detection

Expand All @@ -116,10 +125,14 @@ Fill gaps in time series with various methods

Single outlier tests create a flag where `0=OK` and `2=outlier`.

- Absolute limits ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/AbsoluteLimits.ipynb))
- Absolute limits, separately defined for daytime and nighttime data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/AbsoluteLimitsDaytimeNighttime.ipynb))
- Incremental z-score: Identify outliers based on the z-score of double increments ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreIncremental.ipynb))
- Local standard deviation: Identify outliers based on the local standard deviation from a running median ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalSD.ipynb))
- Absolute
limits ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/AbsoluteLimits.ipynb))
- Absolute limits, separately defined for daytime and nighttime
data ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/AbsoluteLimitsDaytimeNighttime.ipynb))
- Incremental z-score: Identify outliers based on the z-score of double
increments ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/zScoreIncremental.ipynb))
- Local standard deviation: Identify outliers based on the local standard deviation from a running
median ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/OutlierDetection/LocalSD.ipynb))
- Local outlier factor: Identify outliers based on local outlier factor, across all data
- Local outlier factor: Identify outliers based on local outlier factor, daytime nighttime separately
- Manual removal: Remove time periods (from-to) or single records from time series
Expand All @@ -130,7 +143,8 @@ Single outlier tests create a flag where `0=OK` and `2=outlier`.

### Plotting

- Diel cycle per month ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
- Diel cycle per
month ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/DielCycle.ipynb))
- Heatmap showing values (z) of time series as date (y) vs time (
x) ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Plotting/HeatmapDateTime.ipynb))
- Heatmap showing values (z) of time series as year (y) vs month (
Expand All @@ -148,11 +162,14 @@ Single outlier tests create a flag where `0=OK` and `2=outlier`.
database ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/MeteoScreening/StepwiseMeteoScreeningFromDatabase.ipynb))

### Resampling
- Calculate diel cycle per month ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Resampling/ResamplingDielCycle.ipynb))

- Calculate diel cycle per
month ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Resampling/ResamplingDielCycle.ipynb))

### Stats

- Time series stats ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Stats/TimeSeriesStats.ipynb))
- Time series
stats ([notebook example](https://github.com/holukas/diive/blob/main/notebooks/Stats/TimeSeriesStats.ipynb))

### Timestamps

Expand All @@ -163,22 +180,35 @@ Single outlier tests create a flag where `0=OK` and `2=outlier`.

## Installation

`diive` can be installed from source code, e.g. using [`poetry`](https://python-poetry.org/) for dependencies.

`diive` is currently developed under Python 3.9.7, but newer (and many older) versions should also work.

`diive` can be installed using conda with `conda intall -c conda-forge diive`
### Using pip

`pip install diive`

### Using poetry

`poetry add diive`

### Using conda

`conda intall -c conda-forge diive`

### From source

Directly use .tar.gz file of the desired version.

`pip install https://github.com/holukas/diive/archive/refs/tags/v0.76.2.tar.gz`

### Create and use a conda environment for diive

One way to install and use `diive` with a specific Python version on a local machine:

- Install [miniconda](https://docs.conda.io/en/latest/miniconda.html)
- Start `miniconda` prompt
- Create a environment named `diive-env` that contains Python 3.9.7:
`conda create --name diive-env python=3.9.7`
- Create a environment named `diive-env` that contains Python 3.9.7: `conda create --name diive-env python=3.9.7`
- Activate the new environment: `conda activate diive-env`
- Install `diive` version directly from source code:
`pip install https://github.com/holukas/diive/archive/refs/tags/v0.63.1.tar.gz` (select .tar.gz file of the desired
version)
- Install `diive` using pip: `pip install diive`
- If you want to use `diive` in Jupyter notebooks, you can install Jupyterlab.
In this example Jupyterlab is installed from the `conda` distribution channel `conda-forge`:
`conda install -c conda-forge jupyterlab`
Expand Down
22 changes: 17 additions & 5 deletions diive/configs/exampledata/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@ def load_exampledata_parquet() -> DataFrame:
return data_df


def load_exampledata_parquet_long() -> DataFrame:
filepath = Path(DIR_PATH) / 'local/exampledata_PARQUET_CH-DAV_FP2022.5_1997-2022_ID20230206154316_30MIN.parquet'
data_df = load_parquet(filepath=filepath)
return data_df


def load_exampledata_DIIVE_CSV_30MIN():
filepath = Path(DIR_PATH) / 'exampledata_DIIVE-CSV-30MIN_CH-DAV_FP2022.5_2022.07_ID20230206154316_30MIN.diive.csv'
loaddatafile = ReadFileType(filetype='DIIVE-CSV-30MIN',
Expand Down Expand Up @@ -103,6 +109,7 @@ def load_exampledata_TOA5_DAT_1MIN():
data_df, metadata_df = loaddatafile.get_filedata()
return data_df, metadata_df


def load_exampledata_GENERIC_CSV_HEADER_1ROW_TS_MIDDLE_FULL_1MIN_long():
filepath = Path(
DIR_PATH) / 'exampledata_GENERIC-CSV-HEADER-1ROW-TS-MIDDLE-FULL-1MIN_CH-FRU_iDL_BOX1_0_1_TBL1_20240401-0000.dat.csv'
Expand All @@ -129,17 +136,22 @@ def load_exampledata_EDDYPRO_FLUXNET_CSV_30MIN_with_datafilereader_parameters():
dfr = DataFileReader(filepath=filepath,
data_header_section_rows=[0], # Header section (before data) comprises 1 row
data_skip_rows=[], # Skip no rows
data_header_rows=[0], # Header with variable names and units, in this case only variable names in first row of header
data_header_rows=[0],
# Header with variable names and units, in this case only variable names in first row of header
data_varnames_row=0, # Variable names are in first row of header
data_varunits_row=None, # Header does not contain any variable units
data_na_vals=[-9999], # List of values interpreted as missing values, EddyPro uses -9999 for missing values in ouput file
data_na_vals=[-9999],
# List of values interpreted as missing values, EddyPro uses -9999 for missing values in ouput file
data_freq="30min", # Time resolution of the data is 30-minutes
data_delimiter=",", # This csv file uses the comma as delimiter
data_nrows=None, # How many data rows to read from files, mainly used for testing, in this case None to read all rows in file
data_nrows=None,
# How many data rows to read from files, mainly used for testing, in this case None to read all rows in file
timestamp_idx_col=["TIMESTAMP_END"], # Name of the column that is used for the timestamp index
timestamp_datetime_format="%Y%m%d%H%M", # Timestamp in the files looks like this: 202107010300
timestamp_start_middle_end="end", # Timestamp in the file defined in *timestamp_idx_col* refers to the END of the averaging interval
output_middle_timestamp=True, # Timestamp in output dataframe (after reading the file) refers to the MIDDLE of the averaging interval
timestamp_start_middle_end="end",
# Timestamp in the file defined in *timestamp_idx_col* refers to the END of the averaging interval
output_middle_timestamp=True,
# Timestamp in output dataframe (after reading the file) refers to the MIDDLE of the averaging interval
compression=None) # File is not compressed (not zipped)
data_df, metadata_df = dfr.get_data()
return data_df, metadata_df
Expand Down
10 changes: 4 additions & 6 deletions diive/core/dfun/frames.py
Original file line number Diff line number Diff line change
Expand Up @@ -786,11 +786,10 @@ def rolling_variants(df, records: int, aggtypes: list, exclude_cols: list = None

def add_continuous_record_number(df: DataFrame) -> DataFrame:
"""Add continuous record number as new column"""
print("\nAdding continuous record number ...")
newcol = '.RECORDNUMBER'
data = range(1, len(df) + 1)
df[newcol] = data
print(f"Added new column {newcol} with record numbers from {df[newcol].iloc[0]} to {df[newcol].iloc[-1]}.")
print(f"++ Added new column {newcol} with record numbers from {df[newcol].iloc[0]} to {df[newcol].iloc[-1]}.")
return df


Expand Down Expand Up @@ -830,7 +829,7 @@ def lagged_variants(df: DataFrame,
Example:
"""
print(f"\nCreating lagged variants ...")

if len(df.columns) == 1:
if df.columns[0] in exclude_cols:
raise Exception(f"(!) No lagged variants can be created "
Expand Down Expand Up @@ -881,9 +880,8 @@ def lagged_variants(df: DataFrame,
_included.append(col)

if verbose:
print(f"Created lagged variants for: {_included} (lags between {lag[0]} and {lag[1]} "
f"with stepsize {stepsize})\n"
f"No lagged variants for: {_excluded}")
print(f"++ Added new columns with lagged variants for: {_included} (lags between {lag[0]} and {lag[1]} "
f"with stepsize {stepsize}), no lagged variants for: {_excluded}.")
return df


Expand Down
Loading

0 comments on commit 60e6623

Please sign in to comment.