Skip to content

Conversation

@SiAndRo2002
Copy link
Collaborator

@SiAndRo2002 SiAndRo2002 commented Nov 20, 2025

Pull Request

Description

Restructure and individualise shift and add interpolation methods

Type of Change

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Refactoring (code change that neither fixes a bug nor adds a feature)
  • Documentation update

Required Checklist

Testing

  • Unit tests have been created/updated for new/modified functionality
  • CI/CD pipeline passes all tests (pytest, coverage, executables, installation)

Examples

  • Add examples for new features and functionality

Compatibility

  • Changes are backward compatible OR deprecation warnings added
  • No breaking changes to public APIs
  • New dependencies added to pyproject.toml (required in dependencies or optional in [project.optional-dependencies])

Documentation

  • Docstrings added/updated for new/modified public methods (Google style)
  • Type hints added for new functions/methods

Optional

GitHub Copilot Review

  • Request Copilot review via GitHub UI (add 'copilot' as a reviewer)

@SiAndRo2002 SiAndRo2002 linked an issue Nov 20, 2025 that may be closed by this pull request
@SiAndRo2002 SiAndRo2002 marked this pull request as ready for review December 3, 2025 15:07
Comment on lines 50 to 54
shift = {
'reaTZon_y': 'previous', # for all lags of reaTZon_y, the shift will be set automatically
'weaSta_reaWeaHDirNor_y': 'mean_over_interval',
'_default': 0,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should include this as part of Feature with attribute sampling_method. Default should be None, and there should be a class attribute default_sampling_method that is used if None
Shift should be removed from preprocessing (not backwards compatible, but necessary to maintain usability)

Comment on lines 291 to 296
if (len(self.inputs) != len(self.shift.keys())) or not all(inp in self.shift.keys() for inp in self.inputs):
self.shift = convert_shift_to_dict(self.shift, self.inputs, custom_default=self.shift_default)

assert len(self.inputs) == len(self.shift.keys()), (
f"Something went wrong, number of inputs ({len(self.inputs)})"
f" doesn't match number of inputs defined in shift ({len(self.shift.keys())})")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not already done in the init function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L291-292 are necessary in case of recursive feature selection since preprocessing object is only created once but inputs are changed

FeatureConstruction.process(df)
# Only apply for those features that are not lags since lags must be constructed after sampling the data
# according to the given time step
FeatureConstruction.process(df, feature_names=inputs_without_lags + [out for out in self.output if out not in inputs_without_lags])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be easier to implement a process_without_lags and process_only_lags function, to avoid too much code in high level functions

"""

def __init__(self, inputs: list[str], output: Union[str, list[str]], shift: int = 1,
def __init__(self, inputs: list[str], output: Union[str, list[str]], shift: Union[int, str, dict] = 'previous',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can probably be removed if sampling_method is part of feature

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '0'


def convert_shift_to_dict(s: Union[int, str, dict], inputs: list[str], custom_default: Union[int, str] = None) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function should (hopefully) get much easier if shift is a attribute of Feature

Comment on lines 346 to 363
if all('current' == self.shift[k] for k in inputs_without_lags):
# filter / sample data
X = self.filter_df_according_to_timestep(X)
# nothing more to do here
elif all('previous' == self.shift[k] for k in inputs_without_lags):
# filter / sample data
X = self.filter_df_according_to_timestep(X)

# shift data by 1 and shorten DataFrames accordingly
X = X.shift(1)
y = y.iloc[1:]
X = X.iloc[1:]
elif all('mean_over_interval' == self.shift[k] for k in inputs_without_lags):
X = get_mean_over_interval(y, X)
# synchronize length between X and y
y = y.iloc[1:]

else: # different inputs have different shifts
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not a duplicate of the more general "else: # different inputs have different shifts" statement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in fact, the else would do the same thing. The idea was that it might be faster in case all inputs have the same shift

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, i think the preprocessing is not really a performance bottleneck, so i would sacrifice a little performance for a more general implementation

Comment on lines 148 to 150
if isinstance(shift, dict) and '_default' in shift.keys():
self.shift_default = shift['_default']
shift.__delitem__('_default')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should get easier if Feature has a default sampling_method (as class attribute, so it is changeable)

df = df.loc[first_valid_index:last_valid_index]
if df.isnull().values.any():

def get_mean_over_interval(y: pd.DataFrame, x: pd.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this similar to how the sampling is done in agentlib/agentlib-mpc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_mean_over_interval function is mostly copied from agentlib-mpc/utils/sampling.py. However, small adaptions were necessary due to differing data structures

y = y.iloc[:-self.shift]
X = X.iloc[:-self.shift]
# Applies feature constructions defined in `FeatureConstruction` to the lagged inputs
FeatureConstruction.process(res_df, feature_names=lagged_inputs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is it handeld if there is a constructed feature that is based on a lagged input?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved in commit 367624f

ross.simon and others added 24 commits December 7, 2025 11:58
deleted deprecated code and test for shift conversion
sampling_method of constructed features determined based on corresponding base_features(s)
resetting FeatureConstruction.features also affected p_hp_data
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Interpolation Methods for Input / Output

5 participants