Preprocessing Functions for Time-Based Windowing, Differencing, and Data Transformation. Xgboost for GridSearch #54

Reinaldo-Kn · 2024-10-17T17:24:04Z

These functions were implemented as part of the requests from issue #45

time_based_windowing()

This function applies a time-based rolling window to the dataset based on the timestamp index, allowing you to aggregate data over defined time intervals.

Parameters:

  df (pandas.DataFrame): The data to be processed.
   train_or_test (string, optional): Specifies whether the data is for training or testing.
   window_size (string, optional): The size of the time window (e.g., '30min' for 30 minutes or '1H' for 1 hour).
   agg_func (string, optional): The aggregation function to apply within each window ('mean', 'sum', 'max', 'min').

Returns:

   (pandas.DataFrame): The DataFrame with aggregated values for each time window.

Raises:

   ValueError: If the DataFrame index is not a datetime type or if an unsupported aggregation function is provided.

remove_constant_difference()

This function computes the cumulative sum of the input DataFrame and removes columns where the differences between consecutive rows are constant. This can be useful for identifying and dropping features that do not provide significant variability.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.

Returns:

    (pandas.DataFrame): The DataFrame after removing columns with constant differences.

Raises:

    TypeError: If any column in the DataFrame is not numeric.

differencing()

This function applies differencing to the dataset, a technique used to remove trends or seasonal patterns by subtracting the current observation from the previous one.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.
    lag (int, optional): The lag value used for differencing, with a default of 1.

Returns:

    (pandas.DataFrame): The DataFrame after applying differencing, with NaN values dropped.

log_transform()

This function applies a logarithmic transformation to the dataset to reduce variability and help normalize the data. The transformation used is log(1 + x) to handle values close to zero safely.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.

Returns:

    (pandas.DataFrame): The DataFrame after applying the log transformation.

More methods for _gridsearch.py

A new function has also been added for GridSearch #52 , now using the Xgboost library to be able to generate hyperparameters more quickly by using the GPU

xgboost_best()

This function performs a grid search to identify the best hyperparameters for the XGBRegressor model using a predefined parameter grid. It splits the dataset into training and testing sets, then applies GridSearchCV to tune the model for optimal performance.

Parameters:

    df (pandas.DataFrame): The dataset to be used for training and testing.
    target_column (str): The name of the target column that the model will predict.

Returns:

    (XGBRegressor): The best XGBoost estimator found during the grid search.

Key Attributes:

    self.best_params_: Stores the best parameters found by the grid search.
    self.best_estimator_: The best estimator (model) found based on the grid search results.

Grid Search Default Parameters:

    n_estimators: [100, 200]
    learning_rate: [0.05, 0.1]
    max_depth: [None, 6]
    subsample: [0.1, 0.5]
    colsample_bytree: [0.1, 0.5]
    gamma: [0, 0.1]
    reg_alpha: [0, 0.1]
    reg_lambda: [0.1, 0.5]

Notes:

The grid search utilizes the GPU for faster computation by setting tree_method="gpu_hist" and predictor="gpu_predictor".
The function is designed to optimize the model based on the scoring method and cross-validation (cv) specified during class initialization.

You can view the new functions in Colab

…r ini inspection

* fixed relative imports * fixed relative imports * removed debbug prints

zRafaF and others added 8 commits October 15, 2024 13:00

added find_df_transitions to _bibmon_tools

d0dca7f

Created test requirements and implemented find_df_transitions unity test

3b72b57

implemented df splitting tool and its unity test

72be3cc

Added 3w dataset example to code base, added 3w loader and tooling fo…

65c119d

…r ini inspection

added split_dataset to 3w tools

db3d911

Fix relative imports (#1)

f016640

* fixed relative imports * fixed relative imports * removed debbug prints

added grid search

393ff83

added more func for preprocess

0e5ec7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing Functions for Time-Based Windowing, Differencing, and Data Transformation. Xgboost for GridSearch #54

Preprocessing Functions for Time-Based Windowing, Differencing, and Data Transformation. Xgboost for GridSearch #54

Reinaldo-Kn commented Oct 17, 2024 •

edited

Loading

Preprocessing Functions for Time-Based Windowing, Differencing, and Data Transformation. Xgboost for GridSearch #54

Are you sure you want to change the base?

Preprocessing Functions for Time-Based Windowing, Differencing, and Data Transformation. Xgboost for GridSearch #54

Conversation

Reinaldo-Kn commented Oct 17, 2024 • edited Loading

These functions were implemented as part of the requests from issue #45

More methods for _gridsearch.py

Notes:

Reinaldo-Kn commented Oct 17, 2024 •

edited

Loading