Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing Functions for Time-Based Windowing, Differencing, and Data Transformation. Xgboost for GridSearch #54

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

Reinaldo-Kn
Copy link

@Reinaldo-Kn Reinaldo-Kn commented Oct 17, 2024

These functions were implemented as part of the requests from issue #45

  • time_based_windowing()

This function applies a time-based rolling window to the dataset based on the timestamp index, allowing you to aggregate data over defined time intervals.

Parameters:

  df (pandas.DataFrame): The data to be processed.
   train_or_test (string, optional): Specifies whether the data is for training or testing.
   window_size (string, optional): The size of the time window (e.g., '30min' for 30 minutes or '1H' for 1 hour).
   agg_func (string, optional): The aggregation function to apply within each window ('mean', 'sum', 'max', 'min').

Returns:

   (pandas.DataFrame): The DataFrame with aggregated values for each time window.

Raises:

   ValueError: If the DataFrame index is not a datetime type or if an unsupported aggregation function is provided.
  • remove_constant_difference()

This function computes the cumulative sum of the input DataFrame and removes columns where the differences between consecutive rows are constant. This can be useful for identifying and dropping features that do not provide significant variability.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.

Returns:

    (pandas.DataFrame): The DataFrame after removing columns with constant differences.

Raises:

    TypeError: If any column in the DataFrame is not numeric.
  • differencing()

This function applies differencing to the dataset, a technique used to remove trends or seasonal patterns by subtracting the current observation from the previous one.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.
    lag (int, optional): The lag value used for differencing, with a default of 1.

Returns:

    (pandas.DataFrame): The DataFrame after applying differencing, with NaN values dropped.
  • log_transform()

This function applies a logarithmic transformation to the dataset to reduce variability and help normalize the data. The transformation used is log(1 + x) to handle values close to zero safely.

Parameters:

    df (pandas.DataFrame): The data to be processed.
    train_or_test (string, optional): Specifies whether the data is for training or testing.

Returns:

    (pandas.DataFrame): The DataFrame after applying the log transformation.

More methods for _gridsearch.py

A new function has also been added for GridSearch #52 , now using the Xgboost library to be able to generate hyperparameters more quickly by using the GPU

  • xgboost_best()

This function performs a grid search to identify the best hyperparameters for the XGBRegressor model using a predefined parameter grid. It splits the dataset into training and testing sets, then applies GridSearchCV to tune the model for optimal performance.

Parameters:

    df (pandas.DataFrame): The dataset to be used for training and testing.
    target_column (str): The name of the target column that the model will predict.

Returns:

    (XGBRegressor): The best XGBoost estimator found during the grid search.

Key Attributes:

    self.best_params_: Stores the best parameters found by the grid search.
    self.best_estimator_: The best estimator (model) found based on the grid search results.

Grid Search Default Parameters:

    n_estimators: [100, 200]
    learning_rate: [0.05, 0.1]
    max_depth: [None, 6]
    subsample: [0.1, 0.5]
    colsample_bytree: [0.1, 0.5]
    gamma: [0, 0.1]
    reg_alpha: [0, 0.1]
    reg_lambda: [0.1, 0.5]

Notes:

  • The grid search utilizes the GPU for faster computation by setting tree_method="gpu_hist" and predictor="gpu_predictor".
  • The function is designed to optimize the model based on the scoring method and cross-validation (cv) specified during class initialization.

You can view the new functions in Colab

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants