optiver_dev_lgbm.py

# -*- coding: utf-8 -*-
"""optiver-dev-lgbm.ipynb

Automatically generated by Colab.

Original file is located at
    https://colab.research.google.com/drive/1bELhm-NODp7-NAqvCLxTvM8QgwS4f2CN

# [Optiver] LGBM Dev Solution - For Kaggle

![image.png](attachment:dff6a04d-f41d-4c78-95a2-f02228913d06.png)

# 1. Baseline

First, just as a baseline, let's feed the training data into LightGBM and see how good public score is.

### Imports and Configuration

1. **Standard Libraries and Data Handling**:
    - `os`: This module provides a way of using operating system dependent functionality like reading or writing to a file system.
    - `pandas as pd`: Pandas is crucial for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series.
    - `numpy as np`: NumPy is used for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.

2. **Visualization**:
    - `matplotlib.pyplot as plt`: This is used for creating static, interactive, and animated visualizations in Python. `%matplotlib inline` is a magic function that renders the figure in a notebook (instead of displaying a figure in a new window) immediately after a plot command.
   
3. **Warnings**:
    - `warnings`: This module is used to suppress warnings that might interrupt the viewing experience or clutter the output. `warnings.filterwarnings('ignore')` instructs Python to ignore specific categories of warnings.

4. **Hyperparameter Optimization**:
    - `optuna`: Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It is used to automate the optimization of the parameters to best fit the model. `optuna.logging.set_verbosity(optuna.logging.WARNING)` configures Optuna to only output warnings and more severe messages, reducing log noise.

5. **Machine Learning Tools**:
    - `sklearn.model_selection.KFold`: KFold is a cross-validator that divides the dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
    - `sklearn.metrics.mean_absolute_error`: This function measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s particularly useful as it’s the metric used to evaluate the performance of the model.
    - `lightgbm as lgbm`: LightGBM is a gradient boosting framework that uses tree-based learning algorithms and is designed for distributed and efficient training, particularly on large datasets.

6. **Specific Functional Configurations**:
    - `from lightgbm import *`: Imports all functions and classes from LightGBM directly into the namespace. This is generally not best practice due to potential naming conflicts; specific imports are preferable.
    - `pd.set_option("display.max_columns", None)`: This pandas function is set to ensure that when dataframes are displayed, no columns are omitted in the output, regardless of how many columns are in the dataframe.
"""

# Commented out IPython magic to ensure Python compatibility.
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import optuna

from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
import lightgbm as lgbm

optuna.logging.set_verbosity(optuna.logging.WARNING)
import warnings
warnings.filterwarnings('ignore')

from lightgbm import *
pd.set_option("display.max_columns", None)

"""This snippet of code handles the loading of datasets necessary for building and testing the predictive model for the Nasdaq Closing Price Prediction Challenge. Let's delve into each line to understand its functionality:

### Code Explanation

1. **Loading Training Data**:
   ```python
   df_train = pd.read_csv('../input/optiver-trading-at-the-close/train.csv')
   ```
   This line reads the `train.csv` file into a Pandas DataFrame called `df_train`. This dataset likely contains historical order book and auction data for a variety of stocks listed on the Nasdaq, and serves as the primary dataset for training the predictive model.

2. **Loading Testing Data**:
   ```python
   df_test = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/test.csv')
   ```
   Here, the `test.csv` file is loaded into a DataFrame called `df_test`. This file is used to test the model after training, allowing you to evaluate how well the model predicts new, unseen data. This dataset would mimic the structure of the training data but without revealing the target variables (i.e., the closing prices).

3. **Loading Sample Submission Format**:
   ```python
   sample_sub = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/sample_submission.csv')
   ```
   This line loads a sample submission file, `sample_submission.csv`, into a DataFrame called `sample_sub`. This file likely outlines the required format for submitting predictions to a competition or evaluation framework, showing the expected structure of predictions, typically including identifiers and predicted values.

4. **Loading Revealed Targets for Testing**:
   ```python
   rev_target = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/revealed_targets.csv')
   ```
   The `revealed_targets.csv` file is read into a DataFrame called `rev_target`. This dataset contains actual values of the targets for the test set, which are revealed for the purpose of model evaluation and validation post-prediction. It is used to calculate the accuracy metrics (such as MAE) to judge the model's performance.

### Purpose and Usage

- **Training and Testing**: The primary purpose of loading these datasets is to split the model's workflow into training and evaluation phases. `df_train` is used to fit the model, while `df_test` is crucial for predicting and testing the model's generalization capabilities on new data.
- **Evaluation**: The `revealed_targets.csv` allows for the direct comparison of the model’s predictions against actual outcomes, which is essential for iterative model tuning and refinement.
- **Submission**: The `sample_submission.csv` ensures that predictions are formatted correctly for submission, adhering to the specifications of the competition or project requirements.
"""

df_train = pd.read_csv('../input/optiver-trading-at-the-close/train.csv')
df_test = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/test.csv')
sample_sub = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/sample_submission.csv')
rev_target = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/revealed_targets.csv')

"""### Column Descriptions

1. **stock_id**:
   - **Description**: A unique identifier for each stock.
   - **Context**: Not every stock appears in every time bucket (a specific, often small, period of time during trading), making this identifier crucial for tracking the performance and data specific to each stock across different periods.

2. **date_id**:
   - **Description**: A unique identifier for the date on which the trading data was recorded. These IDs are sequential and consistent across all stocks.
   - **Context**: Allows the model to differentiate data across different trading days and to identify trends or patterns over time.

3. **imbalance_size**:
   - **Description**: Represents the volume of shares that remain unmatched at the current reference price, expressed in USD.
   - **Context**: A critical measure in understanding supply and demand dynamics at the closing auction, influencing the model's ability to predict price movements based on existing imbalances.

4. **imbalance_buy_sell_flag**:
   - **Description**: A categorical flag indicating the direction of the auction imbalance:
     - 1 for a buy-side imbalance (more demand than supply)
     - -1 for a sell-side imbalance (more supply than demand)
     - 0 for no imbalance
   - **Context**: This indicator helps in predicting whether the price is likely to rise or fall at the close, based on whether there is excess buying pressure or selling pressure.

5. **reference_price**:
   - **Description**: The price at which the number of paired shares is maximized, the imbalance is minimized, and the price is closest to the bid-ask midpoint.
   - **Context**: Acts as a pivotal price point for the model, as it represents a theoretically optimal trading price considering current market conditions.

6. **matched_size**:
   - **Description**: The total amount in USD that can be matched at the current reference price.
   - **Context**: Indicates the volume of trades that can be executed without affecting the market price too significantly, crucial for understanding market liquidity.

7. **far_price, near_price, [bid/ask]_price, [bid/ask]_size**:
   - **Description**: These are various price points and quantities in the order book.
     - **Far price** and **near price** likely relate to the prices available at further and closer points in the order book.
     - **[bid/ask]_price** are the highest buy and lowest sell prices respectively.
     - **[bid/ask]_size** are the volumes available at these prices.
   - **Context**: These metrics provide detailed insights into the order book's depth and the distribution of buy and sell orders around the reference price, informing predictions on price movement pressures.

8. **wap** (Weighted Average Price):
   - **Description**: Calculated over a specific time frame within the non-auction book, it's a price that reflects the average price at which stocks are traded, weighted by volume.
   - **Context**: WAP is used to gauge the average trading price over a period, often used in financial models to understand market trends and to normalize the impact of large trades on simple average price calculations.

9. **seconds_in_bucket**:
   - **Description**: Measures the number of seconds since the start of the day’s closing auction, starting always from zero.
   - **Context**: Useful for models that need to understand and predict price movements and market behavior at very specific intervals during the closing auction.

10. **target**:
    - **Description**: The difference between the 60-second future movement in the stock's WAP and the 60-second future movement of a synthetic index, provided only in the training set.
    - **Context**: Serves as the dependent variable in training the model. It represents the relative movement of a stock's price compared to the market, which is central to predicting future price movements effectively.

Understanding these columns and their interrelationships is essential for developing an effective predictive model that can accurately forecast stock price movements during the crucial final minutes of trading based on order book dynamics and auction data.
"""

df_train

"""### Function: `feature_cols`

This function is designed to filter out certain columns from a given DataFrame and return the modified DataFrame.

```python
def feature_cols(df):
    cols = [c for c in df.columns if c not in ['row_id', 'time_id', 'date_id']]
    df = df[cols]    
    return df
```

#### Details:
- **Input Parameter**:
  - `df`: A pandas DataFrame from which specific columns need to be excluded.
  
- **Process**:
  - The function creates a list comprehension that iterates over all column names in the DataFrame. It includes only those columns that are not `'row_id'`, `'time_id'`, and `'date_id'`. These columns are typically identifiers that do not provide predictive power for the model (i.e., they are not features but merely identifiers or indexes).
  - It then filters the DataFrame to include only the columns listed in `cols`, effectively removing any columns that might skew the model or are not useful as features.

- **Return**:
  - The function returns the DataFrame with the specified columns removed, focusing the DataFrame on potentially relevant features for the model.

### Data Preprocessing

```python
df_train.fillna(0, inplace=True)
```
- **Description**: This line replaces all missing values (`NaN`s) in the `df_train` DataFrame with `0`. Handling missing values is crucial to avoid errors during the modeling process and can also impact the model’s performance.
- **`inplace=True`**: This parameter ensures that the modification is done in place and does not return a new DataFrame, thus directly updating `df_train`.

### Feature Selection and Target Separation

```python
x_train = feature_cols(df_train.drop(columns='target'))
y_train = df_train['target'].values
```

- **Feature DataFrame (`x_train`)**:
  - `df_train.drop(columns='target')`: Drops the 'target' column from `df_train`, as the target column is what the model is trying to predict and should not be used as a feature.
  - `feature_cols(...)`: Applies the `feature_cols` function to the result, further filtering out the non-feature columns ('row_id', 'time_id', 'date_id'), and assigns the result to `x_train`.

- **Target Array (`y_train`)**:
  - `df_train['target'].values`: Extracts the target values from the `df_train` DataFrame. This creates a NumPy array of the target variable, which is used as the dependent variable in model training.

### Summary

This setup is typical in supervised machine learning tasks where the goal is to predict a target variable based on a set of features. The code effectively prepares the dataset by cleaning up non-feature columns, handling missing values, and segregating features and targets, which is critical for the subsequent model training phase.
"""

def feature_cols(df) :
    cols = [c for c in df.columns if c not in ['row_id', 'time_id', 'date_id']]
    df = df[cols]
    return df

df_train.fillna(0, inplace = True)
x_train = feature_cols(df_train.drop(columns='target'))
y_train = df_train['target'].values

"""### Model Initialization

```python
lgbm_model = lgbm.LGBMRegressor(objective='mae', n_estimators=500, random_state=1234)
```

- **`LGBMRegressor`**:
  - This class implements the LightGBM regressor. A regressor predicts continuous values, which is appropriate for predicting stock prices as continuous numerical data.

- **Parameters**:
  - **`objective='mae'`**: Specifies the loss function to be minimized in the learning process. Here, 'mae' stands for Mean Absolute Error, which aligns with the project’s evaluation criteria. It measures the average magnitude of errors in a set of predictions, without considering their direction (i.e., whether they are over or underestimates).
  - **`n_estimators=500`**: This defines the number of boosting stages the model has to go through. More trees can lead to a more accurate model but can also cause overfitting if not handled correctly. In this context, 500 trees are chosen to balance between bias and variance.
  - **`random_state=1234`**: This parameter ensures reproducibility of the model’s results by providing a fixed seed for the random number generator, which influences aspects of model training like the selection of features at each split.

### Model Training

```python
lgbm_model.fit(x_train, y_train)
```

- **Description**:
  - This line fits the LightGBM model to the training data. The `fit` method adjusts the weights of the model over the specified number of boosting rounds (`n_estimators`) to minimize the specified loss function.

- **Parameters**:
  - **`x_train`**: Feature matrix (independent variables) used for training the model.
  - **`y_train`**: Target variable (dependent variable) the model needs to predict.

### Importance in the Context of the Project

The use of the LightGBM model in this project is particularly well-suited for several reasons:
- **Efficiency**: LightGBM is known for its high efficiency with large data sets and handles large volumes of data faster than many other implementations of gradient boosting.
- **Handling Sparse Data**: Given the potentially large and sparse nature of financial data (like order books), LightGBM’s ability to handle sparse data effectively is beneficial.
- **Gradient-based Learning**: The model’s learning is based on identifying errors from previous trees and improving on them, which is effective for complex patterns like those found in stock price movements.

Training this model on the defined features and target prepares it to forecast the closing prices of Nasdaq-listed stocks, providing crucial insights into short-term price movements essential for traders and financial analysts.
"""

lgbm_model = lgbm.LGBMRegressor(objective='mae', n_estimators=500, random_state=1234)
lgbm_model.fit(x_train, y_train)

"""### Function: `lgbm.plot_importance`
This function is part of the LightGBM framework and is used to plot the importance of each feature used by the model. The importance can be derived in different ways, and in this case, the importance type specified is "gain".

- **`lgbm_model`**: This is the trained LightGBM model from which the feature importance is calculated.
- **`importance_type="gain"`**: Specifies the type of importance measure to be used. "Gain" refers to the total gains of splits which use the feature. Essentially, it measures the contribution of each feature to the model by calculating how much each feature's splits improve the performance measure (in this case, the reduction in loss or "gain").

### Understanding "Gain"
- **Gain (also known as 'split gain')**: This is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before making a split on a feature, the model calculates how much using this feature would reduce the loss (mean absolute error, in this case). A higher gain implies a more significant contribution of the feature to making more accurate predictions.

### Significance of Feature Importance Visualization
- **Model Interpretability**: This visualization helps in understanding which features are most influential in predicting the target variable. For financial modeling, where interpretability is crucial for trust and understanding, knowing which features affect predictions most can guide further data collection, feature engineering, and model tweaking.
- **Feature Engineering**: Insights from feature importance can lead to improved feature engineering. Features with low importance might be candidates for removal or modification, while understanding high-importance features might lead to the creation of new features that enhance the model’s predictive power.
- **Strategic Decisions**: In trading, understanding which features (e.g., aspects of the order book, price movements, etc.) most influence predictions can help in formulating more effective trading strategies.

### Practical Use
To execute this function properly and ensure that the plot displays as intended, it's crucial to have your environment set up correctly with graphical capabilities (especially if running on a local machine or in environments without GUI support, additional settings may be needed). Also, ensure that the LightGBM library is correctly installed and imported.

The result of this function is a bar chart where each feature is listed along with its importance score. Features are typically sorted in descending order of importance, making it clear which are the most critical for the model’s predictions. This visual tool is invaluable for presentations and reports to stakeholders, providing a clear and intuitive way to discuss model dynamics.
"""

lgbm.plot_importance(lgbm_model, importance_type="gain")

"""The function `lgbm.plot_importance` with the parameter `importance_type="split"` is used to visualize the importance of each feature in the trained LightGBM model based on the number of times each feature is used to split the data across all trees. This provides a different perspective on feature importance compared to the "gain" method. Here’s what you need to know about using this particular function and parameter:

### Function: `lgbm.plot_importance`

- **Parameter**: `importance_type="split"`
  - When set to "split", the importance of each feature is calculated based on how often the feature is used to split data points at a tree node, across all trees in the model. Essentially, it counts the number of times a feature is selected to make a decision in the tree.

### Significance of "Split" as an Importance Measure

- **Usage Frequency**: This measure provides insight into how frequently a feature is used in the tree models, irrespective of the magnitude of its impact (or gain). A feature that is used very often to make splits might be considered crucial for the decision process within the model.
- **Interpretability**: Understanding which features are most frequently used can help in assessing the reliance of the model on certain data points or feature types. For example, in stock price prediction, if the model frequently splits on features related to volume, it suggests a strong dependency on trading volume for making predictions.

### Practical Uses of Feature Importance (Split)

- **Model Simplification**: If some features are rarely used to make splits, it might indicate that they are not contributing much to the model's decisions, providing a basis for potentially simplifying the model by removing these less important features.
- **Feature Engineering**: By identifying which features are most frequently used, you can focus your feature engineering efforts to enhance these features or create new features that are similar in nature but might capture additional nuances.
- **Model Validation**: Frequent use of intuitive or expected features for splits can serve as a sanity check, validating that the model is considering relevant factors (e.g., certain market indicators in stock trading models).

### Visualization and Output

When you execute `lgbm.plot_importance`, the function typically produces a bar chart. Each bar represents a feature with its length proportional to the count of times the feature has been used in splits across all boosting rounds (trees). The features are generally ordered by their importance, with the most frequently used feature at the top.

### Code Execution

To ensure the plot is correctly generated and displayed, consider the following:
- Ensure your Python environment supports plotting; for Jupyter notebooks or IPython environments, `%matplotlib inline` should be used to display plots within the notebook.
- Validate that the `lightgbm` library is correctly installed and imported.
- If running in a non-interactive environment or needing to save the plot to a file, additional code may be required to handle these aspects.

This visualization tool is particularly useful for presentations or detailed analysis where understanding the structure and decision-making process of the model is crucial.
"""

# splits means number of times a feature is used to split the data across all trees in the model
lgbm.plot_importance(lgbm_model, importance_type="split")

"""If you submit at this point, the public score will show **5.3888**, which is not so bad. Let's try some approaches from this baseline.

# 2. Optimize parameter with Optuna

Well, changing the approach here, let's see how the score is improved by optimizing the parameters. I used Optuna for optimization. As a result, Public score was improved to **5.3878**.

Optuna is an open-source optimization library designed specifically for automating the process of optimizing the hyperparameters of machine learning algorithms. It is highly regarded for its efficiency and flexibility in tuning parameters to enhance the performance of models. Here’s a detailed look at what Optuna offers and why it's beneficial for machine learning projects:

### Key Features of Optuna:

1. **Automatic Hyperparameter Optimization**:
   - Optuna automates the tedious process of manually searching for the best hyperparameters, using sophisticated algorithms to explore the parameter space efficiently.

2. **Efficient Search Algorithms**:
   - Optuna supports several state-of-the-art algorithms for hyperparameter optimization, including Tree-structured Parzen Estimator (TPE), Categorical Algorithm for Numerical and Categorical Bayesian Optimization (CMA-ES), and Random Search. These algorithms predict which hyperparameters are likely to yield better results and focus the search around those areas.

3. **Easy Parallelization**:
   - One of the strengths of Optuna is its support for easy parallelization, allowing users to speed up their optimization processes by running trials simultaneously across multiple processors or even across different machines.

4. **Pruning of Trials**:
   - Optuna provides an automatic trial pruning feature which can terminate poorly performing trials early. This feature is useful in saving computational resources and focusing efforts on more promising parameter sets.

5. **Visualization**:
   - The library includes functions for visualizing the optimization process, such as plots for the history of trials, parallel coordinate plots of parameter relationships, and importance plots for assessing which parameters are most influential in achieving the best performance.

6. **User-Friendly**:
   - Despite its sophisticated capabilities, Optuna is designed to be user-friendly. It allows for defining the search space using Pythonic APIs and integrates seamlessly with existing Python data science ecosystems like NumPy, Pandas, and major machine learning frameworks like PyTorch, TensorFlow, and Scikit-learn.

### Usage in Machine Learning Projects:

Optuna is particularly useful in projects where the optimal combination of parameters is not known in advance and where manual tuning could be impractical due to the vastness of the parameter space or the complexity of the model. For instance, when optimizing a LightGBM model (as in the Nasdaq Closing Price Prediction Challenge), Optuna can systematically and efficiently explore different combinations of parameters like `number of leaves`, `max depth`, `learning rate`, etc., to find the best configuration that minimizes the error or maximizes the accuracy of predictions.

### Code Breakdown

#### Data Copy
```python
x = x_train.copy()
y = y_train.copy()
```
- **Purpose**: Creates copies of the training data (`x_train`) and target variable (`y_train`). This is generally done to avoid modifying the original data during the optimization process, ensuring data integrity throughout the experimentation.

#### Objective Function
```python
def objective(trial):
    params = {
        'random_seed': 123,
        'n_estimators': trial.suggest_int('n_estimators', 300, 1000),
        'num_leaves': trial.suggest_int('num_leaves', 4, 32),
        'max_depth': trial.suggest_int("max_depth", 1, 10)}
            
    model = lgbm.LGBMRegressor(**params)
    model.fit(x, y)
    y_pred = model.predict(x)
    score = mean_absolute_error(y, y_pred)
    
    return score
```
- **Function**: `objective(trial)`
  - **Purpose**: This function defines the objective for optimization, which Optuna aims to minimize—in this case, the mean absolute error (MAE) between the predicted and actual values.
  - **Parameters**:
    - `random_seed`: Ensures reproducibility.
    - `n_estimators`: Number of boosted trees to fit. Suggested range is 300 to 1000.
    - `num_leaves`: Maximum number of leaves in one tree. Suggested range is 4 to 32.
    - `max_depth`: Maximum depth of a tree. Suggested range is 1 to 10.
  - **Process**: The function initializes a LightGBM regressor with suggested parameters, fits the model to the training data, and computes the MAE on the same training set.

#### Optuna Study Creation and Optimization
```python
#study = optuna.create_study(sampler=optuna.samplers.RandomSampler(seed=123))
#study.optimize(objective, n_trials=50)
#study.best_params
```
- **Purpose**: These commented-out lines are used to create an Optuna study that manages the optimization process.
  - `create_study()`: Sets up the optimization framework with a random sampler, which selects parameter values randomly and ensures the reproducibility with a seed.
  - `optimize()`: Executes the optimization over a specified number of trials (`n_trials=50`), where each trial evaluates the objective function with a different set of parameters.
  - `best_params`: This attribute of the study object stores the best parameter values found during the optimization.

#### Notes
- The process is commented out due to its time-consuming nature (taking a couple of hours to complete). For practical implementations, especially in development environments, such lengthy computations are typically run in dedicated sessions, possibly on optimized hardware or cloud resources.

#### Suggestions for Use
- **Uncomment and Run**: If you have the resources and time, uncomment these lines to perform the optimization and potentially improve your model.
- **Experimentation**: After running the initial optimization, you might want to further refine the ranges or try optimizing additional parameters based on the initial results.
"""

x = x_train.copy()
y = y_train.copy()

def objective(trial):
    params = {
        'random_seed':123,
        'n_estimators'    :trial.suggest_int('n_estimators', 300, 1000),
        'num_leaves'      :trial.suggest_int('num_leaves', 4, 32),
        'max_depth'       :trial.suggest_int("max_depth",1,10)}

    model = lgbm.LGBMRegressor(**params)
    model.fit(x,y)
    y_pred = model.predict(x)
    score = mean_absolute_error(y, y_pred)

    return score

#study = optuna.create_study(sampler=optuna.samplers.RandomSampler(seed=123))
#study.optimize(objective, n_trials=50)
#study.best_params

"""# 3. Add imbalance_size

Next, let's see how the score is improved by adding new features.  
As you can see the Light GBM "importance" in the above Section 1, not Price-related but Size-related features were regarded as important by LightGBM, thus try to create Size-related new features. First, let's create the ratio between imbalance_size and matched_size.  

### Concept of Feature Engineering

**Feature Engineering** is a critical aspect of model development in machine learning, particularly in fields like finance where market dynamics can be complex. Creating new features can help capture additional insights from the data that are not immediately apparent but may significantly influence the outcome.

### Code Explanation

Although the specific code snippet for creating the `imbalance_ratio` feature is commented out, here’s a breakdown:

```python
# def pre_process1(df):
#     df['imbalance_ratio'] = df['imbalance_size'] / df['matched_size']
#     return df
```

#### Function: `pre_process1`

- **Purpose**: This function adds a new column to the DataFrame `df` that represents the ratio of `imbalance_size` to `matched_size`.

- **New Feature**: `imbalance_ratio`
  - **Definition**: It is calculated as the division of `imbalance_size` by `matched_size`.
  - **`imbalance_size`**: This could represent the volume of shares that remain unmatched at the current reference price.
  - **`matched_size`**: This typically indicates the volume of shares that can be matched at the current reference price.

### Significance of the `imbalance_ratio` Feature

- **Insight into Market Dynamics**: This ratio provides insight into the relative size of unmatched orders compared to matched orders at the reference price, potentially signaling market pressure (either buying or selling pressure) that isn't fully resolved by current order matches.
- **Indicator of Market Sentiment**: A high `imbalance_ratio` might suggest a strong imbalance in buy or sell orders that could affect the stock price shortly, especially during the closing auction when liquidity and volatility are high.

### Improvement in Model Performance

- **Public Score Improvement**: The noted improvement in the public score (to **5.3866**) suggests that the `imbalance_ratio` provides meaningful information that enhances the model’s ability to predict stock price movements accurately. In machine learning competitions and real-world applications, even small improvements in score can be significant, reflecting better alignment of the model with underlying patterns in the data.

### Utilizing the Feature

To utilize this feature effectively:
- **Uncomment and Integrate**: To apply this preprocessing step, you would uncomment the function and apply it to your data frames where needed (both training and testing datasets).
- **Model Re-training**: After integrating this new feature, re-train your model to ensure that it learns to use this new information.
- **Continuous Evaluation**: Continuously evaluate the impact of this new feature on model performance, using validation sets or through cross-validation, ensuring that it genuinely improves the model rather than fitting noise.

This approach is an excellent example of iterative model improvement through feature engineering, highlighting how domain insights (like the importance of size-related features in trading models) can lead directly to tangible enhancements in predictive accuracy.
"""

#def pre_process1(df):
#    df['imbalance_ratio'] = df['imbalance_size'] / df['matched_size']
#    return df

"""# 4. Add imbalance_size

Then, let's try to add 2 more features related to imbalance between bid-size and ask-size, which will improve Public score to **5.3852**.  


For these features, I referred to below great notebook.   
https://www.kaggle.com/code/renatoreggiani/optv-lightgbm

### Overview of Feature Engineering Steps

The snippet shows an expanded version of feature engineering where new features are derived from the order book data, focusing on the imbalance and differences between bid and ask sizes, as well as their cumulative and differential impacts on stock price movements.

### Explanation of Each Feature

#### 1. `imbalance_ratio`
   - **Definition**: The ratio of `imbalance_size` to `matched_size`.
   - **Purpose**: Measures the proportion of unmatched orders to matched orders, providing insight into the market's directional pressure.

#### 2. `imbl_size1`
   - **Definition**: The normalized difference between `bid_size` and `ask_size`.
   - **Formula**: `(df['bid_size'] - df['ask_size']) / (df['bid_size'] + df['ask_size'])`
   - **Purpose**: Captures the net order flow direction, indicating whether buying or selling pressure is dominant.

#### 3. `imbl_size2`
   - **Definition**: The normalized difference between `imbalance_size` and `matched_size`.
   - **Formula**: `(df['imbalance_size'] - df['matched_size']) / (df['imbalance_size'] + df['matched_size'])`
   - **Purpose**: Similar to `imbalance_ratio` but focuses on the relative difference rather than the ratio, providing another perspective on market liquidity and order imbalance.

### Additional Features Considered (but not always effective)

- **`bid_size_diff` and `ask_size_diff`**: Attempt to capture the sequential changes in bid and ask sizes, respectively, which could reflect momentum or shifts in market sentiment. However, these features are noted to not perform well.
- **`bid_size_over_ask_size` and `bid_price_over_ask_price`**: These features aim to directly compare the bid and ask sides, potentially useful in models that are sensitive to such direct ratios.

### Feature Engineering Process

The function `pre_process1` is systematically updated to include these new features, and through testing and validation, their impact on the model's predictive accuracy is assessed. As noted, each group of features contributes to an incremental improvement in the model's public score, indicating their effectiveness.

### Applying Feature Engineering in the Model

```python
df_train = pre_process1(df_train)
df_train = feature_cols(df_train)
df_train.fillna(0, inplace = True)
df_train
```

- **Preprocessing**: Apply the `pre_process1` function to add new features.
- **Feature Selection**: Use the `feature_cols` function to filter the DataFrame, ensuring that only relevant features are included.
- **Handling Missing Data**: Fill any NaN values with zero, a necessary step to prepare the data for modeling without errors due to missing values.

### Conclusion

The iterative approach to adding and testing new features as shown in this example is a cornerstone of effective machine learning practices, particularly in complex domains like financial markets where the dynamics are influenced by numerous and often subtle factors. Each feature is an attempt to encapsulate some aspect of market behavior, and their validation through improved scores demonstrates their utility in enhancing model performance.
"""

#def pre_process1(df):
#
#    df['imbl_size1'] = (df['bid_size']-df['ask_size']) / (df['bid_size']+df['ask_size'])
#    df['imbl_size2'] = (df['imbalance_size']-df['matched_size']) / (df['imbalance_size']+df['matched_size'])
#
#    return df

# Original
# def pre_process1(df):

#     df['imbalance_ratio'] = df['imbalance_size'] / df['matched_size']
#     #---> improve 0.0012

#     df['imbl_size1'] = (df['bid_size']-df['ask_size']) / (df['bid_size']+df['ask_size'])
#     df['imbl_size2'] = (df['imbalance_size']-df['matched_size']) / (df['imbalance_size']+df['matched_size'])
#     #---> improve 0.0014

#     df['bid_size_diff'] = df[["stock_id", "date_id", "bid_size"]].groupby(["stock_id","date_id"]).diff()
#     df['ask_size_diff'] = df[["stock_id", "date_id", "ask_size"]].groupby(["stock_id","date_id"]).diff()
#     #<--- "diff" doesn't work well

#     df["bid_size_over_ask_size"] = df["bid_size"].div(df["ask_size"])
#     df["bid_price_over_ask_price"] = df["bid_price"].div(df["ask_price"])
#     #---> improve 0.0018

#     return df

# Edited
def pre_process1(df):

    df['imbalance_ratio'] = df['imbalance_size'] / df['matched_size']
    #---> improve 0.0012

    df['imbl_size1'] = (df['bid_size']-df['ask_size']) / (df['bid_size']+df['ask_size'])
    df['imbl_size2'] = (df['imbalance_size']-df['matched_size']) / (df['imbalance_size']+df['matched_size'])
    #---> improve 0.0014

#     df['bid_size_diff'] = df[["stock_id", "date_id", "bid_size"]].groupby(["stock_id","date_id"]).diff()
#     df['ask_size_diff'] = df[["stock_id", "date_id", "ask_size"]].groupby(["stock_id","date_id"]).diff()
#     #<--- "diff" doesn't work well

#     df["bid_size_over_ask_size"] = df["bid_size"].div(df["ask_size"])
#     df["bid_price_over_ask_price"] = df["bid_price"].div(df["ask_price"])
    #---> improve 0.0018

    return df

df_train = pre_process1(df_train)
df_train = feature_cols(df_train)
df_train.fillna(0, inplace = True)
df_train

lgbm.plot_importance(lgbm_model, importance_type="gain")

"""# 5. Add bid/ask ratio in size and price

Adding ratios between bid and ask in terms of both price and size as new features in your predictive model for stock price movements reflects a strategic move in feature engineering. These types of features can capture essential aspects of market sentiment and liquidity that are not explicitly represented by individual size or price features. Let's delve into how these features are conceptualized, their potential impact, and the importance of avoiding redundant calculations.

### Concept of Bid/Ask Ratios

1. **Bid/Ask Size Ratio**: This ratio compares the total quantity of buy orders (bids) to the total quantity of sell orders (asks). A higher ratio indicates a dominance of buy orders, which could be interpreted as a bullish signal, whereas a lower ratio might suggest bearish sentiment.
   
2. **Bid/Ask Price Ratio**: This compares the highest price buyers are willing to pay (bid price) to the lowest price sellers are willing to accept (ask price). This ratio can indicate the immediate direction the market participants expect the stock to move. A ratio close to or greater than 1 might suggest that buyers are willing to pay a price close to or higher than sellers' lowest asking price, potentially driving the price upwards.

### Implementation and Improvement

By adding these ratios, you are essentially trying to leverage the structural information in the order book data that might not be fully utilized by simpler models. LightGBM can indeed consider nonlinear interactions between features, but explicitly modeling interactions that are known to be predictive in financial contexts (like these ratios) can often lead to more robust predictions.

### Code Example for Adding Bid/Ask Ratios

Here's a refined version of how you might implement these features in your preprocessing function:

```python
def pre_process1(df):
    # Calculate bid/ask ratios only once to avoid redundancy
    df['bid_ask_size_ratio'] = df['bid_size'] / df['ask_size']
    df['bid_ask_price_ratio'] = df['bid_price'] / df['ask_price']

    # Calculate normalized differences for size and imbalance
    df['imbl_size1'] = (df['bid_size'] - df['ask_size']) / (df['bid_size'] + df['ask_size'])
    df['imbl_size2'] = (df['imbalance_size'] - df['matched_size']) / (df['imbalance_size'] + df['matched_size'])

    return df
```

### Addressing Redundancy and Efficiency

As noted, redundant calculations can be a significant inefficiency in data preprocessing, especially with large datasets typical in financial modeling. Here are strategies to address this:

- **Avoid Repeated Calculations**: Ensure that each unique calculation is only done once, and if the result is needed again, store it rather than recalculating. This approach saves computational resources and execution time.
  
- **Use Caching**: For more complex or expensive calculations that are used multiple times across different parts of your application or model training process, consider implementing caching. This can be done at the code level using decorators like `@lru_cache` from Python's `functools` or by manually saving results to a temporary data structure.

### Impact on Model Performance

Improving the public score to **5.3834** by adding these features suggests that these aspects of the trading dynamics are crucial in predicting closing prices accurately. This confirms the importance of careful feature selection based on domain knowledge and the behavior of the underlying model (in this case, LightGBM).

By refining the feature engineering process to focus on meaningful relationships and interactions within the data while avoiding unnecessary recalculations, you optimize both the efficiency and effectiveness of your predictive modeling efforts.
"""

x_train = feature_cols(df_train.drop(columns='target'))
y_train = df_train['target'].values

"""I have outlined the process of hyperparameter tuning using GridSearchCV, which is part of the scikit-learn library. This technique is used to find the optimal hyperparameters for the LightGBM model aiming to predict stock prices with the lowest Mean Absolute Error (MAE). Let's go through the main components of this code and discuss each step:

### Step-by-Step Breakdown

1. **Hyperparameter Grid Definition**:
   - **`param_grid`** is a dictionary where keys are the names of parameters to tune, and values are the ranges of values to test for each parameter. For this model, the parameters being tuned are:
     - `n_estimators`: The number of boosting stages the model will go through. More stages increase the model's complexity and potential accuracy but can lead to overfitting.
     - `num_leaves`: The maximum number of leaves in one tree. Increasing this number can make the model more detailed but may cause overfitting.
     - `max_depth`: The maximum depth of each tree. Deeper trees can learn more specific patterns but might overfit on the training data.

2. **Grid Search Setup**:
   - **`GridSearchCV`**:
     - `estimator`: Here, `lgbm.LGBMRegressor(objective='mae')` specifies that the model is a LightGBM regressor with the objective set to minimize the mean absolute error, which is relevant for regression problems where you want to minimize the error magnitude without considering direction.
     - `param_grid`: The grid of parameters to test.
     - `cv=5`: Specifies that 5-fold cross-validation should be used. In 5-fold cross-validation, the data is split into 5 parts, with each part being used as a validation set once while the remaining 4 parts form the training set. This method helps ensure that the model's performance is stable across different subsets of the data.

3. **Fitting the Grid Search**:
   - **`grid_search.fit(x_train, y_train)`**: This command starts the grid search process. The model will be trained multiple times with different combinations of parameters from `param_grid`. Each combination will be evaluated using 5-fold cross-validation to determine its effectiveness.

4. **Best Parameters and Model Training**:
   - **`best_params = grid_search.best_params_`**: After the grid search completes, you can retrieve the best parameter set that led to the lowest average cross-validation error.
   - **Creating and Training a New Model with the Best Parameters**:
     - `lgbm.LGBMRegressor(objective='mae', **best_params)`: This initializes a new LightGBM regressor using the best parameters found.
     - `.fit(x_train, y_train)`: Fits the model to the entire training dataset using these optimized parameters.

### Significance

This approach is particularly beneficial for refining model performance, ensuring that you are using the best possible parameters for your specific dataset and problem. By systematically searching through a predefined space of parameter values with cross-validation, GridSearchCV helps avoid overfitting and ensures that the model's performance is robust across different data samples.

### Conclusion

Using GridSearchCV for hyperparameter tuning is a robust method for improving the predictive power of machine learning models. It automates the laborious process of manually searching for the best model settings, leading to more effective and reliable predictions, which is crucial in high-stakes fields like stock price prediction.
"""

# import numpy as np
# from sklearn.model_selection import GridSearchCV

# param_grid = {
#     'n_estimators': [500, 1000, 2000],
#     'num_leaves': [25, 50, 100],
#     'max_depth': [5, 7, 10]
# }

# grid_search = GridSearchCV(estimator=lgbm.LGBMRegressor(objective='mae'), param_grid=param_grid, cv=5)
# grid_search.fit(x_train, y_train)

# best_params = grid_search.best_params_

# # Create and train a new model with the best hyperparameters
# lgbm_model = lgbm.LGBMRegressor(objective='mae', **best_params)
# lgbm_model.fit(x_train, y_train)

# !pip install hyperopt --upgrade

"""Hyperopt is a powerful tool for optimizing model parameters via various search algorithms, such as Tree-structured Parzen Estimator (TPE), which is used in this case. Let's discuss each component of the code to understand how it contributes to optimizing the LightGBM model:

### Import Statements

```python
# import hyperopt
# from lightgbm import LGBMRegressor
```
- These lines import the Hyperopt library and the LGBMRegressor class from LightGBM. Commented out here, but they are necessary to run the code.

### Objective Function

```python
def objective(params):
    model = LGBMRegressor(objective='mae',
                          n_estimators=params['n_estimators'],
                          num_leaves=params['num_leaves'],
                          max_depth=params['max_depth'])
    model.fit(x_train, y_train)
    y_pred = model.predict(x_train)
    mae = mean_absolute_error(y_train, y_pred)
    return mae
```
- **Purpose**: Defines the function that Hyperopt will minimize. Here, it trains a LightGBM model with given parameters and calculates the mean absolute error (MAE) on the training set.
- **Parameters**: Takes a dictionary `params` that includes settings for `n_estimators`, `num_leaves`, and `max_depth`.

### Search Space

```python
search_space = {
    'n_estimators': hyperopt.hp.choice('n_estimators', range(500, 1000)),
    'num_leaves': hyperopt.hp.choice('num_leaves', range(20, 50)),
    'max_depth': hyperopt.hp.choice('max_depth', range(5, 10))
}
```
- Defines the hyperparameter space over which to search. `hyperopt.hp.choice` specifies a list of discrete values for each parameter. Hyperopt will test different combinations of these values to find the set that results in the lowest MAE.

### Trials Object and Optimization Call

```python
trials = hyperopt.Trials()
best_hyperparams = hyperopt.fmin(objective, search_space, algo=hyperopt.tpe.suggest, max_evals=13, trials=trials)
```
- **`Trials()`**: Stores details of each trial, including parameters and the resulting MAE.
- **`fmin()`**: Runs the optimization process, using the TPE algorithm (`hyperopt.tpe.suggest`) over 13 evaluations.

### Extract Best Parameters

```python
best_hyperparams = trials.best_trial['misc']['vals']
```
- Extracts the best parameters from the trials. The actual values are indexed under `'misc'['vals']`.

### Train the Model with Best Parameters

```python
lgbm_model = LGBMRegressor(objective='mae',
                           n_estimators=best_hyperparams['n_estimators'],
                           num_leaves=best_hyperparams['num_leaves'],
                           max_depth=best_hyperparams['max_depth'])
lgbm_model.fit(x_train, y_train)
```
- Initializes a new LGBMRegressor with the best parameters found and fits it to the training data.

### Predictions

```python
y_pred = lgbm_model.predict(x_test)
```
- Makes predictions using the optimized model on the test data (`x_test`).

### Conclusion

This approach provides a systematic way to tune model parameters using Hyperopt, which can lead to significant improvements in model performance by carefully searching the parameter space. It's particularly useful in scenarios where manual tuning is impractical due to the large number of combinations and the complexity of interactions between parameters.
"""

# import hyperopt
# from lightgbm import LGBMRegressor

# def objective(params):
#     model = LGBMRegressor(objective='mae',
#                          n_estimators=params['n_estimators'],
#                          num_leaves=params['num_leaves'],
#                          max_depth=params['max_depth'])
#     model.fit(x_train, y_train)
#     y_pred = model.predict(x_train)
#     mae = mean_absolute_error(y, y_pred)
#     return mae

# search_space = {
#     'n_estimators': hyperopt.hp.choice('n_estimators', range(500, 1000)),
#     'num_leaves': hyperopt.hp.choice('num_leaves', range(20, 50)),
#     'max_depth': hyperopt.hp.choice('max_depth', range(5, 10))
# }

# trials = hyperopt.Trials()
# best_hyperparams = hyperopt.fmin(objective, search_space, algo=hyperopt.tpe.suggest, max_evals=13, trials=trials)

# best_hyperparams = trials.best_trial['misc']['vals']

# lgbm_model = LGBMRegressor(objective='mae',
#                             n_estimators=best_hyperparams['n_estimators'],
#                             num_leaves=best_hyperparams['num_leaves'],
#                             max_depth=best_hyperparams['max_depth'])
# lgbm_model.fit(x_train, y_train)

# y_pred = lgbm_model.predict(x_test)

"""### Parameter Explanation

1. **`task`: 'train'**
   - Specifies the task that LightGBM will perform, which is 'train' in this case. This is typical when you are using LightGBM for building and training new models.

2. **`boosting_type`: 'gbdt'**
   - Specifies the boosting algorithm. 'gbdt' stands for Gradient Boosting Decision Tree, which is the standard boosting framework that LightGBM uses. It creates a series of decision trees where each tree learns to correct the errors of the previous one.

3. **`objective`: 'regression'**
   - Indicates the learning task and the corresponding learning objective. 'Regression' means the model will predict continuous target values, which is typical for predicting metrics like prices or rates.

4. **`metric`: ['l1', 'l2']**
   - Metrics for evaluating model performance. 'l1' is the mean absolute error (MAE), and 'l2' is the mean squared error (MSE). Including both allows you to evaluate the model under different error metrics during the training phase.

5. **`learning_rate`: 0.005**
   - Determines the step size at each iteration while moving toward a minimum of the loss function. A smaller learning rate can lead to better performance (at the risk of longer training time and potentially getting stuck in local minima).

6. **`feature_fraction`: 0.9**
   - Specifies the fraction of features to be randomly selected for building each tree. A lower value can provide better performance because it provides a better generalization capability and can prevent overfitting.

7. **`bagging_fraction`: 0.7**
   - Specifies the fraction of data to be used for each iteration and is a method for speedup training and handling overfitting.

8. **`bagging_freq`: 10**
   - Specifies the frequency for performing bagging. Every 10 iterations, a new subset of the data is selected according to the `bagging_fraction`.

9. **`verbose`: 0**
   - Controls the level of LightGBM’s output (verbosity of printing messages). Setting it to 0 means silent mode.

10. **`max_depth`: 8**
    - Maximum depth of the trees. Restricting the depth of the trees helps prevent the model from becoming overly complex and overfitting.

11. **`num_leaves`: 20**
    - The maximum number of leaves in one tree. More leaves will make the model more complex and can lead to overfitting.

12. **`max_bin`: 512**
    - Maximum number of bins that feature values will be bucketed into. A larger number increases the model’s complexity.

13. **`num_iterations`: 1000**
    - The number of boosting iterations to be run. More iterations can improve accuracy but might lead to overfitting if not controlled with other parameters like `bagging_fraction`.

14. **`force_col_wise`: 'true'**
    - Forces the algorithm to use a column-wise (feature-wise) histogram-building algorithm, which can be faster when the dataset has a large number of rows and small number of features.

### Fitting the Model

```python
lgbm_model = lgbm.LGBMRegressor(**params)
lgbm_model.fit(x_train, y_train)
```

- These lines initialize a `LGBMRegressor` with the specified parameters and then fit this model to `x_train` and `y_train`. The fitting process involves learning from the training data by minimizing the specified loss function (`'objective': 'regression'` with metrics `'l1'` and `'l2'`).

This setup exemplifies a comprehensive application of machine learning techniques to predict continuous outcomes through regression analysis, optimized by adjusting a variety of hyperparameters to balance the model's accuracy and generalization capabilities.
"""

params = {
    'task': 'train',
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': ['l1','l2'],
    'learning_rate': 0.005,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.7,
    'bagging_freq': 10,
    'verbose': 0,
    "max_depth": 8,
    "num_leaves": 20,
    "max_bin": 512,
    "num_iterations": 1000,
    "force_col_wise": 'true'
}

lgbm_model = lgbm.LGBMRegressor(**params)
lgbm_model.fit(x_train, y_train)

# import xgboost as xgb

# # Create an XGBoost regressor
# xgb_model = xgb.XGBRegressor(objective='reg:squarederror',
#                              n_estimators=895,
#                              max_depth=7)

# # Train the model
# xgb_model.fit(x_train, y_train)

# xgb.plot_importance(xgb_model)

"""The provided code snippet demonstrates how to use XGBoost, a powerful and widely used machine learning library, to train a regression model with GPU acceleration, plot feature importances, and save the trained model in various formats. Let's break down each part of this process:

### Step-by-Step Explanation

#### 1. **XGBoost Regressor Initialization**
```python
import xgboost as xgb
xgb_model = xgb.XGBRegressor(objective='reg:squarederror',
                             n_estimators=895,
                             max_depth=7,
                             tree_method='gpu_hist')
```
- **`xgb.XGBRegressor`**: This creates an instance of XGBoost's regressor. Parameters specified in the constructor configure the behavior of the model:
  - **`objective='reg:squarederror'`**: Sets the loss function to be minimized as squared error, which is appropriate for regression tasks.
  - **`n_estimators=895`**: Defines the number of gradient boosted trees to fit. More trees can improve the model's predictive accuracy but may lead to longer training times and overfitting.
  - **`max_depth=7`**: Limits the maximum depth of each tree. Deeper trees can model more complex patterns but also can overfit.
  - **`tree_method='gpu_hist'`**: Specifies that the histogram-based algorithm should run on a GPU, enhancing the training speed significantly. This is particularly useful for large datasets.

#### 2. **Training the Model**
```python
xgb_model.fit(x_train, y_train)
```
- **`.fit(x_train, y_train)`**: This method trains the XGBoost model using the provided training data (`x_train`) and targets (`y_train`).

#### 3. **Plotting Feature Importances**
```python
xgb.plot_importance(xgb_model)
```
- **`plot_importance`**: This function from the XGBoost module plots a chart of feature importances, which are calculated based on the number of times a feature is used to split the data across all trees. This visualization helps in understanding which features are most influential in predicting the target variable.

#### 4. **Saving the Model**
```python
xgb_model.save_model('xgb_model.bin')
xgb_model.save_model("xgb_model.json")
xgb_model.save_model("xgb_model.txt")
```
- **`save_model`**: This method saves the trained model to a file, allowing the model to be loaded later without retraining. XGBoost provides flexibility in the format:
  - **Binary file (`'xgb_model.bin'`)**: Saves the model in a native XGBoost binary format, which is efficient but can only be used with XGBoost.
  - **JSON file (`"xgb_model.json"`)**: Saves the model in a JSON format, making it more transparent and easier to interpret. JSON models are more portable and can be used for debugging or in environments where binary format is not preferred.
  - **Text file (`"xgb_model.txt"`)**: Saves a human-readable format of the model, useful for understanding the model structure and for documentation purposes.

### Summary
This sequence of operations demonstrates a comprehensive approach to model training with XGBoost, leveraging GPU capabilities for speed, examining feature importance for insights, and preserving the model in various formats for future use, sharing, or deployment. The flexibility in saving models in different formats ensures that you can choose the appropriate one based on your needs for performance, transparency, or compatibility.
"""

import xgboost as xgb
# Create an XGBoost regressor with the gpu_hist tree construction algorithm
xgb_model = xgb.XGBRegressor(objective='reg:squarederror',
                             n_estimators=895,
                             max_depth=7,
                             tree_method='gpu_hist')

# Train the model
xgb_model.fit(x_train, y_train)

# Plot the feature importances
xgb.plot_importance(xgb_model)

# Save the model
xgb_model.save_model('xgb_model.bin')
# Save as JSON file
xgb_model.save_model("xgb_model.json")
# Save as TXT file
xgb_model.save_model("xgb_model.txt")

# # Make predictions on the test data
# y_pred = xgb_model.predict(X_test)

# lgbm_model = lgbm.LGBMRegressor(objective='mae',
#                                 n_estimators=895,
#                                 num_leaves= 25,
#                                 max_depth= 7)
# lgbm_model.fit(x_train, y_train)

# from sklearn.ensemble import RandomForestRegressor

# rf_model = RandomForestRegressor(n_estimators=895,
#                                 max_depth= 8,criterion="squared_error",bootstrap=True)
# rf_model.fit(x_train, y_train)

"""# 6. Submission

The provided code snippet outlines how to submit predictions in a Kaggle competition that requires real-time interaction with an API. This setup is often used in "Code Competitions," where the submissions are evaluated on the fly. Let's break down the steps and functionalities involved in this submission process:

### Understanding the Kaggle Environment API

1. **Initialization**:
    ```python
    import optiver2023
    env = optiver2023.make_env()
    iter_test = env.iter_test()
    ```
    - **`import optiver2023`**: Imports the competition-specific Python module provided by Kaggle, which contains methods necessary for the submission.
    - **`env = optiver2023.make_env()`**: Initializes the competition environment. This environment handles the process of receiving the test data and submitting predictions.
    - **`iter_test = env.iter_test()`**: Creates an iterator that will provide batches of test data. This method is typically used when the test data is revealed in chunks over time, simulating a real-world scenario such as a trading environment.

2. **Processing and Making Predictions**:
    ```python
    counter = 0
    for (test, revealed_targets, sample_prediction) in iter_test:
        test = pre_process1(test)
        test_df = feature_cols(test)
        sample_prediction['target'] = xgb_model.predict(test_df)
        env.predict(sample_prediction)
        
        counter += 1
    ```
    - **Loop Over Test Data**: The `for` loop iterates over each batch of data provided by the `iter_test` iterator.
    - **Preprocessing**: `pre_process1(test)` applies preprocessing steps to the test data, preparing it by creating new features or transforming existing ones as defined earlier in your workflow.
    - **Feature Selection**: `feature_cols(test)` ensures that only the relevant features are used for making predictions, filtering out any non-predictive or extraneous data columns.
    - **Making Predictions**: `xgb_model.predict(test_df)` uses the pre-trained XGBoost model to generate predictions based on the processed test data.
    - **Submitting Predictions**: `env.predict(sample_prediction)` submits the predictions back to the Kaggle environment. `sample_prediction` is a DataFrame provided by the iterator that likely includes a format or template indicating how submissions should be structured.
    - **Counter**: An optional counter is used here to keep track of the number of iterations or batches processed.

### Important Notes

- **API Optimization Warning**: Note that the current API version is not optimized and should not be used to estimate the runtime of your code on the hidden test set.
- **Contact Information**: You may email "adityasaxena@g.harvard.edu" for production-level optimal code, which might be essential for serious competitors aiming for the best performance in the competition.
"""

import optiver2023
env = optiver2023.make_env()
iter_test = env.iter_test()

# counter = 0
# for (test, revealed_targets, sample_prediction) in iter_test:
#     test = pre_process1(test)
#     test_df = feature_cols(test)
#     sample_prediction['target'] = lgbm_model.predict(test_df)
#     env.predict(sample_prediction)

#     counter += 1

# basic edits
counter = 0
for (test, revealed_targets, sample_prediction) in iter_test:
    test = pre_process1(test)
    test_df = feature_cols(test)
    sample_prediction['target'] = xgb_model.predict(test_df)
    env.predict(sample_prediction)

    counter += 1

sample_prediction