-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathoptiver_dev_lgbm.py
1020 lines (741 loc) · 62 KB
/
optiver_dev_lgbm.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# -*- coding: utf-8 -*-
"""optiver-dev-lgbm.ipynb
Automatically generated by Colab.
Original file is located at
https://colab.research.google.com/drive/1bELhm-NODp7-NAqvCLxTvM8QgwS4f2CN
# [Optiver] LGBM Dev Solution - For Kaggle
![image.png](attachment:dff6a04d-f41d-4c78-95a2-f02228913d06.png)
# 1. Baseline
First, just as a baseline, let's feed the training data into LightGBM and see how good public score is.
### Imports and Configuration
1. **Standard Libraries and Data Handling**:
- `os`: This module provides a way of using operating system dependent functionality like reading or writing to a file system.
- `pandas as pd`: Pandas is crucial for data manipulation and analysis. It offers data structures and operations for manipulating numerical tables and time series.
- `numpy as np`: NumPy is used for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
2. **Visualization**:
- `matplotlib.pyplot as plt`: This is used for creating static, interactive, and animated visualizations in Python. `%matplotlib inline` is a magic function that renders the figure in a notebook (instead of displaying a figure in a new window) immediately after a plot command.
3. **Warnings**:
- `warnings`: This module is used to suppress warnings that might interrupt the viewing experience or clutter the output. `warnings.filterwarnings('ignore')` instructs Python to ignore specific categories of warnings.
4. **Hyperparameter Optimization**:
- `optuna`: Optuna is an automatic hyperparameter optimization software framework, particularly designed for machine learning. It is used to automate the optimization of the parameters to best fit the model. `optuna.logging.set_verbosity(optuna.logging.WARNING)` configures Optuna to only output warnings and more severe messages, reducing log noise.
5. **Machine Learning Tools**:
- `sklearn.model_selection.KFold`: KFold is a cross-validator that divides the dataset into k consecutive folds (without shuffling by default). Each fold is then used once as a validation while the k - 1 remaining folds form the training set.
- `sklearn.metrics.mean_absolute_error`: This function measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s particularly useful as it’s the metric used to evaluate the performance of the model.
- `lightgbm as lgbm`: LightGBM is a gradient boosting framework that uses tree-based learning algorithms and is designed for distributed and efficient training, particularly on large datasets.
6. **Specific Functional Configurations**:
- `from lightgbm import *`: Imports all functions and classes from LightGBM directly into the namespace. This is generally not best practice due to potential naming conflicts; specific imports are preferable.
- `pd.set_option("display.max_columns", None)`: This pandas function is set to ensure that when dataframes are displayed, no columns are omitted in the output, regardless of how many columns are in the dataframe.
"""
# Commented out IPython magic to ensure Python compatibility.
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# %matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import optuna
from sklearn.model_selection import KFold
from sklearn.metrics import mean_absolute_error
import lightgbm as lgbm
optuna.logging.set_verbosity(optuna.logging.WARNING)
import warnings
warnings.filterwarnings('ignore')
from lightgbm import *
pd.set_option("display.max_columns", None)
"""This snippet of code handles the loading of datasets necessary for building and testing the predictive model for the Nasdaq Closing Price Prediction Challenge. Let's delve into each line to understand its functionality:
### Code Explanation
1. **Loading Training Data**:
```python
df_train = pd.read_csv('../input/optiver-trading-at-the-close/train.csv')
```
This line reads the `train.csv` file into a Pandas DataFrame called `df_train`. This dataset likely contains historical order book and auction data for a variety of stocks listed on the Nasdaq, and serves as the primary dataset for training the predictive model.
2. **Loading Testing Data**:
```python
df_test = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/test.csv')
```
Here, the `test.csv` file is loaded into a DataFrame called `df_test`. This file is used to test the model after training, allowing you to evaluate how well the model predicts new, unseen data. This dataset would mimic the structure of the training data but without revealing the target variables (i.e., the closing prices).
3. **Loading Sample Submission Format**:
```python
sample_sub = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/sample_submission.csv')
```
This line loads a sample submission file, `sample_submission.csv`, into a DataFrame called `sample_sub`. This file likely outlines the required format for submitting predictions to a competition or evaluation framework, showing the expected structure of predictions, typically including identifiers and predicted values.
4. **Loading Revealed Targets for Testing**:
```python
rev_target = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/revealed_targets.csv')
```
The `revealed_targets.csv` file is read into a DataFrame called `rev_target`. This dataset contains actual values of the targets for the test set, which are revealed for the purpose of model evaluation and validation post-prediction. It is used to calculate the accuracy metrics (such as MAE) to judge the model's performance.
### Purpose and Usage
- **Training and Testing**: The primary purpose of loading these datasets is to split the model's workflow into training and evaluation phases. `df_train` is used to fit the model, while `df_test` is crucial for predicting and testing the model's generalization capabilities on new data.
- **Evaluation**: The `revealed_targets.csv` allows for the direct comparison of the model’s predictions against actual outcomes, which is essential for iterative model tuning and refinement.
- **Submission**: The `sample_submission.csv` ensures that predictions are formatted correctly for submission, adhering to the specifications of the competition or project requirements.
"""
df_train = pd.read_csv('../input/optiver-trading-at-the-close/train.csv')
df_test = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/test.csv')
sample_sub = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/sample_submission.csv')
rev_target = pd.read_csv('../input/optiver-trading-at-the-close/example_test_files/revealed_targets.csv')
"""### Column Descriptions
1. **stock_id**:
- **Description**: A unique identifier for each stock.
- **Context**: Not every stock appears in every time bucket (a specific, often small, period of time during trading), making this identifier crucial for tracking the performance and data specific to each stock across different periods.
2. **date_id**:
- **Description**: A unique identifier for the date on which the trading data was recorded. These IDs are sequential and consistent across all stocks.
- **Context**: Allows the model to differentiate data across different trading days and to identify trends or patterns over time.
3. **imbalance_size**:
- **Description**: Represents the volume of shares that remain unmatched at the current reference price, expressed in USD.
- **Context**: A critical measure in understanding supply and demand dynamics at the closing auction, influencing the model's ability to predict price movements based on existing imbalances.
4. **imbalance_buy_sell_flag**:
- **Description**: A categorical flag indicating the direction of the auction imbalance:
- 1 for a buy-side imbalance (more demand than supply)
- -1 for a sell-side imbalance (more supply than demand)
- 0 for no imbalance
- **Context**: This indicator helps in predicting whether the price is likely to rise or fall at the close, based on whether there is excess buying pressure or selling pressure.
5. **reference_price**:
- **Description**: The price at which the number of paired shares is maximized, the imbalance is minimized, and the price is closest to the bid-ask midpoint.
- **Context**: Acts as a pivotal price point for the model, as it represents a theoretically optimal trading price considering current market conditions.
6. **matched_size**:
- **Description**: The total amount in USD that can be matched at the current reference price.
- **Context**: Indicates the volume of trades that can be executed without affecting the market price too significantly, crucial for understanding market liquidity.
7. **far_price, near_price, [bid/ask]_price, [bid/ask]_size**:
- **Description**: These are various price points and quantities in the order book.
- **Far price** and **near price** likely relate to the prices available at further and closer points in the order book.
- **[bid/ask]_price** are the highest buy and lowest sell prices respectively.
- **[bid/ask]_size** are the volumes available at these prices.
- **Context**: These metrics provide detailed insights into the order book's depth and the distribution of buy and sell orders around the reference price, informing predictions on price movement pressures.
8. **wap** (Weighted Average Price):
- **Description**: Calculated over a specific time frame within the non-auction book, it's a price that reflects the average price at which stocks are traded, weighted by volume.
- **Context**: WAP is used to gauge the average trading price over a period, often used in financial models to understand market trends and to normalize the impact of large trades on simple average price calculations.
9. **seconds_in_bucket**:
- **Description**: Measures the number of seconds since the start of the day’s closing auction, starting always from zero.
- **Context**: Useful for models that need to understand and predict price movements and market behavior at very specific intervals during the closing auction.
10. **target**:
- **Description**: The difference between the 60-second future movement in the stock's WAP and the 60-second future movement of a synthetic index, provided only in the training set.
- **Context**: Serves as the dependent variable in training the model. It represents the relative movement of a stock's price compared to the market, which is central to predicting future price movements effectively.
Understanding these columns and their interrelationships is essential for developing an effective predictive model that can accurately forecast stock price movements during the crucial final minutes of trading based on order book dynamics and auction data.
"""
df_train
"""### Function: `feature_cols`
This function is designed to filter out certain columns from a given DataFrame and return the modified DataFrame.
```python
def feature_cols(df):
cols = [c for c in df.columns if c not in ['row_id', 'time_id', 'date_id']]
df = df[cols]
return df
```
#### Details:
- **Input Parameter**:
- `df`: A pandas DataFrame from which specific columns need to be excluded.
- **Process**:
- The function creates a list comprehension that iterates over all column names in the DataFrame. It includes only those columns that are not `'row_id'`, `'time_id'`, and `'date_id'`. These columns are typically identifiers that do not provide predictive power for the model (i.e., they are not features but merely identifiers or indexes).
- It then filters the DataFrame to include only the columns listed in `cols`, effectively removing any columns that might skew the model or are not useful as features.
- **Return**:
- The function returns the DataFrame with the specified columns removed, focusing the DataFrame on potentially relevant features for the model.
### Data Preprocessing
```python
df_train.fillna(0, inplace=True)
```
- **Description**: This line replaces all missing values (`NaN`s) in the `df_train` DataFrame with `0`. Handling missing values is crucial to avoid errors during the modeling process and can also impact the model’s performance.
- **`inplace=True`**: This parameter ensures that the modification is done in place and does not return a new DataFrame, thus directly updating `df_train`.
### Feature Selection and Target Separation
```python
x_train = feature_cols(df_train.drop(columns='target'))
y_train = df_train['target'].values
```
- **Feature DataFrame (`x_train`)**:
- `df_train.drop(columns='target')`: Drops the 'target' column from `df_train`, as the target column is what the model is trying to predict and should not be used as a feature.
- `feature_cols(...)`: Applies the `feature_cols` function to the result, further filtering out the non-feature columns ('row_id', 'time_id', 'date_id'), and assigns the result to `x_train`.
- **Target Array (`y_train`)**:
- `df_train['target'].values`: Extracts the target values from the `df_train` DataFrame. This creates a NumPy array of the target variable, which is used as the dependent variable in model training.
### Summary
This setup is typical in supervised machine learning tasks where the goal is to predict a target variable based on a set of features. The code effectively prepares the dataset by cleaning up non-feature columns, handling missing values, and segregating features and targets, which is critical for the subsequent model training phase.
"""
def feature_cols(df) :
cols = [c for c in df.columns if c not in ['row_id', 'time_id', 'date_id']]
df = df[cols]
return df
df_train.fillna(0, inplace = True)
x_train = feature_cols(df_train.drop(columns='target'))
y_train = df_train['target'].values
"""### Model Initialization
```python
lgbm_model = lgbm.LGBMRegressor(objective='mae', n_estimators=500, random_state=1234)
```
- **`LGBMRegressor`**:
- This class implements the LightGBM regressor. A regressor predicts continuous values, which is appropriate for predicting stock prices as continuous numerical data.
- **Parameters**:
- **`objective='mae'`**: Specifies the loss function to be minimized in the learning process. Here, 'mae' stands for Mean Absolute Error, which aligns with the project’s evaluation criteria. It measures the average magnitude of errors in a set of predictions, without considering their direction (i.e., whether they are over or underestimates).
- **`n_estimators=500`**: This defines the number of boosting stages the model has to go through. More trees can lead to a more accurate model but can also cause overfitting if not handled correctly. In this context, 500 trees are chosen to balance between bias and variance.
- **`random_state=1234`**: This parameter ensures reproducibility of the model’s results by providing a fixed seed for the random number generator, which influences aspects of model training like the selection of features at each split.
### Model Training
```python
lgbm_model.fit(x_train, y_train)
```
- **Description**:
- This line fits the LightGBM model to the training data. The `fit` method adjusts the weights of the model over the specified number of boosting rounds (`n_estimators`) to minimize the specified loss function.
- **Parameters**:
- **`x_train`**: Feature matrix (independent variables) used for training the model.
- **`y_train`**: Target variable (dependent variable) the model needs to predict.
### Importance in the Context of the Project
The use of the LightGBM model in this project is particularly well-suited for several reasons:
- **Efficiency**: LightGBM is known for its high efficiency with large data sets and handles large volumes of data faster than many other implementations of gradient boosting.
- **Handling Sparse Data**: Given the potentially large and sparse nature of financial data (like order books), LightGBM’s ability to handle sparse data effectively is beneficial.
- **Gradient-based Learning**: The model’s learning is based on identifying errors from previous trees and improving on them, which is effective for complex patterns like those found in stock price movements.
Training this model on the defined features and target prepares it to forecast the closing prices of Nasdaq-listed stocks, providing crucial insights into short-term price movements essential for traders and financial analysts.
"""
lgbm_model = lgbm.LGBMRegressor(objective='mae', n_estimators=500, random_state=1234)
lgbm_model.fit(x_train, y_train)
"""### Function: `lgbm.plot_importance`
This function is part of the LightGBM framework and is used to plot the importance of each feature used by the model. The importance can be derived in different ways, and in this case, the importance type specified is "gain".
- **`lgbm_model`**: This is the trained LightGBM model from which the feature importance is calculated.
- **`importance_type="gain"`**: Specifies the type of importance measure to be used. "Gain" refers to the total gains of splits which use the feature. Essentially, it measures the contribution of each feature to the model by calculating how much each feature's splits improve the performance measure (in this case, the reduction in loss or "gain").
### Understanding "Gain"
- **Gain (also known as 'split gain')**: This is the improvement in accuracy brought by a feature to the branches it is on. The idea is that before making a split on a feature, the model calculates how much using this feature would reduce the loss (mean absolute error, in this case). A higher gain implies a more significant contribution of the feature to making more accurate predictions.
### Significance of Feature Importance Visualization
- **Model Interpretability**: This visualization helps in understanding which features are most influential in predicting the target variable. For financial modeling, where interpretability is crucial for trust and understanding, knowing which features affect predictions most can guide further data collection, feature engineering, and model tweaking.
- **Feature Engineering**: Insights from feature importance can lead to improved feature engineering. Features with low importance might be candidates for removal or modification, while understanding high-importance features might lead to the creation of new features that enhance the model’s predictive power.
- **Strategic Decisions**: In trading, understanding which features (e.g., aspects of the order book, price movements, etc.) most influence predictions can help in formulating more effective trading strategies.
### Practical Use
To execute this function properly and ensure that the plot displays as intended, it's crucial to have your environment set up correctly with graphical capabilities (especially if running on a local machine or in environments without GUI support, additional settings may be needed). Also, ensure that the LightGBM library is correctly installed and imported.
The result of this function is a bar chart where each feature is listed along with its importance score. Features are typically sorted in descending order of importance, making it clear which are the most critical for the model’s predictions. This visual tool is invaluable for presentations and reports to stakeholders, providing a clear and intuitive way to discuss model dynamics.
"""
lgbm.plot_importance(lgbm_model, importance_type="gain")
"""The function `lgbm.plot_importance` with the parameter `importance_type="split"` is used to visualize the importance of each feature in the trained LightGBM model based on the number of times each feature is used to split the data across all trees. This provides a different perspective on feature importance compared to the "gain" method. Here’s what you need to know about using this particular function and parameter:
### Function: `lgbm.plot_importance`
- **Parameter**: `importance_type="split"`
- When set to "split", the importance of each feature is calculated based on how often the feature is used to split data points at a tree node, across all trees in the model. Essentially, it counts the number of times a feature is selected to make a decision in the tree.
### Significance of "Split" as an Importance Measure
- **Usage Frequency**: This measure provides insight into how frequently a feature is used in the tree models, irrespective of the magnitude of its impact (or gain). A feature that is used very often to make splits might be considered crucial for the decision process within the model.
- **Interpretability**: Understanding which features are most frequently used can help in assessing the reliance of the model on certain data points or feature types. For example, in stock price prediction, if the model frequently splits on features related to volume, it suggests a strong dependency on trading volume for making predictions.
### Practical Uses of Feature Importance (Split)
- **Model Simplification**: If some features are rarely used to make splits, it might indicate that they are not contributing much to the model's decisions, providing a basis for potentially simplifying the model by removing these less important features.
- **Feature Engineering**: By identifying which features are most frequently used, you can focus your feature engineering efforts to enhance these features or create new features that are similar in nature but might capture additional nuances.
- **Model Validation**: Frequent use of intuitive or expected features for splits can serve as a sanity check, validating that the model is considering relevant factors (e.g., certain market indicators in stock trading models).
### Visualization and Output
When you execute `lgbm.plot_importance`, the function typically produces a bar chart. Each bar represents a feature with its length proportional to the count of times the feature has been used in splits across all boosting rounds (trees). The features are generally ordered by their importance, with the most frequently used feature at the top.
### Code Execution
To ensure the plot is correctly generated and displayed, consider the following:
- Ensure your Python environment supports plotting; for Jupyter notebooks or IPython environments, `%matplotlib inline` should be used to display plots within the notebook.
- Validate that the `lightgbm` library is correctly installed and imported.
- If running in a non-interactive environment or needing to save the plot to a file, additional code may be required to handle these aspects.
This visualization tool is particularly useful for presentations or detailed analysis where understanding the structure and decision-making process of the model is crucial.
"""
# splits means number of times a feature is used to split the data across all trees in the model
lgbm.plot_importance(lgbm_model, importance_type="split")
"""If you submit at this point, the public score will show **5.3888**, which is not so bad. Let's try some approaches from this baseline.
# 2. Optimize parameter with Optuna
Well, changing the approach here, let's see how the score is improved by optimizing the parameters. I used Optuna for optimization. As a result, Public score was improved to **5.3878**.
Optuna is an open-source optimization library designed specifically for automating the process of optimizing the hyperparameters of machine learning algorithms. It is highly regarded for its efficiency and flexibility in tuning parameters to enhance the performance of models. Here’s a detailed look at what Optuna offers and why it's beneficial for machine learning projects:
### Key Features of Optuna:
1. **Automatic Hyperparameter Optimization**:
- Optuna automates the tedious process of manually searching for the best hyperparameters, using sophisticated algorithms to explore the parameter space efficiently.
2. **Efficient Search Algorithms**:
- Optuna supports several state-of-the-art algorithms for hyperparameter optimization, including Tree-structured Parzen Estimator (TPE), Categorical Algorithm for Numerical and Categorical Bayesian Optimization (CMA-ES), and Random Search. These algorithms predict which hyperparameters are likely to yield better results and focus the search around those areas.
3. **Easy Parallelization**:
- One of the strengths of Optuna is its support for easy parallelization, allowing users to speed up their optimization processes by running trials simultaneously across multiple processors or even across different machines.
4. **Pruning of Trials**:
- Optuna provides an automatic trial pruning feature which can terminate poorly performing trials early. This feature is useful in saving computational resources and focusing efforts on more promising parameter sets.
5. **Visualization**:
- The library includes functions for visualizing the optimization process, such as plots for the history of trials, parallel coordinate plots of parameter relationships, and importance plots for assessing which parameters are most influential in achieving the best performance.
6. **User-Friendly**:
- Despite its sophisticated capabilities, Optuna is designed to be user-friendly. It allows for defining the search space using Pythonic APIs and integrates seamlessly with existing Python data science ecosystems like NumPy, Pandas, and major machine learning frameworks like PyTorch, TensorFlow, and Scikit-learn.
### Usage in Machine Learning Projects:
Optuna is particularly useful in projects where the optimal combination of parameters is not known in advance and where manual tuning could be impractical due to the vastness of the parameter space or the complexity of the model. For instance, when optimizing a LightGBM model (as in the Nasdaq Closing Price Prediction Challenge), Optuna can systematically and efficiently explore different combinations of parameters like `number of leaves`, `max depth`, `learning rate`, etc., to find the best configuration that minimizes the error or maximizes the accuracy of predictions.
### Code Breakdown
#### Data Copy
```python
x = x_train.copy()
y = y_train.copy()
```
- **Purpose**: Creates copies of the training data (`x_train`) and target variable (`y_train`). This is generally done to avoid modifying the original data during the optimization process, ensuring data integrity throughout the experimentation.
#### Objective Function
```python
def objective(trial):
params = {
'random_seed': 123,
'n_estimators': trial.suggest_int('n_estimators', 300, 1000),
'num_leaves': trial.suggest_int('num_leaves', 4, 32),
'max_depth': trial.suggest_int("max_depth", 1, 10)}
model = lgbm.LGBMRegressor(**params)
model.fit(x, y)
y_pred = model.predict(x)
score = mean_absolute_error(y, y_pred)
return score
```
- **Function**: `objective(trial)`
- **Purpose**: This function defines the objective for optimization, which Optuna aims to minimize—in this case, the mean absolute error (MAE) between the predicted and actual values.
- **Parameters**:
- `random_seed`: Ensures reproducibility.
- `n_estimators`: Number of boosted trees to fit. Suggested range is 300 to 1000.
- `num_leaves`: Maximum number of leaves in one tree. Suggested range is 4 to 32.
- `max_depth`: Maximum depth of a tree. Suggested range is 1 to 10.
- **Process**: The function initializes a LightGBM regressor with suggested parameters, fits the model to the training data, and computes the MAE on the same training set.
#### Optuna Study Creation and Optimization
```python
#study = optuna.create_study(sampler=optuna.samplers.RandomSampler(seed=123))
#study.optimize(objective, n_trials=50)
#study.best_params
```
- **Purpose**: These commented-out lines are used to create an Optuna study that manages the optimization process.
- `create_study()`: Sets up the optimization framework with a random sampler, which selects parameter values randomly and ensures the reproducibility with a seed.
- `optimize()`: Executes the optimization over a specified number of trials (`n_trials=50`), where each trial evaluates the objective function with a different set of parameters.
- `best_params`: This attribute of the study object stores the best parameter values found during the optimization.
#### Notes
- The process is commented out due to its time-consuming nature (taking a couple of hours to complete). For practical implementations, especially in development environments, such lengthy computations are typically run in dedicated sessions, possibly on optimized hardware or cloud resources.
#### Suggestions for Use
- **Uncomment and Run**: If you have the resources and time, uncomment these lines to perform the optimization and potentially improve your model.
- **Experimentation**: After running the initial optimization, you might want to further refine the ranges or try optimizing additional parameters based on the initial results.
"""
x = x_train.copy()
y = y_train.copy()
def objective(trial):
params = {
'random_seed':123,
'n_estimators' :trial.suggest_int('n_estimators', 300, 1000),
'num_leaves' :trial.suggest_int('num_leaves', 4, 32),
'max_depth' :trial.suggest_int("max_depth",1,10)}
model = lgbm.LGBMRegressor(**params)
model.fit(x,y)
y_pred = model.predict(x)
score = mean_absolute_error(y, y_pred)
return score
#study = optuna.create_study(sampler=optuna.samplers.RandomSampler(seed=123))
#study.optimize(objective, n_trials=50)
#study.best_params
"""# 3. Add imbalance_size
Next, let's see how the score is improved by adding new features.
As you can see the Light GBM "importance" in the above Section 1, not Price-related but Size-related features were regarded as important by LightGBM, thus try to create Size-related new features. First, let's create the ratio between imbalance_size and matched_size.
### Concept of Feature Engineering
**Feature Engineering** is a critical aspect of model development in machine learning, particularly in fields like finance where market dynamics can be complex. Creating new features can help capture additional insights from the data that are not immediately apparent but may significantly influence the outcome.
### Code Explanation
Although the specific code snippet for creating the `imbalance_ratio` feature is commented out, here’s a breakdown:
```python
# def pre_process1(df):
# df['imbalance_ratio'] = df['imbalance_size'] / df['matched_size']
# return df
```
#### Function: `pre_process1`
- **Purpose**: This function adds a new column to the DataFrame `df` that represents the ratio of `imbalance_size` to `matched_size`.
- **New Feature**: `imbalance_ratio`
- **Definition**: It is calculated as the division of `imbalance_size` by `matched_size`.
- **`imbalance_size`**: This could represent the volume of shares that remain unmatched at the current reference price.
- **`matched_size`**: This typically indicates the volume of shares that can be matched at the current reference price.
### Significance of the `imbalance_ratio` Feature
- **Insight into Market Dynamics**: This ratio provides insight into the relative size of unmatched orders compared to matched orders at the reference price, potentially signaling market pressure (either buying or selling pressure) that isn't fully resolved by current order matches.
- **Indicator of Market Sentiment**: A high `imbalance_ratio` might suggest a strong imbalance in buy or sell orders that could affect the stock price shortly, especially during the closing auction when liquidity and volatility are high.
### Improvement in Model Performance
- **Public Score Improvement**: The noted improvement in the public score (to **5.3866**) suggests that the `imbalance_ratio` provides meaningful information that enhances the model’s ability to predict stock price movements accurately. In machine learning competitions and real-world applications, even small improvements in score can be significant, reflecting better alignment of the model with underlying patterns in the data.
### Utilizing the Feature
To utilize this feature effectively:
- **Uncomment and Integrate**: To apply this preprocessing step, you would uncomment the function and apply it to your data frames where needed (both training and testing datasets).
- **Model Re-training**: After integrating this new feature, re-train your model to ensure that it learns to use this new information.
- **Continuous Evaluation**: Continuously evaluate the impact of this new feature on model performance, using validation sets or through cross-validation, ensuring that it genuinely improves the model rather than fitting noise.
This approach is an excellent example of iterative model improvement through feature engineering, highlighting how domain insights (like the importance of size-related features in trading models) can lead directly to tangible enhancements in predictive accuracy.
"""
#def pre_process1(df):
# df['imbalance_ratio'] = df['imbalance_size'] / df['matched_size']
# return df
"""# 4. Add imbalance_size
Then, let's try to add 2 more features related to imbalance between bid-size and ask-size, which will improve Public score to **5.3852**.
For these features, I referred to below great notebook.
https://www.kaggle.com/code/renatoreggiani/optv-lightgbm
### Overview of Feature Engineering Steps
The snippet shows an expanded version of feature engineering where new features are derived from the order book data, focusing on the imbalance and differences between bid and ask sizes, as well as their cumulative and differential impacts on stock price movements.
### Explanation of Each Feature
#### 1. `imbalance_ratio`
- **Definition**: The ratio of `imbalance_size` to `matched_size`.
- **Purpose**: Measures the proportion of unmatched orders to matched orders, providing insight into the market's directional pressure.
#### 2. `imbl_size1`
- **Definition**: The normalized difference between `bid_size` and `ask_size`.
- **Formula**: `(df['bid_size'] - df['ask_size']) / (df['bid_size'] + df['ask_size'])`
- **Purpose**: Captures the net order flow direction, indicating whether buying or selling pressure is dominant.
#### 3. `imbl_size2`
- **Definition**: The normalized difference between `imbalance_size` and `matched_size`.
- **Formula**: `(df['imbalance_size'] - df['matched_size']) / (df['imbalance_size'] + df['matched_size'])`
- **Purpose**: Similar to `imbalance_ratio` but focuses on the relative difference rather than the ratio, providing another perspective on market liquidity and order imbalance.
### Additional Features Considered (but not always effective)
- **`bid_size_diff` and `ask_size_diff`**: Attempt to capture the sequential changes in bid and ask sizes, respectively, which could reflect momentum or shifts in market sentiment. However, these features are noted to not perform well.
- **`bid_size_over_ask_size` and `bid_price_over_ask_price`**: These features aim to directly compare the bid and ask sides, potentially useful in models that are sensitive to such direct ratios.
### Feature Engineering Process
The function `pre_process1` is systematically updated to include these new features, and through testing and validation, their impact on the model's predictive accuracy is assessed. As noted, each group of features contributes to an incremental improvement in the model's public score, indicating their effectiveness.
### Applying Feature Engineering in the Model
```python
df_train = pre_process1(df_train)
df_train = feature_cols(df_train)
df_train.fillna(0, inplace = True)
df_train
```
- **Preprocessing**: Apply the `pre_process1` function to add new features.
- **Feature Selection**: Use the `feature_cols` function to filter the DataFrame, ensuring that only relevant features are included.
- **Handling Missing Data**: Fill any NaN values with zero, a necessary step to prepare the data for modeling without errors due to missing values.
### Conclusion
The iterative approach to adding and testing new features as shown in this example is a cornerstone of effective machine learning practices, particularly in complex domains like financial markets where the dynamics are influenced by numerous and often subtle factors. Each feature is an attempt to encapsulate some aspect of market behavior, and their validation through improved scores demonstrates their utility in enhancing model performance.
"""
#def pre_process1(df):
#
# df['imbl_size1'] = (df['bid_size']-df['ask_size']) / (df['bid_size']+df['ask_size'])
# df['imbl_size2'] = (df['imbalance_size']-df['matched_size']) / (df['imbalance_size']+df['matched_size'])
#
# return df
# Original
# def pre_process1(df):
# df['imbalance_ratio'] = df['imbalance_size'] / df['matched_size']
# #---> improve 0.0012
# df['imbl_size1'] = (df['bid_size']-df['ask_size']) / (df['bid_size']+df['ask_size'])
# df['imbl_size2'] = (df['imbalance_size']-df['matched_size']) / (df['imbalance_size']+df['matched_size'])
# #---> improve 0.0014
# df['bid_size_diff'] = df[["stock_id", "date_id", "bid_size"]].groupby(["stock_id","date_id"]).diff()
# df['ask_size_diff'] = df[["stock_id", "date_id", "ask_size"]].groupby(["stock_id","date_id"]).diff()
# #<--- "diff" doesn't work well
# df["bid_size_over_ask_size"] = df["bid_size"].div(df["ask_size"])
# df["bid_price_over_ask_price"] = df["bid_price"].div(df["ask_price"])
# #---> improve 0.0018
# return df
# Edited
def pre_process1(df):
df['imbalance_ratio'] = df['imbalance_size'] / df['matched_size']
#---> improve 0.0012
df['imbl_size1'] = (df['bid_size']-df['ask_size']) / (df['bid_size']+df['ask_size'])
df['imbl_size2'] = (df['imbalance_size']-df['matched_size']) / (df['imbalance_size']+df['matched_size'])
#---> improve 0.0014
# df['bid_size_diff'] = df[["stock_id", "date_id", "bid_size"]].groupby(["stock_id","date_id"]).diff()
# df['ask_size_diff'] = df[["stock_id", "date_id", "ask_size"]].groupby(["stock_id","date_id"]).diff()
# #<--- "diff" doesn't work well
# df["bid_size_over_ask_size"] = df["bid_size"].div(df["ask_size"])
# df["bid_price_over_ask_price"] = df["bid_price"].div(df["ask_price"])
#---> improve 0.0018
return df
df_train = pre_process1(df_train)
df_train = feature_cols(df_train)
df_train.fillna(0, inplace = True)
df_train
lgbm.plot_importance(lgbm_model, importance_type="gain")
"""# 5. Add bid/ask ratio in size and price
Adding ratios between bid and ask in terms of both price and size as new features in your predictive model for stock price movements reflects a strategic move in feature engineering. These types of features can capture essential aspects of market sentiment and liquidity that are not explicitly represented by individual size or price features. Let's delve into how these features are conceptualized, their potential impact, and the importance of avoiding redundant calculations.
### Concept of Bid/Ask Ratios
1. **Bid/Ask Size Ratio**: This ratio compares the total quantity of buy orders (bids) to the total quantity of sell orders (asks). A higher ratio indicates a dominance of buy orders, which could be interpreted as a bullish signal, whereas a lower ratio might suggest bearish sentiment.
2. **Bid/Ask Price Ratio**: This compares the highest price buyers are willing to pay (bid price) to the lowest price sellers are willing to accept (ask price). This ratio can indicate the immediate direction the market participants expect the stock to move. A ratio close to or greater than 1 might suggest that buyers are willing to pay a price close to or higher than sellers' lowest asking price, potentially driving the price upwards.
### Implementation and Improvement
By adding these ratios, you are essentially trying to leverage the structural information in the order book data that might not be fully utilized by simpler models. LightGBM can indeed consider nonlinear interactions between features, but explicitly modeling interactions that are known to be predictive in financial contexts (like these ratios) can often lead to more robust predictions.
### Code Example for Adding Bid/Ask Ratios
Here's a refined version of how you might implement these features in your preprocessing function:
```python
def pre_process1(df):
# Calculate bid/ask ratios only once to avoid redundancy
df['bid_ask_size_ratio'] = df['bid_size'] / df['ask_size']
df['bid_ask_price_ratio'] = df['bid_price'] / df['ask_price']
# Calculate normalized differences for size and imbalance
df['imbl_size1'] = (df['bid_size'] - df['ask_size']) / (df['bid_size'] + df['ask_size'])
df['imbl_size2'] = (df['imbalance_size'] - df['matched_size']) / (df['imbalance_size'] + df['matched_size'])
return df
```
### Addressing Redundancy and Efficiency
As noted, redundant calculations can be a significant inefficiency in data preprocessing, especially with large datasets typical in financial modeling. Here are strategies to address this:
- **Avoid Repeated Calculations**: Ensure that each unique calculation is only done once, and if the result is needed again, store it rather than recalculating. This approach saves computational resources and execution time.
- **Use Caching**: For more complex or expensive calculations that are used multiple times across different parts of your application or model training process, consider implementing caching. This can be done at the code level using decorators like `@lru_cache` from Python's `functools` or by manually saving results to a temporary data structure.
### Impact on Model Performance
Improving the public score to **5.3834** by adding these features suggests that these aspects of the trading dynamics are crucial in predicting closing prices accurately. This confirms the importance of careful feature selection based on domain knowledge and the behavior of the underlying model (in this case, LightGBM).
By refining the feature engineering process to focus on meaningful relationships and interactions within the data while avoiding unnecessary recalculations, you optimize both the efficiency and effectiveness of your predictive modeling efforts.
"""
x_train = feature_cols(df_train.drop(columns='target'))
y_train = df_train['target'].values
"""I have outlined the process of hyperparameter tuning using GridSearchCV, which is part of the scikit-learn library. This technique is used to find the optimal hyperparameters for the LightGBM model aiming to predict stock prices with the lowest Mean Absolute Error (MAE). Let's go through the main components of this code and discuss each step:
### Step-by-Step Breakdown
1. **Hyperparameter Grid Definition**:
- **`param_grid`** is a dictionary where keys are the names of parameters to tune, and values are the ranges of values to test for each parameter. For this model, the parameters being tuned are:
- `n_estimators`: The number of boosting stages the model will go through. More stages increase the model's complexity and potential accuracy but can lead to overfitting.
- `num_leaves`: The maximum number of leaves in one tree. Increasing this number can make the model more detailed but may cause overfitting.
- `max_depth`: The maximum depth of each tree. Deeper trees can learn more specific patterns but might overfit on the training data.
2. **Grid Search Setup**:
- **`GridSearchCV`**:
- `estimator`: Here, `lgbm.LGBMRegressor(objective='mae')` specifies that the model is a LightGBM regressor with the objective set to minimize the mean absolute error, which is relevant for regression problems where you want to minimize the error magnitude without considering direction.
- `param_grid`: The grid of parameters to test.
- `cv=5`: Specifies that 5-fold cross-validation should be used. In 5-fold cross-validation, the data is split into 5 parts, with each part being used as a validation set once while the remaining 4 parts form the training set. This method helps ensure that the model's performance is stable across different subsets of the data.
3. **Fitting the Grid Search**:
- **`grid_search.fit(x_train, y_train)`**: This command starts the grid search process. The model will be trained multiple times with different combinations of parameters from `param_grid`. Each combination will be evaluated using 5-fold cross-validation to determine its effectiveness.
4. **Best Parameters and Model Training**:
- **`best_params = grid_search.best_params_`**: After the grid search completes, you can retrieve the best parameter set that led to the lowest average cross-validation error.
- **Creating and Training a New Model with the Best Parameters**:
- `lgbm.LGBMRegressor(objective='mae', **best_params)`: This initializes a new LightGBM regressor using the best parameters found.
- `.fit(x_train, y_train)`: Fits the model to the entire training dataset using these optimized parameters.
### Significance
This approach is particularly beneficial for refining model performance, ensuring that you are using the best possible parameters for your specific dataset and problem. By systematically searching through a predefined space of parameter values with cross-validation, GridSearchCV helps avoid overfitting and ensures that the model's performance is robust across different data samples.
### Conclusion
Using GridSearchCV for hyperparameter tuning is a robust method for improving the predictive power of machine learning models. It automates the laborious process of manually searching for the best model settings, leading to more effective and reliable predictions, which is crucial in high-stakes fields like stock price prediction.
"""
# import numpy as np
# from sklearn.model_selection import GridSearchCV
# param_grid = {
# 'n_estimators': [500, 1000, 2000],
# 'num_leaves': [25, 50, 100],
# 'max_depth': [5, 7, 10]
# }
# grid_search = GridSearchCV(estimator=lgbm.LGBMRegressor(objective='mae'), param_grid=param_grid, cv=5)
# grid_search.fit(x_train, y_train)
# best_params = grid_search.best_params_
# # Create and train a new model with the best hyperparameters
# lgbm_model = lgbm.LGBMRegressor(objective='mae', **best_params)
# lgbm_model.fit(x_train, y_train)
# !pip install hyperopt --upgrade
"""Hyperopt is a powerful tool for optimizing model parameters via various search algorithms, such as Tree-structured Parzen Estimator (TPE), which is used in this case. Let's discuss each component of the code to understand how it contributes to optimizing the LightGBM model:
### Import Statements
```python
# import hyperopt
# from lightgbm import LGBMRegressor
```
- These lines import the Hyperopt library and the LGBMRegressor class from LightGBM. Commented out here, but they are necessary to run the code.
### Objective Function
```python
def objective(params):
model = LGBMRegressor(objective='mae',
n_estimators=params['n_estimators'],
num_leaves=params['num_leaves'],
max_depth=params['max_depth'])
model.fit(x_train, y_train)
y_pred = model.predict(x_train)
mae = mean_absolute_error(y_train, y_pred)
return mae
```
- **Purpose**: Defines the function that Hyperopt will minimize. Here, it trains a LightGBM model with given parameters and calculates the mean absolute error (MAE) on the training set.
- **Parameters**: Takes a dictionary `params` that includes settings for `n_estimators`, `num_leaves`, and `max_depth`.
### Search Space
```python
search_space = {
'n_estimators': hyperopt.hp.choice('n_estimators', range(500, 1000)),
'num_leaves': hyperopt.hp.choice('num_leaves', range(20, 50)),
'max_depth': hyperopt.hp.choice('max_depth', range(5, 10))
}
```
- Defines the hyperparameter space over which to search. `hyperopt.hp.choice` specifies a list of discrete values for each parameter. Hyperopt will test different combinations of these values to find the set that results in the lowest MAE.
### Trials Object and Optimization Call
```python
trials = hyperopt.Trials()
best_hyperparams = hyperopt.fmin(objective, search_space, algo=hyperopt.tpe.suggest, max_evals=13, trials=trials)
```
- **`Trials()`**: Stores details of each trial, including parameters and the resulting MAE.
- **`fmin()`**: Runs the optimization process, using the TPE algorithm (`hyperopt.tpe.suggest`) over 13 evaluations.
### Extract Best Parameters
```python
best_hyperparams = trials.best_trial['misc']['vals']
```
- Extracts the best parameters from the trials. The actual values are indexed under `'misc'['vals']`.
### Train the Model with Best Parameters
```python
lgbm_model = LGBMRegressor(objective='mae',
n_estimators=best_hyperparams['n_estimators'],
num_leaves=best_hyperparams['num_leaves'],
max_depth=best_hyperparams['max_depth'])
lgbm_model.fit(x_train, y_train)
```
- Initializes a new LGBMRegressor with the best parameters found and fits it to the training data.
### Predictions
```python
y_pred = lgbm_model.predict(x_test)
```
- Makes predictions using the optimized model on the test data (`x_test`).
### Conclusion
This approach provides a systematic way to tune model parameters using Hyperopt, which can lead to significant improvements in model performance by carefully searching the parameter space. It's particularly useful in scenarios where manual tuning is impractical due to the large number of combinations and the complexity of interactions between parameters.
"""
# import hyperopt
# from lightgbm import LGBMRegressor
# def objective(params):
# model = LGBMRegressor(objective='mae',
# n_estimators=params['n_estimators'],
# num_leaves=params['num_leaves'],
# max_depth=params['max_depth'])
# model.fit(x_train, y_train)
# y_pred = model.predict(x_train)
# mae = mean_absolute_error(y, y_pred)
# return mae
# search_space = {
# 'n_estimators': hyperopt.hp.choice('n_estimators', range(500, 1000)),
# 'num_leaves': hyperopt.hp.choice('num_leaves', range(20, 50)),
# 'max_depth': hyperopt.hp.choice('max_depth', range(5, 10))
# }
# trials = hyperopt.Trials()
# best_hyperparams = hyperopt.fmin(objective, search_space, algo=hyperopt.tpe.suggest, max_evals=13, trials=trials)
# best_hyperparams = trials.best_trial['misc']['vals']
# lgbm_model = LGBMRegressor(objective='mae',
# n_estimators=best_hyperparams['n_estimators'],
# num_leaves=best_hyperparams['num_leaves'],
# max_depth=best_hyperparams['max_depth'])
# lgbm_model.fit(x_train, y_train)
# y_pred = lgbm_model.predict(x_test)
"""### Parameter Explanation
1. **`task`: 'train'**
- Specifies the task that LightGBM will perform, which is 'train' in this case. This is typical when you are using LightGBM for building and training new models.
2. **`boosting_type`: 'gbdt'**
- Specifies the boosting algorithm. 'gbdt' stands for Gradient Boosting Decision Tree, which is the standard boosting framework that LightGBM uses. It creates a series of decision trees where each tree learns to correct the errors of the previous one.
3. **`objective`: 'regression'**
- Indicates the learning task and the corresponding learning objective. 'Regression' means the model will predict continuous target values, which is typical for predicting metrics like prices or rates.
4. **`metric`: ['l1', 'l2']**
- Metrics for evaluating model performance. 'l1' is the mean absolute error (MAE), and 'l2' is the mean squared error (MSE). Including both allows you to evaluate the model under different error metrics during the training phase.
5. **`learning_rate`: 0.005**
- Determines the step size at each iteration while moving toward a minimum of the loss function. A smaller learning rate can lead to better performance (at the risk of longer training time and potentially getting stuck in local minima).
6. **`feature_fraction`: 0.9**
- Specifies the fraction of features to be randomly selected for building each tree. A lower value can provide better performance because it provides a better generalization capability and can prevent overfitting.
7. **`bagging_fraction`: 0.7**
- Specifies the fraction of data to be used for each iteration and is a method for speedup training and handling overfitting.
8. **`bagging_freq`: 10**
- Specifies the frequency for performing bagging. Every 10 iterations, a new subset of the data is selected according to the `bagging_fraction`.
9. **`verbose`: 0**
- Controls the level of LightGBM’s output (verbosity of printing messages). Setting it to 0 means silent mode.
10. **`max_depth`: 8**
- Maximum depth of the trees. Restricting the depth of the trees helps prevent the model from becoming overly complex and overfitting.
11. **`num_leaves`: 20**
- The maximum number of leaves in one tree. More leaves will make the model more complex and can lead to overfitting.
12. **`max_bin`: 512**
- Maximum number of bins that feature values will be bucketed into. A larger number increases the model’s complexity.
13. **`num_iterations`: 1000**
- The number of boosting iterations to be run. More iterations can improve accuracy but might lead to overfitting if not controlled with other parameters like `bagging_fraction`.
14. **`force_col_wise`: 'true'**
- Forces the algorithm to use a column-wise (feature-wise) histogram-building algorithm, which can be faster when the dataset has a large number of rows and small number of features.
### Fitting the Model
```python
lgbm_model = lgbm.LGBMRegressor(**params)
lgbm_model.fit(x_train, y_train)
```
- These lines initialize a `LGBMRegressor` with the specified parameters and then fit this model to `x_train` and `y_train`. The fitting process involves learning from the training data by minimizing the specified loss function (`'objective': 'regression'` with metrics `'l1'` and `'l2'`).
This setup exemplifies a comprehensive application of machine learning techniques to predict continuous outcomes through regression analysis, optimized by adjusting a variety of hyperparameters to balance the model's accuracy and generalization capabilities.
"""
params = {
'task': 'train',
'boosting_type': 'gbdt',
'objective': 'regression',
'metric': ['l1','l2'],
'learning_rate': 0.005,
'feature_fraction': 0.9,
'bagging_fraction': 0.7,
'bagging_freq': 10,
'verbose': 0,
"max_depth": 8,
"num_leaves": 20,
"max_bin": 512,
"num_iterations": 1000,
"force_col_wise": 'true'
}
lgbm_model = lgbm.LGBMRegressor(**params)
lgbm_model.fit(x_train, y_train)
# import xgboost as xgb
# # Create an XGBoost regressor
# xgb_model = xgb.XGBRegressor(objective='reg:squarederror',
# n_estimators=895,
# max_depth=7)
# # Train the model
# xgb_model.fit(x_train, y_train)
# xgb.plot_importance(xgb_model)
"""The provided code snippet demonstrates how to use XGBoost, a powerful and widely used machine learning library, to train a regression model with GPU acceleration, plot feature importances, and save the trained model in various formats. Let's break down each part of this process:
### Step-by-Step Explanation
#### 1. **XGBoost Regressor Initialization**
```python
import xgboost as xgb
xgb_model = xgb.XGBRegressor(objective='reg:squarederror',
n_estimators=895,
max_depth=7,
tree_method='gpu_hist')
```
- **`xgb.XGBRegressor`**: This creates an instance of XGBoost's regressor. Parameters specified in the constructor configure the behavior of the model:
- **`objective='reg:squarederror'`**: Sets the loss function to be minimized as squared error, which is appropriate for regression tasks.
- **`n_estimators=895`**: Defines the number of gradient boosted trees to fit. More trees can improve the model's predictive accuracy but may lead to longer training times and overfitting.
- **`max_depth=7`**: Limits the maximum depth of each tree. Deeper trees can model more complex patterns but also can overfit.
- **`tree_method='gpu_hist'`**: Specifies that the histogram-based algorithm should run on a GPU, enhancing the training speed significantly. This is particularly useful for large datasets.
#### 2. **Training the Model**
```python
xgb_model.fit(x_train, y_train)
```
- **`.fit(x_train, y_train)`**: This method trains the XGBoost model using the provided training data (`x_train`) and targets (`y_train`).
#### 3. **Plotting Feature Importances**
```python
xgb.plot_importance(xgb_model)
```
- **`plot_importance`**: This function from the XGBoost module plots a chart of feature importances, which are calculated based on the number of times a feature is used to split the data across all trees. This visualization helps in understanding which features are most influential in predicting the target variable.
#### 4. **Saving the Model**
```python
xgb_model.save_model('xgb_model.bin')
xgb_model.save_model("xgb_model.json")
xgb_model.save_model("xgb_model.txt")
```
- **`save_model`**: This method saves the trained model to a file, allowing the model to be loaded later without retraining. XGBoost provides flexibility in the format:
- **Binary file (`'xgb_model.bin'`)**: Saves the model in a native XGBoost binary format, which is efficient but can only be used with XGBoost.
- **JSON file (`"xgb_model.json"`)**: Saves the model in a JSON format, making it more transparent and easier to interpret. JSON models are more portable and can be used for debugging or in environments where binary format is not preferred.
- **Text file (`"xgb_model.txt"`)**: Saves a human-readable format of the model, useful for understanding the model structure and for documentation purposes.
### Summary
This sequence of operations demonstrates a comprehensive approach to model training with XGBoost, leveraging GPU capabilities for speed, examining feature importance for insights, and preserving the model in various formats for future use, sharing, or deployment. The flexibility in saving models in different formats ensures that you can choose the appropriate one based on your needs for performance, transparency, or compatibility.
"""
import xgboost as xgb
# Create an XGBoost regressor with the gpu_hist tree construction algorithm
xgb_model = xgb.XGBRegressor(objective='reg:squarederror',
n_estimators=895,
max_depth=7,
tree_method='gpu_hist')
# Train the model
xgb_model.fit(x_train, y_train)
# Plot the feature importances
xgb.plot_importance(xgb_model)
# Save the model
xgb_model.save_model('xgb_model.bin')
# Save as JSON file
xgb_model.save_model("xgb_model.json")
# Save as TXT file
xgb_model.save_model("xgb_model.txt")
# # Make predictions on the test data
# y_pred = xgb_model.predict(X_test)
# lgbm_model = lgbm.LGBMRegressor(objective='mae',
# n_estimators=895,
# num_leaves= 25,
# max_depth= 7)
# lgbm_model.fit(x_train, y_train)
# from sklearn.ensemble import RandomForestRegressor
# rf_model = RandomForestRegressor(n_estimators=895,
# max_depth= 8,criterion="squared_error",bootstrap=True)
# rf_model.fit(x_train, y_train)
"""# 6. Submission
The provided code snippet outlines how to submit predictions in a Kaggle competition that requires real-time interaction with an API. This setup is often used in "Code Competitions," where the submissions are evaluated on the fly. Let's break down the steps and functionalities involved in this submission process:
### Understanding the Kaggle Environment API
1. **Initialization**:
```python
import optiver2023
env = optiver2023.make_env()
iter_test = env.iter_test()
```
- **`import optiver2023`**: Imports the competition-specific Python module provided by Kaggle, which contains methods necessary for the submission.
- **`env = optiver2023.make_env()`**: Initializes the competition environment. This environment handles the process of receiving the test data and submitting predictions.
- **`iter_test = env.iter_test()`**: Creates an iterator that will provide batches of test data. This method is typically used when the test data is revealed in chunks over time, simulating a real-world scenario such as a trading environment.
2. **Processing and Making Predictions**:
```python
counter = 0
for (test, revealed_targets, sample_prediction) in iter_test:
test = pre_process1(test)
test_df = feature_cols(test)
sample_prediction['target'] = xgb_model.predict(test_df)
env.predict(sample_prediction)
counter += 1
```
- **Loop Over Test Data**: The `for` loop iterates over each batch of data provided by the `iter_test` iterator.
- **Preprocessing**: `pre_process1(test)` applies preprocessing steps to the test data, preparing it by creating new features or transforming existing ones as defined earlier in your workflow.
- **Feature Selection**: `feature_cols(test)` ensures that only the relevant features are used for making predictions, filtering out any non-predictive or extraneous data columns.
- **Making Predictions**: `xgb_model.predict(test_df)` uses the pre-trained XGBoost model to generate predictions based on the processed test data.
- **Submitting Predictions**: `env.predict(sample_prediction)` submits the predictions back to the Kaggle environment. `sample_prediction` is a DataFrame provided by the iterator that likely includes a format or template indicating how submissions should be structured.
- **Counter**: An optional counter is used here to keep track of the number of iterations or batches processed.
### Important Notes
- **API Optimization Warning**: Note that the current API version is not optimized and should not be used to estimate the runtime of your code on the hidden test set.
- **Contact Information**: You may email "adityasaxena@g.harvard.edu" for production-level optimal code, which might be essential for serious competitors aiming for the best performance in the competition.
"""
import optiver2023
env = optiver2023.make_env()
iter_test = env.iter_test()