Cascade-Cup-2020

APPROACH

Ensemble of Random Forest, XgBoost, CatBoost, LightGBM, KNN
Stratified KFold with the same seed used for all the models
UserId, Unnamed: 0 increased accuracy and F1_score
Creations == 0 perfectly classified class 1
Non zero creations had all the classes equally distributed
Trained XgBoost on the dataset with only non-zero creations and Hardcoded class to be 1 whenever creations are 0
Trained XgBoost on the entire dataset with a new Binary feature "zero creations"
Decomposed the feature columns to 3 using PCA
Trained XgBoost on the entire dataset by adding the 3 new decomposed columns
Feature selection using feature importances of the models and Recursive Feature Elimination CV
Out-of-fold cross-validation(OOF) was used instead of average across all the folds for deciding weights of the ensemble. Source

MODELS AND NOTEBOOKS

RANDOM FOREST - 1

Important Features were selected using feature importances.
Two Random Forests were combined and used together for better predictions.
The soft probabilities from both the trees were averaged to get the final probabilities.
Notebook

RANDOM FOREST - 2

Important Features were selected using Recursive Feature elimination.
Recursive Feature Elimination Notebook
User Id and Unnamed used for better age_group prediction.
Notebook

LIGHT GBM

User Id and Unnamed used for better age_group prediction.
Tuned the parameters for the Number of leaves with regularization to increase the score.
Explored the double trees (class provided in the notebook), but the score decreased.
Notebook

CATBOOST

User Id and Unnamed used for better age_group prediction.
The parameters were tuned using Sklearn optimizer and the tuning code is provided.
Triple Tree was explored but the score decreased with the addition of User Id.
Notebook

KNN CLASSIFIER

Trained KNN Classifier on GPU using RAPIDS Library.
Found the best number of neighbors using the Elbow method.
Notebook

DOUBLE XGBOOST - WITH MANUAL TUNING

Removed Features using feature importances and Recursive feature elimination.
Used Two Xgboosts together for better predictions.
The Parameters were tuned manually.
Notebook

XGBOOST - BASELINE

Removed features using Recursive Feature elimination.
Trained Baseline Xgboost with no parameter tuning.
Notebook

XGBOOST - UNNAMED

Removed features using feature importances.
Used Unnamed: 0 and user id feature for better classification.
Tuned the number of trees and regularization parameters by hand and the results are documented in the form of comments.
Notebook

XGBOOST - UNNAMED AND NON ZERO TRAINING

Removed features using feature importances.
Trained only on the samples with Creations value non zero which resulted in faster training and better results.
The Samples which had Creations zero were hardcoded and classified to age group 1.
The above method was discovered after the EDA of the training data.
Notebook

XGBOOST - UNNAMED AND BINARY FEATURES

Removed Features using feature importances and Recursive feature elimination.
Created New Binary features with Creation feature which gave the model the information about the sparse nature of the Creation column in the training data.
Notebook

XGBOOST - PCA

Used Principal Component Analysis (PCA) for decomposing the data and creating new features for the model.
The PCA Features were used in addition to the original data so as to not lose the information from the old data.
Notebook

ENSEMBLING

All the above models were used to generate OOF Files which were then blended using scipy optimizer.
We found appropriate weights using the oof files and used them to combine predictions.
The Blending increased the f1 score as much as by 1.2
Notebook

WHAT DIDN’T WORK

Neural Networks couldn't cross 70. We tried both normal Neural Network and a Skip connection (Resnet) Model. (Approach)
Tabnet couldn't cross 67, fluctuated a lot
Training on GPU resulted in lower f1_score (0.3 less than CPU). (RESOURCE)
T-SNE decomposition very slow (9 hours on GPU not enough)
Tried using kernelPCA but faced a memory limit error.
SVM very slow (9 hours on CPU not enough for completing even a single fold)
Kaggle results not reproducible on Colab
Fast.ai Tabular learner gave high training loss

What more could be done

Better Hyperparameter Search using Optuna (our approach)
Better Feature engineering can be done.
More Deep learning approaches can be explored.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cascade-Cup-2020

APPROACH

MODELS AND NOTEBOOKS

WHAT DIDN’T WORK

What more could be done

About

Releases

Packages

DebarshiChanda/Cascade-Cup-2020

Folders and files

Latest commit

History

Repository files navigation

Cascade-Cup-2020

APPROACH

MODELS AND NOTEBOOKS

WHAT DIDN’T WORK

What more could be done

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages