A submission for HUAWEI - 2020 DIGIX GLOBAL AI CHALLENGE
team: Melbourne dağları
members: @mustafahakkoz, @Aysenuryilmazz
rank: 94/ 343
score (AUC): 0.679876
dataset: advertising behavior data Heavily unbalanced and very large / out of core dataset containing the advertising behavior data collected from seven consecutive days.
-
training dataset (6.09 GB, 43M rows, 36 cols)
-
2 testing datasets (153 MB, 1M rows, 36 cols)
The main ideas of the project are:
-
Reading dataset with chunks and downcasting to fit into the memory.
-
Target encoding with smoothing.
-
SGD model with mini-batches.
-
class_weights to balance classes.
Implementation details can be found in the document DIGIX Implementation Instruction.docx.
- We read whole train (42M) dataset with chunk size of 10K and apply downcast to reduce the size in memory.
2. target encoding with smoothing
-
We implemented target encoding on columns by using a custom function which smooths standard target encoding with global mean of a column.
-
We dropped uid and pt_d columns on train dataset.
-
We shuffle the dataset and split it to 40M for train and rest (~2M) for test purposes.
-
We produce train dataset in several notebooks due to hard disk limitations of Kaggle platform (only 5GB).
-
We chose SGD model of Scikit Learn with default parameters and feed it with batches of 10K since it supports out-of-core learning by partial_fit and warm_start parameters.
-
For every batch we used class_weight parameter to balance classes.
-
After evaluating our model on our test dataset (with AUC score of 70%), we refit our model on whole training set (~42M) and export the model.
- By using exported model, we implement prediction on submission dataset test_data_B.csv. For this step we used mean values to fill NA values which are produced by target encoding because of newly encountered values.
-
We didn’t use any cross validation or hyper parameter tuning technique for this contest due to computational constraints of online platforms.
-
We didn’t perform any of feature engineering techniques also.
-
We also tried Decision Tree, XGBoost, catboost and lightGBM with several parameters but they didn’t work out due to memory errors.
-
This repo only contains of final versions. Experiments are implemented in kaggle platform. All of the notebooks including scratches are below.
-
a. Splitting Dataset into 4 parts-1 [deleted]
b. Splitting Dataset into 4 parts-2 [deleted]
c. Splitting Dataset into 4 parts-3
d. Splitting Dataset into 4 parts-4
e. Trying out XGBoost with batches but failed since boosting cannot work with different datasets
-
a. Creating test set, encoding map dictionary, datatype dictionary for reading data by chunks
b. Splitting train set by chunk size of 5M-1
c. Splitting train set by chunk size of 5M-2
d. Splitting train set by chunk size of 5M-3
e. Splitting train set by chunk size of 5M-4
f. Splitting train set by chunk size of 5M-5
g. Splitting train set by chunk size of 5M-6
h. Splitting train set by chunk size of 5M-7
i. Splitting train set by chunk size of 5M-8
j. Training SGD model by chunks with class_weights, testing the model then refitting whole data