Skip to content
This repository has been archived by the owner on Aug 15, 2023. It is now read-only.

Latest commit

 

History

History
67 lines (47 loc) · 3.96 KB

README.md

File metadata and controls

67 lines (47 loc) · 3.96 KB

Sberbank Data Science Journey 2018: AutoML

SDSJ AutoML — AutoML(automatic machine learning) competition aimed at development of machine learning systems for processing banking datasets: transactions, time-series as well as classic table data from real banking operations. Processing is handled automatically by the system with models selection, architecture, hyper-parameters, etc.

Team members

Dmitriy Kulagin, Yauheni Kachan, Nastassia Smolskaya, Vadim Yermakov

Solution description

Preprocessing:

  • Drop constant columns
  • Add time-shifted columns
  • Features from datetime columns (year, weekday, month, day)
  • Smoothed target encoding (Semenov encoding) for string and id columns and if the dataset has more than 1000 rows else and for numeric with less than 31 unique values
  • If dataset's size bigger than 250Mb than convert data to np.float32 data-type

Training:

If dataset has less than 1000 rows (e.g. first dataset) and regression problem than train linear model else gradient boosting model(s)

Linear Model

  1. Fill missing values with mean
  2. Transform data with QuantileTransformer
  3. Train Lasso with regularization term alpha = 0.1
  4. Search for best alpha for Lasso, Ridge and select best of them:
    1. Cross-validation: time_series_split.TimeSeriesCV with min(6, number_of_rows / 30) folds if datetime_0 in dataset else use KFold with 3 folds
    2. Grid search alpha for Lasso in range np.logspace(-2, 0, n_points) where n_points is min of 35 and estimation of how many times we could train the model on all folds
    3. Grid search alpha Ridge if by estimation we could train more than 2 times on all folds, search for alpha in np.logspace(-2, 2, n_points) range
  5. If we successfully grid search than select best of Lasso and Ridge else use Lasso from 2.

Gradient Boosting

  1. Train few iterations of XGBoost with 700 trees (early_stopping_rounds=20) with continuation until we have enough time for next iteration, or early stopping achieved
  2. If XGBoosts trains fewer two-thirds of available time than train LightGBM with 5000 trees (early_stopping_rounds=20)
  3. If XGBoost and LightGBM trained successfully than stacking them with Logistic Regression or Ridge according to the prediction problem

Local Validation

Public datasets for local validation: sdsj2018_automl_check_datasets.zip

Docker 🐳

docker pull rekcahd/sdsj2018

Useful links