Objective of this project to enable quick experimentation in Data Analytics projects with minimal cookie cutter programming. Getting rid of all the fit_transforms.!
NOTE
- This is a work in progress. Underlying modules are in process of development.
- As this project matures there will be changes in the scripts such as train.pyandpredict.py
- TODO
- Create modules for tuning,stacking
- Removal of some of the modules that are redundant
 
- Create modules for 
The framework is designed to make the Data Science flow easier to perform, by encapsulating different techniques for each step within 1 method. There are classes for each of the below listed steps:
- 
Feature Evaluation - Report to give an intution of the dataset
 
- 
Feature Engineering - Modules to perform feature transformations on Categorical and Numerical Dataset.
- Various applicable techniques are encoded within these modules and are accesed with an argument.
 
- 
Fefature Generation - Module to create new features based on different techniques
 
- 
Cross Validation - Stratified Folding both for Regression and Classification
 
- 
Training - Run multiple models using 1 class.
- Evaluating and Saving the results in an organized manner
 
- 
Tuning - Hyper-parameter tuning of multiple models, based on json arguments for parameter values.
 
- 
Prediction 
- 
Evaluating the model 
- Clone the repo.
- Create 3 folders inputandmodelandtuneq.
- Save the training, testing and sample submission file in inputfolder.
- The outputs generated from training such as trained model, encoders and oof_preds will be saved in modelfolder.
- The parameters for fine tuning the models should be saved in the tunefolder.
- Update the config.pyto point it to the correct path for data, model and tuning.
- Update the dispatcher.pywith model/models you want to run your dataset on.
- Use the sample notebook to understand how to use this framework after this intial configuration is completed.
- 
config.py: Config file to give path of all the datasets and other standard configuration items. Such as csv files path, random seed etc.
- 
feature_eval.py: This script and the class inside is used to analyze the dataframe and its columns to get the following output:- min, max and unique values of each column
- histogram/ distribution of each column
- corelation of columns using a heat map
 
- 
feature_gen.py: Encapsulates method to generate new features. Currently implemented thePolynomial featuresmethod from sklearn. Returns Dataframe with new features.
- 
feature_impute.py: Encapsulates the method to impute blank values in a dataframe. Currently, it supports 3 imputation methods:- Simple Imputer
- Model Based Imputer: Extra Trees or knn
- Knn based imputer
- Returns updated Dataframe
 
- 
cross_validation.py: This class is used to perform cross validation on any dataframe based on the type of problem statement. It is used to create cross validated dataset.
- 
categorical.py: This class can be used for encoding of categorical features in a given dataframe.- Inputs : Dataframe, Categorical Columns List, Type of Encoding
- Output: Encoded Dataframe
- Supported Encoding Techniques:
- Lable Encoding
- Binary Encoding
- One Hot Encoding
 
 
- 
numerical.py: This class can be used for encoding of numerical features in a given dataframe.- Inputs : Dataframe, Categorical Columns List, Type of Encoding
- Output: Encoded Dataframe, Transformer Object for later use.
- Support Techniques:
- Standard Scaler
- Min-Max Scaler
- Power Tranformer
- Log Transformer
 
 
- 
metrics.py: This class can be used to evaluate the results of given predictions and actual value.
- 
dispatcher.py: Python File with Models and parameters. They have been designed to supply the models toengine.pyfor training on a given dataset
- 
engine.py: This script encapsulates the method to train and evaluate the multiple models simultaneously- Leverages on dispatcher.pyandmetrics.pyfor model and metrics
- The results for each fold are also saved in the modelsfolder asoof_predictions.csvfor each model.
- To Do Stacking module to suporrt stacking of multiple models
 
- Leverages on 
- 
Scripts to be ignored for now: - train.py: For training
- predict.py: For prediction
- tune.py: For tuning h-parameter
- create_folds.py: To create folded datframe