Kaggle Competition : Microsoft-Malware-Detection

Requirements

Python 3.7+
numpy 1.19.1
scikit-learn 0.23.1
category-encoders 2.2.2
pandas 1.0.5*
lightgbm 2.3.1

* Do not use pandas 1.1+ because it is incompatible with category-encoders 2.2.2's CatBoost encoder.

Usage Intructions

Feature Selection

Run the feature_selection.py script to conduct the feature selection process. The result will be printed to stdout. During the process, the script will show each removed features, including its reason for removal. At the end of the output, it will show the all of the selected features.

Parameters for the script are:

path/to/dataset, the path to the dataset file
encoding, encoding method to be used to encode categorical variables. Available encoding methods are target, js, catboost, and freq*.

* See the program arguments glossary below.

Usage examples:

python3 feature_selection.py data/randombalancedsample10000_train.csv freq
python3 feature_selection.py data/randombalancedsample100000_train.csv catboost > catboost_100k_features.txt

Model Benchmarking

Run the benchmark.py script to conduct a model benchmark test. The result will be printed to stdout. It will show the selected model parameters decided by GridSearchCV and the highest score (ROC AUC) from the grid search. It will then show the ROC AUC score result of applying the model to the test set. If the model's algorithm is tree-based (cart, rf, adaboost, lgbm), it will also show list of features sorted by importance.

Parameters for the script are:

path/to/dataset, the path to the dataset file to train the model
algorithm, the algorithm to train the model. Available algorithms are adaboost, bagging, cart, knn, lgbm, logistic, rf, and svm*.
encoding, encoding method to be used to encode categorical variables. Available encoding methods are target, js, catboost, and freq*.

* See the program arguments glossary below.

Usage examples:

python3 benchmark.py data/randombalancedsample10000_train.csv knn freq
python3 benchmark.py data/randombalancedsample100000_train.csv lgbm catboost > lgbm_catboost_100k.txt
python3 benchmark.py data/randombalancedsample10000_train.csv svm js > svm_js_10k.txt

Sampling

The sampling procedure uses 2 scripts, sampler/splitter.py and sampler/random_balanced_sampler.py.

The splitter.py script will split the original dataset into 2 files: file containing positive examples only and file containing negative examples only.

Parameters:

path/to/input/dataset
path/to/output/negative/example/file
path/to/output/positive/example/file

Example:

python3 splitter.py data/train.csv data/train0.csv data/train1.csv

The random_balanced_sampler.py script takes the examples in the 2 split files to randomly combine them into a single balanced sample file. For example, if the specified sample size is 2S, it will take random S positive examples and random S negative examples and combine them into a single file containing 2S examples.

Note that the script needs the number of examples of each input files in order for the random sampling behaviour to work. We provide a script, sampler/counter.py that might help in getting these numbers.

Parameters:

path/to/input/negative/example/file
path/to/input/positive/example/file
path/to/output/sample/file
number_of_examples_in_negative_file
number_of_examples_in_positive_file
output_sample_size

Example:

python3 random_balanced_sampler.py train0.csv train1.csv randombalancedsample10000_train.csv 4462591 4458892 10000'

Helper Modules

The feature selection and model benchmarking scripts require these modules:

attr_map.py: Map of the feature selection results for each encoding method. It is required by the benchmark script and the preparer module to take only the selected features when training the model.
attr_classes.py: Classification of features. It is required by the feature selection script to decide which feature is nominal categorical, continuous, boolean, etc.
preparer.py: Contains a helper function to prepare the dataset for model benchmarking. The preparation includes dropping features and cleansing the values.

Result Files

Feature selection result files are stored in feature_selection_results directory. The file name format is [encoding]_[sample_size]_features.txt. For example, the result file of feature selection process using CatBoost encoder and sample of 100,000 examples is catboost_100k_features.txt.

Model benchmark result files are stored in benchmark_results directory. The file name format is [algorithm]_[encoding]_[sample_size].txt. For example, the benchmark result file of model trained by the random forest algorthm using James-Stein encoder and sample of 10,000 examples is rf_js_10k.txt.

Program Arguments Glossary

target: Base target encoding
catboost: CatBoost encoding
js: James-Stein encoding
freq: Frequency encoding
adaboost: AdaBoost ensemble algorithm
bagging: Bagging ensemble algorithm
cart: scikit-learn implementation of decision tree algorithm
knn: K-nearest neighbour algorithm
lgbm: LightGBM implementation of gradient boosting algorithm
logistic: Logistic regression algorithm
rf: Random forest algorithm
svm: Support vector machine algorithm

Links to Dataset and Samples

Dataset:

https://www.kaggle.com/c/microsoft-malware-prediction/data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kaggle Competition : Microsoft-Malware-Detection

Requirements

Usage Intructions

Feature Selection

Model Benchmarking

Sampling

Helper Modules

Result Files

Program Arguments Glossary

Links to Dataset and Samples

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
benchmark_results		benchmark_results
data		data
feature_selection_results		feature_selection_results
sampler		sampler
README.md		README.md
_config.yml		_config.yml
attr_classes.py		attr_classes.py
attr_map.py		attr_map.py
benchmark.py		benchmark.py
feature_selection.py		feature_selection.py
preparer.py		preparer.py
report.pdf		report.pdf
requirements.txt		requirements.txt

jatin7gupta/Microsoft-Malware-Detection

Folders and files

Latest commit

History

Repository files navigation

Kaggle Competition : Microsoft-Malware-Detection

Requirements

Usage Intructions

Feature Selection

Model Benchmarking

Sampling

Helper Modules

Result Files

Program Arguments Glossary

Links to Dataset and Samples

About

Topics

Resources

Stars

Watchers

Forks

Languages