Link to Kaggle: https://www.kaggle.com/c/microsoft-malware-prediction
The goal is to predict a Windows machine’s probability of getting infected by various families of malware using CatBoost based on different properties of that machine generated by telemetry data of Windows Defender.
CatBoost is a state-of-the-art open-source gradient boosting on decision trees library. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within the company for ranking tasks, forecasting and making recommendations. It is universal and can be applied across a wide range of areas and to a variety of problems.
- Accurate: leads or ties competition on standard benchmarks
- Robust: reduces the need for extensive hyperparameter tuning
- Easy-to-use: offers Python interfaces integrated with scikit, as well as R and command-line interfaces
- Practical: uses categorical features directly and scalably
- Extensible: allows specifying custom loss functions
- Download the dataset
- Clean the dataset
- Perform feature engineering on the dataset
- Encode the dataset
- Fit the model to the training dataset
- Find the accuracy and evaluation metrics using test(validation) dataset
Since this is a Kaggle competition, the output of the model is evaluated by Kaggle. There are two leaderboards in Kaggle namely public and private leaderboard. The private leaderboard is calculated with approximately 37% of the test data for this competition. This leaderboard is calculated with approximately 63% of the test data.
Private Score: 0.64949
Public Score: 0.65380