Link to Kaggle: https://www.kaggle.com/c/microsoft-malware-prediction
Helps in protecting more than one billion machines from damage before it happens.
Predict a Windows machine’s probability of getting infected by various families of malware using CatBoost based on different properties of that machine generated by telemetry data of Windows Defender.
Python, Jupyter, Pandas, NumPy, Dask and CatBoost.
CatBoost is a state-of-the-art open-source gradient boosting on decision trees library. Developed by Yandex researchers and engineers, it is the successor of the MatrixNet algorithm that is widely used within the company for ranking tasks, forecasting and making recommendations. It is universal and can be applied across a wide range of areas and to a variety of problems.
- Accurate: leads or ties competition on standard benchmarks
- Robust: reduces the need for extensive hyperparameter tuning
- Easy-to-use: offers Python interfaces integrated with scikit, as well as R and command-line interfaces
- Practical: uses categorical features directly and scalably
- Extensible: allows specifying custom loss functions
- Download the dataset
- Clean the dataset
- Perform feature engineering on the dataset
- Encode the dataset
- Fit the model to the training dataset
- Find the accuracy and evaluation metrics using test(validation) dataset