In the course of this project, I successfully performed feature extraction from raw input data, and trained multiple classification models using the Mllib library. A comprehensive performance comparison was carried out among the models, in order to determine the most optimal model.
This project is inspired from the book Machine Learning with Spark
- Used PySpark to extract the appropriate features from raw input data.
- Trained a number of classification models using MLlib.
- Made predictions with our classification models.
- Applied a number of standard evaluation techniques to assess the predictive performance of our models.
- Explored the impact of parameter tuning on model performance and learn how to use cross-validation to select the most optimal model parameters.
The notebook Classification_with_Pyspark.ipynb
has a full description of each step of this project.