This is my learning project when taking Machine Learning 0-Master course on Udemy. The goal is to predict the sale price of bulldozer, and this is a time-series related
project. The original project is from Kaggle.
How well can we predict the future sale price of a bulldozer, giving the characterics and the previous examples of how much similar bulldozers have been sold for?
The data is downloaded from the Kaggle Bluebook for Bulldozers competition: https://www.kaggle.com/c/bluebook-for-bulldozers/data
View and download the benchmark code from Github: https://github.com/benhamner/BluebookForBulldozers/tree/master/Benchmark
The data is split into three parts:
-
Train.csv is the training set, which contains data through the end of 2011.
-
Valid.csv is the validation set, which contains data from January 1, 2012 - April 30, 2012 You make predictions on this set throughout the majority of the competition. Your score on this set is used to create the public leaderboard.
-
Test.csv is the test set, which won't be released until the last week of the competition. It contains data from May 1, 2012 - November 2012. Your score on the test set determines your final rank for the competition.
Python version: 3.7 Packages: pandas, numpy, matplotlib.pyplot, sklearn, catboost
I looked at the distributions of the data and values counts for various categorical variables. Below are a few highlights from pivot tables:
After download the data from Kaggle, I cleaned the data so it could be used by our models. This is what I have done to clean the data:
- Parsing Dates: using parse_date parameter in pandas to convert the date column, so I can sort the data by sale date.
- Make a copy of the original data frame, so we can paly around with it.
- Add datetime parameters for saledate column.
- Convert astring into category.
- Filling missing nummerical values.
- Filling and tuning categorical variables into numbers
I used two models to predcit the price: Random Forest Regressor and catboost. In both models, I used RandomSearchCV to find the best paramters for each model to fine tune the models.
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.
For more the evaluation of the this project, please check: https://www.kaggle.com/c/bluebook-for-bulldozers/overview/evaluation
Random Forest regressor model performed slightly better than Catboost, but there is no significant difference between the two models that I used in this project.
Random Forest regressor - Valid RMSLE': 0.25346274312057565 (with the best hyperparameters found via RandomSearchCV Catboost: - Valid RMSLE: ': 0.2545226666113938