Using decision trees and random forest algorithms to solve real-world data analysis. "sklearn_decision_trees_random_forests"
This project coding-focused approach how to use decision trees and random forests
to solve a real-world problem from Kaggle:
QUESTION: The dataset contains about 10 years of daily weather observations from numerous Au weather stations. Here's a small sample from the dataset:
As a data scientist at the Bureau of Meteorology, you are tasked with creating a fully automated system that can use today's weather data for a given location to predict whether it will rain at the location.
Perform the following steps to prepare the dataset for training:
- Create a train/test/validation split
- Identify input and target columns
- Identify numeric and categorical columns
- Impute (fill) missing numeric values
- Scale numeric values to the
$(0, 1)$ range - Encode categorical columns to one-hot vectors
A decision tree in general parlance represents a hierarchical series of binary decisions:
A decision tree in machine learning works in the same way except that we let the computer figure out the optimal structure hierarchy of decisions, following the instruction of criteria.
The following topics were covered in this tutorial:
- Downloading a real-world dataset
- Preparing a dataset for training
- Training and interpreting decision trees
- Training and interpreting random forests
- Overfitting, hyperparameter tuning & regularization
- Making predictions on single inputs
Introduced the following terms:
- Decision tree
- Random forest
- Overfitting
- Hyperparameter
- Hyperparameter tuning
- Regularization
- Ensembling
- Generalization
- Bootstrapping
Check out the following resources to learn more:
- https://scikit-learn.org/stable/modules/tree.html
- https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
- https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction
- https://www.kaggle.com/willkoehrsen/introduction-to-manual-feature-engineering
- https://www.kaggle.com/willkoehrsen/intro-to-model-tuning-grid-and-random-search
- https://www.kaggle.com/c/home-credit-default-risk/discussion/64821