6.5 Decision trees parameter tuning

Notes

In this lesson, we will discuss about different parameters used to control a Decision Tree (DT). Two of them, max_depth and min_samples_leaf have a greater importance than the others. We will further see how we first tune max_depth parameter and then move to tuning other parameters will help. After that, a dataframe will be created with all possible combinations of max_depth, min_sample_leaf and the auc score corresponding to them. These results will be visualized using a heatmap by pivoting the dataframe to easily determine the best possible max_depth and min_samples_leaf combination. Finally, the DT will be retrained using the identified parameter combination. The DT so trained will be viewed as a tree diagram, for visualizing decision rules.

Steps

Fine-Tuning Process: iterate to find optimal parameter settings.
- Start by tuning max_depth with various values to determine a subset of optimal depths.
- Then, using this subset, fine-tune the model further by exploring different min_samples_leaf values.
This method is computationally efficient for large datasets, though it may not be optimal for smaller ones.
Heatmaps for Visualization: Store the scores (e.g., AUC) obtained during tuning in a pivot table, and create a heatmap with seaborn to easily identify high score areas, which helps pinpoint the optimal max_depth and min_samples_leaf combination.

NB: Choose parameter values that effectively control the tree's size and avoid values like 'nan' (Not a Number), even if they seem to lead to better scores.

Importance of `max_depth` and `min_samples_leaf`

Controlling Overfitting: these parameters play a critical role in preventing overfitting.
- max_depth limits the tree's complexity, preventing it from growing too deep and memorizing the training data.
- min_samples_leaf ensures that leaf nodes have a sufficient number of samples, reducing the chance of creating nodes that are too specific to the training data.
Impact on Bias and Variance: They also affect the model's bias and variance.
- Increasing max_depth and decreasing min_samples_leaf can lead to a more complex model with lower bias but higher variance.
- Decreasing max_depth and increasing min_samples_leaf results in a simpler model with higher bias but lower variance.

It's then important to find the right balance between max_depth and min_samples_leaf to achieve optimal model performance. This involves a trade-off between bias and variance, and the best values depend on the specific dataset and problem.

Add notes from the video (PRs are welcome)

⚠️	The notes are written by the community. If you see an error here, please create a PR with a fix.

Notes from Peter Ernicke

Navigation

Machine Learning Zoomcamp course
Session 6: Decision Trees and Ensemble Learning
Previous: Decision tree learning algorithm
Next: Ensemble learning and random forest

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05-decision-tree-tuning.md

05-decision-tree-tuning.md

6.5 Decision trees parameter tuning

Notes

Steps

Importance of `max_depth` and `min_samples_leaf`

Navigation

Files

05-decision-tree-tuning.md

Latest commit

History

05-decision-tree-tuning.md

File metadata and controls

6.5 Decision trees parameter tuning

Notes

Steps

Importance of max_depth and min_samples_leaf

Navigation

Importance of `max_depth` and `min_samples_leaf`