Gini Importance, also known as Mean Decrease in Impurity (MDI), is a measure used to calculate the importance of features in decision tree-based models like Random Forests. It is based on the reduction in the Gini Impurity metric, which measures how well a node in a decision tree separates the data into classes.
Gini Impurity is a measure of the likelihood of incorrect classification at a decision node if a data point were randomly classified based on the distribution of labels in the node.
The formula for Gini Impurity at a node is:
- ( p_i ) is the proportion of samples belonging to class ( i ) at the node.
- ( C ) is the total number of classes.
- ( G = 0 ): Perfect purity, all samples belong to one class.
- ( G = 0.5 ): Maximum impurity for a binary classification problem with an equal split between two classes.
Split Evaluation:
At each split in a decision tree, the Gini Impurity of the parent node and the resulting child nodes is calculated. -
Reduction in Gini Impurity:
The reduction in Gini Impurity caused by a split is computed as: -
Feature Importance Aggregation:
The Gini reductions caused by splits using a specific feature are summed across all the trees in the ensemble (e.g., a Random Forest). -
The sum of reductions for each feature is normalized to provide a relative importance score.
- Feature Dependence:
The Gini Importance score reflects how often a feature is used in splits and how much it reduces impurity. - Higher Importance:
Features that cause significant impurity reduction or are frequently used for splitting get higher importance scores.
Efficient Computation:
Gini Importance is computed during model training, making it computationally efficient. -
The scores provide a straightforward way to rank feature importance. -
Global Perspective:
Gini Importance reflects the overall contribution of a feature across the entire model.
Bias Toward High-Cardinality Features:
Features with many unique values (e.g., ID numbers) can get artificially high importance scores because they create many small, pure splits. -
Correlation Effect:
If two features are highly correlated, their importance scores may be shared or skewed, making it harder to distinguish their true contribution. -
Gini Importance is specific to tree-based models and does not generalize to other model types.
Here’s an example code snippet to compute Gini Importance (Mean Decrease in Impurity) using a Random Forest classifier from Scikit-learn. This method is based on the reduction in Gini impurity contributed by each feature in the decision-making process of the trees in the forest.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
# Load a sample dataset (Iris dataset)
data = load_iris()
X = pd.DataFrame(, columns=data.feature_names)
y =
# Train a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42), y)
# Extract feature importances (Gini Importance)
feature_importances = rf.feature_importances_
# Create a DataFrame for better visualization
importance_df = pd.DataFrame({
'Feature': X.columns,
'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)
print("Feature Importances (Gini Importance):")
# Plot feature importances
plt.figure(figsize=(8, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.title('Feature Importance using Gini (Mean Decrease in Impurity)')
- Dataset: The Iris dataset is used here as an example. You can replace it with your dataset.
- Random Forest: Trains a Random Forest classifier with 100 trees.
- Feature Importances: The
attribute of the trained Random Forest provides the Gini Importance for each feature. - Visualization: A horizontal bar chart shows the relative importance of each feature.