This repository contains the implementation and performance evaluation of various tree-based machine learning models, including Decision Trees (DT), Random Forests (RF), and Permutation Decision Forests (PDF). Additionally, we used a novel impurity measure, Effort-To-Compress (ETC), and empirically evaluate its performance compared to traditional impurity measures like Shannon Entropy and Gini Index.
├── load.py # Script for loading all datasets
├── classical_decision_tree_using_entropy.ipynb # DT using Shannon Entropy
├── classical_decision_tree_using_gini.ipynb # DT using Gini Index
├── random_forest_entropy.ipynb # RF using Shannon Entropy
├── random_forest_gini.ipynb # RF using Gini Index
├── permutated_decision_tree_and_forest.ipynb # PDF and PDT Implemenataion using PDT
├── barplots.ipynb # Script for generating barplots for results
└── README.md # Project documentation
The following datasets were analyzed:
- Appendicitis
- Breast Cancer Wisconsin
- Diabetes Pima Indian
- Ionosphere
- Iris
- Sonar
- Wine
Each dataset was preprocessed and split into training and testing sets using a random seed (42) and time series split (validation size = 15).
The following parameters were optimized:
- Depth: 2 to 20
- Impurity Measures: Shannon Entropy and Gini Index
- n_estimators: [10, 50, 100, 150, 1000]
- Depth: 2 to 20
- Impurity Measures: Shannon Entropy and Gini Index
-
Permutation of Data:
- 21 unique seed values for datasets: Ionosphere, Iris, Wine, Diabetes, Appendicitis
- 100 and 200 unique seed values for datasets: Breast Cancer Wisconsin and Sonar
-
Depth Tuning:
- PDT depths range from 2 to 20.
- For each permutation, the ideal depth that produces the highest macro F1-score is determined using time series split (five splits, validation size = 15).
-
Testing Phase:
- PDT models are retrained on the shuffled data with optimal depth and seed.
- For PDF, the test data is passed through multiple PDTs, and a voting scheme is employed to predict the final label.
The models were evaluated based on the following metrics:
- Accuracy
- Macro F1-Score
- Precision
- Recall
- Permutation Decision Tree (PDT) outperformed classical Decision Trees on average.
- Permutation Decision Forest (PDF) achieved performance close to Random Forests with significantly fewer estimators:
- PDF: 21–200 estimators
- RF: 50–1000 estimators
- ETC impurity measure behaved comparably to Shannon Entropy in empirical evaluations.
Results are visualized using barplots to compare Macro F1-Scores across all models and datasets. Execute barplots.ipynb to generate the plots.
Ensure you have the required libraries installed:
pip install numpy pandas scikit-learn matplotlib-
Clone the repository:
git clone https://github.com/arham-jain12/Permutation-Decision-Tree.git cd Permutation-Decision-Tree -
Load the datasets by running
load.py:python load.py
-
Run the specific Jupyter Notebooks for model implementation:
- Decision Trees (DT):
classical_decision_tree_using_entropy.ipynbclassical_decision_tree_using_gini.ipynb
- Random Forests (RF):
random_forest_entropy.ipynbrandom_forest_gini.ipynb
- Permutation Decision Forests:
permutated_decision_tree_and_forest.ipynb
- Decision Trees (DT):
-
Generate the barplots using
barplots.ipynb.
- Investigate the relationship between ETC and Kolmogorov complexity.
- Test PDT and PDF on non-iid datasets to evaluate their robustness further.
Contributions are welcome! If you encounter any issues or have suggestions for improvements, feel free to open an issue or submit a pull request.
This project is licensed under the MIT License.