REFERENCES:
-
Induction of Decision Trees - Research Paper
-
Scikit Learn decision tree documentation
-
Breast Cancer dataset - UCI Machine learning database
-
Medium article for modeling decision trees
-
Hands-On Machine Learning book for decision trees and random forests
-
Medium article for Seaborn pair plots
Advantages
- Decision trees are logarithmic in cost, meaning that this is not very intensive, especially when used on high-features datasets in comparison with other models.
- White box, meaning that we can actually understand how this works
- Minimal data preparation needed
Disadvantages
- Prone to overfit, we could prune to see if that helps
- Slightly unstable, there might be better trees that represent the true population better
- Requires a balanced dataset without "inadequate" attributes. Meaning we can't have several examples where the same set of attributes are indicative of different classes
This is a dataset with 10 features, that describe features derived from images of breast masses. They describe the characteristics of the cell nuclei
- Each object in the dataset has a value of 2 or 4, 2 for benign and 4 for objects that are malignant.
- The goal is to correctly classify an object based on its attributes.
- We create a class using BaseEstimator to include in a scikit learn pipeline that deals with certain values in the data, primarily making certain values numeric, and dealing with the nulls
- The class does not need to have code that gets initialized. Because this is a transformer, we don't need to add any code in the fit function
- We also add a MinMaxScaler in the pipeline so this can be used for other models we might make as well, and to deal with the diagnosis column (which has values as 2 or 4 currently). The result is that malignant masses will have a value of 1, and 0 for benign masses
A pair plot that shows the relationships between the size attributes. I thought this was important because from my personal knowledge, size is a common indicator of malignant tumors
Below is the distribution of (some!) of the features, split by the benign and malignant. As you can see, malignant tumors are generally associated with larger values for the given attributes
Another thing I checked was the "adequacy" of the data. Based on a research paper, "Induction of Decision Trees" written by J.R. Quinlan, it is ideal for decision trees to be used in datasets where the same set of attributes of an object, always results in the same classification We check this in the code, finding that though there are some duplicates (where attributes are all the same), there are no instance where the attributes are the same BUT the classification is different.
The model we use is a classification decision tree from scikit-learn which uses the CART algorithm.
- Meaning that these are binary trees, where each leaf only has up to two children.
- If we built this with an ID3 algorithm based model, we might see different results
Running it initially, we find that there is a depth of 9, and an accuracy of 94.16%. We improve on this by customizing some of the default hyper-parameters in scikitlearn. First we increase the values of min_samples_leaf and min_samples_split, which increases the value of accuracy to 95.62%. Just sets some requirements for when to split and for leaves to be created
Doing this is a form of regularization, that reduces overfitting of the model when running it against the test set. In general, increasing the min_* features and decreasing the max_* features will regularize the model.
Note that these values might change because there is some probability in how trees are constructed, leading to slightly different results.
Finally, we see that the most important factor seems to be the size uniformity.
We further tune the hyperparameters by optimzing the max_depth. We test depths from 4 to 9, seeing as how there was depth of 9 when no limit was set. This is another form of preventing overfitting.
Because of the differences from each run, we run the models 100 times and take the average to get a more realistic model accuracy from each hyperparameter set
We find that the optimal max_depth is 6! by a small margin. By regularizing and tuning, we were able to improve the overall model accuracy to nearly 95%
Built a random forest from scratch that has a few hyperparameters we can use for adjusting the forest construction. Main ones to note are feature_split and max_samples.
- max_samples affects how many samples are used in the bootstrapping method for each tree in the forest
- feature_split defines how many features each tree will be randomly assigned. We use the squareroot, log, and all features depending on the parameter.
By implementing a random forest with 128 trees, we actually see an increase of accuracy on the testing data from 95% to nearly 98%!
We also can access items in the cache of the forest, including the original predictions for each tree on each sample.
Thanks for reading:D