We have built a convolutional neural network (CNN) to analyze images of skin lesions and categorize them into one of seven classes, three of which are cancerous and four of which are benign. We have have also developed a web page, currently hosted on GitHub pages, and plan to embed a web app with our model.
Our dataset is from Kaggle and can be accessed by the link below. This dataset contains 10,015 images of skin lesions across the 7 classes detailed below.
Demographics:
Class Definitions:
https://www.kaggle.com/datasets/farjanakabirsamanta/skin-cancer-dataset
We started by benchmarking three CNN architectures detailed in Aurelien Geron's book "Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow" (2022). These three CNNs were: InceptionV3, ResNet50, and VGG16. In addition to the three architectures, we tested 3 different optimizers for each: Adam, RMSprop, and SGD. In total, nine models were benchmarked, and from those, we chose InceptionV3 with the Adam optimizer as our primary model.*
- For benchmarking metrics, see "run1/visualizations/"
Despite the InceptionV3.Adam slightly underperforming relative to ResNet50.Adam in the classification reports, InceptionV3.Adam was chosen for it's supperior performance on AUC metrics. (see \run1\visualizations\roc_curve\roc_curve_InceptionV3_Adam.png).
Once we chose our primary model, we continued to fine-tune it to maximize our AUC, precision, and recall scores, with recall on our three cancerous classes more highly prioritized. This is because, in the precision/recall trade-off, favoring recall reduces false negatives. In a cancer identification model, such as this, false negatives in the cancerous classes would be our most detrimental outcome that should be minimized to the extent possible. Please note the 'Cancer Catcher' model in run4, which reached our higest recall for melanoma at .7.
Our fine-tuning steps, along with their corresponding run folders in our repo are detailed below.
-
Running InceptionV3.adam at 150 epochs (run3; v6)
-
Removed image augmentation - original benchmarking involved preliminary image augmentation:
rotation_range=20, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True,
-
Results from testing do not show improvement with higher epochs.
precision recall f1-score support akiec 0.02 0.05 0.03 65 bcc 0.05 0.09 0.07 103 bkl 0.08 0.07 0.08 220 df 0.00 0.00 0.00 23 mel 0.11 0.17 0.13 223 nv 0.68 0.50 0.57 1341 vasc 0.01 0.04 0.02 28 accuracy 0.37 2003 macro avg 0.14 0.13 0.13 2003 weighted avg 0.48 0.37 0.41 2003
- Resolution: test at lower epochs; weights need to be adjusted
-
-
Weighting scheme testing 1 - The Cancer Catcher (run4; v7)
- All testing to this point utilized TensorFlow's 'balanced' weighting system to account for large imblanace in classes.
- 4x on 'bcc' and 'akeic', 20x on 'mel'
- Tested effectiveness of different weights
- Increased weighting for underrepresented classes by a factor of 4x
- Note the recall for 'mel' at .70
precision recall f1-score support akiec 0.04 0.06 0.05 65 bcc 0.06 0.07 0.06 103 bkl 0.00 0.00 0.00 220 df 0.00 0.00 0.00 23 mel 0.11 0.70 0.19 223 nv 0.63 0.12 0.20 1341 vasc 0.02 0.04 0.03 28 accuracy 0.16 2003 macro avg 0.12 0.14 0.08 2003 weighted avg 0.44 0.16 0.16 2003
-
Binary classification testing (run5; v8)
- Testing conducted at same time as weighting scheme testing
- All testing to this point involved a multiclass classifier.
- Tested to effectiveness of a binary classifier as opposed to a multiclass classifier.
- Results unremarkable
precision recall f1-score support benign 0.79 0.58 0.67 1612 cancerous 0.17 0.37 0.24 391 accuracy 0.54 2003 macro avg 0.48 0.47 0.45 2003 weighted avg 0.67 0.54 0.59 2003
-
Inverse proportional weighting (run8)
- Weighted classes based on the inverse of their frequency
-
Class balanced loss approach weighting (run9)
- Attempted to implement balanced loss weighting, model performed poorly
-
Adding generated augmented images to training data (run10)
- Added a random imgage augementor and image generator
- Added randomly generated images back into training data
- Wanted to normalize percentage representation in data set of underrepresented classes
precision recall f1-score support akiec 0.01 0.02 0.01 65 bcc 0.04 0.06 0.05 103 bkl 0.07 0.03 0.04 220 df 0.03 0.04 0.04 23 mel 0.12 0.22 0.15 223 nv 0.68 0.61 0.64 1341 vasc 0.00 0.00 0.00 28 accuracy 0.44 2003 macro avg 0.14 0.14 0.13 2003 weighted avg 0.48 0.44 0.45 2003
-
Increasing custom layer neuron density from 512 to 1024 and rerunning promissing models (run11; v12)
- Testing Multiple models with increased neuron count
- Top performers are as follows: InceptionV3.Adam, ResNet50.Adam, VGG16.SGD
- Ultimately, InceptionV3.Adam remained the highest perfrmer
InceptionV3.Adam
precision recall f1-score support akiec 0.03 0.06 0.04 65 bcc 0.04 0.06 0.05 103 bkl 0.12 0.14 0.13 220 df 0.01 0.04 0.02 23 mel 0.09 0.15 0.12 223 nv 0.67 0.50 0.57 1341 vasc 0.03 0.07 0.04 28 accuracy 0.37 2003 macro avg 0.14 0.15 0.14 2003 weighted avg 0.48 0.37 0.41 2003
ResNet50.Adam
precision recall f1-score support akiec 0.03 0.05 0.04 65 bcc 0.04 0.06 0.05 103 bkl 0.12 0.13 0.12 220 df 0.00 0.00 0.00 23 mel 0.13 0.24 0.17 223 nv 0.69 0.52 0.59 1341 vasc 0.07 0.11 0.09 28 accuracy 0.39 2003 macro avg 0.16 0.16 0.15 2003 weighted avg 0.49 0.39 0.43 2003
VGG16.SGD
precision recall f1-score support akiec 0.04 0.06 0.05 65 bcc 0.02 0.03 0.02 103 bkl 0.13 0.15 0.14 220 df 0.00 0.00 0.00 23 mel 0.15 0.24 0.19 223 nv 0.68 0.54 0.61 1341 vasc 0.03 0.04 0.03 28 accuracy 0.41 2003 macro avg 0.15 0.15 0.15 2003 weighted avg 0.49 0.41 0.44 2003
-
Augmented image generation with 1000 images for underrepresented classes with InceptionV3.Adam (run12; v11)
- This version was technically ran in two parts: the first generated augmented images such that minority classes would contain at least 500 images. The second run generated augmented images such that each minority class would contain at least 1000 images.
- 'df' performing relatively well, but vasc is not being identified at all
- Our theory was that the 'vasc' class was being subsumed into the other minority classes due to augmentation noise.
precision recall f1-score support akiec 0.03 0.06 0.04 65 bcc 0.08 0.13 0.09 103 bkl 0.14 0.15 0.14 220 df 0.00 0.00 0.00 23 mel 0.12 0.22 0.15 223 nv 0.67 0.50 0.58 1341 vasc 0.00 0.00 0.00 28 accuracy 0.39 2003 macro avg 0.15 0.15 0.14 2003 weighted avg 0.48 0.39 0.42 2003