Skip to content

Latest commit

 

History

History
32 lines (19 loc) · 9.57 KB

README.md

File metadata and controls

32 lines (19 loc) · 9.57 KB

Applied ML for Breast Cancer Detection: Unveiling Diagnostic Potentials

Introduction

Breast cancer, one of the most common and potentially lethal cancers affecting many women worldwide, presents a significant public health challenge. Early detection is crucial in improving the prognosis and survival rates of breast cancer patients. Studies show that 1 in 8 women in the United States will be diagnosed with breast cancer in their lifetime, and in 2023, an estimated 297,790 women and 2,800 men were diagnosed with invasive breast cancer. Traditional methods such as mammography, ultrasound, and biopsies have been employed for detection, but these techniques often require expert interpretation and can sometimes lead to false positives or negatives. In recent years, machine learning has emerged as a revolutionary tool in the medical field, offering new avenues for enhancing breast cancer detection. Machine learning algorithms, trained on large medical images and patient data datasets, can identify patterns and anomalies, assisting in early and accurate diagnosis and creating personalized treatment plans. Our study delves into the application of machine learning in breast cancer detection, exploring various modeling algorithms and techniques used to improve diagnostic accuracy.

Methodology

To develop an accurate and reliable machine learning algorithm for classifying breast tumors as “benign” or “malignant” using the data from the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. For our machine learning models, we focused on Logistic Regression, Random Forest, and XGBoost, optimizing them to distinguish between benign and malignant samples, thus aiming to advance the accuracy of breast cancer diagnostics significantly. The WDBC dataset comprises features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, with 569 instances and 32 attributes, including an ID number, diagnosis, and 30 real-valued input features – the dataset had around 200 malignant and 350 benign tumors. Some key features include characteristics of the cell nuclei, such as texture, radius, and smoothness.

Results

When plotting boxplots of the features of both the benign and malignant tumors, we found that the average values for all the features were more significant for the malignant tumors than the benign ones. This makes sense because malignant tumors would be more aggressive, typically exhibiting higher values in features like texture, density, and size. To help our data exploration and better understand our data, we created a pairwise plot that shows us that malignant tumors (red dots) tend to have higher values in most of the features compared to benign tumors (blue dots) and that features where there is a clear distinction between the two types of tumors on the scatterplots implies that they can be strong predictors in distinguishing between the two. From our correlation heatmap, we also see that features like “mean area”, “mean perimeter,” “mean radius,” and “mean concavity” seem to show a strong correlation with the diagnosis, meaning that as these features increase, the likelihood of the tumor being malignant also increases. The histograms also show that the distribution of values for malignant tumors is often skewed toward the higher endpoints of the scale. In comparison, benign tumors are skewed toward the lower endpoints.

We utilized a Random Forest classifier to attain the top seven features with variable importance. From our output, we see that “area_mean” had the highest variable importance followed by “perimeter_se,” “compactness_mean,” “texture_mean,” “symmetry_worst,” “concave_points_se,” “smoothness_mean,” and “fractal_dimension_worst.”

To apply our machine learning models, we split our dataset into training, validation, and testing sets with a 60-20-20 ratio, respectively. This means that our dataset was first split into training data (60% of the dataset), and the remaining 40% was split 50-50 for the validation and testing sets. After this process, we ran our Logistic Regression model and got an accuracy of 0.91 for the validation and 0.94 for the test set. We also got that the ROC AUC for the validation set was 0.98 and 0.99 for the test set. PR AUC for the validation set was 0.96 and 0.98 for the test set. For our Random Forest model, we got an accuracy of 0.93 for the validation and 0.95 for the test set. ROC AUC for both the validation and test set was 0.99. PR AUC for the validation set was 0.99 and 0.98 for the test set. For our XGBoost model, we got that the accuracy for the validation set was 0.97 and 0.96 for the test set. Surprisingly, the ROC AUC and PR AUC were 1.00 for the validation set, while both metrics came out to be 0.98 for the test set.

Traditional breast cancer diagnostics often face a sensitivity-specificity trade-off, where increasing the sensitivity to detect more true positives can lead to a decrease in specificity, resulting in false positives. Conversely, increasing specificity to reduce false positives may lower sensitivity, potentially missing true cancer cases. This balance is crucial for effective screening programs, as it impacts the overall accuracy of diagnoses and the rate at which patients undergo unnecessary treatments due to false alarms. In our model evaluation, Logistic Regression had a sensitivity of 0.90 and a specificity of 0.96. Random Forest showed an improved sensitivity of 0.93, and XGBoost achieved the highest sensitivity of 0.98 with a specificity of 0.94.

A SHAP values plot helps us understand which features influence a model’s predictions most. It increases the transparency of machine learning models by quantifying the impact of each feature on the prediction. This is particularly important in healthcare, where stakeholders need to trust and understand model decisions. Our SHAP values plot, run on a logistic regression model, showed that features such as area mean, texture mean, and worst perimeter are significant predictors of malignancy. Meanwhile, features such as worst fractal dimension, mean smoothness, and mean compactness had much less of an impact in predicting a malignant tumor.

Discussion

The models' performance metrics have substantiated the potential of machine learning to revolutionize breast cancer diagnostics studied. The XGBoost algorithm, in particular, has showcased its prowess by handling many data types and distributions and significantly reducing false negatives—a crucial factor in cancer detection where the cost of missing a diagnosis is substantial. The ability of such models to incorporate an array of data types promises a move towards personalized treatment and improved intervention efficacy. However, the near-perfect metrics observed in our study necessitate a prudent approach. These results, while promising, might be overly optimistic indicators due to potential overfitting—where the model adapts too well to the training data, including noise and outliers, at the expense of generalizability. Additionally, the dataset predominantly includes morphological features of the tumor cells and lacks broader patient data. Without demographic variables such as age, ethnicity, and lifestyle factors, the representativeness of the dataset is constrained, potentially limiting the applicability of the findings to the general population.

We undertook an exercise to balance the sensitivity-specificity trade-off by adjusting the classification threshold. This manipulation revealed a fluctuation in the trade-off dynamics, with the Logistic Regression model achieving perfect specificity at a higher threshold, though at the cost of reduced sensitivity. The Random Forest and XGBoost models also experienced increased specificity but with a slight compromise in sensitivity. Conversely, they were lowering the threshold enhanced sensitivity at the expense of specificity. Notably, the XGBoost model demonstrated considerable robustness, maintaining high sensitivity with commendable specificity across varied thresholds. These threshold experiments underline the models' flexibility and the need to tailor them according to the clinical context—optimizing for sensitivity or specificity based on the screening program's objectives.

Conclusion

Our study showed the potential of machine learning in breast cancer detection, mainly through the adept application of the XGBoost model. By fine-tuning the classification thresholds, we observed a nuanced sensitivity-specificity trade-off, revealing the complex dynamics of diagnostic accuracy. With threshold adjustments, our models demonstrated that specificity could be maximized at the expense of sensitivity and vice versa. The XGBoost model, in particular, exhibited a remarkable balance, maintaining robust performance metrics under varying thresholds.

Despite the promise shown in our results, the pursuit of perfection in model metrics must be tempered with the practicality of clinical application. We recognize the risks of overfitting and the pressing need for model validation against broader, more diverse datasets. These steps are crucial to ensuring the real-world efficacy of our findings. Additionally, the calibration of sensitivity and specificity thresholds must be tailored to the specific screening requirements of the target population.

Given these insights, future endeavors should enhance the models' generalizability and fine-tune the balance between sensitivity and specificity to meet clinical demands. Advancing these efforts will allow machine learning to evolve from a burgeoning tool to an essential element in breast cancer diagnostics, thereby improving patient outcomes and pioneering advancements in medical care.