The goal of this project is to develop a machine-learning model to identify breast cancer in mammograms obtained from regular screening exams. The dataset consists of radiographic breast images of approximately 20,000 female subjects, along with metadata such as patient age, breast density, whether the breast was positive for malignant cancer, and further follow-up information. The model will be trained on the provided training set and evaluated on a hidden test set, with the ultimate goal of improving the accuracy and efficiency of breast cancer detection, reducing costs and unnecessary medical procedures, and extending the benefits of early detection to a broader population. The current state-of-the-art probabilistic F1 score (pF1) stands at 55%.
Breast cancer is the most commonly occurring cancer worldwide and early detection is critical in reducing mortality rates. However, current screening programs are expensive, require the expertise of highly-trained radiologists, and often result in false positives, leading to unnecessary medical procedures and increased patient anxiety. By developing a machine learning model, we aim to improve the accuracy, efficiency, and accessibility of breast cancer screening programs. This would have significant implications for patient and health care delivery, reducing costs and unnecessary medical procedures, and extending the benefits of early detection to a broader population. Application of this technology could be used in clinical settings to assist radiologists in the screening process, reducing their workload and improving the accuracy of their diagnosis. This could also be used in areas with limited access to radiologists, where the machine learning model could provide a preliminary diagnosis before being reviewed by a radiologist. Ultimately, this technology could help save lives by improving the early detection and treatment of breast cancer.
Based on the context provided and the description statement, this is a binary classification problem, where the goal is to classify mammogram images into positive/negative. We aim to train a model based on not only the target variable but also additional metadata as described in the dataset. We later plan to drop some of these parameters as few of them are not present in the test data set, hoping that the model has extracted enough useful information from them to help reduce false positives.
This dataset comes from the Kaggle challenge "RSNA Screening Mammography Breast Cancer Detection" by the Radiological Society of North America. It contains radiographic breast images of roughly 20,000 female patients (60:40 split) with usually 4 images per patient (2 lateral [left
, right
] per view [mediolateral-oblique
, crainal-caudal
]).
site id
,machine id
- ID code for the source hospital, imaging devicepatient id
,image id
- ID code for the patient and respective imagelaterality
- whether the image is of the left or right breastview
- orientation of the imageage
- patient’s age in yearsimplant
- whether the patient had breast implants at the patient leveldensity
(train only) - rating for how dense the breast tissue is, with A being the least dense and D being the most densebiopsy
(train only) - whether a follow-up biopsy was performed on the breastinvasive
(train only) - whether or not cancer (iftrue
) proved to be invasiveBIRADS
(train only) -0
if the breast required follow-up,1
if the breast was rated as negative for cancer, and2
if the breast was rated as normaldifficult negative case
(train only) -true
if the case was unusually difficultcancer
(target) - whether or not the breast was positive for malignant cancer
The feature vectors for the images are not provided in the dataset and will need to be extracted using an encoder module. The images may also be subjected to context-based cropping and blurring. No other pre-processing requirement is estimated, but this need may be modified in the future. We plan to train the secondary network on all the five image properties, and later remove everything except the target variable.
- National Cancer Institute. Common Cancer Types. https://www.cancer.gov/types/common-cancers, 2022.
- Radiological Society of North America. RSNA Screening Mammography Breast Cancer Detection. https://www.kaggle.com/competitions/rsna-breast-cancer-detection/leaderboard, 2023.