This repository contains a multi-stage pipeline for the analysis of Hematoxylin and eosin stained (H&E) histopathological images of hepatocellular carcinoma. The pipeline includes stain color normalization, nuclei mask segmentation, feature extraction, and prognosis prediction.
- Overview
- Stage I: Stain Colour Normalisation and Segmentation
- Stage II: Feature Extraction and Analysis
- Requirements
- Key Highlights
Hepatocellular Carcinoma (HCC) is a primary cancer that originates in liver cells, specifically in hepatocytes. It is the most common type of liver cancer and represents a significant global health concern. With the advent of Machine Learning and imaging, staging has become an easy task and neural networks do a decent job. The research entails, quantifying certain biomarkers and features in the hope of making the process more "explainable". Quantifying said markers enables us to tweak the pipeline and staging systems for different body compositions and also helps us establish baseline scores for different markers helping us understand the leading causes of a particular variant of cancer.
The pipeline consists of two main stages:
- Stain Colour Normalisation and Nuclei Mask Segmentation: This stage normalizes the color of histopathological images and segments the nuclei.
- Feature Extraction and Analysis: This stage extracts features from the segmented nuclei masks and predicts prognosis based on these features.
The Hepatoma_Pipeline.ipynb
notebook performs stain color normalization on the given histopathological images and segments the nuclei masks. This notebook generates a Gradio link that hosts an interface for dragging and dropping images to get the intermediate outputs.
Feed the nuclei masks generated from the given images to Cell Profiler, an open-source software, to detect the primary objects within a pixel range of 5px to 40px.
Pass the obtained database file to Cell Profiler Analyst, which will produce an Excel sheet containing suitable features extracted for each nucleus identified from all the images.
Hepatoma_Generate_CSV.py
: Aggregates all the features and creates a condensed table with image names and related features.Hepatoma_Prognosis.py
: Runs the final prediction script based on the generated feature table.
- Python 3.x
- Gradio
- Jupyter Notebook
- Cell Profiler
- Cell Profiler Analyst
- ADASYN Data Augmentation: We address the class imbalance, particularly the scarcity of images in advanced cancer stages, using Adaptive Synthetic Sampling (ADASYN). This method was chosen over SMOTE for its better sensitivity score, crucial for medical staging and imaging tasks.
- Deconvolution Stain Color Normalization: Applied to generalize the pipeline to various staining methods. This step enhances the generalization of the UNet model for generating nuclei segmented masks.
Using the segmented masks, we extract quantifiable features with Cell Profiler and Cell Profiler Analyst:
- Total Cell Count: The total number of cells in each image.
- Average Cell Area: The mean area of segmented cells within each image.
- Spatial Variance in X and Y Coordinates: The variance in the spatial distribution of cell centroids along the X and Y axes.
- Coefficient of Variation (CV) of Cell Area: Standard deviation of cell areas divided by the mean cell area, indicating variability in cell sizes.
- Cell Density: The sum of the areas of all segmented cells within each image.
- Perimeter-to-Area Ratio: The ratio of the sum of cell perimeters to the total cell area, indicating cell shape complexity.
- Compactness Variation: Standard deviation of cell eccentricities divided by the mean eccentricity, quantifying variation in cell compactness.
- Normalized Cell Roundness (NCR): Sum of major axis lengths of cells, providing a measure of cell roundness.
- Feret’s Diameter Variation: Standard deviation of Feret diameters divided by the mean Feret diameter, assessing variability in cell elongation.
- Model Performance: Various machine learning models were trained on the extracted features, achieving a peak test accuracy of 74.45%.
- Feature Impact: Adding image moments (Central Moments, Normalized Moments, Zernike Moments) as metrics decreased model performance. This is attributed to the large number of cells per image, making aggregate moment values less useful for distinguishing HCC stages.
Feel free to reach out if you have any questions or need further assistance with the project!