Multiplex images of tissue contain information on the gene expression, morphology, and spatial distribution of individual cells comprising biologically specialized niches. However, accurate extraction of cell-level features from pixel-level data is hindered by the presence of microscopy artifacts. Manual curation of noisy cell segmentation instances scales poorly with increasing dataset size, and methods capable of automated artifact detection are needed to enhance workflow efficiency, minimize curator burden, and mitigate human bias. In this challenge, participants will draw on classical and/or machine learning approaches to develop probabilistic classifiers for detecting cell segmentation instances in multiplex images of tissue corrupted by microscopy artifacts.
Test data for this challenge was collected as part of the Human Tumor Atlas Network (HTAN) and consists of a single 1.6cm2 section of primary human colorectal adenocarcinoma probed for 21 tumor, immune, and stromal markers over 8 rounds of CyCIF multiplex immunofluorescence imaging (SARDANA-097).
Multiclass quality control (QC) annotations for microscopy artifacts present in the multi-channel SARDANA-097 image have been manually curated and are provided as labels for model training. These and other data files for the SARDANA-097 image can be found at the Sage Synapse data repository under Synapse ID: syn26848598.
Data files and descriptions:
01-artifacts │ └───csv │ │ ReadMe.txt │ │ unmicst-WD-76845-097_cellRing.csv │ └───markers │ │ markers.csv │ └───mask │ │ ReadMe.txt │ │ cellRingMask.tif │ └───qc │ │ ROI_table.csv │ │ polygon_dict.pkl │ │ qcmask_cell.tif │ │ qcmask_pixel.tif │ │ truth.csv │ └───seg │ │ ReadMe.txt │ │ WD-76845-097.ome.tif │ └───tif │ │ ReadMe.txt │ │ WD-76845-097.ome.tif
csv/unmicst-WD-76845-097_cellRing.csv
: single-cell feature table containing cell IDs, (x, y) spatial coordinates, integrated fluorescence signal intensities, and various nuclear morphology attributes for 1,242,756 cells constituting the SARDANA-097 image.markers/markers.csv
: immunomarker metadata mapping immunomarkers to image channel numbers and CyCIF cycle numbers.mask/WD-76845-097.ome.tif
: cell segmentation mask for the SARDANA-097 image indexed 0 to 1,242,756 with 0 reserved for background pixels.qc/ROI_table.csv
: ROI metadata for artifacts in the SARDANA-097 image.qc/polygon_dict.pkl
: shape type (ellipse or polygon) and vertex coordinates defining ROIs inqc/ROI_table.csv
.qc/qcmask_cell.csv
: cell segmentation mask annotated by artifact classes: 0=background, 1=artifact-free, 2=fluorescence aberration, 3=slide debris, 4=coverslip air bubble, 5=uneven immunolabeling, 6=image blur.qc/qcmask_pixel.csv
: ROI mask: 1=no ROI, 2=fluorescence aberration, 3=slide debris, 4=coverslip air bubble, 5=uneven immunolabeling, 6=image blur.qc/truth.csv
: multiclass ground truth annotations for 1,242,756 cells comprising the SARDANA-097 image.seg/WD-76845-097.ome.tif
: segmentation outlines defining cell boundaries in the SARDANA-097 image.tif/WD-76845-097.ome.tif
: stitched and registered 40-channel OME-TIFF file constituting the SARDANA-097 image.
While the SARDANA-097 image comprises a total of 40 channels, artifacts were only curated from 22 of these, as some channels either contained signals from secondary antibodies alone or were determined unsuitable for the purposes of this hackathon challenge. Please consider the following channels for model training:
'Hoechst0', 'anti_CD3', 'anti_CD45RO', 'Keratin_570', 'aSMA_660', 'CD4_488', 'CD45_PE', 'PD1_647', 'CD20_488', 'CD68_555', 'CD8a_660', 'CD163_488', 'FOXP3_570', 'PDL1_647', 'Ecad_488', 'Vimentin_555', 'CDX2_647', 'LaminABC_488',
'Desmin_555', 'CD31_647', 'PCNA_488', 'CollagenIV_647'
Examples of artifact classes in the SARDANA-097 image:
Classifier output should consist of a CSV file named scores.csv
containing probability scores for each of the 6 artifact classes for each cell in the dataset: 1=artifact-free, 2=fluorescence aberration, 3=slide debris, 4=coverslip air bubble, 5=uneven immunolabeling, 6=image blur. The table should be formatted as follows:
"CellID","1","2","3","4","5","6"
1,0.11,0.29,0.13,0.35,0.05,0.07
2,0.09,0.49,0.14,0.17,0.10,0.02
3,0.28,0.06,0.20,0.06,0.10,0.29
.
.
.
Classifier predictions will be evaluated against multiclass ground truth annotations (qc/truth.csv
) using Receiver operating characteristic (ROC) curve analysis by passing scores.csv
and qc/truth.csv
as ordered arguments to roc.py
:
$ python roc.py scores.csv truth.csv
Classifier predictions will also be scored against binary multiclass ground truth labels using measures of precision and recall. This is achieved by first making artifact class calls and save them to a file named calls.csv
:
"CellID","class_label"
1,1
2,2
3,3
4,1
5,2
.
.
.
calls.csv
and qc/truth.csv
can then be passed as ordered arguments to pr.py
for computing precision and recall on individual and combined artifact classes as follows:
$ python pr.py calls.csv truth.csv Fluor: precision=0.78, recall=0.67 Debris: precision=0.61, recall=0.45 Bubble: precision=0.73, recall=0.84 Staining: precision=0.90, recall=0.62 Blur: precision=0.57, recall=0.56 Overall: precision=0.87, recall=0.79
-
Ground truth labels can themselves be inaccurate. How might classifiers be developed to guard against 1) artifact misclassification, 2) false positives (i.e. artifact-free cells inadvertently classified as noisy), and 3) false negatives (i.e. artifacts gone unannotated)?
-
Which model type achieves superior classifier performance, those trained on single-cell feature tables (i.e.
csv/unmicst-WD-76845-097_cellRing.csv
), or those trained on pixel-level imaging data (tif/WD-76845-097.ome.tif
)? What about hybrid models trained on both data types?
- High-level programming language (Python 3 is recommended)
- Core data science software packages (e.g.
pandas
,numpy
, andscipy
) - Libraries for reading, writing, analyzing, and visualizing multi-channel TIFF images (e.g.
tifffile
,skimage
,matplotlib
,napari
) - Machine learning and artificial intelligence libraries (e.g.
scikit-learn
,tensorflow
,keras
,pytorch
)
If using Python 3, the aforementioned libraries can be installed in a new Python virtual environment dedicated to this project by running the following commands:
# on Mac
$ python3 -m venv ~/artifacts # Creates a new Python virtual environment in the home directory
$ source ~/artifacts/bin/activate # Steps into the newly created virtual environment
$ pip install -r requirements.txt # Installs software packages using the "requirements.txt" file in this GitHub repo
Virtual check-ins will occur daily at 9am & 1pm (US EST) at the following Zoom link:
- https://harvard.zoom.us/j/97485448563?pwd=dWw3VDA5RUZ3emhHVFJsMVZtSUMydz09
- For questions outside of these times, please post to the #01-artifacts Slack channel.