Automated Detection of Microscopy Artifacts

Description

Multiplex images of tissue contain information on the gene expression, morphology, and spatial distribution of individual cells comprising biologically specialized niches. However, accurate extraction of cell-level features from pixel-level data is hindered by the presence of microscopy artifacts. Manual curation of noisy cell segmentation instances scales poorly with increasing dataset size, and methods capable of automated artifact detection are needed to enhance workflow efficiency, minimize curator burden, and mitigate human bias. In this challenge, participants will draw on classical and/or machine learning approaches to develop probabilistic classifiers for detecting cell segmentation instances in multiplex images of tissue corrupted by microscopy artifacts.

Dataset

Test data for this challenge was collected as part of the Human Tumor Atlas Network (HTAN) and consists of a single 1.6cm² section of primary human colorectal adenocarcinoma probed for 21 tumor, immune, and stromal markers over 8 rounds of CyCIF multiplex immunofluorescence imaging (SARDANA-097).

Multiclass quality control (QC) annotations for microscopy artifacts present in the multi-channel SARDANA-097 image have been manually curated and are provided as labels for model training. These and other data files for the SARDANA-097 image can be found at the Sage Synapse data repository under Synapse ID: syn26848598.

Data files and descriptions:

01-artifacts  
│
└───csv
│   │   ReadMe.txt
│   │   unmicst-WD-76845-097_cellRing.csv
│
└───markers
│   │   markers.csv
│
└───mask
│   │   ReadMe.txt
│   │   cellRingMask.tif
│
└───qc
│   │   ROI_table.csv
│   │   polygon_dict.pkl
│   │   qcmask_cell.tif
│   │   qcmask_pixel.tif
│   │   truth.csv
│
└───seg
│   │   ReadMe.txt
│   │   WD-76845-097.ome.tif
│
└───tif
│   │   ReadMe.txt
│   │   WD-76845-097.ome.tif

csv/unmicst-WD-76845-097_cellRing.csv: single-cell feature table containing cell IDs, (x, y) spatial coordinates, integrated fluorescence signal intensities, and various nuclear morphology attributes for 1,242,756 cells constituting the SARDANA-097 image.
markers/markers.csv: immunomarker metadata mapping immunomarkers to image channel numbers and CyCIF cycle numbers.
mask/WD-76845-097.ome.tif: cell segmentation mask for the SARDANA-097 image indexed 0 to 1,242,756 with 0 reserved for background pixels.
qc/ROI_table.csv: ROI metadata for artifacts in the SARDANA-097 image.
qc/polygon_dict.pkl: shape type (ellipse or polygon) and vertex coordinates defining ROIs in qc/ROI_table.csv.
qc/qcmask_cell.csv: cell segmentation mask annotated by artifact classes: 0=background, 1=artifact-free, 2=fluorescence aberration, 3=slide debris, 4=coverslip air bubble, 5=uneven immunolabeling, 6=image blur.
qc/qcmask_pixel.csv: ROI mask: 1=no ROI, 2=fluorescence aberration, 3=slide debris, 4=coverslip air bubble, 5=uneven immunolabeling, 6=image blur.
qc/truth.csv: multiclass ground truth annotations for 1,242,756 cells comprising the SARDANA-097 image.
seg/WD-76845-097.ome.tif: segmentation outlines defining cell boundaries in the SARDANA-097 image.
tif/WD-76845-097.ome.tif: stitched and registered 40-channel OME-TIFF file constituting the SARDANA-097 image.

Target Channels

While the SARDANA-097 image comprises a total of 40 channels, artifacts were only curated from 22 of these, as some channels either contained signals from secondary antibodies alone or were determined unsuitable for the purposes of this hackathon challenge. Please consider the following channels for model training:

'Hoechst0', 'anti_CD3', 'anti_CD45RO', 'Keratin_570', 'aSMA_660', 'CD4_488', 'CD45_PE', 'PD1_647', 'CD20_488', 'CD68_555', 'CD8a_660', 'CD163_488', 'FOXP3_570', 'PDL1_647', 'Ecad_488', 'Vimentin_555', 'CDX2_647', 'LaminABC_488',
'Desmin_555', 'CD31_647', 'PCNA_488', 'CollagenIV_647'

Artifact Classes

Examples of artifact classes in the SARDANA-097 image:

Classifier Output

Classifier output should consist of a CSV file named scores.csv containing probability scores for each of the 6 artifact classes for each cell in the dataset: 1=artifact-free, 2=fluorescence aberration, 3=slide debris, 4=coverslip air bubble, 5=uneven immunolabeling, 6=image blur. The table should be formatted as follows:

"CellID","1","2","3","4","5","6"
1,0.11,0.29,0.13,0.35,0.05,0.07
2,0.09,0.49,0.14,0.17,0.10,0.02
3,0.28,0.06,0.20,0.06,0.10,0.29
.
.
.

Performance Evaluation

Classifier predictions will be evaluated against multiclass ground truth annotations (qc/truth.csv) using Receiver operating characteristic (ROC) curve analysis by passing scores.csv and qc/truth.csv as ordered arguments to roc.py:

$ python roc.py  scores.csv truth.csv

Classifier predictions will also be scored against binary multiclass ground truth labels using measures of precision and recall. This is achieved by first making artifact class calls and save them to a file named calls.csv:

"CellID","class_label"
1,1
2,2
3,3
4,1
5,2
.
.
.

calls.csv and qc/truth.csv can then be passed as ordered arguments to pr.py for computing precision and recall on individual and combined artifact classes as follows:

$ python pr.py calls.csv truth.csv

Fluor: precision=0.78, recall=0.67
Debris: precision=0.61, recall=0.45
Bubble: precision=0.73, recall=0.84
Staining: precision=0.90, recall=0.62
Blur: precision=0.57, recall=0.56
Overall: precision=0.87, recall=0.79

Considerations

Ground truth labels can themselves be inaccurate. How might classifiers be developed to guard against 1) artifact misclassification, 2) false positives (i.e. artifact-free cells inadvertently classified as noisy), and 3) false negatives (i.e. artifacts gone unannotated)?
Which model type achieves superior classifier performance, those trained on single-cell feature tables (i.e. csv/unmicst-WD-76845-097_cellRing.csv), or those trained on pixel-level imaging data (tif/WD-76845-097.ome.tif)? What about hybrid models trained on both data types?

Suggested Computational Resources and Software Packages

High-level programming language (Python 3 is recommended)
Core data science software packages (e.g. pandas, numpy, and scipy)
Libraries for reading, writing, analyzing, and visualizing multi-channel TIFF images (e.g. tifffile, skimage, matplotlib, napari)
Machine learning and artificial intelligence libraries (e.g. scikit-learn, tensorflow, keras, pytorch)

If using Python 3, the aforementioned libraries can be installed in a new Python virtual environment dedicated to this project by running the following commands:

# on Mac

$ python3 -m venv ~/artifacts  # Creates a new Python virtual environment in the home directory
$ source ~/artifacts/bin/activate  # Steps into the newly created virtual environment
$ pip install -r requirements.txt  # Installs software packages using the "requirements.txt" file in this GitHub repo

Team Check-In Times

Virtual check-ins will occur daily at 9am & 1pm (US EST) at the following Zoom link:

https://harvard.zoom.us/j/97485448563?pwd=dWw3VDA5RUZ3emhHVFJsMVZtSUMydz09
For questions outside of these times, please post to the #01-artifacts Slack channel.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
images		images
score		score
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Detection of Microscopy Artifacts

Description

Dataset

Target Channels

Artifact Classes

Classifier Output

Performance Evaluation

Considerations

Suggested Computational Resources and Software Packages

Team Check-In Times

About

Releases 1

Packages

Contributors 3

Languages

License

IAWG-CSBC-PSON/hack2022-01-artifacts

Folders and files

Latest commit

History

Repository files navigation

Automated Detection of Microscopy Artifacts

Description

Dataset

Target Channels

Artifact Classes

Classifier Output

Performance Evaluation

Considerations

Suggested Computational Resources and Software Packages

Team Check-In Times

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages