Skip to content

prakashlab/codex-analysis-pipeline

 
 

CODEX Pipeline

This repository contains the pipeline for analyzing CODEX output images.

Introduction

This repository contains the python scripts necessary for analyzing the CODEX data. The relevant scripts are for image registration, nucleus segmentation, and extracting average brightness to save in a .csv.

Usage

Repo Contents:

This repository contains the following scripts, intended to be run on data generated by CODEX:

shift registration

  • requirements_registrator.txt: List of requirements to run shift registration. Activate your conda environment and run pip install -r requirements_registrator.txt
  • registrator.py: perform a focus stack using local (provide path) or remote (using GCSFS) images.

segmentation

  • requirements_segment.txt: requirements for operation, install using pip install -r requirements_segment.txt
  • pbmc_cellpose_model.pth: pretrained cellpose model for PBMC.
  • segmenter.py: segment nuclei using cellpose using a pretrained model.

analysis

  • requirements_analyze.txt: requirements for operation, install using pip install -r requirements_analyze.txt
  • analyzer.py: measure the size and average brightness of each cell in each channel and save as a csv. Shift registration and segmentation must be run on the data first before running this step.

view cells

  • cell_crop.py: Once the cells are identified, load the CSV, find a certain number of each cell and save them as images.

Usage Guide

preliminary initialization

First, install miniconda for python 3 following this guide: https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html Miniconda will help manage our packages and python environment.

Next, install CUDA 11.3 and cuDNN 8.5.0 from nvidia. This lets us use the graphics card for accelerated ML and image processing. We need version 11.3 for compatibility with Cellpose and M2Unet.

CUDA: sudo apt-get update, sudo apt-get upgrade, sudo apt-get install cuda=11.3.1-1, sudo apt-get install nvidia-gds=11.4.1-1, export PATH="/usr/local/cuda-11.3/bin:$PATH", export LD_LIBRARY_PATH="/usr/local/cuda-11.3/lib64 $LD_LIBRARY_PATH", sudo reboot. Verify that the PATH exported properly, if it didn't, modify ~./bashrc to add CUDA to PATH and LD_LIBRARY_PATH.

cuDNN: Follow the directions for Ubuntu network installation here: https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#package-manager-ubuntu-install Make sure you install a version of cuDNN compatible with CUDA 11.3 (e.g.libcudnn8=8.2.1.32-1+cuda11.3 and libcudnn8-dev=8.2.1.32-1+cuda11.3)

Create a new conda environment and install the requirements: conda create --name pipeline, conda activate pipeline, then follow this guide to install the correct version of pytorch: https://pytorch.org/get-started/locally/. Next, run the following commands:pip install -r requirements_fstack.txt, pip install -r requirements_segment.txt, pip install -r requirements_analyze.txt, pip install -r requirements_deepzoom.txt

registrator usage

registrator operation

Shift-register and crop every imageset from a CODEX experiment. We assume the following file structure:

  • src source directory (can be any valid string)
    • exp_id_1 experiment ID (can be any string)
      • index.csv CSV with the valid cycle names (must be named "index.csv")
      • cycle1 folder with name of fisrt cycle (can be any string as long as it matches index.csv)
        • 0 folder named "0" (must be named "0")
          • 0_0_0_Fluorescence_405_nm_Ex.bmp bmp image with this name format. The first digit represents the i coordinate, the second digit represents the j coordinate, the third represents the z height, and the rest of the filename represnts the channel wavelength
          • 0_0_0_Fluorescence_488_nm_Ex.bmp bmp with same coordinates as above but with a different channel
          • more BMPs
          • 6_7_5_Fluorescence_488_nm_Ex.bmp for our example, suppose i ranges from 0 to 6, j ranges from 0 to 7, and z index ranges from 0 to 5
    • exp_id_2 another experiment (can have any number)
      • 0
        • identical structure to exp_id_1

For each experiment ID, for each channel Fluorescence_NNN_nm_Ex, and for each i index i and j index j in the range, registrator.py generates an image called "i_j_f_Fluorescence_NNN_nm_Ex.png" image (with different values for i, j, and NNN for each image stacked) and saves it to either the src directory or a different directory of your choosing.

set registrator parameters

there are many parameters to set which images get focus stacked and where to save them. here's the complete list of the parameters and what they mean:

  • prefix: string. If you have an index.csv, leave this string empty. If you don't have an index.csv with the names of the cycles to analyze, you can select which cycles to run by prefix. For example, if you have three cycles with names, cycle_good1, cycle_also_good, and bad_cycle and you only want to run focus stacking on the two good datasets, you can set prefix="cy". Set prefix = '*' to get all folders.
  • key: string. If you are connecting to a Google Cloud File Storage, set this to the local path to the authentication token .json. If you are running segmentation locally, this doesn't matter.
  • gcs_project: string. Set this to the Google Cloud Storage project name if you are connecting to GCS. Otherwise, it doesn't matter.
  • src: string. path to the folder that contains the experiment folder. Can be a GCS locator (e.g.gs://octopi-data) or a local path (e.g. /home/user/Documents/src/). Note - must have a trailing slash (/)
  • dst: string. path to folder to save data. If left blank, the images will be stored in the same directory as src. registrator.py will recreate the source folder structure in this folder. Also must have a trailing slash.
  • exp: list of strings. List of experiment IDs. In the example above, this would be ["exp_id_1", "exp_id_2"]
  • cha: list of strings. List of channel names. In the example above, this would be ["Fluorescence_405_nm_Ex", "Fluorescence_488_nm_Ex"] but it should be left as-is.
  • typ: string. Filetype of the images to read. Leave as-is.
  • colors: Dictionary of lists. Leave as-is.
  • remove_background: boolean. Set True to run a white-tophat filter. This will increase runtime considerably.
  • invert_contrast: boolean. Set True to invert the image.
  • shift_registration: boolean. Set True to ensure images stay aligned across cycles.
  • subtract_background: boolean. Set True to subtract the minimum brightess off from the entire image.
  • use_color: boolean. Set True when processing RGB images (untested).
  • crop_start and crop_end: integers. Crop the image; x and y coordinates less than or greater than crop_start or crop_end respectively will be cropped out.
  • verbose: boolean. Set True to print extra details during operation (good for determining whether GPU stacking is faster than CPU)

segmenter usage

segmenter operation

Cells don't move between cycles so we only need to segment one cycle for each (i,j) view. We generally choose to segment the nuclear channel because it is brightest. The script first loads all the channel paths, sorts them to find the 0th channel, then filters the image paths so only the 0th channel images are loaded. Alternatively, we can choose to segment all the images. Images are then segmented one at a time. In principle, Cellpose can work faster by segmenting multi-image batches but in my experience not all GPUs can handle segmenting multiple images. Cellpose then saves a series of .npy files with the masks to the destination directory.

if you are having trouble installing cellpose, try uninstalling all other python packages that use QTpy (e.g. cv2), install cellpose, then reinstall everthing you uninstalled. If you are using CUDA, ensure your CUDA version is compatible with your version of torch. Cellpose[gui] seems to only work in the base channel.

set segmenter parameters
  • root_dir: string. local or remote path to where the images are stored
  • exp_id: string. experiment ID to get the images from. see registrator theory of operation for more detail.
  • channel: string. image channel to randomly select from. see registrator theory of operation for more detail.
  • zstack: string. if you ran registrator.py and want to use the shift-registered images, set this value to "f". otherwise set it to the z stack you want to use (for example, if z=5 is in focus for all cells across all images, you can set zstack="5" and those images will be utilized)
  • key: string. If you are connecting to a Google Cloud File Storage, set this to the local path to the authentication token .json.
  • gcs_project: string. Set this to the Google Cloud Storage project name if you are connecting to GCS. Otherwise, it doesn't matter.
  • cpmodel: string. path to cellpose model. Can be remote or local.
  • use_gpu: boolean. Set to True to try segmenting using the GPU.
  • segment_all: boolean. Set to True to segment all cycles, not just the first
  • channels: list of ints. Which channel to run segmentation on. Set to [0,0] for the monochrome channel

analyzer usage

analyzer operation

We assume cell positions don't change between cycles and we can mask the cells by expanding the nucleus masks from segmenter.py. We have a mask for each (i,j) view; for each view we load all the images across all cycles at that view and load the nucleus mask. The mask is expanded to mask the entire cell. Then for each cell in the view, for each channel, for each cycle we calculate the average brightness of the cell and store it in a csv.

The CSV columns are cell index in a given view, the i index, j index, x position in the image, y position in the image, the number of pixels in the expanded mask, and the number of pixels in the nuclear mask. Then, there is a column for each element in the cartesian product of channel index and cycle index. The header is re-printed in the csv and the cell index resets for each view.

set analyzer parameters
  • cy_name: string. If during semgentation segment_all = True, set this to be the name of the cycle you want to use as the mask. If segment_all = False, set cy_name = "first".
  • start_idx, end_idx: integers. Select a range of cycles to analyze. Set both to negative 1 to select all.
  • n_ch: integer. Number of channels to analyze
  • expansion: integer, must be odd. The number of pixels to expand around the nucleus mask to create the cell masks.
  • root_dir: string. local or remote path to where the images are stored
  • exp_id: string. experiment ID to get the images from. see registrator theory of operation for more detail.
  • channel: string. image channel to randomly select from. see registrator theory of operation for more detail.
  • zstack: string. if you ran registrator.py and want to use the shift registered images, set this value to "f". otherwise set it to the z stack you want to use (for example, if z=5 is in focus for all cells across all images, you can set zstack="5" and those images will be utilized)
  • key: string. If you are connecting to a Google Cloud File Storage, set this to the local path to the authentication token .json.
  • gcs_project: string. Set this to the Google Cloud Storage project name if you are connecting to GCS. Otherwise, it doesn't matter.
  • mask_union: boolean. Set to true to save a .npy pickle for each view with the union of all nuclear and cell masks.
  • out: string. Path to store the csv. Local or remote path.

cell crop usage

cell crop operation

We first read the cell type csv file. For each row, we check whether the cell type is one we have already seen enough times. If so, we skip it. If we haven't seen this cell type enough times, we increment the number of times we have seen the cell type and load the image for each channel and for each cycle. Then, we crop the images such that the cell is centered and concatenate the different channels/cycles together to form a single image. Then we save the image to disk and move on to the next row in the csv.

set cell crop parameters
  • root_dir: string. local or remote path to where the images are stored
  • exp_id: string. experiment ID to get the images from. see registrator theory of operation for more detail.
  • dest_dir: string. local or remote path to where the images are stored
  • channels: list of strings. channels to read from when making the combined image.
  • celltype_file: string. local or remote path to celltype .csv file.
  • zstack: string. if you ran registrator.py and want to use the shift registered images, set this value to "f". otherwise set it to the z stack you want to use (for example, if z=5 is in focus for all cells across all images, you can set zstack="5" and those images will be utilized)
  • key: string. If you are connecting to a Google Cloud File Storage, set this to the local path to the authentication token .json.
  • gcs_project: string. Set this to the Google Cloud Storage project name if you are connecting to GCS. Otherwise, it doesn't matter.
  • cell_radius: integer. half the width/height of the final views.
  • n_of_each_type: integer. number of each cell type to make views for
  • ftype: string. file type to save the outputs as
  • subtract_min: boolean. set "True" to subtract the minimum value off from each view.

Contributing

Currently, this project only accepts outside contributions in the form of bug reports, feature requests, documentation improvements, and bug fixes. Additional features carry a long-term maintenance burden which this project is not yet mature enough to support, so pull requests adding new features will not be accepted at this time. Please start a new discussion (in the "Discussions" tab of this repository on Github) if you have a new feature you'd like to propose for this project.

Licensing

We have chosen the following licenses in order to give away our work for free, so that you can freely use for whatever purposes you have, with minimal restrictions while still protecting our disclaimer that this work is provided without any warranties at all. If you're using this project, or if you have questions about the licenses, we'd love to hear from you - please start a new discussion thread in the "Discussions" tab of this repository on Github or email us at lietk12@gmail.com .

Software

Except where otherwise indicated in this repository, software files provided here are covered by the following information:

Copyright Prakash Lab and template-permissive project contributors

SPDX-License-Identifier: Apache-2.0 OR BlueOak-1.0.0

Software files in this project are released under the Apache License v2.0 and the Blue Oak Model License 1.0.0; you can use the source code provided here under the Apache License or under the Blue Oak Model License, and you get to decide which license you will agree to. We are making the software available under the Apache license because it's OSI-approved and it goes well together with the Solderpad Hardware License, which is an open hardware license used in various projects released by the Prakash Lab; but we like the Blue Oak Model License more because it's easier to read and understand. Please read and understand the licenses for the specific language governing permissions and limitations.

Everything else

Except where otherwise indicated in this repository, any other files (such as images, media, data, and textual documentation) provided here not already covered by software or hardware licenses (described above) are instead covered by the following information:

Copyright Prakash Lab and codex-analysis-pipeline project contributors

SPDX-License-Identifier: CC-BY-4.0

Files in this project are released under the Creative Commons Attribution 4.0 International License. Please read and understand the license for the specific language governing permissions and limitations.

About

Scripts for analyzing the outputs of CODEX microscopy platform.

Resources

License

Apache-2.0 and 2 other licenses found

Licenses found

Apache-2.0
LICENSE-Apache
Unknown
LICENSE-BlueOak.md
CC-BY-4.0
LICENSE-CC-BY

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%