Skip to content

DAb-seq: combined single-cell DNA genotyping and protein quantification.

License

Notifications You must be signed in to change notification settings

AbateLab/DAb-seq

Repository files navigation

DAb-seq

DAb-seq: combined single-cell DNA and Antibody sequencing.

Published as: Demaree*, B., Delley*, C.L., Vasudevan, H.N. et al. Joint profiling of DNA and proteins in single cells to dissect genotype-phenotype associations in leukemia. Nature Communications 12, 1583 (2021).

Cite this repository:

DOI

Introduction

DAb-seq is a multiomic tool combining targeted genotyping and immunophenotyping in single cells. Through the use of DNA-antibody conjugates, phenotypic signal is encoded into next-generation sequencing data, providing a readout analogous to that of flow cytometry. The result is a dataset of linked proteogenomic information from thousands of single cells.

DAb-seq Workflow

The experimental DAb-seq workflow, from sample collection to bioinformatics.

The DAb-seq data analysis pipeline includes modules for targeted DNA genotyping and antibody tag counting. The primary outputs of the pipeline are a genotyping matrix of variant calls by cell and a matrix of antibody UMI counts by cell.

Input File Requirements

The pipeline requires a configuration file and paired-end, compressed FASTQ files (which must end in ".fastq.gz"), in addition to other mode-specific files. The configuration file is subdivided into sections for each analysis module (see the provided example, dabseq.hg19.cfg).

The folder structure of the input FASTQ files needs to be set up in a specific way to allow the program to find everything. For a given cohort (CohortA) of samples to joint-genotype (Timepoint1 and Timepoint2), the folder structure should look like:

Timepoint1 DNA Panel:   .../CohortA/Timepoint1/fastq/panel/<filename.fastq.gz>
Timepoint1 Abs:         .../CohortA/Timepoint1/fastq/abs/<filename.fastq.gz>

Timepoint2 DNA Panel:   .../CohortA/Timepoint2/fastq/panel/<filename.fastq.gz>
Timepoint2 Abs:         .../CohortA/Timepoint2/fastq/abs/<filename.fastq.gz>

For each <filename.fastq.gz>, both R1 and R2 files need to be present. When sequencing multiple tubes of Tapestri output (e.g. grouping tubes 1-4 and 5-8 into two libraries), FASTQ files from multiple tubes should be placed in the same folder. Users should verify that the panel and antibody filenames remain in the same order when sorted lexicographically (a simple filenaming scheme like panel-A/abs-A, panel-B/abs-B, etc.. works well).

When running the pipeline in dna-only or ab-only modes, the user is not required to create the folders for the missing file types.

Software Dependencies

The following software packages should be installed and located on the user's PATH. The version numbers shown are those used in the DAb-seq publication.

  • GATK (4.1.3.0)
  • bowtie2 (2.3.4.1)
  • ITDseek (1.2)
  • samtools (1.8)
  • bedtools (2.27.1)
  • bcftools (1.9)
  • cutadapt (2.4)
  • BBMap (38.57)
  • snpEff (4.3t)

To simplify installation and enhance data reproducibility, the pipeline can also be run in a Docker container. Instructions for building and running the DAb-seq image are listed in the section DAb-seq in Docker.

Running the Pipeline

The pipeline is run through the main Python script, dabseq_pipeline.py, which must be run sequentially in barcode and genotype modes.

barcode Mode

In barcode mode, the pipeline processes raw FASTQ files according to settings in a configuration file . The script demultiplexes DNA panel amplicons and antibody tags into single cells, aligns panel reads to the human genome, and generates a GVCF file for each cell. All samples belonging to the same cohort must be individually barcoded before joint genotyping.

genotype Mode

In genotype mode, the pipeline calls variants for a single cohort, containing samples that have been individually processed in barcode mode. The script imports GVCFs into a GenomicsDB database (GATK GenomicsDBImport) and calls genotypes (GATK GenotypeGVCFs) separately for each genomic interval in parallel.

DAb-seq in Docker

Running the DAb-seq pipeline in a Docker container is recommended and ensures that the pipeline dependencies are installed and configured properly. The Docker image is hosted on DockerHub.

  1. Build the DAb-seq image using Docker:
docker build <path_to_dabseq_repo> -t dab-seq:human

Alternatively, pull the Docker image from Docker Hub:

docker pull bendemaree/dab-seq:human

There are two tags of the Docker image available: base and human. Any study involving human cells should use the human tag. The smaller base image is intended for non-human studies and does not include the hg19 FASTA file and associated indices.

  1. Organize the input files on the host machine in the same file structure as described in Input File Requirements.

  2. Edit the included example bash script run_dabseq_docker.sh with the appropriate cohort and sample information. Samples to process should be added as needed. The user running the pipeline should also be specified. This user should also have read/write access to the input FASTQ files.

  3. Run run_dabseq_docker.sh. You may need to change the file permissions to allow execution before running (e.g. chmod +x run_dabseq_docker.sh).

That's it! The pipeline will run in the Docker container and produce output files at the mounted locations on the host machine.

Memory and CPU Considerations

DNA genotyping is CPU and memory-intensive, particularly when processing thousands of single-cell samples. The DAb-seq pipeline is implemented with tunable parallelization to scale to the resources available on different systems. In its default configuration, the pipeline requires at least 64 gb of memory and 16 physical threads to run. Changing the amount of cells aligned simultaneously or the number of genomic intervals genotyped in parallel can reduce this hardware requirement significantly, with a corresponding increase in total processing time.

Output Files

DAb-seq Output

The DAb-seq output can be represented as linked genotyping and antibody UMI count tables.

Output genotyping data is saved in a new directory labeled GENOTYPING in the root of each cohort. A compresed HDF5 file contains matrices of discrete genotyping calls and antibody UMI counts for all cells in the cohort. Genotyping calls for each variant assume a genome ploidy of 2. Therefore, there are four possible genotypes per matrix entry:

  • Wildtype (0)
  • Heterozygous Alternate (1)
  • Homozygous Alternate (2)
  • No Call (3)

Further details on reading and manipulating data from this file can be found in the included Python notebook in the sample_data folder.

Non-Human Organism Support

The DAb-seq pipeline includes support for non-human organisms using the flag --non-human. When combined with additional settings such as --ploidy 1, the pipeline can be used to perform single-cell genotyping on haploid bacteria and yeast cells.

Sequencing Data and Examples

All data from the initial DAb-seq publication is available in FASTQ format at on the NCBI Sequence Read Archive. In addition, the compressed HDF5 file for the three cell line validation experiment is included in the sample_data folder of this repository. A Jupyter Notebook providing example analysis code is also included in this folder.