DAb-seq: combined single-cell DNA and Antibody sequencing.
Published as: Demaree*, B., Delley*, C.L., Vasudevan, H.N. et al. Joint profiling of DNA and proteins in single cells to dissect genotype-phenotype associations in leukemia. Nature Communications 12, 1583 (2021).
Cite this repository:
DAb-seq is a multiomic tool combining targeted genotyping and immunophenotyping in single cells. Through the use of DNA-antibody conjugates, phenotypic signal is encoded into next-generation sequencing data, providing a readout analogous to that of flow cytometry. The result is a dataset of linked proteogenomic information from thousands of single cells.
The experimental DAb-seq workflow, from sample collection to bioinformatics.
The DAb-seq data analysis pipeline includes modules for targeted DNA genotyping and antibody tag counting. The primary outputs of the pipeline are a genotyping matrix of variant calls by cell and a matrix of antibody UMI counts by cell.
The pipeline requires a configuration file and paired-end, compressed FASTQ files (which must end in ".fastq.gz"), in addition to other mode-specific files. The configuration file is subdivided into sections for each analysis module (see the provided example, dabseq.hg19.cfg
).
The folder structure of the input FASTQ files needs to be set up in a specific way to allow the program to find everything. For a given cohort (CohortA) of samples to joint-genotype (Timepoint1 and Timepoint2), the folder structure should look like:
Timepoint1 DNA Panel: .../CohortA/Timepoint1/fastq/panel/<filename.fastq.gz>
Timepoint1 Abs: .../CohortA/Timepoint1/fastq/abs/<filename.fastq.gz>
Timepoint2 DNA Panel: .../CohortA/Timepoint2/fastq/panel/<filename.fastq.gz>
Timepoint2 Abs: .../CohortA/Timepoint2/fastq/abs/<filename.fastq.gz>
For each <filename.fastq.gz>
, both R1 and R2 files need to be present. When sequencing multiple tubes of Tapestri output (e.g. grouping tubes 1-4 and 5-8 into two libraries), FASTQ files from multiple tubes should be placed in the same folder. Users should verify that the panel and antibody filenames remain in the same order when sorted lexicographically (a simple filenaming scheme like panel-A/abs-A, panel-B/abs-B, etc.. works well).
When running the pipeline in dna-only
or ab-only
modes, the user is not required to create the folders for the missing file types.
The following software packages should be installed and located on the user's PATH. The version numbers shown are those used in the DAb-seq publication.
- GATK (4.1.3.0)
- bowtie2 (2.3.4.1)
- ITDseek (1.2)
- samtools (1.8)
- bedtools (2.27.1)
- bcftools (1.9)
- cutadapt (2.4)
- BBMap (38.57)
- snpEff (4.3t)
To simplify installation and enhance data reproducibility, the pipeline can also be run in a Docker container. Instructions for building and running the DAb-seq image are listed in the section DAb-seq in Docker.
The pipeline is run through the main Python script, dabseq_pipeline.py
, which must be run sequentially in barcode
and genotype
modes.
In barcode
mode, the pipeline processes raw FASTQ files according to settings in a configuration file . The script demultiplexes DNA panel amplicons and antibody tags into single cells, aligns panel reads to the human genome, and generates a GVCF file for each cell. All samples belonging to the same cohort must be individually barcoded before joint genotyping.
In genotype
mode, the pipeline calls variants for a single cohort, containing samples that have been individually processed in barcode
mode. The script imports GVCFs into a GenomicsDB database (GATK GenomicsDBImport) and calls genotypes (GATK GenotypeGVCFs) separately for each genomic interval in parallel.
Running the DAb-seq pipeline in a Docker container is recommended and ensures that the pipeline dependencies are installed and configured properly. The Docker image is hosted on DockerHub.
- Build the DAb-seq image using Docker:
docker build <path_to_dabseq_repo> -t dab-seq:human
Alternatively, pull the Docker image from Docker Hub:
docker pull bendemaree/dab-seq:human
There are two tags of the Docker image available: base
and human
. Any study involving human cells should use the human
tag. The smaller base
image is intended for non-human studies and does not include the hg19 FASTA file and associated indices.
-
Organize the input files on the host machine in the same file structure as described in Input File Requirements.
-
Edit the included example bash script
run_dabseq_docker.sh
with the appropriate cohort and sample information. Samples to process should be added as needed. The user running the pipeline should also be specified. This user should also have read/write access to the input FASTQ files. -
Run
run_dabseq_docker.sh
. You may need to change the file permissions to allow execution before running (e.g.chmod +x run_dabseq_docker.sh
).
That's it! The pipeline will run in the Docker container and produce output files at the mounted locations on the host machine.
DNA genotyping is CPU and memory-intensive, particularly when processing thousands of single-cell samples. The DAb-seq pipeline is implemented with tunable parallelization to scale to the resources available on different systems. In its default configuration, the pipeline requires at least 64 gb of memory and 16 physical threads to run. Changing the amount of cells aligned simultaneously or the number of genomic intervals genotyped in parallel can reduce this hardware requirement significantly, with a corresponding increase in total processing time.
The DAb-seq output can be represented as linked genotyping and antibody UMI count tables.
Output genotyping data is saved in a new directory labeled GENOTYPING
in the root of each cohort. A compresed HDF5 file contains matrices of discrete genotyping calls and antibody UMI counts for all cells in the cohort. Genotyping calls for each variant assume a genome ploidy of 2. Therefore, there are four possible genotypes per matrix entry:
- Wildtype (0)
- Heterozygous Alternate (1)
- Homozygous Alternate (2)
- No Call (3)
Further details on reading and manipulating data from this file can be found in the included Python notebook in the sample_data
folder.
The DAb-seq pipeline includes support for non-human organisms using the flag --non-human
. When combined with additional settings such as --ploidy 1
, the pipeline can be used to perform single-cell genotyping on haploid bacteria and yeast cells.
All data from the initial DAb-seq publication is available in FASTQ format at on the NCBI Sequence Read Archive. In addition, the compressed HDF5 file for the three cell line validation experiment is included in the sample_data
folder of this repository. A Jupyter Notebook providing example analysis code is also included in this folder.