pmdbs-bulk-rnaseq-wf

Repo for testing and developing a common postmortem-derived brain sequencing (PMDBS) workflow harmonized across ASAP with human bulk RNA sequencing data.

Common workflows, tasks, utility scripts, and docker images reused across harmonized ASAP workflows are defined in the wf-common repository.

Workflows

Worfklows are defined in the workflows directory.

This workflow is set up to analyze bulk RNAseq in WDL using mainly command line and a Python script.

Entrypoint: workflows/main.wdl

Input template: workflows/inputs.json

The workflow is broken up into four main chunks:

Index reference genome
Upstream
Downstream
Cohort analysis

Index reference genome

Optional step to index reference genome in order to produce a genome directory input for STAR and/or Salmon.

Upstream

Run once per sample; only rerun when the upstream workflow version is updated. Upstream outputs are stored in the originating team's raw and staging data buckets.

Downstream

Run once per team (all samples from a single team). This can be rerun using different sample subsets; including additional samples requires this entire analysis to be rerun. Intermediate files from previous runs are not reused and are stored in timestamped directories.

Cohort analysis

Run once per team (all samples from a single team) if project.run_project_cohort_analysis is set to true, and once for the whole cohort (all samples from all teams) if run_cross_team_cohort_analysis is set to true. This can be rerun using different sample subsets; including additional samples requires this entire analysis to be rerun. Intermediate files from previous runs are not reused and are stored in timestamped directories.

Inputs

An input template file can be found at workflows/inputs.json.

Type	Name	Description
String	cohort_id	Name of the cohort; used to name output files during cross-team downstream analysis.
Array[Project]	projects	The project ID, set of samples and their associated reads and metadata, output bucket locations, and whether or not to run project-level downstream analysis.
Boolean?	run_alignment_quantification	Option to align raw reads with STAR and quantify aligned reads with Salmon. This and/or 'run_pseudo_mapping_quantification' must be set to true. [true]
Boolean?	run_star_index_ref_genome	Option to index reference genome with STAR. If set to false, star_genome_dir_tar_gz must be provided. [false]
File?	star_genome_dir_tar_gz	The indexed reference genome files required for STAR.
Boolean?	run_pseudo_mapping_quantification	Option to map and directly quantify raw reads with Salmon. This and/or 'run_alignment_quantification' must be set to true. [false]
Boolean?	run_salmon_index_ref_genome	Option to create decoy sequences (from genome), concatenating transcriptome and genome, and index concatenated genome with Salmon. If set to false, salmon_genome_dir_tar_gz must be provided [false]
File?	salmon_genome_dir_tar_gz	The indexed concatenated transcriptome and genome files required for Salmon.
Boolean?	run_cross_team_cohort_analysis	Whether to run downstream harmonization steps on all samples across projects. If set to false, only upstream steps (QC, align/map, and quantify) will run for samples. [false]
String	cohort_raw_data_bucket	Bucket to upload cross-team downstream intermediate files to.
Array[String]	cohort_staging_data_buckets	Buckets to upload cross-team downstream analysis outputs to.
File	gene_map_csv	CSV containing mapped transcript IDs and gene IDs that must be in this order.
File	gene_ids_and_names_json	JSON file containing mapped gene IDs and gene names created from the gene annotation GTF.
String	container_registry	Container registry where workflow Docker images are hosted.
String?	zones	Space-delimited set of GCP zones where compute will take place. ['us-central1-c us-central1-f']

Structs

Project

Type	Name	Description
String	team_id	Unique identifier for team; used for naming output files.
String	dataset_id	Unique identifier for dataset; used for naming output files.
String	dataset_doi_url	Generated Zenodo DOI URL referencing the dataset.
Array[Sample]	samples	The set of samples associated with this project.
File	project_sample_metadata_csv	CSV containing all sample information including batch, condition, etc. used for DESeq2 pairwise condition ('PD', 'Control'). For the `batch` column, there must be at least two distinct values.
Boolean	run_project_cohort_analysis	Whether or not to run cohort analysis within the project.
String	raw_data_bucket	Raw data bucket; intermediate output files that are not final workflow outputs are stored here.
String	staging_data_bucket	Staging data bucket; final project-level outputs are stored here.

Sample

Type	Name	Description
String	sample_id	Unique identifier for the sample within the project.
String?	batch	The sample's batch.
File	fastq_R1	Path to the sample's read 1 FASTQ file.
File	fastq_R2	Path to the sample's read 2 FASTQ file.
File?	fastq_I1	Optional fastq index 1.
File?	fastq_I2	Optional fastq index 2.

Reference

See reference data notes for more details.

Type	Name	Description
File	primary_assembly_fasta	Nucleotide sequence of the GRCh38 primary genome assembly (chromosomes and scaffolds).
File	gene_annotation_gtf	Comprehensive gene annotation on the reference chromosomes only.
File	transcripts_fasta	Nucleotide sequences of all transcripts on the reference chromosomes.
File	all_transcripts_fasta	Manually generated all transcripts on the reference chromosomes with the `primary_assembly_fasta` and `gene_annotation_gtf`.

Generating the inputs JSON

The inputs JSON may be generated manually, however when running a large number of samples, this can become unwieldly. The generate_inputs utility script may be used to automatically generate the inputs JSON (inputs.{staging_env}.{source}-{cohort_dataset}.{date}.json) and a sample list TSV ({team_id}.{source}-{cohort_dataset}.sample_list.{date}.tsv); same as the one generated in the write_cohort_sample_list task). The script requires the libraries outlined in the requirements.txt file and the following inputs:

project-tsv: One or more project TSVs with one row per sample and columns team_id, sample_id, batch, fastq_path. All samples from all projects may be included in the same project TSV, or multiple project TSVs may be provided.
- team_id: A unique identifier for the team from which the sample(s) arose
- dataset_id: A unique identifier for the dataset from which the sample(s) arose
- sample_id: A unique identifier for the sample within the project
- batch: The sample's batch
- fastq_path: The directory in which paired sample FASTQs may be found, including the gs:// bucket name and path
  - This is appended to the project-tsv from the fastq-locs-txt: FASTQ locations for all samples provided in the project-tsv, one per line. Each sample is expected to have one set of paired fastqs located at ${fastq_path}/${sample_id}*. The read 1 file should include 'R1' somewhere in the filename; the read 2 file should inclue 'R2' somewhere in the filename. Generate this file e.g. by running gcloud storage ls gs://fastq_bucket/some/path/**.fastq.gz >> fastq_locs.txt
inputs-template: The inputs template JSON file into which the projects information derived from the project-tsv will be inserted. Must have a key ending in *.projects. Other default values filled out in the inputs template will be written to the output inputs.json file.
run-project-cohort-analysis: Optionally run project-level cohort analysis for provided projects. This value will apply to all projects. [false]
workflow_name: WDL workflow name.
cohort-dataset: Dataset name in cohort bucket name (e.g. 'sc-rnaseq').

Example usage:

./wf-common/util/generate_inputs \
	--project-tsv metadata.tsv \
	--inputs-template workflows/inputs.json \
	--run-project-cohort-analysis \
	--workflow-name pmdbs_bulk_rnaseq_analysis \
	--cohort-dataset bulk-rnaseq

Outputs

Output structure

cohort_id: either the team_id for project-level downstream analysis, or the cohort_id for the full cohort
workflow_run_timestamp: format: %Y-%m-%dT%H-%M-%SZ
The list of samples used to generate the downstream analysis will be output alongside other downstream analysis outputs in the staging data bucket (${cohort_id}.sample_list.tsv)
The MANIFEST.tsv file in the staging data bucket describes the file name, md5 hash, timestamp, workflow version, workflow name, and workflow release for the run used to generate each file in that directory

Raw data (intermediate files and final outputs for all runs of the workflow)

The raw data bucket will contain some artifacts generated as part of workflow execution. Following successful workflow execution, the artifacts will also be copied into the staging bucket as final outputs.

In the workflow, task outputs are either specified as String (final outputs, which will be copied in order to live in raw data buckets and staging buckets) or File (intermediate outputs that are periodically cleaned up, which will live in the cromwell-output bucket). This was implemented to reduce storage costs.

asap-raw-{cohort,team-xxyy}-{source}-{dataset}
└── workflow_execution
	└── pmdbs_bulk_rnaseq
		├── cohort_analysis
		│	└──${cohort_analysis_workflow_version}
		│		└── ${salmon_mode}
		│			└── ${workflow_run_timestamp}
		│				└── <cohort_analysis outputs>
		├── downstream
		│	└──${downstream_workflow_version}
		│		└── ${salmon_mode}
		│			└── ${workflow_run_timestamp}
		│				└── <downstream outputs>
		└── upstream
			├── fastqc_raw_reads
			│	└── ${fastqc_task_version}
			│		└── <fastqc_raw_reads output>
			├── trim_and_qc
			│	└── ${trim_and_qc_task_version}
			│		└── <trim_and_qc output>
			├── fastqc_trimmed_reads
			│	└── ${fastqc_task_version}
			│		└── <fastqc_trimmed_reads output>
			├── alignment_quantification
			│	└── ${alignment_quantification_workflow_version}
			│		└── <alignment_quantification output>
			└── pseudo_mapping_quantification
				└── ${pseudo_mapping_quantification_workflow_version}
					└── <pseudo_mapping_quantification output>

Staging data (intermediate workflow objects and final workflow outputs for the latest run of the workflow)

Following QC by researchers, the objects in the dev or uat bucket are synced into the curated data buckets, maintaining the same file structure. Curated data buckets are named asap-curated-{cohort,team-xxyy}-{source}-{dataset}.

Data may be synced using the promote_staging_data script.

asap-dev-{cohort,team-xxyy}-{source}-{dataset}
└── pmdbs_bulk_rnaseq
	├── cohort_analysis
	│   └── ${salmon_mode}
	│       ├── ${cohort_id}.sample_list.tsv
	│    	├──	${cohort_id}.${salmon_mode}.overlapping_significant_genes.csv # Only for cross_team_cohort_analysis
	│       ├── ${cohort_id}.${salmon_mode}.pca_plot.png
	│    	└── MANIFEST.tsv
	├── downstream
	│   └── ${salmon_mode}
	│       ├── ${team_id}.${output_name}.html # Includes ${salmon_mode} in output_name
	│       ├── ${team_id}.${output_name}_data.zip # Includes ${salmon_mode} in output_name
	│       ├── ${team_id}.${salmon_mode}.dds.pkl
	│       ├── ${team_id}.${salmon_mode}.pydeseq2_significant_genes.csv
	│       ├── ${team_id}.${salmon_mode}.volcano_plot.png
	│       └── MANIFEST.tsv
	└── upstream
		└── ${salmon_mode}
			├── ${sampleA_id}.${salmon_mode}.salmon_quant.tar.gz
			├── MANIFEST.tsv
			├── ...
			├── ${sampleN_id}.${salmon_mode}.salmon_quant.tar.gz
			└── MANIFEST.tsv

Promoting staging data

The promote_staging_data script can be used to promote staging data that has been approved to the curated data bucket for a team or set of teams.

This script compiles bucket and file information for both the initial (staging) and target (prod) environment. It also runs data integrity tests to ensure staging data can be promoted and generates a Markdown report. It (1) checks that files are not empty and are not less than or equal to 10 bytes (factoring in white space) and (2) checks that files have associated metadata and is present in MANIFEST.tsv.

If data integrity tests pass, this script will upload a combined MANIFEST.tsv and the data promotion Markdown report under a metadata/{timestamp} directory in the staging bucket. Previous manifest files and reports will be kept. Next, it will rsync all files in the staging bucket to the curated bucket's upstream, downstream, cohort_analysis, and metadata directories. Exercise caution when using this script; files that are not present in the source (staging) bucket will be deleted at the destination (curated) bucket.

If data integrity tests fail, staging data cannot be promoted. The combined MANFIEST.tsv, Markdown report, and promote_staging_data_script.log will be locally available.

The script defaults to a dry run, printing out the files that would be copied or deleted for each selected team.

Options

-h  Display this message and exit
-t  Space-delimited team(s) to promote data for
-l  List available teams
-s  Source name in bucket name
-d  Space-delimited dataset name(s) in team bucket name, must follow the same order as {team}
-w  Workflow name used as a directory in bucket
-p  Promote data. If this option is not selected, data that would be copied or deleted is printed out, but files are not actually changed (dry run)

Usage

# List available teams
./wf-common/util/promote_staging_data -t cohort -l -s pmdbs -d bulk-rnaseq -w pmdbs_bulk_rnaseq

# Print out the files that would be copied or deleted from the staging bucket to the curated bucket for teams team-hardy and team-wood
./wf-common/util/promote_staging_data -t team-hardy team-wood -s pmdbs -d bulk-rnaseq -w pmdbs_bulk_rnaseq

# Promote data for team-hardy and cohort
./wf-common/util/promote_staging_data -t team-hardy cohort -s pmdbs -d bulk-rnaseq -w pmdbs_bulk_rnaseq

Docker images

Docker images are defined in the docker directory. Each image must minimally define a build.env file and a Dockerfile.

Example directory structure:

docker
├── fastp
│   ├── build.env
│   └── Dockerfile
├── star_samtools
│   ├── build.env
│   └── Dockerfile
└── pydeseq2
	├── build.env
	├── Dockerfile
	├── requirements.txt
	└── scripts
		├── dge_analysis.py
		└── cohort_analysis.py

The `build.env` file

Each target image is defined using the build.env file, which is used to specify the name and version tag for the corresponding Docker image. It must contain at minimum the following variables:

IMAGE_NAME
IMAGE_TAG

All variables defined in the build.env file will be made available as build arguments during Docker image build.

The DOCKERFILE variable may be used to specify the path to a Dockerfile if that file is not found alongside the build.env file, for example when multiple images use the same base Dockerfile definition.

Building Docker images

Docker images can be build using the build_docker_images script.

# Build a single image
./build_docker_images -d docker/fastp

# Build all images in the `docker` directory
./build_docker_images -d docker

# Build and push all images in the docker directory, using the `dnastack` container registry
./build_docker_images -d docker -c dnastack -p

Tool and library versions

Image	Major tool versions	Links
fastqc	fastqc v0.12.1	Dockerfile
fastp	fastp v0.23.4	Dockerfile
star_samtools	star 2.7.11b samtools 1.20	Dockerfile
salmon	salmon v1.10.3	Dockerfile
pydeseq2	Python (v3.12.5) libraries: pydeseq2 v0.5.2 scikit-learn 1.7.1 scipy v1.16.1 pytximport 0.12.0 matplotlib v3.10.3 seaborn v0.13.2	Dockerfile
multiqc	multiqc v1.30	Dockerfile
util	google-cloud-cli 524.0.0	Dockerfile

wdl-ci

wdl-ci provides tools to validate and test workflows and tasks written in Workflow Description Language (WDL). In addition to the tests packaged in wdl-ci, the pmdbs-wdl-ci-custom-test-dir is a directory containing custom WDL-based tests that are used to test workflow tasks. wdl-ci in this repository is set up to run on pull request.

In general, wdl-ci will use inputs provided in the wdl-ci.config.json and compare current outputs and validated outputs based on changed tasks/workflows to ensure outputs are still valid by meeting the critera in the specified tests. For example, if the Differential Gene Expression Analysis task in our workflow was changed, then this task would be submitted and that output would be considered the "current output". When inspecting the raw counts generated by PyDESeq2, there is a test specified in the wdl-ci.config.json called, "check_pkl". The test will compare the "current output" and "validated output" (provided in the wdl-ci.config.json) to make sure that the dds.pkl file is still a valid PKL file.

Notes

Reference data

Release 46 (GRCh38.p14) on GENCODE was used in this pipeline.

The GENCODE gene annotation file was used to create a tx2gene dataframe in R:

library(GenomicFeatures)

txdb <- makeTxDbFromGFF("gencode.v46.annotation.gtf", format="gtf", organism="Homo sapiens")
k <- keys(txdb, keytype = "TXNAME")
# The column names do not matter but this column order must be used: 1) transcript ID and 2) gene ID
tx2gene <- select(txdb, k, "GENEID", "TXNAME")
write.csv(tx2gene, "tx2gene.gencode.v46.csv", row.names = FALSE)

The GENCODE primary assembly and gene annotation files were used to create a transcriptome FASTA file with GffRead. This is used for salmon quant alignment-mode (see issue for full context):

# Install gffread
gffread -w gencode.v46.all_transcripts.fa -g GRCh38.primary_assembly.genome.fa gencode.v46.annotation.gtf

The GENCODE gene annotation file was used to create a JSON that maps gene IDs and gene names for easier readability:

import json

with open("gencode.v46.annotation.gtf") as f:
	gtf = list(f)

gtf = [x for x in gtf if not x.startswith('#')]
gtf = [x for x in gtf if 'gene_id "' in x and 'gene_name "' in x]
gtf = list(map(lambda x: (x.split('gene_id "')[1].split('"')[0], x.split('gene_name "')[1].split('"')[0]), gtf))
gtf = list(set(gtf))
gtf = dict(gtf)

with open("gencode.v46.gene_id_and_gene_name.json", "w") as file:
	json.dump(gtf, file)

Name		Name	Last commit message	Last commit date
Latest commit History 251 Commits
.github		.github
docker		docker
pmdbs-bulk-rnaseq-wdl-ci-custom-test-dir		pmdbs-bulk-rnaseq-wdl-ci-custom-test-dir
wf-common @ 1544966		wf-common @ 1544966
workflows		workflows
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
wdl-ci.config.json		wdl-ci.config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pmdbs-bulk-rnaseq-wf

Table of contents

Workflows

Index reference genome

Upstream

Downstream

Cohort analysis

Inputs

Structs

Project

Sample

Reference

Generating the inputs JSON

Outputs

Output structure

Raw data (intermediate files and final outputs for all runs of the workflow)

Staging data (intermediate workflow objects and final workflow outputs for the latest run of the workflow)

Promoting staging data

Options

Usage

Docker images

The `build.env` file

Building Docker images

Tool and library versions

wdl-ci

Notes

Reference data

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

DNAstack/pmdbs-bulk-rnaseq-wf

Folders and files

Latest commit

History

Repository files navigation

pmdbs-bulk-rnaseq-wf

Table of contents

Workflows

Index reference genome

Upstream

Downstream

Cohort analysis

Inputs

Structs

Project

Sample

Reference

Generating the inputs JSON

Outputs

Output structure

Raw data (intermediate files and final outputs for all runs of the workflow)

Staging data (intermediate workflow objects and final workflow outputs for the latest run of the workflow)

Promoting staging data

Options

Usage

Docker images

The build.env file

Building Docker images

Tool and library versions

wdl-ci

Notes

Reference data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

The `build.env` file

Packages