Skip to content

Multigranular Analysis of Regulatory Variants on the Epigenomic Landscape

License

Notifications You must be signed in to change notification settings

fuxialexander/marvel

Repository files navigation

marvel

MARVEL: Multigranular Analysis of Regulatory Variants on the Epigenomic Landscape.

Nextflow install with bioconda Docker

Introduction

MARVEL is a pipeline for noncoding regulatory variants analysis using whole-genome sequencing data and cell-type specific epigenomic profiles. The workflow of MARVEL can be summarized using the following figure:

Schematic overview of MARVEL Figure 1 Schematic overview of MARVEL. (a) Epigenomic data of relevant cell type (hNC in the case of HSCR) are integrated with a gene annotation set to identify the active regulatory elements relevant to the phenotype of interest. (b) In each regulatory element, the functional significance of genetic variants is evaluated by their perturbation to TF sequence motifs. (c) Since the perturbation effects of multiple genetic variants may not add up linearly, they are considered together to reconstruct the sample-specific sequences, based on which the overall change of TF motif match scores is determined. (d) For motifs with multiple appearances within the same regulatory element, their match scores are aggregated to give a single score. (e) At a higher level, if a gene involves multiple regulatory elements, the aggregated match scores of a motif in the different elements can be further aggregated into a single score. This is done in the gene-based analysis. (f-g) The aggregated match score matrix of all the motifs for a regulatory element/gene is used as the input of an association test, which selects a subset of the most informative motif features (f) and compares a model involving both these selected features and the covariates with a null model that involves only the covariates using likelihood ratio (LR) test (g). (h) The regulatory elements and genes identified to be significantly associated with the phenotype can be further studied by other downstream analyses, such as gene set enrichment and single-cell expression analyses. (i) TFs with recurrently perturbed match scores in different regulatory elements are collected to infer a network that highlights the phenotype-associated perturbations. Please notice that h and i are not included in this repository at the current stage, but can be obtained easily using the result produced by MARVEL and Cytoscape.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It comes with docker containers making installation trivial and results highly reproducible.

Quick Start

i. Install nextflow

curl -s https://get.nextflow.io | bash

and add it to your path. The reason is the normal release has a bug in conda integration.

ii. Install one of docker, singularity or conda

iii. Clone the repo and test it on a minimal dataset Notice: the test.vcf.gz and pheno_covar.txt file were temporarily removed as they were made from real genomics data.

Basically:

For test.vcf.gz: you can use bcftools to select variants in a small regions to produce a VCF with genotypes of multiple samples

For pheno_covar.txt:

  • It's a TSV file
  • First column is sample name (in the same order as in the VCF file)
  • Second column is y/phenotype in 0, 1 coding
  • Third or later columns are covariates (all numeric)
git clone https://github.com/fuxialexander/marvel.git
cd marvel
nextflow main.nf -profile test,<docker/singularity/conda> -resume

iv. Look into nextflow.config and test/test.conf and modify it to start running your own analysis!

nextflow main.nf -profile <docker/singularity/conda> -resume

See usage docs for all of the available options when running the pipeline.

Documentation

The marvel pipeline comes with documentation about the pipeline, found in the docs/ directory:

  1. Installation
  2. Pipeline configuration
  3. Running the pipeline
  4. Output and how to interpret the results
  5. Troubleshooting

Credits

MARVEL is implemented using a boilerplate created by the nf-core team (https://nf-co.re/).

Citation

If you use MARVEL for your analysis, please cite it as: Fu AX, Lui KN, Tang CS, Ng RK, Lai FP, Lau ST, Li Z, Garcia-Barcelo MM, Sham PC, Tam PK, Ngan ES, Yip KY. Whole-genome analysis of noncoding genetic variations identifies multiscale regulatory element perturbations associated with Hirschsprung disease. Genome Res. 2020 Nov;30(11):1618-1632. doi: 10.1101/gr.264473.120.