Skip to content

Scripts

supermaxiste edited this page Feb 19, 2020 · 5 revisions

ARPEGGIO includes four different R scripts. In the following sections there will be a description of all these scripts and how they work. Note that all scripts were developed with command line interaction in mind and are not meant to run on RStudio by default. Different scripts are meant for different contexts:

  • CoverageFileGeneratorComplete.R is used after Bismark generates a coverage file for all Cs and all contexts
  • dmrseq.R runs dmrseq and takes care of arranging the samples in the correct way
  • significantGenesToBed.R is part of the downstream analyses and does some simple data wrangling
  • DMGeneSummary.R is part of the downstream analyses summarizes all the output from dmrseq and downstream analyses in a concise txt file about differential methylation in genes

CoverageFileGeneratorComplete.R

When people want to analyse all methylation contexts (CpG, CHG, CHH) Bismark provides an option to consider all of those contexts together. In practice those contexts are always considered separately because biologically they have three different enzymes regulated in three different ways. In addition many papers (contact me for more info) discussed how the three contexts target different genomic regions depending on the type of stress and its duration. To cut the story short, we need to separate Bismark's CX report that includes information about all cytosines and all contexts and turn it into a Bismark cov (coverage) file to use as input for dmrseq.

Input:

  • [1] Bismark CX report
  • [2] Path to output folder
  • [3] Name of output file (no file extension needed)

Command structure:

CoverageFileGeneratorComplete.R [1] [2] [3]

Output:

Three separate cov for each methylation context (CpG, CHG, CHH).

Script in a nutshell:

  • Remove Cytosines showing no counts
  • Separate Cytosines with counts in three files depending on context

dmrseq.R

This script runs dmrseq in a very similar fashion as in their vignette. Two additional aspects were taken into account in this script: the order of samples (any number) and saving the resulting Rdata file. The first aspect is very important because the order of samples influences the interpretation of the results (reference defined by the order), so we made sure to keep the same order. For default mode, parents vs polyploid, parents are always the reference, while for the other two special modes (polyploid vs polyploid and diploid vs diploid) condition B is always the reference. The second aspect is very practical: by saving the Rdata file with the output from dmrseq, users can do more data wrangling and plotting without needing to rerun dmrseq every time.

Input:

  • [1] Number of samples for parent1 or parent2 or samples in condition B
  • [2] Number of samples for polyploid or samples in condition A
  • [3] Name of output file with file extension
  • [4] Number of cores to use
  • [5, ...] Bismark cov files in the correct order (samples from [1] first then samples from [2])

Command structure:

dmrseq.R [1] [2] [3] [4] [5, ...]

Output:

A csv file with differentially methylated regions and a Rdata file with a GRanges object about differentially methylated regions.

Script in a nutshell:

  • Order cov files correctly
  • Read cov files in the right order
  • Run dmrseq

significantGenesToBed.R

The output of dmrseq includes all candidate regions showing differential methylation (DM), but not all of them are significant. To select significant regions we pick regions showing a q-value < 0.05. This script takes care of doing this and outputs a bed file containing only significant regions after FDR correction.

Input:

  • [1] dmrseq output (in csv format)
  • [2] Output name (no extension needed)

Command structure:

significantGenesToBed.R [1] [2]

Output:

A bed file with information about the genomic ranges of DM regions.

Script in a nutshell:

  • Select regions with q < 0.05
  • Make sure start position < end position

DMGeneSummary.R

Biologists always look for genes and that's why this script exist. From regions showing differential methylation (DM) we check for overlaps with genic regions. By combining information from the dmrseq output, the overlap between regions showing DM and gene regions, and geneID from annotation files, we generate a summary file with all the information a biologist would need.

Input:

  • [1] Intersection file (from the rules bedtools_intersect and bedtools_intersect_special, ends with overlap.txt)
  • [2] dmrseq output (in csv format)
  • [3] gene ID as in the annotation file
  • [4] output name (no extension needed)

Command structure:

DMGeneSummary.R [1] [2] [3] [4]

Output:

A txt file with summary information about genes overlapping DM regions.

Script in a nutshell:

  • Pick the columns of interest from different files
  • Add methylation status (increase or decrease) and gene ID
  • Arrange everything nicely

Next section

  1. Software and tools

Wiki index

Basic steps to run & get an idea of the workflow:

  1. Overview
  2. Input files
  3. Configuration file
  4. Running Snakemake
  5. Output structure

Advanced information to better understand & modify the workflow:

  1. Snakefile (rules and relationships)
  2. Scripts
  3. Software and tools