Skip to content

EM‐Seq Pipeline

sprokopec edited this page Jun 13, 2024 · 5 revisions

Overview

Requirements:

  • Adapter trimming is performed using trim_galore (for targeted panels) or fastq (for WG).
  • The PughLab EM-Seq pipeline uses bwa-meth for alignment (currently tested with v0.2.7): https://github.com/brentp/bwa-meth
  • After alignment, mark-nonconverted-reads is used to mark non-converted reads: https://github.com/nebiolabs/mark-nonconverted-reads
  • Duplicates are marked using either Picard MarkDuplicates or Sambamba markdup.
  • QC checks utilize functions from Picard via GATK (ie, GATK v4.1.8.1)
  • MethylDackel is used to estimate methylation at each CpG site: https://github.com/dpryan79/MethylDackel
    • NOTE on MethylDackel: MethylDackel has two steps
      1. mbias (to evaluate bias as the ends of reads and recommend trimming parameters): Pipeline-suite will run this, however there is no way to programmatically collect the recommended parameters therefore it is up to the user to check these and implement if necessary
      2. extract (to extract the per-site methylation metrics): if you want to apply the recommended mbias parameters (formatted as --OT A,B,C,D --OB E,F,G,H to indicate that positions from A through B (E-F for OB) on read #1 and C through D (G-H for OB) on read #2 should be included). Note that there is a bug in the version currently available on H4H that doesn't allow 0's in these parameters [change these to either 1 or 150 (or whatever your read length is)]

How to run on H4H:

module load perl

perl /path/to/pughlab_emseq_pipeline.pl \
-t /path/to/emseq_pipeline_config.yaml \
-d /path/to/dna_fastq_config.yaml \
--fastq_prep \
--alignment \
--qc \
--analysis \
-c slurm \
--remove \
--dry-run { if this is a dry-run }

Directory Structure

PROJECT
├── fastq_trimmed
│   ├── emseq_fastq_trimmed_config.yaml
│   ├── trimmed fastq files
│   └── logs
├── fastqc
│   ├── fastqc_key_metrics.tsv (combined cohort)
│   ├── fastqc outputs for individual files
│   └── logs
├── BWA
│   ├── bwa_bam_config.yaml
│   ├── logs
│   ├── SAMPLE1
│   │   ├── SMP-001-N
│   │   └── SMP-001-T
│   └── SMP-002
│       ├── SMP-002-N
│       ├── SMP-002-T1
│       └── SMP-002-T2
├── BAMQC
│   ├── date_PROJECTNAME_AlignmentMetrics.tsv
│   ├── date_PROJECTNAME_GCbiasMetrics.tsv
│   ├── date_PROJECTNAME_InsertSizes.tsv
│   ├── date_PROJECTNAME_WGSMetrics.tsv or date_PROJECTNAME_HSMetrics.tsv
│   ├── logs
│   ├── SAMPLE1
│   │   └── SMP-001-T_{*metrics}
│   └── SMP-002
│       ├── SMP-002-N_{*metrics}
│       ├── SMP-002-T1_{c*metrics}
│       └── SMP-002-T2_{*metrics}
├── MethylDackel
│   ├── logs
│   ├── SAMPLE1
│   │   ├── mbias (contains plots with recommended trimming parameters)
│   │   └── per-site bedGraph files
│   └── SMP-002
│   │   ├── mbias (contains plots with recommended trimming parameters)
│   │   └── per-site bedGraph files
└── logs
    └── run_DNA_pipeline_timestamp
        ├── pughlab_dna_pipeline__run_bwa
        ├── pughlab_dna_pipeline__run_gatk
        ├── pughlab_dna_pipeline__run_coverage
        └── pughlab_dna_pipeline__run_run_qc