Skip to content

hukai916/urlpipe

Repository files navigation

URLpipe

Table of Contents

Introduction
Pipeline summary
Quick start
Documentation
Credits
Bug report/Support
Citations
Release notes

Introduction

Gene editing via CRISPR/Cas9 technologies has emerged as a promoising strategy to treating certain Repeat Expansion Diseases (REDs) including Huntington's Disease (HD) by permanantly reducing the length of pathogenic expansions in DNA regions. One crucial step in this process is evaluating the editing results, which involves checking for undesired Insertions and Deletions (INDELs) near the targeted editing sites and determining the edited length of the DNA repeat expansion through sequencing. However, accurately determining the editing outcomes remains a challenge due to PCR artifects caused by polymerase slippage in repetitive DNA regions, decreased efficiency when amplifying larger fragments, and sequencing errors. The URLpipe (UMI-based Repeat Length analyssi pipeline) tackles this problem by leveraging Unique Molecular Identifier (UMIs) to improve the accuracy of inferring gene editing outcomes.

Powered by Nextflow, URLpipe is designed to be user-friendly and portable, enabling execution across various compute infrastructures through Docker/Singularity technologies. URLpipe takes raw fastq files as input and generates statistical tables and plots that summarize the editing outcomes. Below is an overview of the design and the implemented sub-workflows/modules in URLpipe.

The development of the pipeline is guided by nf-core TEMPLATE.

Pipeline Summary

URLpipe supports sequencing reads from both Illumina and Nanopore platforms. The relevant sub-workflows/modules are illustrated in the diagram below. For detailed instructions on configuring your analysis and examples, refer to the usage documentation.

Illumina reads

The Illumina branch of the pipeline is structured into eight distinct sub-workflows, each with a specific role in processing data:

  • INPUT_CHECK:
    • Validate the input files and configurations to ensure they meet the requirements for analysis.
  • PREPROCESS_QC:
    • Perform preprocessing and quality control on the raw data.
    • Result folder: 1_preprocess and 2_qc_and_umi
  • CLASSIFY_READ:
    • Categorize reads into different classes to facilitate downstream analysis.
    • Result folder: 3_read_category
  • REPEAT_STAT_DEFAULT and REPEAT_STAT_MERGE:
    • Determine repeat lengths by leveraging UMI.
    • Result folder: 4_repeat_statistics
  • INDEL_STAT:
    • Analyze patterns of insertions and deletions around the repeat region.
    • Result folder: 5_indel_statistics
  • GET_SUMMARY:
    • Generate tables and plots summarizing the editng outcome.
    • Result folder: 6_summary

Selected sub-workflows and their functionalities are summarized below. Refer to output - Results for more details.

PREPROCESS_QC:

  1. Merge fastq files from different lanes (if any) that belong to the same library (1a_lane_merge).
  2. Extract UMI from each read and append it to the read name (1b_umi_extract).
  3. Trim adapter sequences (1c_cutadapt).
  4. Quality control using FastQC (2a_fastqc).
  5. Quality control by plotting read count per UMI (2b_read_per_umi_cutadapt).

CLASSIFY_READ:

  1. Determine if read is mapped to the predefined target region (on-locus) (3a_classify_locus).
  2. Classify on-locus reads based on the presence of INDELs around the repeat region (non-indel) (3b_classify_indel).
  3. Classify non-indel reads For each non-indel read, determine if it covers the entire repeat region (readthrough) (3c_classify_readthrough).

The readthrough reads will be used towards determining the repeat lengths.

REPEAT_STAT_DEFAULT/MERGE:

In URLpipe, repeat length determination can be performed in two modes: DEFAULT mode, which uses only R1 reads, and MERGE mode, which merges R1 and R2 reads. For UMI correction, four methods are currently available: "mode", "mean", and "least distance", and "square distance".

  1. Figure out repeat length distribution (4a_repeat_length_distribution).
  2. Perform UMI correction to refine repeat length measurements (4a_repeat_length_distribution).
  3. Plot the repeat length distribution per UMI (4b_4a_repeat_length_distribution_per_umi).

INDEL_STAT:

Gather statistic information for reads containing INDELs.

GET_SUMMARY:

Obtain summary statistics from CLASSIFY_READ, REPEAT_STAT_DEFAULT/MERGE, and INDEL_STAT results.

  1. Generate master statistic tables (6a_master_table).
  2. Generate summary plots (6b_bin_plot).

Nanopore reads

The setup for the Nanopore branch is quite similar to that of the Illumina branch, with the main difference being the inclusinog of an optional PREPROCESS_NANOPORE sub-workflow specifically designed for pre-processing Nanopore data.

Quick start

  1. Install nextflow(>=23.10.0).

  2. To avoid potential issues with dependency installation, all URLpipe dependencies are built into images. Therefore, you should install either Docker or Singularity (Apptainer) and specify -profile singularity or -profile docker, respectively, when running URLpipe. Otherwise, you will need to manually install all dependencies and ensure they are available on you local PATH, which is unlikely to be the case!

  3. Download the pipeline:

git clone https://github.com/hukai916/URLpipe.git
cd URLpipe
  1. Download a minimal test dataset:
    • The dataset1 comprises a subset of samples from a CRISPR editing experiment using the HQ50 (Human) cell line. HQ50 cells contain two HTT alleles with differing CAG-repeat lengths: one with approximately 18CAG/20Q and the other with approximately 48CAG/50Q. The objective of the experiment is to examine the editing outcomes when treated with various DNA damage repair inhibitors. For demonstration purposes, six samples (three conditions, each with two replicates) have been selected:
    • Two samples with no electroporation (no_E), serving as the unedited control.
    • Two samples with no inhibitor (noINH_DMSO), serving as the edited control.
    • Two samples with the D103 inhibitor (D103_10uM) to assess its effect on the editing outcome.
wget https://www.dropbox.com/scl/fi/b4xspm0ydq4y1p8s9u55g/sample_dataset1.zip
unzip sample_dataset1.zip
  1. Edit the replace_with_full_path in the "assets/samplesheet_dataset1.csv" file to use the actual full path.

  2. Test the pipeline with this minimal dataset1:

  • At least 8GB memory is recommended for dataset1.

  • By default, the local executor (your local computer) will be used (-profile local) meaning that all jobs will be executed on your local computer. Nextflow supports many other executors including SLURM, LSF, etc.. You can create a profile file to config which executor to use. Multiple profiles can be supplied with comma, e.g. -profile docker,lsf.

  • Please check nf-core/configs to see what other custom configurations can be supplied.

  • Example command for run URLpipe with Docker and local executor:

nextflow run main.nf -c conf/sample_dataset1.config -profile docker,local

By executing the above command:

  • The "local executor" (-profile local) will be used.
  • The "docker" (-profile docker) will be leveraged.
  • The configurations specified via -c conf/sample_dataset1.config will be applied, which includes:
    • input = "./assets/samplesheet_dataset1.csv": input samplesheet file path
    • outdir = "./results_dataset1": output directory path
    • ref = "./assets/IlluminaHsQ50FibTrim_Ref.fa": reference file path
    • ref_repeat_start = 69: 1-based repeat start coordinate in reference
    • ref_repeat_end = 218: 1-based repeat end coordinate in reference
    • ref_repeat_unit = "CAG": repeat unit in reference
    • length_mode = "reference_align": repeat length determination method
    • umi_cutoffs = "1,3,5,7,10,30,100": UMI cutoffs for correction
    • umi_correction_method = "least_distance": UMI correction method
    • repeat_bins = "[(0,50), (51,60), (61,137), (138,154), (155,1000)]": number and range of bins to plot
    • allele_number = 2: number of alleles in reference
    • max_memory = "16.GB": maximum memory to use, do not exceed what your system has
    • max_cpus = 16: maximum number of cpu to use, do not exceed what your system has
    • max_time = "240.h": maximum running time
    • other module-specific configurations

Detailed explanations, refer to usage.

  • Example command for running URLpipe with Singularity and LSF executor:
nextflow run main.nf -c conf/sample_dataset1.config -profile singularity,local

Like the first example, the above command directs the pipeline to use Singularity and LSF executor rather than Docker and local executor by -profile singularity,lsf.

  1. Run your own analysis:
  • Typical commands:
# Supply configurations through command flags
nextflow run main.nf -profile <singularity/docker/lsf/local> --input <path_to_input_samplesheet_file> --outdir <path_to_result_dir> --allele_number <1/2> --length_mode <reference_align/distance_count> --ref <path_to_ref_file> ...

# Or include configurations into a single file, e.g. test.config
nextflow run main.nf -profile <singularity/docker/lsf/local> -c test.config
  • For help: # todo
nextflow run main.nf --help

See documentation usage for all of the available options.

Documentation

The URLpipe workflow includes comprehensive documention, covering both usage and output.

Credits

URLpipe was originally designed and written by Kai Hu, Michael Brodsky, and Lihua Julie Zhu. We also extend our gratitude to Rui Li, Haibo Liu, Junhui Li for their extensive assistance in the development of this tool.

Bug report/Support

For help, bug reports, or feature requests, please create a GitHub issue by clicking here. If you would like to extend URLpipe for your own use, feel free to fork the repository.

Citations: todo

Please cite [URLpipe](to be added) if you use it for your research.

A Template of Method can be found here.

A complete list of references for the tools used by URLpipe can be found here.

Release notes

v0.1.0
  • initial release

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published