GitHub

Introduction

Gene editing via CRISPR/Cas9 technologies has emerged as a promoising strategy to treating certain Repeat Expansion Diseases (REDs) including Huntington's Disease (HD) by permanantly reducing the length of pathogenic expansions in DNA regions. One crucial step in this process is evaluating the editing results, which involves checking for undesired Insertions and Deletions (INDELs) near the targeted editing sites and determining the edited length of the DNA repeat expansion through sequencing. However, accurately determining the editing outcomes remains a challenge due to PCR artifects caused by polymerase slippage in repetitive DNA regions, decreased efficiency when amplifying larger fragments, and sequencing errors. The URLpipe (UMI-based Repeat Length analyssi pipeline) tackles this problem by leveraging Unique Molecular Identifier (UMIs) to improve the accuracy of inferring gene editing outcomes.

Powered by Nextflow, URLpipe is designed to be user-friendly and portable, enabling execution across various compute infrastructures through Docker/Singularity technologies. URLpipe takes raw fastq files as input and generates statistical tables and plots that summarize the editing outcomes. Below is an overview of the design and the implemented sub-workflows/modules in URLpipe.

The development of the pipeline is guided by nf-core TEMPLATE.

Pipeline Summary

URLpipe supports sequencing reads from both Illumina and Nanopore platforms. The relevant sub-workflows/modules are illustrated in the diagram below. For detailed instructions on configuring your analysis and examples, refer to the usage documentation.

Illumina reads

The Illumina branch of the pipeline is structured into eight distinct sub-workflows, each with a specific role in processing data:

INPUT_CHECK:
- Validate the input files and configurations to ensure they meet the requirements for analysis.
PREPROCESS_QC:
- Perform preprocessing and quality control on the raw data.
- Result folder: 1_preprocess and 2_qc_and_umi
CLASSIFY_READ:
- Categorize reads into different classes to facilitate downstream analysis.
- Result folder: 3_read_category
REPEAT_STAT_DEFAULT and REPEAT_STAT_MERGE:
- Determine repeat lengths by leveraging UMI.
- Result folder: 4_repeat_statistics
INDEL_STAT:
- Analyze patterns of insertions and deletions around the repeat region.
- Result folder: 5_indel_statistics
GET_SUMMARY:
- Generate tables and plots summarizing the editng outcome.
- Result folder: 6_summary

Selected sub-workflows and their functionalities are summarized below. Refer to output - Results for more details.

PREPROCESS_QC:

Merge fastq files from different lanes (if any) that belong to the same library (1a_lane_merge).
Extract UMI from each read and append it to the read name (1b_umi_extract).
Trim adapter sequences (1c_cutadapt).
Quality control using FastQC (2a_fastqc).
Quality control by plotting read count per UMI (2b_read_per_umi_cutadapt).

CLASSIFY_READ:

Determine if read is mapped to the predefined target region (on-locus) (3a_classify_locus).
Classify on-locus reads based on the presence of INDELs around the repeat region (non-indel) (3b_classify_indel).
Classify non-indel reads For each non-indel read, determine if it covers the entire repeat region (readthrough) (3c_classify_readthrough).

The readthrough reads will be used towards determining the repeat lengths.

REPEAT_STAT_DEFAULT/MERGE:

In URLpipe, repeat length determination can be performed in two modes: DEFAULT mode, which uses only R1 reads, and MERGE mode, which merges R1 and R2 reads. For UMI correction, four methods are currently available: "mode", "mean", and "least distance", and "square distance".

Figure out repeat length distribution (4a_repeat_length_distribution).
Perform UMI correction to refine repeat length measurements (4a_repeat_length_distribution).
Plot the repeat length distribution per UMI (4b_4a_repeat_length_distribution_per_umi).

INDEL_STAT:

Gather statistic information for reads containing INDELs.

GET_SUMMARY:

Obtain summary statistics from CLASSIFY_READ, REPEAT_STAT_DEFAULT/MERGE, and INDEL_STAT results.

Generate master statistic tables (6a_master_table).
Generate summary plots (6b_bin_plot).

Nanopore reads

The setup for the Nanopore branch is quite similar to that of the Illumina branch, with the main difference being the inclusinog of an optional PREPROCESS_NANOPORE sub-workflow specifically designed for pre-processing Nanopore data.

Quick start

Install nextflow(>=23.10.0).
To avoid potential issues with dependency installation, all URLpipe dependencies are built into images. Therefore, you should install either Docker or Singularity (Apptainer) and specify -profile singularity or -profile docker, respectively, when running URLpipe. Otherwise, you will need to manually install all dependencies and ensure they are available on you local PATH, which is unlikely to be the case!
Download the pipeline:

git clone https://github.com/hukai916/URLpipe.git
cd URLpipe

Download a minimal test dataset:
- The dataset1 comprises a subset of samples from a CRISPR editing experiment using the HQ50 (Human) cell line. HQ50 cells contain two HTT alleles with differing CAG-repeat lengths: one with approximately 18CAG/20Q and the other with approximately 48CAG/50Q. The objective of the experiment is to examine the editing outcomes when treated with various DNA damage repair inhibitors. For demonstration purposes, six samples (three conditions, each with two replicates) have been selected:
- Two samples with no electroporation (no_E), serving as the unedited control.
- Two samples with no inhibitor (noINH_DMSO), serving as the edited control.
- Two samples with the D103 inhibitor (D103_10uM) to assess its effect on the editing outcome.

wget https://www.dropbox.com/scl/fi/b4xspm0ydq4y1p8s9u55g/sample_dataset1.zip
unzip sample_dataset1.zip

Edit the replace_with_full_path in the "assets/samplesheet_dataset1.csv" file to use the actual full path.
Test the pipeline with this minimal dataset1:

At least 8GB memory is recommended for dataset1.
By default, the local executor (your local computer) will be used (-profile local) meaning that all jobs will be executed on your local computer. Nextflow supports many other executors including SLURM, LSF, etc.. You can create a profile file to config which executor to use. Multiple profiles can be supplied with comma, e.g. -profile docker,lsf.
Please check nf-core/configs to see what other custom configurations can be supplied.
Example command for run URLpipe with Docker and local executor:

nextflow run main.nf -c conf/sample_dataset1.config -profile docker,local

By executing the above command:

The "local executor" (-profile local) will be used.
The "docker" (-profile docker) will be leveraged.
The configurations specified via -c conf/sample_dataset1.config will be applied, which includes:
- input = "./assets/samplesheet_dataset1.csv": input samplesheet file path
- outdir = "./results_dataset1": output directory path
- ref = "./assets/IlluminaHsQ50FibTrim_Ref.fa": reference file path
- ref_repeat_start = 69: 1-based repeat start coordinate in reference
- ref_repeat_end = 218: 1-based repeat end coordinate in reference
- ref_repeat_unit = "CAG": repeat unit in reference
- length_mode = "reference_align": repeat length determination method
- umi_cutoffs = "1,3,5,7,10,30,100": UMI cutoffs for correction
- umi_correction_method = "least_distance": UMI correction method
- repeat_bins = "[(0,50), (51,60), (61,137), (138,154), (155,1000)]": number and range of bins to plot
- allele_number = 2: number of alleles in reference
- max_memory = "16.GB": maximum memory to use, do not exceed what your system has
- max_cpus = 16: maximum number of cpu to use, do not exceed what your system has
- max_time = "240.h": maximum running time
- other module-specific configurations

Detailed explanations, refer to usage.

Example command for running URLpipe with Singularity and LSF executor:

nextflow run main.nf -c conf/sample_dataset1.config -profile singularity,local

Like the first example, the above command directs the pipeline to use Singularity and LSF executor rather than Docker and local executor by -profile singularity,lsf.

Note, Singularity images will be downloaded and saved to work/singularity directory by default. It is recommended to configure the NXF_SINGULARITY_CACHEDIR or singularity.cacheDir settings to store the images in a central location.

Run your own analysis:

Typical commands:

# Supply configurations through command flags
nextflow run main.nf -profile <singularity/docker/lsf/local> --input <path_to_input_samplesheet_file> --outdir <path_to_result_dir> --allele_number <1/2> --length_mode <reference_align/distance_count> --ref <path_to_ref_file> ...

# Or include configurations into a single file, e.g. test.config
nextflow run main.nf -profile <singularity/docker/lsf/local> -c test.config

For help: # todo

nextflow run main.nf --help

See documentation usage for all of the available options.

Documentation

The URLpipe workflow includes comprehensive documention, covering both usage and output.

Credits

URLpipe was originally designed and written by Kai Hu, Michael Brodsky, and Lihua Julie Zhu. We also extend our gratitude to Rui Li, Haibo Liu, Junhui Li for their extensive assistance in the development of this tool.

Bug report/Support

For help, bug reports, or feature requests, please create a GitHub issue by clicking here. If you would like to extend URLpipe for your own use, feel free to fork the repository.

Citations: todo

Please cite [URLpipe](to be added) if you use it for your research.

A Template of Method can be found here.

A complete list of references for the tools used by URLpipe can be found here.

Release notes

v0.1.0

initial release

Name		Name	Last commit message	Last commit date
Latest commit History 1,189 Commits
assets		assets
bin		bin
conf		conf
docs		docs
lib		lib
modules		modules
subworkflows		subworkflows
workflows		workflows
.DS_Store		.DS_Store
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CITATIONS.md		CITATIONS.md
LICENSE		LICENSE
README.md		README.md
README.v1.md		README.v1.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of Contents

Introduction

Pipeline Summary

Illumina reads

PREPROCESS_QC:

CLASSIFY_READ:

REPEAT_STAT_DEFAULT/MERGE:

INDEL_STAT:

GET_SUMMARY:

Nanopore reads

Quick start

Documentation

Credits

Bug report/Support

Citations: todo

Release notes

About

Releases

Packages

Languages

License

hukai916/urlpipe

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Introduction

Pipeline Summary

Illumina reads

PREPROCESS_QC:

CLASSIFY_READ:

REPEAT_STAT_DEFAULT/MERGE:

INDEL_STAT:

GET_SUMMARY:

Nanopore reads

Quick start

Documentation

Credits

Bug report/Support

Citations: todo

Release notes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages