polymeval

POLYMerase EVALuation (polymeval) is a Snakemake pipeline to evaluate polymerase-amplified HiFi read sets. For quite some samples, it is challenging to extract sufficient amounts of DNA to prepare PacBio High-Fidelity (HiFi) sequencing libraries. Examples include tissue biopsies, small-bodied organisms, or ethanol-preserved specimens where the DNA can be severely fragmented and long fragments might be rare. Furthermore, certain samples are prone to contain secondary metabolites even after library preparation that potentially inhibit the DNA-polymerase that sequences the reads.

In such cases, long-range Polymerase Chain Reaction (PCR) or Multiple Displacement (MDA) amplification have proven helpful to generate enough input material for downstream HiFi library preparation and sequencing. However, depending on the polymerase used for amplification, certain biases can affect the amplified product, notably GC drop outs uneven coverage. Using polymeval one can uncover such biases.

Pipeline Setup

Standard mode

This is how the pipeline operates in standard mode (on each input sample):

(optional) read summary stats (with seqkit, rdeval and bbmap)
(optional) read k-mer statistics/histogram (with kmc and genomescope)
assembly (with hifiasm)
assembly error-rate, quality and k-mer completeness (with merqury)
assembly orthologue completness (with compleasm)
(optional) read error estimation (with hifieval)

Downsample Mode

Since the number of sequenced nucleotides per polymerase read set might differ considerably, polymeval has a downsample mode:

downsample each sample to the smallest coverage of all samples (with rasusa)
as in standard mode

Combine Mode

Given that each polymerase used in amplification might have unique biases, combining different polymerase read sets might alleviate single-polymerase weaknesses. This is how polymeval works in combine mode:

select up to five samples that should be combined
downsample to the smallest coverage of all selected samples (and to 0.5, 0.4, 0.33, 0.2x of that depending on the number of combinations, with rasusa)
as in standard mode

Reference Mode

In case one of the above (most probably standard) has been run, it can be informative to compare the resulting assemblies to a 'gold standard reference'. This way, dataset-specific contig breaks and their underlying causes can be investigated. For this, polymeval has a reference mode:

select a reference genome
map reads of each sample to reference (with minimap2 and samtools)
map dataset-specific assemblies to reference (with minimap2, creating paf output files)
determine the amount of chimeric reads in the dataset (reading samtools-flags from bam files in R)
identify contig breaks relative to reference (with an R script using the PAF-reading function from SVbyEye)
(optional) determine coverage against reference (with PanDepth)
(optional) create summary plots displaying how coverage, GC content and contig breaks are related (R scripts)

All of this is done automatically, the user only has to provide a path to the input reads (in .fastq.gz format) and, for reference mode, reference and dataset specific assemblies (in .fa format).

Installation

Polymeval uses the snakemake pipeline language to string together different tools. Snakemake will take care of installing all necessary dependencies through conda or mamba. After cloning the repository, it is highly recommended to set up a new environment like so:

git clone https://github.com/casparbein/polymeval
conda env create -f polymeval/environment.yaml
conda activate polymeval
python3 polymeval/polymeval.py -h

potential dependencies

-> Compleasm, a reimplementation of BUSCO, is part of the pipeline. It is highly recommended to download the necessary databases once and store them somewhere accessible on the cluster. For the test case (see below), we need the saccharomycetes_odb12 database (download compleasm as specified on the github):

conda create -n compleasm -c conda-forge -c bioconda compleasm
conda activate compleasm
compleasm download -L compleasm_libs --odb odb12 saccharomycetes_odb12

## The library path is now ~/compleasm_libs

-> One optional step is to use KMC to count k-mers. Since KMC is not available through conda, if you want to include this step, you have to download and install KMC yourself. Otherwise, just skip this step.

-> For most of the analysis done in reference mode, PanDepth has to be installed on the system (anyway highly recommended because it is very fast). There exists a pre-compiled binary that one only has to download and unzip, otherwise it can be downloaded and installed from github. Instructions can be found here: PanDepth. Once installed, the user can provide the absolute path to PanDepth through polymeval (--pandepth_path).

Polymeval is implemented with the slurm scheduling system. If you do not use slurm, you can run the pipeline locally using --local_run, but some of the jobs create large files and need cosiderable amounts of both RAM and storage space.

Test case

To test the pipeline, I used data generated by Quail et al. (2024), who sequenced yeast samples that were amplified with different polymerases. For a minimal example, you can download the following three files: ERR10357093.fastq.gz, ERR10357086.fastq.gz, ERR10357098.fastq.gz

Download with wget:

mkdir yeast_reads
cd yeast_reads
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/093/ERR10357093/ERR10357093.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/086/ERR10357086/ERR10357086.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/098/ERR10357098/ERR10357098.fastq.gz

Unfortunately, wget downloads are not available on all clusters (depending on the firewall), so you can either download them manually from ENA (by clicking on the link under 'Generated FASTQ files: FTP'):
https://www.ebi.ac.uk/ena/browser/view/ERR10357098
https://www.ebi.ac.uk/ena/browser/view/ERR10357086
https://www.ebi.ac.uk/ena/browser/view/ERR10357093

Or download them with the sra toolkit:

mkdir yeast_reads
cd yeast_reads
fastq-dump --accession ERR10357093 --gzip
fastq-dump --accession ERR10357086 --gzip
fastq-dump --accession ERR10357098 --gzip

Standard Mode

Now, we can just run polymeval in standard mode:

python3 ~/poymeval/polymeval.py \
--standard \
--directory_name yeast_standard_test \
--dry_run \
--readstats \
--hifieval \
--compleasm_db saccharomycetes_odb12 \
--compleasm_dp_path ~/compleasm_libs \
--input_reads ~/yeast_reads/

This will run a "dry-run", which just means that it will test whether snakemake can locate all necessary input files and whether the input and output steps are clear. If this works, you can then start a run:

python3 ~/poymeval/polymeval.py \
--standard \
--directory_name yeast_standard_test \
--run_snakemake \
--readstats \
--hifieval \
--compleasm_db saccharomycetes_odb12 \
--compleasm_dp_path ~/compleasm_libs \
--input_reads ~/yeast_reads/

Depending on the cluster, this should finish in 30 min - 2hrs.

Downsample Mode

After the standard run has finished, we can now try downsampling the sample to see how the polymerase read sets perform when the number of input nucleotides is the same. The standard run will have created a seqkit summary file which is used to gather the downsampling base target:

ls ~/yeast_standard_test/out/stats/seqkit_all.tsv

If this file exists, you can simply run (first as dry-run, then as the actual run):

python3 ~/poymeval/polymeval.py \
--downsample \
--directory_name yeast_downsample_test \
--dry_run \
--readstats \
--hifieval \
--compleasm_db saccharomycetes_odb12 \
--compleasm_dp_path ~/compleasm_libs \
--input_reads ~/yeast_reads/ \
--seqkit_file_path ~/yeast_standard_test/out/stats/seqkit_all.tsv

Combine Mode

Now we can investigate whether read set combinations perform better than single read sets. In this case, we have three read sets, so we can theoretically combine them in such a way: X1 = nucleotides of smallest read set
X2 = all other read sets downsampled to X1
X3 = all read sets downsampled to 0.5x X1
X4 = all read sets downsampled to 0.33x X1

The possible combinations are thus: X1,
X2 (downsampled read sets, in this in this case there are two such downsampled sets),
X3 + X3 (combining two downsampled read sets, in this case there are three such combinations)
X4 + X4 + X4 (combining three downsampled read sets, in this case there is one such combination)

All this is done automatically with:

python3 ~/poymeval/polymeval.py \
--combine \
--directory_name yeast_combine_test \
--dry_run \
--hifieval \
--compleasm_db saccharomycetes_odb12 \
--compleasm_dp_path ~/compleasm_libs \
--input_reads ~/yeast_reads/ \
--seqkit_file_path ~/yeast_standard_test/out/stats/seqkit_all.tsv \
--samples ERR10357093,ERR10357086,ERR10357098

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
envs		envs
prof		prof
rules		rules
scripts		scripts
.gitignore		.gitignore
README.md		README.md
Snakefile_combine		Snakefile_combine
Snakefile_downsample		Snakefile_downsample
Snakefile_reference		Snakefile_reference
Snakefile_standard		Snakefile_standard
environment.yaml		environment.yaml
get_downsample_rates.py		get_downsample_rates.py
polymeval.py		polymeval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

polymeval

Pipeline Setup

Standard mode

Downsample Mode

Combine Mode

Reference Mode

Installation

potential dependencies

Test case

Standard Mode

Downsample Mode

Combine Mode

About

Uh oh!

Releases

Packages

Languages

casparbein/polymeval

Folders and files

Latest commit

History

Repository files navigation

polymeval

Pipeline Setup

Standard mode

Downsample Mode

Combine Mode

Reference Mode

Installation

potential dependencies

Test case

Standard Mode

Downsample Mode

Combine Mode

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages