POLYMerase EVALuation (polymeval) is a Snakemake pipeline to evaluate polymerase-amplified HiFi read sets. For quite some samples, it is challenging to extract sufficient amounts of DNA to prepare PacBio High-Fidelity (HiFi) sequencing libraries. Examples include tissue biopsies, small-bodied organisms, or ethanol-preserved specimens where the DNA can be severely fragmented and long fragments might be rare. Furthermore, certain samples are prone to contain secondary metabolites even after library preparation that potentially inhibit the DNA-polymerase that sequences the reads.
In such cases, long-range Polymerase Chain Reaction (PCR) or Multiple Displacement (MDA) amplification have proven helpful to generate enough input material for downstream HiFi library preparation and sequencing. However, depending on the polymerase used for amplification, certain biases can affect the amplified product, notably GC drop outs uneven coverage. Using polymeval one can uncover such biases.
This is how the pipeline operates in standard mode (on each input sample):
- (optional) read summary stats (with seqkit, rdeval and bbmap)
- (optional) read k-mer statistics/histogram (with kmc and genomescope)
- assembly (with hifiasm)
- assembly error-rate, quality and k-mer completeness (with merqury)
- assembly orthologue completness (with compleasm)
- (optional) read error estimation (with hifieval)
Since the number of sequenced nucleotides per polymerase read set might differ considerably, polymeval has a downsample mode:
- downsample each sample to the smallest coverage of all samples (with rasusa)
- as in standard mode
Given that each polymerase used in amplification might have unique biases, combining different polymerase read sets might alleviate single-polymerase weaknesses. This is how polymeval works in combine mode:
- select up to five samples that should be combined
- downsample to the smallest coverage of all selected samples (and to 0.5, 0.4, 0.33, 0.2x of that depending on the number of combinations, with rasusa)
- as in standard mode
In case one of the above (most probably standard) has been run, it can be informative to compare the resulting assemblies to a 'gold standard reference'. This way, dataset-specific contig breaks and their underlying causes can be investigated. For this, polymeval has a reference mode:
- select a reference genome
- map reads of each sample to reference (with minimap2 and samtools)
- map dataset-specific assemblies to reference (with minimap2, creating paf output files)
- determine the amount of chimeric reads in the dataset (reading samtools-flags from bam files in R)
- identify contig breaks relative to reference (with an R script using the PAF-reading function from SVbyEye)
- (optional) determine coverage against reference (with PanDepth)
- (optional) create summary plots displaying how coverage, GC content and contig breaks are related (R scripts)
All of this is done automatically, the user only has to provide a path to the input reads (in .fastq.gz format) and, for reference mode, reference and dataset specific assemblies (in .fa format).
Polymeval uses the snakemake pipeline language to string together different tools. Snakemake will take care of installing all necessary dependencies through conda or mamba. After cloning the repository, it is highly recommended to set up a new environment like so:
git clone https://github.com/casparbein/polymeval
conda env create -f polymeval/environment.yaml
conda activate polymeval
python3 polymeval/polymeval.py -h-> Compleasm, a reimplementation of BUSCO, is part of the pipeline. It is highly recommended to download the necessary databases once and store them somewhere accessible on the cluster. For the test case (see below), we need the saccharomycetes_odb12 database (download compleasm as specified on the github):
conda create -n compleasm -c conda-forge -c bioconda compleasm
conda activate compleasm
compleasm download -L compleasm_libs --odb odb12 saccharomycetes_odb12
## The library path is now ~/compleasm_libs-> One optional step is to use KMC to count k-mers. Since KMC is not available through conda, if you want to include this step, you have to download and install KMC yourself. Otherwise, just skip this step.
-> For most of the analysis done in reference mode, PanDepth has to be installed on the system (anyway highly recommended because it is very fast).
There exists a pre-compiled binary that one only has to download and unzip, otherwise it can be downloaded and installed from github.
Instructions can be found here: PanDepth. Once installed, the user can provide the absolute path to PanDepth through polymeval (--pandepth_path).
Polymeval is implemented with the slurm scheduling system. If you do not use slurm, you can run the pipeline locally using --local_run, but some of the jobs create large files and need cosiderable amounts of both RAM and storage space.
To test the pipeline, I used data generated by Quail et al. (2024), who sequenced yeast samples that were amplified with different polymerases. For a minimal example, you can download the following three files: ERR10357093.fastq.gz, ERR10357086.fastq.gz, ERR10357098.fastq.gz
Download with wget:
mkdir yeast_reads
cd yeast_reads
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/093/ERR10357093/ERR10357093.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/086/ERR10357086/ERR10357086.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR103/098/ERR10357098/ERR10357098.fastq.gzUnfortunately, wget downloads are not available on all clusters (depending on the firewall), so you can either download them manually from ENA (by clicking on the link under 'Generated FASTQ files: FTP'):
https://www.ebi.ac.uk/ena/browser/view/ERR10357098
https://www.ebi.ac.uk/ena/browser/view/ERR10357086
https://www.ebi.ac.uk/ena/browser/view/ERR10357093
Or download them with the sra toolkit:
mkdir yeast_reads
cd yeast_reads
fastq-dump --accession ERR10357093 --gzip
fastq-dump --accession ERR10357086 --gzip
fastq-dump --accession ERR10357098 --gzipNow, we can just run polymeval in standard mode:
python3 ~/poymeval/polymeval.py \
--standard \
--directory_name yeast_standard_test \
--dry_run \
--readstats \
--hifieval \
--compleasm_db saccharomycetes_odb12 \
--compleasm_dp_path ~/compleasm_libs \
--input_reads ~/yeast_reads/This will run a "dry-run", which just means that it will test whether snakemake can locate all necessary input files and whether the input and output steps are clear. If this works, you can then start a run:
python3 ~/poymeval/polymeval.py \
--standard \
--directory_name yeast_standard_test \
--run_snakemake \
--readstats \
--hifieval \
--compleasm_db saccharomycetes_odb12 \
--compleasm_dp_path ~/compleasm_libs \
--input_reads ~/yeast_reads/Depending on the cluster, this should finish in 30 min - 2hrs.
After the standard run has finished, we can now try downsampling the sample to see how the polymerase read sets perform when the number of input nucleotides is the same. The standard run will have created a seqkit summary file which is used to gather the downsampling base target:
ls ~/yeast_standard_test/out/stats/seqkit_all.tsvIf this file exists, you can simply run (first as dry-run, then as the actual run):
python3 ~/poymeval/polymeval.py \
--downsample \
--directory_name yeast_downsample_test \
--dry_run \
--readstats \
--hifieval \
--compleasm_db saccharomycetes_odb12 \
--compleasm_dp_path ~/compleasm_libs \
--input_reads ~/yeast_reads/ \
--seqkit_file_path ~/yeast_standard_test/out/stats/seqkit_all.tsvNow we can investigate whether read set combinations perform better than single read sets.
In this case, we have three read sets, so we can theoretically combine them in such a way:
X1 = nucleotides of smallest read set
X2 = all other read sets downsampled to X1
X3 = all read sets downsampled to 0.5x X1
X4 = all read sets downsampled to 0.33x X1
The possible combinations are thus:
X1,
X2 (downsampled read sets, in this in this case there are two such downsampled sets),
X3 + X3 (combining two downsampled read sets, in this case there are three such combinations)
X4 + X4 + X4 (combining three downsampled read sets, in this case there is one such combination)
All this is done automatically with:
python3 ~/poymeval/polymeval.py \
--combine \
--directory_name yeast_combine_test \
--dry_run \
--hifieval \
--compleasm_db saccharomycetes_odb12 \
--compleasm_dp_path ~/compleasm_libs \
--input_reads ~/yeast_reads/ \
--seqkit_file_path ~/yeast_standard_test/out/stats/seqkit_all.tsv \
--samples ERR10357093,ERR10357086,ERR10357098