Skip to content
SAMtoBAM edited this page Jan 15, 2026 · 26 revisions

Post-Assembly Quality manager (PAQman)

NOT UPDATED SINCE v1.0; PLEASE REFER TO GITHUB README

PAQman is combines a set of tools (and PAQman scripts) designed to comprehensively evaluate a de-novo genome assembly using a range of important metrics and without the need for a reference
Note: PAQman has been primarily designed for long-read reference quality assemblies

PAQman evaluates 7 important metrics:
Contiguity; Gene Content; Completeness; Error rate; Correctness; Coverage; Telomerality*.


Installation and quick start

Conda installation

conda install samtobam::paqman

Quick run

paqman.sh -a assembly.fa -l long-reads.fq.gz
paqman.sh -a assembly.fa -l long-reads.fq.gz -x ont -1 illumina.R1.fq.gz -2 illumina.R2.fq.gz -b eukaryota -w 30000 -s 10000 -r TTAGGG -p genome -o paqman_output -c yes

Required inputs:
-a | --assembly     Genome assemly in fasta format (*.fa / *.fasta / *.fna) and can be gzipped (*.gz)
-l | --longreads    Long reads used for assembly in fastq or fasta format  (*.fa / *.fasta / *.fna / *.fastq / *.fq) and can be gzipped (*.gz)

Recommended inputs:
-x | --platform     Long-read technology to determine mapping mapping parameters. Choose between 'ont' or 'pacbio-hifi' or 'pacbio-clr' (default: ont)
-b | --buscodb      Name of BUSCO database to be used (default: eukaryota)
-t | --threads      Number of threads for tools that accept this option (default: 1)
-r | --repeat       Telomeric repeat pattern (default: TTAGGG)

Optional parameters:
-1 | --pair1        Paired end illumina reads in fastq format; first pair. Used by Merqury, CRAQ and coverage analysis (Recommended). Can be gzipped (*.gz)
-2 | --pair2        Paired end illumina reads in fastq format; second pair. Used by Merqury, CRAQ and coverage analysis (Recommended). Can be gzipped (*.gz)
-w | --window       Number of basepairs for window averaging for coverage (default: 30000)
-s | --slide        Number of basepairs for the window to slide for coverage (default: 10000)
-p | --prefix       Prefix for output (default: name of assembly file (-a) before the fasta suffix)
-o | --output       Name of output folder for all results (default: paqman_output)
-c | --cleanup      Remove a large number of files produced by each of the tools that can take up a lot of space. Choose between 'yes' or 'no' (default: 'yes')
-h | --help         Print this help message

The output:

Summary stats: 'summary_stats.tsv':

Column     Header Description
Column 01 strain prefix given to the output files (-p)
Column 02 assembly prefix from the fasta file without suffix (-a)
Column 03 quast_#contigs Number of total contigs
Column 04 quast_#contigs>10kb Number of contigs >10 kb
Column 05 quast_assembly_N50 Assembly N50
Column 06 quast_assembly_N90 Assembly N90
Column 07 quast_largest_contig Largest contig in the assembly
Column 08 BUSCO_db The BUSCO database used to evaluate the assembly
Column 09 BUSCO_total Total number of BUSCOs in the database
Column 10 BUSCO_complete_single BUSCOs identified as complete and as a single copy
Column 11 BUSCO_fragmented BUSCOs identified as fragmented
Column 12 BUSCO_missing BUSCOs not identified
Column 13 merqury_completeness(%) A k-mer estimation of compmeteness
Column 14 merqury_qv(phred) A k-mer estimation of the genome wide error rate
Column 15 CRAQ_average_CRE(%) Percentage of assembly without small regional errors
Column 16 CRAQ_average_CSE(%) Percentage of assembly without large structural errors
Column 17 coverage_normal(%) Percentage of the genome with a median coverage within two standard deviations from the genome wide median
Column 18 telomeric_ends Number of contig ends with telomeric repeats
Column 19 telomeric_ends(%) Percentage of contig ends with telomeric repeats
Column 20 t2t_contigs Number of contigs with telomeric repeats at both ends

*Telomerality:

*I am using the term to describe stats about how many of the assembled contig have reached telomere sequences giving confidence of structural completeness at contig ends in repeat regions
Telomerality is calculated using a few simple steps specific to PAQman

  1. All exact single repeats are identified (seqkit locate --ignore-case -p ${telomererepeat})
  2. Coordinates of single repeats are merged if withint 7 bp (allowing for 1 repeat to deviate mildly)
  3. Only keep regions where at least two consecutive repeats were found (i.e. only keep region > 2*repeat length)
    Can find all the coordinates for telomeric regions (including interstitial) in the bed file with explanatory header: 'telomerality/telomeres.bed'
  4. Contig ends are labelled in 3 ways
    A. 'telomeric': Coordinates for a telomeric repeat are at least within 0.75*length from the end (e.g. a 100 bp long telomeric repeat region with within 75bp of a contig end)
    B. 'distant': >0.75*length bp away but within 5kb
    C. 'absent': >5kb from the end or no repeats identified in contig
    Can find these classifications (and coordinates/distance from edge etc) for each contig end in the tsv file with explanatory header: 'telomerality/telomeres.classification.tsv'

Comparing PAQman output across multiple assemblies

PAQman also has a tool (paqplots) to compare and analyse summary files from multiple assemblies
This makes it easier to benchmark tools and parameters using all the variables analysed
Simply provide paqplots with a combined summary file (with the same header) or a list of paths to the summary files

	paqplots.sh -s summary_file.tsv -p genome -o paqman_output
	OR
	paqplots.sh -l list_of_summary_files.txt -p genome -o paqman_output

	
	Required inputs:
	-s | --summary     A PAQman summary file with multiple assemblies combined and the same header
	-l | --list		   A list of paths to multiple summary files
	
	Optional parameters:
	-p | --prefix       Prefix for output (default: paqplot)
	-o | --output       Name of output folder for all results (default: paqplot_output)
	-h | --help         Print this help message
	

paqplot will provide two images (alongside the R scripts used to generate them)
First: Figures containing all of the raw variables compared
A radar plot (thanks ggradar!) for stats of percentages and a lollipop plot for all others (due to the wildly different scales)

Second: Another radar plot (thanks ggradar again!) containing a subset of stats and their relative values (i.e. all stats divided by the maximum value for that stat)
In this example, all stats should be maximised except for contig count hence the PACman like shape in blue below

To easily combined multiple outputs run:

##concatenate all the output files and give them the same header
echo "strain;assembly;quast_#contigs;quast_#contigs>10kb;quast_assembly_size;quast_assembly_N50;quast_assembly_N90;quast_largest_contig;BUSCO_db;BUSCO_total;BUSCO_complete;BUSCO_complete_single;BUSCO_fragmented;BUSCO_missing;merqury_completeness(%);merqury_qv(phred);CRAQ_average_CRE(%);CRAQ_average_CSE(%);telomeric_ends;telomeric_ends(%);t2t_contigs" | tr ';' '\t' >  summary_stats.combined.tsv

##just change pathn to the path to your summary files
for summaryfile in path1 path2 path3 path4
do
tail -n+2 ${summaryfile} 
done >> summary_stats.combined.tsv

##you can now input the summary_stats.combined.tsv file to paqplots (-s)