- Install through pip:
pip install --user --force-reinstall hifieval
# This installs hifieval into $HOME/.local/lib/python{your_version}/site-packages.
# Then add path/to/your/site-packages to your $PATH and run the tool:
export PATH=path/to/your/site-packages:$PATH
# The below command helps you locate the package
pip show hifieval
- Install through conda:
conda install hifieval
If running from raw reads data using an existing EC tool:
- Install one error correction/assembly tool: hifiasm for example
- Install minimap2 If just running hifieval.py:
- You could download PAF files from Hifieval output data
- You could also download raw Hifieval output of EC tools on CHM13 HiFi reads from Hifieval output data in order to compare with your own EC tool performance.
# get test data
wget https://zenodo.org/record/7799845/files/ecoli.reads.fastq?download=1 # simulated raw reads
wget https://zenodo.org/record/7799845/files/ecoli.ref.fasta?download=1 # reference genome
# get error corrected reads
hifiasm -o ecoli.asm.hifiasm --primary -t 10 --write-ec ecoli.reads.fastq 2> ecoli.asm.hifiasm.log
# get alignment paf files
minimap2 -t 8 -cx map-hifi --secondary=no --paf-no-hit --cs ecoli.ref.fasta ecoli.reads.fastq > ecoli.raw.paf
minimap2 -t 8 -cx map-hifi --secondary=no --paf-no-hit --cs ecoli.ref.fasta ecoli.asm.hifiasm.ec.fa > ecoli.hifiasm.paf
# get evaluation files
hifieval.py -o ecoli.hifiasm -r ecoli.raw.paf -c ecoli.hifiasm.paf
hifieval [options] -r <raw.paf> -c <corrected.paf>
Hifieval is a tool to evaluate long-read error correction mainly with PacBio High-Fidelity Reads (HiFi reads). Use command hifieval
to see available options.
The input of this tool takes in two .paf files: one is raw reads aligned to reference genome; the other is corrected reads aligned to reference genome. PAF is a text format describing the approximate mapping positions between two set of sequences.
The paf file will encodes difference of sequence alignments in the short form, indication substitution, insertion, and deletion. The metrics of error correction are:
- OC: (over-correction) The errors appeared in corrected reads but not in raw reads
- UC: (under-correction) The errors in raw reads that are still in corrected reads
- CC: (correct-correction) The errors that are in raw reads but not corrected reads
Examples of Error Correction (EC) tools to output error corrected reads
- hifiasm:
hifiasm -o <prefix> --write-ec -t32 <read_files> 2> <prefix>.log
- LJA:
lja -o <output_dir> --reads <reads_file> [--reads <reads_file2> …]
- Verkko:
verkko -d <output_dir> --hifi <reads_files>
- hifiasm:
If the EC tool produce HPC corrected reads, use seqtk to perform homopolymer-compression (HPC) on raw reads and the reference:
seqtk hpc <file>
Minimap2 is used to generate the paf files using the command, the --cs tag is required:
./minimap2 -t8 -cx map-hifi --secondary=no --paf-no-hit --cs <ref_fasta_file> <read_files> > <prefix>.paf
On top of FPR and TPR for the corrections, errors in homopolymer (HP) regions can be further incorporated if the assembly tool does not perform HPC on the raw reads during the error correction step using the command:
hifieval [options] -h <reference_file> -r <raw.paf> -c <corrected.paf>
HP regions of different lengths are identified, and UC/OC that fall within these regions is calculated. Here the error rate is calculated by
- summary.tsv: the most detailed summary of EC performance for any downstream analysis
- contains 12 columns: readName, raw_mapped_chr, raw_start, raw_end, raw_mq, corrected_mapped_chr, corrected_start, corrected_end, corrected_mq, num_oc, num_uc, num_cc
- rdlvl.eval.tsv
- counts how many corrected reads have 1 oc/uc, 2 oc/uc, etc. for each chromosome and all chromosomes
- metric.eval.tsv
- overall metrics for each chromosome and all chromosomes
- hp.ErrorRate.tsv
- contains the error rates for each length of the homopolymers
Yujie Guo, Xiaowen Feng, Heng Li, Evaluation of haplotype-aware long-read error correction with hifieval, Bioinformatics, Volume 39, Issue 10, October 2023, btad631, https://doi.org/10.1093/bioinformatics/btad631