**# PRECISE-QC: Step‑by‑Step Pipeline Manual
A reproducible guide to run the Yee et al. (2025) analysis from raw POD5 to error profiles.
- Dorado v1.0.0
- samtools ≥1.17
- BWA-MEM v0.7.15 (or 0.7.17)
- Porechop v0.2.4
- Python ≥3.9 with pysamstats installed (
pip install pysamstats) - IGV (optional, for visual inspection)
# adjust PROJECT to your desired location
export PROJECT=$HOME/projects/precise-qc
mkdir -p $PROJECT/{raw,ref,work,out}
export RAW=$PROJECT/raw
export REF=$PROJECT/ref
export WORK=$PROJECT/work
export OUT=$PROJECT/out-
Place your POD5 (or TAR of POD5s) in
$RAW/. -
Put the sgRNA reference sequence FASTA at
$REF/reference.fa. -
(Optional, for reproducing the paper figures) Download the datasets:
- Labeled data: link in the paper’s Figshare page
- Unlabeled data: link in the paper’s Figshare page
If you are reproducing exactly Yee et al. (2025), download from https://www.ncbi.nlm.nih.gov/sra/PRJNA1305499 and place POD5 files under
$RAW/.
Recommended Dorado Version (Dorado ≥0.8):
# MODEL: rna004_130bps_sup@v5.2.0
dorado basecaller rna004_130bps_sup@v5.2.0 $RAW \
--modified-bases pseU_2OmeU m5C_2OmeC 2OmeG inosine_m6A_2OmeA \
--emit-moves \
The list of modifications in the command is optional, adjust according to your sequence:
# Equivalent to the command shown in the manuscript methods
# dorado --modified-bases ... --emit-moves > basecalled.bambwa index $REF/reference.fabwa mem -t 1 -w 13 -k 6 -x ont2d $REF/reference.fa $WORK/basecalled.fastq > $WORK/alignment.samNote: These parameters are tuned for short (~100 nt) ONT reads against a small sgRNA reference.
samtools view -@ 4 -bS $WORK/alignment.sam | samtools sort -o $WORK/alignment.sorted.bam
samtools index $WORK/alignment.sorted.bamsamtools view -@ 4 -b -F 0x100 $WORK/alignment.sorted.bam -o $WORK/primary.bam
samtools index $WORK/primary.bamsamtools view -h $WORK/primary.bam \
| awk 'BEGIN {OFS="\t"} /^@/ {print; next} {
split($6,C,/[0-9]*/); split($6,L,/[SMDIN]/);
if (C[2]=="S") {$10=substr($10,L[1]+1); if($11!~/^\*$/) $11=substr($11,L[1]+1)};
if (C[length(C)]=="S") {L1=length($10)-L[length(L)-1];
$10=substr($10,1,L1); if($11!~/^\*$/) $11=substr($11,1,L1)};
gsub(/[0-9]*S/,"",$6); print
}' \
| samtools view -b -o $WORK/unclipped.bam -
samtools index $WORK/unclipped.bam
samtools view -h $WORK/unclipped.bam \
| awk 'BEGIN {OFS="\t"} /^@/ {print; next} {
if (length($10) >= 95 && length($10) <= 105) print
}' \
| samtools view -b -o $WORK/full_length.bam -
samtools index $WORK/full_length.bam
pysamstats --type variation --fasta $REF/reference.fa $WORK/full_length.bam > $OUT/only_variation.txtOutput:
$OUT/only_variation.txt(columns include ref pos, depth, mismatches, insertions, deletions, etc.).
porechop -i $WORK/basecalled.fastq -o $WORK/adapter_reads.fastq \
--barcode_diff 1 --barcode_threshold 74 --verbosity 2bwa mem -t 1 -w 13 -k 6 -x ont2d $REF/reference.fa $WORK/adapter_reads.fastq > $WORK/trunc_alignment.sam
samtools view -bS $WORK/trunc_alignment.sam | samtools sort -o $WORK/trunc_alignment.bam
samtools index $WORK/trunc_alignment.bam- Open
$REF/reference.faand$WORK/trunc_alignment.bamin IGV. - Examine coverage and CIGAR patterns near the 5′ end.
$WORK/basecalled.bam(.bai)— basecalled reads with modified‑base and move metadata$WORK/basecalled.fastq— (optional) FASTQ export of the above$WORK/alignment.sorted.bam(.bai)— all alignments to sgRNA$WORK/primary.bam(.bai)— primary alignments only$WORK/unclipped.bam(.bai)— primary alignments with no soft clipping$WORK/full_length.bam(.bai)— 95–105 nt full‑length reads$WORK/adapter_reads.fastq— reads containing the 5′ adapter (truncated set)$WORK/trunc_alignment.bam(.bai)— alignments of adapter‑containing reads$OUT/error_profile.txt— per‑nucleotide error profile for full‑length reads$OUT/truncated_*lengths.tsv— length distributions for truncated reads
- No MM/ML tags: ensure Dorado was run with
--modified-basesand the selected modifications. - No moves: ensure
--emit-moveswas included; some tags are model/CLI dependent. - pysamstats fails: make sure BAMs are coordinate‑sorted and indexed; use the same
reference.fayou aligned to. - Soft‑clipped reads remain: re‑run step 3.2; your filter must run on SAM text.
- Few full‑length reads: adjust the length window in step 4.1 (e.g.,
90–110 nt) and re‑profile. - Adapter detection weak: try relaxing/tightening
--barcode_thresholdin Porechop; verify the expected 5′ adapter sequence is included in Porechop’s database or provide a custom adapter file.
Analysis by Yvonne Yee and Dinara Boyko (Northeastern University, Departments of Chemical Engineering & Physics).
If you use this pipeline, please cite: https://www.biorxiv.org/content/10.1101/2025.09.20.677417v1
---**