GitHub - wanunulab/PRECISE-QC_analysis: Code and workflows for PRECISE-QC: assessing sgRNA synthesis quality using nanopore sequencing.

**# PRECISE-QC: Step‑by‑Step Pipeline Manual

A reproducible guide to run the Yee et al. (2025) analysis from raw POD5 to error profiles.

0) Prerequisites

0.1 Install required tools (tested versions)

Dorado v1.0.0
samtools ≥1.17
BWA-MEM v0.7.15 (or 0.7.17)
Porechop v0.2.4
Python ≥3.9 with pysamstats installed (pip install pysamstats)
IGV (optional, for visual inspection)

0.2 Prepare project folders

# adjust PROJECT to your desired location
export PROJECT=$HOME/projects/precise-qc
mkdir -p $PROJECT/{raw,ref,work,out}
export RAW=$PROJECT/raw
export REF=$PROJECT/ref
export WORK=$PROJECT/work
export OUT=$PROJECT/out

0.3 Get the reference and data

Place your POD5 (or TAR of POD5s) in $RAW/.
Put the sgRNA reference sequence FASTA at $REF/reference.fa.
(Optional, for reproducing the paper figures) Download the datasets:
- Labeled data: link in the paper’s Figshare page
- Unlabeled data: link in the paper’s Figshare page

If you are reproducing exactly Yee et al. (2025), download from https://www.ncbi.nlm.nih.gov/sra/PRJNA1305499 and place POD5 files under $RAW/.

1) Basecalling (with modified bases + moves table)

1.1 Run Dorado basecalling

Recommended Dorado Version (Dorado ≥0.8):

# MODEL: rna004_130bps_sup@v5.2.0
dorado basecaller rna004_130bps_sup@v5.2.0 $RAW \
  --modified-bases pseU_2OmeU m5C_2OmeC 2OmeG inosine_m6A_2OmeA \
  --emit-moves \

The list of modifications in the command is optional, adjust according to your sequence:

# Equivalent to the command shown in the manuscript methods
# dorado --modified-bases ... --emit-moves > basecalled.bam

2) Align reads to the sgRNA reference

2.1 Index the reference

bwa index $REF/reference.fa

2.2 Run BWA-MEM with ONT presets (as used in the study)

bwa mem -t 1 -w 13 -k 6 -x ont2d $REF/reference.fa $WORK/basecalled.fastq > $WORK/alignment.sam

Note: These parameters are tuned for short (~100 nt) ONT reads against a small sgRNA reference.

2.3 Convert and sort alignments

samtools view -@ 4 -bS $WORK/alignment.sam | samtools sort -o $WORK/alignment.sorted.bam
samtools index $WORK/alignment.sorted.bam

3) Filter primary alignments and remove soft-clipped reads

3.1 Keep only primary alignments

samtools view -@ 4 -b -F 0x100 $WORK/alignment.sorted.bam -o $WORK/primary.bam
samtools index $WORK/primary.bam

3.2 Remove any read whose CIGAR contains soft clipping ("S")

samtools view -h $WORK/primary.bam \
| awk 'BEGIN {OFS="\t"} /^@/ {print; next} {
  split($6,C,/[0-9]*/); split($6,L,/[SMDIN]/);
  if (C[2]=="S") {$10=substr($10,L[1]+1); if($11!~/^\*$/) $11=substr($11,L[1]+1)};
  if (C[length(C)]=="S") {L1=length($10)-L[length(L)-1];
    $10=substr($10,1,L1); if($11!~/^\*$/) $11=substr($11,1,L1)};
  gsub(/[0-9]*S/,"",$6); print
}' \
| samtools view -b -o $WORK/unclipped.bam -

samtools index $WORK/unclipped.bam

4) Select full‑length reads (95–105 nt) and profile errors

4.1 Filter by read length on unclipped alignments

samtools view -h $WORK/unclipped.bam \
| awk 'BEGIN {OFS="\t"} /^@/ {print; next} {
  if (length($10) >= 95 && length($10) <= 105) print
}' \
| samtools view -b -o $WORK/full_length.bam -

samtools index $WORK/full_length.bam

4.2 Generate per‑nucleotide error profiles (pysamstats)

pysamstats --type variation --fasta $REF/reference.fa $WORK/full_length.bam > $OUT/only_variation.txt

Output: $OUT/only_variation.txt (columns include ref pos, depth, mismatches, insertions, deletions, etc.).

5) Analyze truncated reads

5.1 Find reads with the 5′ adapter (Porechop)

porechop -i $WORK/basecalled.fastq -o $WORK/adapter_reads.fastq \
  --barcode_diff 1 --barcode_threshold 74 --verbosity 2

5.2 Align adapter‑containing truncated reads

bwa mem -t 1 -w 13 -k 6 -x ont2d $REF/reference.fa $WORK/adapter_reads.fastq > $WORK/trunc_alignment.sam
samtools view -bS $WORK/trunc_alignment.sam | samtools sort -o $WORK/trunc_alignment.bam
samtools index $WORK/trunc_alignment.bam

5.3 Inspect in IGV (optional)

Open $REF/reference.fa and $WORK/trunc_alignment.bam in IGV.
Examine coverage and CIGAR patterns near the 5′ end.

6) Outputs you should have

$WORK/basecalled.bam(.bai) — basecalled reads with modified‑base and move metadata
$WORK/basecalled.fastq — (optional) FASTQ export of the above
$WORK/alignment.sorted.bam(.bai) — all alignments to sgRNA
$WORK/primary.bam(.bai) — primary alignments only
$WORK/unclipped.bam(.bai) — primary alignments with no soft clipping
$WORK/full_length.bam(.bai) — 95–105 nt full‑length reads
$WORK/adapter_reads.fastq — reads containing the 5′ adapter (truncated set)
$WORK/trunc_alignment.bam(.bai) — alignments of adapter‑containing reads
$OUT/error_profile.txt — per‑nucleotide error profile for full‑length reads
$OUT/truncated_*lengths.tsv — length distributions for truncated reads

7) Common pitfalls & troubleshooting

No MM/ML tags: ensure Dorado was run with --modified-bases and the selected modifications.
No moves: ensure --emit-moves was included; some tags are model/CLI dependent.
pysamstats fails: make sure BAMs are coordinate‑sorted and indexed; use the same reference.fa you aligned to.
Soft‑clipped reads remain: re‑run step 3.2; your filter must run on SAM text.
Few full‑length reads: adjust the length window in step 4.1 (e.g., 90–110 nt) and re‑profile.
Adapter detection weak: try relaxing/tightening --barcode_threshold in Porechop; verify the expected 5′ adapter sequence is included in Porechop’s database or provide a custom adapter file.

Authors & Credits

Analysis by Yvonne Yee and Dinara Boyko (Northeastern University, Departments of Chemical Engineering & Physics).

How to cite

If you use this pipeline, please cite: https://www.biorxiv.org/content/10.1101/2025.09.20.677417v1

---**

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
notebooks		notebooks
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

0) Prerequisites

0.1 Install required tools (tested versions)

0.2 Prepare project folders

0.3 Get the reference and data

1) Basecalling (with modified bases + moves table)

1.1 Run Dorado basecalling

2) Align reads to the sgRNA reference

2.1 Index the reference

2.2 Run BWA-MEM with ONT presets (as used in the study)

2.3 Convert and sort alignments

3) Filter primary alignments and remove soft-clipped reads

3.1 Keep only primary alignments

3.2 Remove any read whose CIGAR contains soft clipping ("S")

4) Select full‑length reads (95–105 nt) and profile errors

4.1 Filter by read length on unclipped alignments

4.2 Generate per‑nucleotide error profiles (pysamstats)

5) Analyze truncated reads

5.1 Find reads with the 5′ adapter (Porechop)

5.2 Align adapter‑containing truncated reads

5.3 Inspect in IGV (optional)

6) Outputs you should have

7) Common pitfalls & troubleshooting

Authors & Credits

How to cite

About

Uh oh!

Releases 1

Packages

Languages

wanunulab/PRECISE-QC_analysis

Folders and files

Latest commit

History

Repository files navigation

0) Prerequisites

0.1 Install required tools (tested versions)

0.2 Prepare project folders

0.3 Get the reference and data

1) Basecalling (with modified bases + moves table)

1.1 Run Dorado basecalling

2) Align reads to the sgRNA reference

2.1 Index the reference

2.2 Run BWA-MEM with ONT presets (as used in the study)

2.3 Convert and sort alignments

3) Filter primary alignments and remove soft-clipped reads

3.1 Keep only primary alignments

3.2 Remove any read whose CIGAR contains soft clipping ("S")

4) Select full‑length reads (95–105 nt) and profile errors

4.1 Filter by read length on unclipped alignments

4.2 Generate per‑nucleotide error profiles (pysamstats)

5) Analyze truncated reads

5.1 Find reads with the 5′ adapter (Porechop)

5.2 Align adapter‑containing truncated reads

5.3 Inspect in IGV (optional)

6) Outputs you should have

7) Common pitfalls & troubleshooting

Authors & Credits

How to cite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages