🚀 Ultra-optimized search engines for mismatches, edits, and local alignment
This repository is a full-scale bioinformatics search engine, built from the
ground up using every major string-matching paradigm used in real genomic tools.
Why?
Because no single algorithm works best for all DNA data.
Genome datasets range from:
• 100bp short reads
• 30,000bp viral genomes
• 3Bbp human genome
• terabytes of metagenomic data
A single method cannot handle all of them —
that’s why professional tools like BLAST, Bowtie, BWA, minimap2 use multiple models internally.
This repo rebuilds that entire ecosystem, algorithm by algorithm:
✔ DP Edit Distance → full gap-aware accuracy
✔ Myers Bit-vector → fastest short-pattern fuzzy search
✔ FFT Convolution (JS) → browser-scale pattern scanning
✔ FFT WASM → native-level mismatch search
✔ q-gram / LSH → TB-scale approximate indexing
✔ Seed-and-Extend → BLAST-style heuristics
✔ Wu–Manber → skip-based large-text matching
✔ FM-index / Suffix Array → genome-level ultra-fast search
Together, these models form a miniature version of a real genomics pipeline: index → seed → filter → align → verify.
This project shows how modern DNA search engines are built —
from raw bit-operations to FFT convolution to compressed suffix indexes.
Not just a matcher.
A complete bioinformatics architecture, recreated algorithm-by-algorithm, fully open-source.
Perfect for:
• genomic research
• alignment engines
• mutation scanning
• virus detection
• coursework + teaching tools
• high-performance computing demos
• future AI + genomics integrations
This is not a toy project.
This is the foundation for future DNA search systems.
This project is a collection of 8 powerful DNA pattern-matching algorithms, each optimized for:
- Large DNA sequences (MB → GB scale)
- Allowed mismatches / edits
- Local alignment
- Convolution-based matching
- Index-based search at huge scale
Every algorithm reports:
✔ Gene Present
✔ Mutation Present
✔ Virus Marker Probability
✔ Variant Similarity
✔ Approx. Similarity (%)
✔ Neon Heatmap of match strength
| Algorithm | Errors | Speed | Pattern Size | Text Size | TB Realistic? | Best Use |
|---|---|---|---|---|---|---|
| DP Edit Distance | full edits | ❌ slow | ≤1k | MB–100MB | ❌ | full accuracy |
| Myers Bit-vector | k edits/mism | ⚡⚡ fast | ≤128bp | MB–GB | short pattern fuzzy | |
| FFT JS (Hamming) | mismatches | ⚡ fast | 10–200bp | 10⁵–10⁶ | ❌ | medium in-browser |
| FFT WASM | mismatches | ⚡⚡ faster | 10–200bp | 1M–10M | high-performance | |
| q-gram / LSH | edits/mism | ⚡ | any | GB–TB | ✅ | massive DB search |
| Seed-and-Extend | mismatches | ⚡ | 10–200 | MB–GB | BLAST-like | |
| Wu–Manber | mismatches | ⚡ | 20–200 | MB–GB | skip-based | |
| FM-index / SA | k edits | ⚡⚡ | short | GB–tens GB | genome aligners |
- Full edits: insertions, deletions, substitutions
- Perfect accuracy but slow
- 50k text → ~10–20 ms
- 500k → ~150 ms
- 2M → ~1–2 s
- MB–100MB
- ❌ Not GB/TB suitable
- Fastest for patterns ≤ 128 bp
- DP simulated with bitwise operations
- 50k → ~5 ms
- 1M → ~40 ms
- 10M → ~0.5–1.2 s
- MB → a few GB
- TB possible via streaming
- Pure JavaScript FFT
- 4-channel DNA convolution (A/C/G/T)
- 50k → 10–15 ms
- 500k → ~90 ms
- 2M → ~0.5 s
- Best for 10⁵–10⁶
- Native-like convolution
- Rust/C++ → WebAssembly
- 500k → 50–80 ms
- 2M → 150–250 ms
- 10M → 600–800 ms
- 1M–10M
- Chunking needed beyond this
- TB-scale realistic
- Divide into q-grams → LSH buckets → verify candidates
- Extremely fast on large datasets
- Designed for GB–TB databases
- Many GB → Many TB
- Exact seeds → extend & check mismatches
- Used widely in genomics
- 50k → ~5–8 ms
- 500k → ~60 ms
- 2M → 200–300 ms
- MB → few GB
- TB possible with distributed indexing
- Skip-based matching
- Uses block hashes and shift table
- 50k → 6–10 ms
- 500k → 50–90 ms
- 2M → 200–350 ms
- Best: MB–GB
- Compressed full-text index
- Supports backtracking for k edits
- GB–tens of GB reference
- TB hard (requires distributed FM-index)
| Algorithm | Full Edits | Mismatches | Speed | Best Pattern | Best Text | TB? | Notes |
|---|---|---|---|---|---|---|---|
| Levenshtein | ✅ | ✅ | ❌ | ≤1k | MB–100MB | ❌ | Exact DP |
| Myers | ✅ | ✅ | ⚡⚡ | ≤128 | MB–GB | Best for short | |
| FFT JS | ❌ | ✅ | ⚡ | 10–200 | 10⁵–10⁶ | ❌ | Browser FFT |
| FFT WASM | ❌ | ✅ | ⚡⚡ | 10–200 | 1M–10M | Native-like | |
| q-gram / LSH | ⚡ | any | GB–TB | ✅ | Large DBs | ||
| Seed+Extend | ❌ | ✅ | ⚡ | 10–200 | MB–GB | BLAST-like | |
| Wu–Manber | ❌ | ✅ | ⚡ | 20–200 | MB–GB | Skip-based | |
| FM-index | ⚡⚡ | short | GB–tens GB | Genome aligners |
This repository demonstrates every major approximate string-matching family,
from basic DP → bit-vector → FFT → seeds → index structures.
It can be used for:
- Genomic research
- Mutation scanning
- Fast approximate search
- Educational visualizations
- Benchmarking diverse algorithms
→ FM-index / Suffix-Array (large genomes, repeated queries)
Super-fast queries (microseconds–ms) once index is built.
Ideal for huge reference genomes (GB–tens of GB).
→ Myers Bit-vector (short patterns, streaming text)
Fastest approximate DP for ≤128bp patterns.
Perfect for MB–GB scans.
→ FFT WASM k-mismatches (large mismatch scanning)
Fastest for long windows (hundreds of thousands–millions of bases).
Mismatch-only model but extremely high throughput.
→ Seed-and-Extend + Smith–Waterman
Seeds for fast filtering, SW for accurate local alignment.
Classic pipeline used in BLAST, Bowtie-2, minimap2.
→ FM-index / Suffix Array
- microsecond–millisecond search
- handles billions of bases
- backbone of BWA / Bowtie
- ideal for repeated queries
→ Myers Bit-vector
- O(n / wordsize) streaming
- pattern ≤ 128bp
- best for MB–GB linear scans
→ FFT WASM k-mismatches
- O(n log n) convolution
- best for 200k–10M text size
- perfect for mismatch-only genomic scanning
→ q-gram / LSH (distributed collection search)
Great for many GB–TB of sequences (as documents/reads).
→ Distributed FM-index (cluster-scale reference)
Research-level but extremely powerful for massive references.
→ DP Edit Distance (Levenshtein) or Smith–Waterman
Use for final verification or small local windows.
Built with passion by Pankaj Kumar.