SeqMorph is a bioinformatics toolkit for simulating mutations in DNA/RNA/Protein sequences and comparing the results. It focuses on correctness, scalability, and reproducibility using a store-first design (pluggable sequence backends) and a small FastAPI service.
- 🧬 Sequence storage, not just strings
SequenceStructureregistry +BaseStorebackends:StringStorefor small inputsChunkedStorefor long sequences (O(log k) index mapping)
- 🔁 Structural mutations (reproducible with a seed)
- inversion, duplication, translocation (windowed; non-overlapping sampling)
- event mix/weights & mean segment length
- 🧪 Sequence analysis
- GC%, k-mers, Shannon entropy, ORFs/translation (RNA handled via U→T normalization)
- comparison report between original & mutated sequences
- 📦 Reproducibility & outputs
manifest.json(seed, params, backend), per-event logs, FASTA export,events.json
- 🌐 API-first
- add/fetch sequences and run mutations via HTTP
- easy to drive from notebooks, CLI wrappers, or a future GUI
Note: Point mutations with context policy are designed (engine supports them) and will be exposed via API once we surface the policy knobs.
- ✅ Unified
library/(replacesgene_library.pyandprotein_library.py) - ✅ New
mutation/package (replacesmutation_module.py) - ✅ FastAPI backend (
SeqMorph_Main.py) instead of a monolithic CLI - ✅ Store-first sequence handling (
sequence_store/backends) - ✅ Correct CpG “boost” logic (applies only when the current base is C and next is G)
- 🗑️ Old CLI/GUI temporarily removed; they’ll be rebuilt on the API
Requires Python 3.11+.
# From repo root
pip install -r requirements.txtrequirements.txt (for reference):
biopython>=1.83
requests>=2.31
fastapi>=0.110
pydantic>=2.5
uvicorn[standard]>=0.27
scipy>=1.11
matplotlib>=3.8
# macOS/Linux
export PYTHONPATH=./src
python src/SeqMorph_Main.py
# Windows PowerShell
$env:PYTHONPATH = "$PWD\src"
python .\src\SeqMorph_Main.pyOpen http://127.0.0.1:8000/docs for interactive docs.
- Add a sequence
curl -X POST http://127.0.0.1:8000/sequence/add -H "Content-Type: application/json" -d '{"sequence":"ATG" + "ACGT"*50}'- Mutate & analyze
curl -X POST http://127.0.0.1:8000/mutate-and-analyze -H "Content-Type: application/json" -d '{
"accession_id": "seq_1",
"struct_rate": 3.0,
"mean_seg_len": 200,
"start": 1,
"seed": 123,
"save_outputs": true
}'Response includes: original/mutated lengths, delta, per-op counts, total duplicated length added, full structural events, a 10-event preview, params (window/rate/seed/store), duration, and an analysis report. When save_outputs=true, files are written under ./output/<timestamp>_mut/:
*_original.fasta,*_mutated.fasta*_events.jsonmanifest.jsonseqmorph.log
You can quickly sanity-check the core without any extra files. Pick A (API-only) or B (direct Python).
- Start the server (see Quick start).
- In a new terminal, add & mutate a random DNA:
macOS/Linux
SEQ=$(python - <<'PY'
import random; random.seed(1)
print(''.join(random.choice('ACGT') for _ in range(20000)))
PY
)
curl -s -X POST http://127.0.0.1:8000/sequence/add -H "Content-Type: application/json" -d "{"sequence":"$SEQ"}"
curl -s -X POST http://127.0.0.1:8000/mutate-and-analyze -H "Content-Type: application/json" -d '{"accession_id":"seq_1","struct_rate":3.0,"mean_seg_len":200,"start":1,"seed":123,"save_outputs":true}' | python -m json.toolWindows PowerShell
$seq = python - <<'PY'
import random; random.seed(1)
print(''.join(random.choice('ACGT') for _ in range(20000)))
PY
curl -s -X POST http://127.0.0.1:8000/sequence/add `
-H "Content-Type: application/json" `
-d "{""sequence"":""$seq""}"
curl -s -X POST http://127.0.0.1:8000/mutate-and-analyze `
-H "Content-Type: application/json" `
-d "{""accession_id"":""seq_1"",""struct_rate"":3.0,""mean_seg_len"":200,""start"":1,""seed"":123,""save_outputs"":true}" `
| python -m json.toolWhat to check
num_events> 0delta_length == struct_dup_length_added(inversions/translocations keep length)output_paths.run_direxists with FASTA/JSON/log files- Repeat with the same seed → identical
struct_events
This checks the storage and mutation engine without the HTTP layer.
macOS/Linux
PYTHONPATH=./src python - <<'PY'
from sequence_store import StringStore, ChunkedStore
from sequence_structure import SequenceStructure
from mutation import MutationEngine
import random
rng = random.Random(123)
seq = ''.join(rng.choice('ACGT') for _ in range(20000))
# auto-pick store
def store_factory(s: str):
return ChunkedStore(s, min_chunk=1024, max_chunk=4096) if len(s) >= 50000 else StringStore(s)
ss = SequenceStructure(store_factory=store_factory)
acc = ss.add_sequence("acc1", seq, seq_type="DNA")
store = ss.sequences[acc]
eng = MutationEngine(seed=123)
events = eng.mutate_structural(store, win_start=0, win_end=len(store), rate_pct=3.0, mean_seg_len=200)
print("events:", len(events), "orig_len:", len(seq), "mut_len:", len(store))
dup_added = sum(e.end - e.start for e in events if e.op == "duplication")
print("delta:", len(store)-len(seq), "dup_added:", dup_added)
PYWindows PowerShell
$env:PYTHONPATH = "$PWD\src"
python - <<'PY'
from sequence_store import StringStore, ChunkedStore
from sequence_structure import SequenceStructure
from mutation import MutationEngine
import random
rng = random.Random(123)
seq = ''.join(rng.choice('ACGT') for _ in range(20000))
def store_factory(s: str):
return ChunkedStore(s, min_chunk=1024, max_chunk=4096) if len(s) >= 50000 else StringStore(s)
ss = SequenceStructure(store_factory=store_factory)
acc = ss.add_sequence("acc1", seq, seq_type="DNA")
store = ss.sequences[acc]
eng = MutationEngine(seed=123)
events = eng.mutate_structural(store, win_start=0, win_end=len(store), rate_pct=3.0, mean_seg_len=200)
print("events:", len(events), "orig_len:", len(seq), "mut_len:", len(store))
dup_added = sum(e.end - e.start for e in events if e.op == "duplication")
print("delta:", len(store)-len(seq), "dup_added:", dup_added)
PYsrc/
analysis_module.py # analysis + comparison report (headless-safe)
input_module.py # FASTA I/O + NCBI/UniProt fetching
library/ # unified analysis/validation/utils
mutation/ # MutationEngine + structural/point primitives
sequence_store/ # BaseStore + StringStore + ChunkedStore
sequence_structure.py # registry + metadata + file import/export
SeqMorph_Main.py # FastAPI app (backend “brain”)
- Pass
seedin/mutate-and-analyzeto reproduce events deterministically. - Runs can write:
manifest.json(seed, backend, rate, mean segment, window, counts)seqmorph.log(one line per structural event applied)
-
ModuleNotFoundError for local imports
EnsurePYTHONPATH=./src(see Quick start). Also make suresrc/__init__.pydoes not exist (we treatsrcas a source root, not a package). -
Biopython errors
Install withpip install biopython. FASTA parsing depends on it. -
RNA translation yields X
We normalize U→T before translation and handle reverse-complement with the correct alphabet.
- Expose point-mutation policies via API (context/CpG, ti/tv)
- Harden
ChunkedStorerebalance thresholds with profiling - New CLI & GUI (call the API)
- Optional packed DNA store (2-bit) for large datasets
- Test suite & benchmarks
We welcome issues and PRs. Focus areas:
sequence_store/(performance, packed encodings)mutation/(policies, additional structural ops)library/analysis.py(metrics & scalable χ² strategies)SeqMorph_Main.py(new endpoints, params, CORS for future GUI)
Basic steps:
- Fork & branch (
feature/xyz) - Keep code PEP8 + typed; add docstrings
- Prefer backend-shared logic (CLI/GUI must not duplicate logic)
- Open a PR with a short rationale and example inputs/outputs
MIT (see LICENSE).