SeqMorph

SeqMorph is a bioinformatics toolkit for simulating mutations in DNA/RNA/Protein sequences and comparing the results. It focuses on correctness, scalability, and reproducibility using a store-first design (pluggable sequence backends) and a small FastAPI service.

Key features (current)

🧬 Sequence storage, not just strings
- SequenceStructure registry + BaseStore backends:
  - StringStore for small inputs
  - ChunkedStore for long sequences (O(log k) index mapping)
🔁 Structural mutations (reproducible with a seed)
- inversion, duplication, translocation (windowed; non-overlapping sampling)
- event mix/weights & mean segment length
🧪 Sequence analysis
- GC%, k-mers, Shannon entropy, ORFs/translation (RNA handled via U→T normalization)
- comparison report between original & mutated sequences
📦 Reproducibility & outputs
- manifest.json (seed, params, backend), per-event logs, FASTA export, events.json
🌐 API-first
- add/fetch sequences and run mutations via HTTP
- easy to drive from notebooks, CLI wrappers, or a future GUI

Note: Point mutations with context policy are designed (engine supports them) and will be exposed via API once we surface the policy knobs.

What’s new

✅ Unified library/ (replaces gene_library.py and protein_library.py)
✅ New mutation/ package (replaces mutation_module.py)
✅ FastAPI backend (SeqMorph_Main.py) instead of a monolithic CLI
✅ Store-first sequence handling (sequence_store/ backends)
✅ Correct CpG “boost” logic (applies only when the current base is C and next is G)
🗑️ Old CLI/GUI temporarily removed; they’ll be rebuilt on the API

Install

Requires Python 3.11+.

# From repo root
pip install -r requirements.txt

requirements.txt (for reference):

biopython>=1.83
requests>=2.31
fastapi>=0.110
pydantic>=2.5
uvicorn[standard]>=0.27
scipy>=1.11
matplotlib>=3.8

Quick start (run the API)

# macOS/Linux
export PYTHONPATH=./src
python src/SeqMorph_Main.py

# Windows PowerShell
$env:PYTHONPATH = "$PWD\src"
python .\src\SeqMorph_Main.py

Open http://127.0.0.1:8000/docs for interactive docs.

Example flow (via API)

Add a sequence

curl -X POST http://127.0.0.1:8000/sequence/add   -H "Content-Type: application/json"   -d '{"sequence":"ATG" + "ACGT"*50}'

Mutate & analyze

curl -X POST http://127.0.0.1:8000/mutate-and-analyze   -H "Content-Type: application/json"   -d '{
    "accession_id": "seq_1",
    "struct_rate": 3.0,
    "mean_seg_len": 200,
    "start": 1,
    "seed": 123,
    "save_outputs": true
  }'

Response includes: original/mutated lengths, delta, per-op counts, total duplicated length added, full structural events, a 10-event preview, params (window/rate/seed/store), duration, and an analysis report. When save_outputs=true, files are written under ./output/<timestamp>_mut/:

*_original.fasta, *_mutated.fasta
*_events.json
manifest.json
seqmorph.log

DIY “smoke test” (no script in repo)

You can quickly sanity-check the core without any extra files. Pick A (API-only) or B (direct Python).

A) API-only smoke (works anywhere)

Start the server (see Quick start).
In a new terminal, add & mutate a random DNA:

macOS/Linux

SEQ=$(python - <<'PY'
import random; random.seed(1)
print(''.join(random.choice('ACGT') for _ in range(20000)))
PY
)
curl -s -X POST http://127.0.0.1:8000/sequence/add   -H "Content-Type: application/json"   -d "{"sequence":"$SEQ"}"

curl -s -X POST http://127.0.0.1:8000/mutate-and-analyze   -H "Content-Type: application/json"   -d '{"accession_id":"seq_1","struct_rate":3.0,"mean_seg_len":200,"start":1,"seed":123,"save_outputs":true}'   | python -m json.tool

Windows PowerShell

$seq = python - <<'PY'
import random; random.seed(1)
print(''.join(random.choice('ACGT') for _ in range(20000)))
PY

curl -s -X POST http://127.0.0.1:8000/sequence/add `
  -H "Content-Type: application/json" `
  -d "{""sequence"":""$seq""}"

curl -s -X POST http://127.0.0.1:8000/mutate-and-analyze `
  -H "Content-Type: application/json" `
  -d "{""accession_id"":""seq_1"",""struct_rate"":3.0,""mean_seg_len"":200,""start"":1,""seed"":123,""save_outputs"":true}" `
  | python -m json.tool

What to check

num_events > 0
delta_length == struct_dup_length_added (inversions/translocations keep length)
output_paths.run_dir exists with FASTA/JSON/log files
Repeat with the same seed → identical struct_events

B) Direct Python (import modules)

This checks the storage and mutation engine without the HTTP layer.

macOS/Linux

PYTHONPATH=./src python - <<'PY'
from sequence_store import StringStore, ChunkedStore
from sequence_structure import SequenceStructure
from mutation import MutationEngine
import random

rng = random.Random(123)
seq = ''.join(rng.choice('ACGT') for _ in range(20000))

# auto-pick store
def store_factory(s: str):
    return ChunkedStore(s, min_chunk=1024, max_chunk=4096) if len(s) >= 50000 else StringStore(s)

ss = SequenceStructure(store_factory=store_factory)
acc = ss.add_sequence("acc1", seq, seq_type="DNA")
store = ss.sequences[acc]

eng = MutationEngine(seed=123)
events = eng.mutate_structural(store, win_start=0, win_end=len(store), rate_pct=3.0, mean_seg_len=200)
print("events:", len(events), "orig_len:", len(seq), "mut_len:", len(store))
dup_added = sum(e.end - e.start for e in events if e.op == "duplication")
print("delta:", len(store)-len(seq), "dup_added:", dup_added)
PY

Windows PowerShell

$env:PYTHONPATH = "$PWD\src"
python - <<'PY'
from sequence_store import StringStore, ChunkedStore
from sequence_structure import SequenceStructure
from mutation import MutationEngine
import random

rng = random.Random(123)
seq = ''.join(rng.choice('ACGT') for _ in range(20000))

def store_factory(s: str):
    return ChunkedStore(s, min_chunk=1024, max_chunk=4096) if len(s) >= 50000 else StringStore(s)

ss = SequenceStructure(store_factory=store_factory)
acc = ss.add_sequence("acc1", seq, seq_type="DNA")
store = ss.sequences[acc]

eng = MutationEngine(seed=123)
events = eng.mutate_structural(store, win_start=0, win_end=len(store), rate_pct=3.0, mean_seg_len=200)
print("events:", len(events), "orig_len:", len(seq), "mut_len:", len(store))
dup_added = sum(e.end - e.start for e in events if e.op == "duplication")
print("delta:", len(store)-len(seq), "dup_added:", dup_added)
PY

Project layout (core)

src/
  analysis_module.py          # analysis + comparison report (headless-safe)
  input_module.py             # FASTA I/O + NCBI/UniProt fetching
  library/                    # unified analysis/validation/utils
  mutation/                   # MutationEngine + structural/point primitives
  sequence_store/             # BaseStore + StringStore + ChunkedStore
  sequence_structure.py       # registry + metadata + file import/export
  SeqMorph_Main.py            # FastAPI app (backend “brain”)

Reproducibility

Pass seed in /mutate-and-analyze to reproduce events deterministically.
Runs can write:
- manifest.json (seed, backend, rate, mean segment, window, counts)
- seqmorph.log (one line per structural event applied)

Troubleshooting

ModuleNotFoundError for local imports
Ensure PYTHONPATH=./src (see Quick start). Also make sure src/__init__.py does not exist (we treat src as a source root, not a package).
Biopython errors
Install with pip install biopython. FASTA parsing depends on it.
RNA translation yields X
We normalize U→T before translation and handle reverse-complement with the correct alphabet.

Roadmap

Expose point-mutation policies via API (context/CpG, ti/tv)
Harden ChunkedStore rebalance thresholds with profiling
New CLI & GUI (call the API)
Optional packed DNA store (2-bit) for large datasets
Test suite & benchmarks

Contributing

We welcome issues and PRs. Focus areas:

sequence_store/ (performance, packed encodings)
mutation/ (policies, additional structural ops)
library/analysis.py (metrics & scalable χ² strategies)
SeqMorph_Main.py (new endpoints, params, CORS for future GUI)

Basic steps:

Fork & branch (feature/xyz)
Keep code PEP8 + typed; add docstrings
Prefer backend-shared logic (CLI/GUI must not duplicate logic)
Open a PR with a short rationale and example inputs/outputs

License

MIT (see LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.vscode		.vscode
archive		archive
docs		docs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sequence_cache.json		sequence_cache.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SeqMorph

Key features (current)

What’s new

Install

Quick start (run the API)

Example flow (via API)

DIY “smoke test” (no script in repo)

A) API-only smoke (works anywhere)

B) Direct Python (import modules)

Project layout (core)

Reproducibility

Troubleshooting

Roadmap

Contributing

License

About

Uh oh!

Releases

Uh oh!

Contributors 2

Uh oh!

Languages

License

Bit-2310/SeqMorph

Folders and files

Latest commit

History

Repository files navigation

SeqMorph

Key features (current)

What’s new

Install

Quick start (run the API)

Example flow (via API)

DIY “smoke test” (no script in repo)

A) API-only smoke (works anywhere)

B) Direct Python (import modules)

Project layout (core)

Reproducibility

Troubleshooting

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 2

Uh oh!

Languages