VAE-powered minimal genome generation pipeline for E. coli.
Data Files → [Preprocess] → [Explore] → [Training] → [Sample] → [Minimize]
- Preprocess: Extract essential gene positions from literature
- Explore: Analyze dataset distributions and generate visualizations
- Training: Train VAE models with different configurations
- Sample: Generate synthetic genomes from trained models
- Minimize: Create actual minimized genome sequences
- Python >=3.9,<3.11
- pip (usually comes with Python)
-
Clone the repository:
git clone <repository-url> cd genome-minimizer-2
-
Create virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install project:
pip install -e .
python main.py --mode preprocess
python main.py --mode training --preset v0 --epochs 1
python main.py --mode sample --model-path models/trained_models/v0_model/saved_VAE_v0.pt --genes-path data/essential_genes/essential_gene_positions.pkl --num-samples 100
python main.py --mode minimizer --genes-path models/v0_model/sampling_results/binary_samples_default.npy --single-file --output-file results.fastaPlace these files in data/:
data/
├── F4_complete_presence_absence.csv # Gene presence/absence matrix
├── accessionID_phylogroup_BD.csv # Phylogroup classifications
├── essential_genes.csv # Essential genes from literature
└── wild_type_sequence.gb # E. coli reference genome
| Mode | Purpose | Example |
|---|---|---|
preprocess |
Extract essential gene positions | python main.py --mode preprocess |
explore |
Generate data analysis plots | python main.py --mode explore |
training |
Train VAE models | python main.py --mode training --preset v0 |
sample |
Generate synthetic genomes | python main.py --mode sample --model-path MODEL.pt --genes-path GENES.pkl |
minimizer |
Create FASTA sequences | python main.py --mode minimizer --genes-path SAMPLES.npy --single-file |
python main.py --mode preprocess [--force-reprocess]--force-reprocess: Regenerate even if files exist
python main.py --mode training --preset PRESET [--epochs N]--preset v0/v1/v2/v3: Model architecture (required)--epochs N: Training epochs (default: 10000)
python main.py --mode experiment [--interactive]--interactive: Prompt for custom parameters
python main.py --mode sample --model-path PATH --genes-path PATH [OPTIONS]Required:
--model-path: Trained model (.pt file)--genes-path: Essential gene positions (.pkl file)
Optional:
--num-samples N: Number of genomes (default: 1)--sampling-mode default/focused: Strategy (default: default)--noise-level N: Noise for focused sampling (default: 0.1)--genome-path: Reference genome (.gb file)
python main.py --mode minimizer --genes-path PATH [OPTIONS]Required:
--genes-path: Path to a .npy file containing lists of gene names (one list per sample); Note: this is not a binary mask! pass actual gene IDs/names like the ones in the original presence absence matrix referred to in the paper.--genome-path: Reference genome (.gb, .gbff, .genbank files allowed)
Optional:
--single-file: Output single FASTA file with all sequences (if not specified one FASTA file per sequence is generated)--output-file: Specific output filename (default for one file minimized_genomes_default.fasta, for multiple minimized_default_XXXX.fasta)--output-dir: Directory for outputs (default: ./minimized_genomes)--model-name: Label for file naming (default: "default")
Examples:
Single combined FASTA (generated output file - minimized_default.fasta):
python main.py --mode minimizer \
--genes-path data/data_full_validated_IDS.npy \
--genome-path data/GCF_000005845.2.gbff \
--single-file \
--output-file minimized_genomes/minimized_default.fasta \Single combined FASTA (generated output file - minimized_genomes_67.fasta):
python main.py --mode minimizer \
--genes-path data/data_full_validated_IDS.npy \
--genome-path data/GCF_000005845.2.gbff \
--single-file \
--model-name 67Multiple FASTAs (one per sample) into a directory (generated output files - minimized_default_XXXX.fasta):
python main.py --mode minimizer \
--genes-path data/data_full_validated_IDS.npy \
--genome-path data/GCF_000005845.2.gbff \
--output-dir minimized_genomes/ \
--model-name default| Preset | Architecture | Features |
|---|---|---|
| v0 | 1024→64 | Linear KL annealing |
| v1 | 512→32 | + Gene abundance + L1 regularization |
| v2 | 512→32 | + Cosine annealing |
| v3 | 512→32 | + Weighted abundance |
├── data/essential_genes/ # Preprocessing results
├── models/
│ ├── trained_models/v0_model/ # Saved model weights
│ └── v0_model/
│ ├── figures/ # Training plots
│ └── sampling_results/ # Generated samples
└── [output-dir]/ # Final FASTA files
- Missing files: Pipeline automatically checks and shows missing files
- Import errors: Ensure virtual environment is activated
- GPU issues: Auto-detects GPU/CPU availability
- Path errors: Use absolute paths for model/data files
Pipeline flow: preprocess → training → sample → minimizer