prepDyn: Preprocessing sequences for dynamic homology

A collection of Python scripts to facilitate the preprocessing of input sequences for dynamic homology.

In dynamic homology, data should be preprocessed to distinguish differences in sequence length resulting from missing data or insertion-deletion events to avoid grouping from artifacts. However, previous empirical studies using POY/PhyG manually preprocessed data with varying approaches. Here we present prepDyn, a collection of Python scripts to facilitate the preprocessing of input sequences to POY/PhyG.

Copyright (C) Daniel Y. M. Nakamura 2025

Installation

The two dependencies that should be installed beforehand by the user are:

Python v. 3.10.9 (or newer), including argparse, ast, csv, importlib, re, StringIO, subprocess, sys, tempfile, and time, which are usually part of recent versions of Python.
MAFFT v. 7.5.2 (or newer), installed in $PATH as 'mafft'.

conda create -n new_env python=3.10 --yes
conda install bioconda::mafft

Other dependencies are Python modules that will be automatically installed by prepDyn when you run it for the first time:

Bio v. 1.73 (or newer), including AlignIO, Entrez, SeqIO, Align, Seq, and SeqRecord.
matplotlib v. 3.7.0 (or newer)
numpy v. 1.23.5 (or newer)
termcolor

If the modules are not installed automatically, try:

conda install conda-forge::biopython
conda install conda-forge::matplotlib
conda install anaconda::numpy
conda install conda-forge::termcolor

Finally, clone the prepDyn repository using the command:

git clone https://github.com/danimelsz/PrepDyn.git

Introduction

prepDyn comprises four steps: (1) data collection from GenBank, (2) trimming, (3) identification of missing data, and (4) partitioning.

Usage

prepDyn is organized in three Python files in the directory src:

prepDyn.py: main script integrating the pipeline.
GB2MSA.py: script to download sequences from GenBank and identify internal missing data.
addSeq.py: script to align one or a few sequence(s) to a previously preprocessed alignment.

Warning: Do not move the files from the directory src, otherwise Python may not recognize modules.

The following examples are designed for users with little experience on Unix. If you have questions, send a message using GitHub issues.

Example 1: Basic

The basic use of prepDyn is running all four steps using a single command. Given an input CSV, whose first column is called Terminals and the other columns are the names of genes (each cell containing the correspondent GenBank accession number), the following command will download sequences, trim invariants and orphan nucleotides <10 bp in terminal positions, and identify missing data as ? (all differences in sequence length in terminal positions are missing data). The log reports the runtime.

python src/prepDyn.py \
    --GB_input test_data/tutorial/ex1.1/ex1.1_input.csv \
    --output_file test_data/tutorial/ex1.1/ex1.1 \
    --del_inv T \
    --orphan_method semi \
    --orphan_threshold 10 \
    --partitioning_method None \
    --log T

In the CSV file, if more than one GenBank accession number is specified in the same cell refering to non-overlapping fragments of the same gene (e.g. MT893619/MT895696), the space between them is automatically identified as internal missing data (?).

We specified --paritioning_round 0, which means that partitioning was not performed. As a heuristic, we recommend testing the impact of adding pound signs to the tree optimality scores using a successive partitioning strategy. For instance, if you specify partitioning_method conservative and --partitioning_round 1, the largest block(s) of contiguous invariants will be partitioned.

python src/prepDyn.py \
    --input_file test_data/tutorial/ex1.2/ex1.2_input.fasta \
    --output_file test_data/tutorial/ex1.2/ex1.2 \
    --partitioning_method balanced \
    --partitioning_round 1 \
    --log T

This process can continue until tree costs reported by POY/PhyG remain stationary (e.g. --partitioning_round 2 inserts pound signs in the 1- and 2-largest block(s) of contiguous invariants). Other methods of partitioning are also available and the user should explore whether they can reduce tree costs.

Example 2: GB2MSA + prepDyn

Suppose you want to download sequences and preprocess them using different commands. Given a CSV file called input.csv, the following command will download the sequences and align them with MAFFT. In addition, files containing the names of the terminals (useful for control of taxon sampling in POY/PhyG) and the run time will be reported.

AAAA

Now, you can run prepDyn:

AAAA

Example 3: Multiple alignments

Suppose you have a phylogenomic dataset with hundreds of gene alignmens in the directory ./data/. Phylogenomic datasets are usually unavailable in GenBank, but are available in repositories like Dryad and Zenodo. You can preprocess all unaligned gene alignments in FASTA format using a single command:

python src/prepDyn.py \
    --input_file test_data/tutorial/ex3.1/ \
    --input_format fasta \
    --output_file test_data/tutorial/ex3.1/out \
    --MSA T \
    --del_inv T \
    --orphan_method semi --orphan_threshold 10 \
    --internal_method semi --internal_threshold 15 \
    --partitioning_method max

If the input files are already aligned, just change the boolean parameter MSA to False:

python src/prepDyn.py \
    --input_file test_data/tutorial/ex3.2/ \
    --input_format fasta \
    --output_file test_data/tutorial/ex3.2/ \
    --MSA F \
    --del_inv T \
    --orphan_method semi --orphan_threshold 10 \
    --internal_method semi --internal_threshold 15 \
    --log T

Example 4: Appending new sequences

MUSCLE and MAFFT are unable to align sequences if pound signs or question marks are present. This is a problem when we try to align new sequences to a prevously preprocessed profile alignment. To avoid manual alignment by eye, addSeq.py allows aligning new sequences to profile alignments. Gaps, missing data, and pound signs are not modified for the sequences present in the profile alignment. Gaps, missing data, and pound signs are only inserted in the new sequences.

A simple example:

python src/addSeq.py \
    --alignment test_data/tutorial/ex4.1/ex4.1_aln.fas \
    --new_seqs test_data/tutorial/ex4.1/ex4.1_new_seqs.fas \
    --output test_data/tutorial/ex4.1/ex4.1_out.fas \
    --log True

A more complex example, where new sequences were preprocessed using trimming of blocks of orphan nucleotides of length lesser than 45 bp, replacement of internal blocks of gaps longer than 20 with question marks, and replacement of all IUPAC N with question marks in the sequence Thoropa_miliaris_CFBH10125:

python src/addSeq.py \
    --alignment test_data/tutorial/ex4.2/ex4.2_aln.fas \
    --new_seqs test_data/tutorial/ex4.2/ex4.2_new_seqs.fas \
    --output test_data/tutorial/ex4.2/ex4.2_out.fas \
    --orphan_threshold 45 \
    --gaps2question 20 \
    --n2question Thoropa_miliaris_CFBH10125 \
    --write_names True \
    --log True

Warning: the input new_seqs cannot be longer than the profile alignment.

Example 5: Ancient DNA

Suppose you have a dataset with ancient DNA sequences from the sample Dendropsophus_tritaeniatus_MZUSP73973. The IUPAC Ns present in sequences are ambiguous positions resulting from low coverage death in DNA read mapping. It is unknown if this positions actually correspond to nucleotides N or to indels (-). As such, the IUPAC Ns of ancient DNA sequences can be replaced with question marks using the following command:

AAAAAAAA

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.ipynb_checkpoints		.ipynb_checkpoints
jupyter		jupyter
src		src
test_data		test_data
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

prepDyn: Preprocessing sequences for dynamic homology

Installation

Introduction

Usage

Example 1: Basic

Example 2: GB2MSA + prepDyn

Example 3: Multiple alignments

Example 4: Appending new sequences

Example 5: Ancient DNA

Cite

About

Uh oh!

Releases

Packages

Languages

License

danimelsz/PrepDyn

Folders and files

Latest commit

History

Repository files navigation

prepDyn: Preprocessing sequences for dynamic homology

Installation

Introduction

Usage

Example 1: Basic

Example 2: GB2MSA + prepDyn

Example 3: Multiple alignments

Example 4: Appending new sequences

Example 5: Ancient DNA

Cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages