Skip to content

A collection of Python scripts to facilitate the preprocessing of input sequences to dynamic homology.

License

Notifications You must be signed in to change notification settings

danimelsz/PrepDyn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

prepDyn: Preprocessing sequences for dynamic homology

language author license

A collection of Python scripts to facilitate the preprocessing of input sequences for dynamic homology.

In dynamic homology, data should be preprocessed to distinguish differences in sequence length resulting from missing data or insertion-deletion events to avoid grouping from artifacts. However, previous empirical studies using POY/PhyG manually preprocessed data with varying approaches. Here we present prepDyn, a collection of Python scripts to facilitate the preprocessing of input sequences to POY/PhyG.

Copyright (C) Daniel Y. M. Nakamura 2025

Installation

The two dependencies that should be installed beforehand by the user are:

  • Python v. 3.10.9 (or newer), including argparse, ast, csv, importlib, re, StringIO, subprocess, sys, tempfile, and time, which are usually part of recent versions of Python.
  • MAFFT v. 7.5.2 (or newer), installed in $PATH as 'mafft'.
conda create -n new_env python=3.10 --yes
conda install bioconda::mafft

Other dependencies are Python modules that will be automatically installed by prepDyn when you run it for the first time:

  • Bio v. 1.73 (or newer), including AlignIO, Entrez, SeqIO, Align, Seq, and SeqRecord.
  • matplotlib v. 3.7.0 (or newer)
  • numpy v. 1.23.5 (or newer)
  • termcolor

If the modules are not installed automatically, try:

conda install conda-forge::biopython
conda install conda-forge::matplotlib
conda install anaconda::numpy
conda install conda-forge::termcolor

Finally, clone the prepDyn repository using the command:

git clone https://github.com/danimelsz/PrepDyn.git

Introduction

prepDyn comprises four steps: (1) data collection from GenBank, (2) trimming, (3) identification of missing data, and (4) partitioning.

Usage

prepDyn is organized in three Python files in the directory src:

  • prepDyn.py: main script integrating the pipeline.
  • GB2MSA.py: script to download sequences from GenBank and identify internal missing data.
  • addSeq.py: script to align one or a few sequence(s) to a previously preprocessed alignment.

Warning: Do not move the files from the directory src, otherwise Python may not recognize modules.

The following examples are designed for users with little experience on Unix. If you have questions, send a message using GitHub issues.

Example 1: Basic

The basic use of prepDyn is running all four steps using a single command. Given an input CSV, whose first column is called Terminals and the other columns are the names of genes (each cell containing the correspondent GenBank accession number), the following command will download sequences, trim invariants and orphan nucleotides <10 bp in terminal positions, and identify missing data as ? (all differences in sequence length in terminal positions are missing data). The log reports the runtime.

python src/prepDyn.py \
    --GB_input test_data/tutorial/ex1.1/ex1.1_input.csv \
    --output_file test_data/tutorial/ex1.1/ex1.1 \
    --del_inv T \
    --orphan_method semi \
    --orphan_threshold 10 \
    --partitioning_method None \
    --log T 

In the CSV file, if more than one GenBank accession number is specified in the same cell refering to non-overlapping fragments of the same gene (e.g. MT893619/MT895696), the space between them is automatically identified as internal missing data (?).

We specified --paritioning_round 0, which means that partitioning was not performed. As a heuristic, we recommend testing the impact of adding pound signs to the tree optimality scores using a successive partitioning strategy. For instance, if you specify partitioning_method conservative and --partitioning_round 1, the largest block(s) of contiguous invariants will be partitioned.

python src/prepDyn.py \
    --input_file test_data/tutorial/ex1.2/ex1.2_input.fasta \
    --output_file test_data/tutorial/ex1.2/ex1.2 \
    --partitioning_method balanced \
    --partitioning_round 1 \
    --log T

This process can continue until tree costs reported by POY/PhyG remain stationary (e.g. --partitioning_round 2 inserts pound signs in the 1- and 2-largest block(s) of contiguous invariants). Other methods of partitioning are also available and the user should explore whether they can reduce tree costs.

Example 2: GB2MSA + prepDyn

Suppose you want to download sequences and preprocess them using different commands. Given a CSV file called input.csv, the following command will download the sequences and align them with MAFFT. In addition, files containing the names of the terminals (useful for control of taxon sampling in POY/PhyG) and the run time will be reported.

AAAA

Now, you can run prepDyn:

AAAA

Example 3: Multiple alignments

Suppose you have a phylogenomic dataset with hundreds of gene alignmens in the directory ./data/. Phylogenomic datasets are usually unavailable in GenBank, but are available in repositories like Dryad and Zenodo. You can preprocess all unaligned gene alignments in FASTA format using a single command:

python src/prepDyn.py \
    --input_file test_data/tutorial/ex3.1/ \
    --input_format fasta \
    --output_file test_data/tutorial/ex3.1/out \
    --MSA T \
    --del_inv T \
    --orphan_method semi --orphan_threshold 10 \
    --internal_method semi --internal_threshold 15 \
    --partitioning_method max

If the input files are already aligned, just change the boolean parameter MSA to False:

python src/prepDyn.py \
    --input_file test_data/tutorial/ex3.2/ \
    --input_format fasta \
    --output_file test_data/tutorial/ex3.2/ \
    --MSA F \
    --del_inv T \
    --orphan_method semi --orphan_threshold 10 \
    --internal_method semi --internal_threshold 15 \
    --log T

Example 4: Appending new sequences

MUSCLE and MAFFT are unable to align sequences if pound signs or question marks are present. This is a problem when we try to align new sequences to a prevously preprocessed profile alignment. To avoid manual alignment by eye, addSeq.py allows aligning new sequences to profile alignments. Gaps, missing data, and pound signs are not modified for the sequences present in the profile alignment. Gaps, missing data, and pound signs are only inserted in the new sequences.

A simple example:

python src/addSeq.py \
    --alignment test_data/tutorial/ex4.1/ex4.1_aln.fas \
    --new_seqs test_data/tutorial/ex4.1/ex4.1_new_seqs.fas \
    --output test_data/tutorial/ex4.1/ex4.1_out.fas \
    --log True

A more complex example, where new sequences were preprocessed using trimming of blocks of orphan nucleotides of length lesser than 45 bp, replacement of internal blocks of gaps longer than 20 with question marks, and replacement of all IUPAC N with question marks in the sequence Thoropa_miliaris_CFBH10125:

python src/addSeq.py \
    --alignment test_data/tutorial/ex4.2/ex4.2_aln.fas \
    --new_seqs test_data/tutorial/ex4.2/ex4.2_new_seqs.fas \
    --output test_data/tutorial/ex4.2/ex4.2_out.fas \
    --orphan_threshold 45 \
    --gaps2question 20 \
    --n2question Thoropa_miliaris_CFBH10125 \
    --write_names True \
    --log True

Warning: the input new_seqs cannot be longer than the profile alignment.

Example 5: Ancient DNA

Suppose you have a dataset with ancient DNA sequences from the sample Dendropsophus_tritaeniatus_MZUSP73973. The IUPAC Ns present in sequences are ambiguous positions resulting from low coverage death in DNA read mapping. It is unknown if this positions actually correspond to nucleotides N or to indels (-). As such, the IUPAC Ns of ancient DNA sequences can be replaced with question marks using the following command:

AAAAAAAA

Cite

About

A collection of Python scripts to facilitate the preprocessing of input sequences to dynamic homology.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages