Speeding up iqtree gene trees #290

vmkhot · 2024-07-31T16:08:10Z

vmkhot
Jul 31, 2024

Hi there,

I'm attempting to make gene trees (13000+ of them) with a mixture model. Some of the gene families have over 1000+ sequences and take a very long time to make even on an HPC cluster.

E.g. A gene family alignment with 418 sequences and ~500AA took 6GB RAM and a total of 11h30m with the following commands:

iqtree2 -s N0.HOG0000932_MSA_trimmed.faa -m LG+F+G -T 10 --mem 100G -pre ./guide_trees/pmsf_guide_N0.HOG0000932_MSA_trimmed

iqtree2 -s N0.HOG0000932_MSA_trimmed.faa -m LG+C60+G -ft ./guide_trees/pmsf_guide_N0.HOG0000932_MSA_trimmed.treefile -mwopt -B 5000 -wbtl -T 10 --mem 100G -pre results/pmsf_N0.HOG0000932_MSA_trimmed

I need a 5000 bootstraps for a tree reconciliation in some later steps. Ideally I would use 10000 but I've already reduced this. I have about 1200 trees to go and would really appreciate some suggestions on speeding up the larger alignments. Here's what I've considered so far after scouring the manual.

Reducing the mixture model to C20 for the larger alignments however, I'm not sure whether it is appropriate to mix mixture models for some gene families like this? Or whether I should rerun all of them (rest 11800) with the C20 as well?
I saw from the manual that I can use iqtree-mpi - which we successfully installed and ran but I run into a multicoring/RAM usage issue since some of the larger alignments require 20GB ram... even if I have a total of 100GB ram for the tree then I can only split it into 5 processes. Plus it says in the manual that this will only be beneficial for longer alignments and mine range between 50AA - 3000AA.
Splitting the bootstrap into 1000 and concatenating the results. But from what I read, this was possible for the standard bootstrap - is it also possible for the ultra-fast bootstrap? The log files show that bootstrap goes through 100 iterations at a time and writes out to the .ufboot file multiple times - how would this work if I were to split the bootstrap into chunks?

Any input from the community is greatly appreciated!

Thanks,

Varada

roblanf · 2024-08-06T04:55:07Z

roblanf
Aug 6, 2024
Maintainer

Hi Varada,

Some answers that I hope might help.

Reducing to C20 seems entirely reasonable to me. You have to be pragmatic, and fit the best model that you are able. And yes, it's fine to have different models for different genes.
The amount of RAM scales with the number of classes in the mixture, so C20 will use 1/3 the RAM of C60. This might help! If you have a lot of alignments to run (>> than the number of processors you have) it's usually more efficient to give 1 processor to each job. This is because multi-threading speeds up one job, but is never 100% efficient. In other words, one job per thread is more efficient than multithreading each job. You can marginally improve things by doing the biggest jobs first, so that you don't end up waiting a long time at the end.
I don't see why this shouldn't be possible for UFBOOT. It uses trees seen during the search, and splitting across multiple searches should be better, not worse, as long as the searches are truly independent (i.e. start from different trees, with different random number seeds, etc). @bqminh may know better though... one uncertainty I'd have is on how IQ-TREE is measuring convergence, adn whether you'd be inappropriately sampling across multiple runs (in practical terms I think you would just cat the bootstrap tree files from UFBOOT. As for answer 2 though, I don't think this will improve efficiency overall unless you have many more CPUs than you have loci.

If I were in your situation, I'd start by questioning whether you really need the C60 model. To do this, you could try fitting various simpler models on a subset of the loci, then comparing the things you are intersted in (e.g. topologies, bootstrap values) to determine how much benefit C60 really brings you. You coudl e.g. estimtae trees under C60 plus many other models, then us the AU/KH tests to ask whether the C60 model would actually reject any of those other trees produced under simpler models. It may be that C60 is not necessary to get what you need with the level of precision that you require.

Rob

0 replies

crcardenas · 2024-12-09T09:02:28Z

crcardenas
Dec 9, 2024

I've had a similar issue, but not nearly as extreme as yours. Not to detract from Rob's advice, but in parallel with using the C30 model like recommend you could also try another approach on your HPC. Rather than relying on IQTree to perform the gene tree search, I split up my alignment using AMAS and then used a slurm array to do the individual searches. This really helps with my time constraints on the HPC I use. Afterwards I would concat the treefiles.

It would look something like this:

#!/bin/bash
#SBATCH --time 06:00:00 # in hours
#SBATCH --partition public-cpu
#SBATCH --cpus-per-task 3 # number of cores to use
#SBATCH --mem-per-cpu 100 # in MB
#SBATCH --array 1-8164%100 # run N jobs with a maximum of 100 concurent jobs; maximum number of jobs dependent on your cluster
#SBATCH --job-name gene_trees

# to run: sbatch ufboot_array.sh
# be sure to adjust the length of your array based on your list

# load modules
module load Anaconda3;
source activate iqtree2;

# set variables
WORKING_DIR="/your/path/to/locus/trees/loci_trees"
ALIGN_LIST="alignments.list"

# your list should look something like this:
# flank200_40p_parti/flank200_40p_uce1000002-out.nex
# flank200_60p_parti/flank200_40p_uce1000002-out.nex
# flank200_40p_parti/flank200_40p_uce20103-out.nex
# flank200_60p_parti/flank200_40p_uce20103-out.nex
# ...

# get alignment path from list using awk to print the line based on the array step
DATASET_DIR=$(cat ${ALIGN_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "/" -f 1)
ALIGNMENT=$(cat ${ALIGN_LIST} | awk -v var=${SLURM_ARRAY_TASK_ID} 'NR==var {print $0}' | cut -d "/" -f 2)

# change directories to where data lives and run script
cd ${WORKING_DIR}/${DATASET_DIR}
srun iqtree -s ${ALIGNMENT} -bnni -bb 1000 --runs 20 -nt 3

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speeding up iqtree gene trees #290

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Speeding up iqtree gene trees #290

vmkhot Jul 31, 2024

Replies: 2 comments

roblanf Aug 6, 2024 Maintainer

crcardenas Dec 9, 2024

vmkhot
Jul 31, 2024

roblanf
Aug 6, 2024
Maintainer

crcardenas
Dec 9, 2024