Skip to content

Latest commit

 

History

History
119 lines (86 loc) · 6.81 KB

01_genetrees.md

File metadata and controls

119 lines (86 loc) · 6.81 KB
layout title categories usemathjax
page
Standard tree inference
jekyll update
true
<script type="text/javascript" charset="utf-8" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML, https://vincenttam.github.io/javascripts/MathJaxLocal.js"></script>

For gene tree reconstructions, we will use a dataset of Australasian monitor lizards (genus Varanus) from Pavón-Vázquez et al. (2021). It consists of 388 nuclear loci obtained through anchored hybrid enrchment, a technique for capturing orthologous regions of the genome.

To estimate trees from these loci, we will rely on RAxML, a program for efficient tree inference based on maximum likelihood (ML). We will also estimate node supports based on bootstrap calculations.

Download and install the software

Download the latest version from GitHub. Alternatively, if you use UNIX (Linux or Mac) and have git installed, open the terminal and type:

git clone https://github.com/stamatak/standard-RAxML.git

to download the repository. Uncompress the .zip and move to the newly created folder.

Installing RAxML in Windows

Windows executables are already included in the folder. To run the software, open the command prompt (cmd.exe) and type the path to the executable

# use the `cd` command to change the directory
cd \path\to\raxml\WindowsExecutables_v8.2.10
.\raxmlHPC
# The following error message must appear:
# Error, you must specify a model of substitution with the "-m" option

Be aware that Windows uses the backslash \ as the path-component separator, while Unix uses the forward slash /.

Compiling RAxML in Unix (Mac or Linux)

First, we need to compile the software before it can be used.

Install the standard version:

make -f Makefile.gcc # install the standard version

Alternatively, you can install the multicore version that allows the use of multiple CPU processors:

rm *.o # remove previously compiled files if you installed a different version
make -f Makefile.PTHREADS.gcc

This will create an executable raxmlHPC (or raxmlHPC-PTHREADS, depending on the compiled version) that can be called:

./raxmlHPC
# The following error message must appear:
# Error, you must specify a model of substitution with the "-m" option

Running RAxML for gene tree inference

We will estimate a tree of one locus based on the $\text{GTR} + \Gamma$ model of nucleotide substitution, as well as calculate 100 bootstrap pseudoreplicates to asses node support. The bootstrap is a statistical approach for assessing the accuracy of any estimation (continuous parameters or clades in a tree); it consists of analyzing replicates of the original data. Its use in phylogenetic inference was introduced by Felsenstein (1985) to assess "confidence" for each clade in a tree. More specifically, in each of $N$ cycles ($N$ = 100 or 1000), the algorithm samples sites with replacement from the original alignment until the original number of sites is reached. A tree is estimated from each replicate, and the proportion of times that each clade is inferred provides a measure for its support. The support values are usually depicted in the maximum likelihood tree.

The following code allows to infer a ML tree and estimate node supports in a single run:

./raxmlHPC -s locus177.phylip -n 177.boot -m GTRGAMMA -f a -N 100 -p 2334 -x 563454
  • -s: name of the sequence file (include the path to the file if it is located in a different folder)
  • -n: name of the output files (the files generated during the run will have .177.stand appended to the end)
  • -m: substitution model
  • -f: Specify one of the different algorithms available in RAxML. If nothing is specified (like in our first run), by default it executes the standard hill climbing algorithm to perform the tree search (which is equivalent to -f d). The a option tells RAxML to conduct a rapid Bootstrap analysis and search for the best-scoring ML tree in a single run
  • -N: number of bootstrap pseudoreplicates
  • -p: random number seed to generate a parsimony starting tree (can be any integer)
  • -x: specify an integer number (random seed) and turn on rapid bootstrapping

Further command options are detailed in the software manual, or can be explored using:

./raxmlHPC -help

The maximum likelihood tree is printed in the RAxML_bestTree.1.stand file. We can visualize the tree in FigTree (download from here) and, optionally, export in any image format. To visualize this tree and the support values open the file in FigTree. On the left-hand side of the screen select: Branch Labels → Display → label.

Note: If we want output files to be written in a specific folder, we have to execute RAxML in that folder.
Suppose that I want output files in a folder called output/:

cd output/
path/to/raxmlHPC -s path/to/locus177.phylip -n 177.boot -m GTRGAMMA -f a -N 100 -p 2334 -x 563454

To simplify this command, you can add the RAxML executable to the path (follow this guide). This allows to execute the program from any directory.

Let's estimate a tree for a different locus:

./raxmlHPC -s locus256.phylip -n 256.boot -m GTRGAMMA -f a -N 100 -p 2334 -x 563454

Visualize both trees (RAxML_bipartitions.177.boot and RAxML_bipartitions.256.boot). Some of the phylogenetic relationships are different, what could be the reason/s?

Varanus komodoensis is the famous Komodo dragon.

Automatizing gene tree inference using a loop

It is possible to automatically set a run for all 388 gene trees using the code for a loop. Note that all .phy in the dataset folder are named L_1.phy, L_2.phy ... L_388.phy. Thus, we can set a loop with an iterator i taking values from 1 to 388 to call all the input .phy into RAxML:

for i in {1..388}
do
./raxmlHPC -s L_$i.phy -n $i.boot -m GTRGAMMA -f a -N 100 -p 2334 -x 563454
done