Skip to content

Lab 6: Phylogenetic Reinference via Maximum Likelihood

Ricky Woo edited this page Dec 17, 2018 · 2 revisions

Learning Objectives

After this lab, you will

  • Grasp the underlying theory of maximum likelihood approach for phylogenetic inference
  • Know how to choose the evolutionary models
  • Understand the input and output of phylogenetic analyses
  • Use phyml for conducting phylogenetic reinference
  • Draw trees using figtree.

1. PhyML Tutorial

1.1 Menu

First menu

[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis

[D] .......................... Data type (DNA/AA/Generic)  DNA
[I] ......... Input sequences interleaved (or sequential)  interleaved
[M] .......................... Analyze multiple data sets  no
[R] .............................................. Run ID  none

Second menu

[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis

[M] .................... Model of nucleotide substitution  HKY85
[F] .................... Optimise equilibrium frequencies  yes
[T] ....................... Ts/tv ratio (fixed/estimated)  estimated
[V] .... Proportion of invariable sites (fixed/estimated)  fixed (p-invar=0.00)
[R] .......... One category of substitution rate (yes/no)  no
[C] .............. Number of substitution rate categories  4
[G] ...... Gamma distribution parameter (fixed/estimated)  estimated
[O] .............................. Optimise tree topology  yes
[U] ........... Starting tree (BioNJ/parsimony/user tree)  BioNJ
[S] ..................... Tree toloLOGy search operations  NNI moves (fast, approximate)
[B] ................... Non parametric bootstrap analysis  no
[A] ................... Approximate likelihood ratio test  yes / SH-like supports

1.2 Model of nucleotide substitution

  • JC69
  • K80
  • F81
  • HKY85
  • F84
  • TN93
  • GTR
  • Custom (*): 000000, 012345

1.3 Equilibrium frequencies of nucletides

  • Estimated empirically from the used datasets
  • Optimized with a maximum likelihood (ML) approach

1.4 Transition/Transversion (ts/tv) ratio

  • Fixed
  • Estimated through maximum likelihood

1.5 Proportion of invariable sites

  • Estimated
  • Fixed (.01)

1.6 Tree topology search strategy

  • NNI moves (quick, approximate): nearest neighbor interchange
  • SPR moves (slow, accurate): subtree pruning and regrafting
  • Best of NNI and SPR

1.7 Non-parametric bootstrap analysis

  • no
  • Number of replicates

1.8 Approximate likelihood ratio test

  • aLRT statistics
  • Chi2-based supports
  • aBayes supports

2. PhyML Command Line

You can type in

phyml --help

to see the following phyml manual:

  • -i|--input seq_file_name: nt or aa file in PHYLIP format; (输入MSA文件)
  • -d|--datatype data_type: nt - nucleotide (default); aa - amino acids; generic (数据类型)
  • -q|--sequential: changes interleaved format (default) to sequential format (MSA的存在方式)
  • -n|--multiple nb_data_sets: number of datasets to analyze
  • -p|--pars: use minimum parsimony starting tree (用最大简约树作为初始树)
  • -b|--bootstrap int: number of bootstrap replicates (重采样次数)
    • >0: number of bootstrap
    • 0: neither aLRT nor boostrap values are computed
    • -1: aLRT statistics for aLRT test
    • -2: Chi2-based parametric branch supports for aLRT test
    • -4: (default) SH-like branch supports alone
    • -5: approximate Bayes branch supports
  • -m|--model model: substitution model name
    • nt-based model: HKY85 (default), JC69, K80, F81, F84, TN93, GTR, custom (*)
    • aa-based model: LG (default), WAG, JTT, MtREV, Dayhoff, DCMut, RtREV, CpREV, VT, Blosum62, MtMam, MtArt, HIVw, HIVb, custom
  • --aa_rate_file filename: aa subtitution rate matrix in PAML format.
  • (-f e|m) | (fA,fC,fG,fT): __e__mpirical base/aa frequencies, estimated by __m__l(稳态核苷酸比例)
  • -t|--ts/tv ratio: transition/transversion ratio, m to optimize, float to fixed.(转换/颠换速率比)
  • -v|--pinv prop_inv: proportion of invariable sites (保守位点的比例)
  • -c|--nclasses nb_subst_cat: number of relative substitution rate categories, default 4. (相对替换速率的分类)
  • -a|--alpha alpha: gamma shape parameter, e to mle, or fixed positive value. (伽马分布的参数alpha)
  • -s|--search move: NNI|SPR|BEST (搜索树拓扑结构的方法)
  • -u|--inputtree user_tree_file: starting tree file (初始树文件) in Newick format
  • -o params: optimizes specific parameters:
    • tlr: tree topology (t, 拓扑结构) + branch length (l) + rate parameter (r)
    • tl: tree topology + branch length
    • lr: branch length (枝长) + rate parameters (速率参数)
    • l: branch length
    • r: rate parameters
    • n: no parameter
  • --rand_start: sets the initial tree to random, only for SPR search (初始树随机产生)
  • n_rand_starts num: number of initial trees to be used. (随机初始树的个数)
  • --r_seed num: random number generator (随机数发生器的seed)
  • --print_site_lnl: print the likelihood for each site in file *_phyml_lk.txt (保存每个位点的似然值)
  • --print_trace: print each phylogeny during tree search process in file *_phyml_trace.txt

3. Figtree tutorial

4. Exercises

4.1 Exercise 1 - Parameter estimation

  • Dataset: primates-nt.phy

Inferring phylogenies using maximum likelihood

In this exercise we need to run PhyML twice in order to compare the effect of

  • estimating nt frequencies from the used dataset versus
  • optimizing the frequencies via maximum likelihood
  1. First run: Set the model to HKY85+Gamma, estimating the transition/transversion (ts/tv ratio) and the alpha parameter of the Gamma distribution by maximum likelihood, nucleotide frequencies are estimated by ML.
  2. Second run: Set the model to HKY85+Gamma, estimating the ts/tv ratio and alpha parameter by ML, however nucleotide frequencies are estimated empirically from the data.
  3. Answer the following questions:
  • Do you see much difference in the resulting tree?
  • Do you observe much difference in the likelihood value (in stat file)?
  • Which option is better and why do you think so?

4.2 Exercise 2 - Tree topologies

In this exercise you will learn to optimize the tree topology on the substitution paramters obtained using ML performing a tree search (i.e., NNI, SPR, TBR) on the initial tree topology:

  1. Compare the ML and the BioNJ trees and the model estimates (HKY85+Gamma) obtained for the two trees.
  2. Compare the likelihood of the ML and BioNJ trees.
  3. What do you observe and why?

4.3 Exercise 3 - Model comparision

In this exercise, you are asked to infer the phylogenetic tree on the same dataset using different subtitution models (and their variations). Use now GTR+Gamma, JC69+Gamma, GTR, HKY85 and JC69.

  1. First run: Run PhyMl with the substitution model set to GTR, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.
  • GTR
  • GTR+Gamma with 4 discrete classes, estimate the shape parameter.
  • GTR+I: Adding invariable sites option
  • GTR+Gamma+I: with 4 classes, estimate alpha, and adding invariable sites
  1. Second run: Run PhyML with the substitution model set to HKY85, estimating the ts/tv ratio, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.
  • HKY85
  • HKY85+Gamma: Gamma with 4 classes; Estimating alpha
  • HKY85+I: Adding invariable sites
  • HKY85+Gamma+I: Gamma distribution with 4 classes; Estimating Gamma alpha parameter; Adding invariable sites
  1. Third run: Run PhyML with the substitution model set to JC69 and executing the tree search routines.
  • JC69
  • JC69+Gamma: Gamma with 4 classes; Estimating alpha
  • JC69+I: Adding invariable sites option
  • JC69+Gamma+I: Gamma sith 4 classes; Estimating Gamma shape; Adding invariable site

4.4 Exercise 4 - Branch support

In this exercise, you need to compute the branch support using 2 different approaches:

  • Nonparametric ML,
  • SH-aLRT

Compare supports inferred by nonparametric ML bootstrap to those obtain using SH-aLRT method. Are the results compatible?

4.5 Exercise 5 - Command line

Rewrite the command lines for the above exercises.

4.6 Exercise 6 - Display the tree

Display the above trees in a nice and meaningful way using Figtree.

4.7 (Extra)Exercsie 7 - Create a pipeline for phylogenetic reinference

  • MSA alignment is inferred with MUSCLE
  • FASTA files are converted into phylip format using fasta2phylip
  • Phylogenetic tree consutruction is performed using GTR+Gamma

A bioinformatics wiki for the course BI462.

Clone this wiki locally