-
Notifications
You must be signed in to change notification settings - Fork 10
Lab 6: Phylogenetic Reinference via Maximum Likelihood
After this lab, you will
- Grasp the underlying theory of maximum likelihood approach for phylogenetic inference
- Know how to choose the evolutionary models
- Understand the input and output of phylogenetic analyses
- Use
phyml
for conducting phylogenetic reinference - Draw trees using
figtree
.
[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis
[D] .......................... Data type (DNA/AA/Generic) DNA
[I] ......... Input sequences interleaved (or sequential) interleaved
[M] .......................... Analyze multiple data sets no
[R] .............................................. Run ID none
[+] .................................... Next sub-menu
[-] ................................ Previous sub-menu
[Y] .............................. Launch the analysis
[M] .................... Model of nucleotide substitution HKY85
[F] .................... Optimise equilibrium frequencies yes
[T] ....................... Ts/tv ratio (fixed/estimated) estimated
[V] .... Proportion of invariable sites (fixed/estimated) fixed (p-invar=0.00)
[R] .......... One category of substitution rate (yes/no) no
[C] .............. Number of substitution rate categories 4
[G] ...... Gamma distribution parameter (fixed/estimated) estimated
[O] .............................. Optimise tree topology yes
[U] ........... Starting tree (BioNJ/parsimony/user tree) BioNJ
[S] ..................... Tree toloLOGy search operations NNI moves (fast, approximate)
[B] ................... Non parametric bootstrap analysis no
[A] ................... Approximate likelihood ratio test yes / SH-like supports
- JC69
- K80
- F81
- HKY85
- F84
- TN93
- GTR
- Custom (*):
000000
,012345
- Estimated empirically from the used datasets
- Optimized with a maximum likelihood (ML) approach
- Fixed
- Estimated through maximum likelihood
- Estimated
- Fixed (.01)
- NNI moves (quick, approximate): nearest neighbor interchange
- SPR moves (slow, accurate): subtree pruning and regrafting
- Best of NNI and SPR
- no
- Number of replicates
- aLRT statistics
- Chi2-based supports
- aBayes supports
You can type in
phyml --help
to see the following phyml
manual:
-
-i|--input seq_file_name
:nt
oraa
file inPHYLIP
format; (输入MSA文件) -
-d|--datatype data_type
:nt
- nucleotide (default);aa
- amino acids;generic
(数据类型) -
-q|--sequential
: changesinterleaved
format (default) tosequential
format (MSA的存在方式) -
-n|--multiple nb_data_sets
: number of datasets to analyze -
-p|--pars
: useminimum parsimony starting tree
(用最大简约树作为初始树) -
-b|--bootstrap int
: number of bootstrap replicates (重采样次数)-
>0
: number of bootstrap -
0
: neither aLRT nor boostrap values are computed -
-1
: aLRT statistics for aLRT test -
-2
: Chi2-based parametric branch supports for aLRT test -
-4
: (default) SH-like branch supports alone -
-5
: approximate Bayes branch supports
-
-
-m|--model model
: substitution model name- nt-based model:
HKY85
(default),JC69
,K80
,F81
,F84
,TN93
,GTR
,custom (*)
- aa-based model:
LG
(default),WAG
,JTT
,MtREV
,Dayhoff
,DCMut
,RtREV
,CpREV
,VT
,Blosum62
,MtMam
,MtArt
,HIVw
,HIVb
,custom
- nt-based model:
-
--aa_rate_file filename
: aa subtitution rate matrix in PAML format. -
(-f e|m) | (fA,fC,fG,fT)
: __e__mpirical base/aa frequencies, estimated by __m__l(稳态核苷酸比例) -
-t|--ts/tv ratio
: transition/transversion ratio,m
to optimize,float
to fixed.(转换/颠换速率比) -
-v|--pinv prop_inv
: proportion of invariable sites (保守位点的比例) -
-c|--nclasses nb_subst_cat
: number of relative substitution rate categories, default 4. (相对替换速率的分类) -
-a|--alpha alpha
: gamma shape parameter,e
to mle, or fixed positive value. (伽马分布的参数alpha) -
-s|--search move
: NNI|SPR|BEST (搜索树拓扑结构的方法) -
-u|--inputtree user_tree_file
: starting tree file (初始树文件) in Newick format -
-o params
: optimizes specific parameters:-
tlr
: tree topology (t, 拓扑结构) + branch length (l) + rate parameter (r) -
tl
: tree topology + branch length -
lr
: branch length (枝长) + rate parameters (速率参数) -
l
: branch length -
r
: rate parameters -
n
: no parameter
-
-
--rand_start
: sets the initial tree to random, only for SPR search (初始树随机产生) -
n_rand_starts num
: number of initial trees to be used. (随机初始树的个数) -
--r_seed num
: random number generator (随机数发生器的seed) -
--print_site_lnl
: print the likelihood for each site in file*_phyml_lk.txt
(保存每个位点的似然值) -
--print_trace
: print each phylogeny during tree search process in file*_phyml_trace.txt
- Dataset: primates-nt.phy
In this exercise we need to run PhyML twice in order to compare the effect of
- estimating nt frequencies from the used dataset versus
- optimizing the frequencies via maximum likelihood
-
First run: Set the model to
HKY85+Gamma
, estimating the transition/transversion (ts/tv
ratio) and thealpha
parameter of the Gamma distribution by maximum likelihood,nucleotide frequencies
are estimated by ML. -
Second run: Set the model to
HKY85+Gamma
, estimating thets/tv
ratio andalpha
parameter by ML, however nucleotide frequencies are estimated empirically from the data. - Answer the following questions:
- Do you see much difference in the resulting tree?
- Do you observe much difference in the likelihood value (in stat file)?
- Which option is better and why do you think so?
In this exercise you will learn to optimize the tree topology on the substitution paramters obtained using ML performing a tree search (i.e., NNI, SPR, TBR) on the initial tree topology:
- Compare the ML and the BioNJ trees and the model estimates (HKY85+Gamma) obtained for the two trees.
- Compare the likelihood of the ML and BioNJ trees.
- What do you observe and why?
In this exercise, you are asked to infer the phylogenetic tree on the same dataset using
different subtitution models (and their variations). Use now GTR+Gamma
, JC69+Gamma
,
GTR
, HKY85
and JC69
.
-
First run: Run PhyMl with the substitution model set to
GTR
, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.
GTR
-
GTR+Gamma
with 4 discrete classes, estimate the shape parameter. -
GTR+I
: Adding invariable sites option -
GTR+Gamma+I
: with 4 classes, estimate alpha, and adding invariable sites
-
Second run: Run PhyML with the substitution model set to
HKY85
, estimating thets/tv
ratio, estimating the nt frequencies empirically from the dataset, and executing the tree search optimization routines.
HKY85
-
HKY85+Gamma
: Gamma with 4 classes; Estimating alpha -
HKY85+I
: Adding invariable sites -
HKY85+Gamma+I
: Gamma distribution with 4 classes; Estimating Gamma alpha parameter; Adding invariable sites
-
Third run: Run PhyML with the substitution model set to
JC69
and executing the tree search routines.
JC69
-
JC69+Gamma
: Gamma with 4 classes; Estimating alpha -
JC69+I
: Adding invariable sites option -
JC69+Gamma+I
: Gamma sith 4 classes; Estimating Gamma shape; Adding invariable site
In this exercise, you need to compute the branch support using 2 different approaches:
- Nonparametric ML,
- SH-aLRT
Compare supports inferred by nonparametric ML bootstrap to those obtain using SH-aLRT method. Are the results compatible?
Rewrite the command lines for the above exercises.
Display the above trees in a nice and meaningful way using Figtree.
- MSA alignment is inferred with
MUSCLE
- FASTA files are converted into phylip format using
fasta2phylip
- Phylogenetic tree consutruction is performed using
GTR+Gamma
On the way to the garden of bioinformatics.
A bioinformatics wiki for the course BI462.