This project is an abridged version of a future, more involved project looking to resolve uncertain details in the wolf-like canid phylogeny through combined-evidence analysis using DNA, fossil, and morphological data under a Bayesian framework. The description below is currently intended to be for the simpler version of this project intended as a final project for EEOB563 - Molecular Phylogenetics, a graduate class at Iowa State University. More details, be it in data filtering, model building, model selection and others, will be considered in the future. See misc/final.pdf
for a detailed explanation of the question, methods and results for the analysis undertaken for completion of EEOB563. A brief description of each part of the repository follows.
The main directory contains the .gitignore
and this README.md
file, and all the .sh
files, batch scripts written to run the analysis scripts on the HPC-class cluster. This includes model averaging for nuclear and morphological data, nuclear- and morphological- only analysis, and combined-evidence analysis.
misc/
contains the proposal, draft, and final version of the project report required in EEOB563.
data/
contains all the data used throughout the conception and execution of the project. taxa.tsv
contains a list of fossil occurrences with fossil ages, while taxa_clean.tsv
is the same but only containing one entry for each species and its minimum and maximum age. nuclear.nex
contains 583k bases of SNP data for 10 species (9 canina and 1 fox), while nuclear_full.nex
contains 621k bases of SNP data for 16 specimens spanning 12 species (10 canina and 2 foxes). morpho.nex
is the full morphological matrix including fossil occurrences (which received the same morphological scoring as their species name has in the original matrix), and morpho_clean.nex
is the morphological matrix used for analysis with the taxa in taxa_clean.tsv
.
scripts/
contains all the scripts used in the analysis, including those that did not make it into the final report. RB_bug_avg_nuclear_MCMC.Rev
is a script which leads to a NAN likelihood when using a p_inv
parameter for dnPhyloCTMC
, kept there for easy access for the RB development team. avg_morpho_setup.Rev
sets up some parameters for the morphological model averaging analysis, while avg_morpho_MCMC.Rev
runs the analysis for a given value of k
, the number of states in a set of characters. avg_nuclear_MCMC.Rev
runs the molecular model averaging analysis. morpho_MCMC.Rev
runs the morphological-only analysis for binary characters, and nuclear_MCMC.Rev
runs the nuclear-only analysis (currently using the full data set). combined_evidence_fbd.Rev
runs combined-evidence analysis for the full data set with all fossil occurrences, while combined_evidence_fbd_clean.Rev
runs it only for the min-max ages data set (and has many other updates that it accumulated throughout the projects).
output/
contains the output from RevBayes analyses and post-hoc tree summaries, and is a bit of a mess. All prev_
directories are simply backups from previous analyses, except for prev_output_nuclear
which contains the output from the last nuclear analysis with just 10 species. output_combined
and output_combined1
etc. contain output from 4 samples of 300k generations, which I did to be able to get a bigger sample size without running for more than the maximum time HPC-class allows. output_combined_asc
is an intermediary output and not relevant for the EEOB563 project. output_morpho
, output_morpho_avg
, and the corresponding ones for nuclear
, all are outputs of the latest version of their corresponding .Rev
scripts.
I should acknowledge Rachel Rompala and Mihir Kharate, who gave me valuable feedback during the peer review part of this project.