MEOW is an extension of MAMMaL (Susko, 2022). In order to run the program, it must be installed in the same directory as MAMMaL. MAMMaL is a program designed to read in sequence files and output a list of estimated frequency classes, each containing relative frequency values of amino acid types, as described in Susko, Lincker and Roger (2018). The purpose of MEOW is to expand upon the core functionality of MAMMaL by providing a wider range of options on how to process the inputted sequence file. This document will primarily be focused on explaining the newer features added that are not present in MAMMaL; for a more general overview of the software, it is recommended to first read the documentation in Susko (2022).
A command line call to the program ‘meow’ will run the routine and execute with the provided flags. The base command looks like:
$ meow -s seqfile -c number_of_classes [OPTIONS]
Note that a tree file is not required in order to run the program. Required files needed for execution of the R-Script are listed below.
-s seqfile : The input sequence file, adhering toPHYLIP conventions. Can be sequential or interleaved.
-c number_of_classes : How many frequency classesto use for both the upper and lower partitions. NOTE: If using -cl and -ch, do not set this flag.
-cl num_classes_low : Number of frequency classesfor the low partition. Use 0 for no low partition.
-ch num_classes_low : Number of frequency classesfor the high partition. Use 0 for no high partition.
-t treefile : The name of the tree file to be usedfor calculating rates data. Not needed if rates are not being used. If rates are used and this is not specified, a default tree will be generated.
-p partition_type : How to split the data betweenlow and high partitions.DEFAULT: Entropy Accepted values: “E” → Entropy; “R” → Rates, “K” → Keff (Effective number of Amino Acids)
-f cluster_type : Clustering method to use when calculatingthe starting frequencies. DEFAULT: H-clust* Accepted values: “H” → Hierarchical clustering (H-clust); “C” → C-series frequencies of Le et. al. (2008)
* This defaults to C-series instead if both num_classes_low and num_classes_high are in {0, 10, 20, ... , 60}.
-o output_file : Name of the output Nexus file. DEFAULT: “esmodel”
-q quantile : The quantile at which to split the data,determined from calculated values using the partition type. DEFAULT: 0.75.
-C penalty : The penalty parameter, η in Susko, Linckerand Roger (2018). DEFAULT: η = 1.0e-10.
-log log_type : Method to use for program output logs. DEFAULT: Both (file and nexus) Accepted values: “N” → No logs; “F” → Text file, “X” → Nexus file, “B” → Both file and nexus
-ri : Drop invariant sites from the sequence data beforeprocessing. DEFAULT: Don’t drop invariant sites.
-I : Add a frequency class for invariant sites in the output file. DEFAULT: Don’t add the invariant class.
Additional flags can be found in the meow_flags.txt
file.
The program will output a nexus file called esmodel.nex
by default, though another name can be specified
through use of the -o
flag. This file can be used as input for IQ-TREE (Nguyen et al. 2015) by using the options
-mdef [nexus file]
, -m LG+ESMmodel+G
. If using a +F class, use -m LG+ESMmodel+F+G
.
Additionally, you can opt to keep the estimated frequency files (estimated-frequencies1 and
estimated-frequencies2) through use of the -ff
flag. These files contain amino acid frequencies for each class in
both the high (1) and low (2) partitions.
You can also keep the partitioned sequence files through use of the -pf
flag. These are generated by the program
according to the specified options and are in interleaved format.
All other program output is deleted upon completion of the routine unless the -tmp
flag is set.
Here is a list of these temporary files:
tmp.out, tmp.Sigma, tmp.frs1, tmp.frs2, tmp.err, tmp.lwt1, tmp.lwt2,
rate_est.dat, estimated-weights
Low/high partitions
MAMMaL only processes the sites in the provided sequence file that lies above the specified q-value. It is possible
to process the entire sequence file by specifying q=0, however, this does not retain the division made by the
rates-filtering algorithm. MEOW aims to fix this by processing both divisions - low and high - and outputting them in
separate files. If only the low or high partition is desired, one may set -ch
or -cl
to 0, meaning no classes will be
attributed to the high (h) or low (l) partitions.
Entropy / Keff partition methods
In MAMMaL it is only possible to filter sites in the sequence file by rates, which are calculated by dgpe using the provided tree file. This is still supported in MEOW, but it also supports two alternate filtering methods: entropy, and effective number of amino acids (abbreviated Keff).
Entropy is a single value calculated from a vector containing relative amino acid frequencies of a given site. It is
defined (for a 20-dimensional vector v) as: -sum(v * log(v))
. This calculation does not require a tree file to be
provided, and does not make any calls to dgpe. Values are placed in the upper and lower partition based on whether
the sitewise entropy is greater than the q-th percentile.
Keff is closely related to entropy and is defined as 1/sum(v^2)
, for the same vector v. It measures the effective
number of amino acids present in a given site. E.g.: a vector [1.0, 0.0, 0.0, .... , 0.0] has a Keff of 1, and
[1/20,1/20, 1/20, ... ] has a Keff of 20.
The default partition method in MEOW is entropy. To use either rates or Keff, the -p
flag can be used with “E”
representing entropy, “R” for rates, and “K” for Keff.
Sequence file sorting
It has been observed that permutations of sequence files can cause outputs generated by entropy or Keff filtering to greatly differ due to ties in the calculations. To mitigate this, MEOW attempts to sort all the sites in the sequence file by a consistent metric, which should remove the effects of permutations. The metric is a combination of site entropy and amino acid frequency, which produces a number that is relatively unique. From (limited) testing, it has been observed that this causes very few ties which results in a sequence file sorted in almost the same way every time. (As of ver0.5.1, this must be enabled via the -sort flag.)
Cluster algorithm
MEOW uses hierarchical clustering by default, unless rates are being used (-p R) and both num_classes_high
and
num_classes_low
are in {0, 10, 20, ... 60} - in which case the C-series algorithm will be used. This is so that the
default behavior for MAMMaL can be preserved when using rates. The -f
flag can be used to force a given
clustering algorithm under any circumstance. In MAMMaL, the -h
flag is used to force hierarchical clustering.
This flag does not exist in MEOW.
Output logs
As an additional feature, MEOW will output program logs to a desired location. These logs include information on the
given parameters as well as any modifications made to the sequence file by the program. By default, the logs are
appended to the top of the generated .nex file, as well as on top of a text file called meow_output.log
. This file is
automatically generated by the program if not already present. If both of these methods are not desired, it is
possible to specify using either/or, or neither of them with the -log
flag.
File suffix
If the -suf
flag is set, the program will generate random numerical suffixes at the ends of each temporary file. This
allows running multiple instances of the program at the same time, as they would not overwrite each other’s files.
Note: Some temporary files generated by dgpe do not have suffixes added on to them, so it is not recommended to
use this option for rate calculations.
Empty / Invariant sites
After sorting the sequence file, MEOW will automatically remove any empty sites from the data before processing it.
An empty site is one filled entirely with blank spots ('-'). Additionally, one can opt to remove invariant sites from
the data as well, using the -I
flag. An invariant site in MEOW is defined as a site whose entropy is less than 7e-10. By
default, these will not be removed, but the program will still output how many invariant sites it detects in the logs.
Tree Generation
MAMMaL requires a treefile as input when estimating rates data. In MEOW, a treefile can still be specified using the
-t
flag, but if a tree is not supplied then the program will generate its own tree to use on the data. The tree is
constructed using a neighbor-joining LG algorithm. (Le, S. Q., and Gascuel, O. , 2008)
Tree Verification
When using rates, all trees (either provided or generated) are checked for positive edge lengths. Any negative edge lengths on the tree will be zeroed. If such a tree is detected, a warning message will be displayed in the logs.
H-clust tweaks
The program generates a matrix of H-clust centres (frs) in much the same way as MAMMaL. However, in order to obtain more accurate cluster centres, MEOW will take any zeroes of the frs matrix and add a small value (1e-10) to them, and then normalize the data again. This generally results in smaller error values. (As of ver0.5.1, this must be enabled via the -e flag.)
MEOW is not currently supported on Windows machines due to incompatibility. It was developed and tested on a remote Linux shell, so that is the recommended run environment. It has also been tested to work on MacOS.
Note: Currently, MEOW requires the R package phangorn v2.11.1. This version is no longer in the CRAN but can be found here: link. An update will be eventually coming supporting the most recent version of phangorn.
It is recommended to first install MAMMaL following the instructions in Susko (2022) (see this link), and the R package phangorn v2.11.1. MEOW comes with three files:
meow, meow_functions, and mammal_functions. These must all be placed in the same directory, and should also be in
the same folder as MAMMaL, dgpe, and the rest of the binaries. If the binaries are not in the same folder as the
program, and not in your PATH, replace the line bindir <- ""
at the top of the program to
bindir <- "dir_with_files/"
, where dir_with_files
is the name of the directory containing the binaries.
Additionally, if the mammal_functions
and meow_functions
files are not in the same folder as the execution directory,
you can change the linefunctions.dir <- "" at thetop of the program in the same way. This also applies
to the c-series files (C10.aafreq.dat, ... ,C60.aafreq.dat),which have their own variable
cseries.dir- also at the top of the script.
Before running MEOW, Make sure that the file permissions allow execution from the terminal. Using changemod:
$ chmod a+x meow
Also ensure that the ape
, phangorn
, and quadprog
libraries are included in the installation of R you are using.
In the terminal, change directories to the location of the meow
source file. Any input sequence files or tree files
should also be in this folder, or a nearby directory. Then, you can execute the MEOW routine by typing
$ meow -s seqfile -c num_classes [options]
Alternatively, the meow
file can be copied to a PATH location, where it can be invoked from any location. This way,
it is not required to store both the meow source code and the sequence file data in the same folder.
Susko, E. (2022). MAMMaL: (M)ultinomial (A)pproximate (M)ixture (Ma)ximum (L)ikelihood Accelerated Estimation of Frequency Classes in Site-heterogeneous Profile Mixture Models Version 1.1.3: https://www.mathstat.dal.ca/~tsusko/doc/mammal.pdf
Susko, E., Lincker, L. and Roger, A.J. (2018). Accelerated Estimation of Frequency Classes in Siteheterogeneous Profile Mixture Models. Mol Biol. Evol. 35:1266–1283.
Nguyen L.T., Schmidt H.A., von Haeseler A., Minh B.Q. (2015). IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies.. Mol. Biol. Evol., 32:268-274.
Le S.Q., Gascuel O., Lartillot N. (2008). Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics. 24:23172323.
Le, S. Q., and Gascuel, O. (2008). LG: An improved general amino acid replacement matrix. MolecularBiology and Evolution , 25 (7), 1307–1320. https://doi.org/10.1093/molbev/msn