diff --git a/docs/introduction.md b/docs/introduction.md index c9e38f5c..b3ae57a1 100644 --- a/docs/introduction.md +++ b/docs/introduction.md @@ -18,7 +18,7 @@ kernelspec: # Introduction -`Tsdate` {cite}`wohns2022unified` infers dates for nodes in a genetic genealogy, +`tsdate` {cite}`wohns2022unified` infers dates for nodes in a genetic genealogy, sometimes loosely known as an ancestral recombination graph or ARG {cite}`wong2023general`. More precisely, it takes a genealogy in [tree sequence](https://tskit.dev/tutorials/what_is.html) format as an input @@ -43,7 +43,7 @@ for each of the nodes to date. By default, `tsdate` calculates priors from the [conditional coalescent](http://dx.doi.org/10.1006/tpbi.1998.1411), although alternative prior distributions can also be specified. -`Tsdate` provides several methods for assigning probabilities to different times, +`tsdate` provides several methods for assigning probabilities to different times, and updating information through the genealogy. These include discrete-time and continuous-time methods, see {ref}`sec_methods` for more details. diff --git a/docs/methods.md b/docs/methods.md index e684b761..e225849a 100644 --- a/docs/methods.md +++ b/docs/methods.md @@ -19,11 +19,15 @@ kernelspec: # Methods The methods available for `tsdate` inference can be divided into _discrete-time_ -and _continuous-time_ approaches. Discrete-time approaches define time grid based on -discrete timepoints, and assign a probability to each node being at -each timepoint. Continuous-time -approaches approximate the probability distribution for each node time using a continuous -mathematical function (e.g. a gamma distribution). +and _continuous-time_ approaches. +Both approaches iteratively propagate information between nodes to +construct an approximation of the marginal posterior distribution for the age of each node, +given the mutational information in the tree sequence. +Discrete-time approaches approximate the posterior across a grid of +discrete timepoints (e.g. assign a probability to each node being at +each timepoint). +Continuous-time approaches approximate the posterior by a continuous +univariate distribution (e.g. a gamma distribution). In tests, we find that the continuous-time `variational_gamma` approach is the most accurate (but can suffer from {ref}`numerical instability`). @@ -61,8 +65,8 @@ Cons based on the conditional coalescent) : Inferred times are imprecise due to discretization: a denser timegrid can increase precision, but also increases computational cost (quadratic with number of timepoints) -: In particular, the oldest nodes can suffer from poor dating, as time into the past - is an unbounded value, but a single oldest timepoint must be chosen. +: In particular, the oldest/youngest nodes can suffer from poor dating, as time into the past + is an unbounded value, but a single oldest/youngest timepoint must be chosen. ### Inside Outside vs Maximization @@ -96,8 +100,8 @@ Pros Cons : Assumes posterior times can be reasonably modelled by gamma distributions (e.g. they are not bimodal) -: The "Expectation propagation" approach requires multiple rounds of iteration - until convergence. +: The "expectation propagation" algorithm used to fit the posterior requires + multiple rounds of iteration until convergence. : Numerical stability issues are more common (but often indicate pathological of otherwise problematic tree sequences) @@ -120,14 +124,14 @@ See {ref}`sec_priors_conditional_coalescent` for details. #### Expectation propagation We are in the process of writing a formal description of the algorithm, but in -summary, the expectation propagation ("message passing") -approach updates the gamma distribution for each node based on the times of connected +summary, this approach uses an expectation propagation ("message passing") +approach to update the gamma distribution for each node based on the ages of connected nodes and the mutations on connected edges. Updates take the form of moment matching against an iteratively refined approximation to the posterior, which makes this method very fast. The iterative expectation propagation should converge to a fixed -point that approximately minimizes KL divergence between the true posterior +point that approximately minimizes Kullback-Leibler divergence between the true posterior distribution and the approximation {cite}`minka2001expectation`. At the moment, when `method="variational_gamma"`, a relatively large number of iterations is used (which testing indicates is @@ -143,4 +147,4 @@ ts = tsdate.date( progress=True, population_size=100, mutation_rate=1e-8) -``` \ No newline at end of file +``` diff --git a/docs/priors.md b/docs/priors.md index 45ca6dc1..9c20d11b 100644 --- a/docs/priors.md +++ b/docs/priors.md @@ -20,7 +20,8 @@ kernelspec: ## Basic usage -The {func}`build_prior_grid` function allows you to create a bespoke prior. +The {func}`build_prior_grid` and `build_parameter_grid` functions allow you to create a bespoke prior +for the {ref}`sec_methods_discrete_time` and {ref}`sec_methods_continuous_time`, respectively. This can be passed in to {func}`date` using the `priors` argument. It provides a tuneable alternative to passing the {ref}`population size` directly to the {func}`date` function. @@ -50,10 +51,12 @@ See below for more explanation of the interpretation of the parameters passed to ## Prior shape -For {ref}`sec_methods_discrete_time` methods, it is possible to switch from the (default) -lognormal approximation to a gamma distribution, used when building a mixture prior -for nodes that have variable numbers of children in different genomic regions. Tests have shown -that the lognormal is usually a better fit to the true prior in most cases. +For {ref}`sec_methods_discrete_time` methods, it is possible to switch from the +(default) lognormal approximation to a gamma distribution, used when building a +mixture prior for nodes that have variable numbers of children in different +genomic regions. The discretized prior is then constructed by evaluating the +lognormal (or gamma) distribution across a grid of fixed times. Tests have shown that the +lognormal is usually a better fit to the true prior in most cases. For {ref}`sec_methods_continuous_time` methods, only the gamma distribution is available. @@ -62,9 +65,9 @@ For {ref}`sec_methods_continuous_time` methods, only the gamma distribution is a ## Setting the timegrid For {ref}`sec_methods_discrete_time` methods, a grid of timepoints is created. An explicit -timgrid can be passed via the `timepoints` parameter. Alternatively, if a single integer is -passed, a nonuniform time grid will be chosen based on the expected quantiles of the -coalescent approximation. +timegrid can be passed via the `timepoints` parameter. Alternatively, if a single integer is +passed, a nonuniform time grid will be chosen based on quantiles of the +lognormal (or gamma) approximation of the mixture prior. Note that if an integer is given this is *not* the number of timepoints, but the number of quantiles used as a basis for generating timepoints. The actual number of timepoints will @@ -82,6 +85,14 @@ print( print(prior.timepoints) ``` +For {ref}`sec_methods_continuous_time` methods, a grid of variational parameters is +created (e.g. shape and rate parameters of gamma distributions for each node), which +may be modified manually. +Currently, node-specific priors are combined to generate a global i.i.d. prior +(although this behaviour will be changed in future releases to provide more +flexibility.) + + (sec_priors_conditional_coalescent)= ## The conditional coalescent @@ -104,4 +115,4 @@ indicates that using a single prior for all nodes may provide better accuracy; t may also be a result of too strongly constraining each node prior to a fixed topology before dating. Currently the `variational_gamma` method using a single ("global") prior for all nodes. The best prior to use for different methods is a current topic of -investigation, and may be subject to change in future versions of `tsdate`. \ No newline at end of file +investigation, and may be subject to change in future versions of `tsdate`. diff --git a/docs/usage.md b/docs/usage.md index c339c121..6e7db23f 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -179,6 +179,10 @@ that sum to one: each cell gives the probability that the time of the node whose ID is given by the column header lies within the timeslice specified by the `start_time` and `end_time` columns. +For the continuous-time `variational_gamma` method, the posterior for +each node is represented by the shape and rate parameter of the gamma approximation, +and the `start_time` and `end_time` entries may be ignored. + (sec_usage_popsize)= ## Population sizes @@ -229,7 +233,7 @@ instability and other problems. Here we detail some common issues found in real ### Memory and run time -`Tsdate` can be run on most modern computers: large tree sequences of millions or +`tsdate` can be run on most modern computers: large tree sequences of millions or tens of millions of edges will take of the order of hours, and use tens of GB of RAM (e.g. 24 GB / 1 hour on a 2022-era laptop for a tree sequence of 5 million edges covering @@ -309,4 +313,4 @@ that they cover, which causes numerical instability and inaccuracy. However, we expect *accurately* estimated unary regions to improve dating. It is therefore possible to date a tree sequence containing locally unary nodes -using the `allow_unary` option when {ref}`building a prior`. \ No newline at end of file +using the `allow_unary` option when {ref}`building a prior`. diff --git a/docs/variable_popsize.md b/docs/variable_popsize.md index 27c113f2..ab0cd7ee 100644 --- a/docs/variable_popsize.md +++ b/docs/variable_popsize.md @@ -94,11 +94,11 @@ plot_real_vs_tsdate_times(ts.nodes_time, redated_ts.nodes_time, ts, redated_ts, ## Estimating population size -If you don't know the population size, it is possible to use `tsdate`` to *estimate* changes in -population size over time, by extimating the number of coalescence points in different time -intervals, and re-estimating the dates. However, this -approach has not been fully tested or documented. +If you don't know the population size, it is possible to use `tsdate` to +*estimate* changes in population size over time, by first estimating the rate +of coalescence in different time intervals, and then re-estimating the dates. +However, this approach has not been fully tested or documented. If you are interested in doing this, see [GitHub issue #237](https://github.com/tskit-dev/tsdate/issues/237#issuecomment-1785655708) -for an example. \ No newline at end of file +for an example. diff --git a/tsdate/__init__.py b/tsdate/__init__.py index e92e6a1a..fb33b2a9 100644 --- a/tsdate/__init__.py +++ b/tsdate/__init__.py @@ -23,6 +23,7 @@ from .core import date # NOQA: F401 from .core import get_dates # NOQA: F401 from .prior import build_grid as build_prior_grid # NOQA: F401 +from .prior import parameter_grid as build_parameter_grid # NOQA: F401 from .provenance import __version__ # NOQA: F401 from .util import add_sampledata_times # NOQA: F401 from .util import preprocess_ts # NOQA: F401