Skip to content

Commit

Permalink
Merge pull request #339 from nspope/doc-proofread
Browse files Browse the repository at this point in the history
Docs proofread
  • Loading branch information
hyanwong authored Nov 17, 2023
2 parents 8de8a6a + 9c5d393 commit e24af53
Show file tree
Hide file tree
Showing 6 changed files with 51 additions and 31 deletions.
4 changes: 2 additions & 2 deletions docs/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ kernelspec:

# Introduction

`Tsdate` {cite}`wohns2022unified` infers dates for nodes in a genetic genealogy,
`tsdate` {cite}`wohns2022unified` infers dates for nodes in a genetic genealogy,
sometimes loosely known as an ancestral recombination graph or ARG
{cite}`wong2023general`. More precisely, it takes a genealogy in
[tree sequence](https://tskit.dev/tutorials/what_is.html) format as an input
Expand All @@ -43,7 +43,7 @@ for each of the nodes to date. By default, `tsdate` calculates priors from the
[conditional coalescent](http://dx.doi.org/10.1006/tpbi.1998.1411), although
alternative prior distributions can also be specified.

`Tsdate` provides several methods for assigning probabilities to different times,
`tsdate` provides several methods for assigning probabilities to different times,
and updating information through the genealogy. These include discrete-time and
continuous-time methods, see {ref}`sec_methods` for more details.

Expand Down
30 changes: 17 additions & 13 deletions docs/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,15 @@ kernelspec:
# Methods

The methods available for `tsdate` inference can be divided into _discrete-time_
and _continuous-time_ approaches. Discrete-time approaches define time grid based on
discrete timepoints, and assign a probability to each node being at
each timepoint. Continuous-time
approaches approximate the probability distribution for each node time using a continuous
mathematical function (e.g. a gamma distribution).
and _continuous-time_ approaches.
Both approaches iteratively propagate information between nodes to
construct an approximation of the marginal posterior distribution for the age of each node,
given the mutational information in the tree sequence.
Discrete-time approaches approximate the posterior across a grid of
discrete timepoints (e.g. assign a probability to each node being at
each timepoint).
Continuous-time approaches approximate the posterior by a continuous
univariate distribution (e.g. a gamma distribution).

In tests, we find that the continuous-time `variational_gamma` approach is
the most accurate (but can suffer from {ref}`numerical instability<sec_usage_real_data_stability>`).
Expand Down Expand Up @@ -61,8 +65,8 @@ Cons
based on the conditional coalescent)
: Inferred times are imprecise due to discretization: a denser timegrid can increase
precision, but also increases computational cost (quadratic with number of timepoints)
: In particular, the oldest nodes can suffer from poor dating, as time into the past
is an unbounded value, but a single oldest timepoint must be chosen.
: In particular, the oldest/youngest nodes can suffer from poor dating, as time into the past
is an unbounded value, but a single oldest/youngest timepoint must be chosen.

### Inside Outside vs Maximization

Expand Down Expand Up @@ -96,8 +100,8 @@ Pros
Cons
: Assumes posterior times can be reasonably modelled by gamma distributions
(e.g. they are not bimodal)
: The "Expectation propagation" approach requires multiple rounds of iteration
until convergence.
: The "expectation propagation" algorithm used to fit the posterior requires
multiple rounds of iteration until convergence.
: Numerical stability issues are more common (but often indicate pathological
of otherwise problematic tree sequences)

Expand All @@ -120,14 +124,14 @@ See {ref}`sec_priors_conditional_coalescent` for details.
#### Expectation propagation

We are in the process of writing a formal description of the algorithm, but in
summary, the expectation propagation ("message passing")
approach updates the gamma distribution for each node based on the times of connected
summary, this approach uses an expectation propagation ("message passing")
approach to update the gamma distribution for each node based on the ages of connected
nodes and the mutations on connected edges. Updates take the form of moment matching
against an iteratively refined approximation to the posterior, which makes this method
very fast.

The iterative expectation propagation should converge to a fixed
point that approximately minimizes KL divergence between the true posterior
point that approximately minimizes Kullback-Leibler divergence between the true posterior
distribution and the approximation {cite}`minka2001expectation`.
At the moment, when `method="variational_gamma"`,
a relatively large number of iterations is used (which testing indicates is
Expand All @@ -143,4 +147,4 @@ ts = tsdate.date(
progress=True,
population_size=100,
mutation_rate=1e-8)
```
```
29 changes: 20 additions & 9 deletions docs/priors.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ kernelspec:

## Basic usage

The {func}`build_prior_grid` function allows you to create a bespoke prior.
The {func}`build_prior_grid` and `build_parameter_grid` functions allow you to create a bespoke prior
for the {ref}`sec_methods_discrete_time` and {ref}`sec_methods_continuous_time`, respectively.
This can be passed in to {func}`date` using the `priors` argument. It provides
a tuneable alternative to passing the {ref}`population size<sec_usage_popsize>`
directly to the {func}`date` function.
Expand Down Expand Up @@ -50,10 +51,12 @@ See below for more explanation of the interpretation of the parameters passed to

## Prior shape

For {ref}`sec_methods_discrete_time` methods, it is possible to switch from the (default)
lognormal approximation to a gamma distribution, used when building a mixture prior
for nodes that have variable numbers of children in different genomic regions. Tests have shown
that the lognormal is usually a better fit to the true prior in most cases.
For {ref}`sec_methods_discrete_time` methods, it is possible to switch from the
(default) lognormal approximation to a gamma distribution, used when building a
mixture prior for nodes that have variable numbers of children in different
genomic regions. The discretized prior is then constructed by evaluating the
lognormal (or gamma) distribution across a grid of fixed times. Tests have shown that the
lognormal is usually a better fit to the true prior in most cases.

For {ref}`sec_methods_continuous_time` methods, only the gamma distribution is available.

Expand All @@ -62,9 +65,9 @@ For {ref}`sec_methods_continuous_time` methods, only the gamma distribution is a
## Setting the timegrid

For {ref}`sec_methods_discrete_time` methods, a grid of timepoints is created. An explicit
timgrid can be passed via the `timepoints` parameter. Alternatively, if a single integer is
passed, a nonuniform time grid will be chosen based on the expected quantiles of the
coalescent approximation.
timegrid can be passed via the `timepoints` parameter. Alternatively, if a single integer is
passed, a nonuniform time grid will be chosen based on quantiles of the
lognormal (or gamma) approximation of the mixture prior.

Note that if an integer is given this is *not* the number of timepoints, but the number of
quantiles used as a basis for generating timepoints. The actual number of timepoints will
Expand All @@ -82,6 +85,14 @@ print(
print(prior.timepoints)
```

For {ref}`sec_methods_continuous_time` methods, a grid of variational parameters is
created (e.g. shape and rate parameters of gamma distributions for each node), which
may be modified manually.
Currently, node-specific priors are combined to generate a global i.i.d. prior
(although this behaviour will be changed in future releases to provide more
flexibility.)


(sec_priors_conditional_coalescent)=

## The conditional coalescent
Expand All @@ -104,4 +115,4 @@ indicates that using a single prior for all nodes may provide better accuracy; t
may also be a result of too strongly constraining each node prior to a fixed topology
before dating. Currently the `variational_gamma` method using a single ("global") prior
for all nodes. The best prior to use for different methods is a current topic of
investigation, and may be subject to change in future versions of `tsdate`.
investigation, and may be subject to change in future versions of `tsdate`.
8 changes: 6 additions & 2 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,10 @@ that sum to one: each cell gives the probability that the time of the node
whose ID is given by the column header lies within the
timeslice specified by the `start_time` and `end_time` columns.

For the continuous-time `variational_gamma` method, the posterior for
each node is represented by the shape and rate parameter of the gamma approximation,
and the `start_time` and `end_time` entries may be ignored.

(sec_usage_popsize)=

## Population sizes
Expand Down Expand Up @@ -229,7 +233,7 @@ instability and other problems. Here we detail some common issues found in real

### Memory and run time

`Tsdate` can be run on most modern computers: large tree sequences of millions or
`tsdate` can be run on most modern computers: large tree sequences of millions or
tens of millions of edges will take of the order of hours, and use
tens of GB of RAM (e.g. 24 GB / 1 hour on a 2022-era laptop
for a tree sequence of 5 million edges covering
Expand Down Expand Up @@ -309,4 +313,4 @@ that they cover, which causes numerical instability and inaccuracy.

However, we expect *accurately* estimated unary regions to improve dating.
It is therefore possible to date a tree sequence containing locally unary nodes
using the `allow_unary` option when {ref}`building a prior<sec_priors>`.
using the `allow_unary` option when {ref}`building a prior<sec_priors>`.
10 changes: 5 additions & 5 deletions docs/variable_popsize.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,11 @@ plot_real_vs_tsdate_times(ts.nodes_time, redated_ts.nodes_time, ts, redated_ts,

## Estimating population size

If you don't know the population size, it is possible to use `tsdate`` to *estimate* changes in
population size over time, by extimating the number of coalescence points in different time
intervals, and re-estimating the dates. However, this
approach has not been fully tested or documented.
If you don't know the population size, it is possible to use `tsdate` to
*estimate* changes in population size over time, by first estimating the rate
of coalescence in different time intervals, and then re-estimating the dates.
However, this approach has not been fully tested or documented.

If you are interested in doing this, see
[GitHub issue #237](https://github.com/tskit-dev/tsdate/issues/237#issuecomment-1785655708)
for an example.
for an example.
1 change: 1 addition & 0 deletions tsdate/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
from .core import date # NOQA: F401
from .core import get_dates # NOQA: F401
from .prior import build_grid as build_prior_grid # NOQA: F401
from .prior import parameter_grid as build_parameter_grid # NOQA: F401
from .provenance import __version__ # NOQA: F401
from .util import add_sampledata_times # NOQA: F401
from .util import preprocess_ts # NOQA: F401
Expand Down

0 comments on commit e24af53

Please sign in to comment.