Skip to content

Commit

Permalink
Merge pull request #330 from hyanwong/docs-typos
Browse files Browse the repository at this point in the history
A few more doc corrections
  • Loading branch information
hyanwong authored Nov 8, 2023
2 parents b03502b + a9ccbdb commit 5785dc0
Show file tree
Hide file tree
Showing 3 changed files with 29 additions and 15 deletions.
4 changes: 2 additions & 2 deletions docs/historical_samples.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ kernelspec:

Sometimes we wich to infer and date a genetic genealogy from
data which includes *historical samples*,
whose time is older that the current generation (i.e. sample with
node times > 0).
whose time is older that the current generation (i.e. sample nodes with
times > 0).

The output of {ref}`tsinfer:sec_introduction` is valid regardless
of the inclusion of historical samples, but *dating* such a tree sequence
Expand Down
15 changes: 11 additions & 4 deletions docs/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ approaches approximate the probability distribution of times by a continuous
mathematical function (e.g. a gamma distribution).

In tests, we find that the continuous-time `variational_gamma` approach is
the most accurate (but can suffer from numerical stability). The discrete-time
`inside_outside` approach is slightly less accurate, especially for older times,
the most accurate (but can suffer from {ref}`numerical instability<sec_usage_real_data_stability>`).
The discrete-time `inside_outside` approach is slightly less accurate, especially for older times,
but is more numerically robust, and the discrete-time `maximization` approach is
always stable but is the least accurate.

Expand All @@ -49,8 +49,8 @@ Currently the default is `inside_outside`, but this may change in future release

## Discrete-time

The `inside_outside` and `maximization` methods both implement discrete-time
algorithms. These have the following advantages and disadvantages:
The available discrete-time algorithms are the `inside_outside` and `maximization` methods.
They have the following advantages and disadvantages:

Pros
: allows any shape for the distribution of times
Expand Down Expand Up @@ -91,6 +91,7 @@ Pros
with number of timepoints
: Old nodes do not suffer from time-discretisation issues caused by forcing
bounds on the oldest times
: Iterative updating theoretically solves the "loopy belief propagation" problem

Cons
: Assumes posterior times can be reasonably modelled by a gamma distribution
Expand All @@ -106,6 +107,12 @@ The `variational_gamma` method approximates times by fitting a separate gamma
distribution for each node. Iteration is required to converge
to a stable solution.

Note that as a result of testing, the default priors used for this method are
identical for all nodes (i.e. a "global" prior is used), based on a composite
of all the conditional coalescent priors for all nodes.
See {ref}`sec_priors_conditional_coalescent`
for details.

We are in the process of writing a formal description of the algorithm, but in
summary, this approach uses an expectation propagation ("message passing")
approach to update the gamma distribution for each node based on the times of connected
Expand Down
25 changes: 16 additions & 9 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,8 @@ redated_ts = tsdate.date(sim_ts, population_size=100, mutation_rate=1e-6)

This simple example has no recombination, infinite sites mutation,
a high mutation rate, and a known genealogy, so we would expect that the node times
as estimated by tsdate from the mutations would be very close to the actual node times:
as estimated by tsdate from the mutations would be very close to the actual node times,
as indeed they seem to be:

```{code-cell} ipython3
:tags: [hide-input]
Expand Down Expand Up @@ -124,7 +125,7 @@ print(
There was not a fixed population size in the simulation used to generate the data,
so we have used a rough commonly-used
estimate of an human effective population size of 20,000 (see the
[Variable population sizes]`sec_variable_popsize` section for more
[Variable population sizes](sec_variable_popsize) section for more
sophisticated approaches).
:::

Expand Down Expand Up @@ -161,7 +162,7 @@ when calling {func}`tsdate.date`, which then returns both the dated tree sequenc
and a dictionary specifying the posterior distributions.

The returned posterior is a dictionary keyed by integer node ID, with values representing the
probability distribution of times. This can be read in to a [pandas](https://pandas.pydata.org
probability distribution of times. This can be read in to a [pandas](https://pandas.pydata.org)
dataframe:

```{code-cell} ipython3
Expand Down Expand Up @@ -227,14 +228,20 @@ instability and other problems. Here we detail some common issues found in real

### Memory and run time

`Tsdate` is not particularly memory intensive: whole genome tree sequences with
`Tsdate` can be run on most modern computers: large tree sequences of millions or
tens of millions of edges will take of the order of hours, and use
tens of GB of RAM (e.g. 24 GB / 1 hour on a 2022-era laptop
for a tree sequence of 5 million edges covering
60 megabases of 7500 samples of human chromosome 6 from {cite}`wohns2022unified`).


:::{todo}
Add some scaling plots.
:::

Running the dating algorithm is linear in the number of edges in the tree sequence.
This makes `tsdate` usable even for large tree sequences (e.g. millions of samples).
Nevertheless, dating large tree sequences with millions of edges is likely to take
some time (e.g. an hour or more for a tree sequence of 11 million edges covering
150Mb of 7500 human chromosome 2, e.g. from {cite}`wohns2022unified`).
If you are running `tsdate` interactively, it can be useful to
This makes `tsdate` usable even for vary large tree sequences (e.g. millions of samples).
For large instances, if you are running `tsdate` interactively, it can be useful to
specify the `progress` option to display a progress bar telling you how long
different stages of dating will take.

Expand Down

0 comments on commit 5785dc0

Please sign in to comment.