Merge pull request #330 from hyanwong/docs-typos

A few more doc corrections
tskit-dev · Nov 8, 2023 · 5785dc0 · 5785dc0
2 parents b03502b + a9ccbdb
commit 5785dc0
Show file tree

Hide file tree

Showing 3 changed files with 29 additions and 15 deletions.
diff --git a/docs/historical_samples.md b/docs/historical_samples.md
@@ -21,8 +21,8 @@ kernelspec:
 
 Sometimes we wich to infer and date a genetic genealogy from
 data which includes *historical samples*,
-whose time is older that the current generation (i.e. sample with
-node times > 0).
+whose time is older that the current generation (i.e. sample nodes with
+times > 0).
 
 The output of {ref}`tsinfer:sec_introduction` is valid regardless
 of the inclusion of historical samples, but *dating* such a tree sequence

diff --git a/docs/methods.md b/docs/methods.md
@@ -26,8 +26,8 @@ approaches approximate the probability distribution of times by a continuous
 mathematical function (e.g. a gamma distribution).
 
 In tests, we find that the continuous-time `variational_gamma` approach is
-the most accurate (but can suffer from numerical stability). The discrete-time
-`inside_outside` approach is slightly less accurate, especially for older times,
+the most accurate (but can suffer from {ref}`numerical instability<sec_usage_real_data_stability>`).
+The discrete-time `inside_outside` approach is slightly less accurate, especially for older times,
 but is more numerically robust, and the discrete-time `maximization` approach is
 always stable but is the least accurate.
 
@@ -49,8 +49,8 @@ Currently the default is `inside_outside`, but this may change in future release
 
 ## Discrete-time
 
-The `inside_outside` and `maximization` methods both implement discrete-time
-algorithms. These have the following advantages and disadvantages:
+The available discrete-time algorithms are the `inside_outside` and `maximization` methods.
+They have the following advantages and disadvantages:
 
 Pros
 : allows any shape for the distribution of times
@@ -91,6 +91,7 @@ Pros
     with number of timepoints
 : Old nodes do not suffer from time-discretisation issues caused by forcing
     bounds on the oldest times
+: Iterative updating theoretically solves the "loopy belief propagation" problem
 
 Cons
 : Assumes posterior times can be reasonably modelled by a gamma distribution
@@ -106,6 +107,12 @@ The `variational_gamma` method approximates times by fitting a separate gamma
 distribution for each node. Iteration is required to converge
 to a stable solution.
 
+Note that as a result of testing, the default priors used for this method are
+identical for all nodes (i.e. a "global" prior is used), based on a composite
+of all the conditional coalescent priors for all nodes.
+See {ref}`sec_priors_conditional_coalescent`
+for details.
+
 We are in the process of writing a formal description of the algorithm, but in
 summary, this approach uses an expectation propagation ("message passing")
 approach to update the gamma distribution for each node based on the times of connected

diff --git a/docs/usage.md b/docs/usage.md
@@ -74,7 +74,8 @@ redated_ts = tsdate.date(sim_ts, population_size=100, mutation_rate=1e-6)
 
 This simple example has no recombination, infinite sites mutation,
 a high mutation rate, and a known genealogy, so we would expect that the node times
-as estimated by tsdate from the mutations would be very close to the actual node times:
+as estimated by tsdate from the mutations would be very close to the actual node times,
+as indeed they seem to be:
 
 ```{code-cell} ipython3
 :tags: [hide-input]
@@ -124,7 +125,7 @@ print(
 There was not a fixed population size in the simulation used to generate the data,
 so we have used a rough commonly-used
 estimate of an human effective population size of 20,000 (see the
-[Variable population sizes]`sec_variable_popsize` section for more
+[Variable population sizes](sec_variable_popsize) section for more
 sophisticated approaches).
 :::
 
@@ -161,7 +162,7 @@ when calling {func}`tsdate.date`, which then returns both the dated tree sequenc
 and a dictionary specifying the posterior distributions.
 
 The returned posterior is a dictionary keyed by integer node ID, with values representing the
-probability distribution of times. This can be read in to a [pandas](https://pandas.pydata.org
+probability distribution of times. This can be read in to a [pandas](https://pandas.pydata.org)
 dataframe:
 
 ```{code-cell} ipython3
@@ -227,14 +228,20 @@ instability and other problems. Here we detail some common issues found in real
 
 ### Memory and run time
 
-`Tsdate` is not particularly memory intensive: whole genome tree sequences with
+`Tsdate` can be run on most modern computers: large tree sequences of millions or
+tens of millions of edges will take of the order of hours, and use
+tens of GB of RAM (e.g. 24 GB / 1 hour on a 2022-era laptop
+for a tree sequence of 5 million edges covering
+60 megabases of 7500 samples of human chromosome 6 from {cite}`wohns2022unified`).
+
+
+:::{todo}
+Add some scaling plots.
+:::
 
 Running the dating algorithm is linear in the number of edges in the tree sequence.
-This makes `tsdate` usable even for large tree sequences (e.g. millions of samples).
-Nevertheless, dating large tree sequences with millions of edges is likely to take
-some time (e.g. an hour or more for a tree sequence of 11 million edges covering
-150Mb of 7500 human chromosome 2, e.g. from {cite}`wohns2022unified`).
-If you are running `tsdate` interactively, it can be useful to
+This makes `tsdate` usable even for vary large tree sequences (e.g. millions of samples).
+For large instances, if you are running `tsdate` interactively, it can be useful to
 specify the `progress` option to display a progress bar telling you how long
 different stages of dating will take.