From 22dd1a56929159f5684fa3e21b7b7d78ee8fc91f Mon Sep 17 00:00:00 2001 From: Gregor Gorjanc Date: Fri, 20 Sep 2024 07:41:36 +0100 Subject: [PATCH 1/4] Minor formatting what_is.md, terminology_and_concepts.md, getting_started.md --- getting_started.md | 81 +++++++++++++++++++++---------------- terminology_and_concepts.md | 58 ++++++++++++-------------- what_is.md | 14 +++---- 3 files changed, 80 insertions(+), 73 deletions(-) diff --git a/getting_started.md b/getting_started.md index 672903ab..0c25bcd8 100644 --- a/getting_started.md +++ b/getting_started.md @@ -18,20 +18,21 @@ kernelspec: # Getting started with {program}`tskit` You've run some simulations or inference methods, and you now have a -{class}`TreeSequence` object; what now? This tutorial is aimed +{class}`TreeSequence` object; what now? This tutorial is aimed for users who are new to {program}`tskit` and would like to get some basic tasks completed. We'll look at five fundamental things you might -need to do: +want to do: {ref}`process trees`, -{ref}`sites & mutations`, and -{ref}`genotypes`, +{ref}`process sites & mutations`, +{ref}`extract genotypes`, {ref}`compute statistics`, and {ref}`save or export data`. Throughout, we'll also provide pointers to where you can learn more. + :::{note} The examples in this -tutorial are all written using the {ref}`sec_python_api`, but it's also possible to +tutorial are all written using the {ref}`tskit:sec_python_api`, but it's also possible to {ref}`use R `, or access the API in other languages, notably {ref}`C` and [Rust](https://github.com/tskit-dev/tskit-rust). ::: @@ -43,7 +44,6 @@ twenty diploid individuals. To make it a bit more interesting, we'll simulate th of a {ref}`selective sweep` in the middle of the chromosome, then throw some neutral mutations onto the resulting tree sequence. - ```{code-cell} ipython3 import msprime @@ -61,7 +61,7 @@ ts = msprime.sim_ancestry( recombination_rate=1e-8, random_seed=1234, # only needed for repeatabilty ) -# Optionally add finite-site mutations to the ts using the Jukes & Cantor model, creating SNPs +# Optionally add finite-site mutations to the ts using the default Jukes & Cantor model, creating SNPs ts = msprime.sim_mutations(ts, rate=1e-8, random_seed=4321) ts ``` @@ -250,13 +250,16 @@ for nmuts, count in enumerate(np.bincount(num_muts)): (sec_processing_genotypes)= -## Processing genotypes +## Extracting genotypes At each site, the sample nodes will have a particular allelic state (or be flagged as {ref}`tskit:sec_data_model_missing_data`). The -{meth}`TreeSequence.variants` method gives access to the -full variation data. For efficiency, the {attr}`~Variant.genotypes` -at a site are returned as a [numpy](https://numpy.org) array of integers: +{meth}`TreeSequence.variants` method gives access to +full variation data - possible allelic states and node genotypes at each site. +Since nodes are by definition haploid, their genotype at a site is their allelic state +at the site. For efficiency, its attribute {attr}`~Variant.genotypes` +at a site returns a [numpy](https://numpy.org) array of integers +representing allelic states: ```{code-cell} ipython3 import numpy as np @@ -275,11 +278,9 @@ Tree sequences are optimised to look at all samples at one site, then all sample adjacent site, and so on along the genome. It is much less efficient look at all the sites for a single sample, then all the sites for the next sample, etc. In other words, you should generally iterate over sites, not samples. Nevertheless, all the alleles for -a single sample can be obtained via the -{meth}`TreeSequence.haplotypes` method. +a single sample can be obtained via the {meth}`TreeSequence.haplotypes` method. ::: - To find the actual allelic states at a site, you can refer to the {attr}`~Variant.alleles` provided for each {class}`Variant`: the genotype value is an index into this list. Here's one way to print them out; for @@ -287,8 +288,6 @@ clarity this example also prints out the IDs of both the sample nodes (i.e. the and the diploid {ref}`individuals ` in which each sample node resides. - - ````{code-cell} ipython3 samp_ids = ts.samples() print(" ID of diploid individual: ", " ".join([f"{ts.node(s).individual:3}" for s in samp_ids])) @@ -309,7 +308,6 @@ such as indels, leading to allelic states which need not be one of these 4 lette even be a single letter. ::: - (sec_tskit_getting_started_compute_statistics)= ## Computing statistics @@ -353,17 +351,17 @@ is the spectrum for a section of the tree sequence between 5 and 5.5Mb, which we created by deleting the regions outside that interval using {meth}`TreeSequence.keep_intervals`. Unsurprisingly, as we noted when looking at the trees, there's a far higher proportion of singletons in -the region of the sweep. +the region of the sweep compared to the entire genome. (sec_tskit_getting_started_compute_statistics_windowing)= ### Windowing It is often useful to see how statistics vary in different genomic regions. This is done -by calculating them in {ref}`tskit:sec_stats_windows` along the genome. For this, +by calculating them in {ref}`windows` along the genome. For this, let's look at a single statistic, the genetic {meth}`~TreeSequence.diversity` (π). As a -site statistic this measures the average number of genetic differences between two -randomly chosen samples, whereas as a branch length statistic it measures the average +site statistic, it measures the average number of genetic differences between two +randomly chosen samples, whereas as a branch statistic, it measures the average branch length between them. We'll plot how the value of π changes using 10kb windows, plotting the resulting diversity between positions 4 and 6 Mb: @@ -380,15 +378,15 @@ ax1.set_yscale("log") ax1.set_ylim(1e-6, 1e-3) ax2.stairs(ts.diversity(windows=windows, mode="branch"), windows/1_000, baseline=None) ax2.set_xlabel("Genome position (kb)") -ax2.set_title("Branch-length-based calculation") +ax2.set_title("Branch-based calculation") ax2.set_xlim(4e3, 6e3) ax2.set_yscale("log") plt.show() ``` There's a clear drop in diversity in the region of the selective sweep. And as expected, -the statistic based on branch-lengths gives a less noisy signal. - +the statistic based on branch lengths gives a less noisy signal than the statistic based +on randomly occuring mutations at sites. (sec_tskit_getting_started_exporting_data)= @@ -434,7 +432,7 @@ print(small_ts.as_nexus(precision=3, include_alignments=False)) ### VCF -The standard way of interchanging genetic variation data is the Variant Call Format, +The standard way of interchanging genetic variation data is the Variant Call Format (VCF), for which tskit has basic support: ```{code-cell} ipython3 @@ -442,9 +440,8 @@ import sys small_ts.write_vcf(sys.stdout) ``` -The write_vcf method takes a file object as a parameter; to get it to write out to the -notebook here we ask it to write to stdout. - +The {meth}`TreeSequence.write_vcf` method takes a file object as a parameter; +to get it to write out to the notebook here we ask it to write to stdout. (sec_tskit_getting_started_exporting_data_allel)= @@ -466,8 +463,8 @@ print(h.n_variants, h.n_haplotypes) h ``` -Sckit.allel has a wide-ranging and efficient suite of tools for working with genotype -data, so should provide anything that's needed. For example, it gives us an +{program}`scikit-allel` has a wide-ranging and efficient suite of tools for working +with genotype data for most of users' needs. For example, it gives us an another way to compute the pairwise diversity statistic (that we calculated {ref}`above` using the native {meth}`TreeSequence.diversity` method): @@ -477,6 +474,23 @@ ac = h.count_alleles() allel.mean_pairwise_difference(ac) ``` +Comparative call with tskit is: + +```{code-cell} ipython3 +# Per site (not span-normalised to match scikit-allel) +small_ts.diversity(windows="sites", span_normalise=False) +# Per site (span-normalised to account for variable span of windows defined by sites) +small_ts.diversity(windows="sites") + +# Per whole genome +small_ts.diversity(span_normalise=False) +# ... equivalent call to the above +sum(small_ts.diversity(windows="sites", span_normalise=False)) +small_ts.diversity() +``` + +See {ref}`windows` and {ref}`windows` +for details about the ``span_normalise=False``. (sec_tskit_getting_started_key_points)= @@ -496,15 +510,15 @@ in rough order of importance: * {ref}`sec_terminology_nodes` (i.e. genomes) can belong to {ref}`individuals`. For example, sampling a diploid individual results in an {class}`Individual` object which - possesses two distinct {ref}`sample nodes`. + links to two distinct {ref}`sample nodes`. * Key tree sequence methods * {meth}`~TreeSequence.samples()` returns an array of node IDs specifying the nodes that are marked as samples * {meth}`~TreeSequence.node` returns the node object for a given integer node ID * {meth}`~TreeSequence.trees` iterates over all the trees * {meth}`~TreeSequence.sites` iterates over all the sites - * {meth}`~TreeSequence.variants` iterates over all the sites with their genotypes - and alleles + * {meth}`~TreeSequence.variants` iterates over all the sites with their list of + unique alleles and sample/node? genotypes * {meth}`~TreeSequence.simplify()` reduces the number of sample nodes in the tree sequence to a specified subset * {meth}`~TreeSequence.keep_intervals()` (or its complement, @@ -519,4 +533,3 @@ in rough order of importance: sequence, for example {meth}`~TreeSequence.allele_frequency_spectrum`, {meth}`~TreeSequence.diversity`, and {meth}`~TreeSequence.Fst`; these can also be calculated in windows along the genome. - diff --git a/terminology_and_concepts.md b/terminology_and_concepts.md index 360b587d..6e672150 100644 --- a/terminology_and_concepts.md +++ b/terminology_and_concepts.md @@ -48,8 +48,6 @@ def basics(): ) tables.mutations.time = np.full_like(tables.mutations.time, tskit.UNKNOWN_TIME) tables.tree_sequence().dump("data/basics.trees") - - def create_notebook_data(): basics() @@ -71,18 +69,18 @@ concepts behind {program}`tskit`, the tree sequence toolkit. ::::{margin} :::{note} -See {ref}`sec_intro_downloading_datafiles` to run this tutorial on your own computer +See {ref}`sec_intro_downloading_datafiles` to run this tutorial on your own computer. ::: :::: A tree sequence is a data structure which describes a set of correlated evolutionary trees, together with some associated data that specifies, for example, -the location of mutations in the tree sequence. More technically, a tree sequence -stores a biological structure known as an "Ancestral Recombination Graph", or ARG. +the location of mutations in the genome. More technically, a tree sequence +stores a population genetics object known as an Ancestral Recombination Graph (ARG). -Below are the most important {ref}`terms and concepts ` -that you'll encounter in these tutorials, but first we'll {func}`~tskit.load` a tree -sequence from a `.trees` file using the +Below are the most important terms and concepts that you'll encounter in these tutorials. +A concise glossary of these terms and concepts is available at {ref}`here `. +But first we'll {func}`~tskit.load` a tree sequence from a `.trees` file using the {ref}`tskit:sec_python_api` (which will be used in the rest of this tutorial): ```{code-cell} ipython3 @@ -97,7 +95,7 @@ ts = tskit.load("data/basics.trees") :::{note} {ref}`Workarounds` exist to represent a multi-chromosome genome as a tree -sequence, but are not covered here +sequence, but are not covered here. ::: :::: @@ -126,13 +124,12 @@ ts.draw_svg( ) ``` -Each tree records the lines of descent along which a piece of DNA has been -inherited (ignore for the moment the red symbols, which represent a mutation). +Each tree records the lines of descent along which a piece of DNA has been inherited. For example, the first tree tells us that DNA from ancestral genome 7 duplicated to produce two lineages, which ended up in genomes 1 and 4, both of which exist in the current population. In fact, since this pattern is seen in all trees, these particular lines of inheritance were taken by all the DNA in this 1000 base pair genome. - +The red symbol is a mutation, which we will describe later. (sec_terminology_nodes)= @@ -148,7 +145,6 @@ an *internal node*, representing an ancestor in which a single DNA sequence was duplicated (in forwards-time terminology) or in which multiple sequences coalesced (in backwards-time terminology). - (sec_terminology_nodes_samples)= #### Sample nodes @@ -163,7 +159,6 @@ labelled $0..5$, and also 6 non-sample nodes, labelled $6..11$, in the tree sequ print("There are", ts.num_nodes, "nodes, of which", ts.num_samples, "are sample nodes") ``` - (sec_terminology_edges)= ### Edges @@ -175,10 +170,9 @@ three trees in the example above has a branch from node 7 to node 1, but those t branches represent just a single edge. Each edge is associated with a parent node ID and a child node ID. The time of the parent -node must be -strictly greater than the time of the child node, and the difference in these times is -sometimes referred to as the "length" of the edge. Since trees in a tree sequence are -usually taken to represent marginal trees along a genome, as well as the time dimension +node must be strictly greater than the time of the child node, and the difference in these times +is sometimes referred to as the "length" of the edge. Since trees in a tree sequence are +usually taken to represent local trees along a genome, as well as the time dimension each edge also has a genomic _span_, defined by a *left* and a *right* position along the genome. There are 15 edges in the tree sequence above. Here's an example of one of them: @@ -240,9 +234,6 @@ children_of_7 = first_tree.children(7) print("Node 7's parent is", parent_of_7, "and childen are", children_of_7, "in the first tree") ``` - - - (sec_terminology_individuals_and_populations)= ### Individuals and populations @@ -332,6 +323,8 @@ homozygous for "T", Bob is homozygous for "G", and Cat is heterozygous "T/G". In other words the ancestral state and the details of any mutations at that site, when coupled with the tree topology at the site {attr}`~Site.position`, is sufficient to define the allelic state possessed by each sample. +See description for {attr}`~Mutation.parent` on how tskit handles multiple mutations along +a path in a tree. Note that even though the genome is 1000 base pairs long, the tree sequence only contains a single site, because we usually only bother defining *variable* sites in a tree @@ -340,7 +333,6 @@ that genomic location). It is perfectly possible to have a site with no mutation (or silent mutations) --- i.e. a "monomorphic" site --- but such sites are not normally used in further analysis. - (sec_terminology_provenance)= ### Provenance @@ -354,7 +346,6 @@ call to msprime that produced it, and the second the call to provenance entries are sufficient to exactly recreate the tree sequence, but this is not always possible. - (sec_concepts)= ## Concepts @@ -385,17 +376,18 @@ with 3 or more children in a particular tree (these are known as *polytomies*). ### Tree changes, ancestral recombinations, and SPRs The process of recombination usually results in trees along a genome where adjacent -trees differ by only a few "tree edit" or SPR (subtree-prune-and-regraft) operations. +trees differ by only a few "tree edit" or subtree-prune-and-regraft (SPR) operations. The result is a tree sequence in which very few edges {ref}`change from tree to tree`. This is the underlying reason that `tskit` is so efficient, and is well illustrated in the example tree sequence above. In this (simulated) tree sequence, each tree differs from the next by a single SPR. -The subtree defined by node 7 in the first tree has been pruned and regrafted onto the -branch between 0 and 10, to create the second tree. The second and third trees have the -same topology, but differ because their ultimate coalesence happened in a different -ancestor (easy to spot in a simulation, but hard to detect in real data). This is also +The subtree defined by node 7 in the first tree has been pruned (away from node 11) and +regrafted onto the branch between 0 and 9, to create the second tree. +The second and third trees have the same topology, +but differ because their ultimate coalesence happened in a different ancestor +(easy to spot in a simulation, but hard to detect in real data). This is also caused by a single SPR: looking at the second tree, either the subtree below node 8 or the subtree below node 9 must have been pruned and regrafted higher up on the same lineage to create the third tree. Because this is a fully {ref}`simplified` @@ -409,7 +401,7 @@ positions (an "infinite sites" model of breakpoints), then the number of trees i sequence equals the number of ancestral recombination events plus one. If recombinations can occur at the same physical position (e.g. if the genome is treated as a set of discrete integer positions, as in the simulation that created this tree sequence) then -moving from one tree to the next in a tree sequence might require multiple SPRs if +moving from one tree to the next in a tree sequence might require multiple SPRs if there are multiple, overlaid ancestral recombination events. (sec_concepts_args)= @@ -418,11 +410,11 @@ there are multiple, overlaid ancestral recombination events. ::::{margin} :::{note} -There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in a local tree. +There is a subtle distinction between common ancestry and coalescence. In particular, all coalescent nodes are common ancestor events, but not all common ancestor events in an ARG result in coalescence in all local trees. ::: :::: -The term "Ancestral Recombination Graph", or ARG, is commonly used to describe a genetic +The term Ancestral Recombination Graph (ARG), is commonly used to describe a genetic genealogy. In particular, many (but not all) authors use it to mean a genetic genealogy in which details of the position and potentially the timing of all recombination and common ancestor events are explictly stored. For clarity @@ -438,7 +430,7 @@ which omits these extra nodes. This is for two main reasons: 2. The number of recombination and non-coalescing common ancestor events in the genealogy quickly grows to dominate the total number of nodes in the tree sequence, without actually contributing to the mutations inherited by the samples. - In other words, these nodes are redundant to the storing of genome data. + In other words, these nodes are redundant to the storing of genomic data. Therefore, compared to a full ARG, you can think of a simplified tree sequence as storing the trees *created by* recombination events, rather than attempting to record the @@ -450,6 +442,8 @@ way to put it: > whereas a [simplified] tree sequence encodes the outcome of those events" > ([Kelleher _et al._, 2019](https://doi.org/10.1534/genetics.120.303253)) +[Wong _et al._, 2024](https://doi.org/10.1093/genetics/iyae100) +review this topic in detail. ### Tables diff --git a/what_is.md b/what_is.md index faf89a6c..6fce9e48 100644 --- a/what_is.md +++ b/what_is.md @@ -79,7 +79,7 @@ they can be created by [evolutionary simulation](https://tskit.dev/software/#sim or by [inferring genealogies from empirical DNA data](https://tskit.dev/software/#infer). :::{margin} Key point -Tree sequences are used to encode and analyse large genetic datasets +Tree sequences are used to encode and analyse large genetic datasets. ::: Tree sequences provide an efficient way of storing @@ -226,7 +226,7 @@ branches of the trees, and the positions of the twelve variable sites associated mutations are shown along the X axis. :::{margin} Key point -Mutation on trees are the source of genetic variation +Mutation on trees are the source of genetic variation. ::: The trees inform us that, for example, the final mutation (at position 987) is inherited @@ -266,7 +266,7 @@ ts.draw_svg( ``` :::{margin} Key point -Tree sequences are efficient because they don't store each tree separately +Tree sequences are efficient because they don't store each tree separately. ::: A branch can be shared by many adjacent trees, but is stored as a single edge in the tree @@ -314,9 +314,9 @@ plt.show() ::::{margin} :::{note} -The genetic genealogy is sometimes referred to as an ancestral recombination graph, -or ARG, and one way to think of tskit tree sequence is as a way -to store various different sorts of ARGs (see the {ref}`ARG tutorial`) +The genetic genealogy is sometimes referred to as an ancestral recombination graph (ARG), +and one way to think of tskit tree sequence is as a way +to store various different sorts of ARGs (see the {ref}`ARG tutorial`). ::: :::: @@ -428,7 +428,7 @@ multiple correlated trees along a genome. ```{margin} Key point Most genetic calculations involve iterating over trees, which is highly efficient in -{program}`tskit` +{program}`tskit`. ``` For example, statistical measures of genetic variation can be thought of as a calculation From 177fb100e827613f27a9ac335e015bd826ac3a5d Mon Sep 17 00:00:00 2001 From: Gregor Gorjanc Date: Sat, 5 Oct 2024 20:29:56 +0100 Subject: [PATCH 2/4] Polishing analysing tree seq tutorial --- analysing_tree_sequences.md | 325 +++++++++++++++++++++++++----------- getting_started.md | 13 +- terminology_and_concepts.md | 4 +- 3 files changed, 231 insertions(+), 111 deletions(-) diff --git a/analysing_tree_sequences.md b/analysing_tree_sequences.md index 7da9dcf4..c1538a9d 100644 --- a/analysing_tree_sequences.md +++ b/analysing_tree_sequences.md @@ -33,7 +33,6 @@ def afs(): ts = tables.tree_sequence() ts.dump("data/afs.trees") - def create_notebook_data(): computing_statistics() afs() @@ -50,27 +49,28 @@ def create_notebook_data(): This tutorial is a work in progress. ::: - (sec_tutorial_stats)= ## Computing statistics Tskit provides an extensive and flexible interface for computing population genetic statistics, which is documented in detail in the -{ref}`general statistics ` section of the offical documentation. -This tutorial aims to give a quick overview of how the APIs work and how to use -them effectively. +{ref}`general statistics` section of the offical documentation. +This tutorial aims to give a quick overview of how the statistics APIs work +and how to use them effectively. After this overview, we will dive into +allele frequency spectrum analysis. -First, let's load a tree sequence to work with which has roughly human -parameters for 10 thousand samples and 10Mb chromosomes: +First, let's load a tree sequence that represents 10 thousand samples +over a 10 megabases chromosome from a simulation with human-like parameters: ```{code-cell} ipython3 +import tskit ts = tskit.load("data/computing_statistics.trees") -ts +print(ts) ``` This tree sequence has ~36.6 thousand trees & ~39 thousand segregating sites. -We'd now like to compute some statistics on this dataset. +We'd now like to analyse this dataset using different statistics. ### One-way statistics @@ -80,30 +80,46 @@ is computed using the {meth}`TreeSequence.diversity` method: ```{code-cell} ipython3 d = ts.diversity() -print("Average diversity per unit sequence length = {d:.3G}") +print(f"Average diversity per unit sequence length = {d:e}") ``` -This tells the average diversity across the whole sequence and returns a single -number. We'll usually want to compute statistics in -{ref}`windows ` along the genome and we -use the ``windows`` argument to do this: +The above result tells us the average diversity across the samples and +their whole sequence (that is their genomes), returning a single number. +Note the emphasis on "average" in the above print statement. +See the {ref}`Span normalisation` sub-section +below for more details on this point. + +#### Windows + +We often want to compute statistics in {ref}`windows` +along the genome. We can use the ``windows`` argument to do this: ```{code-cell} ipython3 +import numpy as np windows = np.linspace(0, ts.sequence_length, num=5) +print(windows) d = ts.diversity(windows=windows) -print(windows, d, sep="\n") +print(d) ``` -The ``windows`` argument takes a numpy array specifying the breakpoints -along the genome. Here, we use numpy to create four equally spaced windows -of size 2.5 megabases (the windows array contains k + 1 elements to define -k windows). Because we have asked for values in windows, tskit now returns -a numpy array rather than a single value. (See -{ref}`sec_stats_output_dimensions` for a full description of how the output -dimensions of statistics are determined by the ``windows`` argument.) - -Suppose we wanted to compute diversity within a specific subset of samples. -We can do this using the ``sample_sets`` argument: +The ``windows`` argument takes a numpy array with breakpoints along the genome. +Above, we created four equally spaced windows of size 2.5 megabases, +since the total sequence length (``ts.sequence_length``) is 10 megabases. +Note that the array contains `k + 1` elements to define `k` windows. +Because we have asked for values in windows, tskit now returns a numpy array +rather than a single value. +See {ref}`tskit:sec_stats_windows` for a full description on the ``windows`` argument, +including shortcuts on how to get statistics per local tree or per site. +Note though that there can be many local trees and sites in a tree sequence, +leading to a lot of output. +See {ref}`tskit:sec_stats_output_dimensions` for a full description of how +the output dimensions of statistics are determined by the ``windows`` argument. + +#### Sample sets + +Suppose we wanted to compute average diversity within a specific subset of samples, +instead of all samples. We can do this using the {ref}`tskit:sec_stats_sample_sets` +argument: ```{code-cell} ipython3 A = ts.samples()[:100] @@ -112,10 +128,10 @@ print(d) ``` Here, we've computed the average diversity within the first hundred samples across -the whole genome. As we've not specified any windows, this is again a single value. +the whole sequence. As we've not specified any windows, this is again a single value. -We can also compute diversity in *multiple* sample sets at the same time by providing -a list of sample sets as an argument: +We can also compute average diversity in *multiple* sample sets at the same time +by providing a list of sample sets as an argument: ```{code-cell} ipython3 A = ts.samples()[:100] @@ -125,75 +141,169 @@ d = ts.diversity(sample_sets=[A, B, C]) print(d) ``` -Because we've computed multiple statistics concurrently, tskit returns a numpy array -of these statistics. We have asked for diversity within three different sample sets, -and tskit therefore returns an array with three values. (In general, the -dimensions of the input determines the dimensions of the output: see -{ref}`tskit:sec_stats_output_dimensions` for a detailed description of the rules.) +Because we've computed a statistic for multiple sets, tskit returns a numpy array. +We have asked for diversity within three different sample sets, +and tskit therefore returns an array with three values. +As with ``windows``, the ``sample_sets`` argument determines the output dimensions. +See {ref}`tskit:sec_stats_output_dimensions` for a detailed description of the rules. + +#### Sample sets and windows -We can also compute multiple statistics in multiple windows: +We can also compute a statistic for multiple sets in multiple windows: ```{code-cell} ipython3 d = ts.diversity(sample_sets=[A, B, C], windows=windows) print("shape = ", d.shape, "\n", d) ``` -We have computed diversity within three different sample sets across four -genomic windows, and our output is therefore a 2D numpy array with four -rows and three columns: each row contains the diversity values within -A, B and C for a particular window. +We have computed average diversity in three sample sets across four genomic windows, +and our output is therefore a 2D numpy array with four rows and three columns: +each row contains the diversity values within sample set A, B, and C +for a particular window. + +(sec_span_normalisation)= + +#### Span normalisation + +Above we referred to the *average* diversity per unit sequence length, +either for the whole sequence or within windows. +To facilitate comparison, tskit by default normalises statistics +by the sequence length, returning average of a statistic. +When this is not what you want, you can use the {ref}`tskit:sec_stats_span_normalise` +argument: + +```{code-cell} ipython3 +davg = ts.diversity() +dtot = ts.diversity(span_normalise=False) +print(f"Average diversity per unit sequence length = {davg:e}") +print(f"Total diversity across sequence = {dtot}") +print(davg * ts.sequence_length) +``` + +As you see above, the average diversity per unit sequence length is equal to +the total diversity divided (normalised) by the sequence length. + +Normalisation is useful when comparing statistics across different windows +and/or sample sets. As an example, let's look at the span of +the first four local trees and compute the total diversity for each tree: + +```{code-cell} ipython3 +b = ts.breakpoints(as_array=True)[0:4] +print(b) +dtot = ts.diversity(windows="trees", span_normalise=False)[0:4] +print(f"Total diversity per local tree = {dtot}") +``` + +As you can see the span of local tree differs. +The first tree spans from 0 to 131.2..., +the second tree spans from 131.2... to 147.8..., +etc. +Consequently, the total diversity for each tree is also different. +The first two trees have total diversity of 0, indicating they contain no mutations. +The third tree has total diversity of 0.225... and +the fourth tree has total diversity of 0.001... +These last two values differ because of diffent tree spans and frequency of mutated +alleles in these trees, challenging comparison of the values. +To facilitate comparisons, tskit by default span-normalises statistics: + +```{code-cell} ipython3 +davg = ts.diversity(windows="trees")[0:4] +print(f"Average diversity per local tree = {davg}") +``` + +The above values indicate that the average diversity per unit sequence length +is larger in the third tree than in the fourth tree, indicating that the third tree +has more intermediate frequencies of alleles than the fourth tree. + +#### Mode + +TODO: The simulated tree sequence is without time units, +hence branch statistics have no units. Should we skip +the example below then and just keep the 1st paragraph below +(but surely someone will try branch mode and ask what is the meaning of +ts.diversity(mode="branch") being 40534.4...)? + +Above we have computed statistics based on site mutations in the tree sequence. +In tskit, these statistics are hence called "site-based". This "mode" of statistics +is computed by default, but we can also compute "branch-based" and "node-based" +statistics using the {ref}`tskit:sec_stats_mode` argument: + +```{code-cell} ipython3 +ds = ts.diversity() +print(ds) +db = ts.diversity(mode="branch") +print(db) +``` + +Site-based statistics use mutation differences between genomes, +while branch-based statistics use branch lengths between genomes. +Above, 0.0004... is average diversity per unit sequence length +in terms of mutation differences between samples. And, 40534.4... +is average diversity per unit sequence length in terms of +branch lengths between samples. + +Node-based statistics behave differently. They are computed per +each node in a tree sequence and TODO: REPORT WHAT? +Since there can be many nodes in a tree sequence, the output +can be large. + +```{code-cell} ipython3 +dn = ts.diversity(mode="node") +print(dn) +``` ### Multi-way statistics -Many population genetic statistics compare multiple sets of samples to -each other. For example, the {meth}`TreeSequence.divergence` method computes -the divergence between two subsets of samples: +Many population genetic statistics compare sample sets to each other, +leading to "multi-way" statistics. +For example, the {meth}`TreeSequence.divergence` method computes +the divergence between two sets of samples: ```{code-cell} ipython3 A = ts.samples()[:100] -B = ts.samples()[:100] -d = ts.divergence([A, B]) +B = ts.samples()[100:200] +d = ts.divergence(sample_sets=[A, B]) print(d) ``` The divergence between two sets of samples A and B is a single number, -and we we again return a single floating point value as the result. We can also -compute this in windows along the genome, as before: +hence we obtain a single value as the result. + +We can also compute "multi-way" statistics in windows along the genome, as before: ```{code-cell} ipython3 -d = ts.divergence([A, B], windows=windows) +windows = np.linspace(0, ts.sequence_length, num=5) +d = ts.divergence(sample_sets=[A, B], windows=windows) print(d) ``` Again, as we have defined four genomic windows along the sequence, the result is -numpy array with four values. - -A powerful feature of tskit's stats API is that we can compute the divergences -between multiple sets of samples simultaneously using the ``indexes`` argument: +numpy array with four values, representing the average divergence between sample +set A and B in each window. +A powerful feature of tskit's statistics API is that we can compute +multi-way statistics between multiple sets of samples simultaneously +using the {ref}`indexes`). +Because we know the ancestral and derived states at these sites, +we have set ``polarised`` argument to True. We can get the "folded" AFS by +setting ``polarised`` argument to False. +Because we want simple counts here and not averaged values, we set +``span_normalise=False``. As noted {ref}`above`, +by default, statistics are divided by the sequence length, +so they are comparable between samples and windows, which is not what we want here. + +We can also obtain AFS in windows along the genome: ```{code-cell} ipython3 -afs = ts.allele_frequency_spectrum(windows=[0, 0.5, 1], span_normalise=False, polarised=True) +afs = ts.allele_frequency_spectrum(windows=[0, 0.5, 1], polarised=True, span_normalise=False) print(afs) ``` This time, we've asked for the number of sites at each frequency in two -equal windows. Now we can see that in the first half of the sequence we -have three sites (compare with the site table above): one singleton, -one doubleton and one tripleton. +equally long windows. Now we can see that in the first half of the sequence we +have one singleton, one doubleton, and one tripleton. The second half of the +sequence has one singleton, five doubletons, and one 4-ton. ### Joint spectra We can also compute allele frequencies within multiple sets of samples, -the *joint allele frequency spectra*. +giving the *joint allele frequency spectra*. ```{code-cell} ipython3 node_colours = {0: "blue", 2: "blue", 3: "blue", 1: "green", 4: "green", 5: "green"} @@ -276,31 +398,29 @@ these belonging to different populations, for example). We can then compute the joint AFS based on these two sets: ```{code-cell} ipython3 -afs = ts.allele_frequency_spectrum([[0, 2, 3], [1, 4, 5]], polarised=True) +afs = ts.allele_frequency_spectrum(sample_sets=[[0, 2, 3], [1, 4, 5]], polarised=True) print(afs) ``` -Now, each window in our AFS is a 2D numpy array, where each dimension -corresponds to frequencies within the different sets. So, we see for example -that there are six sites that are singletons in both sets, 1 site -that is a doubleton in both sets, and 2 sites that singletons in $[1, 4, 5]$ -and not present in the other sample set. +This gave us a 2D numpy array, where each dimension corresponds to one sample set. +We see that there are: +two singletons in the second sample set and not present in the other sample set ($afs[0,1]$), +six singletons in both sets ($afs[1,1]$), +one doubleton in the first sample set that is singleton in the second sample set ($afs[1,2]$), +and one doubleton in both sets ($afs[2,2]$). ### Branch length spectra -Up to now we've used the {meth}`~TreeSequence.allele_frequency_spectrum` method -to summarise the number of sites that occur at different frequencies. We can also -use this approach to compute the total branch lengths subtending a given -number of samples by setting ``mode="branch"``: +Instead of summarising the number of sites that occur at different frequencies, +we can also compute the total branch lengths subtending a given number of samples +by setting ``mode="branch"``: ```{code-cell} ipython3 afs = ts.allele_frequency_spectrum(mode="branch", polarised=True, span_normalise=False) print(afs) ``` -Thus, the total branch length over example one sample is 4.86, over two is -5.39, and so on. - +Thus, the total branch length over one sample is 4.86, over two is 5.39, and so on. (sec_tutorial_afs_zeroth_entry)= @@ -309,18 +429,19 @@ Thus, the total branch length over example one sample is 4.86, over two is The zeroth element of the AFS is significant when we are working with sample sets that are a subset of all samples in the tree sequence. For example, in the following we compute the AFS within the sample set -[0, 1, 2]: +$[0, 1, 2]$: ```{code-cell} ipython3 -afs = ts.allele_frequency_spectrum([[0, 1, 2]], mode="branch", polarised=True) +afs = ts.allele_frequency_spectrum(sample_sets=[[0, 1, 2]], mode="branch", polarised=True) print(afs) ``` -Thus, the total branch length over 0, 1 and 2 is 5.3, and over pairs from this set -is 5.25. What does the zeroth value of 4.33 signify? This is the total branch length -over all samples that are **not** in this sample set. By including this value, we -maintain the property that for each tree, the sum of the AFS for any sample set -is always equal to the total branch length. For example, here we compute: +Thus, the total branch length over samples 0, 1, and 2 is 5.3, and over pairs +from this set is 5.25. What does the zeroth value of 4.33 mean? +This is the total branch length over all samples that are **not** in this sample set. +By including this value, we maintain the property that for each tree, +the sum of the AFS for any sample set is always equal to the total branch length. +For example, here we compute: ```{code-cell} ipython3 print("sum afs = ", np.sum(afs)) diff --git a/getting_started.md b/getting_started.md index 0c25bcd8..e5c23ec8 100644 --- a/getting_started.md +++ b/getting_started.md @@ -29,11 +29,10 @@ want to do: {ref}`save or export data`. Throughout, we'll also provide pointers to where you can learn more. - :::{note} The examples in this tutorial are all written using the {ref}`tskit:sec_python_api`, but it's also possible to -{ref}`use R `, or access the API in other languages, notably +{ref}`use R`, or access the API in other languages, notably {ref}`C` and [Rust](https://github.com/tskit-dev/tskit-rust). ::: @@ -119,7 +118,7 @@ forwards through trees in a tree sequence (or indeed backwards using the standar {func}`~py:reversed` function) is efficient. That means it's quick, for example to check if all the trees in a tree sequence have fully coalesced (which is to be expected in reverse-time, coalescent simulations, but not always for tree sequences produced by -{ref}`forward simulation `). +{ref}`forward simulation`). ```{code-cell} ipython3 import time @@ -173,7 +172,7 @@ print(f"Tree number {swept_tree.index}, which runs from position {intvl.left} to swept_tree.draw_svg(size=(1000, 200)) ``` :::{margin} -The {ref}`visualization tutorial ` gives more drawing possibilities +The {ref}`visualization tutorial` gives more drawing possibilities ::: This tree shows the classic signature of a recent expansion or selection event, with many @@ -227,7 +226,7 @@ sites or mutations in your analyses. For many purposes it may be better to focus on the genealogy of your samples, rather than the {ref}`sites` and {ref}`mutations` that -{ref}`define ` the genome sequence itself. Nevertheless, +{ref}`define` the genome sequence itself. Nevertheless, {program}`tskit` also provides efficient ways to return {class}`Site` object and {class}`Mutation` objects from a tree sequence. For instance, under the finite sites model of mutation that we used above, multiple mutations @@ -285,7 +284,7 @@ To find the actual allelic states at a site, you can refer to the {attr}`~Variant.alleles` provided for each {class}`Variant`: the genotype value is an index into this list. Here's one way to print them out; for clarity this example also prints out the IDs of both the sample nodes (i.e. the genomes) -and the diploid {ref}`individuals ` in which each sample +and the diploid {ref}`individuals` in which each sample node resides. ````{code-cell} ipython3 @@ -474,7 +473,7 @@ ac = h.count_alleles() allel.mean_pairwise_difference(ac) ``` -Comparative call with tskit is: +Comparable call with {program}`tskit` is: ```{code-cell} ipython3 # Per site (not span-normalised to match scikit-allel) diff --git a/terminology_and_concepts.md b/terminology_and_concepts.md index 6e672150..2d430776 100644 --- a/terminology_and_concepts.md +++ b/terminology_and_concepts.md @@ -79,7 +79,7 @@ the location of mutations in the genome. More technically, a tree sequence stores a population genetics object known as an Ancestral Recombination Graph (ARG). Below are the most important terms and concepts that you'll encounter in these tutorials. -A concise glossary of these terms and concepts is available at {ref}`here `. +A concise glossary of these terms and concepts is available at {ref}`here`. But first we'll {func}`~tskit.load` a tree sequence from a `.trees` file using the {ref}`tskit:sec_python_api` (which will be used in the rest of this tutorial): @@ -245,7 +245,7 @@ obtain two copies of each autosomal chromosome. The tree with six sample nodes a could therefore represent the result of sampling three diploid individuals from a larger population. The tree sequence can keep track of the individuals in which nodes reside, and store specific information about them (such as the individuals' spatial location) -as well as arbitrary {ref}`metadata ` (such as a name). In this particular +as well as arbitrary {ref}`metadata` (such as a name). In this particular tree sequence the sample nodes are indeed associated with three named diploid individuals: ``Ada``, ``Bob`` and ``Cat``. From b2ed60a3441ffd3bbf3eb4e9a62d629fc0bff7d9 Mon Sep 17 00:00:00 2001 From: Gregor Gorjanc Date: Sun, 6 Oct 2024 17:39:48 +0100 Subject: [PATCH 3/4] Fix label/link issue --- analysing_tree_sequences.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/analysing_tree_sequences.md b/analysing_tree_sequences.md index c1538a9d..f20f8801 100644 --- a/analysing_tree_sequences.md +++ b/analysing_tree_sequences.md @@ -86,7 +86,7 @@ print(f"Average diversity per unit sequence length = {d:e}") The above result tells us the average diversity across the samples and their whole sequence (that is their genomes), returning a single number. Note the emphasis on "average" in the above print statement. -See the {ref}`Span normalisation` sub-section +See the {ref}`Span normalisation` sub-section below for more details on this point. #### Windows @@ -118,8 +118,8 @@ the output dimensions of statistics are determined by the ``windows`` argument. #### Sample sets Suppose we wanted to compute average diversity within a specific subset of samples, -instead of all samples. We can do this using the {ref}`tskit:sec_stats_sample_sets` -argument: +instead of all samples. +We can do this using the {ref}`sample_sets` argument: ```{code-cell} ipython3 A = ts.samples()[:100] @@ -169,8 +169,8 @@ Above we referred to the *average* diversity per unit sequence length, either for the whole sequence or within windows. To facilitate comparison, tskit by default normalises statistics by the sequence length, returning average of a statistic. -When this is not what you want, you can use the {ref}`tskit:sec_stats_span_normalise` -argument: +When this is not what you want, +you can use the {ref}`span_normalise` argument: ```{code-cell} ipython3 davg = ts.diversity() @@ -226,7 +226,7 @@ ts.diversity(mode="branch") being 40534.4...)? Above we have computed statistics based on site mutations in the tree sequence. In tskit, these statistics are hence called "site-based". This "mode" of statistics is computed by default, but we can also compute "branch-based" and "node-based" -statistics using the {ref}`tskit:sec_stats_mode` argument: +statistics using the {ref}`mode` argument: ```{code-cell} ipython3 ds = ts.diversity() @@ -283,7 +283,7 @@ set A and B in each window. A powerful feature of tskit's statistics API is that we can compute multi-way statistics between multiple sets of samples simultaneously -using the {ref}`indexes` argument: ```{code-cell} ipython3 d = ts.divergence(sample_sets=[A, B, C], indexes=[(0, 1), (0, 2)]) From aea3587d180a9d5188613f2280931f41410c8557 Mon Sep 17 00:00:00 2001 From: Gregor Gorjanc Date: Mon, 7 Oct 2024 06:54:04 +0100 Subject: [PATCH 4/4] Fix another label/link issue --- analysing_tree_sequences.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/analysing_tree_sequences.md b/analysing_tree_sequences.md index f20f8801..ae88e5aa 100644 --- a/analysing_tree_sequences.md +++ b/analysing_tree_sequences.md @@ -86,7 +86,7 @@ print(f"Average diversity per unit sequence length = {d:e}") The above result tells us the average diversity across the samples and their whole sequence (that is their genomes), returning a single number. Note the emphasis on "average" in the above print statement. -See the {ref}`Span normalisation` sub-section +See the {ref}`Span normalisation` sub-section below for more details on this point. #### Windows