Null parents on sites with multiple mutations #2496

lkirk · 2022-09-07T17:27:37Z

lkirk
Sep 7, 2022
Collaborator

My team and I are familiarizing ourselves with the tskit data model and have come across an inconsistency between the documentation and the behavior of the library. The documentation of the Mutation Table specifies that "The parent column is only required in situations where there are multiple mutations at a given site. For “infinite sites” mutations, it can be ignored".

To demonstrate this, we're performing a simple msprime simulation to generate some mutations for a site. Here is our example tree:

When we generate mutations with msprime, we observe the expected behavior, sites with multiple mutations have parents. In this case, we focus on site 4, which has 3 mutations:

If we create a new tree sequence with the same mutations for site 4, excluding the parent column, there are no errors in loading or computing statistics, such as diversity. Though there are no errors, the two trees produce differing diversity results (0.6 for the tree with malformed data, 0.516 the tree with proper data). We believe the discrepancy stems from the way that parents are handled in the site general stat code.

For more detail, here is the Jupyter notebook that was used to generate the data/observations listed above.

This leaves us with the following questions:

Is this expected behavior? If yes, the documentation might need a bit of clarification around this.
If this is not expected behavior, should tskit validate the presence of parents in sites with multiple alleles?

benjeffery · 2022-09-07T21:58:38Z

benjeffery
Sep 7, 2022
Maintainer

Thankyou for this detailed report! We'll have a look into it - @petrelharp do you know what the answer is here?

0 replies

petrelharp · 2022-09-09T17:50:56Z

petrelharp
Sep 9, 2022
Maintainer

Yes, gee, thanks a lot for the great report. So, let's see. The only problem here is that you're passing in bad input and we're not catching it at load time. I think this is expected, and here's why: the information about which mutation is the parent to which other one is contained within the tables in nearly all cases, so the parent column in the mutation table is redundant except in the case when two mutations on the same branch have the same time. We provide the method tables.compute_mutation_parents() to fill out that column for you, and calling it on the table without the parent column produces the right answer:

t = ts_bad_mutations.tables
t.compute_mutation_parents()
tts = t.tree_sequence()
tts.diversity()
# array(0.51666667)

This method uses order in the table to resolve ambiguous cases.
When we wrote compute_mutation_parents() we discussed re-computing the parents column at load time to catch situations like this; however, it's kinda expensive, and we figured that anyone sophisticated enough to be producing their own tables could either fill out that column or use compute_mutation_parents() to fill it out. You could make the argument that we should be doing that check, if you like, but we'd want to do some profiling to see how it affects load times, I think.

Edit: you said there's an inconsistency, but I'm not seeing it - the docs say that the parent column is required in the case of multiple mutations at a given site, and that's what you've got here? What am I missing?

0 replies

lkirk · 2022-09-12T15:28:22Z

lkirk
Sep 12, 2022
Collaborator Author

@petrelharp Thank you for the explanation, it clarifies all of our questions. The perceived inconsistency comes from our interpretation of the word "required": we assumed that an error or warning would be thrown if the parents were missing for a site with multiple mutations. After reading your response, I'm now interpreting the word "required" to mean "required for certain computations to be correct".

It's also worth noting that the documentation does not mention compute_mutation_parents() in this context. It might be the case that compute_mutation_parents() is too expensive to run at load time, but it might be useful to mention its existence in the documentation for the parent column. We'd be happy to PR the documentation to add a bit of clarifying language if you think that compute_mutation_parents() would be appropriate to mention in the description of the parents column.

1 reply

jeromekelleher Sep 12, 2022
Maintainer

Thanks @lkirk, any updates to clarify the documentation would be much appreciated! This is one of the few places that "required" doesn't mean "is validated at load time", so we should perhaps change the wording.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Null parents on sites with multiple mutations #2496

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Null parents on sites with multiple mutations #2496

lkirk Sep 7, 2022 Collaborator

Replies: 3 comments · 1 reply

benjeffery Sep 7, 2022 Maintainer

petrelharp Sep 9, 2022 Maintainer

lkirk Sep 12, 2022 Collaborator Author

jeromekelleher Sep 12, 2022 Maintainer

lkirk
Sep 7, 2022
Collaborator

Replies: 3 comments 1 reply

benjeffery
Sep 7, 2022
Maintainer

petrelharp
Sep 9, 2022
Maintainer

lkirk
Sep 12, 2022
Collaborator Author

jeromekelleher Sep 12, 2022
Maintainer