Replies: 3 comments 1 reply
-
Thankyou for this detailed report! We'll have a look into it - @petrelharp do you know what the answer is here? |
Beta Was this translation helpful? Give feedback.
-
Yes, gee, thanks a lot for the great report. So, let's see. The only problem here is that you're passing in bad input and we're not catching it at load time. I think this is expected, and here's why: the information about which mutation is the parent to which other one is contained within the tables in nearly all cases, so the
This method uses order in the table to resolve ambiguous cases. Edit: you said there's an inconsistency, but I'm not seeing it - the docs say that the parent column is required in the case of multiple mutations at a given site, and that's what you've got here? What am I missing? |
Beta Was this translation helpful? Give feedback.
-
@petrelharp Thank you for the explanation, it clarifies all of our questions. The perceived inconsistency comes from our interpretation of the word "required": we assumed that an error or warning would be thrown if the parents were missing for a site with multiple mutations. After reading your response, I'm now interpreting the word "required" to mean "required for certain computations to be correct". It's also worth noting that the documentation does not mention |
Beta Was this translation helpful? Give feedback.
-
My team and I are familiarizing ourselves with the tskit data model and have come across an inconsistency between the documentation and the behavior of the library. The documentation of the Mutation Table specifies that "The
parent
column is only required in situations where there are multiple mutations at a given site. For “infinite sites” mutations, it can be ignored".To demonstrate this, we're performing a simple msprime simulation to generate some mutations for a site. Here is our example tree:
When we generate mutations with msprime, we observe the expected behavior, sites with multiple mutations have parents. In this case, we focus on site 4, which has 3 mutations:
If we create a new tree sequence with the same mutations for site 4, excluding the parent column, there are no errors in loading or computing statistics, such as diversity. Though there are no errors, the two trees produce differing diversity results (0.6 for the tree with malformed data, 0.516 the tree with proper data). We believe the discrepancy stems from the way that parents are handled in the site general stat code.
For more detail, here is the Jupyter notebook that was used to generate the data/observations listed above.
This leaves us with the following questions:
Beta Was this translation helpful? Give feedback.
All reactions