-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save flags and metadata when splitting disjoint nodes #382
Conversation
@nspope : I also added the following to the changelog. Do my short summaries read OK?
|
Also note the API for adding metadata information on split nodes: I'm unsure if this seems a bit hacky? But it seems good to me to always be able to run the routine by default. |
LGTM! Let me know when you're ready for a review. |
Definitely ready for a review, thanks! Re your comment about |
After a quick chat with Jerome, we think that (a) this routine is mostly tsinfer-specific so (b) we should probably always try to set the metadata, and simply warn if this is not possible. This could probably apply to the other tsinfer-set metadata fields like ancestor_data_id too. |
And add split_nodes to preprocess_ts. Fixes tskit-dev#373
OK, I think this is ready to go. Does it look OK to you @nspope ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a couple comments re: tests. Also, just out of curiosity, I'm wondering what the timing of this routine on the 40K GEL data is, versus the old version (that didn't do anything with metadata). If there's a sizeable gap, lets open an issue about moving some of the internals into numba.
|
||
|
||
class TestSplitDisjointNodes: | ||
def test_nosplit(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would also be good to test that split_disjoint_nodes leaves a tree sequence unchanged if it's already had the routine called on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also it looks like I stuck "TestNodeSplitting" class into test_functions.py, let's move it over to this file (I think it includes something like the test I suggest above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, done. I'll get Ben to test the speed, then merge.
And address other comments
I didn't test on the 40K ts, but I did on chr2 of the unified genealogy (7524 samples), which isn't too bad a test, as The old routine took 15 secs to run. The new one, having set a permissive_json schema, takes 25 secs. I think that's fine, so I'll merge. At the moment, tsinfer doesn't set a node metadata schema, so the metadata is not saved anyway (and so it only takes 15 secs even with the new code). When we fix tsinfer to set a proper schema, it will take a little longer on large tree sequences, but still a fraction of the time required for simplification. |
And add split_nodes to preprocess_ts. Fixes #373