-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify sites for two-site statistics #2863
Conversation
c/tskit/trees.c
Outdated
if (ret != 0) { \ | ||
goto out; \ | ||
} \ | ||
} while (0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This macro simplifies the loops below, but I'll admit, it does make debugging a bit more difficult
if (sites == NULL) { | ||
goto out; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to add a few more error values, but I assumed that would merit some discussion.
Something like: TSK_ERR_DUPLICATE_SITES
, and TSK_ERR_NULL_SITE
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's file a follow-up issue on this one - additional error cases for 2 site stats.
The matrix filling is now a bit more complicated that it was initially. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a quick look and overall LGTM
Can you explain the rationale for doing the site index lookup, and what your assumptions about the input are please? I'm not quite following what you're checking for, and why.
Requiring that the row and cols are sorted seems reasonable all right.
Also, I don't follow the need for the site index and duplicate checking. Does this happen in common cases? Could we not deal with this more easily at a higher level, if so?
c/tskit/trees.c
Outdated
tsk_id_t id; | ||
tsk_size_t i; | ||
|
||
if (sites == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the interpretation here? I would have though this would be an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mistakenly leftover from when I thought I'd specify the defaults in C. If no sites were specified and we were going to use all of the sites, we wouldn't need to validate them, we'd just want to use all available sites. However, I think I decided it'd be much simpler to specify the defaults on the python side. Do you agree? If so, I'll make this condition an error.
c/tskit/trees.c
Outdated
} | ||
|
||
for (i = 0; i < num_sites - 1; i++) { | ||
id = sites[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can use a generic check like this:
if (sites[i] < 0 || sites[i] >= num_sites) {
ret = TSK_ERR_SITE_OUT_OF_BOUNDS
goto out
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more straightforward. If we've encountered an out of bounds site, immediately fail.
Since we're checking pairs, we'll still need to handle the last site separately.
c/tskit/trees.c
Outdated
ret = TSK_ERR_BAD_PARAM_VALUE; // TODO: new err type? TSK_BAD_SITE_VALUE? | ||
goto out; | ||
} | ||
if (id > sites[i + 1]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter if there are repeats? We're just specifying things to compute in the output matrix, right? I worry we're putting effort in to stop people doing "stupid" things, when this might also stop them doing clever things
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeats are going to complicate the indexing of the output array. We'd have to track the number of shared for rows and columns, we wouldn't be able to rely on their union. Also, we'd need a bit of rework when walking the trees because we assume that the sites are ascending and not repeating.
In short, I think it's possible to support repeats, but it's not currently supported. Do you think it's something we should be supporting?
c/tskit/trees.c
Outdated
} | ||
} | ||
// check if the last or only value is null | ||
if (num_sites != 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check if every item is non-null and ordered by iterating from 0->n-1. We then need to check the last item if it's null. Also, since we have checked if the sites are ordered, by checking if the last site exceeds the length of the site rows table, we can ensure we're not specifying a site that's out of bounds. I see now I'm checking ==
instead of sites[num_sites - 1] >= num_site_rows
... an error.
This case also applies if only one site has been specified.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2863 +/- ##
==========================================
- Coverage 89.71% 89.70% -0.02%
==========================================
Files 30 30
Lines 30243 30296 +53
Branches 5883 5634 -249
==========================================
+ Hits 27134 27178 +44
- Misses 1780 1784 +4
- Partials 1329 1334 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 4 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
Sorry about the input validation, there were a couple of issues with that code that made it confusing. I've cleared it up now I think. My only question is how to fail if no sites are specified. Do we want to do a The site index lookup is needed because we're only collecting data for a subset of sites. We first store all of the sites we want to collect data from, then use that list while walking over the trees. When we compute statistics, we index into the data we've gathered with the The assumptions about the input sites are that they are ordered, valid sites (non-null and in-range), and non-duplicated. As I've mentioned in my comments above, we're not equipped to handle duplicates, but I could add code that allows for duplicates if that seems like a reasonable requirement. Overall, I think that we'll specify sites in python and if they're not specified, we can pass through an array of all sites. The C code can simply error if no sites are specified, but I think it's worth validating on the C side if we intend to make this part of the public C api surface. |
Thanks for the explanations @lkirk. I'm afraid I'm still not really getting the algorithm here, and this is where I'm really missing having a straightforward Python implementation --- I find it very hard to think about actual algorithmic processes when it's obscured by all the C-stuff. Do you have a Python prototype of this code somewhere? I think we're going to need it at some point for maintainability anyway, so maybe we could switch to creating a I think the basic "shape" (and API) of this is good, I just don't follow the details of the "indexing" code. From a quick glance I can see it has some memory management issues, but I want to get my head around what it does and why before jumping in to that. In general, I think we should try to develop the algorithm in Python first, test, and then implement in C. |
I have a python prototype here, but it has fallen out of sync with the c implementation. It doesn't have sample sets or the ability to subset sites. In any case, it doesn't follow the c implementation and is more of an orthogonal check I'd say. I can add these new features to it, and place it in the python tests dir, but I'll need another day or so to get that done. In the meantime, I tried to come up with another document in an attempt to clarify the details of the indexing: |
I've added a python test file that currently tests itself ( |
Hi @jeromekelleher, just checking in on this. Is there anything more I can do to clarify the inner workings of the indexing algo? The python code I've attached mirrors the indexing closely. Lines 202-226 of the python code, along with the written documentation I've attached should explain things relatively well. I'm happy to have a zoom call or provide some further explanation if needed. |
Whoops, sorry @lkirk - I've been trying to get a manuscript over the line and been focusing completely on that for a few days. I'll have a good look at this tomorrow. |
Thanks @lkirk, I think I'm starting to get the idea here... The variable names on the indexing structure aren't that obvious to me, so some descriptive comments in the Python version would be helpful. I still don't follow why we think there'll be a lot of shared sites between the rows and the cols though. Do you have some ideas for applications where we expect this? And some quantitative ideas on how much time you'd expect to save by doing this indexing? Premature optimisation is a real thing, and I don't know how many times I've come up with some complicated fancy idea for how to do something, only to be disappointed to see that it's actually slower than the simple, obvious method. So, I would always now implement things in the simple, obvious way first, even if (or especially, maybe) I think I know a better way. |
Hi @jeromekelleher sorry for my late reply. Thank you for taking another look. I've thought about if we might be over-complicating things here. I think any way we slice it, it's going be complicated to index into gathered data, site ids, and result matrix. I tried to write a reduced version of this code and it wasn't a whole lot simpler. The issue is that we need to maintain a union of the sites and its corresponding indices (which happen to be the same indices as the gathered data/state matrix). Then we also need to store the row/column indices into the results matrix. All said, if we eliminated shared/diff tracking, we could get rid of 4 struct members in This complexity stems from the requirement that we compute LD for arbitrary NxM matricies. If it was an NxN matrix, this would be easier. The reason we want to track shared sites between rows and columns is the case where a user specifies a square output matrix where the row sites and column sites are identical (a very common case). We only want to compute the upper triangle and reflect to the lower triangle of the output matrix because computations across the diagonal are redundant. Most of the time is spent on intersecting sample sets between sites, so we want to avoid redundant computations as much as possible (IMO). I did a simple benchmark where I took chromosome 3 from the SGDP trees inferred by tsinfer and ran this code on:
We can see that by tracking overlaps and reflecting we get an UP TO 2x speedup in a common usecase. The nice thing is that we also handle edge cases where we can simply specify arbitrary lists of site ids, retaining any benefits from avoiding redundant computations. That said, I'm happy to consider alternative approaches. I also went ahead and tried to clean up the python prototype, renaming the indexing object attributes to (hopefully) more descriptive names. I also made a first pass at documenting the python code so it's hopefully more understandable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I've been so slow @lkirk - this is tricky stuff and I haven't found time to get by head into it. I started getting my head back into it here, but I've gotten bogged down again. Maybe we should organise a call for next week so we can go through the code together?
python/tests/test_ld_matrix.py
Outdated
states.append(mutation.derived_state) | ||
current_state = len(states) - 1 | ||
|
||
for sample_id in tree.samples(mutation.node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is right - you're assuming that the samples are 0 to n - 1 here, but that's not true in general. What you need to do is use the sample_index_map:
sample_index_map = np.zeros(ts.num_nodes, dtype=int32) - 1
for j, u in enumerate(ts.samples()):
sample_index_map[u] = j
# Later:
for u in tree.samples(mutation.node):
sample_index = sample_index_map[u]
mutation[site.id, sample_index] = current_state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I want to punt this change to when I write all of the python tests for ld_matrix. I don't currently have any data to test the sample index map with, but I will create some when I make those tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can we drop a comment to this effect in here so we don't forget please?
python/tests/test_ld_matrix.py
Outdated
for all samples and sites in the index. | ||
""" | ||
state = np.zeros((len(sites), ts.num_samples), dtype=int) | ||
current_site = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No point in tracking current_site
here, just use site.id
below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still relevant
tsk_size_t i; | ||
|
||
if (sites == NULL || num_sites == 0) { | ||
ret = TSK_ERR_BAD_SITE_POSITION; // TODO: error should be no sites? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still considering a TSK_ERR_NULL_SITES
error here or something of the sort.
Hi @jeromekelleher sorry for the delay in fixing this up. I've covered the issues that we discussed during our meeting:
I think this is about ready, please let me know if there are any other blockers that I need to address before it's ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've gone through again and I'm afraid I still haven't managed to get my head fully around the SiteMatrixIndex. I feel like it could be substantially simpler, and I'm having to spend a lot of time understanding how it works. Unless the code is easy to follow, it won't get maintained or extended, so I do want to get this right.
This is a pity, because it's just an implementation detail, and it's blocking adding the actual functionality.
My suggestion here would be to take a backup branch of this PR, and just pull out the SiteMatrixIndex stuff from this PR, using a naive and direct algorithm instead. Let's not worry about absolute performance, and just focus on getting the functionality in and do optimisations afterwards. I think this is really just a case of deleting code, as the actual logic can be recast really easily using the "compute" primitive you have already.
Then, once we have the functionality in we can come back to the SiteMatrixIndex logic, which should be a simple diff from your current commit to main, and easy to recover.
I know this is annoying, and I'm sorry to make this so slow. It will be much easier to understand your optimisations when they are isolated though, and I'm confident we can get them in quickly when they are pulled out by themselves.
c/tskit/trees.h
Outdated
@@ -1818,6 +1816,29 @@ bool tsk_tree_position_prev(tsk_tree_position_t *self); | |||
int tsk_tree_position_seek_forward(tsk_tree_position_t *self, tsk_id_t index); | |||
int tsk_tree_position_seek_backward(tsk_tree_position_t *self, tsk_id_t index); | |||
|
|||
// These are in the header for testing purposes only | |||
struct stat_matrix_site_indicies { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo probably should be, stat_matrix_site_indices
. Elsewhere we use the indexes
plural form, FYI but whatever you prefer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess strictly speaking if we put this in the public header we should also put in a tsk_
prefix so that we're fulfilling the contracts we specify in the docs. (Names space collisions are unlikely I agree, but we do say that all public symbols start with a tsk_
prefix)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also might as well follow the usual conventions and typedef
it as tsk_stat_matrix_site_indexes
(rather than as a struct)
python/tests/test_ld_matrix.py
Outdated
for all samples and sites in the index. | ||
""" | ||
state = np.zeros((len(sites), ts.num_samples), dtype=int) | ||
current_site = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still relevant
python/tests/test_ld_matrix.py
Outdated
states.append(mutation.derived_state) | ||
current_state = len(states) - 1 | ||
|
||
for sample_id in tree.samples(mutation.node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, can we drop a comment to this effect in here so we don't forget please?
python/tests/test_ld_matrix.py
Outdated
raise ValueError(f"Site out of bounds: {sites[i + 1]}") | ||
|
||
|
||
def get_matrix_indices(row_sites, col_sites): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be more naturally done as the constructor to SiteMatrixIndex, so that we call idx = SiteMatrixIndex(row_sites, col_sites)
in client code. Here, I found myself going up and down the file to try and figure out what the fields were actually doing, where we should really just encapsulate everything in the SiteMatrixIndex class.
python/tests/test_ld_matrix.py
Outdated
|
||
def compute(left_states, right_states): | ||
hap_counts = np.zeros((np.max(left_states) + 1, np.max(right_states) + 1)) | ||
# compute the haplotype matrix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still confusing me - what's the haplotype matrix? (I'm pretty sure it's not the same as what we get when calling ts.haplotypes()
, e.g.). Can we rename this please?
@jeromekelleher I agree with you. It could definitely be simpler and it'll be a bit of technical debt looming over this code. I'll pull it out and make things simpler. I can have something by Friday for you to look at. Along these lines... I was thinking about writing the python code from the bottom up to mirror the C code. Would that make things simpler in the long run? I'm happy to do that too. |
That sounds great, thanks @lkirk . Mirroring the python and c implementations (to a reasonable degree, no need to be too literal) would help all right |
import tskit | ||
|
||
|
||
class BitSet: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've modeled this class after the BitSet that I'm imagining for #2834.
The only strange thing about the API is that the intersect method saves to a third length-1 array, while the rest of the methods modify the calling instance. I suppose union
and difference
could write out to an out
array like intersect
, but it would be clunky and slightly confusing to be passing the calling instance back into the method.
In any case, I'm not sure how much effort should be put into this API since technically only the two-locus code is using it for now.
ed1a86b
to
a7e8609
Compare
Okay, so that took a little longer than expected. I've simplified things @jeromekelleher. There are no longer any optimizations or fancy index tracking around the results matrix. We're now just filling the matrix from top to bottom, using the exact indices specified by the user. The python code now closely mirrors the C code (except for the simple refactor of bit arrays that I commented on in the code). I've also annotated the python code with types (compatible with all supported versions of python) to add some further documentation. Code coverage is up, tests are passing and merge conflicts are handled. I'm happy to hear any feedback and/or points of confusion. Hopefully this makes things more understandable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! A couple of small issues, but looks good to merge then. Thanks for your patience @lkirk!
// If we've hit the last site in our index, finish up | ||
if (site_idx >= n_sites) { | ||
ret = 0; | ||
goto out; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only use gotos for "exception handling", so it would be good to change the flow a bit here. I'm also not sure that you're correctly iterating over the trees (see comment above). How about this?
for (site_idx=0; site_id < num_sites; site_id++) {
tsk_treeseq_get_site(ts, sites[site_idx], &site);
ret = tsk_tree_seek(&tree, site.position, 0);
if (ret < 0) {
goto out;
}
tsk_bug_assert(ret == TSK_TREE_OK);
// do stuff with site and tree
}
The call to tsk_tree_seek
will exit quickly when it's no-op, so it'll be just as fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you're saying. After a few tests, I see what's going on. The issue I have in your proposed solution is that we'd also be calling realloc for every site. From what I understand, realloc is not specified in the C standard (GNU returns the same memory when realloc is called with the same size, but I'm not sure about other libc implementations -- see here in the portability notes).
I've proposed a solution (my most recent commit) where we seek to the first tree, then keep seeking until we run out of sites or trees.
What do you think about this? I can implement what you've proposed if you think that's the better, more straightforward route.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, things have to be correct first and then fast afterwards. You make them correct by doing things simply first. Then you profile, and if it's not fast enough, you optimise based on this profiling.
The realloc thing you identify is a small detail (which is trivially worked around, if it really is a problem). You haven't profiled this, so you don't know if it is a problem. That is not sufficient reason to make the code so much more complicated.
I do understand the impulse to optimise, but you're going about it the wrong way. I've found this classic worse is better essay very illuminating, and has convinced me of the importance of simplicity of implementation, almost above all other considerations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for passing that along, I've heard of "worse is better" colloquially, but I haven't seen this essay before. I agree with it and like the ideas, I just need to translate them into my software design.
I've adopted your suggestions and things look a bit simpler. tsk_tree_seek
always returns 0 unless there's an error, so there's no need to make any further asserts as far as I can tell.
I also updated the python code to reflect the change.
c/tskit/trees.c
Outdated
// interest. We rely on our sites bounds checking to access the first site position | ||
site_offset = 0; | ||
site_idx = 0; | ||
for (ret = tsk_tree_seek(&tree, ts->tables->sites.position[sites[0]], 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this correct? Won't you seek to the first tree over and over doing this? See below for suggested restructuring of flow control.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right about that. I didn't realize that this would cause the loop to wrap around until I messed with it in isolation. See my detailed comment below.
if (sites == NULL) { | ||
goto out; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's file a follow-up issue on this one - additional error cases for 2 site stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Let's squash-n-merge 🎉
This allows a user to specify an array of row sites to be compared against an array of column sites. Rectangular matricies are returned as a result of these comparisons. We weighed the pros/cons of the indexing performance and redundant computation. The initial implementation avoided redundant computation, but at the expense of complexity and readability. The final implementation is much more simple and DOES perform redundant computations if the row/col sites cross the diagonal of the LD matrix. We will revisit if/when this becomes a problem. In addition to these changes, I've cleaned up the results matrix allocation. Since the python code will be allocating and managing this memory, we'll pull out the allocation code. C tests have been added to capture all of this new functionality. A type annotated python prototype that mirrors the C functionality was added to provide some documentation on how the C algorithms work, and to aid in the testing of the C code once the python API is ready.
d675999
to
979fc0f
Compare
Squashed and ready. I'll make that issue for the error types today. I also talked with @apragsdale and he suggests we make an issue for tracking the matrix optimizations if we ever decide to remove redundant computations. Does that sound reasonable? |
Sounds good to me. Let's get the Python-C plumbing setup up next and so we can start running this stuff! |
Description
As discussed on #2830, I've set the foundation on the C side of things for specifying row/column sites. I've also got a bit of the cpython code that wraps this once we're happy with how the c-api looks.
This allows a user to specify an array of row sites to be compared against an array of column sites. To avoid computing redundant entries in the matrix, we implement an index that tracks items and their overlap. When we encounter entries that are redundant, we reflect their value across the diagonal of the matrix (for the valid square sub-matrix that crosses the diagonal). I've attached a diagram to the pull request that describes this in more detail.
In addition to these changes, I've cleaned up the results matrix allocation. Since the python code will be allocating and managing this memory, we'll pull out the allocation code.
I've added tests to capture all of this new functionality. I'm sure we'll need to do a final sweep for coverage once we're done with review.
PR Checklist: