Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify sites for two-site statistics #2863

Merged
merged 1 commit into from
Jan 25, 2024

Conversation

lkirk
Copy link
Contributor

@lkirk lkirk commented Oct 16, 2023

Description

As discussed on #2830, I've set the foundation on the C side of things for specifying row/column sites. I've also got a bit of the cpython code that wraps this once we're happy with how the c-api looks.

This allows a user to specify an array of row sites to be compared against an array of column sites. To avoid computing redundant entries in the matrix, we implement an index that tracks items and their overlap. When we encounter entries that are redundant, we reflect their value across the diagonal of the matrix (for the valid square sub-matrix that crosses the diagonal). I've attached a diagram to the pull request that describes this in more detail.

In addition to these changes, I've cleaned up the results matrix allocation. Since the python code will be allocating and managing this memory, we'll pull out the allocation code.

I've added tests to capture all of this new functionality. I'm sure we'll need to do a final sweep for coverage once we're done with review.

PR Checklist:

  • Tests that fully cover new/changed functionality.
  • Documentation including tutorial content if appropriate.
  • Changelogs, if there are API changes.

c/tskit/trees.c Outdated
if (ret != 0) { \
goto out; \
} \
} while (0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This macro simplifies the loops below, but I'll admit, it does make debugging a bit more difficult

if (sites == NULL) {
goto out;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add a few more error values, but I assumed that would merit some discussion.

Something like: TSK_ERR_DUPLICATE_SITES, and TSK_ERR_NULL_SITE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a follow-up issue on this one - additional error cases for 2 site stats.

@lkirk
Copy link
Contributor Author

lkirk commented Oct 16, 2023

The matrix filling is now a bit more complicated that it was initially.
Here's a diagram of how it works: Matrix Filling.pdf. In the case of computing an NxN matrix, this will reduce redundant computations substantially, allowing us to compute the upper triangle and reflect it to make the results matrix. It also allows us to compute a rectangular NxM matrix while retaining the benefits of avoiding redundant computations (i.e. computing r^2 for site 0/1 and 1/0).

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a quick look and overall LGTM

Can you explain the rationale for doing the site index lookup, and what your assumptions about the input are please? I'm not quite following what you're checking for, and why.

Requiring that the row and cols are sorted seems reasonable all right.

Also, I don't follow the need for the site index and duplicate checking. Does this happen in common cases? Could we not deal with this more easily at a higher level, if so?

c/tskit/trees.c Outdated
tsk_id_t id;
tsk_size_t i;

if (sites == NULL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the interpretation here? I would have though this would be an error

Copy link
Contributor Author

@lkirk lkirk Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mistakenly leftover from when I thought I'd specify the defaults in C. If no sites were specified and we were going to use all of the sites, we wouldn't need to validate them, we'd just want to use all available sites. However, I think I decided it'd be much simpler to specify the defaults on the python side. Do you agree? If so, I'll make this condition an error.

c/tskit/trees.c Outdated
}

for (i = 0; i < num_sites - 1; i++) {
id = sites[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can use a generic check like this:

if (sites[i] < 0 || sites[i] >= num_sites) {
    ret = TSK_ERR_SITE_OUT_OF_BOUNDS
    goto out

}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is more straightforward. If we've encountered an out of bounds site, immediately fail.

Since we're checking pairs, we'll still need to handle the last site separately.

c/tskit/trees.c Outdated
ret = TSK_ERR_BAD_PARAM_VALUE; // TODO: new err type? TSK_BAD_SITE_VALUE?
goto out;
}
if (id > sites[i + 1]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter if there are repeats? We're just specifying things to compute in the output matrix, right? I worry we're putting effort in to stop people doing "stupid" things, when this might also stop them doing clever things

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeats are going to complicate the indexing of the output array. We'd have to track the number of shared for rows and columns, we wouldn't be able to rely on their union. Also, we'd need a bit of rework when walking the trees because we assume that the sites are ascending and not repeating.

In short, I think it's possible to support repeats, but it's not currently supported. Do you think it's something we should be supporting?

c/tskit/trees.c Outdated
}
}
// check if the last or only value is null
if (num_sites != 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this?

Copy link
Contributor Author

@lkirk lkirk Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We check if every item is non-null and ordered by iterating from 0->n-1. We then need to check the last item if it's null. Also, since we have checked if the sites are ordered, by checking if the last site exceeds the length of the site rows table, we can ensure we're not specifying a site that's out of bounds. I see now I'm checking == instead of sites[num_sites - 1] >= num_site_rows... an error.

This case also applies if only one site has been specified.

@codecov
Copy link

codecov bot commented Oct 17, 2023

Codecov Report

Attention: 11 lines in your changes are missing coverage. Please review.

Comparison is base (2dae133) 89.71% compared to head (979fc0f) 89.70%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2863      +/-   ##
==========================================
- Coverage   89.71%   89.70%   -0.02%     
==========================================
  Files          30       30              
  Lines       30243    30296      +53     
  Branches     5883     5634     -249     
==========================================
+ Hits        27134    27178      +44     
- Misses       1780     1784       +4     
- Partials     1329     1334       +5     
Flag Coverage Δ
c-tests 86.12% <90.00%> (+0.01%) ⬆️
lwt-tests 80.78% <ø> (ø)
python-c-tests 67.73% <ø> (ø)
python-tests 98.92% <ø> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
c/tskit/trees.c 90.18% <90.00%> (+<0.01%) ⬆️

... and 4 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2dae133...979fc0f. Read the comment docs.

@lkirk
Copy link
Contributor Author

lkirk commented Oct 17, 2023

Can you explain the rationale for doing the site index lookup, and what your assumptions about the input are please? I'm not quite following what you're checking for, and why.

Sorry about the input validation, there were a couple of issues with that code that made it confusing. I've cleared it up now I think. My only question is how to fail if no sites are specified. Do we want to do a tsk_bug_assert or provide a more informative error type?

The site index lookup is needed because we're only collecting data for a subset of sites. We first store all of the sites we want to collect data from, then use that list while walking over the trees. When we compute statistics, we index into the data we've gathered with the _idx suffix entries of the index struct.

The assumptions about the input sites are that they are ordered, valid sites (non-null and in-range), and non-duplicated. As I've mentioned in my comments above, we're not equipped to handle duplicates, but I could add code that allows for duplicates if that seems like a reasonable requirement.

Overall, I think that we'll specify sites in python and if they're not specified, we can pass through an array of all sites. The C code can simply error if no sites are specified, but I think it's worth validating on the C side if we intend to make this part of the public C api surface.

@jeromekelleher
Copy link
Member

Thanks for the explanations @lkirk. I'm afraid I'm still not really getting the algorithm here, and this is where I'm really missing having a straightforward Python implementation --- I find it very hard to think about actual algorithmic processes when it's obscured by all the C-stuff.

Do you have a Python prototype of this code somewhere? I think we're going to need it at some point for maintainability anyway, so maybe we could switch to creating a test_two_site_stats.py file, following the general pattern of test_divmat.py? We can fill in the calls to the "lib" code here later, but we're going to need it anyway, and it would really help me to have some python code to look at and play with.

I think the basic "shape" (and API) of this is good, I just don't follow the details of the "indexing" code. From a quick glance I can see it has some memory management issues, but I want to get my head around what it does and why before jumping in to that. In general, I think we should try to develop the algorithm in Python first, test, and then implement in C.

@lkirk
Copy link
Contributor Author

lkirk commented Oct 18, 2023

I have a python prototype here, but it has fallen out of sync with the c implementation. It doesn't have sample sets or the ability to subset sites. In any case, it doesn't follow the c implementation and is more of an orthogonal check I'd say. I can add these new features to it, and place it in the python tests dir, but I'll need another day or so to get that done. In the meantime, I tried to come up with another document in an attempt to clarify the details of the indexing:
Indexing.pdf

@lkirk
Copy link
Contributor Author

lkirk commented Oct 19, 2023

I've added a python test file that currently tests itself (python/tests/test_ld_matrix.py). It's up to date with the details of the algorithm now.

@lkirk
Copy link
Contributor Author

lkirk commented Oct 30, 2023

Hi @jeromekelleher, just checking in on this. Is there anything more I can do to clarify the inner workings of the indexing algo? The python code I've attached mirrors the indexing closely. Lines 202-226 of the python code, along with the written documentation I've attached should explain things relatively well. I'm happy to have a zoom call or provide some further explanation if needed.

@jeromekelleher
Copy link
Member

Whoops, sorry @lkirk - I've been trying to get a manuscript over the line and been focusing completely on that for a few days. I'll have a good look at this tomorrow.

@jeromekelleher
Copy link
Member

Thanks @lkirk, I think I'm starting to get the idea here... The variable names on the indexing structure aren't that obvious to me, so some descriptive comments in the Python version would be helpful.

I still don't follow why we think there'll be a lot of shared sites between the rows and the cols though. Do you have some ideas for applications where we expect this? And some quantitative ideas on how much time you'd expect to save by doing this indexing?

Premature optimisation is a real thing, and I don't know how many times I've come up with some complicated fancy idea for how to do something, only to be disappointed to see that it's actually slower than the simple, obvious method. So, I would always now implement things in the simple, obvious way first, even if (or especially, maybe) I think I know a better way.

@lkirk
Copy link
Contributor Author

lkirk commented Nov 11, 2023

Hi @jeromekelleher sorry for my late reply. Thank you for taking another look. I've thought about if we might be over-complicating things here. I think any way we slice it, it's going be complicated to index into gathered data, site ids, and result matrix.

I tried to write a reduced version of this code and it wasn't a whole lot simpler. The issue is that we need to maintain a union of the sites and its corresponding indices (which happen to be the same indices as the gathered data/state matrix). Then we also need to store the row/column indices into the results matrix. All said, if we eliminated shared/diff tracking, we could get rid of 4 struct members in stat_matrix_site_indicies, and we'd eliminate all but 1 loop in the result matrix filling (maybe this would be the real gain from simplification?).

This complexity stems from the requirement that we compute LD for arbitrary NxM matricies. If it was an NxN matrix, this would be easier.

The reason we want to track shared sites between rows and columns is the case where a user specifies a square output matrix where the row sites and column sites are identical (a very common case). We only want to compute the upper triangle and reflect to the lower triangle of the output matrix because computations across the diagonal are redundant. Most of the time is spent on intersecting sample sets between sites, so we want to avoid redundant computations as much as possible (IMO). I did a simple benchmark where I took chromosome 3 from the SGDP trees inferred by tsinfer and ran this code on:

  • 10,000 sites that perfectly overlap -- forming a square matrix (best_case)
  • 10,000 sites for the row site ids and 10,000 sites for the column site ids that are non-overlapping (worst_case)
best_case took 13.81 s
worst_case took 26.44 s

We can see that by tracking overlaps and reflecting we get an UP TO 2x speedup in a common usecase.

The nice thing is that we also handle edge cases where we can simply specify arbitrary lists of site ids, retaining any benefits from avoiding redundant computations.

That said, I'm happy to consider alternative approaches.

I also went ahead and tried to clean up the python prototype, renaming the indexing object attributes to (hopefully) more descriptive names. I also made a first pass at documenting the python code so it's hopefully more understandable.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I've been so slow @lkirk - this is tricky stuff and I haven't found time to get by head into it. I started getting my head back into it here, but I've gotten bogged down again. Maybe we should organise a call for next week so we can go through the code together?

states.append(mutation.derived_state)
current_state = len(states) - 1

for sample_id in tree.samples(mutation.node):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is right - you're assuming that the samples are 0 to n - 1 here, but that's not true in general. What you need to do is use the sample_index_map:

sample_index_map = np.zeros(ts.num_nodes, dtype=int32) - 1
for j, u in enumerate(ts.samples()):
     sample_index_map[u] = j

 # Later:
     for u in tree.samples(mutation.node):
         sample_index = sample_index_map[u]  
         mutation[site.id, sample_index] = current_state

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I want to punt this change to when I write all of the python tests for ld_matrix. I don't currently have any data to test the sample index map with, but I will create some when I make those tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we drop a comment to this effect in here so we don't forget please?

for all samples and sites in the index.
"""
state = np.zeros((len(sites), ts.num_samples), dtype=int)
current_site = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point in tracking current_site here, just use site.id below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still relevant

tsk_size_t i;

if (sites == NULL || num_sites == 0) {
ret = TSK_ERR_BAD_SITE_POSITION; // TODO: error should be no sites?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still considering a TSK_ERR_NULL_SITES error here or something of the sort.

@lkirk
Copy link
Contributor Author

lkirk commented Jan 9, 2024

Hi @jeromekelleher sorry for the delay in fixing this up. I've covered the issues that we discussed during our meeting:

  1. We're no longer reallocating things
  2. The malloc calls are now checked
  3. I've renamed instances of haplotype matrix to haplotype counts
  4. I've renamed members of the index struct to be a little more intuitive
  5. No more incrementing variables during use
  6. I've gotten test coverage up to 94%

I think this is about ready, please let me know if there are any other blockers that I need to address before it's ready to merge.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through again and I'm afraid I still haven't managed to get my head fully around the SiteMatrixIndex. I feel like it could be substantially simpler, and I'm having to spend a lot of time understanding how it works. Unless the code is easy to follow, it won't get maintained or extended, so I do want to get this right.

This is a pity, because it's just an implementation detail, and it's blocking adding the actual functionality.

My suggestion here would be to take a backup branch of this PR, and just pull out the SiteMatrixIndex stuff from this PR, using a naive and direct algorithm instead. Let's not worry about absolute performance, and just focus on getting the functionality in and do optimisations afterwards. I think this is really just a case of deleting code, as the actual logic can be recast really easily using the "compute" primitive you have already.

Then, once we have the functionality in we can come back to the SiteMatrixIndex logic, which should be a simple diff from your current commit to main, and easy to recover.

I know this is annoying, and I'm sorry to make this so slow. It will be much easier to understand your optimisations when they are isolated though, and I'm confident we can get them in quickly when they are pulled out by themselves.

c/tskit/trees.h Outdated
@@ -1818,6 +1816,29 @@ bool tsk_tree_position_prev(tsk_tree_position_t *self);
int tsk_tree_position_seek_forward(tsk_tree_position_t *self, tsk_id_t index);
int tsk_tree_position_seek_backward(tsk_tree_position_t *self, tsk_id_t index);

// These are in the header for testing purposes only
struct stat_matrix_site_indicies {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo probably should be, stat_matrix_site_indices. Elsewhere we use the indexes plural form, FYI but whatever you prefer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess strictly speaking if we put this in the public header we should also put in a tsk_ prefix so that we're fulfilling the contracts we specify in the docs. (Names space collisions are unlikely I agree, but we do say that all public symbols start with a tsk_ prefix)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also might as well follow the usual conventions and typedef it as tsk_stat_matrix_site_indexes (rather than as a struct)

for all samples and sites in the index.
"""
state = np.zeros((len(sites), ts.num_samples), dtype=int)
current_site = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still relevant

states.append(mutation.derived_state)
current_state = len(states) - 1

for sample_id in tree.samples(mutation.node):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, can we drop a comment to this effect in here so we don't forget please?

raise ValueError(f"Site out of bounds: {sites[i + 1]}")


def get_matrix_indices(row_sites, col_sites):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more naturally done as the constructor to SiteMatrixIndex, so that we call idx = SiteMatrixIndex(row_sites, col_sites) in client code. Here, I found myself going up and down the file to try and figure out what the fields were actually doing, where we should really just encapsulate everything in the SiteMatrixIndex class.


def compute(left_states, right_states):
hap_counts = np.zeros((np.max(left_states) + 1, np.max(right_states) + 1))
# compute the haplotype matrix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still confusing me - what's the haplotype matrix? (I'm pretty sure it's not the same as what we get when calling ts.haplotypes(), e.g.). Can we rename this please?

@lkirk
Copy link
Contributor Author

lkirk commented Jan 11, 2024

@jeromekelleher I agree with you. It could definitely be simpler and it'll be a bit of technical debt looming over this code. I'll pull it out and make things simpler. I can have something by Friday for you to look at.

Along these lines... I was thinking about writing the python code from the bottom up to mirror the C code. Would that make things simpler in the long run? I'm happy to do that too.

@jeromekelleher
Copy link
Member

That sounds great, thanks @lkirk . Mirroring the python and c implementations (to a reasonable degree, no need to be too literal) would help all right

import tskit


class BitSet:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've modeled this class after the BitSet that I'm imagining for #2834.

The only strange thing about the API is that the intersect method saves to a third length-1 array, while the rest of the methods modify the calling instance. I suppose union and difference could write out to an out array like intersect, but it would be clunky and slightly confusing to be passing the calling instance back into the method.

In any case, I'm not sure how much effort should be put into this API since technically only the two-locus code is using it for now.

@lkirk lkirk force-pushed the two-locus-site-masking-c-api branch from ed1a86b to a7e8609 Compare January 18, 2024 07:01
@lkirk
Copy link
Contributor Author

lkirk commented Jan 18, 2024

Okay, so that took a little longer than expected. I've simplified things @jeromekelleher.

There are no longer any optimizations or fancy index tracking around the results matrix. We're now just filling the matrix from top to bottom, using the exact indices specified by the user.

The python code now closely mirrors the C code (except for the simple refactor of bit arrays that I commented on in the code). I've also annotated the python code with types (compatible with all supported versions of python) to add some further documentation.

Code coverage is up, tests are passing and merge conflicts are handled. I'm happy to hear any feedback and/or points of confusion. Hopefully this makes things more understandable.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! A couple of small issues, but looks good to merge then. Thanks for your patience @lkirk!

// If we've hit the last site in our index, finish up
if (site_idx >= n_sites) {
ret = 0;
goto out;
Copy link
Member

@jeromekelleher jeromekelleher Jan 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use gotos for "exception handling", so it would be good to change the flow a bit here. I'm also not sure that you're correctly iterating over the trees (see comment above). How about this?

for (site_idx=0; site_id < num_sites; site_id++) {
      tsk_treeseq_get_site(ts, sites[site_idx], &site);
      ret = tsk_tree_seek(&tree, site.position, 0);
      if (ret < 0) {
          goto out;
      }
      tsk_bug_assert(ret == TSK_TREE_OK);
      // do stuff with site and tree
}

The call to tsk_tree_seek will exit quickly when it's no-op, so it'll be just as fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you're saying. After a few tests, I see what's going on. The issue I have in your proposed solution is that we'd also be calling realloc for every site. From what I understand, realloc is not specified in the C standard (GNU returns the same memory when realloc is called with the same size, but I'm not sure about other libc implementations -- see here in the portability notes).

I've proposed a solution (my most recent commit) where we seek to the first tree, then keep seeking until we run out of sites or trees.

What do you think about this? I can implement what you've proposed if you think that's the better, more straightforward route.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, things have to be correct first and then fast afterwards. You make them correct by doing things simply first. Then you profile, and if it's not fast enough, you optimise based on this profiling.

The realloc thing you identify is a small detail (which is trivially worked around, if it really is a problem). You haven't profiled this, so you don't know if it is a problem. That is not sufficient reason to make the code so much more complicated.

I do understand the impulse to optimise, but you're going about it the wrong way. I've found this classic worse is better essay very illuminating, and has convinced me of the importance of simplicity of implementation, almost above all other considerations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for passing that along, I've heard of "worse is better" colloquially, but I haven't seen this essay before. I agree with it and like the ideas, I just need to translate them into my software design.

I've adopted your suggestions and things look a bit simpler. tsk_tree_seek always returns 0 unless there's an error, so there's no need to make any further asserts as far as I can tell.

I also updated the python code to reflect the change.

c/tskit/trees.c Outdated
// interest. We rely on our sites bounds checking to access the first site position
site_offset = 0;
site_idx = 0;
for (ret = tsk_tree_seek(&tree, ts->tables->sites.position[sites[0]], 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? Won't you seek to the first tree over and over doing this? See below for suggested restructuring of flow control.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about that. I didn't realize that this would cause the loop to wrap around until I messed with it in isolation. See my detailed comment below.

if (sites == NULL) {
goto out;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a follow-up issue on this one - additional error cases for 2 site stats.

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Let's squash-n-merge 🎉

This allows a user to specify an array of row sites to be compared
against an array of column sites. Rectangular matricies are returned as
a result of these comparisons. We weighed the pros/cons of the
indexing performance and redundant computation. The initial
implementation avoided redundant computation, but at the expense of
complexity and readability. The final implementation is much more simple
and DOES perform redundant computations if the row/col sites cross the
diagonal of the LD matrix. We will revisit if/when this becomes a
problem.

In addition to these changes, I've cleaned up the results matrix
allocation. Since the python code will be allocating and managing this
memory, we'll pull out the allocation code.

C tests have been added to capture all of this new functionality.

A type annotated python prototype that mirrors the C functionality was
added to provide some documentation on how the C algorithms work, and
to aid in the testing of the C code once the python API is ready.
@lkirk lkirk force-pushed the two-locus-site-masking-c-api branch from d675999 to 979fc0f Compare January 25, 2024 15:45
@lkirk
Copy link
Contributor Author

lkirk commented Jan 25, 2024

Squashed and ready. I'll make that issue for the error types today. I also talked with @apragsdale and he suggests we make an issue for tracking the matrix optimizations if we ever decide to remove redundant computations. Does that sound reasonable?

@jeromekelleher jeromekelleher added the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 25, 2024
@jeromekelleher
Copy link
Member

Sounds good to me. Let's get the Python-C plumbing setup up next and so we can start running this stuff!

@mergify mergify bot merged commit 1601a5e into tskit-dev:main Jan 25, 2024
23 checks passed
@mergify mergify bot removed the AUTOMERGE-REQUESTED Ask Mergify to merge this PR label Jan 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants