Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Faster and more flexible code, and code sharing for kernel tests #19

Merged
merged 23 commits into from
Aug 30, 2023

Conversation

adam2392
Copy link
Collaborator

@adam2392 adam2392 commented Jul 19, 2023

Towards: #15

Changes proposed in this pull request:

  • refactors code to setup for kcd test
  • allows any of the pairwise kernel strings to be passed in from sklearn (which is significantly faster than using partial because sklearn optimizes the in-house kernels)
  • also requires kernel functions to be a specific API, so it's easier to test, implement and document

This should all make implementation of the kcd test pretty straightforward

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.
  • I have added a changelog entry succintly describing the change in this PR in the whats_new relevant version document.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 requested a review from bloebp July 19, 2023 15:59
@adam2392 adam2392 marked this pull request as ready for review July 19, 2023 16:00
@adam2392 adam2392 added the No Changelog Needed Does not require updating what's_new label Jul 21, 2023
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 closed this Jul 21, 2023
@adam2392 adam2392 reopened this Jul 21, 2023
Signed-off-by: Adam Li <adam2392@gmail.com>
@codecov
Copy link

codecov bot commented Jul 26, 2023

Codecov Report

Merging #19 (ed2ebe0) into main (1938c19) will increase coverage by 1.09%.
The diff coverage is 78.99%.

@@            Coverage Diff             @@
##             main      #19      +/-   ##
==========================================
+ Coverage   81.73%   82.83%   +1.09%     
==========================================
  Files           8        9       +1     
  Lines         345      402      +57     
  Branches       66       72       +6     
==========================================
+ Hits          282      333      +51     
- Misses         47       49       +2     
- Partials       16       20       +4     
Files Changed Coverage Δ
pywhy_stats/kernels.py 81.81% <71.42%> (+16.60%) ⬆️
pywhy_stats/utils.py 62.16% <71.42%> (+5.64%) ⬆️
pywhy_stats/kernel_utils.py 75.00% <75.00%> (ø)
pywhy_stats/independence/kci.py 98.11% <92.30%> (ø)
pywhy_stats/api.py 93.93% <100.00%> (ø)
pywhy_stats/independence/fisherz.py 95.65% <100.00%> (ø)
pywhy_stats/independence/power_divergence.py 71.71% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

pywhy_stats/api.py Show resolved Hide resolved
pywhy_stats/api.py Show resolved Hide resolved
pywhy_stats/independence/kci.py Outdated Show resolved Hide resolved
tests/test_kci.py Outdated Show resolved Hide resolved
tests/test_kci.py Show resolved Hide resolved
tests/test_kci.py Show resolved Hide resolved
pywhy_stats/kernel_utils.py Outdated Show resolved Hide resolved
@adam2392
Copy link
Collaborator Author

adam2392 commented Aug 8, 2023

Tldr re: #19 (comment)

I will put up a sep commit after Neurips rebuttal period is over to see if we can converge :)

@bloebp
Copy link
Member

bloebp commented Aug 9, 2023

Can you investigate the failing checks?

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 requested a review from bloebp August 11, 2023 17:19
@adam2392
Copy link
Collaborator Author

adam2392 commented Aug 11, 2023

@bloebp lmk wdyt. I addressed the two issues:

  1. use RNG, so the seed does not universally affect the tests
  2. I refactored, but I think there might be still a discrepancy:
  • on one hand pairwise_kernels can take a callable that is defined on (X, Y) 1D row vectors and will run a joblib parallelization
  • on the other hand if you define your own kernel callable that can compute this in a vectorized fashion over all the samples, then you want to call it directly rather than joblib parallelize each row.

E.g.

    def my_custom_kernel(X):
        # X here is multidimensional, where the first dimension is continuous and the second categorical. We can handle
        # this by combining two kernels.
        return delta_kernel(X[:, 0].reshape(-1, 1)) * rbf_kernel(X[:, 1].reshape(-1, 1))

we want to call directly, but for the following, we would want to call it via pairwise_kernels.

def unvectorizable_pairwise_kernel(X, Y=None):
     ...

Should we have a rule that if the number of arguments in the kernel callable is exactly 1, then it is assumed to be some optimized kernel callable that is already vectorized and hence should be called directly?

Otoh, if 2 possible function arguments, then it is better to pass it through pairwise_kernels because it is assumed to be of the unvectorizable form that operates on a pair of samples.

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator Author

I addressed the issue above and just made it clear that we assume the following:

If the metric is a callable, it will have either one input for ``X``, or two inputs for ``X`` and
    ``Y``. If one input is passed in, it is assumed that the kernel operates on the entire array to compute
    the kernel array. If two inputs are passed in, then it is assumed that the kernel operates on
    pairwise row vectors from each input array. This callable is parallelized across rows of the input
    using :func:`~sklearn.metrics.pairwise.pairwise_kernels`. Note that ``(X, Y=None)`` is a valid input
    signature for the kernel function and would then get passed to the pairwise kernel function. If
    a callable is passed in, it is generally faster and more efficient if one can define a vectorized
    operation that operates on the whole array at once. Otherwise, the pairwise kernel function will
    call the function for each combination of rows in the input arrays.

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator Author

Xref: scikit-learn/scikit-learn#27123

Signed-off-by: Patrick Bloebaum <bloebp@amazon.com>
Signed-off-by: Patrick Bloebaum <bloebp@amazon.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
bloebp and others added 3 commits August 21, 2023 12:35
Signed-off-by: Patrick Bloebaum <bloebp@amazon.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator Author

adam2392 commented Aug 30, 2023

@bloebp I will let you merge if it looks good to you. I enabled auto-squash merge commit.

The last commit should fix the circleCI build.

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 enabled auto-merge (squash) August 30, 2023 16:15
Signed-off-by: Adam Li <adam2392@gmail.com>
@bloebp bloebp disabled auto-merge August 30, 2023 17:25
@adam2392 adam2392 merged commit 256d8c9 into py-why:main Aug 30, 2023
20 of 21 checks passed
@adam2392 adam2392 deleted the kernelref branch August 30, 2023 17:27
bloebp pushed a commit that referenced this pull request Sep 6, 2023
Closes: #15 
Follow-up to #19 

Changes proposed in this pull request:
- Adds `kcd` and `bremgan` test along w/ unit-tests and documentation
update
- As a result of #19, the code is entirely self-contained and leverages
the kernel functions that are shared w/ the kci test.

## Before submitting

<!-- Please complete this checklist BEFORE submitting your PR to speed
along the review process. -->
- [ ] I've read and followed all steps in the [Making a pull
request](https://github.com/py-why/pywhy-stats/blob/main/CONTRIBUTING.md#making-a-pull-request)
    section of the `CONTRIBUTING` docs.
- [ ] I've updated or added any relevant docstrings following the syntax
described in the
[Writing
docstrings](https://github.com/py-why/pywhy-stats/blob/main/CONTRIBUTING.md#writing-docstrings)
section of the `CONTRIBUTING` docs.
- [ ] If this PR fixes a bug, I've added a test that will fail without
my fix.
- [ ] If this PR adds a new feature, I've added tests that sufficiently
cover my new functionality.
- [ ] I have added a changelog entry succintly describing the change in
this PR in the
[whats_new](https://github.com/py-why/pywhy-stats/blob/main/docs/whats_new/)
relevant version document.

## After submitting

<!-- Please complete this checklist AFTER submitting your PR to speed
along the review process. -->
- [ ] All GitHub Actions jobs for my pull request have passed.

---------

Signed-off-by: Adam Li <adam2392@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Changelog Needed Does not require updating what's_new
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants