Skip to content

Commit

Permalink
[ENH] Faster and more flexible code, and code sharing for kernel tests (
Browse files Browse the repository at this point in the history
#19)

Towards: #15 

Changes proposed in this pull request:
- refactors code to setup for kcd test
- allows any of the pairwise kernel strings to be passed in from sklearn
(which is significantly faster than using partial because sklearn
optimizes the in-house kernels)
- also requires kernel functions to be a specific API, so it's easier to
test, implement and document

This should all make implementation of the kcd test pretty
straightforward

---------

Signed-off-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Patrick Bloebaum <bloebp@amazon.com>
  • Loading branch information
adam2392 and bloebp authored Aug 30, 2023
1 parent 1938c19 commit 256d8c9
Show file tree
Hide file tree
Showing 22 changed files with 1,536 additions and 995 deletions.
6 changes: 3 additions & 3 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
poetry-version: [1.3.0]
poetry-version: [1.6.1]
steps:
- name: Checkout repository
uses: actions/checkout@v3
Expand Down Expand Up @@ -59,7 +59,7 @@ jobs:
matrix:
os: [ubuntu, macos, windows]
python-version: [3.8, 3.9, "3.10"]
poetry-version: [1.3.0]
poetry-version: [1.6.1]
name: build ${{ matrix.os }} - py${{ matrix.python-version }}
runs-on: ${{ matrix.os }}-latest
defaults:
Expand Down Expand Up @@ -122,7 +122,7 @@ jobs:
matrix:
os: [ubuntu, macos, windows]
python-version: [3.8, "3.10"] # oldest and newest supported versions
poetry-version: [1.3.0]
poetry-version: [1.6.1]
name: Unit-test ${{ matrix.os }} - py${{ matrix.python-version }}
runs-on: ${{ matrix.os }}-latest
defaults:
Expand Down
24 changes: 24 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# YAML 1.2
---
# Metadata for citation of this software according to the CFF format (https://citation-file-format.github.io/)
cff-version: 1.2.0
title: 'Pywhy-Stats: Statistical inference in Python.'
abstract: 'Pywhy-Stats is a Python library that leverages a simple API for performing independence and conditional independence testing.'
authors:
- given-names: Adam
family-names: Li
affiliation: 'Department of Computer Science, Columbia University, New York, NY, USA'
orcid: 'https://orcid.org/0000-0001-8421-365X'
- given-names: Patrick
family-names: Blöbaum
affiliation: 'Amazon'
email: 'bloebp@amazon.com'
type: software
repository-code: 'https://github.com/py-why/pywhy-stats'
license: MIT
keywords:
- causality
- pywhy
- statistics
- independece testing
...
5 changes: 5 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,11 @@ When you're ready to contribute code to address an open issue, please follow the

</details>

5. Adding your name to the CITATION.cff file

We are a community-driven open-source project and want to make sure all contributors are acknowledged. If you are a new contributor, add your name
to the ``CITATION.cff`` file and relevant metadata.

### Writing docstrings

We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings
Expand Down
2 changes: 1 addition & 1 deletion doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ contains the p-value and the test statistic and optionally additional informatio
Testing for conditional independence among variables is a core part
of many data analysis procedures.

.. currentmodule:: pywhy_stats
.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

Expand Down
43 changes: 17 additions & 26 deletions doc/conditional_independence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,20 +80,21 @@ various proposals in the literature for estimating CMI, which we summarize here:
estimating :math:`P(y|x)` and :math:`P(y|x,z)`, which can be used as plug-in estimates
to the equation for CMI.

:mod:`pywhy_stats.fisherz` Partial (Pearson) Correlation
--------------------------------------------------------
:mod:`pywhy_stats.independence.fisherz` Partial (Pearson) Correlation
---------------------------------------------------------------------
Partial correlation based on the Pearson correlation is equivalent to CMI in the setting
of normally distributed data. Computing partial correlation is fast and efficient and
thus attractive to use. However, this **relies on the assumption that the variables are Gaussiany**,
which may be unrealistic in certain datasets.

.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

fisherz

:mod:`pywhy_stats.power_divergence` Discrete, Categorical and Binary Data
-------------------------------------------------------------------------
:mod:`pywhy_stats.independence.power_divergence` Discrete, Categorical and Binary Data
--------------------------------------------------------------------------------------
If one has discrete data, then the test to use is based on Chi-square tests. The :math:`G^2`
class of tests will construct a contingency table based on the number of levels across
each discrete variable. An exponential amount of data is needed for increasing levels
Expand All @@ -104,8 +105,8 @@ for a discrete variable.

power_divergence

Kernel-Approaches
-----------------
:mod:`pywhy_stats.independence.kci` Kernel-Approaches
-----------------------------------------------------
Kernel independence tests are statistical methods used to determine if two random variables are independent or
conditionally independent. One such test is the Hilbert-Schmidt Independence Criterion (HSIC), which examines the
independence between two random variables, X and Y. HSIC employs kernel methods and, more specifically, it computes
Expand All @@ -125,6 +126,12 @@ Kernel-based tests are attractive for many applications, since they are semi-par
that have been shown to be robust in the machine-learning field. For more information, see :footcite:`Zhang2011`.


.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

kci

Classifier-based Approaches
---------------------------
Another suite of approaches that rely on permutation testing is the classifier-based approach.
Expand All @@ -144,9 +151,9 @@ helps maintain dependence between (X, Z) and (Y, Z) (if it exists), but generate
conditionally independent dataset.


=======================
Conditional Discrepancy
=======================
=========================================
Conditional Distribution 2-Sample Testing
=========================================

.. currentmodule:: pywhy_stats

Expand All @@ -170,23 +177,7 @@ indices of the distribution, one can convert the CD test:
:math:`P_{i=j}(y|x) =? P_{i=k}(y|x)` into the CI test :math:`P(y|x,i) = P(y|x)`, which can
be tested with the Chi-square CI tests.

Kernel-Approaches
-----------------
Kernel-based tests are attractive since they are semi-parametric and use kernel-based ideas
that have been shown to be robust in the machine-learning field. The Kernel CD test is a test
that computes a test statistic from kernels of the data and uses a weighted permutation testing
based on the estimated propensity scores to generate samples from the null distribution
:footcite:`Park2021conditional`, which are then used to estimate a pvalue.


Bregman-Divergences
-------------------
The Bregman CD test is a divergence-based test
that computes a test statistic from estimated Von-Neumann divergences of the data and uses a
weighted permutation testing based on the estimated propensity scores to generate samples from the null distribution
:footcite:`Yu2020Bregman`, which are then used to estimate a pvalue.

==========
References
==========
.. footbibliography::
.. footbibliography::
4 changes: 1 addition & 3 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@

# If your documentation needs a minimal Sphinx version, state it here.
#
needs_sphinx = "4.0"
needs_sphinx = "5.0"

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
Expand Down Expand Up @@ -146,9 +146,7 @@
"PValueResult": "pywhy_stats.pvalue_result.PValueResult",
# numpy
"NDArray": "numpy.ndarray",
# "ArrayLike": "numpy.typing.ArrayLike",
"ArrayLike": ":term:`array_like`",
"fisherz": "pywhy_stats.fisherz",
}

autodoc_typehints_format = "short"
Expand Down
2 changes: 1 addition & 1 deletion doc/whats_new/v0.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Version 0.1
Changelog
---------

- |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
- |Feature| Implement partial correlation test :func:`pywhy_stats.independence.fisherz`, by `Adam Li`_ (:pr:`7`)
- |Feature| Add (un)conditional kernel independence test by `Patrick Blöbaum`_, co-authored by `Adam Li`_ (:pr:`14`)
- |Feature| Add categorical independence tests by `Adam Li`_, (:pr:`18`)

Expand Down
Loading

0 comments on commit 256d8c9

Please sign in to comment.