[ENH] Faster and more flexible code, and code sharing for kernel tests (

#19) Towards: #15 Changes proposed in this pull request: - refactors code to setup for kcd test - allows any of the pairwise kernel strings to be passed in from sklearn (which is significantly faster than using partial because sklearn optimizes the in-house kernels) - also requires kernel functions to be a specific API, so it's easier to test, implement and document This should all make implementation of the kcd test pretty straightforward --------- Signed-off-by: Adam Li <adam2392@gmail.com> Co-authored-by: Patrick Bloebaum <bloebp@amazon.com>
py-why · Aug 30, 2023 · 256d8c9 · 256d8c9
1 parent 1938c19
commit 256d8c9
Show file tree

Hide file tree

Showing 22 changed files with 1,536 additions and 995 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -22,7 +22,7 @@ jobs:
     runs-on: ubuntu-latest
     strategy:
       matrix:
-        poetry-version: [1.3.0]
+        poetry-version: [1.6.1]
     steps:
       - name: Checkout repository
         uses: actions/checkout@v3
@@ -59,7 +59,7 @@ jobs:
       matrix:
         os: [ubuntu, macos, windows]
         python-version: [3.8, 3.9, "3.10"]
-        poetry-version: [1.3.0]
+        poetry-version: [1.6.1]
     name: build ${{ matrix.os }} - py${{ matrix.python-version }}
     runs-on: ${{ matrix.os }}-latest
     defaults:
@@ -122,7 +122,7 @@ jobs:
       matrix:
         os: [ubuntu, macos, windows]
         python-version: [3.8, "3.10"]  # oldest and newest supported versions
-        poetry-version: [1.3.0]
+        poetry-version: [1.6.1]
     name: Unit-test ${{ matrix.os }} - py${{ matrix.python-version }}
     runs-on: ${{ matrix.os }}-latest
     defaults:

diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,24 @@
+# YAML 1.2
+---
+# Metadata for citation of this software according to the CFF format (https://citation-file-format.github.io/)
+cff-version: 1.2.0
+title: 'Pywhy-Stats: Statistical inference in Python.'
+abstract: 'Pywhy-Stats is a Python library that leverages a simple API for performing independence and conditional independence testing.'
+authors:
+    - given-names: Adam
+      family-names: Li
+      affiliation: 'Department of Computer Science, Columbia University, New York, NY, USA'
+      orcid: 'https://orcid.org/0000-0001-8421-365X'
+    - given-names: Patrick
+      family-names: Blöbaum
+      affiliation: 'Amazon'
+      email: 'bloebp@amazon.com'
+type: software
+repository-code: 'https://github.com/py-why/pywhy-stats'
+license: MIT
+keywords:
+  - causality
+  - pywhy
+  - statistics
+  - independece testing
+...
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -141,6 +141,11 @@ When you're ready to contribute code to address an open issue, please follow the
 
     </details>
 
+5. Adding your name to the CITATION.cff file
+
+    We are a community-driven open-source project and want to make sure all contributors are acknowledged. If you are a new contributor, add your name
+    to the ``CITATION.cff`` file and relevant metadata.
+
 ### Writing docstrings
 
 We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings

diff --git a/doc/api.rst b/doc/api.rst
@@ -57,7 +57,7 @@ contains the p-value and the test statistic and optionally additional informatio
 Testing for conditional independence among variables is a core part
 of many data analysis procedures.
 
-.. currentmodule:: pywhy_stats
+.. currentmodule:: pywhy_stats.independence
 .. autosummary::
    :toctree: generated/
 

diff --git a/doc/conditional_independence.rst b/doc/conditional_independence.rst
@@ -80,20 +80,21 @@ various proposals in the literature for estimating CMI, which we summarize here:
   estimating :math:`P(y|x)` and :math:`P(y|x,z)`, which can be used as plug-in estimates
   to the equation for CMI.
 
-:mod:`pywhy_stats.fisherz` Partial (Pearson) Correlation
---------------------------------------------------------
+:mod:`pywhy_stats.independence.fisherz` Partial (Pearson) Correlation
+---------------------------------------------------------------------
 Partial correlation based on the Pearson correlation is equivalent to CMI in the setting
 of normally distributed data. Computing partial correlation is fast and efficient and
 thus attractive to use. However, this **relies on the assumption that the variables are Gaussiany**,
 which may be unrealistic in certain datasets.
 
+.. currentmodule:: pywhy_stats.independence
 .. autosummary::
    :toctree: generated/
 
     fisherz
 
-:mod:`pywhy_stats.power_divergence` Discrete, Categorical and Binary Data
--------------------------------------------------------------------------
+:mod:`pywhy_stats.independence.power_divergence` Discrete, Categorical and Binary Data
+--------------------------------------------------------------------------------------
 If one has discrete data, then the test to use is based on Chi-square tests. The :math:`G^2`
 class of tests will construct a contingency table based on the number of levels across
 each discrete variable. An exponential amount of data is needed for increasing levels
@@ -104,8 +105,8 @@ for a discrete variable.
 
     power_divergence
 
-Kernel-Approaches
------------------
+:mod:`pywhy_stats.independence.kci` Kernel-Approaches
+-----------------------------------------------------
 Kernel independence tests are statistical methods used to determine if two random variables are independent or
 conditionally independent. One such test is the Hilbert-Schmidt Independence Criterion (HSIC), which examines the
 independence between two random variables, X and Y. HSIC employs kernel methods and, more specifically, it computes
@@ -125,6 +126,12 @@ Kernel-based tests are attractive for many applications, since they are semi-par
 that have been shown to be robust in the machine-learning field. For more information, see :footcite:`Zhang2011`.
 
 
+.. currentmodule:: pywhy_stats.independence
+.. autosummary::
+   :toctree: generated/
+
+    kci
+
 Classifier-based Approaches
 ---------------------------
 Another suite of approaches that rely on permutation testing is the classifier-based approach.
@@ -144,9 +151,9 @@ helps maintain dependence between (X, Z) and (Y, Z) (if it exists), but generate
 conditionally independent dataset.
 
 
-=======================
-Conditional Discrepancy
-=======================
+=========================================
+Conditional Distribution 2-Sample Testing
+=========================================
 
 .. currentmodule:: pywhy_stats
 
@@ -170,23 +177,7 @@ indices of the distribution, one can convert the CD test:
 :math:`P_{i=j}(y|x) =? P_{i=k}(y|x)` into the CI test :math:`P(y|x,i) = P(y|x)`, which can
 be tested with the Chi-square CI tests.
 
-Kernel-Approaches
------------------
-Kernel-based tests are attractive since they are semi-parametric and use kernel-based ideas
-that have been shown to be robust in the machine-learning field. The Kernel CD test is a test
-that computes a test statistic from kernels of the data and uses a weighted permutation testing
-based on the estimated propensity scores to generate samples from the null distribution
-:footcite:`Park2021conditional`, which are then used to estimate a pvalue.
-
-
-Bregman-Divergences
--------------------
-The Bregman CD test is a divergence-based test
-that computes a test statistic from estimated Von-Neumann divergences of the data and uses a
-weighted permutation testing based on the estimated propensity scores to generate samples from the null distribution
-:footcite:`Yu2020Bregman`, which are then used to estimate a pvalue.
-
 ==========
 References
 ==========
-.. footbibliography::
+.. footbibliography::
diff --git a/doc/conf.py b/doc/conf.py
@@ -39,7 +39,7 @@
 
 # If your documentation needs a minimal Sphinx version, state it here.
 #
-needs_sphinx = "4.0"
+needs_sphinx = "5.0"
 
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
@@ -146,9 +146,7 @@
     "PValueResult": "pywhy_stats.pvalue_result.PValueResult",
     # numpy
     "NDArray": "numpy.ndarray",
-    # "ArrayLike": "numpy.typing.ArrayLike",
     "ArrayLike": ":term:`array_like`",
-    "fisherz": "pywhy_stats.fisherz",
 }
 
 autodoc_typehints_format = "short"

diff --git a/doc/whats_new/v0.1.rst b/doc/whats_new/v0.1.rst
@@ -26,7 +26,7 @@ Version 0.1
 Changelog
 ---------
 
-- |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
+- |Feature| Implement partial correlation test :func:`pywhy_stats.independence.fisherz`, by `Adam Li`_ (:pr:`7`)
 - |Feature| Add (un)conditional kernel independence test by `Patrick Blöbaum`_, co-authored by `Adam Li`_ (:pr:`14`)
 - |Feature| Add categorical independence tests by `Adam Li`_, (:pr:`18`)