Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Faster and more flexible code, and code sharing for kernel tests #19

Merged
merged 23 commits into from
Aug 30, 2023
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 24 additions & 0 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# YAML 1.2
---
# Metadata for citation of this software according to the CFF format (https://citation-file-format.github.io/)
cff-version: 1.2.0
title: 'Pywhy-Stats: Statistical inference in Python.'
abstract: 'Pywhy-Stats is a Python library that leverages a simple API for performing independence and conditional independence testing.'
authors:
- given-names: Adam
family-names: Li
affiliation: 'Department of Computer Science, Columbia University, New York, NY, USA'
orcid: 'https://orcid.org/0000-0001-8421-365X'
- given-names: Patrick
family-names: Blöbaum
affiliation: 'Amazon'
email: 'bloebp@amazon.com'
type: software
repository-code: 'https://github.com/py-why/pywhy-stats'
license: MIT
keywords:
- causality
- pywhy
- statistics
- independece testing
...
5 changes: 5 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,11 @@ When you're ready to contribute code to address an open issue, please follow the

</details>

5. Adding your name to the CITATION.cff file

We are a community-driven open-source project and want to make sure all contributors are acknowledged. If you are a new contributor, add your name
to the ``CITATION.cff`` file and relevant metadata.

### Writing docstrings

We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings
Expand Down
2 changes: 1 addition & 1 deletion doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ contains the p-value and the test statistic and optionally additional informatio
Testing for conditional independence among variables is a core part
of many data analysis procedures.

.. currentmodule:: pywhy_stats
.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

Expand Down
43 changes: 17 additions & 26 deletions doc/conditional_independence.rst
Original file line number Diff line number Diff line change
Expand Up @@ -80,20 +80,21 @@ various proposals in the literature for estimating CMI, which we summarize here:
estimating :math:`P(y|x)` and :math:`P(y|x,z)`, which can be used as plug-in estimates
to the equation for CMI.

:mod:`pywhy_stats.fisherz` Partial (Pearson) Correlation
--------------------------------------------------------
:mod:`pywhy_stats.independence.fisherz` Partial (Pearson) Correlation
---------------------------------------------------------------------
Partial correlation based on the Pearson correlation is equivalent to CMI in the setting
of normally distributed data. Computing partial correlation is fast and efficient and
thus attractive to use. However, this **relies on the assumption that the variables are Gaussiany**,
which may be unrealistic in certain datasets.

.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

fisherz

:mod:`pywhy_stats.power_divergence` Discrete, Categorical and Binary Data
-------------------------------------------------------------------------
:mod:`pywhy_stats.independence.power_divergence` Discrete, Categorical and Binary Data
--------------------------------------------------------------------------------------
If one has discrete data, then the test to use is based on Chi-square tests. The :math:`G^2`
class of tests will construct a contingency table based on the number of levels across
each discrete variable. An exponential amount of data is needed for increasing levels
Expand All @@ -104,8 +105,8 @@ for a discrete variable.

power_divergence

Kernel-Approaches
-----------------
:mod:`pywhy_stats.independence.kci` Kernel-Approaches
-----------------------------------------------------
Kernel independence tests are statistical methods used to determine if two random variables are independent or
conditionally independent. One such test is the Hilbert-Schmidt Independence Criterion (HSIC), which examines the
independence between two random variables, X and Y. HSIC employs kernel methods and, more specifically, it computes
Expand All @@ -125,6 +126,12 @@ Kernel-based tests are attractive for many applications, since they are semi-par
that have been shown to be robust in the machine-learning field. For more information, see :footcite:`Zhang2011`.


.. currentmodule:: pywhy_stats.independence
.. autosummary::
:toctree: generated/

kci

Classifier-based Approaches
---------------------------
Another suite of approaches that rely on permutation testing is the classifier-based approach.
Expand All @@ -144,9 +151,9 @@ helps maintain dependence between (X, Z) and (Y, Z) (if it exists), but generate
conditionally independent dataset.


=======================
Conditional Discrepancy
=======================
=========================================
Conditional Distribution 2-Sample Testing
=========================================

.. currentmodule:: pywhy_stats

Expand All @@ -170,23 +177,7 @@ indices of the distribution, one can convert the CD test:
:math:`P_{i=j}(y|x) =? P_{i=k}(y|x)` into the CI test :math:`P(y|x,i) = P(y|x)`, which can
be tested with the Chi-square CI tests.

Kernel-Approaches
-----------------
Kernel-based tests are attractive since they are semi-parametric and use kernel-based ideas
that have been shown to be robust in the machine-learning field. The Kernel CD test is a test
that computes a test statistic from kernels of the data and uses a weighted permutation testing
based on the estimated propensity scores to generate samples from the null distribution
:footcite:`Park2021conditional`, which are then used to estimate a pvalue.


Bregman-Divergences
-------------------
The Bregman CD test is a divergence-based test
that computes a test statistic from estimated Von-Neumann divergences of the data and uses a
weighted permutation testing based on the estimated propensity scores to generate samples from the null distribution
:footcite:`Yu2020Bregman`, which are then used to estimate a pvalue.

==========
References
==========
.. footbibliography::
.. footbibliography::
4 changes: 1 addition & 3 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@

# If your documentation needs a minimal Sphinx version, state it here.
#
needs_sphinx = "4.0"
needs_sphinx = "5.0"

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
Expand Down Expand Up @@ -146,9 +146,7 @@
"PValueResult": "pywhy_stats.pvalue_result.PValueResult",
# numpy
"NDArray": "numpy.ndarray",
# "ArrayLike": "numpy.typing.ArrayLike",
"ArrayLike": ":term:`array_like`",
"fisherz": "pywhy_stats.fisherz",
}

autodoc_typehints_format = "short"
Expand Down
2 changes: 1 addition & 1 deletion doc/whats_new/v0.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ Version 0.1
Changelog
---------

- |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
- |Feature| Implement partial correlation test :func:`pywhy_stats.independence.fisherz`, by `Adam Li`_ (:pr:`7`)
- |Feature| Add (un)conditional kernel independence test by `Patrick Blöbaum`_, co-authored by `Adam Li`_ (:pr:`14`)
- |Feature| Add categorical independence tests by `Adam Li`_, (:pr:`18`)

Expand Down
5 changes: 3 additions & 2 deletions pywhy_stats/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from . import fisherz, kci
from ._version import __version__ # noqa: F401
from .independence import Methods, independence_test
from .api import Methods, independence_test
from .independence import fisherz, kci, power_divergence
from .pvalue_result import PValueResult
26 changes: 16 additions & 10 deletions pywhy_stats/independence.py → pywhy_stats/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@
from typing import Optional
from warnings import warn

import numpy as np
import scipy.stats
from numpy.typing import ArrayLike

from pywhy_stats import fisherz, kci
from pywhy_stats.independence import fisherz, kci

from .pvalue_result import PValueResult

Expand All @@ -18,10 +19,10 @@ class Methods(Enum):
"""Choose an automatic method based on the data."""

FISHERZ = fisherz
""":py:mod:`~pywhy_stats.fisherz`: Fisher's Z test for independence"""
""":py:mod:`pywhy_stats.independence.fisherz`: Fisher's Z test for independence"""

KCI = kci
""":py:mod:`~pywhy_stats.kci`: Conditional kernel independence test"""
""":py:mod:`pywhy_stats.independence.kci`: Conditional kernel independence test"""


def independence_test(
Expand Down Expand Up @@ -59,32 +60,37 @@ def independence_test(

See Also
--------
fisherz : Fisher's Z test for independence
kci : Kernel Conditional Independence test
pywhy_stats.independence.fisherz : Fisher's Z test for independence
pywhy_stats.independence.kci : Kernel Conditional Independence test
"""
method_module: ModuleType
if method == Methods.AUTO:
method_module = Methods.KCI
elif not isinstance(method, Methods):
raise ValueError(
f"Invalid method type. Expected one of {Methods.__members__.keys()}, "
f"but got {method}."
)
else:
method_module = method
method_module = method # type: ignore

if method_module == Methods.FISHERZ:
if condition_on is None:
data = [X, Y]
else:
data = [X, Y, condition_on]
for _data in data:
_, pval = scipy.stats.normaltest(_data)
res = scipy.stats.normaltest(_data, axis=0)

# XXX: we should add pinguoin as an optional dependency for doing multi-comp stuff
if pval < 0.05:
if np.atleast_1d(res.pvalue).any() < 0.05:
bloebp marked this conversation as resolved.
Show resolved Hide resolved
warn(
"The provided data does not seem to be Gaussian, but the Fisher-Z test "
"assumes that the data follows a Gaussian distribution. The result should "
"be interpreted carefully or consider a different independence test method."
)

if condition_on is None:
return method_module.ind(X, Y, method, **kwargs)
return method_module.value.ind(X, Y, **kwargs)
bloebp marked this conversation as resolved.
Show resolved Hide resolved
else:
return method_module.condind(X, Y, condition_on, method, **kwargs)
return method_module.value.condind(X, Y, condition_on, **kwargs)
1 change: 1 addition & 0 deletions pywhy_stats/independence/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from . import fisherz, kci, power_divergence
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
It works on Gaussian random variables.

When the data is not Gaussian, this test is not valid. In this case, we recommend
using the Kernel independence test at <insert link>.
using the Kernel independence test at `pywhy_stats.kci`.

Examples
--------
Expand All @@ -21,7 +21,7 @@
from numpy.typing import ArrayLike
from scipy.stats import norm

from .pvalue_result import PValueResult
from pywhy_stats.pvalue_result import PValueResult


def ind(X: ArrayLike, Y: ArrayLike, correlation_matrix: Optional[ArrayLike] = None) -> PValueResult:
Expand Down
Loading