py-why · adam2392 · Aug 30, 2023 · Jul 19, 2023 · Jul 19, 2023 · Jul 19, 2023
diff --git a/CITATION.cff b/CITATION.cff
@@ -0,0 +1,24 @@
+# YAML 1.2
+---
+# Metadata for citation of this software according to the CFF format (https://citation-file-format.github.io/)
+cff-version: 1.2.0
+title: 'Pywhy-Stats: Statistical inference in Python.'
+abstract: 'Pywhy-Stats is a Python library that leverages a simple API for performing independence and conditional independence testing.'
+authors:
+    - given-names: Adam
+      family-names: Li
+      affiliation: 'Department of Computer Science, Columbia University, New York, NY, USA'
+      orcid: 'https://orcid.org/0000-0001-8421-365X'
+    - given-names: Patrick
+      family-names: Blöbaum
+      affiliation: 'Amazon'
+      email: 'bloebp@amazon.com'
+type: software
+repository-code: 'https://github.com/py-why/pywhy-stats'
+license: MIT
+keywords:
+  - causality
+  - pywhy
+  - statistics
+  - independece testing
+...
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -141,6 +141,11 @@ When you're ready to contribute code to address an open issue, please follow the
 
     </details>
 
+5. Adding your name to the CITATION.cff file
+
+    We are a community-driven open-source project and want to make sure all contributors are acknowledged. If you are a new contributor, add your name
+    to the ``CITATION.cff`` file and relevant metadata.
+
 ### Writing docstrings
 
 We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings

diff --git a/doc/api.rst b/doc/api.rst
@@ -57,7 +57,7 @@ contains the p-value and the test statistic and optionally additional informatio
 Testing for conditional independence among variables is a core part
 of many data analysis procedures.
 
-.. currentmodule:: pywhy_stats
+.. currentmodule:: pywhy_stats.independence
 .. autosummary::
    :toctree: generated/
 

diff --git a/doc/conditional_independence.rst b/doc/conditional_independence.rst
@@ -80,20 +80,21 @@ various proposals in the literature for estimating CMI, which we summarize here:
   estimating :math:`P(y|x)` and :math:`P(y|x,z)`, which can be used as plug-in estimates
   to the equation for CMI.
 
-:mod:`pywhy_stats.fisherz` Partial (Pearson) Correlation
---------------------------------------------------------
+:mod:`pywhy_stats.independence.fisherz` Partial (Pearson) Correlation
+---------------------------------------------------------------------
 Partial correlation based on the Pearson correlation is equivalent to CMI in the setting
 of normally distributed data. Computing partial correlation is fast and efficient and
 thus attractive to use. However, this **relies on the assumption that the variables are Gaussiany**,
 which may be unrealistic in certain datasets.
 
+.. currentmodule:: pywhy_stats.independence
 .. autosummary::
    :toctree: generated/
 
     fisherz
 
-:mod:`pywhy_stats.power_divergence` Discrete, Categorical and Binary Data
--------------------------------------------------------------------------
+:mod:`pywhy_stats.independence.power_divergence` Discrete, Categorical and Binary Data
+--------------------------------------------------------------------------------------
 If one has discrete data, then the test to use is based on Chi-square tests. The :math:`G^2`
 class of tests will construct a contingency table based on the number of levels across
 each discrete variable. An exponential amount of data is needed for increasing levels
@@ -104,8 +105,8 @@ for a discrete variable.
 
     power_divergence
 
-Kernel-Approaches
------------------
+:mod:`pywhy_stats.independence.kci` Kernel-Approaches
+-----------------------------------------------------
 Kernel independence tests are statistical methods used to determine if two random variables are independent or
 conditionally independent. One such test is the Hilbert-Schmidt Independence Criterion (HSIC), which examines the
 independence between two random variables, X and Y. HSIC employs kernel methods and, more specifically, it computes
@@ -125,6 +126,12 @@ Kernel-based tests are attractive for many applications, since they are semi-par
 that have been shown to be robust in the machine-learning field. For more information, see :footcite:`Zhang2011`.
 
 
+.. currentmodule:: pywhy_stats.independence
+.. autosummary::
+   :toctree: generated/
+
+    kci
+
 Classifier-based Approaches
 ---------------------------
 Another suite of approaches that rely on permutation testing is the classifier-based approach.
@@ -144,9 +151,9 @@ helps maintain dependence between (X, Z) and (Y, Z) (if it exists), but generate
 conditionally independent dataset.
 
 
-=======================
-Conditional Discrepancy
-=======================
+=========================================
+Conditional Distribution 2-Sample Testing
+=========================================
 
 .. currentmodule:: pywhy_stats
 
@@ -170,23 +177,7 @@ indices of the distribution, one can convert the CD test:
 :math:`P_{i=j}(y|x) =? P_{i=k}(y|x)` into the CI test :math:`P(y|x,i) = P(y|x)`, which can
 be tested with the Chi-square CI tests.
 
-Kernel-Approaches
------------------
-Kernel-based tests are attractive since they are semi-parametric and use kernel-based ideas
-that have been shown to be robust in the machine-learning field. The Kernel CD test is a test
-that computes a test statistic from kernels of the data and uses a weighted permutation testing
-based on the estimated propensity scores to generate samples from the null distribution
-:footcite:`Park2021conditional`, which are then used to estimate a pvalue.
-
-
-Bregman-Divergences
--------------------
-The Bregman CD test is a divergence-based test
-that computes a test statistic from estimated Von-Neumann divergences of the data and uses a
-weighted permutation testing based on the estimated propensity scores to generate samples from the null distribution
-:footcite:`Yu2020Bregman`, which are then used to estimate a pvalue.
-
 ==========
 References
 ==========
-.. footbibliography::
+.. footbibliography::
diff --git a/doc/conf.py b/doc/conf.py
@@ -39,7 +39,7 @@
 
 # If your documentation needs a minimal Sphinx version, state it here.
 #
-needs_sphinx = "4.0"
+needs_sphinx = "5.0"
 
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
@@ -146,9 +146,7 @@
     "PValueResult": "pywhy_stats.pvalue_result.PValueResult",
     # numpy
     "NDArray": "numpy.ndarray",
-    # "ArrayLike": "numpy.typing.ArrayLike",
     "ArrayLike": ":term:`array_like`",
-    "fisherz": "pywhy_stats.fisherz",
 }
 
 autodoc_typehints_format = "short"

diff --git a/doc/whats_new/v0.1.rst b/doc/whats_new/v0.1.rst
@@ -26,7 +26,7 @@ Version 0.1
 Changelog
 ---------
 
-- |Feature| Implement partial correlation test :func:`pywhy_stats.fisherz`, by `Adam Li`_ (:pr:`7`)
+- |Feature| Implement partial correlation test :func:`pywhy_stats.independence.fisherz`, by `Adam Li`_ (:pr:`7`)
 - |Feature| Add (un)conditional kernel independence test by `Patrick Blöbaum`_, co-authored by `Adam Li`_ (:pr:`14`)
 - |Feature| Add categorical independence tests by `Adam Li`_, (:pr:`18`)
 

diff --git a/pywhy_stats/__init__.py b/pywhy_stats/__init__.py
@@ -1,3 +1,4 @@
-from . import fisherz, kci
 from ._version import __version__  # noqa: F401
-from .independence import Methods, independence_test
+from .api import Methods, independence_test
+from .independence import fisherz, kci, power_divergence
+from .pvalue_result import PValueResult
diff --git a/pywhy_stats/independence.py → pywhy_stats/api.py b/pywhy_stats/independence.py → pywhy_stats/api.py
@@ -3,10 +3,11 @@
 from typing import Optional
 from warnings import warn
 
+import numpy as np
 import scipy.stats
 from numpy.typing import ArrayLike
 
-from pywhy_stats import fisherz, kci
+from pywhy_stats.independence import fisherz, kci
 
 from .pvalue_result import PValueResult
 
@@ -18,10 +19,10 @@ class Methods(Enum):
     """Choose an automatic method based on the data."""
 
     FISHERZ = fisherz
-    """:py:mod:`~pywhy_stats.fisherz`: Fisher's Z test for independence"""
+    """:py:mod:`pywhy_stats.independence.fisherz`: Fisher's Z test for independence"""
 
     KCI = kci
-    """:py:mod:`~pywhy_stats.kci`: Conditional kernel independence test"""
+    """:py:mod:`pywhy_stats.independence.kci`: Conditional kernel independence test"""
 
 
 def independence_test(
@@ -59,32 +60,37 @@ def independence_test(
 
     See Also
     --------
-    fisherz : Fisher's Z test for independence
-    kci : Kernel Conditional Independence test
+    pywhy_stats.independence.fisherz : Fisher's Z test for independence
+    pywhy_stats.independence.kci : Kernel Conditional Independence test
     """
     method_module: ModuleType
     if method == Methods.AUTO:
         method_module = Methods.KCI
+    elif not isinstance(method, Methods):
+        raise ValueError(
+            f"Invalid method type. Expected one of {Methods.__members__.keys()}, "
+            f"but got {method}."
+        )
     else:
-        method_module = method
+        method_module = method  # type: ignore
 
     if method_module == Methods.FISHERZ:
         if condition_on is None:
             data = [X, Y]
         else:
             data = [X, Y, condition_on]
         for _data in data:
-            _, pval = scipy.stats.normaltest(_data)
+            res = scipy.stats.normaltest(_data, axis=0)
 
             # XXX: we should add pinguoin as an optional dependency for doing multi-comp stuff
-            if pval < 0.05:
+            if np.atleast_1d(res.pvalue).any() < 0.05:
                 warn(
                     "The provided data does not seem to be Gaussian, but the Fisher-Z test "
                     "assumes that the data follows a Gaussian distribution. The result should "
                     "be interpreted carefully or consider a different independence test method."
                 )
 
     if condition_on is None:
-        return method_module.ind(X, Y, method, **kwargs)
+        return method_module.value.ind(X, Y, **kwargs)
     else:
-        return method_module.condind(X, Y, condition_on, method, **kwargs)
+        return method_module.value.condind(X, Y, condition_on, **kwargs)
diff --git a/pywhy_stats/independence/__init__.py b/pywhy_stats/independence/__init__.py
@@ -0,0 +1 @@
+from . import fisherz, kci, power_divergence
diff --git a/pywhy_stats/fisherz.py → pywhy_stats/independence/fisherz.py b/pywhy_stats/fisherz.py → pywhy_stats/independence/fisherz.py
@@ -4,7 +4,7 @@
 It works on Gaussian random variables.
 
 When the data is not Gaussian, this test is not valid. In this case, we recommend
-using the Kernel independence test at <insert link>.
+using the Kernel independence test at `pywhy_stats.kci`.
 
 Examples
 --------
@@ -21,7 +21,7 @@
 from numpy.typing import ArrayLike
 from scipy.stats import norm
 
-from .pvalue_result import PValueResult
+from pywhy_stats.pvalue_result import PValueResult
 
 
 def ind(X: ArrayLike, Y: ArrayLike, correlation_matrix: Optional[ArrayLike] = None) -> PValueResult: