Skip to content

Commit

Permalink
DOC update document references & include summary for papers at home p…
Browse files Browse the repository at this point in the history
…age (#104)

* DOC include summary of papers used in package
* FIX correct module location & optimize reference
  • Loading branch information
PSSF23 authored Aug 7, 2023
1 parent d179de3 commit b44c951
Show file tree
Hide file tree
Showing 9 changed files with 84 additions and 46 deletions.
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,18 +60,18 @@ We use the ``spin`` CLI to abstract away build details:

# run the build using Meson/Ninja
./spin build

# you can run the following command to see what other options there are
./spin --help
./spin build --help

# For example, you might want to start from a clean build
./spin build --clean

# or build in parallel for faster builds
./spin build -j 2

# you will need to double check the build-install has the proper path
# you will need to double check the build-install has the proper path
# this might be different from machine to machine
export PYTHONPATH=${PWD}/build-install/usr/lib/python3.9/site-packages

Expand Down Expand Up @@ -105,4 +105,4 @@ Alternatively, you can use editable installs

References
==========
[1]: [`Li, Adam, et al. "Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks." arXiv preprint arXiv:1909.11799 (2019)`](https://arxiv.org/abs/1909.11799)
[1]: [`Li, Adam, et al. "Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks" SIAM Journal on Mathematics of Data Science, 5(1), 77-96, 2023`](https://doi.org/10.1137/21M1449117)
8 changes: 4 additions & 4 deletions docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,15 +54,15 @@ on unsupervised criterion such as variance and BIC.
.. currentmodule:: sktree
.. autosummary::
:toctree: generated/

UnsupervisedRandomForest
UnsupervisedObliqueRandomForest

The trees that comprise those forests are also available as standalone classes.

.. autosummary::
:toctree: generated/

tree.UnsupervisedDecisionTree
tree.UnsupervisedObliqueDecisionTree

Expand All @@ -84,7 +84,7 @@ provide a natural way to compute neighbors based on the splits. We provide
an API for extracting the nearest neighbors from a tree-model. This provides
an API-like interface similar to :class:`~sklearn.neighbors.NearestNeighbors`.

.. currentmodule:: sktree
.. currentmodule:: sktree.neighbors
.. autosummary::
:toctree: generated/

Expand Down Expand Up @@ -120,4 +120,4 @@ for the entropy, MI and CMI of the Gaussian distributions.

mi_gaussian
cmi_gaussian
entropy_gaussian
entropy_gaussian
26 changes: 26 additions & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,38 @@ Scikit-tree is a package for modern tree-based algorithms for supervised and uns
learning problems. It extends the robust API of `scikit-learn <https://github.com/scikit-learn/scikit-learn>`_
for tree algorithms that achieve strong performance in benchmark tasks.

Our package has implemented unsupervised forests (Geodesic Forests
[Madhyastha2020]_), oblique random forests (SPORF [Tomita2020]_ and
MORF [Li2023]_), and honest forests [Perry2021]_.
In the near future, we also plan to include extended isolation forests
and stream decision forests [Xu2022]_.

We encourage you to use the package for your research and also build on top
with relevant Pull Requests. See our examples for walk-throughs of how to use the package.
Also, see our `contributing guide <https://github.com/neurodata/scikit-tree/blob/main/CONTRIBUTING.md>`_.

We are licensed under BSD-3 (see `License <https://github.com/neurodata/scikit-tree/blob/main/LICENSE>`_).

.. topic:: References

.. [Madhyastha2020] Madhyastha, Meghana, et al. :doi:`"Geodesic Forests"
<10.1145/3394486.3403094>`, KDD 2020, 513-523, 2020.
.. [Tomita2020] Tomita, Tyler M., et al. "Sparse Projection Oblique
Randomer Forests", The Journal of Machine Learning Research, 21(104),
1-39, 2020.
.. [Li2023] Li, Adam, et al. :doi:`"Manifold Oblique Random Forests: Towards
Closing the Gap on Convolutional Deep Networks" <10.1137/21M1449117>`,
SIAM Journal on Mathematics of Data Science, 5(1), 77-96, 2023.
.. [Perry2021] Perry, Ronan, et al. :arxiv:`"Random Forests for Adaptive
Nearest Neighbor Estimation of Information-Theoretic Quantities"
<1907.00325>`, arXiv preprint arXiv:1907.00325, 2021.
.. [Xu2022] Xu, Haoyin, et al. :arxiv:`"Simplest Streaming Trees"
<2110.08483>`, arXiv preprint arXiv:2110.08483, 2022.
Contents
--------

Expand Down
8 changes: 4 additions & 4 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Installing with ``pip``

.. code-block:: bash
pip install scikit-tree
pip install sktree
Installing from source with Meson
---------------------------------
Expand All @@ -43,12 +43,12 @@ Then run installation of build packages
pip install -r build_requirements.txt
pip install spin
# use spin CLI to run Meson build locally
./spin build -j 2
# you can now run tests
./spin test
./spin test
via pip, you will be able to install in editable mode (pending Meson-Python support).

Expand All @@ -70,7 +70,7 @@ First, create a virtual environment using Conda.
conda create -n sklearn-dev python=3.9

# activate the virtual environment and install necessary packages to build from source

conda activate sklearn-dev
conda install -c conda-forge numpy scipy cython joblib threadpoolctl pytest compilers llvm-openmp poetry

Expand Down
20 changes: 11 additions & 9 deletions docs/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ more information and intuition, see

.. topic:: References

.. [B2001] L. Breiman, "Random Forests", Machine Learning, 45(1), 5-32, 2001.
.. [B2001] Breiman, L. "Random Forests", Machine Learning, 45(1), 5-32, 2001.
* P. Geurts, D. Ernst., and L. Wehenkel, "Extremely randomized
.. [G2006] Geurts, P. and Ernst., D. and Wehenkel, L. "Extremely randomized
trees", Machine Learning, 63(1), 3-42, 2006.
.. _oblique_forest_feature_importance:
Expand All @@ -52,8 +52,8 @@ By **averaging** the estimates of predictive ability over several randomized
trees one can **reduce the variance** of such an estimate and use it
for feature selection. This is known as the mean decrease in impurity, or MDI.
Refer to [L2014]_ for more information on MDI and feature importance
evaluation with Random Forests. We implement the approach taken in [Li2019]_
and [Tomita2015]_.
evaluation with Random Forests. We implement the approach taken in [Li2023]_
and [Tomita2020]_.

.. warning::

Expand All @@ -76,12 +76,14 @@ to the prediction function.

.. topic:: References

.. [L2014] G. Louppe, :arxiv:`"Understanding Random Forests: From Theory to
.. [L2014] Louppe, G. :arxiv:`"Understanding Random Forests: From Theory to
Practice" <1407.7502>`,
PhD Thesis, U. of Liege, 2014.
.. [Li2019] Li, Adam, et al. :arxiv:`"Manifold Oblique Random Forests: Towards
Closing the Gap on Convolutional Deep Networks."` arXiv preprint arXiv:1909.11799 (2019).
.. [Li2023] Li, Adam, et al. :doi:`"Manifold Oblique Random Forests: Towards
Closing the Gap on Convolutional Deep Networks" <10.1137/21M1449117>`,
SIAM Journal on Mathematics of Data Science, 5(1), 77-96, 2023.
.. [Tomita2015] Tomita, Tyler M., et al. :arxiv:`"Sparse Projection Oblique Randomer Forests."`
arXiv preprint arXiv:1506.03410 (2015).
.. [Tomita2020] Tomita, Tyler M., et al. "Sparse Projection Oblique
Randomer Forests", The Journal of Machine Learning Research, 21(104),
1-39, 2020.
32 changes: 19 additions & 13 deletions docs/modules/unsupervised_tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ Unsupervised Decision Trees

.. currentmodule:: sklearn.tree

In unsupervised learning, the goal is to identify patterns
or structure in data without using labeled examples. Clustering is a common
unsupervised learning technique that groups similar examples together
In unsupervised learning, the goal is to identify patterns
or structure in data without using labeled examples. Clustering is a common
unsupervised learning technique that groups similar examples together
based on their features. Unsupervised tree models are an adaptive way of generating
clusters of samples. For information on supervised tree models, see :ref:`supervised_tree`

Expand All @@ -29,7 +29,7 @@ The two means split finds the cutpoint that minimizes the one-dimensional
2-means objective, which is finding the cutoff point where the total variance
from cluster 1 and cluster 2 are minimal.

.. math::
.. math::
\min_s \sum_{i=1}^s (x_i - \hat{\mu}_1)^2 + \sum_{i=s+1}^N (x_i - \hat{\mu}_2)^2
where x is a N-dimensional feature vector, N is the number of sample_indices and
Expand All @@ -38,14 +38,15 @@ the \mu terms are the estimated means of each cluster 1 and 2.
Fast-BIC
~~~~~~~~

The Bayesian Information Criterion (BIC) is a popular model seleciton
The Bayesian Information Criterion (BIC) is a popular model seleciton
criteria that is based on the log likelihood of the model given data.
Fast-BIC is a method that combines the speed of the :class:`sklearn.cluster.KMeans` clustering
method with the model flexibility of Mclust-BIC. It sorts data for each
feature and tries all possible splits to assign data points to one of
two Gaussian distributions based on their position relative to the split.
The parameters for each cluster are estimated using maximum likelihood
estimation (MLE).The method performs hard clustering rather than soft
Fast-BIC [Madhyastha2020]_ is a method that combines the speed of the
:class:`sklearn.cluster.KMeans` clustering method with the model flexibility
of Mclust-BIC. It sorts data for each feature and tries all possible splits to
assign data points to one of two Gaussian distributions based on their position
relative to the split.
The parameters for each cluster are estimated using maximum likelihood
estimation (MLE).The method performs hard clustering rather than soft
clustering like in GMM, resulting in a simpler calculation of the likelihood.

.. math::
Expand All @@ -63,9 +64,14 @@ where the prior, mean, and variance are defined as follows, respectively:
.. _unsup_evaluation:

.. topic:: References

.. [Madhyastha2020] Madhyastha, Meghana, et al. :doi:`"Geodesic Forests"
<10.1145/3394486.3403094>`, KDD 2020, 513-523, 2020.
Evaluating Unsupervised Trees
-----------------------------

In clustering settings, there may be no natural
notion of “true” class-labels, thus the efficacy of the clustering scheme is
In clustering settings, there may be no natural
notion of “true” class-labels, thus the efficacy of the clustering scheme is
often measured based on distance based metrics such as :func:`sklearn.metrics.adjusted_rand_score`.
18 changes: 11 additions & 7 deletions docs/references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,19 @@ @article{breiman2001random
publisher = {Springer}
}

@article{Li2019manifold,
title = {Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks},
author = {Li, Adam and Perry, Ronan and Huynh, Chester and Tomita, Tyler M and Mehta, Ronak and Arroyo, Jesus and Patsolic, Jesse and Falk, Benjamin and Vogelstein, Joshua T},
journal = {arXiv preprint arXiv:1909.11799},
year = {2019}
@article{Li2023manifold,
author = {Li, Adam and Perry, Ronan and Huynh, Chester and Tomita, Tyler M. and Mehta, Ronak and Arroyo, Jesus and Patsolic, Jesse and Falk, Ben and Sarma, Sridevi and Vogelstein, Joshua},
title = {Manifold Oblique Random Forests: Towards Closing the Gap on Convolutional Deep Networks},
journal = {SIAM Journal on Mathematics of Data Science},
volume = {5},
number = {1},
pages = {77-96},
year = {2023},
doi = {10.1137/21M1449117},
}

@article{perry2021random,
title={Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities},
title={Random Forests for Adaptive Nearest Neighbor Estimation of Information-Theoretic Quantities},
author={Ronan Perry and Ronak Mehta and Richard Guo and Eva Yezerets and Jesús Arroyo and Mike Powell and Hayden Helm and Cencheng Shen and Joshua T. Vogelstein},
year={2021},
eprint={1907.00325},
Expand Down Expand Up @@ -79,4 +83,4 @@ @article{Kraskov_2004
note = {Publisher: American Physical Society},
pages = {066138},
file = {APS Snapshot:/Users/adam2392/Zotero/storage/GRW23BYU/PhysRevE.69.html:text/html;Full Text PDF:/Users/adam2392/Zotero/storage/NJT9QCVA/Kraskov et al. - 2004 - Estimating mutual information.pdf:application/pdf}
}
}
4 changes: 2 additions & 2 deletions sktree/ensemble/_supervised_forest.py
Original file line number Diff line number Diff line change
Expand Up @@ -653,7 +653,7 @@ class PatchObliqueRandomForestClassifier(SimMatrixMixin, ForestClassifier):
forest that fits a number of patch oblique decision tree classifiers
on various sub-samples of the dataset and uses averaging to
improve the predictive accuracy and control over-fitting. For more
details, see :footcite:`Li2019manifold`.
details, see :footcite:`Li2023manifold`.
Parameters
----------
Expand Down Expand Up @@ -996,7 +996,7 @@ class PatchObliqueRandomForestRegressor(SimMatrixMixin, ForestRegressor):
forest that fits a number of patch oblique decision tree regressors
on various sub-samples of the dataset and uses averaging to
improve the predictive accuracy and control over-fitting. For more
details, see :footcite:`Li2019manifold`.
details, see :footcite:`Li2023manifold`.
Parameters
----------
Expand Down
4 changes: 2 additions & 2 deletions sktree/tree/_classes.py
Original file line number Diff line number Diff line change
Expand Up @@ -1301,7 +1301,7 @@ class PatchObliqueDecisionTreeClassifier(SimMatrixMixin, DecisionTreeClassifier)
"""A oblique decision tree classifier that operates over patches of data.
A patch oblique decision tree is also known as a manifold oblique decision tree
(called MORF in :footcite:`Li2019manifold`), where the splitter is aware of
(called MORF in :footcite:`Li2023manifold`), where the splitter is aware of
the structure in the data. For example, in an image, a patch would be contiguous
along the rows and columns of the image. In a multivariate time-series, a patch
would be contiguous over time, but possibly discontiguous over the sensors.
Expand Down Expand Up @@ -1791,7 +1791,7 @@ class PatchObliqueDecisionTreeRegressor(SimMatrixMixin, DecisionTreeRegressor):
"""A oblique decision tree regressor that operates over patches of data.
A patch oblique decision tree is also known as a manifold oblique decision tree
(called MORF in :footcite:`Li2019manifold`), where the splitter is aware of
(called MORF in :footcite:`Li2023manifold`), where the splitter is aware of
the structure in the data. For example, in an image, a patch would be contiguous
along the rows and columns of the image. In a multivariate time-series, a patch
would be contiguous over time, but possibly discontiguous over the sensors.
Expand Down

0 comments on commit b44c951

Please sign in to comment.