ENH, FIX i) `build_oob_forest` backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

adam2392 · 2024-06-12T16:35:11Z

Changes proposed in this pull request:

build_oob_forest will work with any sklearn Forest that has estimators_samples_ (in-bag sample indices)
HonestForest will now stratify during bootstrap step. This fixes a major bug where the posterior estimates are biased, resulting in a biased AUC score on independent data.

Stratification should occur every time we sample the dataset whether its subsampling, or bootstrapping.

when we bootstrap sample the dataset to get the in-bag and oob samples, we stratify using sklearn.utils.resample (this PR)
when we split the in-bag unique samples in halves to get structure and honest dataset, we stratify using StratifiedShuffleSplit.

Summary

On main branch, using the following test:

def test_honest_forest_posteriors_on_independent():
    from sktree.datasets import make_trunk_classification

    seed = 12345
    scores = []
    for idx in range(5):
        X, y = make_trunk_classification(
            n_samples=128, n_dim=4096, n_informative=1, mu_0=0.0, mu_1=0.0, seed=idx
        )
        clf = HonestForestClassifier(
            n_estimators=100,
            random_state=idx,
            bootstrap=True,
            max_samples=1.6,
            n_jobs=-1,
            honest_prior="ignore",
            stratify=True,
        )
        clf.fit(X, y)

        oob_posteriors = clf.predict_proba_per_tree(X, clf.oob_samples_)
        auc_score = roc_auc_score(y, np.nanmean(oob_posteriors, axis=0)[:, 1])
        scores.append(auc_score)

    print(np.mean(scores), scores)
    assert np.mean(scores) > 0.49, f"{np.mean(scores)} {scores}"
    assert False

we get the error:

(sktree) (base) adam2392@arm64-apple-darwin20 scikit-tree % pytest ./sktree/tests/test_honest_forest.py::test_honest_forest_posteriors_on_independent
==================================================================== test session starts ====================================================================
platform darwin -- Python 3.9.18, pytest-8.2.2, pluggy-1.5.0 -- /Users/adam2392/miniforge3/envs/sktree/bin/python3.9
cachedir: .pytest_cache
rootdir: /Users/adam2392/Documents/scikit-tree
configfile: pyproject.toml
plugins: cov-5.0.0, flaky-3.8.1
collected 1 item                                                                                                                                            

sktree/tests/test_honest_forest.py::test_honest_forest_posteriors_on_independent FAILED                                                               [100%]

========================================================================= FAILURES ==========================================================================
_______________________________________________________ test_honest_forest_posteriors_on_independent ________________________________________________________

    def test_honest_forest_posteriors_on_independent():
        from sktree.datasets import make_trunk_classification
    
        seed = 12345
        scores = []
        for idx in range(5):
            X, y = make_trunk_classification(
                n_samples=128, n_dim=4096, n_informative=1, mu_0=0.0, mu_1=0.0, seed=idx
            )
            clf = HonestForestClassifier(
                n_estimators=100,
                random_state=idx,
                bootstrap=True,
                max_samples=1.6,
                n_jobs=-1,
                honest_prior="ignore",
                stratify=True,
            )
            clf.fit(X, y)
    
            oob_posteriors = clf.predict_proba_per_tree(X, clf.oob_samples_)
            auc_score = roc_auc_score(y, np.nanmean(oob_posteriors, axis=0)[:, 1])
            scores.append(auc_score)
    
        print(np.mean(scores), scores)
>       assert np.mean(scores) > 0.49, f"{np.mean(scores)} {scores}"
E       AssertionError: 0.47548828125 [0.49951171875, 0.479736328125, 0.408203125, 0.464111328125, 0.52587890625]

sktree/tests/test_honest_forest.py:519: AssertionError

However, if we run it on this branch, we get 0.50498046875 [0.484375, 0.53076171875, 0.513671875, 0.46533203125, 0.53076171875], which shows the stratification fixes the bias.

Signed-off-by: Adam Li <adam2392@gmail.com>

sktree/ensemble/_honest_forest.py

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2024-06-21T20:21:29Z

Interestingly, this is not an issue on RandomForestClassifier, so I suspect there is a relationship to the empty leaves, or the fact that we use a separate dataset to estimate the posteriors

Signed-off-by: Adam Li <adam2392@gmail.com>

codecov · 2024-06-21T20:30:17Z

Codecov Report

Attention: Patch coverage is 82.14286% with 10 lines in your changes missing coverage. Please review.

Project coverage is 78.55%. Comparing base (b8da7b0) to head (290c5f6).
Report is 1 commits behind head on main.

Files	Patch %	Lines
sktree/ensemble/_honest_forest.py	77.77%	4 Missing and 2 partials ⚠️
sktree/tree/_honest_tree.py	80.00%	2 Missing and 1 partial ⚠️
sktree/stats/forestht.py	88.88%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #283      +/-   ##
==========================================
+ Coverage   76.79%   78.55%   +1.75%     
==========================================
  Files          25       24       -1     
  Lines        2267     2252      -15     
  Branches      409      414       +5     
==========================================
+ Hits         1741     1769      +28     
+ Misses        402      352      -50     
- Partials      124      131       +7

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added 20 commits June 11, 2024 13:38

Merge in main

abf3669

Signed-off-by: Adam Li <adam2392@gmail.com>

Add option to use oob samples or estimators samples directly

effdd31

Signed-off-by: Adam Li <adam2392@gmail.com>

Adding stratification in honest trees

7b67b00

Signed-off-by: Adam Li <adam2392@gmail.com>

Try to get things working

24fb6e1

Signed-off-by: Adam Li <adam2392@gmail.com>

Wip for pruning honest trees

bebea20

Signed-off-by: Adam Li <adam2392@gmail.com>

Working on honest pruning

2836781

Signed-off-by: Adam Li <adam2392@gmail.com>

WIP

924edb5

Signed-off-by: Adam Li <adam2392@gmail.com>

Merge branch 'main' into inspect

26d5247

Almost working prototype

3693c51

Signed-off-by: Adam Li <adam2392@gmail.com>

Adding unit test on posterior bias

7ce0756

Signed-off-by: Adam Li <adam2392@gmail.com>

Try with fixed dataset

e5defce

Signed-off-by: Adam Li <adam2392@gmail.com>

Working test

49afe30

Signed-off-by: Adam Li <adam2392@gmail.com>

Adding unit test for the regression

ab91c8d

Signed-off-by: Adam Li <adam2392@gmail.com>

Removing honest pruner

6adb052

Signed-off-by: Adam Li <adam2392@gmail.com>

Removing hoenstpruneing

0b3519e

Signed-off-by: Adam Li <adam2392@gmail.com>

Remove hoenst method

25f4659

Signed-off-by: Adam Li <adam2392@gmail.com>

Fixed

3bc25ee

Signed-off-by: Adam Li <adam2392@gmail.com>

Fixed

a888aea

Signed-off-by: Adam Li <adam2392@gmail.com>

Remove scruff

4941fa0

Signed-off-by: Adam Li <adam2392@gmail.com>

Adding changelog

5683d23

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 requested review from sampan501, SUKI-O, PSSF23 and YuxinB June 21, 2024 18:44

adam2392 added 2 commits June 21, 2024 14:45

Merge branch 'main' into inspect

dcc4620

Update docstring

aa8d811

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 changed the title ~~ENH build_oob_forest backwards compatiblility with sklearn and HonestForest stratification during bootstrap~~ ENH, FIX i) build_oob_forest backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap Jun 21, 2024

YuxinB reviewed Jun 21, 2024

View reviewed changes

sktree/ensemble/_honest_forest.py Show resolved Hide resolved

YuxinB approved these changes Jun 21, 2024

View reviewed changes

Fix make trunk

b6ada29

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added 3 commits June 21, 2024 15:44

Fix

07fea4a

Signed-off-by: Adam Li <adam2392@gmail.com>

precommit

a4d58a9

Signed-off-by: Adam Li <adam2392@gmail.com>

Update precommit

2a4d302

Signed-off-by: Adam Li <adam2392@gmail.com>

Decrease dimensionality of problem

9c06ca8

Signed-off-by: Adam Li <adam2392@gmail.com>

Add unit teste coverage

290c5f6

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 merged commit ea5d929 into neurodata:main Jun 24, 2024
28 of 33 checks passed

adam2392 deleted the inspect branch June 24, 2024 18:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH, FIX i) `build_oob_forest` backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

ENH, FIX i) `build_oob_forest` backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

adam2392 commented Jun 12, 2024 •

edited

Loading

adam2392 commented Jun 21, 2024

codecov bot commented Jun 21, 2024 •

edited

Loading

ENH, FIX i) build_oob_forest backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

ENH, FIX i) build_oob_forest backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

Conversation

adam2392 commented Jun 12, 2024 • edited Loading

Summary

adam2392 commented Jun 21, 2024

codecov bot commented Jun 21, 2024 • edited Loading

Codecov Report

ENH, FIX i) `build_oob_forest` backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

ENH, FIX i) `build_oob_forest` backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

adam2392 commented Jun 12, 2024 •

edited

Loading

codecov bot commented Jun 21, 2024 •

edited

Loading