Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH, FIX i) build_oob_forest backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap #283

Merged
merged 28 commits into from
Jun 24, 2024

Conversation

adam2392
Copy link
Collaborator

@adam2392 adam2392 commented Jun 12, 2024

Changes proposed in this pull request:

  • build_oob_forest will work with any sklearn Forest that has estimators_samples_ (in-bag sample indices)
  • HonestForest will now stratify during bootstrap step. This fixes a major bug where the posterior estimates are biased, resulting in a biased AUC score on independent data.

Stratification should occur every time we sample the dataset whether its subsampling, or bootstrapping.

  1. when we bootstrap sample the dataset to get the in-bag and oob samples, we stratify using sklearn.utils.resample (this PR)
  2. when we split the in-bag unique samples in halves to get structure and honest dataset, we stratify using StratifiedShuffleSplit.

Summary

On main branch, using the following test:

def test_honest_forest_posteriors_on_independent():
    from sktree.datasets import make_trunk_classification

    seed = 12345
    scores = []
    for idx in range(5):
        X, y = make_trunk_classification(
            n_samples=128, n_dim=4096, n_informative=1, mu_0=0.0, mu_1=0.0, seed=idx
        )
        clf = HonestForestClassifier(
            n_estimators=100,
            random_state=idx,
            bootstrap=True,
            max_samples=1.6,
            n_jobs=-1,
            honest_prior="ignore",
            stratify=True,
        )
        clf.fit(X, y)

        oob_posteriors = clf.predict_proba_per_tree(X, clf.oob_samples_)
        auc_score = roc_auc_score(y, np.nanmean(oob_posteriors, axis=0)[:, 1])
        scores.append(auc_score)

    print(np.mean(scores), scores)
    assert np.mean(scores) > 0.49, f"{np.mean(scores)} {scores}"
    assert False

we get the error:

(sktree) (base) adam2392@arm64-apple-darwin20 scikit-tree % pytest ./sktree/tests/test_honest_forest.py::test_honest_forest_posteriors_on_independent
==================================================================== test session starts ====================================================================
platform darwin -- Python 3.9.18, pytest-8.2.2, pluggy-1.5.0 -- /Users/adam2392/miniforge3/envs/sktree/bin/python3.9
cachedir: .pytest_cache
rootdir: /Users/adam2392/Documents/scikit-tree
configfile: pyproject.toml
plugins: cov-5.0.0, flaky-3.8.1
collected 1 item                                                                                                                                            

sktree/tests/test_honest_forest.py::test_honest_forest_posteriors_on_independent FAILED                                                               [100%]

========================================================================= FAILURES ==========================================================================
_______________________________________________________ test_honest_forest_posteriors_on_independent ________________________________________________________

    def test_honest_forest_posteriors_on_independent():
        from sktree.datasets import make_trunk_classification
    
        seed = 12345
        scores = []
        for idx in range(5):
            X, y = make_trunk_classification(
                n_samples=128, n_dim=4096, n_informative=1, mu_0=0.0, mu_1=0.0, seed=idx
            )
            clf = HonestForestClassifier(
                n_estimators=100,
                random_state=idx,
                bootstrap=True,
                max_samples=1.6,
                n_jobs=-1,
                honest_prior="ignore",
                stratify=True,
            )
            clf.fit(X, y)
    
            oob_posteriors = clf.predict_proba_per_tree(X, clf.oob_samples_)
            auc_score = roc_auc_score(y, np.nanmean(oob_posteriors, axis=0)[:, 1])
            scores.append(auc_score)
    
        print(np.mean(scores), scores)
>       assert np.mean(scores) > 0.49, f"{np.mean(scores)} {scores}"
E       AssertionError: 0.47548828125 [0.49951171875, 0.479736328125, 0.408203125, 0.464111328125, 0.52587890625]

sktree/tests/test_honest_forest.py:519: AssertionError

However, if we run it on this branch, we get 0.50498046875 [0.484375, 0.53076171875, 0.513671875, 0.46533203125, 0.53076171875], which shows the stratification fixes the bias.

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 changed the title ENH build_oob_forest backwards compatiblility with sklearn and HonestForest stratification during bootstrap ENH, FIX i) build_oob_forest backwards compatiblility with sklearn and ii) HonestForest stratification during bootstrap Jun 21, 2024
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator Author

Interestingly, this is not an issue on RandomForestClassifier, so I suspect there is a relationship to the empty leaves, or the fact that we use a separate dataset to estimate the posteriors

Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link

codecov bot commented Jun 21, 2024

Codecov Report

Attention: Patch coverage is 82.14286% with 10 lines in your changes missing coverage. Please review.

Project coverage is 78.55%. Comparing base (b8da7b0) to head (290c5f6).
Report is 1 commits behind head on main.

Files Patch % Lines
sktree/ensemble/_honest_forest.py 77.77% 4 Missing and 2 partials ⚠️
sktree/tree/_honest_tree.py 80.00% 2 Missing and 1 partial ⚠️
sktree/stats/forestht.py 88.88% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #283      +/-   ##
==========================================
+ Coverage   76.79%   78.55%   +1.75%     
==========================================
  Files          25       24       -1     
  Lines        2267     2252      -15     
  Branches      409      414       +5     
==========================================
+ Hits         1741     1769      +28     
+ Misses        402      352      -50     
- Partials      124      131       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 merged commit ea5d929 into neurodata:main Jun 24, 2024
28 of 33 checks passed
@adam2392 adam2392 deleted the inspect branch June 24, 2024 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants