Add multi-output support to honest trees #86

PSSF23 · 2023-06-14T20:05:26Z

Fixes #85

Changes proposed in this pull request:

Add multi-output support to HonestTreeClassifier & HonestForestClassifier

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.

Co-Authored-By: Ronan Perry <13107341+rflperry@users.noreply.github.com>

PSSF23

I have removed the test skipping except check_classifiers_multilabel_output_format_predict_proba, which is inconsistent with how sklearn.ensemble.RandomForestClassifier produces predict_proba results.

Error states that the shape of (n_samples, n_outputs) should be expected, but (n_outputs, n_samples, n_classes) is the actual shape produced by sklearn forests and trees.

Other multi-output cases passed.

codecov · 2023-06-14T20:25:23Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.62 🎉

Comparison is base (7eb96e6) 86.33% compared to head (0f773e1) 86.95%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
+ Coverage   86.33%   86.95%   +0.62%     
==========================================
  Files          24       24              
  Lines        1924     1940      +16     
==========================================
+ Hits         1661     1687      +26     
+ Misses        263      253      -10

Impacted Files	Coverage Δ
sktree/tree/tests/test_honest_tree.py	`100.00% <ø> (ø)`
sktree/ensemble/_honest_forest.py	`97.14% <100.00%> (+1.36%)`	⬆️
sktree/tests/test_honest_forest.py	`100.00% <100.00%> (ø)`
sktree/tree/_honest_tree.py	`100.00% <100.00%> (+8.03%)`	⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

adam2392 · 2023-06-14T20:26:14Z

I have removed the test skipping except check_classifiers_multilabel_output_format_predict_proba, which is inconsistent with how sklearn.ensemble.RandomForestClassifier produces predict_proba results.
Error states that the shape of (n_samples, n_outputs) should be expected, but (n_outputs, n_samples, n_classes) is the actual shape produced by sklearn forests and trees.

(n_samples, n_outputs) is the shape we want/required, but (n_outputs, n_samples, n_classes) is what we get for the honest tree/forest?

I'll take a look.

In the meantime, can you see if you can think of a quick test we can do to judge the multi-output validity? I.e. Perhaps test that the forest gets the correct answer on a short simulation? You can use the existing tests as inspiration.

I wonder if we can extend the existing figure that's shown in the uncertainty forest paper to have a multi-output setting.

PSSF23 · 2023-06-15T13:52:30Z

I don't think (n_samples, n_outputs) is a valid output format for predict_proba, and that's not the output of current sklearn implementations. It seems to be more appropriate for results of predict, where we get class labels instead of posteriors. Maybe the check is outdated. I'll look into possible tests for multi.

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-06-15T18:44:51Z

f1ab592 fixes the issue with the check_estimator test with multi-output predict_proba.

It would be great if we have a simulation to verify the results of a multi-output honest classification

PSSF23 · 2023-06-15T18:50:53Z

Would that be possible as we don't have a regressor class yet?

adam2392 · 2023-06-15T18:55:48Z

Would that be possible as we don't have a regressor class yet?

Sorry classification

PSSF23 · 2023-06-15T19:43:15Z

sktree/ensemble/_honest_forest.py

        posteriors[~zero_mask] /= posteriors[~zero_mask].sum(1, keepdims=True)
+
        if impute_missing is None:
            posteriors[zero_mask] = self.empirical_prior_


Having trouble on this line. It seems that when defined, empirical_prior_ doesn't have dimensional differences for y labels. So I'll try to increase its dimensions.

Actually, why is the empirical prior set at the forest level. It's already set in honest trees, so this part can be removed.

adam2392 · 2023-06-15T20:27:36Z

sktree/tests/test_honest_forest.py

+    X = iris.data
+    y = np.stack((iris.target, second_y)).T
+    clf.fit(X, y)
+    score = mean_squared_error(clf.predict(X), y)


Why not just use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

and assert some non-trivial performance?

Is it possible to interpret MSE of 0.5-1.0?

It appears that accuracy_score doesn't support multi-output. It seems that the default way they measure such predictions is through sklearn.multioutput.MultiOutputClassifier.

PSSF23 · 2023-06-16T14:14:47Z

neurodata/scikit-learn#45 (review)

PSSF23

Update the test to r2_score as that's the default sklearn uses for multi-output scores. Also included different honest_prior options.

sktree/tests/test_honest_forest.py

PSSF23

I fixed some bugs with leaf updates and the scores are much higher now.

PSSF23 · 2023-06-16T17:33:37Z

It seems that _impute_missing_classes is really not used... Should we just remove it?

PSSF23 · 2023-06-16T17:40:54Z

@adam2392 can you make a commit to include the submodule changes in here? Just want to check if the problem is fixed.

adam2392 · 2023-06-16T19:14:09Z

@adam2392 can you make a commit to include the submodule changes in here? Just want to check if the problem is fixed.

Run

git submodule update --recursive --remote
git add -A
git commit -m "Update submodule commit ID"

and then

./spin build -j 4 --forcesubmodule

This updates the submodule and then rebuilds the package locally.

adam2392 · 2023-06-16T20:18:33Z

It seems that _impute_missing_classes is really not used... Should we just remove it?

This is because the honest trees always get all classes? Can we up the honest_ratio and this will get activated?

Otw... sure you can delete if you think that's fine. I just want to make sure this actually works in relevant edge cases.

adam2392 · 2023-06-16T22:17:01Z

Looks like some other issues to fix with the docstrings in sklearn fork. I can do that later unless you do a quick fix now

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-06-20T18:00:20Z

The circleCI docs build should be fixed now @PSSF23

PSSF23

@adam2392 you reversed one of the changes in the honest tree for multi-output, which I updated back in the last commit.

adam2392 · 2023-06-20T18:38:17Z

@adam2392 you reversed one of the changes in the honest tree for multi-output, which I updated back in the last commit.

Oh apologies! Messed up the merge commit most likely

adam2392

I had two questions that I left.

Otherwise, LGTM to merge once those are addressed!

adam2392 · 2023-06-20T18:44:27Z

sktree/tree/_honest_tree.py

@@ -431,29 +425,31 @@ def _inherit_estimator_attributes(self):
        self.n_outputs_ = self.estimator_.n_outputs_
        self.tree_ = self.estimator_.tree_

-    def _empty_leaf_correction(self, proba, normalizer):
+    def _empty_leaf_correction(self, proba, pos=0):


Can you just add a short docstring to describe what's going on? I'm reading these lines and having trouble figuring out what pos does actually :/

pos indicates the class dimension of y, so that posterior corrections only work on that dimension in multi-output cases.

sktree/tests/test_honest_forest.py

sktree/tree/_honest_tree.py

Co-authored-by: Adam Li <adam2392@gmail.com>

PSSF23 · 2023-06-20T19:35:07Z

This test_max_samples test is very weird, succeeding or failing randomly.

adam2392 · 2023-06-20T19:46:40Z

This test_max_samples test is very weird, succeeding or failing randomly.

There is no random_state parameter passed to the HonestForestClassifier and that is perhaps what's causing this? Can we add that in since that will be definitely required?

adam2392 · 2023-06-21T14:28:17Z

If it is failing, does that mean the tree depths is some edge case?

PSSF23 · 2023-06-21T14:30:15Z

If it is failing, does that mean the tree depths is some edge case?

Maybe just the samples were not enough for the tree to split

adam2392 · 2023-06-21T15:21:00Z

It looks like there is a depth of -1. How does that even occur?...

FAILED sktree/tests/test_honest_forest.py::test_max_samples - assert False
 +  where False = all(array([ 7, -1]) > 0)
 +    where array([ 7, -1]) = <function diff at 0x1102396b0>([1, 8, 7])
 +      where <function diff at 0x1102396b0> = np.diff

PSSF23 · 2023-06-22T02:10:40Z

It looks like there is a depth of -1. How does that even occur?...

FAILED sktree/tests/test_honest_forest.py::test_max_samples - assert False
 +  where False = all(array([ 7, -1]) > 0)
 +    where array([ 7, -1]) = <function diff at 0x1102396b0>([1, 8, 7])
 +      where <function diff at 0x1102396b0> = np.diff

That occurs when the tree did not split. root has depth -1.

PSSF23

I added the random seed for X generation as well, so the test results should be consistent. Overall, when the sample size is too small, some honest trees just wouldn't work as intended.

adam2392 · 2023-06-22T13:56:02Z

I see since the root does not split at all, than the max_depth is -1, rather than 1, 2, ... and so on.

adam2392 · 2023-06-22T13:58:33Z

sktree/tests/test_honest_forest.py

@@ -108,6 +108,7 @@ def test_iris_multi(criterion, max_features, honest_prior, estimator):
 def test_max_samples():
    max_samples_list = [8, 0.5, None]
    depths = []
+    np.random.seed(0)


Don't set global random seed. I just realized a lot of tests are doing this. This actually isn't thread safe which is an issue for Cythonized code.

Instead you can set a global seed = 12345 and then for each place-in for np.random., you run rng = np.random.default_rng(seed) and use rng in place of np.random

Can you do this everywhere in the file (I think just 8 places that uses np.random.seed)?

Here's a ref talking about it: https://albertcthomas.github.io/good-practices-random-number-generators/

adam2392 · 2023-06-22T13:59:28Z

One minor issue I caught just now, but otw then this LGTM

Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>

adam2392 · 2023-06-22T18:14:49Z

Thanks @PSSF23 !

PSSF23 and others added 2 commits June 14, 2023 14:10

ENH attempt for multi-output

7005dfb

Co-Authored-By: Ronan Perry <13107341+rflperry@users.noreply.github.com>

ENH simplify predict_proba & multi-output

2c3606b

PSSF23 requested a review from adam2392 June 14, 2023 20:05

PSSF23 commented Jun 14, 2023

View reviewed changes

DOC update changelog

f831bc4

Fix multi-output

f1ab592

Signed-off-by: Adam Li <adam2392@gmail.com>

PSSF23 commented Jun 15, 2023

View reviewed changes

FIX simplify code & TST add multi-output test

dd0662d

adam2392 reviewed Jun 15, 2023

View reviewed changes

Merge branch 'main' into multi

8365c33

TST update test format & increase code coverage

b259441

PSSF23 commented Jun 16, 2023

View reviewed changes

adam2392 reviewed Jun 16, 2023

View reviewed changes

sktree/tests/test_honest_forest.py Outdated Show resolved Hide resolved

FIX optimize leaf updates with multi-output

16f8561

PSSF23 commented Jun 16, 2023

View reviewed changes

Update submodule commit ID

2e00358

PSSF23 added 2 commits June 19, 2023 09:32

ENH remove impute_missing_classes function

fd6643a

TST optimize prior test

c182cad

PSSF23 and others added 3 commits June 19, 2023 10:37

TST add prior for multi-output

72306d0

Merging main

ee14b64

Signed-off-by: Adam Li <adam2392@gmail.com>

Merging main

d4d2337

Signed-off-by: Adam Li <adam2392@gmail.com>

FIX re-update leaf function for multi

c14f629

PSSF23 commented Jun 20, 2023

View reviewed changes

adam2392 approved these changes Jun 20, 2023

View reviewed changes

DOC optimize docstring & TST remove impute test

2feb117

adam2392 reviewed Jun 20, 2023

View reviewed changes

sktree/tree/_honest_tree.py Outdated Show resolved Hide resolved

DOC correct typo

f172880

Co-authored-by: Adam Li <adam2392@gmail.com>

TST add random state

a31e1de

TXT optimize max sample test

0fc8ec4

PSSF23 commented Jun 22, 2023

View reviewed changes

adam2392 reviewed Jun 22, 2023

View reviewed changes

TST optimize random seed

0f773e1

Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>

adam2392 approved these changes Jun 22, 2023

View reviewed changes

adam2392 merged commit d506519 into main Jun 22, 2023

adam2392 deleted the multi branch June 22, 2023 18:14

Add multi-output support to honest trees #86

Add multi-output support to honest trees #86

Conversation

PSSF23 commented Jun 14, 2023

Before submitting

After submitting

PSSF23 left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 14, 2023 • edited Loading

Codecov Report

adam2392 commented Jun 14, 2023 • edited Loading

PSSF23 commented Jun 15, 2023

adam2392 commented Jun 15, 2023 • edited Loading

PSSF23 commented Jun 15, 2023

adam2392 commented Jun 15, 2023

PSSF23 Jun 15, 2023

Choose a reason for hiding this comment

PSSF23 Jun 15, 2023

Choose a reason for hiding this comment

adam2392 Jun 15, 2023

Choose a reason for hiding this comment

PSSF23 Jun 15, 2023

Choose a reason for hiding this comment

PSSF23 commented Jun 16, 2023

PSSF23 left a comment

Choose a reason for hiding this comment

PSSF23 left a comment

Choose a reason for hiding this comment

PSSF23 commented Jun 16, 2023

PSSF23 commented Jun 16, 2023 • edited Loading

adam2392 commented Jun 16, 2023

adam2392 commented Jun 16, 2023 • edited Loading

adam2392 commented Jun 16, 2023

adam2392 commented Jun 20, 2023

PSSF23 left a comment • edited Loading

Choose a reason for hiding this comment

adam2392 commented Jun 20, 2023

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 Jun 20, 2023

Choose a reason for hiding this comment

PSSF23 Jun 20, 2023

Choose a reason for hiding this comment

PSSF23 commented Jun 20, 2023

adam2392 commented Jun 20, 2023

adam2392 commented Jun 21, 2023

PSSF23 commented Jun 21, 2023

adam2392 commented Jun 21, 2023

PSSF23 commented Jun 22, 2023

PSSF23 left a comment

Choose a reason for hiding this comment

adam2392 commented Jun 22, 2023

adam2392 Jun 22, 2023 • edited Loading

Choose a reason for hiding this comment

PSSF23 Jun 22, 2023

Choose a reason for hiding this comment

adam2392 commented Jun 22, 2023

adam2392 commented Jun 22, 2023

codecov bot commented Jun 14, 2023 •

edited

Loading

adam2392 commented Jun 14, 2023 •

edited

Loading

adam2392 commented Jun 15, 2023 •

edited

Loading

PSSF23 commented Jun 16, 2023 •

edited

Loading

adam2392 commented Jun 16, 2023 •

edited

Loading

PSSF23 left a comment •

edited

Loading

adam2392 Jun 22, 2023 •

edited

Loading