Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi-output support to honest trees #86

Merged
merged 20 commits into from
Jun 22, 2023
Merged

Add multi-output support to honest trees #86

merged 20 commits into from
Jun 22, 2023

Conversation

PSSF23
Copy link
Member

@PSSF23 PSSF23 commented Jun 14, 2023

Fixes #85

Changes proposed in this pull request:

  • Add multi-output support to HonestTreeClassifier & HonestForestClassifier

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

PSSF23 and others added 2 commits June 14, 2023 14:10
Co-Authored-By: Ronan Perry <13107341+rflperry@users.noreply.github.com>
@PSSF23 PSSF23 requested a review from adam2392 June 14, 2023 20:05
Copy link
Member Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the test skipping except check_classifiers_multilabel_output_format_predict_proba, which is inconsistent with how sklearn.ensemble.RandomForestClassifier produces predict_proba results.

Error states that the shape of (n_samples, n_outputs) should be expected, but (n_outputs, n_samples, n_classes) is the actual shape produced by sklearn forests and trees.

Other multi-output cases passed.

@codecov
Copy link

codecov bot commented Jun 14, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.62 🎉

Comparison is base (7eb96e6) 86.33% compared to head (0f773e1) 86.95%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #86      +/-   ##
==========================================
+ Coverage   86.33%   86.95%   +0.62%     
==========================================
  Files          24       24              
  Lines        1924     1940      +16     
==========================================
+ Hits         1661     1687      +26     
+ Misses        263      253      -10     
Impacted Files Coverage Δ
sktree/tree/tests/test_honest_tree.py 100.00% <ø> (ø)
sktree/ensemble/_honest_forest.py 97.14% <100.00%> (+1.36%) ⬆️
sktree/tests/test_honest_forest.py 100.00% <100.00%> (ø)
sktree/tree/_honest_tree.py 100.00% <100.00%> (+8.03%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@adam2392
Copy link
Collaborator

adam2392 commented Jun 14, 2023

I have removed the test skipping except check_classifiers_multilabel_output_format_predict_proba, which is inconsistent with how sklearn.ensemble.RandomForestClassifier produces predict_proba results.
Error states that the shape of (n_samples, n_outputs) should be expected, but (n_outputs, n_samples, n_classes) is the actual shape produced by sklearn forests and trees.

(n_samples, n_outputs) is the shape we want/required, but (n_outputs, n_samples, n_classes) is what we get for the honest tree/forest?

I'll take a look.

In the meantime, can you see if you can think of a quick test we can do to judge the multi-output validity? I.e. Perhaps test that the forest gets the correct answer on a short simulation? You can use the existing tests as inspiration.

I wonder if we can extend the existing figure that's shown in the uncertainty forest paper to have a multi-output setting.

@PSSF23
Copy link
Member Author

PSSF23 commented Jun 15, 2023

I don't think (n_samples, n_outputs) is a valid output format for predict_proba, and that's not the output of current sklearn implementations. It seems to be more appropriate for results of predict, where we get class labels instead of posteriors. Maybe the check is outdated. I'll look into possible tests for multi.

Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator

adam2392 commented Jun 15, 2023

f1ab592 fixes the issue with the check_estimator test with multi-output predict_proba.

It would be great if we have a simulation to verify the results of a multi-output honest classification

@PSSF23
Copy link
Member Author

PSSF23 commented Jun 15, 2023

Would that be possible as we don't have a regressor class yet?

@adam2392
Copy link
Collaborator

Would that be possible as we don't have a regressor class yet?

Sorry classification

posteriors[~zero_mask] /= posteriors[~zero_mask].sum(1, keepdims=True)

if impute_missing is None:
posteriors[zero_mask] = self.empirical_prior_
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having trouble on this line. It seems that when defined, empirical_prior_ doesn't have dimensional differences for y labels. So I'll try to increase its dimensions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, why is the empirical prior set at the forest level. It's already set in honest trees, so this part can be removed.

X = iris.data
y = np.stack((iris.target, second_y)).T
clf.fit(X, y)
score = mean_squared_error(clf.predict(X), y)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

and assert some non-trivial performance?

Is it possible to interpret MSE of 0.5-1.0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that accuracy_score doesn't support multi-output. It seems that the default way they measure such predictions is through sklearn.multioutput.MultiOutputClassifier.

@PSSF23
Copy link
Member Author

PSSF23 commented Jun 16, 2023

Copy link
Member Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the test to r2_score as that's the default sklearn uses for multi-output scores. Also included different honest_prior options.

Copy link
Member Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed some bugs with leaf updates and the scores are much higher now.

@PSSF23
Copy link
Member Author

PSSF23 commented Jun 16, 2023

It seems that _impute_missing_classes is really not used... Should we just remove it?

@PSSF23
Copy link
Member Author

PSSF23 commented Jun 16, 2023

@adam2392 can you make a commit to include the submodule changes in here? Just want to check if the problem is fixed.

@adam2392
Copy link
Collaborator

@adam2392 can you make a commit to include the submodule changes in here? Just want to check if the problem is fixed.

Run

git submodule update --recursive --remote
git add -A
git commit -m "Update submodule commit ID"

and then

./spin build -j 4 --forcesubmodule

This updates the submodule and then rebuilds the package locally.

@adam2392
Copy link
Collaborator

adam2392 commented Jun 16, 2023

It seems that _impute_missing_classes is really not used... Should we just remove it?

This is because the honest trees always get all classes? Can we up the honest_ratio and this will get activated?

Otw... sure you can delete if you think that's fine. I just want to make sure this actually works in relevant edge cases.

@adam2392
Copy link
Collaborator

Looks like some other issues to fix with the docstrings in sklearn fork. I can do that later unless you do a quick fix now

PSSF23 and others added 3 commits June 19, 2023 10:37
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator

The circleCI docs build should be fixed now @PSSF23

Copy link
Member Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adam2392 you reversed one of the changes in the honest tree for multi-output, which I updated back in the last commit.

@adam2392
Copy link
Collaborator

@adam2392 you reversed one of the changes in the honest tree for multi-output, which I updated back in the last commit.

Oh apologies! Messed up the merge commit most likely

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had two questions that I left.

Otherwise, LGTM to merge once those are addressed!

@@ -431,29 +425,31 @@ def _inherit_estimator_attributes(self):
self.n_outputs_ = self.estimator_.n_outputs_
self.tree_ = self.estimator_.tree_

def _empty_leaf_correction(self, proba, normalizer):
def _empty_leaf_correction(self, proba, pos=0):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just add a short docstring to describe what's going on? I'm reading these lines and having trouble figuring out what pos does actually :/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pos indicates the class dimension of y, so that posterior corrections only work on that dimension in multi-output cases.

sktree/tests/test_honest_forest.py Outdated Show resolved Hide resolved
Co-authored-by: Adam Li <adam2392@gmail.com>
@PSSF23
Copy link
Member Author

PSSF23 commented Jun 20, 2023

This test_max_samples test is very weird, succeeding or failing randomly.

@adam2392
Copy link
Collaborator

This test_max_samples test is very weird, succeeding or failing randomly.

There is no random_state parameter passed to the HonestForestClassifier and that is perhaps what's causing this? Can we add that in since that will be definitely required?

@adam2392
Copy link
Collaborator

If it is failing, does that mean the tree depths is some edge case?

@PSSF23
Copy link
Member Author

PSSF23 commented Jun 21, 2023

If it is failing, does that mean the tree depths is some edge case?

Maybe just the samples were not enough for the tree to split

@adam2392
Copy link
Collaborator

It looks like there is a depth of -1. How does that even occur?...

FAILED sktree/tests/test_honest_forest.py::test_max_samples - assert False
 +  where False = all(array([ 7, -1]) > 0)
 +    where array([ 7, -1]) = <function diff at 0x1102396b0>([1, 8, 7])
 +      where <function diff at 0x1102396b0> = np.diff

@PSSF23
Copy link
Member Author

PSSF23 commented Jun 22, 2023

It looks like there is a depth of -1. How does that even occur?...

FAILED sktree/tests/test_honest_forest.py::test_max_samples - assert False
 +  where False = all(array([ 7, -1]) > 0)
 +    where array([ 7, -1]) = <function diff at 0x1102396b0>([1, 8, 7])
 +      where <function diff at 0x1102396b0> = np.diff

That occurs when the tree did not split. root has depth -1.

Copy link
Member Author

@PSSF23 PSSF23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the random seed for X generation as well, so the test results should be consistent. Overall, when the sample size is too small, some honest trees just wouldn't work as intended.

@adam2392
Copy link
Collaborator

I see since the root does not split at all, than the max_depth is -1, rather than 1, 2, ... and so on.

@@ -108,6 +108,7 @@ def test_iris_multi(criterion, max_features, honest_prior, estimator):
def test_max_samples():
max_samples_list = [8, 0.5, None]
depths = []
np.random.seed(0)
Copy link
Collaborator

@adam2392 adam2392 Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't set global random seed. I just realized a lot of tests are doing this. This actually isn't thread safe which is an issue for Cythonized code.

Instead you can set a global seed = 12345 and then for each place-in for np.random., you run rng = np.random.default_rng(seed) and use rng in place of np.random

Can you do this everywhere in the file (I think just 8 places that uses np.random.seed)?

Here's a ref talking about it: https://albertcthomas.github.io/good-practices-random-number-generators/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do.

@adam2392
Copy link
Collaborator

One minor issue I caught just now, but otw then this LGTM

Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>
@adam2392 adam2392 merged commit d506519 into main Jun 22, 2023
@adam2392 adam2392 deleted the multi branch June 22, 2023 18:14
@adam2392
Copy link
Collaborator

Thanks @PSSF23 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add multi-output support to honest trees
2 participants