-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multi-output support to honest trees #86
Conversation
Co-Authored-By: Ronan Perry <13107341+rflperry@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have removed the test skipping except check_classifiers_multilabel_output_format_predict_proba
, which is inconsistent with how sklearn.ensemble.RandomForestClassifier
produces predict_proba
results.
Error states that the shape of (n_samples, n_outputs)
should be expected, but (n_outputs, n_samples, n_classes)
is the actual shape produced by sklearn
forests and trees.
Other multi-output cases passed.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #86 +/- ##
==========================================
+ Coverage 86.33% 86.95% +0.62%
==========================================
Files 24 24
Lines 1924 1940 +16
==========================================
+ Hits 1661 1687 +26
+ Misses 263 253 -10
☔ View full report in Codecov by Sentry. |
I'll take a look. In the meantime, can you see if you can think of a quick test we can do to judge the multi-output validity? I.e. Perhaps test that the forest gets the correct answer on a short simulation? You can use the existing tests as inspiration. I wonder if we can extend the existing figure that's shown in the uncertainty forest paper to have a multi-output setting. |
I don't think |
Signed-off-by: Adam Li <adam2392@gmail.com>
f1ab592 fixes the issue with the It would be great if we have a simulation to verify the results of a multi-output honest classification |
Would that be possible as we don't have a regressor class yet? |
Sorry classification |
sktree/ensemble/_honest_forest.py
Outdated
posteriors[~zero_mask] /= posteriors[~zero_mask].sum(1, keepdims=True) | ||
|
||
if impute_missing is None: | ||
posteriors[zero_mask] = self.empirical_prior_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having trouble on this line. It seems that when defined, empirical_prior_
doesn't have dimensional differences for y
labels. So I'll try to increase its dimensions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, why is the empirical prior set at the forest level. It's already set in honest trees, so this part can be removed.
sktree/tests/test_honest_forest.py
Outdated
X = iris.data | ||
y = np.stack((iris.target, second_y)).T | ||
clf.fit(X, y) | ||
score = mean_squared_error(clf.predict(X), y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
and assert some non-trivial performance?
Is it possible to interpret MSE of 0.5-1.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears that accuracy_score
doesn't support multi-output. It seems that the default way they measure such predictions is through sklearn.multioutput.MultiOutputClassifier
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the test to r2_score
as that's the default sklearn
uses for multi-output scores. Also included different honest_prior
options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed some bugs with leaf updates and the scores are much higher now.
It seems that |
@adam2392 can you make a commit to include the submodule changes in here? Just want to check if the problem is fixed. |
Run
This updates the submodule and then rebuilds the package locally. |
This is because the honest trees always get all classes? Can we up the honest_ratio and this will get activated? Otw... sure you can delete if you think that's fine. I just want to make sure this actually works in relevant edge cases. |
Looks like some other issues to fix with the docstrings in sklearn fork. I can do that later unless you do a quick fix now |
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
The circleCI docs build should be fixed now @PSSF23 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adam2392 you reversed one of the changes in the honest tree for multi-output, which I updated back in the last commit.
Oh apologies! Messed up the merge commit most likely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had two questions that I left.
Otherwise, LGTM to merge once those are addressed!
@@ -431,29 +425,31 @@ def _inherit_estimator_attributes(self): | |||
self.n_outputs_ = self.estimator_.n_outputs_ | |||
self.tree_ = self.estimator_.tree_ | |||
|
|||
def _empty_leaf_correction(self, proba, normalizer): | |||
def _empty_leaf_correction(self, proba, pos=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you just add a short docstring to describe what's going on? I'm reading these lines and having trouble figuring out what pos
does actually :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pos
indicates the class dimension of y, so that posterior corrections only work on that dimension in multi-output cases.
Co-authored-by: Adam Li <adam2392@gmail.com>
This |
There is no random_state parameter passed to the |
If it is failing, does that mean the tree depths is some edge case? |
Maybe just the samples were not enough for the tree to split |
It looks like there is a depth of -1. How does that even occur?...
|
That occurs when the tree did not split. root has depth -1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the random seed for X generation as well, so the test results should be consistent. Overall, when the sample size is too small, some honest trees just wouldn't work as intended.
I see since the root does not split at all, than the max_depth is -1, rather than 1, 2, ... and so on. |
sktree/tests/test_honest_forest.py
Outdated
@@ -108,6 +108,7 @@ def test_iris_multi(criterion, max_features, honest_prior, estimator): | |||
def test_max_samples(): | |||
max_samples_list = [8, 0.5, None] | |||
depths = [] | |||
np.random.seed(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't set global random seed. I just realized a lot of tests are doing this. This actually isn't thread safe which is an issue for Cythonized code.
Instead you can set a global seed = 12345
and then for each place-in for np.random.
, you run rng = np.random.default_rng(seed)
and use rng
in place of np.random
Can you do this everywhere in the file (I think just 8 places that uses np.random.seed
)?
Here's a ref talking about it: https://albertcthomas.github.io/good-practices-random-number-generators/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do.
One minor issue I caught just now, but otw then this LGTM |
Co-Authored-By: Adam Li <3460267+adam2392@users.noreply.github.com>
Thanks @PSSF23 ! |
Fixes #85
Changes proposed in this pull request:
HonestTreeClassifier
&HonestForestClassifier
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting