-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the determination of cluster exemplars fix for main #357
Conversation
…the model is trained to make predictions
// To determine the remaining exemplars, the best thing to do is randomly sample them from all the | ||
// points in this cluster. This could introduce duplicate exemplar points, but that is safer than | ||
// reducing the number of exemplars. | ||
SplittableRandom rand = new SplittableRandom(exemplarSampleSeed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of an edge case is this randomized fallback? We have special treatment for all other RNGs in Tribuo where they are members of the trainer and split under a lock to preserve provenance information, which ensures that repeated runs of a trainer on the same data give different answers when not controlling the RNG state (e.g. when used in an ensemble). If this is purely an edge case issue then it might be ok to leave it as is, but if it might occur relatively frequently then it's probably worth using the same idiom we do elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This edge case seems to be quite rare, but if there is a "best solution" which can be implemented here we should strive towards that. I'll assume that the implementation of this special treatment demonstrated in the KMeansTrainer
is a good example to follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this synchronized idiom along with storing an rng as a field - https://github.com/oracle/tribuo/blob/main/Clustering/KMeans/src/main/java/org/tribuo/clustering/kmeans/KMeansTrainer.java#L278.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a commit which adds a RNG as a field, and the supporting logic. Let me know if anything else might be needed.
…er exemplars in some edge cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs a fix to the RNG creation, otherwise looks good.
"from the members of a cluster.") | ||
private long exemplarSampleSeed = Trainer.DEFAULT_SEED; | ||
|
||
private SplittableRandom rng = new SplittableRandom(exemplarSampleSeed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The creation should happen in postConfig
, as OLCUT creates the object first then inserts all the field values, then calls postConfig
, so currently any trainers created with OLCUT from configs won't use the configured seed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I've made this change as well as adding the call to postConfig
from the Constructors.
* @return A list of {@link ClusterExemplar}s which are used for predictions. | ||
*/ | ||
private static List<ClusterExemplar> computeExemplars(SGDVector[] data, Map<Integer, List<Pair<Double, Integer>>> clusterAssignments, | ||
org.tribuo.math.distance.Distance dist) { | ||
private List<ClusterExemplar> computeExemplars(SGDVector[] data, Map<Integer, List<Pair<Double, Integer>>> clusterAssignments, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably static again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good observation.
…l to it in the Constructors
Just following up to see what else might be needed here, no rush though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks.
Description
We need to improve the determination of cluster exemplars, which are used after the model is trained to make predictions. Some cases, such as those which use contrived datasets, can result in very "clean" clusters, with no outliers. One such case has exposed an issue in the current logic to determine cluster exemplars. The seed, which is used when sampling cluster exemplars at random from members of a cluster, is exposed as a config parameter.
Motivation
This change fixes #355 on the main branch.