Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the determination of cluster exemplars fix for main #357

Merged
merged 3 commits into from
Dec 19, 2023

Conversation

geoffreydstewart
Copy link
Member

Description

We need to improve the determination of cluster exemplars, which are used after the model is trained to make predictions. Some cases, such as those which use contrived datasets, can result in very "clean" clusters, with no outliers. One such case has exposed an issue in the current logic to determine cluster exemplars. The seed, which is used when sampling cluster exemplars at random from members of a cluster, is exposed as a config parameter.

Motivation

This change fixes #355 on the main branch.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Nov 17, 2023
// To determine the remaining exemplars, the best thing to do is randomly sample them from all the
// points in this cluster. This could introduce duplicate exemplar points, but that is safer than
// reducing the number of exemplars.
SplittableRandom rand = new SplittableRandom(exemplarSampleSeed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of an edge case is this randomized fallback? We have special treatment for all other RNGs in Tribuo where they are members of the trainer and split under a lock to preserve provenance information, which ensures that repeated runs of a trainer on the same data give different answers when not controlling the RNG state (e.g. when used in an ensemble). If this is purely an edge case issue then it might be ok to leave it as is, but if it might occur relatively frequently then it's probably worth using the same idiom we do elsewhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This edge case seems to be quite rare, but if there is a "best solution" which can be implemented here we should strive towards that. I'll assume that the implementation of this special treatment demonstrated in the KMeansTrainer is a good example to follow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a commit which adds a RNG as a field, and the supporting logic. Let me know if anything else might be needed.

Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a fix to the RNG creation, otherwise looks good.

"from the members of a cluster.")
private long exemplarSampleSeed = Trainer.DEFAULT_SEED;

private SplittableRandom rng = new SplittableRandom(exemplarSampleSeed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The creation should happen in postConfig, as OLCUT creates the object first then inserts all the field values, then calls postConfig, so currently any trainers created with OLCUT from configs won't use the configured seed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I've made this change as well as adding the call to postConfig from the Constructors.

* @return A list of {@link ClusterExemplar}s which are used for predictions.
*/
private static List<ClusterExemplar> computeExemplars(SGDVector[] data, Map<Integer, List<Pair<Double, Integer>>> clusterAssignments,
org.tribuo.math.distance.Distance dist) {
private List<ClusterExemplar> computeExemplars(SGDVector[] data, Map<Integer, List<Pair<Double, Integer>>> clusterAssignments,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can probably static again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good observation.

@geoffreydstewart
Copy link
Member Author

Needs a fix to the RNG creation, otherwise looks good.

Just following up to see what else might be needed here, no rush though.

Copy link
Member

@Craigacp Craigacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks.

@Craigacp Craigacp merged commit ecab357 into oracle:main Dec 19, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OCA Verified All contributors have signed the Oracle Contributor Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HDBSCAN implementation in 4.3+
2 participants