Improve the determination of cluster exemplars fix for main #357

geoffreydstewart · 2023-11-17T18:58:53Z

Description

We need to improve the determination of cluster exemplars, which are used after the model is trained to make predictions. Some cases, such as those which use contrived datasets, can result in very "clean" clusters, with no outliers. One such case has exposed an issue in the current logic to determine cluster exemplars. The seed, which is used when sampling cluster exemplars at random from members of a cluster, is exposed as a config parameter.

Motivation

This change fixes #355 on the main branch.

…the model is trained to make predictions

Craigacp · 2023-11-22T01:03:54Z

Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java

+                    // To determine the remaining exemplars, the best thing to do is randomly sample them from all the
+                    // points in this cluster. This could introduce duplicate exemplar points, but that is safer than
+                    // reducing the number of exemplars.
+                    SplittableRandom rand = new SplittableRandom(exemplarSampleSeed);


How much of an edge case is this randomized fallback? We have special treatment for all other RNGs in Tribuo where they are members of the trainer and split under a lock to preserve provenance information, which ensures that repeated runs of a trainer on the same data give different answers when not controlling the RNG state (e.g. when used in an ensemble). If this is purely an edge case issue then it might be ok to leave it as is, but if it might occur relatively frequently then it's probably worth using the same idiom we do elsewhere.

This edge case seems to be quite rare, but if there is a "best solution" which can be implemented here we should strive towards that. I'll assume that the implementation of this special treatment demonstrated in the KMeansTrainer is a good example to follow.

Yeah, this synchronized idiom along with storing an rng as a field - https://github.com/oracle/tribuo/blob/main/Clustering/KMeans/src/main/java/org/tribuo/clustering/kmeans/KMeansTrainer.java#L278.

I've added a commit which adds a RNG as a field, and the supporting logic. Let me know if anything else might be needed.

…er exemplars in some edge cases

Craigacp

Needs a fix to the RNG creation, otherwise looks good.

Craigacp · 2023-12-06T15:12:07Z

Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java

+        "from the members of a cluster.")
+    private long exemplarSampleSeed = Trainer.DEFAULT_SEED;
+
+    private SplittableRandom rng = new SplittableRandom(exemplarSampleSeed);


The creation should happen in postConfig, as OLCUT creates the object first then inserts all the field values, then calls postConfig, so currently any trainers created with OLCUT from configs won't use the configured seed.

Right, I've made this change as well as adding the call to postConfig from the Constructors.

Craigacp · 2023-12-06T15:12:57Z

Clustering/Hdbscan/src/main/java/org/tribuo/clustering/hdbscan/HdbscanTrainer.java

     * @return A list of {@link ClusterExemplar}s which are used for predictions.
     */
-    private static List<ClusterExemplar> computeExemplars(SGDVector[] data, Map<Integer, List<Pair<Double, Integer>>> clusterAssignments,
-                                                          org.tribuo.math.distance.Distance dist) {
+    private List<ClusterExemplar> computeExemplars(SGDVector[] data, Map<Integer, List<Pair<Double, Integer>>> clusterAssignments,


This can probably static again?

Yes, good observation.

…l to it in the Constructors

geoffreydstewart · 2023-12-18T23:52:02Z

Needs a fix to the RNG creation, otherwise looks good.

Just following up to see what else might be needed here, no rush though.

Craigacp

LGTM, thanks.

Improve the determination of cluster exemplars, which are used after …

1de1444

…the model is trained to make predictions

oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Nov 17, 2023

Craigacp reviewed Nov 22, 2023

View reviewed changes

Add a RNG to HdbscanTrainer to be used for the determination of clust…

6df1fc0

…er exemplars in some edge cases

Craigacp requested changes Dec 6, 2023

View reviewed changes

Move the RNG initialization to the postConfig method, and add the cal…

bbc7deb

…l to it in the Constructors

Craigacp approved these changes Dec 19, 2023

View reviewed changes

Craigacp merged commit ecab357 into oracle:main Dec 19, 2023
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the determination of cluster exemplars fix for main #357

Improve the determination of cluster exemplars fix for main #357

geoffreydstewart commented Nov 17, 2023

Craigacp Nov 22, 2023

geoffreydstewart Nov 23, 2023

Craigacp Nov 24, 2023

geoffreydstewart Nov 29, 2023

Craigacp left a comment

Craigacp Dec 6, 2023

geoffreydstewart Dec 6, 2023

Craigacp Dec 6, 2023

geoffreydstewart Dec 6, 2023

geoffreydstewart commented Dec 18, 2023

Craigacp left a comment

Improve the determination of cluster exemplars fix for main #357

Improve the determination of cluster exemplars fix for main #357

Conversation

geoffreydstewart commented Nov 17, 2023

Description

Motivation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Craigacp left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

geoffreydstewart commented Dec 18, 2023

Craigacp left a comment

Choose a reason for hiding this comment