NeoML library provides several methods for clustering data.
K-means method is the most popular clustering algorithm. It assigns each object to the cluster with the nearest center. Implemented by the CKMeansClustering
class.
ISODATA clustering algorithm is based on geometrical proximity of the data points. The clustering result will depend greatly on the initial settings. Implemented by the CIsoDataClustering
class.
The library provides a "naive" implementation of upward hierarchical clustering. First, it creates a cluster per element, then merges clusters on each step until the final cluster set is achieved. Implemented by the CHierarchialClustering
class.
A simple clustering algorithm that creates a new cluster for each new vector that is far enough from the clusters already existing. Implemented by the CFirstComeClustering
class.
The input data to be split into clusters is passed to any of the algorithms as a pointer to the object that implements the IClusteringData
interface:
class IClusteringData : public virtual IObject {
public:
// The number of vectors
virtual int GetVectorCount() const = 0;
// The number of features
virtual int GetFeaturesCount() const = 0;
// Gets all input vectors as a matrix of size GetVectorCount() x GetFeaturesCount()
virtual CFloatMatrixDesc GetMatrix() const = 0;
// Gets the vector weight
virtual double GetVectorWeight( int index ) const = 0;
};
Every clustering algorithm implements the IClustering
interface.
class IClustering {
public:
virtual ~IClustering() {};
// Clusterizes the input data
// and returns true if successful with the given parameters
virtual bool Clusterize( const IClusteringData* data, CClusteringResult& result ) = 0;
};
The clustering result is described by the CClusteringResult
structure.
class NEOML_API CClusteringResult {
public:
int ClusterCount;
CArray<int> Data;
CArray<CClusterCenter> Clusters;
};
- ClusterCount — the number of clusters
- Data — the array of cluster numbers for each of the input data elements (the clusters are numbered from 0 to ClusterCount - 1)
- Clusters — the cluster centers