Skip to content

Clustering Algorithms

Joaquin Bedia edited this page Feb 14, 2020 · 9 revisions

K-means:

The aim of K-means algorithm is to obtain the minimum distance between observation within the same subgroup. This algorithm requires the K number of clusters (argument centers) with no default. The K-means algorithm uses random initialization in order to obtain the clusters, so different centroid coordinates and cluster ordering will be obtained at each realization. These and other additional features of the K-means algorithm can be handled and tuned by passing clusterGrid the specific arguments of the kmeans function of the R package stats. An exmaple is next provided:

library(transformeR)
data(NCEP_Iberia_psl, package = "transformeR")

A re-analysis of the Sea Pressure Level over the Iberia peninsula will be used as Dataset in order to obtain 10 CTs (the clusters):

clusters<- clusterGrid(NCEP_Iberia_psl, type="kmeans", centers=10, iter.max=1000)

After that, the centroids of the CTs can be plotted by using spatialPlot, a function in visualizeR, if we first process the output of clusterGrid by using subsetGrid as follows:

cts <- lapply(1:attr(clusters, "centers"), function(x) {
  climatology(subsetGrid(clusters, cluster = x))})

#A list of grids with K elements was created and CTs centroids are now located in time-dimension of the grids
#"makeMultiGrid" can be used to create a multigrid containing all the elements from the CTs list.

cts.mg <- makeMultiGrid(cts, skip.temporal.check = TRUE)
visualizeR::spatialPlot(cts.mg, backdrop.theme = "coastline", rev.colors = TRUE, main="PSL Clusters from NCEP Iberia (Kmeans)", layout = c(2,ceiling(attr(clusters, "centers")/2)), as.table = TRUE)

Hierarchical:

In contrast to K-means, Hierarchical algorithm doesn't require the number of clusters to be provided. It allows the user either to specify the number of clusters or not. If centers is not provided, they are automatically set and the Hierarchical "tree" is cut when the height difference between two consecutive divisions (sorted in ascending order) is larger than the intercuartile range of the heights vector.

In this example, centers will not be provided, so the algorithm decides the number of clusters itself:

clusters<- clusterGrid(NCEP_Iberia_psl, type="hierarchical")

The clusters will be plotted using spatialPlot after processing the data with subsetGrid and makeMultiGrid:

cts <- lapply(1:attr(clusters, "centers"), function(x) {
  climatology(subsetGrid(clusters, cluster = x))})
cts.mg <- makeMultiGrid(cts, skip.temporal.check = TRUE)
visualizeR::spatialPlot(cts.mg, backdrop.theme = "coastline", rev.colors = TRUE, main="PSL Clusters from NCEP Iberia (Hierarchical)", layout = c(2,ceiling(attr(clusters, "centers")/2)), as.table = TRUE)

SOM:

While using the SOM algorithm, the argument centers is provided as a two-element vector, indicating the dimensions {xdim, ydim} of the grid. Otherwise, by default 48 clusters (8x6) with rectangular topology are obtained.

In this example, SOM is forced to create 10 CTs, that will be plotted later:

clusters<- clusterGrid(NCEP_Iberia_psl, type="som", centers = c(10,1))

cts <- lapply(1:attr(clusters, "centers"), function(x) {
  climatology(subsetGrid(clusters, cluster = x))})
cts.mg <- makeMultiGrid(cts, skip.temporal.check = TRUE) 
visualizeR::spatialPlot(cts.mg, backdrop.theme = "coastline", rev.colors = TRUE, main="PSL Clusters from NCEP Iberia (SOM)", layout = c(2,ceiling(attr(clusters, "centers")/2)), as.table = TRUE)

Lamb Weather Types:

Lamb Weather Types (LWTs) is one of the best known and most analysed WTs developed for the British Isles by Lamb (1972). It is applied to daily sea level pressure data and 26 different WTs are defined, 10 pure types (NE, E, SE, S, SW, W, NW, N, C and A) and 16 hybrid types (8 for each C and A hybrid). For further information, check Jones et Al. (2013)[1]

In the following example, we use daily sea level pressure from the NCEP1 Reanalysis on 2001-2010 period in order to obtain the LWTs using clusterGrid. This dataset is included in transformeR package.

data(NCEP_slp_2001_2010, package = "transformeR")

clusters<- clusterGrid(NCEP_slp_2001_2010, type="lamb")

Plot the spatial distribution of the LWTs:

cts <- lapply(1:attr(clusters, "centers"), function(x) {
  climatology(subsetGrid(clusters, cluster = x))})
cts.mg <- makeMultiGrid(cts, skip.temporal.check = TRUE)
visualizeR::spatialPlot(cts.mg, backdrop.theme = "coastline", rev.colors = TRUE, main="PSL Clusters from NCEP (Lamb WTs)", as.table = TRUE)

[1] Jones, P. D., Harpham, C., & Briffa, K. R. (2013). Lamb weather types derived from reanalysis products. International Journal of Climatology, 33(5), 1129-1139. https://doi.org/10.1002/joc.3498

Session Info

R version 3.6.2 (2019-12-12)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=es_ES.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=es_ES.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5      transformeR_1.7.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3          rmsfact_0.0.3       codetools_0.2-16    lattice_0.20-38     grid_3.6.2          spam_2.5-1          kohonen_3.0.10     
 [8] raster_3.0-12       sp_1.3-2            akima_0.6-2         cowsay_0.7.0        Matrix_1.2-18       fortunes_1.5-4      tools_3.6.2        
[15] RcppEigen_0.3.3.7.0 maps_3.3.0          fields_10.3         parallel_3.6.2      abind_1.4-5         compiler_3.6.2      dotCall64_1.0-0