-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathK_medians.Rmd
128 lines (87 loc) · 3.85 KB
/
K_medians.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
---
title: "As1"
output: html_document
date: "2022-10-18"
---
Step 1: Run the following code
Run the following code to load some clustering data with 2 features into your session.
```{r}
library(tidyverse)
library(RCurl)
x <- getURL("https://raw.githubusercontent.com/vankesteren/dav_practicals/master/12_Unsupervised_learning_Clustering/data/clusterdata.csv")
clus_df <- read.csv(text = x)
clus_df
```
Step 2
The Euclidean distance between two vectors 𝐱 and 𝐲 of equal length 𝑁 is 𝐷 = ||𝐱 − 𝐲||2 = √∑𝑁 ( 𝑥 − 𝑦 )2. These two vectors represent points in 𝑁-dimensional space and the
Euclidian distance is the straight-line distance between these points.
Write a function l2_dist(x, y) that takes in two vectors and outputs the Euclidian distance between the two vectors.
```{r}
#Euclidean distance function
l2_dist = function(x,y){
return(sqrt(sum((x-y)^2)))}
```
Step 3: K-medians clustering algorithm
K-medians is partitional clustering method. It is a variant of k-means where it calculates the median instead of calculating the mean for each cluster to determine its centroid.
Program a K-medians clustering algorithm called kmedians. The inputs of this function should be X, a data frame, and K, an integer stating how many clusters you want. The output is (at least) the cluster assignments (an integer vector). Use your l2_dist function to compute the euclidian distance. Create helpful comments in the code along the way.
```{r}
#Set seed for reproducible results
set.seed(1)
#Definining function
kmed_clustering = function(x,k){
#Assigning random groups to each data
x$clusters <- sample(k, size=nrow(x), replace = TRUE)
repeat { #infinite cicle
changes <- TRUE # to scape from cycle
#Obtaining centroids by medians method of each cluster
by_clus <- x %>%
group_by(clusters) %>%
summarise(
centroidx = median(x1),
centroidy = median(x2))
#Defining empty matrix to store distances from each point to each cluster
distances= matrix(,nrow=nrow(x),ncol=k)
#Defining empty matrix to store the selected cluster
winner = matrix(,nrow=nrow(x),ncol=1)
#Double cycle to get distance for each row to each cluster
for (i in 1:nrow(x) ) {
for (clus in 1:k) {
distances[i,clus] = l2_dist(as.numeric(by_clus[clus,2:3]), as.numeric(x[i,1:2]))
}
#Get the best (minimum distance to a cluster)
winner[i,1] = which (distances[i,1:k] == min(distances[i,1:k]), arr.ind = TRUE)
}
#To escape from cycle when there are not changes between clusters in 2 consecutive iterations
if (identical(as.vector(x$clusters), c(t(winner)))) {
changes <- FALSE
}
#Updating clusters
x$clusters <- winner
#If there are no more changes, break the cycle
if (!changes) {break}
}
#Returning the optimum clusters vector
return(winner)
}
#Running function to obtain k medians
kmedians_clusters= kmed_clustering(clus_df,4)
#creating df to store kmedian clusters
df_cluster_med <- clus_df
df_cluster_med$kmedian <- kmedians_clusters
df_cluster_med
```
Step 4: Compare to K-means
Apply your kmedians function and the R kmeans function on clus_df and compare the results in a good visualisation. Reflect on the differences and similarities that you see.
```{r}
#Computing kmean cluster with r function and storing to a df
df_cluster_mean <- clus_df
df = dist(df_cluster_mean)
df_cluster_mean$kmean = kmeans(df,4)$cluster
library(patchwork)
#plotting kmedians
p1 <- ggplot(df_cluster_med) + geom_point(aes(x1, x2, color = as.factor(kmedian))) + ggtitle("K-medians clusters")
#plotting kmeans
p2 <- ggplot(df_cluster_mean) + geom_point(aes(x1, x2, color = as.factor(kmean))) + ggtitle("K-means clusters")
#Joining all and plotting
p1+theme(legend.position="none")+ p2+theme(legend.position="none")
```