-
Notifications
You must be signed in to change notification settings - Fork 41
/
asa136.html
304 lines (263 loc) · 8.2 KB
/
asa136.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
<html>
<head>
<title>
ASA136 - The K-Means Algorithm
</title>
</head>
<body bgcolor="#EEEEEE" link="#CC0000" alink="#FF3300" vlink="#000055">
<h1 align = "center">
ASA136 <br> The K-Means Algorithm
</h1>
<hr>
<p>
<b>ASA136</b>
is a C++ library which
divides M points in N dimensions into K clusters
so that the within-clusters sum of squares is minimized.
</p>
<p>
<b>ASA136</b> is Applied Statistics Algorithm 136. Source code for many
Applied Statistics Algorithms is available through
<a href = "http://lib.stat.cmu.edu/apstat">STATLIB</a>.
</p>
<p>
In the K-Means problem, a set of N points X(I) in M-dimensions is
given. The goal is to arrange these points into K clusters,
with each cluster having a representative point Z(J), usually
chosen as the centroid of the points in the cluster. The energy
of each cluster is <pre>
E(J) = Sum ( all points X(I) in cluster J ) || X(I) - Z(J) ||^2
</pre>
</p>
<p>
For a given set of clusters, the total energy is then simply
the sum of the cluster energies E(J). The goal is to choose the
clusters in such a way that the total energy is minimized.
Usually, a point X(I) goes into the cluster with the closest
representative point Z(J). So to define the clusters, it's
enough simply to specify the locations of the cluster representatives.
</p>
<p>
This is actually a fairly hard problem. Most algorithms do
reasonably well, but cannot guarantee that the best solution
has been found. It is very common for algorithms to get
stuck at a solution which is merely a "local minimum".
For such a local minimum, every slight rearrangement of
the solution makes the energy go up; however a major
rearrangement would result in a big drop in energy.
</p>
<p>
A simple algorithm for the problem is known as "H-Means".
It alternates between two procedures:
<ul>
<li>
Using the given cluster centers, assign each point to the
cluster with the nearest center;
</li>
<li>
Using the given cluster assignments, replace each cluster
center by the centroid or average of the points in the cluster.
</li>
</ul>
These steps are repeated until no points are moved, or some
other termination criterion is reached.
</p>
<p>
A more sophisticated algorithm, known as "K-Means", takes advantage
of the fact that it is possible to quickly determine the decrease in
energy caused by moving a point from its current cluster to another.
It repeats the following procedure:
<ul>
<li>
For each point, move it to another cluster if that would lower
the energy. If you move a point, immediately update the
cluster centers of the two affected clusters.
</li>
</ul>
This procedure is repeated until no points are moved, or some
other termination criterion is reached.
</p>
<h3 align = "center">
Licensing:
</h3>
<p>
The computer code and data files described and made available on this web page
are distributed under
<a href = "../../txt/gnu_lgpl.txt">the GNU LGPL license.</a>
</p>
<h3 align = "center">
Languages:
</h3>
<p>
<b>ASA136</b> is available in
<a href = "../../c_src/asa136/asa136.html">a C version</a> and
<a href = "../../cpp_src/asa136/asa136.html">a C++ version</a> and
<a href = "../../f77_src/asa136/asa136.html">a FORTRAN77 version</a> and
<a href = "../../f_src/asa136/asa136.html">a FORTRAN90 version</a> and
<a href = "../../m_src/asa136/asa136/html">a MATLAB version.</a>
</p>
<h3 align = "center">
Related Data and Programs:
</h3>
<p>
<a href = "../../cpp_src/asa058/asa058.html">
ASA058</a>,
a C++ library which
carries out the K-means algorithm for clustering data.
</p>
<p>
<a href = "../../cpp_src/asa113/asa113.html">
ASA113</a>,
a C++ library which
implements the Banfield and Bassill clustering algorithm using
transfers and swaps.
</p>
<p>
<a href = "../../cpp_src/cities/cities.html">
CITIES</a>,
a C++ library which
handles various problems associated
with a set of "cities" on a map.
</p>
<p>
<a href = "../../datasets/cities/cities.html">
CITIES</a>,
a dataset directory which
contains a number of city distance datasets.
</p>
<p>
<a href = "../../cpp_src/kmeans/kmeans.html">
KMEANS</a>,
a C++ library which
contains several implementations of the H-Means and K-Means clustering algorithms.
</p>
<p>
<a href = "../../f_src/lau_np/lau_np.html">
LAU_NP</a>,
a FORTRAN90 library which
contains heuristic algorithms for the K-center and K-median problems.
</p>
<p>
<a href = "../../f_src/spaeth/spaeth.html">
SPAETH</a>,
a FORTRAN90 library which
clusters data according to various principles.
</p>
<p>
<a href = "../../datasets/spaeth/spaeth.html">
SPAETH</a>,
a dataset directory which
contains test data for clustering.
</p>
<p>
<a href = "../../f_src/spaeth2/spaeth2.html">
SPAETH2</a>,
a FORTRAN90 library which
can cluster data according to various principles.
</p>
<p>
<a href = "../../datasets/spaeth2/spaeth2.html">
SPAETH2</a>,
a dataset directory which
contains test data for clustering.
</p>
<h3 align = "center">
Reference:
</h3>
<p>
<ol>
<li>
John Hartigan, Manchek Wong,<br>
Algorithm AS 136:
A K-Means Clustering Algorithm,<br>
Applied Statistics,<br>
Volume 28, Number 1, 1979, pages 100-108.
</li>
<li>
Wendy Martinez, Angel Martinez,<br>
Computational Statistics Handbook with MATLAB,<br>
pages 373-376,<br>
Chapman and Hall / CRC, 2002.
</li>
<li>
David Sparks,<br>
Algorithm AS 58: Euclidean Cluster Analysis,<br>
Applied Statistics,<br>
Volume 22, Number 1, 1973, pages 126-130.
</li>
</ol>
</p>
<h3 align = "center">
Source Code:
</h3>
<p>
<ul>
<li>
<a href = "asa136.cpp">asa136.cpp</a>, the source code.
</li>
<li>
<a href = "asa136.hpp">asa136.hpp</a>, the include file.
</li>
<li>
<a href = "asa136.sh">asa136.sh</a>,
commands to compile the source code.
</li>
</ul>
</p>
<h3 align = "center">
Examples and Tests:
</h3>
<p>
<ul>
<li>
<a href = "asa136_prb.cpp">asa136_prb.cpp</a>,
a sample calling program.
</li>
<li>
<a href = "asa136_prb.sh">asa136_prb.sh</a>,
commands to compile and run the sample program.
</li>
<li>
<a href = "points_100.txt">points_100.txt</a>,
a sample dataset of 100 points in 2D, to be arranged into 5 classes.
</li>
<li>
<a href = "asa136_prb_output.txt">asa136_prb_output.txt</a>,
the output file.
</li>
</ul>
</p>
<h3 align = "center">
List of Routines:
</h3>
<p>
<ul>
<li>
<b>KMNS</b> carries out the K-means algorithm.
</li>
<li>
<b>OPTRA</b> carries out the optimal transfer stage.
</li>
<li>
<b>QTRAN</b> carries out the quick transfer stage.
</li>
<li>
<b>R8_HUGE</b> returns a "huge" R8.
</li>
<li>
<b>TIMESTAMP</b> prints out the current YMDHMS date as a timestamp.
</li>
</ul>
</p>
<p>
You can go up one level to <a href = "../cpp_src.html">
the C++ source codes</a>.
</p>
<hr>
<i>
Last revised on 15 February 2008.
</i>
<!-- John Burkardt -->
</body>
<!-- Initial HTML skeleton created by HTMLINDEX. -->
</html>