-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathen_Support Vector Machine (SVM).srt
416 lines (312 loc) · 10.3 KB
/
en_Support Vector Machine (SVM).srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
0
00:00:00,530 --> 00:00:02,600
Hello, and welcome!
1
00:00:02,600 --> 00:00:09,260
In this video we will learn a machine learning method called Support Vector Machine (or SVM),
2
00:00:09,260 --> 00:00:11,059
which is used for classification.
3
00:00:11,059 --> 00:00:13,499
So let’s get started.
4
00:00:13,499 --> 00:00:18,830
Imagine that you’ve obtained a dataset containing characteristics of thousands of human cell
5
00:00:18,830 --> 00:00:25,760
samples extracted from patients who were believed to be at risk of developing cancer.
6
00:00:25,760 --> 00:00:31,150
Analysis of the original data showed that many of the characteristics differed significantly
7
00:00:31,150 --> 00:00:34,829
between benign and malignant samples.
8
00:00:34,829 --> 00:00:40,929
You can use the values of these cell characteristics in samples from other patients to give an
9
00:00:40,929 --> 00:00:46,510
early indication of whether a new sample might be benign or malignant.
10
00:00:46,510 --> 00:00:53,979
You can use support vector machine, or SVM, as a classifier, to train your model to understand
11
00:00:53,979 --> 00:00:59,520
patterns within the data, that might show benign or malignant cells.
12
00:00:59,520 --> 00:01:05,269
Once the model has been trained, it can be used to predict your new or unknown cell with
13
00:01:05,269 --> 00:01:06,990
rather high accuracy.
14
00:01:06,990 --> 00:01:11,570
Now, let me give you a formal definition of SVM.
15
00:01:11,570 --> 00:01:19,689
A Support Vector Machine is a supervised algorithm that can classify cases by finding a separator.
16
00:01:19,689 --> 00:01:27,960
SVM works by first, mapping data to a high-dimensional feature space so that data points can be categorized,
17
00:01:27,960 --> 00:01:31,490
even when the data are not otherwise linearly separable.
18
00:01:31,490 --> 00:01:35,939
Then, a separator is estimated for the data.
19
00:01:35,939 --> 00:01:42,430
The data should be transformed in such a way that a separator could be drawn as a hyperplane.
20
00:01:42,430 --> 00:01:47,759
For example, consider the following figure, which shows the distribution of a small set
21
00:01:47,759 --> 00:01:53,030
of cells, only based on their Unit Size and Clump thickness.
22
00:01:53,030 --> 00:01:58,010
As you can see, the data points fall into two different categories.
23
00:01:58,010 --> 00:02:02,320
It represents a linearly, non-separable, dataset.
24
00:02:02,320 --> 00:02:07,150
The two categories can be separated with a curve, but not a line.
25
00:02:07,150 --> 00:02:13,890
That is, it represents a linearly, non-separable dataset, which is the case for most real-world
26
00:02:13,890 --> 00:02:15,400
datasets.
27
00:02:15,400 --> 00:02:22,069
We can transfer this data to a higher dimensional space … for example, mapping it to a 3-dimensional
28
00:02:22,069 --> 00:02:23,330
space.
29
00:02:23,330 --> 00:02:29,700
After the transformation, the boundary between the two categories can be defined by a hyperplane.
30
00:02:29,700 --> 00:02:35,069
As we are now in 3-dimensional space, the separator is shown as a plane.
31
00:02:35,069 --> 00:02:40,410
This plane can be used to classify new or unknown cases.
32
00:02:40,410 --> 00:02:49,170
Therefore, the SVM algorithm outputs an optimal hyperplane that categorizes new examples.
33
00:02:49,170 --> 00:02:53,510
Now, there are two challenging questions to consider:
34
00:02:53,510 --> 00:03:00,989
1) How do we transfer data in such a way that a separator could be drawn as a hyperplane?
35
00:03:00,989 --> 00:03:05,150
and 2) How can we find the best/optimized hyperplane
36
00:03:05,150 --> 00:03:08,260
separator after transformation?
37
00:03:08,260 --> 00:03:13,280
Let’s first look at “transforming data” to see how it works.
38
00:03:13,280 --> 00:03:18,950
For the sake of simplicity, imagine that our dataset is 1-dimensional data, this means,
39
00:03:18,950 --> 00:03:21,310
we have only one feature x.
40
00:03:21,310 --> 00:03:24,720
As you can see, it is not linearly separable.
41
00:03:24,720 --> 00:03:27,310
So, what we can do here?
42
00:03:27,310 --> 00:03:31,620
Well, we can transfer it into a 2-dimensional space.
43
00:03:31,620 --> 00:03:37,920
For example, your can increase the dimension of data by mapping x into a new space using
44
00:03:37,920 --> 00:03:42,883
a function, with outputs x and x-squared.
45
00:03:42,883 --> 00:03:46,230
Now, the data is linearly separable, right?
46
00:03:46,230 --> 00:03:51,879
Notice that, as we are in a two dimensional space, the hyperplane is a line dividing a
47
00:03:51,879 --> 00:03:56,640
plane into two parts where each class lays on either side.
48
00:03:56,640 --> 00:04:00,790
Now we can use this line to classify new cases.
49
00:04:00,790 --> 00:04:06,829
Basically, mapping data into a higher dimensional space is called kernelling.
50
00:04:06,829 --> 00:04:12,730
The mathematical function used for the transformation is known as the kernel function, and can be
51
00:04:12,730 --> 00:04:20,370
of different types, such as: Linear, Polynomial, Radial basis function (or RBF), and Sigmoid.
52
00:04:20,370 --> 00:04:27,790
Each of these functions has its own characteristics, its pros and cons, and its equation, but the
53
00:04:27,790 --> 00:04:32,410
good news is that you don’t need to know them, as most of them are already
54
00:04:32,410 --> 00:04:37,449
implemented in libraries of data science programming languages.
55
00:04:37,449 --> 00:04:44,010
Also, as there's no easy way of knowing which function performs best with any given dataset,
56
00:04:44,010 --> 00:04:48,410
we usually choose different functions in turn and compare the results.
57
00:04:48,410 --> 00:04:55,880
Now, we get to another question, specifically, “How do we find the right or optimized separator
58
00:04:55,880 --> 00:04:58,110
after transformation?”
59
00:04:58,110 --> 00:05:05,780
Basically, SVMs are based on the idea of finding a hyperplane that best divides a dataset into
60
00:05:05,780 --> 00:05:09,169
two classes, as shown here.
61
00:05:09,169 --> 00:05:14,940
As we’re in a 2-dimensional space, you can think of the hyperplane as a line that linearly
62
00:05:14,940 --> 00:05:18,880
separates the blue points from the red points.
63
00:05:18,880 --> 00:05:24,780
One reasonable choice as the best hyperplane is the one that represents the largest separation,
64
00:05:24,780 --> 00:05:27,259
or margin, between the two classes.
65
00:05:27,259 --> 00:05:34,970
So, the goal is to choose a hyperplane with as big a margin as possible.
66
00:05:34,970 --> 00:05:39,220
Examples closest to the hyperplane are support vectors.
67
00:05:39,220 --> 00:05:45,100
It is intuitive that only support vectors matter for achieving our goal; and thus, other
68
00:05:45,100 --> 00:05:48,030
training examples can be ignored.
69
00:05:48,030 --> 00:05:53,560
We try to find the hyperplane in such a way that it has the maximum distance to support
70
00:05:53,560 --> 00:05:54,560
vectors.
71
00:05:54,560 --> 00:06:01,009
Please note, that the hyperplane and boundary decision lines have their own equations.
72
00:06:01,009 --> 00:06:07,450
So, finding the optimized hyperplane can be formalized using an equation which involves
73
00:06:07,450 --> 00:06:13,169
quite a bit more math, so I’m not going to go through it here, in detail.
74
00:06:13,169 --> 00:06:18,789
That said, the hyperplane is learned from training data using an optimization procedure
75
00:06:18,789 --> 00:06:25,870
that maximizes the margin; and like many other problems, this optimization problem can also
76
00:06:25,870 --> 00:06:30,759
be solved by gradient descent, which is out of scope of this video.
77
00:06:30,759 --> 00:06:37,790
Therefore, the output of the algorithm is the values ‘w’ and ‘b’ for the line.
78
00:06:37,790 --> 00:06:42,289
You can make classifications using this estimated line.
79
00:06:42,289 --> 00:06:48,199
It is enough to plug in input values into the line equation, then, you can calculate
80
00:06:48,199 --> 00:06:52,620
whether an unknown point is above or below the line.
81
00:06:52,620 --> 00:06:58,410
If the equation returns a value greater than 0, then the point belongs to the first class,
82
00:06:58,410 --> 00:07:02,130
which is above the line, and vice versa.
83
00:07:02,130 --> 00:07:07,099
The two main advantages of support vector machines are that they’re accurate in high
84
00:07:07,099 --> 00:07:12,900
dimensional spaces; and, they use a subset of training points in the decision function
85
00:07:12,900 --> 00:07:18,169
(called support vectors), so it’s also memory efficient.
86
00:07:18,169 --> 00:07:23,850
The disadvantages of support vector machines include the fact that the algorithm is prone
87
00:07:23,850 --> 00:07:30,350
for over-fitting, if the number of features is much greater than the number of samples.
88
00:07:30,350 --> 00:07:38,710
Also, SVMs do not directly provide probability estimates, which are desirable in most classification
89
00:07:38,710 --> 00:07:40,199
problems.
90
00:07:40,199 --> 00:07:47,580
And finally, SVMs are not very efficient computationally, if your dataset is very big, such as when
91
00:07:47,580 --> 00:07:51,110
you have more than one thousand rows.
92
00:07:51,110 --> 00:07:56,620
And now, our final question is, “In which situation should I use SVM?”
93
00:07:56,620 --> 00:08:04,340
Well, SVM is good for image analysis tasks, such as image classification and handwritten
94
00:08:04,340 --> 00:08:06,940
digit recognition.
95
00:08:06,940 --> 00:08:13,050
Also SVM is very effective in text-mining tasks, particularly due to its effectiveness
96
00:08:13,050 --> 00:08:16,160
in dealing with high-dimensional data.
97
00:08:16,160 --> 00:08:24,390
For example, it is used for detecting spam, text category assignment, and sentiment analysis.
98
00:08:24,390 --> 00:08:30,509
Another application of SVM is in Gene Expression data classification, again, because of its
99
00:08:30,509 --> 00:08:33,580
power in high dimensional data classification.
100
00:08:33,580 --> 00:08:40,630
SVM can also be used for other types of machine learning problems, such as regression,
101
00:08:40,630 --> 00:08:42,970
outlier detection, and clustering.
102
00:08:42,970 --> 00:08:48,570
I’ll leave it to you to explore more about these particular problems.
103
00:08:48,570 --> 00:08:51,480
This concludes this video … Thanks for watching!