-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathen_tDL4HcJ4vTc.srt
652 lines (489 loc) · 16.9 KB
/
en_tDL4HcJ4vTc.srt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
0
00:00:01,000 --> 00:00:05,080
Hello, and welcome! In this video we will learn the difference
1
00:00:05,080 --> 00:00:11,410
between linear regression and logistic regression. We go over linear regression and see why it
2
00:00:11,410 --> 00:00:16,080
cannot be used properly for some binary classification problems.
3
00:00:16,079 --> 00:00:20,830
We also look at the Sigmoid function, which is the main part of logistic regression.
4
00:00:20,830 --> 00:00:22,680
Let’s start.
5
00:00:22,680 --> 00:00:27,060
Let’s look at the telecommunication dataset again.
6
00:00:27,060 --> 00:00:32,250
The goal of logistic regression is to build a model to predict the class of each customer,
7
00:00:32,250 --> 00:00:36,550
and also the probability of each sample belonging to a class.
8
00:00:36,550 --> 00:00:42,309
Ideally, we want to build a model, y^, that can estimate that the class of a customer
9
00:00:42,309 --> 00:00:49,110
is 1, given its features, x. I want to emphasize that y is the “labels
10
00:00:49,110 --> 00:00:55,070
vector,” also called “actual values” that we would like to predict, and y^ is the
11
00:00:55,070 --> 00:01:01,920
vector of the predicted values by our model. Mapping the class labels to integer numbers,
12
00:01:01,920 --> 00:01:05,140
can we use linear regression to solve this problem?
13
00:01:05,140 --> 00:01:11,920
First, let’s recall how linear regression works to better understand logistic regression.
14
00:01:11,920 --> 00:01:16,220
Forget about the churn prediction for a minute, and assume our goal is to predict the income
15
00:01:16,220 --> 00:01:21,330
of customers in the dataset. This means that instead of predicting churn,
16
00:01:21,330 --> 00:01:26,700
which is a categorical value, let’s predict income, which is a continuous value.
17
00:01:26,700 --> 00:01:31,409
So, how can we do this? Let’s select an independent variable, such
18
00:01:31,409 --> 00:01:36,500
as customer age, and predict a dependent variable, such as income.
19
00:01:36,500 --> 00:01:40,990
Of course we can have more features, but for the sake of simplicity, let’s just take
20
00:01:40,990 --> 00:01:45,310
one feature here. We can plot it, and show age as an independent
21
00:01:45,310 --> 00:01:49,790
variable, and income as the target value we would like to predict.
22
00:01:49,790 --> 00:01:55,000
With linear regression, you can fit a line or polynomial through the data.
23
00:01:55,000 --> 00:02:00,299
We can find this line through the training of our model, or calculating it mathematically
24
00:02:00,299 --> 00:02:04,409
based on the sample sets. We’ll say this is a straight line through
25
00:02:04,409 --> 00:02:11,009
the sample set. This line has an equation shown as a+b_x1.
26
00:02:11,008 --> 00:02:18,040
Now, use this line to predict the continuous value y, that is, use this line to predict
27
00:02:18,040 --> 00:02:22,930
the income of an unknown customer based on his or her age.
28
00:02:22,930 --> 00:02:25,290
And it is done.
29
00:02:25,290 --> 00:02:29,849
What if we want to predict churn? Can we use the same technique to predict a
30
00:02:29,849 --> 00:02:34,200
categorical field such as churn? OK, let’s see.
31
00:02:34,200 --> 00:02:38,680
Say we’re given data on customer churn and our goal this time is to predict the churn
32
00:02:38,680 --> 00:02:45,140
of customers based on their age. We have a feature, age denoted as x1, and
33
00:02:45,140 --> 00:02:52,269
a categorical feature, churn, with two classes: churn is yes and churn is no.
34
00:02:52,269 --> 00:02:57,980
As mentioned, we can map yes and no to integer values, 0 and 1.
35
00:02:57,980 --> 00:03:02,819
How can we model it now? Well, graphically, we could represent our
36
00:03:02,819 --> 00:03:07,260
data with a scatter plot. But this time we have only 2 values for the
37
00:03:07,260 --> 00:03:12,200
y axis. In this plot, class zero is denoted in red
38
00:03:12,200 --> 00:03:18,040
and class one is denoted in blue. Our goal here is to make a model based on
39
00:03:18,040 --> 00:03:22,920
existing data, to predict if a new customer is red or blue.
40
00:03:22,920 --> 00:03:28,090
Let’s do the same technique that we used for linear regression here to see if we can
41
00:03:28,090 --> 00:03:33,290
solve the problem for a categorical attribute, such as churn.
42
00:03:33,290 --> 00:03:39,290
With linear regression, you again can fit a polynomial through the data, which is shown
43
00:03:39,290 --> 00:03:47,230
traditionally as a + b_x. This polynomial can also be shown traditionally
44
00:03:47,230 --> 00:03:54,290
as θ_0 + θ_1 x_1. This line has 2 parameters, which are shown
45
00:03:54,290 --> 00:04:00,640
with vector θ, where the values of the vector are θ_0 and θ_1.
46
00:04:00,640 --> 00:04:06,870
We can also show the equation of this line formally as θ^T X.
47
00:04:06,870 --> 00:04:13,819
And generally, we can show the equation for a multi-dimensional space as θ^T X, where
48
00:04:13,819 --> 00:04:20,910
θ is the parameters of the line in 2-dimensional space, or parameters of a plane in 3-dimensional
49
00:04:20,910 --> 00:04:26,450
space, and so on. As θ is a vector of parameters, and is supposed
50
00:04:26,450 --> 00:04:32,150
to be multiplied by X, it is shown conventionally as T^θ.
51
00:04:32,150 --> 00:04:38,961
θ is also called the “weight vector” or “confidences of the equation” -- with
52
00:04:38,961 --> 00:04:45,120
both of these terms used interchangeably. And X is the feature set, which represents
53
00:04:45,120 --> 00:04:50,520
a customer. Anyway, given a dataset, all the feature sets
54
00:04:50,520 --> 00:04:57,870
X, θ parameters, can be calculated through an optimization algorithm or mathematically,
55
00:04:57,870 --> 00:05:01,199
which results in the equation of the fitting line.
56
00:05:01,199 --> 00:05:09,740
For example, the parameters of this line are -1 and 0.1 and the equation for the line is
57
00:05:09,740 --> 00:05:14,470
-1 + 0.1 x1.
58
00:05:14,470 --> 00:05:20,130
Now, we can use this regression line to predict the churn of the new customer.
59
00:05:20,130 --> 00:05:27,680
For example, for our customer, or let’s say a data point with x value of age=13, we
60
00:05:27,680 --> 00:05:33,400
can plug the value into the line formula, and the y value is calculated and returns
61
00:05:33,400 --> 00:05:38,140
a number. For instance, for p1 point we have:
62
00:05:38,140 --> 00:05:51,330
θ^T X = -1 + 0.1 x x_1 = -1 + 0.1 x 13
63
00:05:51,330 --> 00:05:54,530
= 0.3 We can show it on our graph.
64
00:05:54,530 --> 00:06:01,730
Now we can define a threshold here, for example at 0.5, to define the class.
65
00:06:01,730 --> 00:06:06,030
So, we write a rule here for our model, y^,
66
00:06:06,030 --> 00:06:09,440
which allows us to separate class 0 from class 1.
67
00:06:09,440 --> 00:06:17,840
If the value of θ^T X is less than 0.5, then the class is 0, otherwise, if the value of
68
00:06:17,840 --> 00:06:23,810
θ^T X is more than 0.5, then the class is 1.
69
00:06:23,810 --> 00:06:28,750
And because our customer’s y value is less than the threshold, we can say it belongs
70
00:06:28,750 --> 00:06:34,650
to class 0, based on our model. But there is one problem here; what is the
71
00:06:34,650 --> 00:06:38,620
probability that this customer belongs to class 0?
72
00:06:38,620 --> 00:06:42,600
As you can see, it’s not the best model to solve this problem.
73
00:06:42,600 --> 00:06:48,840
Also, there are some other issues, which verify that linear regression is not the proper method
74
00:06:48,840 --> 00:06:51,419
for classification problems.
75
00:06:51,419 --> 00:06:57,060
So, as mentioned, if we use the regression line to calculate the class of a point, it
76
00:06:57,060 --> 00:07:02,740
always returns a number, such as 3, or -2, and so on.
77
00:07:02,740 --> 00:07:09,330
Then, we should use a threshold, for example, 0.5, to assign that point to either class
78
00:07:09,330 --> 00:07:14,789
of 0 or 1. This threshold works as a step function that
79
00:07:14,789 --> 00:07:22,590
outputs 0 or 1, regardless of how big or small, positive or negative, the input is.
80
00:07:22,590 --> 00:07:27,900
So, using the threshold, we can find the class of a record.
81
00:07:27,900 --> 00:07:32,970
Notice that in the step function, no matter how big the value is, as long as it’s greater
82
00:07:32,970 --> 00:07:41,930
than 0.5, it simply equals 1. And vice versa, regardless of how small the value y is, the
83
00:07:41,930 --> 00:07:48,569
output would be zero if it is less than 0.5. In other words, there is no difference between
84
00:07:48,569 --> 00:07:55,270
a customer who has a value of one or 1000; the outcome would be 1.
85
00:07:55,270 --> 00:08:00,430
Instead of having this step function, wouldn’t it be nice if we had a smoother line - one
86
00:08:00,430 --> 00:08:04,000
that would project these values between zero and one?
87
00:08:04,000 --> 00:08:09,410
Indeed, the existing method does not really give us the probability of a customer belonging
88
00:08:09,410 --> 00:08:15,099
to a class, which is very desirable. We need a method that can give us the probability
89
00:08:15,099 --> 00:08:21,059
of falling in a class as well. So, what is the scientific solution here?
90
00:08:21,059 --> 00:08:25,460
Well, if instead of using θ^T X we use a
91
00:08:25,460 --> 00:08:33,349
specific function called sigmoid, then, sigmoid of θ^T X gives us the probability of a point
92
00:08:33,349 --> 00:08:37,790
belonging to a class, instead of the value of y directly.
93
00:08:37,789 --> 00:08:43,300
I’ll explain this sigmoid function in a second, but for now, please accept that it
94
00:08:43,299 --> 00:08:48,700
will do the trick. Instead of calculating the value of θ^T X
95
00:08:48,700 --> 00:08:56,640
directly, it returns the probability that a θ^T X is very big or very small.
96
00:08:56,640 --> 00:09:03,720
It always returns a value between 0 and 1 depending on how large the θ^T X actually
97
00:09:03,720 --> 00:09:09,800
is. Now, our model is σ(θ^T X), which represents
98
00:09:09,800 --> 00:09:13,870
the probability that the output is 1, given x.
99
00:09:13,870 --> 00:09:18,870
Now the question is, “What is the sigmoid function?”
100
00:09:18,870 --> 00:09:22,780
Let me explain in detail what sigmoid really is.
101
00:09:22,780 --> 00:09:28,550
The sigmoid function, also called the logistic function, resembles the step function and
102
00:09:28,550 --> 00:09:33,350
is used by the following expression in the logistic regression.
103
00:09:33,350 --> 00:09:38,500
The sigmoid function looks a bit complicated at first, but don’t worry about remembering
104
00:09:38,500 --> 00:09:41,440
this equation. It’ll make sense to you after working with it.
105
00:09:41,440 --> 00:09:45,320
Notice that in the sigmoid equation, when
106
00:09:45,320 --> 00:09:53,810
θ^T X gets very big, the〖e〗^(-θ^T X) in the denominator of the fraction becomes
107
00:09:53,810 --> 00:09:59,279
almost zero, and the value of the sigmoid function gets closer to 1.
108
00:09:59,279 --> 00:10:06,390
If θ^T X is very small, the sigmoid function gets closer to zero.
109
00:10:06,390 --> 00:10:13,171
Depicting on the in sigmoid plot, when θ^T X , gets bigger, the value of the sigmoid
110
00:10:13,171 --> 00:10:20,540
function gets closer to 1, and also, if the θ^T X is very small, the sigmoid function
111
00:10:20,540 --> 00:10:26,000
gets closer to zero. So, the sigmoid function’s output is always
112
00:10:26,000 --> 00:10:32,540
between 0 and 1, which makes it proper to interpret the results as probabilities.
113
00:10:32,540 --> 00:10:37,860
It is obvious that when the outcome of the sigmoid function gets closer to 1, the probability
114
00:10:37,860 --> 00:10:46,070
of y=1, given x, goes up, and in contrast, when the sigmoid value is closer to zero,
115
00:10:46,070 --> 00:10:51,510
the probability of y=1, given x, is very small.
116
00:10:51,510 --> 00:10:56,230
So what is the output of our model when we use the sigmoid function?
117
00:10:56,230 --> 00:11:02,740
In logistic regression, we model the probability that an input (X) belongs to the default class
118
00:11:02,740 --> 00:11:10,620
(Y=1), and we can write this formally as, P(Y=1|X).
119
00:11:10,620 --> 00:11:23,790
We can also write P(y=0|X) = 1 -P(y=1|x). For example, the probability of a customer
120
00:11:23,790 --> 00:11:30,480
staying with the company can be shown as probability of churn equals 1 given a customer’s income
121
00:11:30,480 --> 00:11:37,660
and age, which can be, for instance, 0.8. And the probability of churn is 0, for the
122
00:11:37,660 --> 00:11:44,540
same customer, given a customer’s income and age can be calculated as 1-0.8=0.2.
123
00:11:44,540 --> 00:11:53,410
So, now, our job is to train the model to set its parameter values in such a way that
124
00:11:53,410 --> 00:12:02,370
our model is a good estimate of P(y=1∣x). In fact, this is what a good classifier model
125
00:12:02,370 --> 00:12:06,580
built by logistic regression is supposed to do for us.
126
00:12:06,580 --> 00:12:15,350
Also, it should be a good estimate of P(y=0∣x) that can be shown as 1-σ(θ^T X).
127
00:12:15,350 --> 00:12:22,029
Now, the question is: "How we can achieve this?"
128
00:12:22,029 --> 00:12:27,670
We can find 𝜃 through the training process, so let’s see what the training process is.
129
00:12:27,670 --> 00:12:35,110
Step 1. Initialize 𝜃 vector with random values, as with most machine learning algorithms,
130
00:12:35,110 --> 00:12:38,680
for example −1 or 2.
131
00:12:38,680 --> 00:12:45,410
Step 2. Calculate the model output, which is σ(θ^T X), for a sample customer in your
132
00:12:45,410 --> 00:12:51,450
training set. X in θ^T X is the feature vector values -- for
133
00:12:51,450 --> 00:12:55,760
example, the age and income of the customer, for instance [2,5].
134
00:12:55,760 --> 00:13:01,910
And θ is the confidence or weight that you’ve set in the previous step.
135
00:13:01,910 --> 00:13:07,010
The output of this equation is the prediction value … in other words, the probability
136
00:13:07,010 --> 00:13:10,480
that the customer belongs to class 1.
137
00:13:10,480 --> 00:13:18,200
Step 3. Compare the output of our model, y^, which could be a value of, let’s say, 0.7,
138
00:13:18,200 --> 00:13:23,690
with the actual label of the customer, which is for example, 1 for churn.
139
00:13:23,690 --> 00:13:31,230
Then, record the difference as our model’s error for this customer, which would be 1-0.7,
140
00:13:31,230 --> 00:13:38,620
which of course equals 0.3. This is the error for only one customer out of all the customers
141
00:13:38,620 --> 00:13:40,600
in the training set.
142
00:13:40,600 --> 00:13:47,500
Step. 4. Calculate the error for all customers as we did in the previous steps, and add up
143
00:13:47,500 --> 00:13:51,890
these errors. The total error is the cost of your model,
144
00:13:51,890 --> 00:13:58,550
and is calculated by the model’s cost function. The cost function, by the way, basically represents
145
00:13:58,550 --> 00:14:03,660
how to calculate the error of the model, which is the difference between the actual and the
146
00:14:03,660 --> 00:14:09,350
model’s predicted values. So, the cost shows how poorly the model is
147
00:14:09,350 --> 00:14:14,540
estimating the customer’s labels. Therefore, the lower the cost, the better
148
00:14:14,540 --> 00:14:18,930
the model is at estimating the customer’s labels correctly.
149
00:14:18,930 --> 00:14:23,930
And so, what we want to do is to try to minimize this cost.
150
00:14:23,930 --> 00:14:29,910
Step 5. But, because the initial values for θ were chosen randomly, it’s very likely
151
00:14:29,910 --> 00:14:35,350
that the cost function is very high. So, we change the 𝜃 in such a way to hopefully
152
00:14:35,350 --> 00:14:37,470
reduce the total cost.
153
00:14:37,470 --> 00:14:44,190
Step 6. After changing the values of θ, we go back to step 2.
154
00:14:44,190 --> 00:14:49,199
Then we start another iteration, and calculate the cost of the model again.
155
00:14:49,199 --> 00:14:54,699
And we keep doing those steps over and over, changing the values of θ each time, until
156
00:14:54,699 --> 00:15:00,810
the cost is low enough. So, this brings up two questions: first, "How
157
00:15:00,810 --> 00:15:06,060
can we change the values of θ so that the cost is reduced across iterations?"
158
00:15:06,060 --> 00:15:11,910
And second, "When should we stop the iterations?" There are different ways to change the values
159
00:15:11,910 --> 00:15:16,160
of θ, but one of the most popular ways is gradient descent.
160
00:15:16,160 --> 00:15:23,760
Also, there are various ways to stop iterations, but essentially you stop training by calculating
161
00:15:23,760 --> 00:15:28,470
the accuracy of your model, and stop it when it’s satisfactory.
162
00:15:28,470 --> 00:15:30,050
Thanks for watching this video!