en_tDL4HcJ4vTc.srt

0
00:00:01,000 --> 00:00:05,080
Hello, and welcome! In this video we will learn the difference

1
00:00:05,080 --> 00:00:11,410
between linear regression and logistic regression. We go over linear regression and see why it

2
00:00:11,410 --> 00:00:16,080
cannot be used properly for some binary classification problems.

3
00:00:16,079 --> 00:00:20,830
We also look at the Sigmoid function, which is the main part of logistic regression.

4
00:00:20,830 --> 00:00:22,680
Let’s start.

5
00:00:22,680 --> 00:00:27,060
Let’s look at the telecommunication dataset again.

6
00:00:27,060 --> 00:00:32,250
The goal of logistic regression is to build a model to predict the class of each customer,

7
00:00:32,250 --> 00:00:36,550
and also the probability of each sample belonging to a class.

8
00:00:36,550 --> 00:00:42,309
Ideally, we want to build a model, y^, that can estimate that the class of a customer

9
00:00:42,309 --> 00:00:49,110
is 1, given its features, x. I want to emphasize that y is the “labels

10
00:00:49,110 --> 00:00:55,070
vector,” also called “actual values” that we would like to predict, and y^ is the

11
00:00:55,070 --> 00:01:01,920
vector of the predicted values by our model. Mapping the class labels to integer numbers,

12
00:01:01,920 --> 00:01:05,140
can we use linear regression to solve this problem?

13
00:01:05,140 --> 00:01:11,920
First, let’s recall how linear regression works to better understand logistic regression.

14
00:01:11,920 --> 00:01:16,220
Forget about the churn prediction for a minute, and assume our goal is to predict the income

15
00:01:16,220 --> 00:01:21,330
of customers in the dataset. This means that instead of predicting churn,

16
00:01:21,330 --> 00:01:26,700
which is a categorical value, let’s predict income, which is a continuous value.

17
00:01:26,700 --> 00:01:31,409
So, how can we do this? Let’s select an independent variable, such

18
00:01:31,409 --> 00:01:36,500
as customer age, and predict a dependent variable, such as income.

19
00:01:36,500 --> 00:01:40,990
Of course we can have more features, but for the sake of simplicity, let’s just take

20
00:01:40,990 --> 00:01:45,310
one feature here. We can plot it, and show age as an independent

21
00:01:45,310 --> 00:01:49,790
variable, and income as the target value we would like to predict.

22
00:01:49,790 --> 00:01:55,000
With linear regression, you can fit a line or polynomial through the data.

23
00:01:55,000 --> 00:02:00,299
We can find this line through the training of our model, or calculating it mathematically

24
00:02:00,299 --> 00:02:04,409
based on the sample sets. We’ll say this is a straight line through

25
00:02:04,409 --> 00:02:11,009
the sample set. This line has an equation shown as  a+b_x1.

26
00:02:11,008 --> 00:02:18,040
Now, use this line to predict the continuous value y, that is, use this line to predict

27
00:02:18,040 --> 00:02:22,930
the income of an unknown customer based on his or her age.

28
00:02:22,930 --> 00:02:25,290
And it is done.

29
00:02:25,290 --> 00:02:29,849
What if we want to predict churn? Can we use the same technique to predict a

30
00:02:29,849 --> 00:02:34,200
categorical field such as churn? OK, let’s see.

31
00:02:34,200 --> 00:02:38,680
Say we’re given data on customer churn and our goal this time is to predict the churn

32
00:02:38,680 --> 00:02:45,140
of customers based on their age. We have a feature, age denoted as x1, and

33
00:02:45,140 --> 00:02:52,269
a categorical feature, churn, with two classes: churn is yes and churn is no.

34
00:02:52,269 --> 00:02:57,980
As mentioned, we can map yes and no to integer values, 0 and 1.

35
00:02:57,980 --> 00:03:02,819
How can we model it now? Well, graphically, we could represent our

36
00:03:02,819 --> 00:03:07,260
data with a scatter plot. But this time we have only 2 values for the

37
00:03:07,260 --> 00:03:12,200
y axis. In this plot, class zero is denoted in red

38
00:03:12,200 --> 00:03:18,040
and class one is denoted in blue. Our goal here is to make a model based on

39
00:03:18,040 --> 00:03:22,920
existing data, to predict if a new customer is red or blue.

40
00:03:22,920 --> 00:03:28,090
Let’s do the same technique that we used for linear regression here to see if we can

41
00:03:28,090 --> 00:03:33,290
solve the problem for a categorical attribute, such as churn.

42
00:03:33,290 --> 00:03:39,290
With linear regression, you again can fit a polynomial through the data, which is shown

43
00:03:39,290 --> 00:03:47,230
traditionally as a + b_x. This polynomial can also be shown traditionally

44
00:03:47,230 --> 00:03:54,290
as θ_0 + θ_1 x_1. This line has 2 parameters, which are shown

45
00:03:54,290 --> 00:04:00,640
with vector θ, where the values of the vector are θ_0 and θ_1.

46
00:04:00,640 --> 00:04:06,870
We can also show the equation of this line formally as θ^T X.

47
00:04:06,870 --> 00:04:13,819
And generally, we can show the equation for a multi-dimensional space as θ^T X, where

48
00:04:13,819 --> 00:04:20,910
θ is the parameters of the line in 2-dimensional space, or parameters of a plane in 3-dimensional

49
00:04:20,910 --> 00:04:26,450
space, and so on. As θ is a vector of parameters, and is supposed

50
00:04:26,450 --> 00:04:32,150
to be multiplied by X, it is shown conventionally as T^θ.

51
00:04:32,150 --> 00:04:38,961
θ is also called the “weight vector” or “confidences of the equation” -- with

52
00:04:38,961 --> 00:04:45,120
both of these terms used interchangeably. And X is the feature set, which represents

53
00:04:45,120 --> 00:04:50,520
a customer. Anyway, given a dataset, all the feature sets

54
00:04:50,520 --> 00:04:57,870
X, θ parameters, can be calculated through an optimization algorithm or mathematically,

55
00:04:57,870 --> 00:05:01,199
which results in the equation of the fitting line.

56
00:05:01,199 --> 00:05:09,740
For example, the parameters of this line are -1 and 0.1 and the equation for the line is

57
00:05:09,740 --> 00:05:14,470
-1 + 0.1 x1.

58
00:05:14,470 --> 00:05:20,130
Now, we can use this regression line to predict the churn of the new customer.

59
00:05:20,130 --> 00:05:27,680
For example, for our customer, or let’s say a data point with x value of age=13, we

60
00:05:27,680 --> 00:05:33,400
can plug the value into the line formula, and the y value is calculated and returns

61
00:05:33,400 --> 00:05:38,140
a number. For instance, for p1 point we have:

62
00:05:38,140 --> 00:05:51,330
θ^T X = -1 + 0.1 x x_1 = -1 + 0.1 x 13

63
00:05:51,330 --> 00:05:54,530
= 0.3 We can show it on our graph.

64
00:05:54,530 --> 00:06:01,730
Now we can define a threshold here, for example at 0.5, to define the class.

65
00:06:01,730 --> 00:06:06,030
So, we write a rule here for our model, y^,

66
00:06:06,030 --> 00:06:09,440
which allows us to separate class 0 from class 1.

67
00:06:09,440 --> 00:06:17,840
If the value of θ^T X is less than 0.5, then the class is 0, otherwise, if the value of

68
00:06:17,840 --> 00:06:23,810
θ^T X is more than 0.5, then the class is 1.

69
00:06:23,810 --> 00:06:28,750
And because our customer’s y value is less than the threshold, we can say it belongs

70
00:06:28,750 --> 00:06:34,650
to class 0, based on our model. But there is one problem here; what is the

71
00:06:34,650 --> 00:06:38,620
probability that this customer belongs to class 0?

72
00:06:38,620 --> 00:06:42,600
As you can see, it’s not the best model to solve this problem.

73
00:06:42,600 --> 00:06:48,840
Also, there are some other issues, which verify that linear regression is not the proper method

74
00:06:48,840 --> 00:06:51,419
for classification problems.

75
00:06:51,419 --> 00:06:57,060
So, as mentioned, if we use the regression line to calculate the class of a point, it

76
00:06:57,060 --> 00:07:02,740
always returns a number, such as 3, or -2, and so on.

77
00:07:02,740 --> 00:07:09,330
Then, we should use a threshold, for example, 0.5, to assign that point to either class

78
00:07:09,330 --> 00:07:14,789
of 0 or 1. This threshold works as a step function that

79
00:07:14,789 --> 00:07:22,590
outputs 0 or 1, regardless of how big or small, positive or negative, the input is.

80
00:07:22,590 --> 00:07:27,900
So, using the threshold, we can find the class of a record.

81
00:07:27,900 --> 00:07:32,970
Notice that in the step function, no matter how big the value is, as long as it’s greater

82
00:07:32,970 --> 00:07:41,930
than 0.5, it simply equals 1. And vice versa, regardless of how small the value y is, the

83
00:07:41,930 --> 00:07:48,569
output would be zero if it is less than 0.5. In other words, there is no difference between

84
00:07:48,569 --> 00:07:55,270
a customer who has a value of one or 1000; the outcome would be 1.

85
00:07:55,270 --> 00:08:00,430
Instead of having this step function, wouldn’t it be nice if we had a smoother line - one

86
00:08:00,430 --> 00:08:04,000
that would project these values between zero and one?

87
00:08:04,000 --> 00:08:09,410
Indeed, the existing method does not really give us the probability of a customer belonging

88
00:08:09,410 --> 00:08:15,099
to a class, which is very desirable. We need a method that can give us the probability

89
00:08:15,099 --> 00:08:21,059
of falling in a class as well. So, what is the scientific solution here?

90
00:08:21,059 --> 00:08:25,460
Well, if instead of using θ^T X we use a

91
00:08:25,460 --> 00:08:33,349
specific function called sigmoid, then, sigmoid of θ^T X gives us the probability of a point

92
00:08:33,349 --> 00:08:37,790
belonging to a class, instead of the value of y directly.

93
00:08:37,789 --> 00:08:43,300
I’ll explain this sigmoid function in a second, but for now, please accept that it

94
00:08:43,299 --> 00:08:48,700
will do the trick. Instead of calculating the value of θ^T X

95
00:08:48,700 --> 00:08:56,640
directly, it returns the probability that a θ^T X is very big or very small.

96
00:08:56,640 --> 00:09:03,720
It always returns a value between 0 and 1 depending on how large the θ^T X actually

97
00:09:03,720 --> 00:09:09,800
is. Now, our model is σ(θ^T X), which represents

98
00:09:09,800 --> 00:09:13,870
the probability that the output is 1, given x.

99
00:09:13,870 --> 00:09:18,870
Now the question is, “What is the sigmoid function?”

100
00:09:18,870 --> 00:09:22,780
Let me explain in detail what sigmoid really is.

101
00:09:22,780 --> 00:09:28,550
The sigmoid function, also called the logistic function, resembles the step function and

102
00:09:28,550 --> 00:09:33,350
is used by the following expression in the logistic regression.

103
00:09:33,350 --> 00:09:38,500
The sigmoid function looks a bit complicated at first, but don’t worry about remembering

104
00:09:38,500 --> 00:09:41,440
this equation. It’ll make sense to you after working with it.

105
00:09:41,440 --> 00:09:45,320
Notice that in the sigmoid equation, when

106
00:09:45,320 --> 00:09:53,810
θ^T X gets very big, the〖e〗^(-θ^T X) in the denominator of the fraction becomes

107
00:09:53,810 --> 00:09:59,279
almost zero, and the value of the sigmoid function gets closer to 1.

108
00:09:59,279 --> 00:10:06,390
If θ^T X is very small, the sigmoid function gets closer to zero.

109
00:10:06,390 --> 00:10:13,171
Depicting on the in sigmoid plot, when θ^T X , gets bigger, the value of the sigmoid

110
00:10:13,171 --> 00:10:20,540
function gets closer to 1, and also, if the θ^T X is very small, the sigmoid function

111
00:10:20,540 --> 00:10:26,000
gets closer to zero. So, the sigmoid function’s output is always

112
00:10:26,000 --> 00:10:32,540
between 0 and 1, which makes it proper to interpret the results as probabilities.

113
00:10:32,540 --> 00:10:37,860
It is obvious that when the outcome of the sigmoid function gets closer to 1, the probability

114
00:10:37,860 --> 00:10:46,070
of y=1, given x, goes up, and in contrast, when the sigmoid value is closer to zero,

115
00:10:46,070 --> 00:10:51,510
the probability of y=1, given x, is very small.

116
00:10:51,510 --> 00:10:56,230
So what is the output of our model when we use the sigmoid function?

117
00:10:56,230 --> 00:11:02,740
In logistic regression, we model the probability that an input (X) belongs to the default class

118
00:11:02,740 --> 00:11:10,620
(Y=1), and we can write this formally as, P(Y=1|X).

119
00:11:10,620 --> 00:11:23,790
We can also write P(y=0|X) = 1 -P(y=1|x). For example, the probability of a customer

120
00:11:23,790 --> 00:11:30,480
staying with the company can be shown as probability of churn equals 1 given a customer’s income

121
00:11:30,480 --> 00:11:37,660
and age, which can be, for instance, 0.8. And the probability of churn is 0, for the

122
00:11:37,660 --> 00:11:44,540
same customer, given a customer’s income and age can be calculated as 1-0.8=0.2.

123
00:11:44,540 --> 00:11:53,410
So, now, our job is to train the model to set its parameter values in such a way that

124
00:11:53,410 --> 00:12:02,370
our model is a good estimate of P(y=1∣x). In fact, this is what a good classifier model

125
00:12:02,370 --> 00:12:06,580
built by logistic regression is supposed to do for us.

126
00:12:06,580 --> 00:12:15,350
Also, it should be a good estimate of P(y=0∣x) that can be shown as 1-σ(θ^T X).

127
00:12:15,350 --> 00:12:22,029
Now, the question is: &quot;How we can achieve this?&quot;

128
00:12:22,029 --> 00:12:27,670
We can find 𝜃 through the training process, so let’s see what the training process is.

129
00:12:27,670 --> 00:12:35,110
Step 1. Initialize 𝜃 vector with random values, as with most machine learning algorithms,

130
00:12:35,110 --> 00:12:38,680
for example −1 or 2.

131
00:12:38,680 --> 00:12:45,410
Step 2. Calculate the model output, which is σ(θ^T X), for a sample customer in your

132
00:12:45,410 --> 00:12:51,450
training set. X in θ^T X is the feature vector values -- for

133
00:12:51,450 --> 00:12:55,760
example, the age and income of the customer, for instance [2,5].

134
00:12:55,760 --> 00:13:01,910
And θ is the confidence or weight that you’ve set in the previous step.

135
00:13:01,910 --> 00:13:07,010
The output of this equation is the prediction value … in other words, the probability

136
00:13:07,010 --> 00:13:10,480
that the customer belongs to class 1.

137
00:13:10,480 --> 00:13:18,200
Step 3. Compare the output of our model, y^, which could be a value of, let’s say, 0.7,

138
00:13:18,200 --> 00:13:23,690
with the actual label of the customer, which is for example, 1 for churn.

139
00:13:23,690 --> 00:13:31,230
Then, record the difference as our model’s error for this customer, which would be 1-0.7,

140
00:13:31,230 --> 00:13:38,620
which of course equals 0.3. This is the error for only one customer out of all the customers

141
00:13:38,620 --> 00:13:40,600
in the training set.

142
00:13:40,600 --> 00:13:47,500
Step. 4. Calculate the error for all customers as we did in the previous steps, and add up

143
00:13:47,500 --> 00:13:51,890
these errors. The total error is the cost of your model,

144
00:13:51,890 --> 00:13:58,550
and is calculated by the model’s cost function. The cost function, by the way, basically represents

145
00:13:58,550 --> 00:14:03,660
how to calculate the error of the model, which is the difference between the actual and the

146
00:14:03,660 --> 00:14:09,350
model’s predicted values. So, the cost shows how poorly the model is

147
00:14:09,350 --> 00:14:14,540
estimating the customer’s labels. Therefore, the lower the cost, the better

148
00:14:14,540 --> 00:14:18,930
the model is at estimating the customer’s labels correctly.

149
00:14:18,930 --> 00:14:23,930
And so, what we want to do is to try to minimize this cost.

150
00:14:23,930 --> 00:14:29,910
Step 5. But, because the initial values for θ were chosen randomly, it’s very likely

151
00:14:29,910 --> 00:14:35,350
that the cost function is very high. So, we change the 𝜃 in such a way to hopefully

152
00:14:35,350 --> 00:14:37,470
reduce the total cost.

153
00:14:37,470 --> 00:14:44,190
Step 6. After changing the values of θ, we go back to step 2.

154
00:14:44,190 --> 00:14:49,199
Then we start another iteration, and calculate the cost of the model again.

155
00:14:49,199 --> 00:14:54,699
And we keep doing those steps over and over, changing the values of θ each time, until

156
00:14:54,699 --> 00:15:00,810
the cost is low enough. So, this brings up two questions: first, &quot;How

157
00:15:00,810 --> 00:15:06,060
can we change the values of θ so that the cost is reduced across iterations?&quot;

158
00:15:06,060 --> 00:15:11,910
And second, &quot;When should we stop the iterations?&quot; There are different ways to change the values

159
00:15:11,910 --> 00:15:16,160
of θ, but one of the most popular ways is gradient descent.

160
00:15:16,160 --> 00:15:23,760
Also, there are various ways to stop iterations, but essentially you stop training by calculating

161
00:15:23,760 --> 00:15:28,470
the accuracy of your model, and stop it when it’s satisfactory.

162
00:15:28,470 --> 00:15:30,050
Thanks for watching this video!