- FT Linear Regression
This subject is about implementing a linear regression model from scratch.
- Train a linear regression model using the gradient descent algorithm
- Predict the price of a car using its milage as the single input feature
Linear Regression is a supervised machine learning algorithm used model the relationship between one or more independent variables (input features) and a continuous dependent variable(the value we are trying to predict).
Goal:
- Find a straight line (or a hyperplane in higher dimensions) that best fits the data points
- This "best fit" line should minimize the difference between the predicted values and the actual values.
Key Terms:
- Independent Variable: The input feature(s) used to predict the dependent variable(in this case, the milage of a car)
- Dependent Variable: The value we are trying to predict (in this case, the price of a car)
- Parameters: The coefficients and the intercept that define the line
The linear regression equation is a linear equation that describes the relationship between the independent variable(s) and the dependent variable.
For a single input feature, the equation is:
Where:
-
$\hat{y}$ is Predicted Value (dependent variable). -
$x$ is the Input Feature (independent variable) -
$w$ is the Slope of the line (weight or coefficient) -
$b$ is the Intercept of the line (bias)
The goal of linear regression is to find the optimal values for the parameters
The cost function is used to measure how well the model fits the data. It measures how far off the predicted values are from the actual values.
The cost function for linear regression is the Mean Squared Error (MSE):
Where:
-
$n$ is the number of data points -
$y_i$ is the actual value for the i-th data point -
$\hat{y}_i$ is the predicted value for the i-th data point
The Goal is to minimize the MSE by finding the optimal values for the parameters
Our main goal is to find the optimal values for the parameters
Gradient refers to information that tells us "which direction to move the parameters from their current position to reduce the loss."
Cost function is a function of
Okay, we have the partial derivatives. But, what does this actually mean?
Partial Derivative with respect to
And the same thing applies to the partial derivative with respect to
This is the key to the gradient descent algorithm.
Gradient Descent is an optimization algorithm used to minimize the cost function by iteratively moving in the direction of steepest descent.
The algorithm is as follows:
Initialize the parameters
Compute the partial derivatives of the cost function with respect to the parameters
Now, we have the gradients, which tell us the direction to move the parameters to reduce the cost function. But, how much do we move the parameters? This is determined by the learning rate.
The Learning Rate is a hyperparameter that controls how much we are adjusting the parameters of our model with respect to the gradient of the cost function.
- If the learning rate is too small, the model will take a long time to converge.
- If the learning rate is too large, the model may overshoot the minimum and fail to converge.
The update rule is as follows:
Why do we subtract the gradient from the parameters? Because we want to minimize the cost function. If the gradient is positive, it means the cost function will increase if we increase the parameter. Therefore, we need to decrease the parameter. And the same thing applies if the gradient is negative.
We just microscopically moved the parameters in the direction that reduces the cost function. Now, we repeat the process until the cost function converges to a minimum value.
After training the model, we can use it to make predictions on new data points.
While updating the parameters, after CoreConstants.ITERATIONS iterations, the theta values will be updated in data/theta.csv file. Updated theta value will be used for the next iteration.
If the cost function increases after updating the parameters, which means it overshoots the minimum, which means the learning rate is too high. In this case, the learning rate will be decreased by CoreConstants.LEARNING_RATE_DECAY_FACTOR. data/theta.csv file will be wiped out and the training process will be started from scratch.
Usually, the initial weights(thetas) are set to zero. Also the project document says that the initial weights should be set to zero. It will take a long time to converge if the initial weights are set to zero. So, I added a feature that initializes the weights by using the first and the last data points. By just drawing a line between the first and the last data points, we can get the initial weights. This will speed up the convergence process.
While updating the parameters, the theta0 will be updated by CoreConstants.THETA0_ACCELERATION_FACTOR times the gradient of the cost function. This will speed up the convergence process. Also it turns out that this reduces the final cost value.