Here is my implementation of multiple logistic regression from scratch utilizing the sigmoid binding function, as well as batch gradient descent on the Kaggle Heart Disease Dataset, predicting wether a given patient will have heart disease based on age, cholesterol and resting heart rate as model weights.
- Introduction
- Cost Function for Logistic Regression
- Making Predictions
- Binary Classification
- Running the Predictions
When attempting to predict the values of a dataset containing binary results, linear regression is not a good choice. This is because a linear function cannot fit binary outputs (outputs of either 0 or 1) accurately. Linear regression assumes a continuous output, whereas binary classification problems require an output that is restricted to two possible values (0 or 1). Using a linear function in such cases can lead to inaccuracies, as it can predict values outside the desired range, such as negative values or values greater than one.
The sigmoid function is a more appropriate choice for binary classification tasks, as it maps input values to a range between 0 and 1. The sigmoid function is given by:
The limit evaluation of the sigmoid function as
To predict binary outcomes with multiple input features, we can adapt the sigmoid function as follows (or sigmoidize is what I like to call it):
In matrix form, this can be represented as:
Logistic regression uses a specific cost function that is suitable for binary output values:
where:
The cost function can be split into two parts depending on the value of
-
When
$y = 1$ :
$$- \log(h_\theta(x))$$ -
When
$y = 0$ :
$$- \log(1 - h_\theta(x))$$
To optimize the cost function, we use gradient descent. The partial derivatives of the cost function with respect to
For
For
These derivatives are then used to update
where
After optimizing
For multiple input features:
Or, in matrix notation:
The resulting probability
This approach enables the model to output binary predictions after fitting the sigmoid function with the optimal weights and bias obtained through gradient descent.
The main.py
file is the main entry point for running the logistic regression predictions. It initializes the dataset and runs the gradient descent algorithm on the given cost function, and outputs the predicted parameter error values into a visualizer.
To run the predictions, follow these steps:
- Ensure you have Python 3 installed on your system.
- Clone this repository and navigate to the project directory.
- Run the following commands to install any necessary dependencies and execute the script:
# Clone the repository
git clone https://github.com/oskccy/logistic-regression-from-scratch.git
cd logistic-regression-from-scratch
# Install dependencies
pip install -r requirements.txt
# Run the predictions
python3 main.py