From dc2c7403baa5fc810893f198108c71a89de5034a Mon Sep 17 00:00:00 2001 From: Pratik Jadhav Date: Wed, 11 Sep 2024 19:21:22 +0530 Subject: [PATCH 1/6] Started documentation for Stochastic Gradient Descent --- .../stochastic-gradient-descent.md | 61 +++++++++++++++++++ 1 file changed, 61 insertions(+) create mode 100644 content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md diff --git a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md new file mode 100644 index 00000000000..dafc7471954 --- /dev/null +++ b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md @@ -0,0 +1,61 @@ +--- +Title: 'Stochastic Gradient Desent' +Description: 'Stochastic Gradient Desent is optimizer algorithm that minimizes the loss functions in machine learning and deep learning models.' +Subjects: + - 'Machine Learning' + - 'Deep Learning' + - 'Computer Science' +Tags: + - 'AI' + - 'Neural Network' + - 'Optimizer' +CatalogContent: + - 'paths/computer-science' + - 'paths/data-science' +--- + +**Stochastic Gradient Descent** (SGD) is one of the optimization algorithms. It is varient of gradient descent optimizer. The SGD minimize the loss function of machine learning algorithms and deep learning algorithms during backpropogation to update the weight and bias in Artificial Neural Networks. + +The term stochastic mean randomness on which algorithm based upon. In this algorithm instead of taking whole dataset like grdient descent we take single randomly selected data point or small batch of data.suppose if the data set contains 500 rows SGD update the model parameters 500 times in one cycle or one epoch. + +This approach significantly reduces computation time, especially for large datasets, making SGD faster and more scalable.SGD is used for training models like neural networks, support vector machines (SVMs), and logistic regression. However, it introduces more noise into the learning process, which can lead to less stable convergence but also helps escape local minima, making it suitable for non-convex problems. + + +![stochastic gradient descent](https://www.goglides.dev/images/Jq8EpuPoMjCcxm7PqMqWuQK7M_MoVtdfAUsGJsoUIMA/w:880/mb:500000/ar:1/aHR0cHM6Ly93d3ct/Z29nbGlkZXMtZGV2/LnMzLmFtYXpvbmF3/cy5jb20vdXBsb2Fk/cy9hcnRpY2xlcy8z/cGh3bjR0bmpnNGlo/eHV0Znpqby5wbmc) + +## Algorithms Step + +- At each iteration, a random sample is selected from the training dataset. +- The gradient of the cost function with respect to the model parameters is computed based on the selected sample. +- The model parameters are updated using the computed gradient and the learning rate. +- The process is repeated for multiple iterations until convergence or a specified number of epochs. + +## Formula + +$$ +\large \theta = \theta - \alpha * \nabla J((\theta ; x_iy_i)) +$$ + +Where: + +- θ represents the model parameter (weight or bias) being updated. +- α is the learning rate, a hyperparameter that controls the step size of the update. +- ∇J(θ;xi,yi) is the gradient of the cost or loss function J with respect to the model parameter θ, computed based on a single training sample (xi,yi). + +## Advantages +- **Faster convergence:** SGD updates parameters more frequently that hence it takes less time to converge especially for large datasets. +- **Reduced Computation Time:** SDD takes only subset of dataset or batch for each updates. This make it easy to handle large datasets and compute faster. +- **Avoid Local Minima:** The noise introduced by updating parameters with individual data points or small batches can help escape local minima.This can potentially lead to better solutions in complex, non-convex optimization problems. +- **Online Learning:** SGD can be used in scenarios where data is arriving sequentially (online learning).- It allows models to be updated continuously as new data comes in. + +## Disadvantages +- **Noisy Updates:** Updates are based on a single data point or small batch, which introduces variability in the gradient estimates.This noise can cause the algorithm to converge more slowly or oscillate around the optimal solution. +- **Convergence Issues:** The noisy updates can lead to less stable convergence and might make it harder to reach the exact minimum of the loss function.Fine-tuning the learning rate and other hyperparameters becomes crucial to achieving good results. +- **Hyperparameter Sensitivity:** - SGD's performance is sensitive to the choice of learning rate and other hyperparameters.Finding the right set of hyperparameters often requires experimentation and tuning. + +## Practical Tips And Tricks When Using SGD +- Shuffle data before training +- Use mini batches(batch size 32) +- Normalize input +- Choose suitable learning rate (0.01) + From 4ee61621fad1d25e07e7864b7d89a4b88c70816c Mon Sep 17 00:00:00 2001 From: Pratik Jadhav Date: Wed, 11 Sep 2024 19:58:25 +0530 Subject: [PATCH 2/6] correct some spelling mistake --- .../stochastic-gradient-descent.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md index dafc7471954..0fdd00ad70c 100644 --- a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md +++ b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md @@ -14,7 +14,7 @@ CatalogContent: - 'paths/data-science' --- -**Stochastic Gradient Descent** (SGD) is one of the optimization algorithms. It is varient of gradient descent optimizer. The SGD minimize the loss function of machine learning algorithms and deep learning algorithms during backpropogation to update the weight and bias in Artificial Neural Networks. +**Stochastic Gradient Descent** (SGD) is a optimization algorithm. It is variant of gradient descent optimizer. The SGD minimize the loss function of machine learning algorithms and deep learning algorithms during backpropagation to update the weight and bias in Artificial Neural Networks. The term stochastic mean randomness on which algorithm based upon. In this algorithm instead of taking whole dataset like grdient descent we take single randomly selected data point or small batch of data.suppose if the data set contains 500 rows SGD update the model parameters 500 times in one cycle or one epoch. @@ -43,8 +43,8 @@ Where: - ∇J(θ;xi,yi) is the gradient of the cost or loss function J with respect to the model parameter θ, computed based on a single training sample (xi,yi). ## Advantages -- **Faster convergence:** SGD updates parameters more frequently that hence it takes less time to converge especially for large datasets. -- **Reduced Computation Time:** SDD takes only subset of dataset or batch for each updates. This make it easy to handle large datasets and compute faster. +- **Faster convergence:** SGD updates parameters more frequently hence it takes less time to converge especially for large datasets. +- **Reduced Computation Time:** SDD takes only subset of dataset or batch for each update. This makes it easy to handle large datasets and compute faster. - **Avoid Local Minima:** The noise introduced by updating parameters with individual data points or small batches can help escape local minima.This can potentially lead to better solutions in complex, non-convex optimization problems. - **Online Learning:** SGD can be used in scenarios where data is arriving sequentially (online learning).- It allows models to be updated continuously as new data comes in. From 4ae857a5bea896f184c92999a3cb989f0b8bb536 Mon Sep 17 00:00:00 2001 From: Pratik Jadhav Date: Sat, 14 Sep 2024 19:25:23 +0530 Subject: [PATCH 3/6] Add example and update documentation for SGD --- .../stochastic-gradient-descent.md | 63 ++++++++++++++++--- 1 file changed, 54 insertions(+), 9 deletions(-) diff --git a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md index 0fdd00ad70c..5f67ced591b 100644 --- a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md +++ b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md @@ -1,22 +1,20 @@ --- -Title: 'Stochastic Gradient Desent' -Description: 'Stochastic Gradient Desent is optimizer algorithm that minimizes the loss functions in machine learning and deep learning models.' +Title: 'Stochastic Gradient Descent' +Description: 'Stochastic Gradient Descent is an optimizer algorithm that minimizes the loss functions in machine learning and deep learning models.' Subjects: - 'Machine Learning' - - 'Deep Learning' - 'Computer Science' Tags: - 'AI' - 'Neural Network' - - 'Optimizer' CatalogContent: - 'paths/computer-science' - 'paths/data-science' --- -**Stochastic Gradient Descent** (SGD) is a optimization algorithm. It is variant of gradient descent optimizer. The SGD minimize the loss function of machine learning algorithms and deep learning algorithms during backpropagation to update the weight and bias in Artificial Neural Networks. +**Stochastic Gradient Descent** (SGD) is an optimization algorithm. It is variant of gradient descent optimizer. The SGD minimizes the loss function of machine learning algorithms and deep learning algorithms during backpropagation to update the weights and biases in Artificial Neural Networks. -The term stochastic mean randomness on which algorithm based upon. In this algorithm instead of taking whole dataset like grdient descent we take single randomly selected data point or small batch of data.suppose if the data set contains 500 rows SGD update the model parameters 500 times in one cycle or one epoch. +The term `stochastic` means randomness on which the algorithm is based. In this algorithm, instead of taking whole datasets like `gradient descent`, we take single randomly selected data points or small batches of data. Suppose if the data set contains 500 rows SGD updates the model parameters 500 times in one cycle or one epoch. This approach significantly reduces computation time, especially for large datasets, making SGD faster and more scalable.SGD is used for training models like neural networks, support vector machines (SVMs), and logistic regression. However, it introduces more noise into the learning process, which can lead to less stable convergence but also helps escape local minima, making it suitable for non-convex problems. @@ -33,7 +31,7 @@ This approach significantly reduces computation time, especially for large datas ## Formula $$ -\large \theta = \theta - \alpha * \nabla J((\theta ; x_iy_i)) +\large \theta = \theta - \alpha \cdot \nabla J(\theta ; x_i, y_i) $$ Where: @@ -44,7 +42,7 @@ Where: ## Advantages - **Faster convergence:** SGD updates parameters more frequently hence it takes less time to converge especially for large datasets. -- **Reduced Computation Time:** SDD takes only subset of dataset or batch for each update. This makes it easy to handle large datasets and compute faster. +- **Reduced Computation Time:** SGD takes only a subset of dataset or batch for each update. This makes it easy to handle large datasets and compute faster. - **Avoid Local Minima:** The noise introduced by updating parameters with individual data points or small batches can help escape local minima.This can potentially lead to better solutions in complex, non-convex optimization problems. - **Online Learning:** SGD can be used in scenarios where data is arriving sequentially (online learning).- It allows models to be updated continuously as new data comes in. @@ -57,5 +55,52 @@ Where: - Shuffle data before training - Use mini batches(batch size 32) - Normalize input -- Choose suitable learning rate (0.01) +- Choose a suitable learning rate (0.01) +## Syntax +- Learning Rate (α): A hyperparameter that controls the size of the update step. +- Number of Iterations: The number of times the algorithm will iterate over the dataset. +- Loss Function: The function that measures the error of the model predictions. +- Gradient Calculation: The method for computing gradients based on the loss function. + +## Example + +Here’s a Python code snippet demonstrating how to implement SGD for linear regression: + +```codebyte/python +import numpy as np + +# Generate synthetic data +np.random.seed(42) +X = 2 * np.random.rand(100, 1) +y = 4 + 3 * X + np.random.randn(100, 1) + +# Initialize parameters +m, n = X.shape +theta = np.random.randn(n, 1) # Initial weights +learning_rate = 0.01 +n_iterations = 1000 + +# Stochastic Gradient Descent function +def stochastic_gradient_descent(X, y, theta, learning_rate, n_iterations): + m = len(y) + for iteration in range(n_iterations): + # Shuffle the data + indices = np.random.permutation(m) + X_shuffled = X[indices] + y_shuffled = y[indices] + + # Update weights for each sample + for i in range(m): + xi = X_shuffled[i:i+1] + yi = y_shuffled[i:i+1] + gradient = 2 * xi.T.dot(xi.dot(theta) - yi) + theta -= learning_rate * gradient + + return theta + +# Perform SGD +theta_final = stochastic_gradient_descent(X, y, theta, learning_rate, n_iterations) + +print("Optimized weights:", theta_final) +``` From 94ddb5b763f6f703618db107a7fcaf09c33da077 Mon Sep 17 00:00:00 2001 From: Pratik Jadhav Date: Mon, 30 Sep 2024 19:40:43 +0530 Subject: [PATCH 4/6] added syntax --- .../stochastic-gradient-descent.md | 21 +++++++++++++++---- 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md index 5f67ced591b..2674728f7dd 100644 --- a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md +++ b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md @@ -1,18 +1,18 @@ --- Title: 'Stochastic Gradient Descent' -Description: 'Stochastic Gradient Descent is an optimizer algorithm that minimizes the loss functions in machine learning and deep learning models.' +Description: 'Stochastic Gradient Descent is an optimizer algorithm that minimizes the loss function in machine learning and deep learning models.' Subjects: - 'Machine Learning' - 'Computer Science' Tags: - 'AI' - - 'Neural Network' + - 'Neural Networks' CatalogContent: - 'paths/computer-science' - 'paths/data-science' --- -**Stochastic Gradient Descent** (SGD) is an optimization algorithm. It is variant of gradient descent optimizer. The SGD minimizes the loss function of machine learning algorithms and deep learning algorithms during backpropagation to update the weights and biases in Artificial Neural Networks. +**Stochastic Gradient Descent** (SGD) is an optimization algorithm. It is a variant of gradient descent optimizer. The SGD minimizes the loss function of machine learning algorithms and deep learning algorithms during backpropagation to update the weights and biases in Artificial Neural Networks. The term `stochastic` means randomness on which the algorithm is based. In this algorithm, instead of taking whole datasets like `gradient descent`, we take single randomly selected data points or small batches of data. Suppose if the data set contains 500 rows SGD updates the model parameters 500 times in one cycle or one epoch. @@ -23,7 +23,7 @@ This approach significantly reduces computation time, especially for large datas ## Algorithms Step -- At each iteration, a random sample is selected from the training dataset. +- At each iteration, a random sample is selected from the training dataset. - The gradient of the cost function with respect to the model parameters is computed based on the selected sample. - The model parameters are updated using the computed gradient and the learning rate. - The process is repeated for multiple iterations until convergence or a specified number of epochs. @@ -58,12 +58,25 @@ Where: - Choose a suitable learning rate (0.01) ## Syntax + + ``SGD(learning_rate, n_iterations, loss_function, gradient_calculation)`` + - Learning Rate (α): A hyperparameter that controls the size of the update step. - Number of Iterations: The number of times the algorithm will iterate over the dataset. - Loss Function: The function that measures the error of the model predictions. - Gradient Calculation: The method for computing gradients based on the loss function. ## Example +```python + def stochastic_gradient_descent(X, y, theta, learning_rate, n_iterations): + for iteration in range(n_iterations): + for i in range(len(y)): + gradient = compute_gradient(X[i], y[i], theta) + theta -= learning_rate * gradient + return theta +``` + +## codebyte Example Here’s a Python code snippet demonstrating how to implement SGD for linear regression: From 3be99e8cf7f6fab1e35cf8f27a20ea3bfeba1ec8 Mon Sep 17 00:00:00 2001 From: Pratik Jadhav Date: Wed, 2 Oct 2024 19:34:22 +0530 Subject: [PATCH 5/6] corrected code --- .../stochastic-gradient-descent.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md index 2674728f7dd..1c3fe3a3887 100644 --- a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md +++ b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md @@ -70,9 +70,9 @@ Where: ```python def stochastic_gradient_descent(X, y, theta, learning_rate, n_iterations): for iteration in range(n_iterations): - for i in range(len(y)): - gradient = compute_gradient(X[i], y[i], theta) - theta -= learning_rate * gradient + for i in range(len(y)): + gradient = compute_gradient(X[i], y[i], theta) + theta -= learning_rate * gradient return theta ``` From 55a704b70ae4bb60cef6385e65506ffed60369b7 Mon Sep 17 00:00:00 2001 From: Avdhoot Fulsundar Date: Thu, 14 Nov 2024 17:19:21 +0530 Subject: [PATCH 6/6] updated the file --- .../stochastic-gradient-descent.md | 157 ++++++++++++------ 1 file changed, 110 insertions(+), 47 deletions(-) diff --git a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md index 1c3fe3a3887..85877803682 100644 --- a/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md +++ b/content/ai/concepts/neural-networks/terms/stochastic-gradient-descent/stochastic-gradient-descent.md @@ -12,14 +12,11 @@ CatalogContent: - 'paths/data-science' --- -**Stochastic Gradient Descent** (SGD) is an optimization algorithm. It is a variant of gradient descent optimizer. The SGD minimizes the loss function of machine learning algorithms and deep learning algorithms during backpropagation to update the weights and biases in Artificial Neural Networks. +**Stochastic Gradient Descent** (SGD) is an optimization algorithm used to minimize the loss function in machine learning and deep learning models. It is a variant of the traditional **Gradient Descent** (GD) algorithm. SGD updates the weights and biases of a model, such as those in an Artificial Neural Network (ANN), during the backpropagation process. -The term `stochastic` means randomness on which the algorithm is based. In this algorithm, instead of taking whole datasets like `gradient descent`, we take single randomly selected data points or small batches of data. Suppose if the data set contains 500 rows SGD updates the model parameters 500 times in one cycle or one epoch. +The term `stochastic` refers to the randomness involved in the algorithm. Instead of using the entire dataset to compute gradients as in batch `gradient descent`, SGD uses a randomly selected data point (or a small mini-batch) to perform each update. For instance, if the dataset contains 500 rows, SGD will update the model parameters 500 times in one epoch, each time using a different randomly chosen data point (or small batch). -This approach significantly reduces computation time, especially for large datasets, making SGD faster and more scalable.SGD is used for training models like neural networks, support vector machines (SVMs), and logistic regression. However, it introduces more noise into the learning process, which can lead to less stable convergence but also helps escape local minima, making it suitable for non-convex problems. - - -![stochastic gradient descent](https://www.goglides.dev/images/Jq8EpuPoMjCcxm7PqMqWuQK7M_MoVtdfAUsGJsoUIMA/w:880/mb:500000/ar:1/aHR0cHM6Ly93d3ct/Z29nbGlkZXMtZGV2/LnMzLmFtYXpvbmF3/cy5jb20vdXBsb2Fk/cy9hcnRpY2xlcy8z/cGh3bjR0bmpnNGlo/eHV0Znpqby5wbmc) +This approach significantly reduces computation time, especially for large datasets, making SGD faster and more scalable. SGD is used for training models like neural networks, support vector machines (SVMs), and logistic regression. However, it introduces more noise into the learning process, which can lead to less stable convergence but also helps escape local minima, making it suitable for non-convex problems. ## Algorithms Step @@ -28,54 +25,117 @@ This approach significantly reduces computation time, especially for large datas - The model parameters are updated using the computed gradient and the learning rate. - The process is repeated for multiple iterations until convergence or a specified number of epochs. -## Formula +## Formula -$$ +$$ \large \theta = \theta - \alpha \cdot \nabla J(\theta ; x_i, y_i) $$ Where: -- θ represents the model parameter (weight or bias) being updated. -- α is the learning rate, a hyperparameter that controls the step size of the update. -- ∇J(θ;xi,yi) is the gradient of the cost or loss function J with respect to the model parameter θ, computed based on a single training sample (xi,yi). +- `θ` represents the model parameter (weight or bias) being updated. +- `α` is the learning rate, a hyperparameter that controls the step size of the update. +- `∇J(θ;xi,yi)` is the gradient of the cost or loss function `J` with respect to the model parameter `θ`, computed based on a single training sample `(xi,yi)`. ## Advantages + - **Faster convergence:** SGD updates parameters more frequently hence it takes less time to converge especially for large datasets. - **Reduced Computation Time:** SGD takes only a subset of dataset or batch for each update. This makes it easy to handle large datasets and compute faster. - **Avoid Local Minima:** The noise introduced by updating parameters with individual data points or small batches can help escape local minima.This can potentially lead to better solutions in complex, non-convex optimization problems. - **Online Learning:** SGD can be used in scenarios where data is arriving sequentially (online learning).- It allows models to be updated continuously as new data comes in. ## Disadvantages + - **Noisy Updates:** Updates are based on a single data point or small batch, which introduces variability in the gradient estimates.This noise can cause the algorithm to converge more slowly or oscillate around the optimal solution. - **Convergence Issues:** The noisy updates can lead to less stable convergence and might make it harder to reach the exact minimum of the loss function.Fine-tuning the learning rate and other hyperparameters becomes crucial to achieving good results. - **Hyperparameter Sensitivity:** - SGD's performance is sensitive to the choice of learning rate and other hyperparameters.Finding the right set of hyperparameters often requires experimentation and tuning. -## Practical Tips And Tricks When Using SGD -- Shuffle data before training -- Use mini batches(batch size 32) -- Normalize input -- Choose a suitable learning rate (0.01) +## Example + +The following code demonstrates **Stochastic Gradient Descent** (SGD) to fit a line to data points. Starting with initial guesses for the slope (`m`) and intercept (`b`), it updates these values iteratively by calculating the gradients of the **Mean Squared Error** (MSE) loss. The parameters are adjusted step-by-step based on the gradients, reducing the error between predicted and actual values: + +```python +import numpy as np -## Syntax +# Data points (x, y) where the true line is y = 2x +x = np.array([1, 2, 3, 4, 5]) +y = np.array([2, 4, 6, 8, 10]) - ``SGD(learning_rate, n_iterations, loss_function, gradient_calculation)`` +# Initial guess for parameters (slope, intercept) +params = np.array([0.0, 0.0]) -- Learning Rate (α): A hyperparameter that controls the size of the update step. -- Number of Iterations: The number of times the algorithm will iterate over the dataset. -- Loss Function: The function that measures the error of the model predictions. -- Gradient Calculation: The method for computing gradients based on the loss function. +# Learning rate and epochs +learning_rate = 0.01 +epochs = 1000 + +# Model: y = mx + b +def model(params, x): + m, b = params + return m * x + b + +# MSE loss function +def loss(pred, actual): + return np.mean((pred - actual) ** 2) # Using mean instead of sum + +# Compute gradients (partial derivatives) +def gradients(params, x, y): + m, b = params + pred = model(params, x) + grad_m = 2 * (pred - y) * x # Gradient for m + grad_b = 2 * (pred - y) # Gradient for b + return np.array([grad_m, grad_b]) + +# Training history +history = [] + +# SGD: Update parameters +for epoch in range(epochs): + total_loss = 0 + # Shuffle data + indices = np.random.permutation(len(x)) + x_shuffled = x[indices] + y_shuffled = y[indices] + + for i in range(len(x)): + # Forward pass + pred = model(params, x_shuffled[i]) + loss_value = loss(pred, y_shuffled[i]) + + # Compute gradients + grads = gradients(params, x_shuffled[i], y_shuffled[i]) + + # Update parameters + params -= learning_rate * grads + total_loss += loss_value + + # Store loss for plotting + avg_loss = total_loss / len(x) + history.append(avg_loss) + + if epoch % 100 == 0: # Print loss every 100 epochs + print(f"Epoch {epoch}, Loss: {avg_loss:.6f}") + +print(f"Final parameters: m = {params[0]:.4f}, b = {params[1]:.4f}") +``` -## Example -```python - def stochastic_gradient_descent(X, y, theta, learning_rate, n_iterations): - for iteration in range(n_iterations): - for i in range(len(y)): - gradient = compute_gradient(X[i], y[i], theta) - theta -= learning_rate * gradient - return theta +The output of the code is as follows: + +```shell +Epoch 0, Loss: 22.414958 +Epoch 100, Loss: 0.001293 +Epoch 200, Loss: 0.000037 +Epoch 300, Loss: 0.000001 +Epoch 400, Loss: 0.000000 +Epoch 500, Loss: 0.000000 +Epoch 600, Loss: 0.000000 +Epoch 700, Loss: 0.000000 +Epoch 800, Loss: 0.000000 +Epoch 900, Loss: 0.000000 +Final parameters: m = 2.0000, b = 0.0000 ``` +> **Note**: The output may vary depending on factors like the initial parameter values, learning rate, and number of epochs. + ## codebyte Example Here’s a Python code snippet demonstrating how to implement SGD for linear regression: @@ -88,32 +148,35 @@ np.random.seed(42) X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.randn(100, 1) +# Add a bias term (X0 = 1) to the input data +X_b = np.c_[np.ones((100, 1)), X] # Add a column of ones for the intercept term + # Initialize parameters -m, n = X.shape +m, n = X_b.shape theta = np.random.randn(n, 1) # Initial weights learning_rate = 0.01 n_iterations = 1000 # Stochastic Gradient Descent function def stochastic_gradient_descent(X, y, theta, learning_rate, n_iterations): - m = len(y) - for iteration in range(n_iterations): - # Shuffle the data - indices = np.random.permutation(m) - X_shuffled = X[indices] - y_shuffled = y[indices] - - # Update weights for each sample - for i in range(m): - xi = X_shuffled[i:i+1] - yi = y_shuffled[i:i+1] - gradient = 2 * xi.T.dot(xi.dot(theta) - yi) - theta -= learning_rate * gradient - - return theta + m = len(y) + for iteration in range(n_iterations): + # Shuffle the data + indices = np.random.permutation(m) + X_shuffled = X[indices] + y_shuffled = y[indices] + + # Update weights for each sample + for i in range(m): + xi = X_shuffled[i:i+1] + yi = y_shuffled[i:i+1] + gradient = 2 * xi.T.dot(xi.dot(theta) - yi) + theta -= learning_rate * gradient + + return theta # Perform SGD -theta_final = stochastic_gradient_descent(X, y, theta, learning_rate, n_iterations) +theta_final = stochastic_gradient_descent(X_b, y, theta, learning_rate, n_iterations) print("Optimized weights:", theta_final) ```