3D Loss Landscapes of SoftNet-based on the public code
@inproceedings{
kang2023on,
title={On the Soft-Subnetwork for Few-Shot Class Incremental Learning},
author={Haeyong Kang and Jaehong Yoon and Sultan Rizky Hikmawan Madjid and Sung Ju Hwang and Chang D. Yoo},
booktitle={The Eleventh International Conference on Learning Representations },
year={2023},
url={https://openreview.net/forum?id=z57WK5lGeHd}
}
In this example the target classes
The goal is to predict the target class
The logistic function is implemented below by the logistic(z)
method below.
The loss function used to optimize the classification is the cross-entropy error function. And is defined for sample
Which will give
The loss function is implemented below by the loss(y, t)
method, and its output with respect to the parameters
The neural network output is implemented by the nn(x, w)
method, and the neural network prediction by the nn_predict(x,w)
method.
The logistic function with the cross-entropy loss function and the derivatives are explained in detail in the tutorial on the [logistic classification with cross-entropy]({% post_url /blog/cross_entropy/2015-06-10-cross-entropy-logistic %}).
The gradient descent algorithm works by taking the gradient (derivative) of the loss function
The parameters
Following the chain rule then
Where
-
${\partial \xi_i}/{\partial y_i}$ can be calculated as (see [this post]({% post_url /blog/cross_entropy/2015-06-10-cross-entropy-logistic %}) for the derivation):
-
${\partial y_i}/{\partial z_i}$ can be calculated as (see [this post]({% post_url /blog/cross_entropy/2015-06-10-cross-entropy-logistic %}) for the derivation):
-
${\partial z_i}/{\partial \mathbf{w}}$ can be calculated as:
Bringing this together we can write:
Notice how this gradient is the same (negating the constant factor) as the gradient of the squared error regression from previous section.
So the full update function
In the batch processing, we just average all the gradients for each sample:
To start out the gradient descent algorithm, you typically start with picking the initial parameters at random and start updating these parameters according to the delta rule with
The gradient gradient(w, x, t)
function. delta_w(w_k, x, t, learning_rate)
.
Gradient descent is run on the example inputs
The resulting decision boundary of running gradient descent on the example inputs
Note that the decision boundary goes through the point