This repo is to implement methods in
- Optimizing Millions of Hyperparameters by Implicit Differentiation.
- On the Iteration Complexity of Hypergradient Computation, ICML 2020
The motivation to reimplement is to
- serve my own learning purpose
- have a cleaner source code
- compare several approaches related to inverse Hessian vector product
Suppose we have a model which is a subclass of nn.Module
, containing all parameters. BaseHypeOptModel
in model.py
will wrap this model and add hyperparameters. This BaseHyperOptModel
will manage and intergate all the hyperparmaters in the main model such as computing the train loss via train_loss
function, compute validation loss via validation_loss
function. Currently BaseHyperOptModel
allows its subclass to customize regularization and data augmentation.
Let us define the logistic regression model for L2 regularization problem:
class LogisticRegression(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.w = nn.Parameter(torch.randn((input_dim, 1)))
def forward(self, x):
return x @ self.w
In this example, we will try to optimize L2 hyperparameter. The following object will handle this hyperparameter
class L2RHyperOptModel(BaseHyperOptModel):
def __init__(self, input_dim) -> None:
network = LogisticRegression(input_dim)
criterion = nn.BCEWithLogitsLoss()
super().__init__(network, criterion)
# declare hyperparmeters
self.hparams = nn.Parameter(torch.ones(input_dim,1))
@property
def hyper_parameters(self):
# return a list of hyperparameters
return [self.hparams]
def regularizer(self):
# regularizer will be added to the train loss
return 0.5 * (self.network.w.t() @ torch.diag(self.hparams.squeeze())) @ self.network.
We introduce BaseHyperOptimizer
object which computes hypergradient for hyperparameters via implicit function theorem. The subclass extending this object should provide a way to approximate inverse Hessian vector product. The current implementation contains serveral approaches
- Conjugate Gradient
- Neumann series expansion
- Fixed point
BaseHyperOptimizer
allows to pick whether the hyper gradient is computed over 1 batch (set stochastic=False
) or multiple batches (set stochastic=True
). Refer to the AISTATS paper for the stochastic version.
This optimizer allows to choose between using Hessian matrix or Gauss-Newton Hessian matrix (see this).
In each optimizer step, BaseHyperOptimizer
will take inputs including train_loss_func
which is a function returing two outputs (train loss, train logit) and val_loss
which is the validation loss.
- Hypertorch library: An excellent library which this repo adopts in many parts. However, it's a little bit hard to work around with
nn.Module.parameters
. - GradientBased Optimization of HyperParamete: Hyperparameter Optimization is dated back in the year 2000 by the work of Bengio.
- Hyperparameter optimization with approximate gradient, ICML 2016: Maybe the first work of hyperparameter optimization using implicit gradient. Here the approximation tool is conjugate gradient method
- On the Iteration Complexity of Hypergradient Computation, ICML 2020: In-depth comparison (convergence and approximate error) between iterative differentation (or unrolling) and approximate implicit differentation. The approximation considers two cases: fixed point vs conjugate gradient
- Convergence Properties of Stochastic Hypergradients, AISTATS 2021: This work is quite important since previously we may blindly train implicit differentation method with minibatches of data and not know if it really converges.
- Optimizing Millions of Hyperparameters by Implicit Differentiation: Approximate implicit differentation with Neumann series expansion.
- Efficient and Modular Implicit Differentiation: A recent work from Google explains a general, modular approach which modularizes solvers and autodiff.
- Roger Grosse's course: Excellent material for beginners from basic optimization to bilevel optimization.