refactor: BO and ifBO #134

eddiebergman · 2024-08-13T18:20:16Z

This PR simplifies and speedsup/improves BO and ifBO. By no means is this list of things exhaustive but it covers some of the major changes and new toys present.

How?

This was primarily done by only using the SearchSpace for its definitions, not it's methods. Interacting with models that expect tensors, we focus on encoding directly to a tensor, and acting directly on this encoded space, instead of going back and forth between SearchSpace and the data format that the surrogate models expect.

Before pass around list[SearchSpace] and each component encodes as needed, often performing operations directly on the SearchSpace.

After: Encode the list[SearchSpace] into what's required and inform the components about the encoding.

This buys a lot of time to perform better acquisition optimization, as well as avoids bloating the ever growing list of methods in SearchSpace, which can not provide a solution for every kind of model/optimizer we have.

As part of this, we now use botorch as a dependancy, which is primarily built on top of gpytorch, which we already depended on. Some of the benefits include:

No hand rolled kernels (potentially buggy, less code to maintain)
Fully differentiable gp/kernel operations as well as the option to set GP hyperparameter priors. I hand-coded some priors based on prior work, such as Carl Hvafners work, but recently this prior actually became the defaults for botorch and means less for us to maintain. We now do MAP optimization of GP hyperparameters instead of MLE optimization which was vastly more stable and consistent in some toy experiments. Their optimization for it out-performed anything I could hand-roll prior.
We now have access to the full suite of botorch acquisitions, some including numerical stability tricks, taking care of Pending configurations without fantazisation, batching, batch acquisition and a lot more. One interesting thing to note is that this would make trasitioning model based methods to multi-objective much more trivial as they directly support MO in their models and acquisition.
One area where I do handroll some things is a class WeightedAcquisition, which can take in a botorch AcquisitionFunction and apply a custom weighting to the output. For example, here is a function that wraps an arbitrary acquisition function, and applies a weight based on the pdf of the samples under a prior, i.e. PiBo. You can see it in use here
This made acquistion optimization much easier. We now scale from roughly a measily 50ish samples, to ~1000 samples, including some rounds of gradient based optimization, restarts and more.

Also, I have removed a lot of the fake flexibility that was offered for BO and ifBO. The primary thing removed is that our hand-rolled GP and the ftpfn model, are not attempted to be treated the same. They share very little in common, are acquired from in very different manners and have very different data-encodings. With the removal of DeepGP, these are our only two surrogates and we just treat them as two very different things. Maybe we try to unify in the future but I do not know what we gain from that.

In reality, we as developers would be the only one to use more advanced options and in general, they would be confusing to users actually looking to configure them, let alone the fact passing custom objects or even some of our own classes/objects would not work. Maybe we introduce the flexibilty at some point but it obfuscated the code, made it harder to maintain, test and debug. As an example, both ifBO and BO now only have one method ask() which contains most of the logic you would expect to see when referencing a paper/description of the algorithm.

Here is the ask() of both BO and ifBO now, which removed most of the abstractions and is now just direct function calls. It also removed the two step load_configs() and get_next_config() that we had before.

The result of this is that using the models is now "stateless", and mostly accessible through a function call.

ifBO is fairly similar in terms of the function calls.

As representing configurations as a complex object SearchSpace is highly innefficient for some of the model routines, such as encoding/decoding/sampling/acquisiton-function optimization, I avoid the use of the methods present in SearchSpace, and treat it as just a definition of hyperparameters. Instead, we define an encoding and encode all configuration information into one big tensor. The encoder can translate back and forth:

Conceptually, list[SearchSpace] <-> list[dict] <-> Encoder <-> Tensor

Doing so meant that we go from "asking the search space to sample itself and then do all transformations" to "Sample a tensor and do tensor operations to match the encoding". No objects, little python, just torch.

This required some new infrastructure that was aware of how configurations are encoded (ConfigEncoder).

This most important piece of new infrastructure is the Domain.

Domain: A dataclass that represents a numeric range, dtype, whether it's binned, log scale etc... The most important method is cast(), which allows you to convert between domains, e.g. cast from Domain.floating(10, 10_000, log=True) to Domain.floating(0, 1, bins=18)

domain1 = Domain.floating(10, 10_000, log=True)
x_in_domain_1 = torch.tensor([10, 100, 1_000, 10_000])

domain2 = Domain.floating(0, 1, bins=18)
x_in_domain_2 = domain2.cast(x, frm=domain1)

Anywhere where we use a tensor, there is a Domain associated with it somehow.
In summary, it contains information about what kind of numbers are in the tensor.
We have them in quite a few places and put to good use:

ConfigEncoder.domains: list[Domain], one domain for each column in the tensor representing encoded configs.
XXXParameter, gives information about the domain of parameter outputs
TorchDistributionWithDomain, as dumb as it sounds, it combines a torch distribution with the domain over which is has support/samples over.
Samplers (below), samples some values in it's own domain and takes in a Domain | list[Domain] into which you'd like those samples transformed.
Priors (below), takes in a tensor and its domain, then can transform it into the space of its prior distribution and calculate the pdf.

Sampler(Protocol): A Protocol for something that can sample tensors. Related, is also the protocol class Prior(Sampler), which extends a Sampler by being able to also calculate log_probs a tensor of configs, used in things like pibo acquisition and prior based sampling. The mains once currently there are:
- Sobol(Sampler)
- Uniform(Prior)
- CenteredPrior(Prior), which handle parameters with and without defaults jointly.
- WeightedPrior(Prior), which allows you to combine multiple priors by weights.
- WeightedSampler(Sampler), which is the same but for samplers which are not prior enabled.
- BorderSampler(Sampler), which efficiently generates border configurations.

The primary method is pretty straight forward. The most important argument is really to=, which lets you say "in what domain(s) would you like your samples?" This means you can sample a big tensor of uniform and convert it directly into the domain of encoded configs, (i.e. integers for categoricals, min-max normalized floats/ints, etc...).

def sample(
    self,
    n: int | torch.Size,
    *,
    to: Domain | list[Domain],
    seed: torch.Generator | None = None,
    device: torch.device | None = None,
    dtype: torch.dtype | None = None,
) -> torch.Tensor:
    """Sample `n` points and convert them to the given domain.

    Args:
        n: The number of points to sample. If a torch.Size, an additional dimension
            will be added with [`.ncols`][neps.samplers.Sampler.ncols].
            For example, if `n = 5`, the output will be `(5, ncols)`. If
            `n = (5, 3)`, the output will be `(5, 3, ncols)`.
        to: If a single domain, `.ncols` columns will be produced form that one
            domain. If a list of domains, then it must have the same length as the
            number of columns, with each column being in the corresponding domain.
        seed: The seed generator
        dtype: The dtype of the output tensor.
        device: The device to cast the samples to.

    Returns:
        A tensor of (n, ndim) points sampled cast to the given domain.
    """
    ...

Most of the Prior's are back by torch distributions, which has the aptly named TorchDistributionWithDomain, which encapsulates both a distribution and the domain over which it samples. The cast() method allows fluidly transforming between distribution domains, sample domains, and config encoding domains.

For some future work, I believe many of the bandit prior methods could benefit from the Prior class, as it allows calculating priors over both uniform parameters and those with a prespecified default.

eddiebergman · 2024-08-13T18:32:49Z

Lol tests pass, I guess non of our tests anywhere hit this because this definitely shouldn't work right now

karibbov · 2024-08-14T00:23:44Z

There are some minor updates on the mergeDyHPO branch which I'll go over with some comments to follow. I don't think NotPSDError needs any special addressing, it usually happens to rise when the model is fed bad data (as in many repeating or similar data points, perhaps even many zero values) so it should occur fairly seldom.

neps/optimizers/models/deepGP.py

neps/optimizers/bayesian_optimization/models/gp.py

pytorch/botorch#2451

eddiebergman · 2024-10-07T11:43:13Z

Adding tests to this PR before merging

refactor: DeepGP

add9eaa