Skip to content

Latest commit

 

History

History
164 lines (119 loc) · 10.6 KB

File metadata and controls

164 lines (119 loc) · 10.6 KB

Learning to Optimize

When faced with a new task, humans usually do not start from scratch but incorporate knowledge obtained from previously solved problems. We implement the same procedure for optimization problems using deep learning: Our neural network learns and generalizes over similar problems!

Table of Contents

  1. Installation
  2. Introduction
    1.1 An Intuitive Explanation
    1.2 A Scientific Explanation
  3. Usage
  4. Modules
    3.1 Uncertainty Measurement
    3.2 Balancing Exploration and Exploitation
    3.3 Function Approximation Using Deep Learning
  5. NIPS


0 Installation

Before doing anything else, please install the Python dependencies (preferably in a virtual environment) using:

pip install -r requirements.txt


1 Introduction


1.1 An Intuitive Explanation

  • Problem: We focus on optimization problems of arbitrary dimensionality that can be represented with functions.
    Example: There are similar but different propellers for airplanes, hovercrafts, and ships. Their performance depends on features, e.g. width and length of different components. We want to maximize performance given similar scenarios.

  • Input: Some samples of the underlying function (performance over features) are available and serve as input for the neural network.

  • Output: The neural network proposes a new place for sampling.
    Example: Suppose you have much experience (samples) for building airplane propellers. Now you want to build a hovercraft propeller, but you don't know where to start! The neural network proposes a prototype that will most likely perform decently. Once you have tested this first prototype, the new knowledge can be incorporated in the model. Thus, the neural network learns along the way and improves its suggestions.

  • Performance: Increased efficiency by more than 50% [1]


1.2 A Scientific Explanation

Optimization constitutes a highly relevant but complex task in many industries. Special focus lies on efficiency, as obtaining samples from the objective functions is typically particularly expensive. We address this problem by using neural networks as surrogate models and estimates of uncertainty in the surrogates to orchestrate the optimization process. We evaluate multiple methods for uncertainty estimation and apply them to balance between exploration and exploitation. The approach is economical regarding the number of required samples. We further propose a method to build a meta-model over a class of similar optimization problems, which we show more than halves the number of samples required in new problems by generalizing from old ones.


2 Usage

You can find a notebook-demo for the generalization module here. To run it install your currently activated virtual environment as kernel, and then run jupyter notebook:

python -m ipykernel install --user --name=optimization-surrogate-generalization
jupyter notebook

Instantiation:

from generalizer import Generalizer

Generalizer(dim, initial_samples_X, initial_samples_Y, grid_size)
  • dim: Space of data, array with shape (2,dimensions), rows refer to min/max
  • initial_samples_X, initial_samples_Y: Samples, array with shape (data points,dimensions), prior scaling of domain necessary (e.g. using utils.scale_01(x, original_range))
  • grid_size: Amount of evaluated points along each dimension (integer) [2]

Proposition of new sampling location:

generalizer.get_next_sampling_location(slice)
  • slice: One optimization problem, e.g. airplane propeller optimization within propeller optimization. The entire model consists of many (n-1)-dimensional slices, e.g. functions for propellers of hovercrafts, ships and planes.
  • utils.scale_restore(data) transforms data back to original space (reverses prior scaling)

Incorporation of new sample:

generalizer.incorporate_new_sample(x_new, y_new)
  • x_new, y_new: x refers to features (e.g. propeller-width), y to respective labels (e.g. performance)

Exemplary convergence criteria can be found in the demo-notebook.


3 Modules


3.1 Uncertainty Measurement

This module measures prediction-uncertainty:

import uncertainty

uncertainty.compute_uncertainty(train_X, train_Y, test_X, dims, method='Ensembles')
  • train_X, train_Y: Training data and labels
  • test_X: test-grid (Where should uncertainty be computed?)
  • dims: Dimensionality
  • method: Method for obtaining uncertainty, either Kriging, XDist, Ensembles or MCDropout [3]


3.2 Balancing Exploration and Exploitation

This module calculates expected improvement for each future sampling location. The method balances exploration and exploitation based on uncertainty estimates [4]. This assures efficient sampling.

import expected_improvement

expected_improvement.compute_expected_improvement_forrester(prediction, uncertainty, y_min, grid)
  • prediction: Predicted function for slice (array with values for each grid-point)
  • grid: Represents space (array with points to evaluate)
  • uncertainty: Function displaying prediction-uncertainty (array with values for each grid-point)
  • y_min: Minimum value of all samples in slice


3.3 Function Approximation Using Deep Learning

Given sample points, this module approximates continuous functions with arbitrary dimensionality. [5]


4 NIPS

An extended abstract has been presented at a NIPS 2016 workshop. A copy can also be found in /docs/.

5 Details

  • [1]: Graphical illustration: Efficiency is measured by the required amount of samples for reaching the global minimum. We used the six hump camel function to test our implementation.

  • [2] Scaling to higher dimensionality exponentially increases the search space. We advise using a sparse grid in those cases.

  • [3]: Performance overview for uncertainty-measurement methods

    • Kriging: Gaussian process regression, interpolated values are modeled by a Gaussian process using variance as uncertainty estimates.
      Advantage: Efficient calculation; disadvantage: Independent of neural network.
    • XDist: Increasing Euclidean distance to samples leads to increased uncertainty.
      Advantage: Easy; disadvantages: too peaky, uneven results, independent of neural network
    • Ensembles: Using several neural networks with different random weight initializations. Mean over outputs constitutes prediction, variance depicts uncertainty.
      Advantage: Uncertainty estimates depend on neural network; disadvantage: computationally expensive
      Future work: Ensemble-uncertainty-estimates could be used as training data for a new neural network, which could then exclusively be used on new data. This would increase computational efficiency.
    • MCDropout: Monte-Carlo dropout: Approximation of deep Gaussian process by application of dropout (multiple times) in test instead of training. Mean of output constitutes prediction, variance depicts uncertainty.
      Advantage: Uncertainty estimate depends on neural network
      Disadvantage: Tendency that higher curvature (in the function) is modelled by scarce number of neurons. Thus, dropout disproportionally affects those regions. However, high curvature can be present in already explored regions, leading to poor uncertainty estimates in our scenario.
  • [4]: Based on two-stage approach according to formula (81), Section 4.3.1, Recent advances in surrogate-based optimization, Forrester and Keane

    • Graphical illustration: A Gaussian distribution displays the probability of possible prediction values for each grid-point x. Predictions decreasing the current slice-minimum are most relevant. Thus, the formula calculates the mean of the "area enclosed by the Gaussian distribution below the best observed value so far".
    • Intuition: The expected improvement evaluates all unobserved points x. If the expected improvement is high for point x, then sampling at this point will probably result in a new slice-minimum.
  • [5]: We target zero-bias regression as samples are assumed to be noise-free. Implementation in tensorflow with following parameters:

    • Hidden Layer: 2
    • Hidden units: 200/layer (More units -> local minima; less units -> underfit)
    • Activation: ReLu (1st layer) + Sigmoid (2nd layer)
      • ReLu: Sharp inflexions and higher latitude; sigmoid: Smoothness (higher underfit)
    • Optimizer: Adam (initial learning rate=0.01, higher rate -> coarseness)
    • Training epochs: 2000 (More: Oscillation between samples; fewer: underfit)
    • Weight/bias initialization: Truncated normal distribution (scaled up (stddev=2) to achieve desired overfit)
    • Regularization None as it leads to underfit (dropout and L2-norm weight decay regularization are implemented, but not activated)
    • Scaling: Domain has to be scaled to [0,1] for excellent results

6 License

Copyright 2016 Jonas Langhabel and Jannik Wolff. See License for details.