This repository contains code that is used and presented as part of the paper Efficient Generator of Mathematical Expressions for Symbolic Regression, that can be found here:
@article{Mežnar2023HVAE,
author={Me{\v{z}}nar, Sebastian and D{\v{z}}eroski, Sa{\v{s}}o and Todorovski, Ljup{\v{c}}o},
title={Efficient generator of mathematical expressions for symbolic regression},
journal={Machine Learning},
year={2023},
month={Sep},
day={06},
issn={1573-0565},
doi={10.1007/s10994-023-06400-2},
url={https://doi.org/10.1007/s10994-023-06400-2}
}
EDHiE, a mad scientist frantically searching for the right mathematical expression:
We are currently refactoring and improving some code and its performance, so some parts of the code might be broken.
An overview of the approach (shown on the symbolic regression example) can be seen below.
- Install HVAE/EDHiE with the command:
pip install git+https://github.com/smeznar/HVAE
- Create an instance of SRDataset using SRToolkit.
- Run the EDHiE or HVAR function.
An example can be found below:
from EDHiE import EDHiE, HVAR
import numpy as np
from SRToolkit.dataset import SRBenchmark, SRDataset
from SRToolkit.utils import SymbolLibrary
# Create your own dataset
dataset = SRDataset(np.array([[1, 1],[2, 3], [3, 4]]), np.array([2, 5, 7]),
["X_0", "+", "X_1"], "y=a+b",
SymbolLibrary.from_symbol_list(["+", "*", "-"], num_variables=2))
# or import a dataset from SRBenchmark
dataset = SRBenchmark.feynman("./fd").create_dataset("I.39.1")
# Run EDHiE or HVAR
EDHiE(dataset, size_trainset=10000, epochs=10, num_runs=3)
HVAR(dataset, size_trainset=10000, epochs=10, num_runs=3)
When using the pipeline for EDHiE, you can configure the following parameters
Hyperparameter | Description | Default Value |
---|---|---|
dataset | The dataset you want to use for equation discovery/symbolic regression, an instance of SRToolkit.dataset.SRDataset | |
grammar | A string of a probabilistic grammar in the NLTK notation. If provided, the expressions in the training set will be generated using it | None |
size_trainset | The number of expressions generated to be used as a training set | 50000 |
max_expression_length | If provided, the generated expressions will be of length at most "max_expression_length" | 35 |
trainset | If provided (A list of expressions written as a list of tokens), this training set will be used instead of the approach generation one | None |
pretrained_params | If provided, a pretrained HVAE model will be used instead of training a new one | None |
latent_size | Size of the latent space | 32 |
epochs | Number of epoch when training the HVAE model | 40 |
batch_size | Batch size when training a HVAE model | 32 |
save_params_to_file | If provided, the parameters of the trained HVAE model will be saved to file "save_params_to_file" | None |
num_runs | Number of symbolic regression runs | 10 |
population_size | Size of the population during the evolutionary algorithm | 200 |
max_generations | The maximum number of generations during the evolutionary algorithm | 500 |
seed | Random seed using during symolic regression. If not None, each consequent run will use the previous seed +1 | 18 |
verbose | If True, the function will provide additionall info regarding the progress of trainset generation, training, evolutionary algorithm and provide basic results of symbolic regression | True |
HVAR has the parameter expr_generated instead of population_size and max_generations, the tells us how many expressions it will sample and evaluate (equivalent to population_size * max_generations)
This repository implements HVAE and EDHiE (HVAE + evolutionary algorithm), HVAR (HVAE + random sampling). HVAE is an autoencoder that needs to be trained before we are able to use it as either a generator or for equation discovery/symbolic regression.
Our motivation for this approach is symbolic regression (equation discovery), a machine learning task where you try to find a closed-form solution that fits the given data. In case of symbolic regression, HVAE is used to generate expression. To explore the latent space produced by HVAE efficiently, our variational autoencoder needs to possess the following characteristics:
- Produce syntactically valid expressions; HVAE produces only syntactically valid expressions by design,
- Reconstruct (unseen) expressions well; otherwise we cannot expect that the latent space will have structure and the expressions produced by the generator are always random (we do not profit from methods for optimization in continuous space)
- Points close in the latent space need to produce (for now syntactically) similar expressions; This makes exploration of the latent space using optimization possible.
In this section we show how to evaluate these characteristics and how to run symbolic regression experiments using HVAE.
Disclaimer: Since submission of the manuscript "Efficient generator of mathematical expressions for symbolic regression", we changed some parts of the approach (mostly BatchedNode, regularization, and symbolic regression script) which may impact performance.
The code for evaluating reconstruction accuracy can be found in EDHiE/reconstruction_accuracy.py script together with an example of how to test it.
Table below shows the percentage of syntactically correct expressions and the reconstruction accuracy (evaluated as the edit distance between the original and the predicted expression in the postfix notation).
Additionally, we show the efficiency of HVAE with regard to the number of training examples needed and the dimension of latent space below.
We use linear interpolation to show that points close in the latent space produce similar expressions. We encode expressions
into the latent space with the encoder, generating vectors
To try it out use the linear_interpolation.py script. Some results of linear interpolation are shown in the table below:
For evaluation of EDHiE (Equation Discovery with Hierarchical variational autoEncoders = HVAE + evolutionary algorithm) on the symbolic regression task, you can use the script symbolic_regression.py.
Some results of symbolic regression on the Nguyen symbolic regression benchmark can be found in the table below.