Skip to content

An exploration of AI-driven symbolic regression, implementing and analyzing two key methods: Equation Learner (EQL) neural networks and a Seq2Seq Transformer model.

License

Notifications You must be signed in to change notification settings

msmrexe/neurosymbolic-regression-ai-methods

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI-Driven Methods for Symbolic Regression

This project explores two modern, AI-driven methods for symbolic regression: the Equation Learner (EQL) neural network and a Seq2Seq Transformer model. The goal is to discover the underlying mathematical formula that describes a given dataset. This repository implements both approaches, allowing for a comprehensive analysis of their strengths and weaknesses.

This project was developed for the System 2 AI (M.S.) course.

Features

  • EQL Implementation: A fully-functional EQL model (SymbolicNet) built in PyTorch, using primitive mathematical operations as activation functions.
  • Transformer Implementation: A complete from-scratch Seq2Seq Transformer model that treats symbolic regression as a translation task (from data points to an expression).
  • Data Generation: Includes the complete pipeline for generating a large-scale dataset of (data, expression) pairs to train the Transformer.
  • Modular & Scriptable: The entire workflow is broken into clean scripts for data generation, training, and evaluation.
  • Analysis Notebooks: Includes demo notebooks that guide a user through running the entire pipeline for both EMethods.

Core Concepts & Techniques

  • Symbolic Regression: The task of finding a simple, interpretable mathematical expression that best fits a given dataset, as opposed to a black-box model (e.g., a standard DNN).
  • EQL (Equation Learner): A neural network architecture where layers are composed of basis functions (e.g., sin, cos, *, x). Training with sparsity-inducing regularization (like L1) prunes the network, revealing a simple underlying equation.
  • Seq2Seq Modeling: A model architecture consisting of an Encoder (to read an input sequence) and a Decoder (to write an output sequence).
  • Transformer Architecture: A modern Seq2Seq model based entirely on self-attention mechanisms, which allows it to learn dependencies between all parts of the input and output.
  • Permutation Invariance: A property of the Transformer's encoder, which is designed to process a set of (x, y) data points where the order of the points does not matter.

How It Works

What is Symbolic Regression?

Symbolic regression is a type of machine learning that aims to discover the underlying mathematical expression that best describes a dataset.

This task is fundamentally different from other types of regression:

  • In Traditional Regression (e.g., linear regression), we pre-define the structure of the equation (e.g., $y = mx + b$) and the algorithm's only job is to find the best parameters (e.g., $m$ and $b$).
  • In Black-Box Modeling (e.g., a standard deep neural network), the model learns a complex, uninterpretable function to map inputs to outputs, but it cannot be easily written down as a simple formula.

Symbolic regression, in contrast, searches for both the optimal structure (the symbols themselves, like +, *, sin, x_1) and the constants of the equation. The final output is a simple, human-readable formula, such as $y = 5x_1x_2^2$.

This project implements and compares two powerful, AI-driven approaches to solve this task:

1. EQL-Based Symbolic Regression

The Equation Learner (EQL) method frames symbolic regression as a specialized neural network design problem.

Architecture: SymbolicNet

The core of this method is the SymbolicNet, which is a multi-layer feedforward network. Unlike a standard DNN that uses uniform activations (like ReLU), each neuron in a SymbolicNet layer represents a different primitive mathematical function (Identity, Square, Sin, Product, etc.).

  • Input Layer: Takes the input variables (e.g., $x_1$, $x_2$).
  • Hidden Layers: Each layer receives the outputs from the previous layer and applies all primitive functions to them. For example, a hidden layer might compute $x_1^2$, $\sin(x_1)$, $x_1 \cdot x_2$, etc., in parallel. The weights of the network determine how these intermediate terms are combined.
  • Output Layer: The final layer is a simple linear neuron that computes a weighted sum of all the terms generated by the final hidden layer.

The final output is a single, differentiable equation built from these primitives.

Analysis & Results

The key to EQL is sparsity. By applying L1 regularization (the reg_weight parameter), we penalize the model for using large weights. This forces the optimizer to set most weights to zero, effectively "pruning" connections and finding the simplest possible expression that fits the data.

  • Dataset 1 (Known Formula: $y = 2x + \sin(x) + x\sin(x)$ ):

    • When run_eql.py is executed on this dataset, the network successfully finds an expression that is a very close approximation of the true formula.
    • Typical Result: 1.99*x_1 + 0.98*sin(1.0*x_1) + 1.01*Product(x_1, sin(1.0*x_1))
  • Dataset 2 (Hidden Formula):

    • This dataset has two input variables, $x_1$ and $x_2$. Without regularization, the EQL model tends to overfit, producing a long, complex, and uninterpretable expression.
    • Bonus (Sparsity): By running the script with a non-zero --reg_weight (e.g., 0.01), we force the model to find a simpler solution.
    • Typical Result: The model converges to a simple, correct expression: 4.99 * Product(x_1, Square(x_2)). This reveals the hidden formula is $y = 5x_1x_2^2$.

2. Seq2Seq-Based Symbolic Regression

This method frames the task as a "translation" problem: we want to translate a set of data points into a sequence of mathematical tokens.

Architecture: Encoder-Decoder Transformer

  1. Data Generation: First, we run generate_transformer_data.py. This script creates thousands of random mathematical expressions (e.g., log(x1) + C*x2), samples data points from them, and saves these (data, expression) pairs.
  2. Encoder: The Encoder's job is to read the set of 50 (x, y) data points. Its architecture is permutation-invariant, meaning the order of the data points doesn't matter. It uses a series of MLPs and attention layers, followed by a max-pooling operation, to "condense" the entire dataset into a single vector (the "context vector"). This vector numerically represents the core pattern of the data.
  3. Decoder: The Decoder is a standard Transformer decoder. It receives the single context vector from the Encoder and begins generating the symbolic expression, token by token. It is trained to produce the expression in a prefix notation (e.g., add, mul, C, x1, sin, x2 for $C \cdot x_1 + \sin(x_2)$ ).
  4. Training: The train_transformer.py script trains this entire end-to-end model on the pre-generated dataset.

Analysis & Results

The evaluate_transformer.py script loads the trained model and tests it on our two datasets. A key final step is that the model only predicts a placeholder token, C, for constants. We must optimize this constant after prediction.

  • Dataset 1 (Known Formula: $y = 2x + \sin(x) + x\sin(x)$ ):

    • The model successfully predicts a sequence corresponding to the known formula, though it may not include the coefficients.
    • Typical Result: (add, (add, x1, (sin, x1)), (mul, x1, (sin, x1))). The script's constant optimizer (scipy.optimize.minimize) then finds the best coefficients to fit the expression, resulting in something like (add, (add, (mul, 1.99, x1), (sin, x1)), (mul, x1, (sin, x1))).
  • Dataset 2 (Hidden Formula: $y = 5x_1x_2^2$):

    • The model, having seen many similar patterns in its training data, correctly identifies the underlying structure.
    • Typical Prediction: (mul, C, (mul, x1, (sq, x2))).
    • Optimization: The evaluate_transformer.py script then uses scipy.optimize.minimize to find the value of C that minimizes the Mean Squared Error (MSE) between the expression and the true data.
    • Final Result: The optimizer robustly finds that $C \approx 5.0$. The final output is (mul, 5.0, (mul, x1, (sq, x2))), perfectly recovering the hidden formula $y = 5x_1x_2^2$.

Project Structure

neurosymbolic-regression-ai-methods/
├── .gitignore
├── LICENSE
├── README.md                          # This file
├── config.py                          # Central configuration for paths, device, etc.
├── requirements.txt                   # Python dependencies
├── data/
│   ├── hidden_formula_dataset.csv     # Provided 2D dataset
│   └── .gitkeep
├── logs/
│   └── .gitkeep                       # Directory for log files
├── notebooks/
│   ├── 1_EQL_Demo.ipynb               # Demo for running the EQL pipeline
│   └── 2_Transformer_Demo.ipynb       # Demo for the Transformer pipeline
├── outputs/
│   └── .gitkeep                       # Directory for saved models and plots
├── scripts/
│   ├── run_eql.py                     # Main script to train and evaluate EQL
│   ├── generate_transformer_data.py   # Step 1: Create Transformer dataset
│   ├── train_transformer.py           # Step 2: Train the Transformer model
│   └── evaluate_transformer.py        # Step 3: Evaluate the Transformer
└── src/
    ├── __init__.py
    ├── utils.py                       # Shared utilities (logging, plotting)
    ├── eql/                           # EQL-specific modules
    │   ├── __init__.py
    │   ├── data.py                    # Data loaders for EQL
    │   ├── expression.py              # SymPy expression retrieval
    │   ├── functions.py               # Primitive activation functions
    │   ├── model.py                   # SymbolicNet (EQL model)
    │   └── trainer.py                 # EQL training/testing logic
    └── transformer/                   # Transformer-specific modules
        ├── __init__.py
        ├── architecture.py            # Transformer model architecture
        ├── data_generator.py          # Functions for generating expression data
        ├── dataset.py                 # PyTorch Dataset for Transformer
        ├── trainer.py                 # Transformer training loop
        └── utils.py                   # Loss, accuracy, and evaluation functions

How to Use

  1. Clone the Repository:

    git clone https://github.com/msmrexe/neurosymbolic-regression-ai-methods.git
    cd neurosymbolic-regression-ai-methods
  2. Install Dependencies:

    pip install -r requirements.txt
  3. Run the EQL Pipeline (Demo):

    • Open and run the notebooks/1_EQL_Demo.ipynb notebook.
    • Alternatively, run the script directly:
    # Run on Dataset 1 (auto-generated)
    python scripts/run_eql.py --dataset_path dataset_1 --n_layers 2 --epochs 20000
    
    # Run on Dataset 2 (from CSV) with L1 regularization
    python scripts/run_eql.py --dataset_path data/hidden_formula_dataset.csv --n_layers 2 --reg_weight 0.01
  4. Run the Transformer Pipeline (Demo):

    • Open and run the notebooks/2_Transformer_Demo.ipynb notebook.
    • Alternatively, run the scripts in order:
    # Step 1: Generate the training data (This takes a long time!)
    # This will create ~7900 data files in data/transformer_pregen/
    python scripts/generate_transformer_data.py --nb_trails 10000
    
    # Step 2: Train the model (This also takes a long time!)
    # Model will be saved to outputs/transformer_best.pth
    python scripts/train_transformer.py --epochs 100 --batch_size 128 --d_model 256
    
    # Step 3: Evaluate the trained model on our two datasets
    python scripts/evaluate_transformer.py --model_path outputs/transformer_best.pth

Author

Feel free to connect or reach out if you have any questions!


License

This project is licensed under the MIT License. See the LICENSE file for full details.

About

An exploration of AI-driven symbolic regression, implementing and analyzing two key methods: Equation Learner (EQL) neural networks and a Seq2Seq Transformer model.

Topics

Resources

License

Stars

Watchers

Forks