Skip to content

This repository includes the code used to generate data and figures in "Carlozo, M. N., Wang, K., Dowling, A. W. (2024). Bayesian Optimization Methods for Nonlinear Model Calibration"

Notifications You must be signed in to change notification settings

dowlinglab/GPBO_Emulators

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bayesian Optimization Methods for Nonlinear Model Calibration

Authors: Montana N. Carlozo, Ke Wang, and Alexander W. Dowling

Introduction

GPBO_Emulators is a repository used to calibrate computationally expensive models given experimental data. The key feature of this work is using machine learning tools in the form of Gaussian processes (GPs) and Bayesian Optimization (BO) which allow us to smartly sample parameter space to decrease the objective. This work features comparing standard GPBO in which a GP models an expensive objective function to emulator GPBO in which an expensive function is emulated directly by the GP.

Citation

This work was submitted to Industrial & Engineering Chemistry Research (I&ECR). Please cite as:

Montana N. Carlozo, Ke Wang, Alexander W. Dowling, “Bayesian Optimization Methods for Nonlinear Model Calibration”, 2024

Available Data

Repository Organization

The repository is organized as follows:
GPBO_Emulators/ is the top level directory. It contains:

  1. .gitignore prevents large files from the signac workflow and plots from being tracked by git and prevents tracking of other unimportant files.
  2. init_gpborand.py generates jobs to run for the stochastic model implementation of the workflow with signac
  3. init_gpbo_nonoise.py generates jobs to run for the deterministic model implementation of the workflow with signac
  4. make_1meth_hms.py makes contour plots of Simulated SSE, GP SSE, GP SSE variance, and acqusition function for a given GP model.
  5. make_1obj_hms.py makes contour plots of either Simulated SSE, GP SSE, GP SSE variance, or acqusition function for a given GP model for each method.
  6. make_bar_charts.py makes Figures 2 and 7 from the main text (bar charts of relevant data).
  7. make_cond_num_data.py generates condition number data for the best GP models.
  8. makes_least_squares_data.py generates data related to nonlinear least squares (NLS) including categorizing number of local minima and best performing NLS runs.
  9. make_line_plots.py generates all line plots in the main text and SI including parity plots and plots of BO iteration for SSE and acquisition function for all modes of evaluation.
  10. make_movies_from_hms.py makes movies (.mp4) from the contour plots generated with make_1obj_hms.py.
  11. make_muly0_hist.py makes the histogram for M"uller y0 case study data.
  12. make_full_results.py makes the full-results.csv and results.csv files from the SI.
  13. gpbo-emul.yml is the environment for running this workflow.

Directory bo_methods_lib/ contains the package for running the workflow.
bo_methods_lib/bo_methods_lib/ contains the following files:

  1. GPBO_Classes_New.py is the main script for the algorithm.
  2. GPBO_Class_fxns.py are helper functions which define parameters for the multiple case studies. This function is also useful for mapping the numerical markers of case studies to their formal names from the manuscript.
  3. analyze_data.py and GPBO_Classes_plotters.py are used for analyzing and plotting the results.
  4. tests/*.py contains test functions for public methods of all classes in GPBO_Classes_New.py.

Directories GPBO_rand/ and GPBO_nonoise/ are initially created via init_gpborand.py and init_gpbo_nonoise.py in the top directory through signac.
It contains the following files/subdirectories:

  1. delete_jobs.py is a script for quickly deleting targeted jobs/results in signac.
  2. view_unfinished_jobs.py is a script for viewing individual case study job progress.
  3. project_gpborand.py (project_gpbononoise.py) is the script for running the workflow using signac.
  4. templates/ are the templates required to run this workflow in signac on the crc.
  5. workspace/ will appear to save all raw results generated during the workflow after running init_gpbo*.py. This file is not tracked by git due to its size. the workspace/ folder for this study can be downloaded on Google Drive (see section 'Workflow Files and Results')
  6. signac_project_document.json will also appear to track the status of jobs in the signac workflow

Running the analysis (steps 6 -11 below) will cause results directories to appear in the GPBO_rand/ and GPBO_nonoise/ directories with relevant human readable data and plots. Subdirectories further categorize the results by case study and methods analyzed.

  1. Results_acq/ shows data where we analyze the best results based on how efficiently the acquisition function was optimized.
  2. Results_act/ shows data where we analyze the best results based on how efficiently the actual SSE was optimized.
  3. Results_gp/ shows data where we analyze the best results based on how efficiently the GP predicted SSE was optimized.

We note that this repository is based on the branch update_and_merge in the dowlinglab/Toy_Problem repository, which is private.

Workflow Files and Results

All workflow iterations were performed inside either GPBO_Emulators/GPBO_rand or GPBO_Emulators/GPBO_nonoise where it exists. Each iteration was managed with signac-flow. Inside GPBO_rand or GPBO_nonoise, you will find all the necessary files to run the workflow. Note that you may not get the exact same simulation results due to differences in software versions, random seeds, etc. The raw results from this study are available in Google Drive. These files are necessary to download and add to GPBO_Emulators/GPBO_rand and GPBO_Emulators/GPBO_nonoise to reproduce all figures and tables identified in the paper. As such, the directories GPBO_Emulators/GPBO_rand/workspace and GPBO_Emulators/GPBO_nonoise/workspace must exist and contain all files from the google drive before analysis scripts are run. The analysis scripts work by parsing the signac workflow data and manipulating the data into a form conducive for analysis. The files are saved on google drive because the files in which the GP models reside are often on the order of gigabytes. For other plots, we also use Google Drive because the workspace folder has hundreds of subdirectories (many of which are also large) which makes storage on GitHub impractical. When reproducing the results of this workflow, more practical, human readable csv files and figures are stored under GPBO_aaa/Results_xxx/cs_name_val_y/, where aaa represents either the deterministic (nonoise) or stochastic (rand) implementation, xxx represents a different analysis scheme and y represent a different (set of) case study. For example data pertaining to the BOD Curve case study for all methods where the actual SSE values were analyzed would be generated in the Results_act/cs_name_val_11/ep_enum_val_1/gp_package_gpflow/meth_name_val_in_1_2_3_4_5_6_7 directory. We specifically note that the best runs for each case study and method are provided under Results_xxx/cs_name_val_y/best_results.csv. These results for each case study make up results.xlsx in the supporting information (SI). We also specifically note nonlinear least squares results categorizing the number of minima and all runs for other DFO methods are found in Results_act/cs_name_val_y.

Workflow Code

All of the scripts for running the workflow are provided in bo_methods_lib/bo_methods_lib/GPBO_Classes_New. All scipts for analyzing the data in GPBO_rand/workflow and GPBO_nonoise/workflow are provided in GPBO_Emulators/make_*.py.

Figures

All scripts required to generate the primary figures in the manuscript and SI are reported under GPBO_Emulators/make_*.py. When running analysis scripts, these figures are saved under GPBO_aaa/Results_xxx/cs_name_val_y/*.

Installation

To run this software, you must have access to all packages in the gpbo-emul environment (gpbo-emul.yml) which can be installed using the instructions in the next section.

This package has a number of requirements that can be installed in different ways. We recommend using a conda environment to manage most of the installation and dependencies. However, some items will need to be installed from source or pip.

Running the simulations will also require an installation of TASMANIAN. This can be installed separately (see installation instructions here ).

An example of the procedure is provided below:

# Install pip/conda available dependencies
# with a new conda environment named gpbo-emul
conda env create -f gpbo-emul.yml
conda activate gpbo-emul
python -m pip install Tasmanian

Usage

GPBO Workflow Execution

NOTE: We use Signac and signac flow to manage the setup and execution of the workflow. These instructions assume a working knowledge of that software.

WARNING: Running these scripts will overwrite your local copy of our data (GPBO_aaa/workflow/*) with the data from your workflow runs.

To run the GPBO workflow, follow the following steps:

  1. Use init_gpborand.py and init_gpbo_nonoise.py to initialize files for simulation use
      python init_gpborand.py
      python init_gpbo_gpbo_noise.py
      cd GPBO_aaa (rand or nonoise)
    
  2. Do the following in GPBO_aaa directory (cd GPBO_aaa):
  3. Check status a few times throughout the process
      python project_gpbo[aaa].py status 
    
  4. Modify project_gpbo[aaa].py if desired. This file is currently set to reproduce the paper results
  5. Run the simulations
      python project_gpbo[aaa].py run -o run_ep_or_sf_exp
    

Note: rm -r workspace/ signac_project_document.json signac.rc will remove everything and allow you to start fresh if you mess up

Final Analysis

The final processing and figure generation steps can be run using the following once all signac jobs have finished. Note that for each make_* file in this section, the project must be updated to be either GPBO_rand or GPBO_nonoise depeding on which workflow you want to generate results for.

WARNING: Running these scripts will overwrite your local copy of our workflow results (GPBO_aaa/Results_xxx/*) with the results from your workflow runs.

  1. Make bar charts for the objective data, time data, and derivative-free method data (Figures 2 and 7) and get all data shown in full-results.xlsx
    When this method is run, the data for Table 3, results.xlsx, and full-results.xlsx are generated. Note: results_xlsx is not automatically assembled. It is comprised of Results_act/*/*/*/*/best_results.csv for each method which is compiled by make_full_results.py.

      python make_bar_charts.py
      python make_full_results.py
    
  2. Gather nonlinear least squares data including number of local minima (in Table 3) and the best NLS solution for each run

      python make_least_squares_data.py
    
  3. Make objective line plots (Figures 3 and S12-S29), parity plots (Figures S1-S11) for all case studies

    Hyperparameter and parameter value plots over BO iteration will can also be generated in GPBO_aaa/workspace/

      python make_line_plots.py
    
  4. Make objective contour plots (Figures 4 and S30-S41) for the 2-Parameter case studies. This function also prints the MAE for each method.

      python make_1obj_hms.py
    
  5. (Optional) Make movies of the plots generated via python make_1obj_hms.py

      python make_movies_from_hms.py
    
  6. Gather condition number data for the best runs of each case study Condition number data is generated in gpflow_condition_numbers_raw.csv (raw) and gpflow_condition_numbers.csv (analyzed)

      python make_cond_num_data.py
    
  7. Make histogram of M"uller y0 data (Figure S52). This function also prints out the statistics mentioned in the paper.

      python make_muly0_hist.py
    

Known Issues

The instructions outlined above seem to be system-dependent. In some cases, users have the following error:

ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found

If you observe this, please try the following in the terminal

export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

which should fix the problem. This is not an optimal solution and is something we would like to address. We found that related projects 1, 2 have similar issues. If you are aware of a robust solution to this issue, please let us know by raising an issue or sending an email!

Credits

This work is funded by the Graduate Assistance in Areas of National Need fellowship from the Department of Education via grant number P200A210048, the National Science Foundation via award number CBET-1917474, the University of Notre Dame College of Engineering and Graduate School, and uses the computing resources provided by the Center for Research Computing (CRC) at the University of Notre Dame. Ke Wang also acknowledges the Patrick and Jana Eilers Graduate Student Fellowship for Energy-Related Research for providing financial support to advance this research. Montana Carlozo also acknowledges Dr. Ryan Smith and Dr. Bridgette Befort who offered technical guidance in the development of this work.

Contact

Please contact Montana Carlozo (mcarlozo@nd.edu) or Dr. Alex Dowling (adowling@nd.edu) with any questions, suggestions, or issues.

Software Versions

This section lists software versions for the most important packages.
gpflow==2.9.1
numpy==1.24.4
pandas==1.4.2
Pyomo==6.6.2
pytest==7.2.0
Python==3.9.12
scipy==1.8.0
signac==1.8.0
signac-flow==0.21.0
Tasmanian==7.7.1
torch==1.11.0
torchaudio==0.13.1
torchvision==0.14.1
pygad==3.3.1

About

This repository includes the code used to generate data and figures in "Carlozo, M. N., Wang, K., Dowling, A. W. (2024). Bayesian Optimization Methods for Nonlinear Model Calibration"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published