Authors: Montana N. Carlozo, Ke Wang, and Alexander W. Dowling
GPBO_Emulators is a repository used to calibrate computationally expensive models given experimental data. The key feature of this work is using machine learning tools in the form of Gaussian processes (GPs) and Bayesian Optimization (BO) which allow us to smartly sample parameter space to decrease the objective. This work features comparing standard GPBO in which a GP models an expensive objective function to emulator GPBO in which an expensive function is emulated directly by the GP.
This work was submitted to Industrial & Engineering Chemistry Research (I&ECR). Please cite as:
Montana N. Carlozo, Ke Wang, Alexander W. Dowling, “Bayesian Optimization Methods for Nonlinear Model Calibration”, 2024
The repository is organized as follows:
GPBO_Emulators/ is the top level directory. It contains:
- .gitignore prevents large files from the signac workflow and plots from being tracked by git and prevents tracking of other unimportant files.
- init_gpborand.py generates jobs to run for the stochastic model implementation of the workflow with signac
- init_gpbo_nonoise.py generates jobs to run for the deterministic model implementation of the workflow with signac
- make_1meth_hms.py makes contour plots of Simulated SSE, GP SSE, GP SSE variance, and acqusition function for a given GP model.
- make_1obj_hms.py makes contour plots of either Simulated SSE, GP SSE, GP SSE variance, or acqusition function for a given GP model for each method.
- make_bar_charts.py makes Figures 2 and 7 from the main text (bar charts of relevant data).
- make_cond_num_data.py generates condition number data for the best GP models.
- makes_least_squares_data.py generates data related to nonlinear least squares (NLS) including categorizing number of local minima and best performing NLS runs.
- make_line_plots.py generates all line plots in the main text and SI including parity plots and plots of BO iteration for SSE and acquisition function for all modes of evaluation.
- make_movies_from_hms.py makes movies (.mp4) from the contour plots generated with make_1obj_hms.py.
- make_muly0_hist.py makes the histogram for M"uller y0 case study data.
- make_full_results.py makes the full-results.csv and results.csv files from the SI.
- gpbo-emul.yml is the environment for running this workflow.
Directory bo_methods_lib/ contains the package for running the workflow.
bo_methods_lib/bo_methods_lib/ contains the following files:
- GPBO_Classes_New.py is the main script for the algorithm.
- GPBO_Class_fxns.py are helper functions which define parameters for the multiple case studies. This function is also useful for mapping the numerical markers of case studies to their formal names from the manuscript.
- analyze_data.py and GPBO_Classes_plotters.py are used for analyzing and plotting the results.
- tests/*.py contains test functions for public methods of all classes in GPBO_Classes_New.py.
Directories GPBO_rand/ and GPBO_nonoise/ are initially created via init_gpborand.py and init_gpbo_nonoise.py in the top directory through signac.
It contains the following files/subdirectories:
- delete_jobs.py is a script for quickly deleting targeted jobs/results in signac.
- view_unfinished_jobs.py is a script for viewing individual case study job progress.
- project_gpborand.py (project_gpbononoise.py) is the script for running the workflow using signac.
- templates/ are the templates required to run this workflow in signac on the crc.
- workspace/ will appear to save all raw results generated during the workflow after running init_gpbo*.py. This file is not tracked by git due to its size. the workspace/ folder for this study can be downloaded on Google Drive (see section 'Workflow Files and Results')
- signac_project_document.json will also appear to track the status of jobs in the signac workflow
Running the analysis (steps 6 -11 below) will cause results directories to appear in the GPBO_rand/ and GPBO_nonoise/ directories with relevant human readable data and plots. Subdirectories further categorize the results by case study and methods analyzed.
- Results_acq/ shows data where we analyze the best results based on how efficiently the acquisition function was optimized.
- Results_act/ shows data where we analyze the best results based on how efficiently the actual SSE was optimized.
- Results_gp/ shows data where we analyze the best results based on how efficiently the GP predicted SSE was optimized.
We note that this repository is based on the branch update_and_merge in the dowlinglab/Toy_Problem repository, which is private.
All workflow iterations were performed inside either GPBO_Emulators/GPBO_rand or GPBO_Emulators/GPBO_nonoise where it exists.
Each iteration was managed with signac-flow. Inside GPBO_rand or GPBO_nonoise, you will find all the necessary files to
run the workflow. Note that you may not get the exact same simulation
results due to differences in software versions, random seeds, etc.
The raw results from this study are available in Google Drive. These files are necessary to download and add to GPBO_Emulators/GPBO_rand and GPBO_Emulators/GPBO_nonoise to reproduce all figures and tables identified in the paper. As such, the directories GPBO_Emulators/GPBO_rand/workspace and GPBO_Emulators/GPBO_nonoise/workspace must exist and contain all files from the google drive before analysis scripts are run. The analysis scripts work by parsing the signac workflow data and manipulating the data into a form conducive for analysis. The files are saved on google drive because the files in which the GP models reside are often on the order of gigabytes. For other plots, we also use Google Drive because the workspace folder has hundreds of subdirectories (many of which are also large) which makes storage on GitHub impractical.
When reproducing the results of this workflow, more practical, human readable csv files and figures are stored under GPBO_aaa/Results_xxx/cs_name_val_y/, where aaa represents either the deterministic (nonoise) or stochastic (rand) implementation, xxx represents a different analysis scheme and y represent a different (set of) case study. For example data pertaining to the BOD Curve case study for all methods where the actual SSE values were analyzed would be generated in the Results_act/cs_name_val_11/ep_enum_val_1/gp_package_gpflow/meth_name_val_in_1_2_3_4_5_6_7 directory. We specifically note that the best runs for each case study and method are provided under Results_xxx/cs_name_val_y/best_results.csv. These results for each case study make up results.xlsx in the supporting information (SI). We also specifically note nonlinear least squares results categorizing the number of minima and all runs for other DFO methods are found in Results_act/cs_name_val_y.
All of the scripts for running the workflow are provided in
bo_methods_lib/bo_methods_lib/GPBO_Classes_New. All scipts for analyzing the data in GPBO_rand/workflow and GPBO_nonoise/workflow are provided in GPBO_Emulators/make_*.py.
All scripts required to generate the primary figures in the
manuscript and SI are reported under GPBO_Emulators/make_*.py. When running analysis scripts, these figures are saved under GPBO_aaa/Results_xxx/cs_name_val_y/*.
To run this software, you must have access to all packages in the gpbo-emul environment (gpbo-emul.yml) which can be installed using the instructions in the next section.
This package has a number of requirements that can be installed in
different ways. We recommend using a conda environment to manage
most of the installation and dependencies. However, some items will
need to be installed from source or pip.
Running the simulations will also require an installation of TASMANIAN.
This can be installed separately (see installation instructions
here ).
An example of the procedure is provided below:
# Install pip/conda available dependencies
# with a new conda environment named gpbo-emul
conda env create -f gpbo-emul.yml
conda activate gpbo-emul
python -m pip install Tasmanian
NOTE: We use Signac and signac flow
to manage the setup and execution of the workflow. These
instructions assume a working knowledge of that software.
WARNING: Running these scripts will overwrite your local copy of our data (GPBO_aaa/workflow/*) with the data from your workflow runs.
To run the GPBO workflow, follow the following steps:
- Use init_gpborand.py and init_gpbo_nonoise.py to initialize files for simulation use
python init_gpborand.py python init_gpbo_gpbo_noise.py cd GPBO_aaa (rand or nonoise) - Do the following in GPBO_aaa directory (cd GPBO_aaa):
- Check status a few times throughout the process
python project_gpbo[aaa].py status - Modify project_gpbo[aaa].py if desired. This file is currently set to reproduce the paper results
- Run the simulations
python project_gpbo[aaa].py run -o run_ep_or_sf_exp
Note: rm -r workspace/ signac_project_document.json signac.rc will remove everything and allow you to start fresh if you mess up
The final processing and figure generation steps can be run using the following once all signac jobs have finished. Note that for each make_* file in this section, the project must be updated to be either GPBO_rand or GPBO_nonoise depeding on which workflow you want to generate results for.
WARNING: Running these scripts will overwrite your local copy of our workflow results (GPBO_aaa/Results_xxx/*) with the results from your workflow runs.
-
Make bar charts for the objective data, time data, and derivative-free method data (Figures 2 and 7) and get all data shown in full-results.xlsx
When this method is run, the data for Table 3, results.xlsx, and full-results.xlsx are generated. Note: results_xlsx is not automatically assembled. It is comprised of Results_act/*/*/*/*/best_results.csv for each method which is compiled by make_full_results.py.python make_bar_charts.py python make_full_results.py -
Gather nonlinear least squares data including number of local minima (in Table 3) and the best NLS solution for each run
python make_least_squares_data.py -
Make objective line plots (Figures 3 and S12-S29), parity plots (Figures S1-S11) for all case studies
Hyperparameter and parameter value plots over BO iteration will can also be generated in GPBO_aaa/workspace/
python make_line_plots.py -
Make objective contour plots (Figures 4 and S30-S41) for the 2-Parameter case studies. This function also prints the MAE for each method.
python make_1obj_hms.py -
(Optional) Make movies of the plots generated via python make_1obj_hms.py
python make_movies_from_hms.py -
Gather condition number data for the best runs of each case study Condition number data is generated in gpflow_condition_numbers_raw.csv (raw) and gpflow_condition_numbers.csv (analyzed)
python make_cond_num_data.py -
Make histogram of M"uller y0 data (Figure S52). This function also prints out the statistics mentioned in the paper.
python make_muly0_hist.py
The instructions outlined above seem to be system-dependent. In some cases, users have the following error:
ImportError: /lib64/libstdc++.so.6: version `GLIBCXX_3.4.29' not found
If you observe this, please try the following in the terminal
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
which should fix the problem. This is not an optimal solution and is something we would like to address. We found that related projects 1, 2 have similar issues. If you are aware of a robust solution to this issue, please let us know by raising an issue or sending an email!
This work is funded by the Graduate Assistance in Areas of National Need fellowship from the Department of Education via grant number P200A210048, the National Science Foundation via award number CBET-1917474, the University of Notre Dame College of Engineering and Graduate School, and uses the computing resources provided by the Center for Research Computing (CRC) at the University of Notre Dame. Ke Wang also acknowledges the Patrick and Jana Eilers Graduate Student Fellowship for Energy-Related Research for providing financial support to advance this research. Montana Carlozo also acknowledges Dr. Ryan Smith and Dr. Bridgette Befort who offered technical guidance in the development of this work.
Please contact Montana Carlozo (mcarlozo@nd.edu) or Dr. Alex Dowling (adowling@nd.edu) with any questions, suggestions, or issues.
This section lists software versions for the most important packages.
gpflow==2.9.1
numpy==1.24.4
pandas==1.4.2
Pyomo==6.6.2
pytest==7.2.0
Python==3.9.12
scipy==1.8.0
signac==1.8.0
signac-flow==0.21.0
Tasmanian==7.7.1
torch==1.11.0
torchaudio==0.13.1
torchvision==0.14.1
pygad==3.3.1