I'm using this repo as a compact, reproducible playground for comparing a few tabular ML baselines against a tiny neural net on engineering-flavoured datasets. The goal is to show that I can set up clean pipelines, keep runs reproducible, and explain the trade-offs I'm seeing.
- I wanted an end-to-end benchmark I can run quickly when I talk about engineering design ML.
- The IDEAL lab at ETH keeps asking for tidy infrastructure, so I've packaged datasets, models, metrics, and plots the way I like to work.
- Everything is small on purpose: it's easy to rerun, inspect, and extend without spinning up heavy tooling.
None of the datasets require deep physics knowledge, just solid ML hygiene.
- A reproducible pipeline around two regression tasks:
- Airfoil Self-Noise: predict sound pressure level from airfoil and flow descriptors.
- Concrete Compressive Strength: predict strength from mix proportions.
- Standardised train/val/test splits with RMSE, MAE, R^2, and timing metrics.
- JSON logs plus plotting helpers for quick comparisons.
- A conditional-VAE generative design benchmark (
benchmarks/run_generative_design.py) that logs validity, diversity, and surrogate-comparison diagnostics. - An inverse-design loop (
inverse_design.py) that lets me test surrogate-driven optimisation under different data budgets.
- Install Miniforge: https://github.com/conda-forge/miniforge
- Create and activate the environment:
conda create -n mini-engibench python=3.11 -y
conda activate mini-engibench- Pull in the scientific stack:
conda install -c conda-forge numpy pandas scipy scikit-learn matplotlib jupyterlab -y
conda install -c conda-forge xgboost openpyxl -y- Optional: install PyTorch (CPU wheel works fine; MPS is a bonus if available).
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpuIf I stick to pip, I run pip install -r requirements.txt inside the environment above.
mini_engibench/
datasets/ # dataset loaders with a load_<task>() helper
models/ # wrappers exposing fit/predict
benchmarks/ # scripts orchestrating experiments and logging metrics
results/ # JSON outputs and figures
notebooks/ # scratchpads or mini walk-throughs
Keeping loaders, models, and experiment scripts in separate folders makes it painless to add new pieces without touching everything else.
- Airfoil Self-Noise (UCI): small aerospace regression dataset that runs quickly.
- Concrete Compressive Strength (UCI): civil engineering flavour with different target behaviour.
Both are public, tidy, and easy to standardise, so I can focus on the benchmarking side.
Each loader also accepts return_metadata=True to expose the scaler and feature bounds needed for generative validity checks.
- Run the airfoil benchmark:
python -m benchmarks.run_airfoilThis writes metrics and timings to results/airfoil.json.
- Run the concrete benchmark:
python -m benchmarks.run_concreteOutputs land in results/concrete.json.
- Regenerate the comparison plots:
python -m benchmarks.plot_results --task airfoil
python -m benchmarks.plot_results --task concreteFigures show up as results/fig_airfoil.png and results/fig_concrete.png.
- Check how models behave with less data:
python -m benchmarks.run_data_efficiency --task airfoil
python -m benchmarks.run_data_efficiency --task concreteEach run logs results/<task>_data_efficiency.json for 10%, 20%, and 50% training splits.
Plotting helper:
python -m benchmarks.plot_data_efficiency --task airfoil --split test
python -m benchmarks.plot_data_efficiency --task concrete --split testSwap --split val if I only care about validation scores. The figures land under results/<task>_data_eff_*.png.
- Spin up the inverse-design loop:
python -m benchmarks.run_inverse_design --task airfoil --budget 25 --num-runs 10
python -m benchmarks.run_inverse_design --task concrete --budget 25 --num-runs 10This logs optimisation traces to results/<task>_inverse_design.json for different starting data budgets.
Plot the optimisation traces:
python -m benchmarks.plot_inverse_design --task airfoil --quantity best
python -m benchmarks.plot_inverse_design --task concrete --quantity regretSaved figures follow results/<task>_inverse_design_<quantity>.png.
- Sample designs with the conditional VAE benchmark:
python -m benchmarks.run_generative_design --task airfoil --num-samples 256 --eval-surrogate
python -m benchmarks.run_generative_design --task concrete --num-samples 256 --eval-surrogateThe script saves results/<task>_generative_cvae.json with full training config, validity/diversity stats,
surrogate score summaries, and (when --eval-surrogate is used) a histogram under
results/<task>_generative_performance_hist.png.
- MAE / RMSE: lower is better.
- R^2: closer to 1 means more variance explained.
- Train / inference timing: I keep an eye on these when thinking about optimisation loops.
Each CVAE run writes results/<task>_generative_cvae.json with:
validity: counts of valid/invalid samples plus per-feature constraint violations.valid_design_indices: zero-based indices for the samples that stayed within the scaled training envelope.diversity: average pairwise distance among valid unscaled designs when metadata is available.performance: surrogate statistics on the valid set and the histogram path (if--eval-surrogate).paper1_comparison: best inverse-design baseline recovered fromresults/<task>_inverse_design.jsonto compare against.
When the surrogate is evaluated, the script also saves results/<task>_generative_performance_hist.png to visualise the score distribution.
- New dataset: drop a loader in
datasets/with aload_<name>()helper that returns train/val/test splits and, whenreturn_metadata=True, the fitted scaler plus min/max ranges for generative validity checks. - New model: add a wrapper in
models/exposingfitandpredict, just like the existing ones. - New task: create
benchmarks/run_<name>.pythat wires loaders and models together and dumps JSON the same way.
The consistent interface keeps any additions from breaking the rest of the pipeline.
When I write things up, I usually cover: brief motivation, dataset summaries, models and metrics, a couple of plots from plot_results, observations on trade-offs, and the exact commands plus environment details for reproducibility.
- If
xgboostfails to import, reinstall it fromconda-forgeinside the environment above. - If PyTorch cannot see MPS, I stick to CPU; the workloads are tiny.
- If UCI downloads hiccup because of SSL, I download the CSV manually and point the loader to it.
- Add a third engineering dataset (maybe something CFD-related).
- Wire in lightweight hyperparameter sweeps with Optuna or similar.
- Ship a
Dockerfileorenvironment.ymlfor one-command setup. - Publish the repo publicly and archive a release on Zenodo for a DOI.
That's the whole setup.