DPP-PDC: Bayesian Diversity Control for Phase Diagram Construction

A Bayesian active learning framework for efficient phase diagram determination using uncertainty-weighted Determinantal Point Processes (UwDPP).

Overview

Machine learning methods are increasingly used in experimental design in phase diagram determination. Some methods perform batch design, where multiple points are sampled from the design space. In this case, it is essential to diversify samples to avoid performing almost identical experiments, and control the diversity level appropriately. Manual diversity control is unintuitive and may require additional trial-and-error in prior to the experiments are started. We propose a Bayesian model called determinantal point process for phase diagram construction (DPP-PDC) that can perform batch design and automatic diversity control simultaneously. Central to this model is the uncertainty-weighted determinantal point process that samples a set of points with high uncertainty under diversity control. Experiments with Cu-Mg-Zn ternary system demonstrate that DPP-PDC can actively control the sample diversity to achieve high efficiency.

Installation

Prerequisites

Python 3.12 or higher
We recommend using a virtual environment

Option 1: Using conda (Recommended)

# Create a new environment
conda create -n dpp-pdc python=3.12 -y

# Activate the environment
conda activate dpp-pdc

# Clone the repository
git clone https://github.com/tsudalab/DPP-PDC.git
cd DPP-PDC

# Install the package
pip install -e .

Option 2: Using venv

# Clone the repository
git clone https://github.com/tsudalab/DPP-PDC.git
cd DPP-PDC

# Create virtual environment
python -m venv venv

# Activate (Linux/macOS)
source venv/bin/activate
# Activate (Windows)
venv\Scripts\activate

# Install the package
pip install -e .

Verify Installation

# Check CLI is available
dpp-pdc --help

# Or test import
python -c 'import dpp_pdc; print("Installation successful!")'

Quick Start

1. Command Line Interface (Recommended)

Run a simple experiment:

dpp-pdc run --temperatures 500K --algorithm DPP-PDC --batch-size 10

Run the full benchmark from the paper:

dpp-pdc run --data datasets/Cu-Mg_Zn_all.csv  --algorithms RS PDC K-Medoids DPP-PDC  --batch-sizes 5 8 10 12  --n-runs 10  --max-sampling 300

2. Using Configuration Files

Generate a configuration template:

dpp-pdc init-config --output configs/my_config.toml

Edit the configuration file, then run:

dpp-pdc run --config configs/my_config.toml

3. Python API

from dpp_pdc import standardize_data
from dpp_pdc.run_experiments import run_single_temperature_experiment

# Load and preprocess data
scaler, X_std, y_encoded = standardize_data("datasets/Cu-Mg_Zn_500K.csv")

# Run experiment
results = run_single_temperature_experiment(
    temperature="500K",
    file_path_template="datasets/Cu-Mg_Zn_{temperature}.csv",
    algorithms=["RS", "PDC", "DPP-PDC"],
    batch_sizes=[10, 12],
    n_runs=5,
    max_sampling=300
)

CLI Reference

`dpp-pdc run`

Run active learning experiments.

Argument	Short	Description	Default
`--config`	`-c`	Path to TOML configuration file	-
`--data`	`-d`	Path to dataset CSV file	-
`--algorithms`	`-a`	Algorithms to compare	`RS PDC K-Medoids DPP-PDC`
`--batch-sizes`	`-b`	Batch sizes to test	`10`
`--n-runs`	`-n`	Number of independent runs	`10`
`--n-initial`	-	Initial labeled samples	`10`
`--max-sampling`	-	Maximum total samples	auto
`--output-dir`	`-o`	Output directory	`results`
`--verbose`	`-v`	Enable verbose output	`False`
`--kmedoids-variant`	-	K-Medoids variant: FPS or PAM	`PAM`
`--kmedoids-top-percentile`	-	Top percentile for candidate selection (0.0-1.0)	`0.025`

`dpp-pdc init-config`

Generate a configuration file template.

dpp-pdc init-config --output configs/my_config.toml

Algorithms

Algorithm	Description
RS	Random Sampling (baseline)
PDC	Phase Diagram Construction using uncertainty sampling
K-Medoids	Two-step approach: select top uncertain points, then apply diversity sampling. Supports FPS (Farthest Point Sampling, default) or PAM variants
DPP-PDC	Our method: uncertainty-weighted DPP with Bayesian diversity control

Dataset Format

Input CSV files should have the following format:

Single Temperature Dataset

phase_name,Zn,Mg,Cu
FCC,0.1,0.2,0.7
BCC,0.3,0.3,0.4
HCP+LIQUID,0.5,0.4,0.1
...

Complete Dataset (with Temperature)

phase_name,Zn,Mg,Cu,T
FCC,0.1,0.2,0.7,850
BCC,0.3,0.3,0.4,850
HCP+LIQUID,0.5,0.4,0.1,650
...

Format requirements:

First column: phase label (categorical)
Composition columns (e.g., Zn, Mg, Cu) should sum to 1.0
Temperature column T is required for the complete dataset
The complete dataset used in the paper contains 71 distinct phase labels across temperatures from 500K to 1500K

Output

Results are saved to the results_pymc_{temperature}/ directory.

Complete Dataset (`Cu-Mg-Zn_complete.csv`)

results_pymc_complete/
├── combined_grid_complete.png/pdf       # Phase discovery curves + σ² evolution (all batch sizes)
├── auc_bar_complete.png/pdf             # AUC bar chart comparison
├── auc_table_complete.csv               # AUC values for all algorithms and batch sizes
├── metrics_curves_complete.png/pdf      # Classification metrics (Accuracy, Precision, Recall, F1, mAP)
├── phase_discovery_batch5_complete.png  # Phase discovery curve for batch size 5
├── phase_discovery_batch8_complete.png
├── phase_discovery_batch10_complete.png
├── phase_discovery_batch12_complete.png
├── RS_batch5_results.csv                # Phase counts per iteration
├── RS_batch8_results.csv
├── ...
├── DPP-PDC_batch5_results.csv
├── DPP-PDC_batch5_sigma.csv             # σ² values per iteration
├── DPP-PDC_batch8_sigma.csv
├── ...
├── RS_batch5_metrics.csv                # Classification metrics per iteration
├── PDC_batch5_metrics.csv
└── DPP-PDC_batch5_metrics.csv

Single Temperature Dataset (e.g., `Cu-Mg_Zn_850K.csv`)

results_pymc_850K/
├── combined_grid_850K.png/pdf           # Phase discovery curves + σ² evolution
├── auc_bar_850K.png/pdf                 # AUC bar chart comparison
├── auc_table_850K.csv                   # AUC values
├── metrics_curves_850K.png/pdf          # Classification metrics curves
├── phase_discovery_batch10_850K.png     # Phase discovery curve
├── ternary_DPP-PDC_batch10_850K.png     # Ternary diagram with sampling points
├── RS_batch10_results.csv
├── PDC_batch10_results.csv
├── DPP-PDC_batch10_results.csv
├── DPP-PDC_batch10_sigma.csv
├── RS_batch10_metrics.csv
├── PDC_batch10_metrics.csv
└── DPP-PDC_batch10_metrics.csv

Output files description:

File	Description
`combined_grid_*.png/pdf`	Combined plot: phase discovery curves (top) and σ² evolution (bottom)
`auc_bar_*.png/pdf`	Bar chart comparing AUC across algorithms and batch sizes
`auc_table_*.csv`	Area under the phase discovery curve (numerical values)
`metrics_curves_*.png/pdf`	Classification metrics over iterations (Accuracy, Precision, Recall, F1, mAP)
`phase_discovery_batch_.png`	Phase discovery curve for specific batch size
`ternary__.png`	Ternary composition diagram with sampling evolution (single temperature only)
`{algorithm}_batch{size}_results.csv`	Number of discovered phases per iteration
`{algorithm}_batch{size}_sigma.csv`	σ² values per iteration (DPP methods only)
`{algorithm}_batch{size}_metrics.csv`	Classification metrics per iteration

Configuration File Format

Example experiment.toml:

[experiment]
# Temperature conditions to evaluate (ignored if using --data with complete dataset)
temperatures = ["500K", "650K", "1000K", "1050K"]

# Algorithms to compare
algorithms = ["RS", "PDC", "K-Medoids", "DPP-PDC"]

# Batch sizes for each sampling iteration
batch_sizes = [5, 8, 10, 12]

# Number of independent runs per configuration
n_runs = 10

# Number of initial random samples
n_initial_sampling = 10

# Maximum total sampling points (comment out for auto-calculation)
# max_sampling = 300

[data]
# Template for dataset file paths (use {temperature} as placeholder)
file_path_template = "datasets/Cu-Mg_Zn_{temperature}.csv"

[output]
output_dir = "results"
save_plots = true
verbose = false

[PyMC]
# Prior parameters for noise variance (log-normal distribution)
prior_mu = -4.0
prior_sigma = 4.0

[kmedoids]
# K-Medoids variant: "FPS" (Farthest Point Sampling) or "PAM" (original k-medoids)
variant = "PAM"

# Top percentile of uncertain points for candidate selection (0.0-1.0)
# Lower values focus on more uncertain points, higher values increase diversity
top_percentile = 0.025

# MCMC sampling parameters
mcmc_samples = 1000
mcmc_tune = 1000
mcmc_chains = 1
target_accept = 0.9

Reproducing Paper Results

To reproduce the results from the paper:

# Full experiment
dpp-pdc run --data datasets/Cu-Mg_Zn_all.csv  --algorithms RS PDC K-Medoids DPP-PDC  --batch-sizes 5 8 10 12  --n-runs 10  --max-sampling 300  --output-dir paper_results

Project Structure

DPP-PDC/
├── pyproject.toml              # Package configuration
├── README.md                   # This file
├── configs/
│   └── experiment.toml         # Example configuration
├── datasets/
│   ├── Cu-Mg_Zn_500K.csv       # Single temperature slices
│   ├── Cu-Mg_Zn_550K.csv       # (500K to 1500K, 50K intervals)
│   ├── ...
│   ├── Cu-Mg_Zn_1500K.csv
│   └── Cu-Mg_Zn_all.csv        # Complete dataset with all temperatures
├── dpp_pdc/
│   ├── __init__.py
│   ├── cli.py                  # Command-line interface
│   ├── config.py               # Configuration management
│   ├── active_learning.py      # Core active learning loop
│   ├── data_preprocessing.py   # Data loading and preprocessing
│   ├── uncertainty_strategies.py  # RS, PDC, K-Medoids sampling
│   ├── pymc_dpp.py             # Bayesian DPP sampling with PyMC
│   ├── visualization.py        # Plotting functions
│   ├── sampling_distribution.py   # Sampling visualization demo
│   └── run_experiments.py      # Experiment runner
└── examples/
    └── quickstart.ipynb        # Jupyter notebook tutorial

Citation

If you use this code in your research, please cite:

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work was supported by JSPS Kakenhi 25K01492, JST CREST JPMJCR21O2 and ERATO JPMJER1903.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DPP-PDC: Bayesian Diversity Control for Phase Diagram Construction

Overview

Installation

Prerequisites

Option 1: Using conda (Recommended)

Option 2: Using venv

Verify Installation

Quick Start

1. Command Line Interface (Recommended)

2. Using Configuration Files

3. Python API

CLI Reference

`dpp-pdc run`

`dpp-pdc init-config`

Algorithms

Dataset Format

Single Temperature Dataset

Complete Dataset (with Temperature)

Output

Complete Dataset (`Cu-Mg-Zn_complete.csv`)

Single Temperature Dataset (e.g., `Cu-Mg_Zn_850K.csv`)

Configuration File Format

Reproducing Paper Results

Project Structure

Citation

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
configs		configs
datasets		datasets
dpp_pdc		dpp_pdc
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

tsudalab/DPP-PDC

Folders and files

Latest commit

History

Repository files navigation

DPP-PDC: Bayesian Diversity Control for Phase Diagram Construction

Overview

Installation

Prerequisites

Option 1: Using conda (Recommended)

Option 2: Using venv

Verify Installation

Quick Start

1. Command Line Interface (Recommended)

2. Using Configuration Files

3. Python API

CLI Reference

dpp-pdc run

dpp-pdc init-config

Algorithms

Dataset Format

Single Temperature Dataset

Complete Dataset (with Temperature)

Output

Complete Dataset (Cu-Mg-Zn_complete.csv)

Single Temperature Dataset (e.g., Cu-Mg_Zn_850K.csv)

Configuration File Format

Reproducing Paper Results

Project Structure

Citation

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`dpp-pdc run`

`dpp-pdc init-config`

Complete Dataset (`Cu-Mg-Zn_complete.csv`)

Single Temperature Dataset (e.g., `Cu-Mg_Zn_850K.csv`)

Packages