Follow the steps to get the files:
# clone this repo into the folder flowmm/
cd flowmm
git submodule init
git submodule update
bash create_env_file.sh # creates the necessary .env file
The submodules include CDVAE, DiffCSP, and Riemannian Flow Matching.
Now we can install flowmm
. We recommend using micromamba
because conda is extremely slow. You can install micromamba
by following their guide. If the installation fails, try again a few more times.
micromamba env create -f environment.yml
Activate using
micromamba activate flowmm
The training data is in .csv
format in data/
. When you first train on it, the script will convert the data into a faster-to-load format.
If you want to compute energy above hull, you must download the convex hull from 2023-02-07. Extract the files to the folder mp_02072023/
. We got this hull from Matbench Discovery.
The various datasets you can train on are as follows:
data \in {perov, carbon, mp_20, mpts_52}
There are a few preselected model
options in scripts_model/conf/model
. The format is {atom_type_manifold}_{lattice_manifold}
.
Atom type manifolds:
abits
- analog bits proposed for use in FlowMMnull
- used for conditional generationsimplex
- the method used to fit the atom types in DiffCSP
Lattice manifolds:
nonsym
- the method to fit the lattice in DiffCSPparams
- our proposed method in FlowMM, lattice parametersparams_normal_base
- ablated version where the lattice parameters base distribution is gaussian and not sent to constrained space.
python scripts_model/run.py data=perov model=null_params
python scripts_model/run.py data=perov model=abits_params
Discussion about evaluation is limited to FlowMM. scripts_model/evaluate.py
uses click
, allowing it to serve as a multi-purpose evaluation program.
These commands will:
- reconstruct the
test
set - consolidate the results into a single torch pickle with the correct format
- compute the match rate and root mean square error, compared to the
test
set - create plots of the distribution of lattice parameters, compared to the
test
set
The user must provide PATH_TO_CHECKPOINT
, NAME_OF_SUBDIRECTORY_AT_CHECKPOINT
, SLOPE_OF_INFERENCE_ANTI_ANNEALING
.
ckpt=PATH_TO_CHECKPOINT
subdir=NAME_OF_SUBDIRECTORY_AT_CHECKPOINT
slope=SLOPE_OF_INFERENCE_ANTI_ANNEALING
python scripts_model/evaluate.py reconstruct ${ckpt} --subdir ${subdir} --inference_anneal_slope ${slope} --stage test && \
python scripts_model/evaluate.py consolidate ${ckpt} --subdir ${subdir} && \
python scripts_model/evaluate.py old_eval_metrics ${ckpt} --subdir ${subdir} --stage test && \
python scripts_model/evaluate.py lattice_metrics ${ckpt} --subdir ${subdir} --stage test
These commands will:
- generate 10k structures from a checkpoint
- consolidate the results into a single torch pickle with the correct format
- compute the De Novo Generation proxy metrics, compared to the
test
set - create plots of the distribution of lattice parameters, compared to the
test
set
The user must provide PATH_TO_CHECKPOINT
, NAME_OF_SUBDIRECTORY_AT_CHECKPOINT
, SLOPE_OF_INFERENCE_ANTI_ANNEALING
.
ckpt=PATH_TO_CHECKPOINT
subdir=NAME_OF_SUBDIRECTORY_AT_CHECKPOINT
slope=SLOPE_OF_INFERENCE_ANTI_ANNEALING
python scripts_model/evaluate.py generate ${ckpt} --subdir ${subdir} --inference_anneal_slope ${slope} && \
python scripts_model/evaluate.py consolidate ${ckpt} --subdir ${subdir} && \
python scripts_model/evaluate.py old_eval_metrics ${ckpt} --subdir ${subdir} --stage test && \
python scripts_model/evaluate.py lattice_metrics ${ckpt} --subdir ${subdir} --stage test
Taking the generations from the previous step, we can prerelax them using CHGNet on the cpu. This script works locally, but it is designed to parallelize the process over nodes on a slurm cluster.
If you want to use slurm, the user must then provide YOUR_SLURM_PARTITION
.
# get the path to the structures
eval_for_dft_pt=$(python scripts_model/evaluate.py consolidate "${ckpt}" --subdir "${subdir}" --path_eval_pt eval_for_dft.pt | tail -n 1)
# get the eval_for_dft_json
parent=${eval_for_dft_pt%/*} # retain part before the last slash
eval_for_dft_json="${eval_for_dft_pt%.*}.json" # retain part before the period, add .json
log_dir="${parent}/chgnet_log_dir"
# set other flags, if you are using slurm.
num_jobs=1
slurm_partition=YOUR_SLURM_PARTITION
# prerelax
python scripts_analysis/prerelax.py "$eval_for_dft_pt" "$eval_for_dft_json" "$log_dir" --num_jobs "$num_jobs" --slurm_partition "$slurm_partition"
We can continue by doing density functional theory (DFT) with VASP.
The user must provide PATH_TO_YOUR_PSEUDOPOTENTIALS
, which requires a VASP license.
export PMG_VASP_PSP_DIR=PATH_TO_YOUR_PSEUDOPOTENTIALS
# create the folder to hold the dft files
dft_folder="${parent}/dft"
mkdir -p "$dft_folder"
# create the dft inputs
python scripts_analysis/dft_create_inputs.py "${eval_for_dft_json}" "${dft_folder}"
We do not provide guidance on running DFT.
We note that your DFT results should typically be corrected using the settings from the Materials Project.
We can compute the energy above hull using the (corrected) DFT relaxed energies or the CHGNet prerelaxed energies.
Note: This whole section requires downloading the convex hull from above!
If you want to use the CHGNet prerelaxed energies you can use the following commands. Since the prerelaxed energies are generally inaccurate, they go in their own column in the .json
file. The prerelaxed energy above hull from CHGNet will be computed whether or not DFT relaxations were run.
json_e_above_hull="${parent}/ehulls.json"
python scripts_analysis/ehull.py "${eval_for_dft_json}" "${json_e_above_hull}"
If you want to use your DFT calculations you can use the following code. Our script expects that clean_outputs
contains trajectory files with name ######.traj
where ######
corresponds to the index of the corresponding row in the ${eval_for_dft_json}
file.
clean_outputs_dir="${parent}/clean_outputs"
json_e_above_hull="${parent}/ehulls.json"
python scripts_analysis/ehull.py "${eval_for_dft_json}" "${json_e_above_hull}" --clean_outputs_dir "${clean_outputs_dir}"
We have a script which does the corrections for you, but it must correspond to our format. There must be a root directory that has (1) a subdirectory called dft
, which was created using scripts_analysis/dft_create_inputs.py
and (2) a subdirectory called clean_outputs
, which contains trajectory files with name ######.traj
where ######
corresponds to the index of the corresponding row in the ${eval_for_dft_json}
file.
root_dft_clean_outputs="${parent}"
ehulls_corrected_json="${parent}/ehulls_corrected.json"
python scripts_analysis/ehull_correction.py "${eval_for_dft_json}" "${ehulls_corrected_json}" --root_dft_clean_outputs "${root_dft_clean_outputs}"
The most accurate estimates of S.U.N. structures occur when using corrected DFT relaxed energies.
sun_json=sun.json
python scripts_analysis/novelty.py "${eval_for_dft_json}" "${sun_json}" --ehulls "${ehulls_corrected_json}"
The FlowLLM model combines RFM and CrystalLLM by using the LLM as a learned base distribution for the RFM model.
To train the FlowLLM from scratch, you need to train the CrystalLLM model first: Get the CrystalLLM codebase from https://github.com/facebookresearch/crystal-text-llm. Follow the instructions in that repo to fine-tune a LLaMA model on the MP-20 dataset. After training, generate a large number of samples from that model and create a dataset to train the RFM model.
For convenience, a subset of the data used in the FlowLLM paper is available in: data/mp20_llama/
.
Once this training data has been created, the FlowLLM model can be trained just like FlowMM with Conditional Training. Be sure to set base_distribution_from_data=True
to read the initial samples from the dataset file.
python scripts_model/run.py data=mp20_llama model=null_params base_distribution_from_data=True
If you find this repository helpful for your publications, please consider citing our papers:
@inproceedings{
miller2024flowmm,
title={Flow{MM}: Generating Materials with Riemannian Flow Matching},
author={Benjamin Kurt Miller and Ricky T. Q. Chen and Anuroop Sriram and Brandon M Wood},
booktitle={Forty-first International Conference on Machine Learning},
year={2024},
url={https://openreview.net/forum?id=W4pB7VbzZI}
}
@inproceedings{
sriram2024flowllm,
title={FlowLLM: Flow Matching for Material Generation with Large Language Models as Base Distributions},
author={Anuroop Sriram and Benjamin Kurt Miller and Ricky T. Q. Chen and Brandon M. Wood},
booktitle={NeurIPS 2024},
year={2024},
url={}
}
flowmm
is CC-BY-NC licensed, as found in the LICENSE.md
file. However, the git submodules may have different license terms:
cdvae
: MIT LicenseDiffCSP-official
: MIT Licenseriemmanian-fm
: CC BY-NC 4.0 License
The licenses for the dependencies can be viewed at the corresponding project's homepage.