The code was run on Python 3.10.4. The following modules are required: gurobipy, numpy, pandas, qml, sklearn, skmatter, plotly, kaleido, json, qmllib.
- lapack and blas to install qmllib
conda env create -f environment.yml
conda activate ilpselect
(if it doesn't work install step by step:)
conda create -n ilpselect python=3.10.4
conda activate ilpselect
conda install numpy=1.26.4 pandas=2.2.3 scikit-learn=1.3.0 skmatter=0.2.0 plotly=5.24.1
conda install blas=1.1 lapack=3.9.0
export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH
export LIBRARY_PATH=$LD_LIBRARY_PATH
pip install qmllib==1.1.5 kaleido==0.2.1 gurobipy==10.0.2
Gurobi is used to solve integer linear programs (FPS and ILP). A license is required. Academic licenses are available, and clusters need special licenses. The following environment variable should point towards the license file in case gurobipy cannot find it on its own.
export GRB_LICENSE_FILE=/ssoft/spack/external/gurobi/gurobi.lic
Note that different results will be obtained with different versions of Gurobi.
- create folder models, rankings, solutions, and learning_curves. The .mps files it contains are large and thus in the .gitignore.
mkdir models rankings solutions learning_curves
- verify that the folder
qm7
exists, and that it contains the energies described in anenergies.csv
file (with columnsfile
andenergy / Ha
).
The main.py
file runs everything based on a Python config file. The default config files config.py
used by default when running main.py
with no argument.
In order to use custom config config-foo.py
, use the command python3 main.py "config-foo"
.
The main.py
combines all files of folder scripts to do the following.
- Read target names from config file. The corresponding
{target_name}.xyz
files should be present in the foldertargets
. - Read the config script for the following parameters: database (qm7 for now), representation (FCHL), algorithm-specific parameters, learning curve parameters, ...
- Generate the representations if not present (with convention
{rep}\_{target}.npz
and{rep}\_{database}.npz
) and save to folderdata
. The database must be in the{database}
folder. The filescripts/generate.py
is responsible for this step. - Compute fragment subsets using different techniques (indices of database)
- Subset selection by ILP (named
algo
in the code):- Generate model and write it to
models
folder OR read it and modify its parameters if possible (simple penalty change for example). The filescripts/algo_model.py
is responsible for this step, and is based off of the filescript/fragments.py
. - Solve model and output subset to folder
rankings
with prefixalgo_
, and solution of ILP to foldersolutions
. The filescripts/algo_subset.py
is responsible for this step.
- Generate model and write it to
- Subset selection by SML:
- Output subset to
rankings
folder with prefixsml_
. The filescripts/sml_subset.py
is responsible for this step.
- Output subset to
- Subset selection by CUR:
- Output subset to
rankings
with prefixcur_
. The filescripts/cur_subset.py
is responsible for this step.
- Output subset to
- Subset selection by FPS:
- Output subset to
rankings
with prefixfps_
. The filescripts/fps_subset.py
is responsible for this step.
- Output subset to
- Subset selection by ILP (named
- Compute the learning curve of each subset and save to folder
learning_curves
. The filescripts/learning_curves.py
is responsible for this step. - Draw the learning curves and save to folder
plots
. The filescripts/plots.py
is responsible for this step. - The timings of each step are saved in a dump file in the
run
folder. An example filedump-template.csv
can be found in the folder.
The folder run
contains a main.run
file which describes how the scripts were ran on the JED cluster.
An example output file slurm.out
is included.
Add a {target_name}.xyz
file to the folder targets
.
Add a corresponding entry with the associated energy in the energies.csv
file in the same folder.
Create a {database}
folder, which contains the energies described in an energies.csv
file (with columns file
and energy / Ha
).
One may add a column atomization energy / Ha
.
See the qm9/generate.py
script and the cluster/scripts/generate_qm9.py
file in the master branch for an example of a qm9 implementation from a master file.
Modify accordingly the file scripts/generate.py
. Currently the get_representations
function asserts that FCHL is used.
- Add list of class attributes of
scripts.fragments.model
. - Implement qm9 database. The only thing currently missing is some pruning of the database because Gurobi uses too much memory (even on clusters).
- Implement other representations than FCHL. Not a priority but should not be too difficult (
representation
is already a parameter).