-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interaction Datasets #40
Merged
Merged
Changes from 11 commits
Commits
Show all changes
51 commits
Select commit
Hold shift + click to select a range
bd3fcf9
started splitting datasets into 'interaction' and 'potential'
mcneela a800ea5
add num_unique_molecules property
mcneela 9d6fca6
added logging
mcneela 794e63f
started base interaction dataset
mcneela 0db4765
add interaction __init__ file and revise potential __init__ file
mcneela 6e5a002
add des370k interaction to config_factory.py
mcneela 8e1e003
have BaseInteractionDataset inherit BaseDataset
mcneela d68bae6
implemented read_raw_entries for DES370K
mcneela 5e94d67
finished implementation of DES370K interaction
mcneela 3c9508b
finished implementation of DES370K interaction
mcneela 768fb2e
update BaseDataset import path
mcneela 8aeadd8
added Metcalf dataset
mcneela 9cf6034
updated DES370K based on Prudencio's comments
mcneela ce2c53b
Merge branch 'interaction' into metcalf
mcneela 6206665
added const molecule_groups lookup for DES370K dataset
mcneela 5cb57d9
updated subsets for DES370K
mcneela e18b710
added download url for des5m_interaction
mcneela 54cadbf
updated README with new datasets
mcneela 7f83eb5
Merge branch 'metcalf' into interaction
mcneela a922ef7
Added DES5M dataset
mcneela 2146058
added des_s66 dataset
mcneela 4d9a4ba
added DESS66x8 dataset
mcneela c2229e3
small update to __init__ file
mcneela 9349454
added L7 dataset
mcneela c3bdc64
added X40 dataset
mcneela 23c0739
add new datasets to __init__.py
mcneela 74f87a6
added splinter dataset
mcneela f046ea9
fixed a couple splinter things
mcneela 3c84ee9
update default data shapes for interaction datasets
mcneela 04c81ae
updated test_dummy.py with new import structure
mcneela 11e2858
fix test_import.py
mcneela 78f0423
code cleanup for the linter
mcneela bd58fdf
fix ani import
mcneela 5dfcf55
Merge branch 'refactoring' into interaction
mcneela 4bc3a49
fix base dataset import
mcneela b046eea
black formatting
mcneela fe54044
ran precommit
mcneela ef2528c
removed DES from datasets/__init__.py
mcneela c0ef5b1
removed DES from datasets/__init__.py
mcneela ad55296
fix X40 energy methods
mcneela 0a51e7c
added interaction dataset docstrings
mcneela b6c3a6a
update readme with all interaction datasets
mcneela 07f70b8
update metcalf __energy_methods__
mcneela 1443450
refactored des370k and des5m
mcneela 802b70b
update base interaction dataset to add n_atoms_first property
mcneela e969b54
update L7 and X40 to use python base yaml package
mcneela 5725fed
modify interaction/base.py to save keys other than force/energy in pr…
mcneela 6c6b286
fix base dataset issue
mcneela 46c5ebe
fix circular imports
mcneela d5ec053
merge origin/develop into interaction
mcneela cb9987c
removed print statements
mcneela File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
import importlib | ||
import os | ||
from typing import TYPE_CHECKING # noqa F401 | ||
|
||
# The below lazy import logic is coming from openff-toolkit: | ||
# https://github.com/openforcefield/openff-toolkit/blob/b52879569a0344878c40248ceb3bd0f90348076a/openff/toolkit/__init__.py#L44 | ||
|
||
# Dictionary of objects to lazily import; maps the object's name to its module path | ||
|
||
_lazy_imports_obj = { | ||
"BaseInteractionDataset": "openqdc.datasets.interaction.base", | ||
"DES370K": "openqdc.datasets.interaction.des370k", | ||
} | ||
|
||
_lazy_imports_mod = {} | ||
|
||
|
||
def __getattr__(name): | ||
"""Lazily import objects from _lazy_imports_obj or _lazy_imports_mod | ||
|
||
Note that this method is only called by Python if the name cannot be found | ||
in the current module.""" | ||
obj_mod = _lazy_imports_obj.get(name) | ||
if obj_mod is not None: | ||
mod = importlib.import_module(obj_mod) | ||
return mod.__dict__[name] | ||
|
||
lazy_mod = _lazy_imports_mod.get(name) | ||
if lazy_mod is not None: | ||
return importlib.import_module(lazy_mod) | ||
|
||
raise AttributeError(f"module {__name__!r} has no attribute {name!r}") | ||
|
||
|
||
def __dir__(): | ||
"""Add _lazy_imports_obj and _lazy_imports_mod to dir(<module>)""" | ||
keys = (*globals().keys(), *_lazy_imports_obj.keys(), *_lazy_imports_mod.keys()) | ||
return sorted(keys) | ||
|
||
|
||
if TYPE_CHECKING or os.environ.get("OPENQDC_DISABLE_LAZY_LOADING", "0") == "1": | ||
from .base import BaseInteractionDataset | ||
from .des370k import DES370K | ||
|
||
__all__ = [ | ||
"BaseInteractionDataset", | ||
"DES370K", | ||
] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
from typing import Dict, List, Optional, Union | ||
from openqdc.utils.io import ( | ||
copy_exists, | ||
dict_to_atoms, | ||
get_local_cache, | ||
load_hdf5_file, | ||
load_pkl, | ||
pull_locally, | ||
push_remote, | ||
set_cache_dir, | ||
) | ||
from openqdc.datasets.potential.base import BaseDataset | ||
|
||
from loguru import logger | ||
|
||
import numpy as np | ||
|
||
class BaseInteractionDataset(BaseDataset): | ||
def __init__( | ||
self, | ||
energy_unit: Optional[str] = None, | ||
distance_unit: Optional[str] = None, | ||
overwrite_local_cache: bool = False, | ||
cache_dir: Optional[str] = None, | ||
) -> None: | ||
super().__init__( | ||
energy_unit=energy_unit, | ||
distance_unit=distance_unit, | ||
overwrite_local_cache=overwrite_local_cache, | ||
cache_dir=cache_dir | ||
) | ||
|
||
def collate_list(self, list_entries: List[Dict]): | ||
# concatenate entries | ||
res = {key: np.concatenate([r[key] for r in list_entries if r is not None], axis=0) \ | ||
for key in list_entries[0] if not isinstance(list_entries[0][key], dict)} | ||
|
||
csum = np.cumsum(res.get("n_atoms")) | ||
x = np.zeros((csum.shape[0], 2), dtype=np.int32) | ||
x[1:, 0], x[:, 1] = csum[:-1], csum | ||
res["position_idx_range"] = x | ||
|
||
return res |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
import os | ||
import numpy as np | ||
import pandas as pd | ||
|
||
from typing import Dict, List | ||
|
||
from tqdm import tqdm | ||
from loguru import logger | ||
from openqdc.datasets.interaction import BaseInteractionDataset | ||
from openqdc.utils.molecule import atom_table | ||
|
||
|
||
class DES370K(BaseInteractionDataset): | ||
__name__ = "des370k_interaction" | ||
__energy_unit__ = "hartree" | ||
__distance_unit__ = "ang" | ||
__forces_unit__ = "hartree/ang" | ||
__energy_methods__ = [ | ||
"mp2/cc-pvdz", | ||
"mp2/cc-pvqz", | ||
"mp2/cc-pvtz", | ||
"mp2/cbs", | ||
"ccsd(t)/cc-pvdz", | ||
"ccsd(t)/cbs", # cbs | ||
"ccsd(t)/nn", # nn | ||
"sapt0/aug-cc-pwcvxz", | ||
mcneela marked this conversation as resolved.
Show resolved
Hide resolved
|
||
"sapt0/aug-cc-pwcvxz", | ||
"sapt0/aug-cc-pwcvxz", | ||
"sapt0/aug-cc-pwcvxz", | ||
"sapt0/aug-cc-pwcvxz", | ||
"sapt0/aug-cc-pwcvxz", | ||
"sapt0/aug-cc-pwcvxz", | ||
"sapt0/aug-cc-pwcvxz", | ||
"sapt0/aug-cc-pwcvxz", | ||
"sapt0/aug-cc-pwcvxz", | ||
] | ||
|
||
energy_target_names = [ | ||
"cc_MP2_all", | ||
"qz_MP2_all", | ||
"tz_MP2_all", | ||
"cbs_MP2_all", | ||
"cc_CCSD(T)_all", | ||
"cbs_CCSD(T)_all", | ||
"nn_CCSD(T)_all", | ||
"sapt_all", | ||
"sapt_es", | ||
"sapt_ex", | ||
"sapt_exs2", | ||
"sapt_ind", | ||
"sapt_exind", | ||
"sapt_disp", | ||
"sapt_exdisp_os", | ||
"sapt_exdisp_ss", | ||
"sapt_delta_HF", | ||
] | ||
|
||
def read_raw_entries(self) -> List[Dict]: | ||
self.filepath = os.path.join(self.root, "DES370K.csv") | ||
logger.info(f"Reading DES370K interaction data from {self.filepath}") | ||
df = pd.read_csv(self.filepath) | ||
data = [] | ||
for idx, row in tqdm(df.iterrows(), total=df.shape[0]): | ||
smiles0, smiles1 = row["smiles0"], row["smiles1"] | ||
charge0, charge1 = row["charge0"], row["charge1"] | ||
natoms0, natoms1 = row["natoms0"], row["natoms1"] | ||
pos = np.array(list(map(float, row["xyz"].split()))).reshape(-1, 3) | ||
pos0 = pos[:natoms0] | ||
pos1 = pos[natoms0:] | ||
|
||
elements = row["elements"].split() | ||
elements0 = np.array(elements[:natoms0]) | ||
elements1 = np.array(elements[natoms0:]) | ||
|
||
atomic_nums = np.expand_dims(np.array([atom_table.GetAtomicNumber(x) for x in elements]), axis=1) | ||
atomic_nums0 = np.array(atomic_nums[:natoms0]) | ||
atomic_nums1 = np.array(atomic_nums[natoms0:]) | ||
|
||
charges = np.expand_dims(np.array([charge0] * natoms0 + [charge1] * natoms1), axis=1) | ||
|
||
atomic_inputs = np.concatenate((atomic_nums, charges, pos), axis=-1, dtype=np.float32) | ||
atomic_inputs0 = atomic_inputs[:natoms0, :] | ||
atomic_inputs1 = atomic_inputs[natoms0:, :] | ||
|
||
energies = np.array(row[self.energy_target_names].values).astype(np.float32)[None, :] | ||
|
||
name = np.array([smiles0 + "." + smiles1]) | ||
|
||
item = dict( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we only need this. a lot of the information can in the sub dict can be retrieved only with the info below!
|
||
mol0=dict( | ||
smiles=smiles0, | ||
atomic_inputs=atomic_inputs0, | ||
n_atoms=natoms0, | ||
charge=charge0, | ||
elements=elements0, | ||
atomic_nums=atomic_nums0, | ||
pos=pos0, | ||
), | ||
mol1=dict( | ||
smiles=smiles1, | ||
atomic_inputs=atomic_inputs1, | ||
n_atoms=natoms1, | ||
charge=charge1, | ||
elements=elements1, | ||
atomic_nums=atomic_nums1, | ||
pos=pos1, | ||
), | ||
energies=energies, | ||
subset=np.array(["DES370K"]), | ||
n_atoms=np.array([natoms0 + natoms1], dtype=np.int32), | ||
atomic_inputs=atomic_inputs, | ||
name=name, | ||
) | ||
data.append(item) | ||
return data |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
src/openqdc/datasets/comp6.py → src/openqdc/datasets/potential/comp6.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
src/openqdc/datasets/gdml.py → src/openqdc/datasets/potential/gdml.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
src/openqdc/datasets/iso_17.py → src/openqdc/datasets/potential/iso_17.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
2 changes: 1 addition & 1 deletion
2
src/openqdc/datasets/sn2_rxn.py → src/openqdc/datasets/potential/sn2_rxn.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The write and read prepossessed must be changed here no? There are news keys been added so the base class must adapt those functions no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We must also change the logic to avoid of a few other functions to avoid the normalization of interaction energies no @FNTwin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I still need to update the preprocessing functions to add the new keys. I'm not familiar with the normalization of the energies, Cristian will probably be able to help with that.