Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring #38

Merged
merged 33 commits into from
Mar 8, 2024
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
4bfb1c1
Updated dummy dataset
Feb 9, 2024
862dc60
ran pre-commit on dummy
Feb 9, 2024
f8a6f0f
Download and upload progress. No init method
Feb 29, 2024
c159603
Basic versioning, refactor src->openqdc, updated .toml
Feb 29, 2024
77acba2
__init__.py clean
Feb 29, 2024
8b54153
black + isort
Feb 29, 2024
e6995a7
DES fix, base utilies, CLI
Feb 29, 2024
2a06637
mkdocs dependency + fix
Feb 29, 2024
5e179d0
Tutorial update + mkdocs fix
Feb 29, 2024
caf4976
Updated readme
Feb 29, 2024
984e971
ruff .
Feb 29, 2024
ef8264a
Fix cli multiple option
Feb 29, 2024
b96158f
Upgraded to black>=24, fixed import tests, cleaning, gh action lint w…
Mar 1, 2024
bc49b9d
pre-commit check
Mar 1, 2024
cedf344
Removed TypeAlias for compatibility with py3.9
Mar 1, 2024
77f2051
First shot at improved docstrings
Mar 1, 2024
79d62a2
Updated Readme, Docstrings datasets and small additions notebook
Mar 1, 2024
3e7c6e9
revert naming AVAILABLE_DATASETS
Mar 4, 2024
b2e9100
Pleasing isort / pre-commit
Mar 4, 2024
534232a
Fixes wrong isort - likely due to phasing out src directory
Mar 4, 2024
a610870
isort fixes
Mar 4, 2024
d695cde
Fixes 5min timeout issue when downloading large datasets
Mar 4, 2024
9b70b84
Merge pull request #39 from OpenDrugDiscovery/docstring_enhancements
FNTwin Mar 5, 2024
0dd2184
Refactored into potential and interaction
Mar 5, 2024
817bbed
Cli test
Mar 5, 2024
e03c6b1
WIP
Mar 7, 2024
f394953
Merge branch 'develop' into minor_fixes
Mar 7, 2024
362d0f4
Correct units for Transition1X
Mar 8, 2024
8ca966a
Merge branch 'refactoring' into minor_fixes
Mar 8, 2024
088d457
Merge pull request #34 from OpenDrugDiscovery/minor_fixes
FNTwin Mar 8, 2024
1e6652f
Cli docs + readme
Mar 8, 2024
b297cef
Merge branch 'refactoring' of https://github.com/OpenDrugDiscovery/op…
Mar 8, 2024
56d1d1b
Cli update
Mar 8, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/code-check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ jobs:

- name: Install black
run: |
pip install black>=23
pip install black>=24

- name: Lint
run: black --check .
Expand Down
43 changes: 38 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,26 @@ You can run tests locally with:
pytest
```

### Documentation

You can build the documentation locally with:

```bash
mkdocs serve
```

# Downloading Datasets

A command line interface is available to download datasets or see which dataset is available, please run openqdc --help.

```bash
# Display the available datasets
openqdc datasets

# Download the Spice and QMugs dataset
openqdc download --datasets Spice QMugs
```

# Overview of Datasets

<!-- Create a table with the following columns
Expand All @@ -32,17 +52,30 @@ pytest

We provide support for the following publicly available QM Datasets.

# Potential Energy

| Dataset | # Molecules | # Conformers | Average Conformers per Molecule | Force Labels | Atom Types | QM Level of Theory | Off-Equilibrium Conformations|
| --- | --- | --- | --- | --- | --- | --- | --- |
| [ANI](https://pubs.rsc.org/en/content/articlelanding/2017/SC/C6SC05720A) | 57,462 | 20,000,000 | 348 | No | 4 | ωB97x:6-31G(d) | Yes |
| [GEOM](https://www.nature.com/articles/s41597-022-01288-4) | 450,000 | 37,000,000 | 82 | No | 18 | GFN2-xTB | No |
| [Molecule3D](https://arxiv.org/abs/2110.01717) | 3,899,647 | 3,899,647 | 1 | No | 5 | B3LYP/6-31G* | No |
| [NablaDFT](https://pubs.rsc.org/en/content/articlelanding/2022/CP/D2CP03966D) | 1,000,000 | 5,000,000 | 5 | No | 6 | ωB97X-D/def2-SVP | |
| [OrbNet Denali](https://arxiv.org/abs/2107.00299) | 212,905 | 2,300,000 | 11 | No | 16 | GFN1-xTB | Yes |
| [PCQM_PM6](https://pubs.acs.org/doi/abs/10.1021/acs.jcim.0c00740) | | | 1| No| | PM6 | No
| [PCQM_B3LYP](https://arxiv.org/abs/2305.18454) | 85,938,443|85,938,443 | 1| No| | B3LYP/6-31G* | No
| [QMugs](https://www.nature.com/articles/s41597-022-01390-7) | 665,000 | 2,000,000 | 3 | No | 10 | GFN2-xTB, ωB97X-D/def2-SVP | No |
| [QM7X](https://www.nature.com/articles/s41597-021-00812-2) | 6,950 | 4,195,237 | 603 | Yes | 7 | PBE0+MBD | Yes |
| [SN2RXN](https://pubs.acs.org/doi/10.1021/acs.jctc.9b00181) | 39 | 452709 | 11,600 | Yes | 6 | DSD-BLYP-D3(BJ)/def2-TZVP | |
| [SolvatedPeptides](https://doi.org/10.1021/acs.jctc.9b00181) | | 2,731,180 | | Yes | | revPBE-D3(BJ)/def2-TZVP | |
| [Spice](https://arxiv.org/abs/2209.10702) | 19,238 | 1,132,808 | 59 | Yes | 15 | ωB97M-D3(BJ)/def2-TZVPPD | Yes |
| [ANI](https://pubs.rsc.org/en/content/articlelanding/2017/SC/C6SC05720A) | 57,462 | 20,000,000 | 348 | No | 4 | ωB97x:6-31G(d) | Yes |
| [tmQM](https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041) | 86,665 | | | No | | TPSSh-D3BJ/def2-SVP | |
| [tmQM](https://pubs.acs.org/doi/10.1021/acs.jcim.0c01041) | 86,665 | 86,665| 1| No | | TPSSh-D3BJ/def2-SVP | |
| [Transition1X](https://www.nature.com/articles/s41597-022-01870-w) | | 9,654,813| | Yes | | ωB97x/6–31 G(d) | Yes |
| [WaterClusters](https://doi.org/10.1063/1.5128378) | 1 | 4,464,740| | No | 2 | TTM2.1-F | Yes|


# Interaction energy

| Dataset | # Molecules | # Conformers | Average Conformers per Molecule | Force Labels | Atom Types | QM Level of Theory | Off-Equilibrium Conformations|
| --- | --- | --- | --- | --- | --- | --- | --- |
| [DES370K](https://www.nature.com/articles/s41597-021-00833-x) | 3,700 | 370,000 | 100 | No | 20 | CCSD(T) | Yes |
| [DES5M](https://www.nature.com/articles/s41597-021-00833-x) | 3,700 | 5,000,000 | 1351 | No | 20 | SNS-MP2 | Yes |
| [OrbNet Denali](https://arxiv.org/abs/2107.00299) | 212,905 | 2,300,000 | 11 | No | 16 | GFN1-xTB | Yes |
| [SN2RXN](https://pubs.acs.org/doi/10.1021/acs.jctc.9b00181) | 39 | 452709 | 11,600 | Yes | 6 | DSD-BLYP-D3(BJ)/def2-TZVP | |
| [QM7X](https://www.nature.com/articles/s41597-021-00812-2) | 6,950 | 4,195,237 | 603 | Yes | 7 | PBE0+MBD | Yes |
793 changes: 598 additions & 195 deletions docs/tutorials/usage.ipynb

Large diffs are not rendered by default.

5 changes: 4 additions & 1 deletion env.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ dependencies:
- loguru
- fsspec
- gcsfs
- typer
- prettytable

# Scientific
- pandas
Expand All @@ -28,7 +30,7 @@ dependencies:
- pytest >=6.0
- pytest-cov
- nbconvert
- black >=23
- black >=24
- jupyterlab
- pre-commit
- ruff
Expand All @@ -42,3 +44,4 @@ dependencies:
- mkdocs-jupyter
- markdown-include
- mdx_truly_sane_lists
- mkdocstrings-python
4 changes: 2 additions & 2 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,13 +58,13 @@ plugins:
- search
- mkdocstrings:
watch:
- src/
- openqdc/
handlers:
python:
setup_commands:
- import sys
- sys.path.append("docs")
- sys.path.append("src")
- sys.path.append("openqdc")
selection:
new_path_syntax: yes
rendering:
Expand Down
28 changes: 26 additions & 2 deletions src/openqdc/__init__.py → openqdc/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
# Dictionary of objects to lazily import; maps the object's name to its module path

_lazy_imports_obj = {
"__version__": "openqdc._version",
"BaseDataset": "openqdc.datasets.base",
"ANI1": "openqdc.datasets.ani",
"ANI1CCX": "openqdc.datasets.ani",
"ANI1X": "openqdc.datasets.ani",
Expand All @@ -21,7 +23,7 @@
"OrbnetDenali": "openqdc.datasets.orbnet_denali",
"SN2RXN": "openqdc.datasets.sn2_rxn",
"QM7X": "openqdc.datasets.qm7x",
"DESS": "openqdc.datasets.dess",
"DES": "openqdc.datasets.des",
"NablaDFT": "openqdc.datasets.nabladft",
"SolvatedPeptides": "openqdc.datasets.solvated_peptides",
"WaterClusters": "openqdc.datasets.waterclusters3_30",
Expand All @@ -30,6 +32,7 @@
"PCQM_B3LYP": "openqdc.datasets.pcqm",
"PCQM_PM6": "openqdc.datasets.pcqm",
"Transition1X": "openqdc.datasets.transition1x",
"AVAILABLE_DATASETS": "openqdc.datasets",
}

_lazy_imports_mod = {"datasets": "openqdc.datasets", "utils": "openqdc.utils"}
Expand Down Expand Up @@ -61,4 +64,25 @@ def __dir__():
if TYPE_CHECKING or os.environ.get("OPENQDC_DISABLE_LAZY_LOADING", "0") == "1":
# These types are imported lazily at runtime, but we need to tell type
# checkers what they are.
from .datasets import *
from ._version import __version__ # noqa
from .datasets import AVAILABLE_DATASETS # noqa
from .datasets.ani import ANI1, ANI1CCX, ANI1X # noqa
from .datasets.base import BaseDataset # noqa
from .datasets.comp6 import COMP6 # noqa
from .datasets.des import DES # noqa
from .datasets.dummy import Dummy # noqa
from .datasets.gdml import GDML # noqa
from .datasets.geom import GEOM # noqa
from .datasets.iso_17 import ISO17 # noqa
from .datasets.molecule3d import Molecule3D # noqa
from .datasets.nabladft import NablaDFT # noqa
from .datasets.orbnet_denali import OrbnetDenali # noqa
from .datasets.pcqm import PCQM_B3LYP, PCQM_PM6 # noqa
from .datasets.qm7x import QM7X # noqa
from .datasets.qmugs import QMugs # noqa
from .datasets.sn2_rxn import SN2RXN # noqa
from .datasets.solvated_peptides import SolvatedPeptides # noqa
from .datasets.spice import Spice # noqa
from .datasets.tmqm import TMQM # noqa
from .datasets.transition1x import Transition1X # noqa
from .datasets.waterclusters3_30 import WaterClusters # noqa
11 changes: 11 additions & 0 deletions openqdc/_version.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
try:
from importlib.metadata import PackageNotFoundError, version
except ModuleNotFoundError:
# Try backported to PY<38 `importlib_metadata`.
from importlib_metadata import PackageNotFoundError, version

try:
__version__ = version("openqdc")
except PackageNotFoundError:
# package is not installed
__version__ = "dev"
77 changes: 77 additions & 0 deletions openqdc/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
from typing import List, Optional

import typer
from loguru import logger
from prettytable import PrettyTable
from typing_extensions import Annotated

from openqdc import AVAILABLE_DATASETS
from openqdc.raws.config_factory import DataConfigFactory
from openqdc.raws.fetch import DataDownloader

app = typer.Typer(help="OpenQDC CLI")


def exist_dataset(dataset):
if dataset not in AVAILABLE_DATASETS:
logger.error(f"{dataset} is not available. Please open an issue on Github for the team to look into it.")
return False
return True


@app.command()
def download(
datasets: List[str],
overwrite: Annotated[
bool,
typer.Option(
help="Whether to overwrite the datasets",
),
] = False,
cache_dir: Annotated[
Optional[str],
typer.Option(
help="Path to cache directory",
),
] = None,
):
"""
Download preprocessed datasets from openQDC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the documentation so the help can tell the user more info about the arguments.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Typer has also --help for the subcommands, I will make it more clear in the readme.

"""
for dataset in list(map(lambda x: x.lower().replace("_", ""), datasets)):
if exist_dataset(dataset):
if AVAILABLE_DATASETS[dataset].no_init().is_cached() and not overwrite:
logger.info(f"{dataset} is already cached. Skipping download")
else:
AVAILABLE_DATASETS[dataset](overwrite_local_cache=True, cache_dir=cache_dir)


@app.command()
def datasets():
"""
Print the available datasets.
"""
table = PrettyTable(["Name", "Forces", "Level of theory"])
for dataset in AVAILABLE_DATASETS:
empty_dataset = AVAILABLE_DATASETS[dataset].no_init()
has_forces = False if not empty_dataset.__force_methods__ else True
table.add_row([dataset, has_forces, ",".join(empty_dataset.__energy_methods__)])
table.align = "l"
print(table)


@app.command()
def fetch(datasets: List[str]):
"""
Download the raw datasets files from openQDC.
"""
if datasets[0] == "all":
dataset_names = DataConfigFactory.available_datasets

for dataset_name in dataset_names:
dd = DataDownloader()
dd.from_name(dataset_name)


if __name__ == "__main__":
app()
44 changes: 44 additions & 0 deletions openqdc/datasets/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from .base import BaseDataset # noqa
from .interaction import DES # noqa
from .potential.ani import ANI1, ANI1CCX, ANI1X # noqa
from .potential.comp6 import COMP6 # noqa
from .potential.dummy import Dummy # noqa
from .potential.gdml import GDML # noqa
from .potential.geom import GEOM # noqa
from .potential.iso_17 import ISO17 # noqa
from .potential.molecule3d import Molecule3D # noqa
from .potential.nabladft import NablaDFT # noqa
from .potential.orbnet_denali import OrbnetDenali # noqa
from .potential.pcqm import PCQM_B3LYP, PCQM_PM6 # noqa
from .potential.qm7x import QM7X # noqa
from .potential.qmugs import QMugs # noqa
from .potential.sn2_rxn import SN2RXN # noqa
from .potential.solvated_peptides import SolvatedPeptides # noqa
from .potential.spice import Spice # noqa
from .potential.tmqm import TMQM # noqa
from .potential.transition1x import Transition1X # noqa
from .potential.waterclusters3_30 import WaterClusters # noqa

AVAILABLE_DATASETS = {
"ani1": ANI1,
"ani1ccx": ANI1CCX,
"ani1x": ANI1X,
"comp6": COMP6,
"des": DES,
"gdml": GDML,
"geom": GEOM,
"iso17": ISO17,
"molecule3d": Molecule3D,
"nabladft": NablaDFT,
"orbnetdenali": OrbnetDenali,
"pcqmb3lyp": PCQM_B3LYP,
"pcqmpm6": PCQM_PM6,
"qm7x": QM7X,
"qmugs": QMugs,
"sn2rxn": SN2RXN,
"solvatedpeptides": SolvatedPeptides,
"spice": Spice,
"tmqm": TMQM,
"transition1x": Transition1X,
"watercluster": WaterClusters,
}
Loading
Loading