Implement client-side dataset caching #802

bennybp · 2024-01-12T17:16:53Z

Description

Previously, dataset information was not cached locally at all. So rerunning a script, or just calling client.get_dataset again could require re-fetching all data, even if it had been fetched before.

This PR implements this caching. All storage of records is now in an SQLite database, either in a file or in memory. Some care has been taken in trying to keep the cache up-to-date as much as possible, but I am sure there are still loopholes. This includes records writing themselves back to the cache when they have been updated with additional data (for example, fetching molecules or trajectories).

There are a few ways to use:

Set the cache_dir parameter when creating a client. This will then automatically create SQLite files for each dataset, and re-use them as long as the same cache_dir is used in subsequent client construction.
The PortalClient has a dataset_from_cache function where you can pass a file directly (ie, downloaded out-of-band).
There is a free dataset_from_cache function in dataset_models.py that works similarly, but will result in an offline dataset object completely disconnected from any server.

This is purely a client-side change, so this branch will work with the currently-deployed MolSSI QCArchive servers.

There is still some polishing to be done (and docs), but I am looking for feedback and any bugs before merging.

See #740

Todos and missing features:

refresh_cache needs to be finished
Function to delete the cache(s)
Any sort of size restriction is not obeyed
Docs, of course

Changelog description

Implement client-side caching of datasets

Status

Code base linted
Ready to go

bennybp · 2024-01-16T14:20:13Z

After testing this, the way records are cached (with all their child records) is not going to work. It works for singlepoints, but torsiondrives are way too big. So this is going to need a bit more work

I think the solution is to store individual records (without children) in a separate table, and store foreign keys in the current record_data table. Probably need to rename some of these to something like dataset_records or something.

…olSSI/QCFractal#802)

j-wags · 2024-02-16T23:49:11Z

I've been playing around with this today, and it's great! Very intuitive. Two comments along with their importance out of 10 (I wouldn't consider either one blocking).

(5/10) "cache" makes me think that this will start overwriting itself in some conditions. I'd like to have a mode that's like "I have infinite disk space, don't limit the cache size". Is this the way it works currently/could this mode be added? If not, could the behavior be documented?
(2/10) It's a little incongruous that the API has me specify cache_dir, but then to load things using dataset_from_cache I need to provide a file path inside that cache dir, with no API point that tells me the path (as in, I have to go to my file system and ls around to get the path to the cache file).

The code that I'm playing with is:

import qcportal

client = qcportal.PortalClient("https://api.qcarchive.molssi.org:443", cache_dir="./cache2")
ds = client.get_dataset("torsiondrive", "XtalPi Shared Fragments TorsiondriveDataset v1.0")

# the next two lines didn't immediately do what I wanted, so I ran the loops below
#ds.fetch_entries()
#ds.fetch_records(include=["optimizations"], force_refetch=True) 

for entry in ds.iterate_entries():
    entry
for record in ds.iterate_records():
    for angle, opt in record[2].minimum_optimizations.items():
        opt.final_molecule

The resulting cache file size is pretty reasonable:

(bespokefit) jw@mba$ ls -lrth cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite
-rw-r--r--  1 jeffreywagner  staff    13M Feb 16 15:32 cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite

Then in a separate interpreter (and with minor changes to qcsubmit):

from qcportal import dataset_models
ds2 = dataset_models.dataset_from_cache("./cache2/api.qcarchive.molssi.org_443/dataset_378.sqlite")
from openff.qcsubmit.results import TorsionDriveResultCollection
tdrc = TorsionDriveResultCollection.from_datasets([ds2])
tdrc

TorsionDriveResultCollection(entries={'local': [TorsionDriveResult(type='torsion', record_id=119138412, cmiles='[H:12][C:1](=[C:3]1[C:4](=[O:9])[N:8]([C@@:6]([C:5](=[O:10])[N:7]1[H:18])([H:17])[O:11][C:2]([H:14])([H:15])[H:16])[H:19])[H:13]', inchi_key='BFHIBVQMZBLHGM-OCSBBNMYNA-N'), TorsionDriveResult(type='torsion', ...

Success!!

bennybp · 2024-02-17T02:13:04Z

Glad it's working so far!

* (5/10) "cache" makes me think that this will start overwriting itself in some conditions. I'd like to have a mode that's like "I have infinite disk space, don't limit the cache size". Is this the way it works currently/could this mode be added? If not, could the behavior be documented?

At the moment, there is no limit to the cache (basically if you set the max size to None). I haven't added the logic for finite cache sizes yet. But yes, docs always need to be written.

* (2/10) It's a little incongruous that the API has me specify `cache_dir`, but then to load things using `dataset_from_cache` I need to provide a file path _inside_ that cache dir, with no API point that tells me the path (as in, I have to go to my file system and `ls` around to get the path to the cache file).

If you create a client with the same cache_dir then it should automatically find the existing cache files when you use get_dataset. At least that is the intent. The dataset_from_cache is more for if you want to pass around/download the cache file separately (kind of like the old 'views').

bennybp · 2024-02-27T20:07:54Z

I'm going to go ahead and merge this. There's still some tasks to be done before the next release, but it seems to be working well.

The main reason is I have another feature being built on top of this and leaving this open makes it a bit complicated.

bennybp added 10 commits January 12, 2024 10:46

Rework PortalCache class

8f3634c

Add caching object to qcportal client

624a17d

Add cache dir to testing classes

c734cc9

Implement caching in datasets

d534f3a

Update dataset tests with new caching

374f413

Fix batching in dataset models

d19f962

Add more cache tests

6aafda8

Implement reading from cache files

bb36592

Remove old dataset view code

d13c658

More tests of caching

6bd4724

bennybp mentioned this pull request Jan 12, 2024

Local data persistence & caching #740

Open

bennybp added 7 commits January 12, 2024 13:49

Remove caplog from tests

642ac1c

Test that writeback doesn't happen when size is smaller

5192783

Handle possible missing _del_tasks in BaseRecord

15e5129

Reduce compression level

f5f08b4

Remove temporary caching of decompressed output/native files/wfn

89da752

Exclude unset fields from serialization by default

1d9f456

Enable fetching all parts of a record

5594eae

bennybp added 5 commits January 17, 2024 15:58

Move to PrivateAttr for record caches

e02e8ff

Split DatasetCache into record and datset parts

edc07fb

Add record cache to base record model

ab3e05c

Get child records via cache

27c3425

Also fetch wavefunction data on fetch_all

83dd3e2

This was referenced Feb 16, 2024

Cache update docs openforcefield/openff-bespokefit#321

Merged

Provide a simple path to retrieving cached datasets, or package them with BespokeFit openforcefield/openff-bespokefit#322

Open

j-wags added a commit to openforcefield/openff-qcsubmit that referenced this pull request Feb 16, 2024

minimal changes to support loading cached QCPortal datasets (wip in M…

1d5021b

…olSSI/QCFractal#802)

j-wags mentioned this pull request Feb 16, 2024

[WIP] Support loading cached QCPortal datasets openforcefield/openff-qcsubmit#262

Closed

3 tasks

bennybp merged commit 345b92c into main Feb 27, 2024
17 checks passed

bennybp deleted the qcportal_caching branch March 20, 2024 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement client-side dataset caching #802

Implement client-side dataset caching #802

bennybp commented Jan 12, 2024 •

edited

Loading

bennybp commented Jan 16, 2024

j-wags commented Feb 16, 2024 •

edited

Loading

bennybp commented Feb 17, 2024

bennybp commented Feb 27, 2024

Implement client-side dataset caching #802

Implement client-side dataset caching #802

Conversation

bennybp commented Jan 12, 2024 • edited Loading

Description

Changelog description

Status

bennybp commented Jan 16, 2024

j-wags commented Feb 16, 2024 • edited Loading

bennybp commented Feb 17, 2024

bennybp commented Feb 27, 2024

bennybp commented Jan 12, 2024 •

edited

Loading

j-wags commented Feb 16, 2024 •

edited

Loading