Download example databases for use in CI tests #608

Kezzsim · 2023-11-16T21:41:30Z

Continuing from the last PR I submitted to Tiled which switched postgresql indexing from btree to btree_gin indexes to support faster queries, an issue emerged when it came to running accurate index usage tests.

The query planner will not use an index if a catalogue table contains fewer than 10,000 records. As a workaround, this PR looks to add a form of caching, via a container registry and the docker postgres image.

This is a work in progress to track changes and will contain numerous additional commits prior to any potential merge.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1205837840821678
- https://app.asana.com/0/0/1206267196412358

danielballan · 2023-11-16T22:19:44Z

Very interesting! I like this general direction.

I do think we should avoid merging binary blobs like postgres-ci-db.sql into the repo. In general it's best to commit the reproducible code that creates the binary blob, not the binary blob itself.

Taking in your comments on Slack, I was thinking about an approach like this to stitch it all together:

Commit a short Python script that connects to a given DATABASE_URI and writes in the test data:

#!/usr/env/bin python

# Usage: generate_sample_data.py DATABASE_URI

import sys
from tiled.catalog import from_uri
from tiled.client import Context, from_context
from tiled.server.app import build_app

uri = sys.argv[1]
catalog = from_uri(uri)
with Context.from_app(build_app(catalog)) as context:
    client = from_context(context)
    # Write data

At the top of ci.yml, add a job that:
1. Starts the postgresql:16 image
2. Runs this script against it to populate it with test data
3. Commits and publishes that populated image to the GitHub container registry
Then, below in ci.yml, the unit tests can use that image. Anyone running the tests locally can fetch the image from the GitHub container registry and use it as well.
It will also be possible to generate a file like postgresql-ci-db.sql from the image, which may be a useful way to share the test data. But, as you alluded in Slack, a layered image is a convenient way to publish the data, especially because GitHub gives us a container registry to work with.

danielballan · 2023-11-21T20:19:49Z

Notes from Zoom chat:

Create a new repo, bluesky/tiled-example-database which will hold:

Dockerfile
GH Workflow that publishes image to that repo's "packages", following https://github.com/NSLS2/databroker-nsls2/blob/main/.github/workflows/publish-image.yml
Data generation script

Then, in Tiled, as with other tests that have external dependencies, make the test skippable, conditional on an env var being set, where the env var points to the URI of the database. Notably, this does not strictly require the database to be running in a container, but the container will be a convenient way to set this up.

danielballan · 2023-11-29T17:00:39Z

I have pushed commits refactoring the pytests fixtures.

Separate sqlite and postgres code branches into separate fixtures, following a pattern I learned from @padraic-shafer
Make a second postgres-backed adapter fixture that looks for a specific named database that is expected to have example data. (The existing postgres-backed adapter fixture creates an empty database with a random name and cleans it up at exit.)

As before, if TILED_TEST_POSTGRESQL_URI is not set, the test is skipped as follows:

tiled/_tests/test_catalog.py::test_metadata_index_is_used SKIPPED (No TILED_TEST_POSTGRESQL_URI configured)                                                                                         [100%]

And now, if TILED_TEST_POSTGRESQL_URI is set but that PostgreSQL instance does not contain a pre-populated database with an expected name, it is also skipped:

tiled/_tests/test_catalog.py::test_metadata_index_is_used SKIPPED (PostgreSQL instance contains no database named 'example_data')                                                                   [100%]

TO DO:

Update ci.yml to populate the PostgreSQL database with data from the tiled-example-database Releases.
Update ci.yml to download a SQLite file from the tiled-example-databases Releases.
Confirm the name of the example database aligns.
Add a SQLite-backed fixture with pre-populated data.

padraic-shafer · 2023-11-29T17:29:03Z

tiled/_tests/conftest.py

+    Note that startup() and shutdown() are not called, and must be run
+    either manually (as in the fixture 'a') or via the app (as in the fixture 'client').


For ease of use, it seems like it would be convenient to keep startup() and shutdown() in the fixtures. Is it straightforward to add a guard to startup() that checks whether it has already been called or whether an app is running?

I do dislike this structure and would welcome suggestions to improve it.

The problem is that startup must be called on the same thread where the application will run. If this adapter is going to be used by the TestClient, via Context.from_app(build_app(adapter)), a background thread is created at that point and startup()` needs to be run on that thread.

I see. In that case, I think the solution using the fixture a is probably already optimal.

One could of course add fixture b (or equivalent) for test_metadata_index_is_used(b) -- if you expect additional tests that would make use of postgresql_with_example_data_adapter:

@pytest_asyncio.fixture async def b(postgresql_with_example_data_adapter): "Raw adapter, not to be used within an app becaues it is manually started and stopped." adapter = postgresql_with_example_data_adapter await adapter.startup() yield adapter await adapter.shutdown()

However, a DRYer and more composable version might look like this...

@pytest.mark.parametrize("a", ["postgresql_with_example_data_adapter"], indirect=True) @pytest.mark.asyncio async def test_metadata_index_is_used(a): # a, a.startup(), a.shutdown() are no longer needed ...

Marking the parameter as indirect will override the argument passed to a parameterized fixture. See "Indirect parametrization" | parametrize | and especially this informative example.

💽 Create SQL test data and dockerfile spec

62b7f1e

Remove large binary file from history w/git-filter

10cbf89

Kezzsim force-pushed the sql-cache-ci branch from 850b0c1 to 10cbf89 Compare November 20, 2023 18:00

danielballan added 4 commits November 29, 2023 11:16

Rely on externally provided example data.

ee78c04

Remove slow-test code, no longer needed.

e8cc08e

Skip test if example database does not exist.

01722b6

Merge branch 'adapter-fixture' into sql-cache-ci

11fe5e2

padraic-shafer reviewed Nov 29, 2023

View reviewed changes

Kezzsim and others added 11 commits December 12, 2023 16:37

Docker container no longer hosted here

a8c0d1a

Merge branch 'bluesky:main' into sql-cache-ci

8d623b0

Merge branch 'bluesky:main' into sql-cache-ci

03dd13b

Download, get and mount 💾 .sql data (empty)

0551a63

Correct mountpoint 🐳

672f9ab

Namespaced test database deployment 📇

23e134c

Merge branch 'bluesky:main' into sql-cache-ci

74804fb

Pin Postgres image to ensure compat with .sql dump.

27a2216

Use curl instead of wget; make script idempotent.

97cfc5d

Download sqlite data.

318fb0a

Test against SQLite with example data

f51cbc3

danielballan changed the title ~~Cache postgresql database for more accurate and expedient C.I. tests~~ Cache example database for more accurate and expedient C.I. tests Jan 2, 2024

danielballan changed the title ~~Cache example database for more accurate and expedient C.I. tests~~ Download example databases for use in CI tests Jan 2, 2024

danielballan merged commit 58969e7 into bluesky:main Jan 2, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Download example databases for use in CI tests #608

Download example databases for use in CI tests #608

Kezzsim commented Nov 16, 2023 •

edited

Loading

danielballan commented Nov 16, 2023 •

edited

Loading

danielballan commented Nov 21, 2023

danielballan commented Nov 29, 2023

padraic-shafer Nov 29, 2023

danielballan Nov 29, 2023

padraic-shafer Nov 30, 2023

		Note that startup() and shutdown() are not called, and must be run
		either manually (as in the fixture 'a') or via the app (as in the fixture 'client').

Download example databases for use in CI tests #608

Download example databases for use in CI tests #608

Conversation

Kezzsim commented Nov 16, 2023 • edited Loading

danielballan commented Nov 16, 2023 • edited Loading

danielballan commented Nov 21, 2023

danielballan commented Nov 29, 2023

padraic-shafer Nov 29, 2023

Choose a reason for hiding this comment

danielballan Nov 29, 2023

Choose a reason for hiding this comment

padraic-shafer Nov 30, 2023

Choose a reason for hiding this comment

Kezzsim commented Nov 16, 2023 •

edited

Loading

danielballan commented Nov 16, 2023 •

edited

Loading