Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download example databases for use in CI tests #608

Merged
merged 17 commits into from
Jan 2, 2024

Conversation

Kezzsim
Copy link
Contributor

@Kezzsim Kezzsim commented Nov 16, 2023

Continuing from the last PR I submitted to Tiled which switched postgresql indexing from btree to btree_gin indexes to support faster queries, an issue emerged when it came to running accurate index usage tests.

The query planner will not use an index if a catalogue table contains fewer than 10,000 records. As a workaround, this PR looks to add a form of caching, via a container registry and the docker postgres image.

This is a work in progress to track changes and will contain numerous additional commits prior to any potential merge.


@danielballan
Copy link
Member

danielballan commented Nov 16, 2023

Very interesting! I like this general direction.

I do think we should avoid merging binary blobs like postgres-ci-db.sql into the repo. In general it's best to commit the reproducible code that creates the binary blob, not the binary blob itself.

Taking in your comments on Slack, I was thinking about an approach like this to stitch it all together:

  • Commit a short Python script that connects to a given DATABASE_URI and writes in the test data:

    #!/usr/env/bin python
    
    # Usage: generate_sample_data.py DATABASE_URI
    
    import sys
    from tiled.catalog import from_uri
    from tiled.client import Context, from_context
    from tiled.server.app import build_app
    
    uri = sys.argv[1]
    catalog = from_uri(uri)
    with Context.from_app(build_app(catalog)) as context:
        client = from_context(context)
        # Write data
  • At the top of ci.yml, add a job that:

    1. Starts the postgresql:16 image
    2. Runs this script against it to populate it with test data
    3. Commits and publishes that populated image to the GitHub container registry
  • Then, below in ci.yml, the unit tests can use that image. Anyone running the tests locally can fetch the image from the GitHub container registry and use it as well.

  • It will also be possible to generate a file like postgresql-ci-db.sql from the image, which may be a useful way to share the test data. But, as you alluded in Slack, a layered image is a convenient way to publish the data, especially because GitHub gives us a container registry to work with.

@danielballan
Copy link
Member

Notes from Zoom chat:

Create a new repo, bluesky/tiled-example-database which will hold:

Then, in Tiled, as with other tests that have external dependencies, make the test skippable, conditional on an env var being set, where the env var points to the URI of the database. Notably, this does not strictly require the database to be running in a container, but the container will be a convenient way to set this up.

@danielballan
Copy link
Member

I have pushed commits refactoring the pytests fixtures.

  1. Separate sqlite and postgres code branches into separate fixtures, following a pattern I learned from @padraic-shafer
  2. Make a second postgres-backed adapter fixture that looks for a specific named database that is expected to have example data. (The existing postgres-backed adapter fixture creates an empty database with a random name and cleans it up at exit.)

As before, if TILED_TEST_POSTGRESQL_URI is not set, the test is skipped as follows:

tiled/_tests/test_catalog.py::test_metadata_index_is_used SKIPPED (No TILED_TEST_POSTGRESQL_URI configured)                                                                                         [100%]

And now, if TILED_TEST_POSTGRESQL_URI is set but that PostgreSQL instance does not contain a pre-populated database with an expected name, it is also skipped:

tiled/_tests/test_catalog.py::test_metadata_index_is_used SKIPPED (PostgreSQL instance contains no database named 'example_data')                                                                   [100%]

TO DO:

  • Update ci.yml to populate the PostgreSQL database with data from the tiled-example-database Releases.
  • Update ci.yml to download a SQLite file from the tiled-example-databases Releases.
  • Confirm the name of the example database aligns.
  • Add a SQLite-backed fixture with pre-populated data.

Comment on lines +177 to +178
Note that startup() and shutdown() are not called, and must be run
either manually (as in the fixture 'a') or via the app (as in the fixture 'client').
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ease of use, it seems like it would be convenient to keep startup() and shutdown() in the fixtures. Is it straightforward to add a guard to startup() that checks whether it has already been called or whether an app is running?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do dislike this structure and would welcome suggestions to improve it.

The problem is that startup must be called on the same thread where the application will run. If this adapter is going to be used by the TestClient, via Context.from_app(build_app(adapter)), a background thread is created at that point and startup()` needs to be run on that thread.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, I think the solution using the fixture a is probably already optimal.


One could of course add fixture b (or equivalent) for test_metadata_index_is_used(b) -- if you expect additional tests that would make use of postgresql_with_example_data_adapter:

@pytest_asyncio.fixture
async def b(postgresql_with_example_data_adapter):
    "Raw adapter, not to be used within an app becaues it is manually started and stopped."
    adapter = postgresql_with_example_data_adapter
    await adapter.startup()
    yield adapter
    await adapter.shutdown()

However, a DRYer and more composable version might look like this...

@pytest.mark.parametrize("a", ["postgresql_with_example_data_adapter"], indirect=True)
@pytest.mark.asyncio
async def test_metadata_index_is_used(a):
    # a, a.startup(), a.shutdown() are no longer needed
    ...

Marking the parameter as indirect will override the argument passed to a parameterized fixture. See "Indirect parametrization" | parametrize | and especially this informative example.

@danielballan danielballan changed the title Cache postgresql database for more accurate and expedient C.I. tests Cache example database for more accurate and expedient C.I. tests Jan 2, 2024
@danielballan danielballan changed the title Cache example database for more accurate and expedient C.I. tests Download example databases for use in CI tests Jan 2, 2024
@danielballan danielballan merged commit 58969e7 into bluesky:main Jan 2, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants