Replace `btree` with `btree_gin` indexes for PostgreSQL to allow larger metadata and faster nested search #588

Kezzsim · 2023-10-24T20:35:35Z

The title explains it all, this PR is intended to have the following effects:

Enable the pg ingestion of real bluesky runs from existing beamline datasets which can occasionally have very large metadata, previously when attempting to do this using the default btree index type the operation would fail with an error.
Enable the eq binary operation for queries like .search(Key("start.purpose") == "test") to utilize the new index for rapid traversals, greatly reducing the time needed to query when searching for nested strings in metadata objects.

Caveats:

Currently the btree_gin index is not being invoked for all other binary operations including ne,lt,le,gt,ge leading to slow seq scans in those conditions. Hypothetically this can be achieved but there is no known way currently.
The eq operator is technically being converted to the postgresql @> operator which is technically an IN, this may have unintended side effects.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1205786328204581

danielballan · 2023-10-24T21:18:18Z

Point of clarification: in the end, was the issue about the size of a given key or was it the overall size of the document? I thought it seemed to be the former but turned out to be the latter, but I could be misremembering or behind. Worth nailing that down here for the record in case we revisit later.

danielballan · 2023-10-24T21:20:33Z

tiled/catalog/adapter.py

+    elif dialect_name == "postgresql" and operation == operator.eq:
+        condition = orm.Node.metadata_.op("@>")(
+            type_coerce(
+                {keys[0]: reduce(lambda x, y: {y: x}, keys[1:][::-1], query.value)},


Clever. Can you refactor this into a utility function so that it can be precisely targeted with a unit test? It might also end up being reused in the future, elsewhere.

tiled/catalog/adapter.py

danielballan · 2023-10-24T21:27:23Z

CI shows that this test is failing:

tiled/tiled/_tests/test_catalog.py

Line 160 in 7c122f9

async def test_metadata_index_is_used(a):

I think this test (mine) was not the right approach. The query planner's internal details are slippery, and may depend on the number of records, perhaps the available system resources, the version of Postgres or SQLite...

I think it would be best to develop a separate test harness, which may or may not use pytest, to evaluate index usage. Like performance benchmarks, this is just a different category of test and does not sit well inside the unit test suite. I think I'd be in favor of just ripping this out for now, but I'm open to alternatives.

danielballan · 2023-10-24T21:28:38Z

@jmaruland Your (C2QA) use case is one motivation for this fix. At your convenience, would you check out this PR branch locally, from @Kezzsim's fork, start a temp catalog server, and test that you can insert your data?

Kezzsim · 2023-10-25T16:40:23Z

Point of clarification: in the end, was the issue about the size of a given key or was it the overall size of the document? I thought it seemed to be the former but turned out to be the latter, but I could be misremembering or behind. Worth nailing that down here for the record in case we revisit later.

Technically it isn't key related, it's overall metadata size related. Here's an example error from importing a bluesky run:

sqlalchemy.exc.DBAPIError: (sqlalchemy.dialects.postgresql.asyncpg.Error) <class 'asyncpg.exceptions.ProgramLimitExceededError'>: index row size 2792 exceeds btree version 4 maximum 2704 for index "top_level_metadata"
DETAIL:  Index row references tuple (40,10) in relation "nodes".
HINT:  Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
[SQL: INSERT INTO nodes (key, ancestors, structure_family, metadata, specs) VALUES ($1::VARCHAR, $2::JSONB, $3::structurefamily, $4::JSONB, $5::JSONB) RETURNING nodes.id, nodes.time_created, nodes.time_updated]
[parameters: ('b603b6ae-5cd8-43d7-b340-b7d8678ad7c4', '[]', 'container', '{"start":{"time":1527797320.2107823,"uid":"b603b6ae-5cd8-43d7-b340-b7d8678ad7c4","motors":["dwell_time_dwell_time","dcm_energy"],"XDI,Mono,name":"Si( ... (5836 characters truncated) ... mp":1527797320.2107823,"datetime":"2018-05-31T20:08:40.210782+00:00","plan_name":"scan_nd","stream_names":["primary"],"duration":178.91345500946045}}', '[]')]
(Background on this error at: https://sqlalche.me/e/20/dbapi)

I will remove the term keys from the PR title and description for clarity

Co-authored-by: Dan Allan <daniel.b.allan@gmail.com>

danielballan · 2023-10-30T20:16:04Z

Once the CI passing, we must remember to add a database migration that will drop the old index and replace it with the new one, on existing deployments. (I can pair-code this on Zoom, @Kezzsim; the process is not too complicated, but a little obscure.)

Kezzsim added 2 commits October 24, 2023 15:41

Integrate btree_gin index type on creation 📥

ea9d24a

Enable GIN index scans on metadata = queries 🔍

ed9686a

danielballan reviewed Oct 24, 2023

View reviewed changes

danielballan requested a review from jmaruland October 24, 2023 21:27

Kezzsim changed the title ~~Replace btree with btree_gin indexes for PostgreSQL to allow larger metadata keys and faster nested search~~ Replace btree with btree_gin indexes for PostgreSQL to allow larger metadata and faster nested search Oct 25, 2023

Kezzsim and others added 4 commits October 25, 2023 12:46

Conform elif statement code style

69c83e1

Co-authored-by: Dan Allan <daniel.b.allan@gmail.com>

Merge branch 'bluesky:main' into btree_gin-eq_index

9b09275

Increase number of records used during testing

6252f98

Try with more test records what could go wrong? 🦀

2e4244a

danielballan added this to the v0.1.0b1 (first beta release) milestone Nov 3, 2023

Kezzsim and others added 7 commits November 13, 2023 16:51

Alembic migrated successfully to 3db11ff95b6c

6ca1dd6

Refactor object serializer, remove slow tests 🕕

ab06d0d

Fix linter issues 🧹

13f2f3d

Trim key_to_array docstring

74fe724

Remove unused comment

e17bff7

Move pytest config from toml to ini 🔜

593e1b7

Set env vars in standard way.

e38d3f1

danielballan mentioned this pull request Nov 14, 2023

Run slow tests with a large database on CI #604

Open

3 tasks

Use -m 'not slow' by default for now

ec3704d

danielballan approved these changes Nov 14, 2023

View reviewed changes

danielballan merged commit d9fb7f8 into bluesky:main Nov 14, 2023
8 checks passed

Kezzsim mentioned this pull request Nov 16, 2023

Download example databases for use in CI tests #608

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `btree` with `btree_gin` indexes for PostgreSQL to allow larger metadata and faster nested search #588

Replace `btree` with `btree_gin` indexes for PostgreSQL to allow larger metadata and faster nested search #588

Kezzsim commented Oct 24, 2023 •

edited

Loading

danielballan commented Oct 24, 2023

danielballan Oct 24, 2023

danielballan commented Oct 24, 2023

danielballan commented Oct 24, 2023

Kezzsim commented Oct 25, 2023

danielballan commented Oct 30, 2023

Replace btree with btree_gin indexes for PostgreSQL to allow larger metadata and faster nested search #588

Replace btree with btree_gin indexes for PostgreSQL to allow larger metadata and faster nested search #588

Conversation

Kezzsim commented Oct 24, 2023 • edited Loading

The title explains it all, this PR is intended to have the following effects:

Caveats:

danielballan commented Oct 24, 2023

danielballan Oct 24, 2023

Choose a reason for hiding this comment

danielballan commented Oct 24, 2023

danielballan commented Oct 24, 2023

Kezzsim commented Oct 25, 2023

danielballan commented Oct 30, 2023

Replace `btree` with `btree_gin` indexes for PostgreSQL to allow larger metadata and faster nested search #588

Replace `btree` with `btree_gin` indexes for PostgreSQL to allow larger metadata and faster nested search #588

Kezzsim commented Oct 24, 2023 •

edited

Loading