Scientific Metadata Search Engine (Fulltext) implementation for PostgreSQL 🔎 #640

Kezzsim · 2024-01-24T21:53:07Z

This feature adds a rudimentary scientific metadata search engine, implemented purely through Postgresql's native ts_vector and ts_query operations. Documented as part of the Tiled client's FullText query.

Tasks:

Add fulltext logic to adapter.py to enable functionality
Create a new computed index metadata_search for storing jsonb_to_tsvector data
Generate Alembic migration to move existing tiled collections to the new standard
Add new tests that ensure the index is being invoked

Resolves #457

… real this time.

danielballan

Very excited to see this!

tiled/catalog/orm.py

tiled/catalog/adapter.py

tiled/catalog/migrations/versions/1cd99c02d0c7_create_index_for_fulltext_search.py

danielballan · 2024-01-24T22:50:11Z

Also: our FullText query accepts a case_sensitive parameter. I added that somewhat speculatively, when I was more naively about what is involved in supporting both options. As we're still in alpha, I wonder if it makes sense to remove that for now---being case insensitive only---and add it back later if there is demand for it.

There are tradeoffs in insert speed and database size that make me question whether case sensitive search is useful and important enough to justify that.

tiled/tiled/queries.py

Lines 42 to 50 in fca4330

    
               Parameters 
        
               ---------- 
        
               text : str 
        
               case_sensitive : bool, optional 
        
                   Default False (case-insensitive). 
        
               """ 
        
               text: str 
        
               case_sensitive: bool = False

dylanmcreynolds · 2024-01-26T21:01:29Z

tiled/_tests/test_queries.py

@@ -168,6 +171,9 @@ def cm():
        cm = nullcontext
    with cm():
        assert list(client.search(FullText("z"))) == ["z", "does_contain_z"]
+        # plainto_tsquery fails to find certain words, weirdly, so it is a useful
+        # test that we are using tsquery
+        assert list(client.search(FullText("purple"))) == ["full_text_test_case"]


I don't know how much to expect of this index, but could we do partial word search? e.g. urple

“light urple!”

I think support is limited, but robust fuzzy text search is next up for @Kezzsim.

Kezzsim · 2024-01-26T21:03:42Z

Current C.I. issues are caused by the caching we set up previously, ORM and alembic need to be able to handle if an index already exists in that cache somehow.

Kezzsim · 2024-01-29T16:41:45Z

Also: our FullText query accepts a case_sensitive parameter. I added that somewhat speculatively, when I was more naively about what is involved in supporting both options. As we're still in alpha, I wonder if it makes sense to remove that for now---being case insensitive only---and add it back later if there is demand for it.

There are tradeoffs in insert speed and database size that make me question whether case sensitive search is useful and important enough to justify that.

tiled/tiled/queries.py

Lines 42 to 50 in fca4330

Parameters

----------

text : str

case_sensitive : bool, optional

Default False (case-insensitive).

"""

text: str

case_sensitive: bool = False

I will remove the case sensitive flag from both the source code and the documentation, minimizing the API surface impact by ignoring any other kwargs sent to query other than text.

The postgresql documentation writes:

A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case

Kezzsim added 4 commits January 17, 2024 15:02

Merge and reset with head upstream 👷‍♀️ Conform to pep∞

9a583ba

🗂️ Implement jsonb_to_tsvector which actually engages the index for…

9b3933e

… real this time.

📜 Implement fulltext query in adapter.py

7498e0e

⬆️ Alembic migration creates text search index

e01b201

Kezzsim added the smse Scientific Metadata Search Engine, everything pertaining to natural language search label Jan 24, 2024

Kezzsim added 4 commits January 24, 2024 21:59

Merge branch 'main' into smse_posgres

8efe691

Upstream merge removed paren 🔣

42696c3

Add new migration to list in core.py ➕

237ae47

Black formatting for precommit 🧹

686c93d

danielballan reviewed Jan 24, 2024

View reviewed changes

tiled/catalog/orm.py Outdated Show resolved Hide resolved

tiled/catalog/adapter.py Outdated Show resolved Hide resolved

tiled/catalog/migrations/versions/1cd99c02d0c7_create_index_for_fulltext_search.py Outdated Show resolved Hide resolved

danielballan requested a review from dylanmcreynolds January 25, 2024 01:49

Kezzsim and others added 6 commits January 26, 2024 14:20

Verbose index naming convention 🪪

6b23411

Change from a list of conditions to a single condition

9c77994

Skip unsupported ts_vector index creation for sqlite

0b4e9a9

Preformat with black ⬛️ and enable tests ✅

3e00b4b

change op from sqlalchemy match to to_tsquery 🪄

8634e31

Fix oddity with plainto_tsquery vs to_tsquery

810f6b2

dylanmcreynolds reviewed Jan 26, 2024

View reviewed changes

danielballan added 3 commits January 27, 2024 08:27

Isolate migration from orm; orm may change!

2b4d09a

Provide a more useful high-level comment.

30b864c

Put index creation in its own function.

f094d24

Kezzsim and others added 2 commits January 29, 2024 13:41

Remove all allusions to case sensitivity from fulltext 🔠🧽

535f9cd

Taking a heavier hand to some light-touch changes

04c6092

danielballan merged commit eb79bdb into bluesky:main Jan 29, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scientific Metadata Search Engine (Fulltext) implementation for PostgreSQL 🔎 #640

Scientific Metadata Search Engine (Fulltext) implementation for PostgreSQL 🔎 #640

Kezzsim commented Jan 24, 2024

danielballan left a comment

danielballan commented Jan 24, 2024

dylanmcreynolds Jan 26, 2024

danielballan Jan 27, 2024

danielballan Jan 29, 2024

Kezzsim commented Jan 26, 2024

Kezzsim commented Jan 29, 2024

Scientific Metadata Search Engine (Fulltext) implementation for PostgreSQL 🔎 #640

Scientific Metadata Search Engine (Fulltext) implementation for PostgreSQL 🔎 #640

Conversation

Kezzsim commented Jan 24, 2024

danielballan left a comment

Choose a reason for hiding this comment

danielballan commented Jan 24, 2024

dylanmcreynolds Jan 26, 2024

Choose a reason for hiding this comment

danielballan Jan 27, 2024

Choose a reason for hiding this comment

danielballan Jan 29, 2024

Choose a reason for hiding this comment

Kezzsim commented Jan 26, 2024

Kezzsim commented Jan 29, 2024