Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scientific Metadata Search Engine (Fulltext) implementation for PostgreSQL 🔎 #640

Merged
merged 19 commits into from
Jan 29, 2024

Conversation

Kezzsim
Copy link
Contributor

@Kezzsim Kezzsim commented Jan 24, 2024

This feature adds a rudimentary scientific metadata search engine, implemented purely through Postgresql's native ts_vector and ts_query operations. Documented as part of the Tiled client's FullText query.

Tasks:

  • Add fulltext logic to adapter.py to enable functionality
  • Create a new computed index metadata_search for storing jsonb_to_tsvector data
  • Generate Alembic migration to move existing tiled collections to the new standard
  • Add new tests that ensure the index is being invoked

Resolves #457

@Kezzsim Kezzsim added the smse Scientific Metadata Search Engine, everything pertaining to natural language search label Jan 24, 2024
Copy link
Member

@danielballan danielballan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very excited to see this!

tiled/catalog/orm.py Outdated Show resolved Hide resolved
tiled/catalog/adapter.py Outdated Show resolved Hide resolved
@danielballan
Copy link
Member

Also: our FullText query accepts a case_sensitive parameter. I added that somewhat speculatively, when I was more naively about what is involved in supporting both options. As we're still in alpha, I wonder if it makes sense to remove that for now---being case insensitive only---and add it back later if there is demand for it.

There are tradeoffs in insert speed and database size that make me question whether case sensitive search is useful and important enough to justify that.

tiled/tiled/queries.py

Lines 42 to 50 in fca4330

Parameters
----------
text : str
case_sensitive : bool, optional
Default False (case-insensitive).
"""
text: str
case_sensitive: bool = False

@@ -168,6 +171,9 @@ def cm():
cm = nullcontext
with cm():
assert list(client.search(FullText("z"))) == ["z", "does_contain_z"]
# plainto_tsquery fails to find certain words, weirdly, so it is a useful
# test that we are using tsquery
assert list(client.search(FullText("purple"))) == ["full_text_test_case"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much to expect of this index, but could we do partial word search? e.g. urple

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“light urple!”

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think support is limited, but robust fuzzy text search is next up for @Kezzsim.

@Kezzsim
Copy link
Contributor Author

Kezzsim commented Jan 26, 2024

Current C.I. issues are caused by the caching we set up previously, ORM and alembic need to be able to handle if an index already exists in that cache somehow.

@Kezzsim
Copy link
Contributor Author

Kezzsim commented Jan 29, 2024

Also: our FullText query accepts a case_sensitive parameter. I added that somewhat speculatively, when I was more naively about what is involved in supporting both options. As we're still in alpha, I wonder if it makes sense to remove that for now---being case insensitive only---and add it back later if there is demand for it.

There are tradeoffs in insert speed and database size that make me question whether case sensitive search is useful and important enough to justify that.

tiled/tiled/queries.py

Lines 42 to 50 in fca4330

Parameters
----------
text : str
case_sensitive : bool, optional
Default False (case-insensitive).
"""
text: str
case_sensitive: bool = False

I will remove the case sensitive flag from both the source code and the documentation, minimizing the API surface impact by ignoring any other kwargs sent to query other than text.

The postgresql documentation writes:

A lexeme is a string, just like a token, but it has been normalized so that different forms of the same word are made alike. For example, normalization almost always includes folding upper-case letters to lower-case

@danielballan danielballan merged commit eb79bdb into bluesky:main Jan 29, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
smse Scientific Metadata Search Engine, everything pertaining to natural language search
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support FullText search in PostgreSQL
3 participants