Really fix UUID search + Add documentation for search (#739)

* Add test for searching uuids * Add documentation for search Signed-off-by: Mark90 <mark_moes@live.nl> * Fixes #732 by actually deploying the uuid search workaround from #603 In #603 the uuid search workaround was added to existing migration da5c9f4cce1c in orchestrator-core commit 3e93263. Because of that, it was never deployed to existing environments where the original migration was already executed. In a new migration (460ec6748e37) this workaround will be deployed onto existing environments. The old migration (da5c9f4cce1c) is restored to its original state before commit 3e93263. Signed-off-by: Mark90 <mark_moes@live.nl> * Bump version to 2.7.6rc2 --------- Signed-off-by: Mark90 <mark_moes@live.nl>
workfloworchestrator · Oct 3, 2024 · 1d0367d · 1d0367d
1 parent 62d64fd
commit 1d0367d
Show file tree

Hide file tree

Showing 8 changed files with 507 additions and 27 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 2.7.6rc1
+current_version = 2.7.6rc2
 commit = False
 tag = False
 parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(rc(?P<build>\d+))?

diff --git a/docs/reference-docs/search.md b/docs/reference-docs/search.md
@@ -0,0 +1,262 @@
+# Search
+
+<!-- TOC -->
+- [Search](#search)
+  - [Usage](#usage)
+  - [Implementation 1: Filter on DB Table](#implementation-1-filter-on-db-table)
+  - [Implementation 2: Subscription Search](#implementation-2-subscription-search)
+    - [Postgres Text Search](#postgres-text-search)
+    - [Materialized View subscriptions\_search](#materialized-view-subscriptions_search)
+<!-- TOC -->
+
+The orchestrator-core provides search functionality to find data within the Postgres DB.
+
+The below overview describes which objects can be searched through which endpoint, and the underlying search implementation.
+
+| Object         | REST endpoint                      | GraphQL endpoint               | Implementation      |
+|----------------|------------------------------------|--------------------------------|---------------------|
+| Subscriptions  | `/api/subscriptions/search?query=` | `subscriptions(query: "...")`  | Subscription Search |
+| Processes      |                                    | `processes(query: "...")`      | Filter on DB table  |
+| Product Blocks |                                    | `product_blocks(query: "...")` | Filter on DB table  |
+| Products       |                                    | `products(query: "...")`       | Filter on DB table  |
+| Resource types |                                    | `resource_types(query: "...")` | Filter on DB table  |
+| Workflows      |                                    | `workflows(query: "...")`      | Filter on DB table  |
+
+There are 2 implementations:
+
+1. **Filter on DB Table**: generic implementation to search on values in the DB table for an object, or related tables
+1. **Subscription Search**: specialized implementation to search for values anywhere in a subscription, such as instance values or customer descriptions
+
+## Usage
+
+Call the REST or GraphQL endpoint with the `query` parameter set to the search query.
+
+Some search query examples for Subscriptions:
+
+| Query                                                    | Matches subscriptions with                                           |
+|----------------------------------------------------------|----------------------------------------------------------------------|
+| `test`                                                   | `test` in any field                                                  |
+| `description:test`                                       | `test` in description field                                          |
+| `L2VPN`                                                  | `L2VPN` in any field                                                 |
+| `tag:L2VPN`                                              | `L2VPN` in product tag field                                         |
+| `tag:(FW \| L2VPN)` <br/> Or: <br/>`tag:FW \| tag:L2VPN` | `FW` **OR** `L2VPN` in product tag field                             |
+| `description:test tag:L2VPN`                             | `test` in description field **and** `L2VPN` in product tag field     |
+| `description:test -tag:L2VPN`                            | `test` in description field **and NOT** `L2VPN` in product tag field |
+| `test 123` <br/> Or: <br/>`123 test`                     | `test` **AND** `123` anywhere
+| `"test 123"`                                             | `test 123` anywhere (Phrase search)                                  |
+| `test*`                                                  | any field starting with `test` (Prefix search)                       |
+
+Note that:
+* For other objects the query syntax is the same
+* Searching is case-insensitive
+* Ordering of words does not matter (unless it is a Phrase)
+* Search words cannot contain the characters `|-*():"` as they are part of the search query grammar
+
+## Implementation 1: Filter on DB Table
+
+This implementation translates the user's search query to `WHERE` clauses on DB columns of the object's DB table. For some objects this extends to related DB tables.
+
+We can distinguish these steps:
+
+* The module `search_query.py` parses the user's query and generates a sequence of sqlalchemy `.filter()` clauses
+* The REST/GraphQL endpoint appends these clauses to the sqlalchemy `select()` to find objects that match the user's query
+
+## Implementation 2: Subscription Search
+
+This is a specialized implementation to allow searching values anywhere in a subscription or in related entities, without sacrificing performance.
+
+We can distinguish these steps/components:
+
+ * The DB view `subscriptions_search` is a search index with _Text Search_ (TS) vectors. Both are explained in the next sections
+ * The module `search_query.py` parses the user's query into a string that is used to create a TS query
+ * The TS query is wrapped in a single sqlalchemy `.filter()` clause to match TS documents in `subscriptions_search` and get the corresponding `subscription_id`
+ * The REST/GraphQL endpoint appends this clause to the sqlalchemy `select()` to find subscription objects that match the user's query
+
+The diagram below visualizes the flow and dependencies between components in the Database, API and Frontend.
+
+![Diagram search](search_overview.png)
+
+<!-- search_overview.png can be opened and modified in draw.io -->
+
+### Postgres Text Search
+
+Postgres _Full Text Search_ (TS) has extensive [documentation](https://www.postgresql.org/docs/current/textsearch.html) but we'll cover the fundamentals in this section. TS queries can be done on "normal" DB tables, i.e. without a search index, but this is too slow. It is recommended to maintain a search index which we do in DB view `subscriptions_search`.
+
+**Creating TS vectors**
+
+The query behind `subscriptions_search` retrieves Subscriptions joined with several other tables (as shown in the previous diagram), forming a "document" of keywords that in some way relate to the subscription.
+
+Each document is turned into a _tsvector_ with Postgres function `to_tsvector()` which consists of these phases:
+* Parse document into _tokens_: split text into _tokens_ using special characters as delimiters
+* Convert tokens into _lexemes_: a lexeme is a normalized token, i.e. this folds upper-case to lower-case. This step can also normalize based on language, but we disable that by using the `'simple'` dictionary (shown in diagram above)
+* Create vector optimized for search: store array of _lexemes_ along with positional information
+
+The array of _lexemes_ makes up the tsvector document that we store in `subsriptions_search`, and which we can query through the Postgres function `to_tsquery()`.
+
+We'll demonstrate how this works through a few examples. You can follow along in any  postgres shell (v14 or above). If you have Docker installed, run these commands in separate shells.
+
+```sh
+docker run --rm -e POSTGRES_HOST_AUTH_METHOD=trust --name pg15 postgres:15
+docker exec -i -t pg15 su - postgres -c psql
+```
+
+Translating the input text `'color:Light_Blue count:4'` to a tsvector:
+
+```
+postgres=# select to_tsvector('simple', 'color:Light_Blue count:4');
+                 to_tsvector
+----------------------------------------------
+ '4':5 'blue':3 'color':1 'count':4 'light':2
+(1 row)
+```
+
+The result `'4':5 'blue':3 'color':1 'count':4 'light':2` is an array of lexemes combined with information about their position in the original text.
+
+With Postgres function `to_tsdebug()` one can investigate how the input was parsed.
+
+```
+postgres=# select * from ts_debug('simple', 'color:Light_Blue count:4');
+   alias   |   description    | token | dictionaries | dictionary | lexemes
+-----------+------------------+-------+--------------+------------+---------
+ asciiword | Word, all ASCII  | color | {simple}     | simple     | {color}
+ blank     | Space symbols    | :     | {}           |            |
+ asciiword | Word, all ASCII  | Light | {simple}     | simple     | {light}
+ blank     | Space symbols    | _     | {}           |            |
+ asciiword | Word, all ASCII  | Blue  | {simple}     | simple     | {blue}
+ blank     | Space symbols    |       | {}           |            |
+ asciiword | Word, all ASCII  | count | {simple}     | simple     | {count}
+ blank     | Space symbols    | :     | {}           |            |
+ uint      | Unsigned integer | 4     | {simple}     | simple     | {4}
+ ```
+
+Note: when we would write "Light-Blue" with a dash instead of underscore, Postgres translates this into the vector `'blue':4 'light':3 'light-blue':2`, which makes it very complicated to query, in particular for UUID strings. To mitigate this we replace all occurrences of `-` with `_` when creating TS Vectors and before executing TS Queries.
+
+**Querying TS Vectors**
+
+To run queries against TS Vectors we have to prepare a TS Query with Postgres function `to_tsquery()`. We also pass the `'simple'` dictionary here to prevent language-specific normalization.
+
+```
+postgres=# select to_tsquery('simple', 'Light_Blue');
+     to_tsquery
+--------------------
+ 'light' <-> 'blue'
+ ```
+
+The resulting TS query means as much as: `the vector must contain 'light' followed by 'blue'`. Note that the input string has been tokenized similar to TS vectors.
+
+We can execute this TS query against the TS vector:
+
+```
+postgres=# select to_tsvector('simple', 'color:Light_Blue count:4') @@ to_tsquery('simple', 'Light_Blue');
+ ?column?
+----------
+ t
+(1 row)
+```
+
+Returns `t` for `true`.
+
+As a final example we can also add an `OR` condition:
+
+```
+postgres=# select to_tsvector('simple', 'color:Light_Blue count:4') @@ to_tsquery('simple', 'color <-> Green | count <-> 4');
+ ?column?
+----------
+ t
+(1 row)
+```
+
+Which returns true because while the vector does not contain the color green, it does contain count 4.
+
+Note that the `color <-> Green | count <-> 4` string passed to `ts_query()` must be constructed in a specific way.
+This happens in the orchestrator-core module `search_query.py` as shown in the overview diagram.
+
+### Materialized View subscriptions_search
+
+As mentioned before, `subscriptions_search` is a DB view which lies at the heart of the implementation.
+If you're not familiar with database views; they represent a (usually complicated) database query in the form of a "virtual table".
+
+In this case we're using a [Materialized View](https://www.postgresql.org/docs/current/rules-materializedviews.html#RULES-MATERIALIZEDVIEWS) which is like a normal view, except that the "virtual table" is persisted to save resources and increase performance.
+The view's data is persisted until it is refreshed, which can be done manually or through an update trigger. (further explained below)
+
+This table has a [GIN index](https://www.postgresql.org/docs/current/textsearch-indexes.html#TEXTSEARCH-INDEXES) for efficient search queries.
+
+**Triggers**
+
+Database function `refresh_subscriptions_search_view` takes care of refreshing `subscriptions_search`. It is called by triggers on the following tables:
+
+* `fi_refresh_search` on table `fixed_inputs`
+* `products_refresh_search` on table `products`
+* `siv_refresh_search` on table `subscription_instance_values`
+* `sub_cust_desc_refresh_search` on table `subscription_customer_descriptions`
+* `sub_refresh_search` on table `subscriptions`
+
+The following query shows the current state of the triggers:
+
+```sql
+SELECT tgname, tgenabled
+FROM pg_trigger
+where pg_trigger.tgname in
+    ('fi_refresh_search',
+     'products_refresh_search',
+     'sub_cust_desc_refresh_search',
+     'siv_refresh_search',
+     'sub_refresh_search');
+```
+
+When all triggers are disabled the output looks like:
+
+```
+            tgname            | tgenabled
+------------------------------+-----------
+ fi_refresh_search            | D
+ products_refresh_search      | D
+ siv_refresh_search           | D
+ sub_cust_desc_refresh_search | D
+ sub_refresh_search           | D
+```
+
+And when all triggers are enabled the output looks like this:
+```
+            tgname            | tgenabled
+------------------------------+-----------
+ fi_refresh_search            | O
+ products_refresh_search      | O
+ siv_refresh_search           | O
+ sub_cust_desc_refresh_search | O
+ sub_refresh_search           | O
+```
+
+Enabling all triggers is done with these statements:
+
+```sql
+ALTER TABLE fixed_inputs ENABLE TRIGGER fi_refresh_search;
+ALTER TABLE products ENABLE TRIGGER products_refresh_search;
+ALTER TABLE subscription_customer_descriptions ENABLE TRIGGER sub_cust_desc_refresh_search;
+ALTER TABLE subscription_instance_values ENABLE TRIGGER siv_refresh_search;
+ALTER TABLE subscriptions ENABLE TRIGGER sub_refresh_search;
+```
+
+Disabling all triggers is done with these statements:
+
+```sql
+ALTER TABLE fixed_inputs DISABLE TRIGGER fi_refresh_search;
+ALTER TABLE products DISABLE TRIGGER products_refresh_search;
+ALTER TABLE subscription_customer_descriptions DISABLE TRIGGER sub_cust_desc_refresh_search;
+ALTER TABLE subscription_instance_values DISABLE TRIGGER siv_refresh_search;
+ALTER TABLE subscriptions DISABLE TRIGGER sub_refresh_search;
+```
+
+The following query returns the number of seconds since the last refresh, which can be useful for debugging.
+
+```sql
+SELECT extract(epoch from now())::int - coalesce(pg_catalog.obj_description('subscriptions_search'::regclass)::int, 0);
+```
+
+**Limitations**
+
+Updating the `subscriptions_search` materialized view is expensive and limited to once every 2 minutes.
+
+This means that when 2 changes happen within, say, 5 seconds of each other, the first change will be picked up directly.
+However, the second change will only be processed on the next refresh of the view.
+So during that period the second change will not show up in the search results.
diff --git a/docs/reference-docs/search_overview.png b/docs/reference-docs/search_overview.png
diff --git a/orchestrator/__init__.py b/orchestrator/__init__.py
@@ -13,7 +13,7 @@
 
 """This is the orchestrator workflow engine."""
 
-__version__ = "2.7.6rc1"
+__version__ = "2.7.6rc2"
 
 from orchestrator.app import OrchestratorCore
 from orchestrator.settings import app_settings

diff --git a/...ations/versions/schema/2023-09-25_da5c9f4cce1c_add_subscription_metadata_to_fulltext_.sql b/...ations/versions/schema/2023-09-25_da5c9f4cce1c_add_subscription_metadata_to_fulltext_.sql
@@ -5,10 +5,17 @@ DROP MATERIALIZED VIEW IF EXISTS subscriptions_search;
 CREATE MATERIALIZED VIEW IF NOT EXISTS subscriptions_search AS
 WITH rt_info AS (SELECT s.subscription_id,
                         concat_ws(
-                                ' ',
-                                string_agg(rt.resource_type || ' ' || siv.value, ' '),
-                                string_agg(distinct 'subscription_instance_id' || ':' || si.subscription_instance_id, ' ')
-                        ) AS rt_info
+                                ', ',
+                                string_agg(
+                                                rt.resource_type || ':' || siv.value,
+                                                ', '
+                                                ORDER BY rt.resource_type
+                                    ),
+                                string_agg(
+                                        distinct 'subscription_instance_id' || ':' || si.subscription_instance_id,
+                                        ', '
+                                    )
+                            ) AS rt_info
                  FROM subscription_instance_values siv
                           JOIN resource_types rt ON siv.resource_type_id = rt.resource_type_id
                           JOIN subscription_instances si ON siv.subscription_instance_id = si.subscription_instance_id
@@ -23,46 +30,48 @@ WITH rt_info AS (SELECT s.subscription_id,
                                           'note:' || coalesce(s.note, ''),
                                           'customer_id:' || s.customer_id,
                                           'product_id:' || s.product_id],
-                                      ' '
-                              ) AS sub_info,
+                                      ', '
+                                  ) AS sub_info,
                               array_to_string(
                                       ARRAY ['product_name:' || p.name,
                                           'product_description:' || p.description,
                                           'tag:' || p.tag,
                                           'product_type:', p.product_type],
-                                      ' '
-                              ) AS prod_info
+                                      ', '
+                                  ) AS prod_info
                        FROM subscriptions s
                                 JOIN products p ON s.product_id = p.product_id),
      fi_info AS (SELECT s.subscription_id,
-                        string_agg(fi.name || ':' || fi.value, ' ') AS fi_info
+                        string_agg(
+                                        fi.name || ':' || fi.value,
+                                        ', '
+                                        ORDER BY fi.name
+                            ) AS fi_info
                  FROM subscriptions s
                           JOIN products p ON s.product_id = p.product_id
                           JOIN fixed_inputs fi ON p.product_id = fi.product_id
                  GROUP BY s.subscription_id),
      cust_info AS (SELECT s.subscription_id,
-                          string_agg('customer_description: ' || scd.description, ' ') AS cust_info
+                          string_agg(
+                                      'customer_description: ' || scd.description,
+                                      ', '
+                              ) AS cust_info
                    FROM subscriptions s
                             JOIN subscription_customer_descriptions scd ON s.subscription_id = scd.subscription_id
                    GROUP BY s.subscription_id)
--- to_tsvector handles parsing of hyphened words in a peculiar way and is inconsistent with how to_tsquery parses it in Postgres <14
--- Replacing all hyphens with underscores makes the parsing more predictable and removes some issues arising when searching for subscription ids for example
--- See: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0c4f355c6a5fd437f71349f2f3d5d491382572b7
 SELECT s.subscription_id,
        to_tsvector(
                'simple',
-               replace(
-                       concat_ws(
-                               ' ',
-                               spi.sub_info,
-                               spi.prod_info,
-                               fi.fi_info,
-                               rti.rt_info,
-                               ci.cust_info,
-                               md.metadata::text
-                       ),
-                       '-', '_')
-       ) as tsv
+               concat_ws(
+                       ', ',
+                       spi.sub_info,
+                       spi.prod_info,
+                       fi.fi_info,
+                       rti.rt_info,
+                       ci.cust_info,
+                       md.metadata::text
+                   )
+           ) as tsv
 FROM subscriptions s
          LEFT JOIN sub_prod_info spi ON s.subscription_id = spi.subscription_id
          LEFT JOIN fi_info fi ON s.subscription_id = fi.subscription_id