Skip to content

Commit

Permalink
Really fix UUID search + Add documentation for search (#739)
Browse files Browse the repository at this point in the history
* Add test for searching uuids

* Add documentation for search

Signed-off-by: Mark90 <mark_moes@live.nl>

* Fixes #732 by actually deploying the uuid search workaround from #603

In #603 the uuid search workaround was added to existing migration da5c9f4cce1c in orchestrator-core commit 3e93263.
Because of that, it was never deployed to existing environments where the original migration was already executed.

In a new migration (460ec6748e37) this workaround will be deployed onto existing environments.
The old migration (da5c9f4cce1c) is restored to its original state before commit 3e93263.

Signed-off-by: Mark90 <mark_moes@live.nl>

* Bump version to 2.7.6rc2

---------

Signed-off-by: Mark90 <mark_moes@live.nl>
  • Loading branch information
Mark90 authored Oct 3, 2024
1 parent 62d64fd commit 1d0367d
Show file tree
Hide file tree
Showing 8 changed files with 507 additions and 27 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 2.7.6rc1
current_version = 2.7.6rc2
commit = False
tag = False
parse = (?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+)(rc(?P<build>\d+))?
Expand Down
262 changes: 262 additions & 0 deletions docs/reference-docs/search.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,262 @@
# Search

<!-- TOC -->
- [Search](#search)
- [Usage](#usage)
- [Implementation 1: Filter on DB Table](#implementation-1-filter-on-db-table)
- [Implementation 2: Subscription Search](#implementation-2-subscription-search)
- [Postgres Text Search](#postgres-text-search)
- [Materialized View subscriptions\_search](#materialized-view-subscriptions_search)
<!-- TOC -->

The orchestrator-core provides search functionality to find data within the Postgres DB.

The below overview describes which objects can be searched through which endpoint, and the underlying search implementation.

| Object | REST endpoint | GraphQL endpoint | Implementation |
|----------------|------------------------------------|--------------------------------|---------------------|
| Subscriptions | `/api/subscriptions/search?query=` | `subscriptions(query: "...")` | Subscription Search |
| Processes | | `processes(query: "...")` | Filter on DB table |
| Product Blocks | | `product_blocks(query: "...")` | Filter on DB table |
| Products | | `products(query: "...")` | Filter on DB table |
| Resource types | | `resource_types(query: "...")` | Filter on DB table |
| Workflows | | `workflows(query: "...")` | Filter on DB table |

There are 2 implementations:

1. **Filter on DB Table**: generic implementation to search on values in the DB table for an object, or related tables
1. **Subscription Search**: specialized implementation to search for values anywhere in a subscription, such as instance values or customer descriptions

## Usage

Call the REST or GraphQL endpoint with the `query` parameter set to the search query.

Some search query examples for Subscriptions:

| Query | Matches subscriptions with |
|----------------------------------------------------------|----------------------------------------------------------------------|
| `test` | `test` in any field |
| `description:test` | `test` in description field |
| `L2VPN` | `L2VPN` in any field |
| `tag:L2VPN` | `L2VPN` in product tag field |
| `tag:(FW \| L2VPN)` <br/> Or: <br/>`tag:FW \| tag:L2VPN` | `FW` **OR** `L2VPN` in product tag field |
| `description:test tag:L2VPN` | `test` in description field **and** `L2VPN` in product tag field |
| `description:test -tag:L2VPN` | `test` in description field **and NOT** `L2VPN` in product tag field |
| `test 123` <br/> Or: <br/>`123 test` | `test` **AND** `123` anywhere
| `"test 123"` | `test 123` anywhere (Phrase search) |
| `test*` | any field starting with `test` (Prefix search) |

Note that:
* For other objects the query syntax is the same
* Searching is case-insensitive
* Ordering of words does not matter (unless it is a Phrase)
* Search words cannot contain the characters `|-*():"` as they are part of the search query grammar

## Implementation 1: Filter on DB Table

This implementation translates the user's search query to `WHERE` clauses on DB columns of the object's DB table. For some objects this extends to related DB tables.

We can distinguish these steps:

* The module `search_query.py` parses the user's query and generates a sequence of sqlalchemy `.filter()` clauses
* The REST/GraphQL endpoint appends these clauses to the sqlalchemy `select()` to find objects that match the user's query

## Implementation 2: Subscription Search

This is a specialized implementation to allow searching values anywhere in a subscription or in related entities, without sacrificing performance.

We can distinguish these steps/components:

* The DB view `subscriptions_search` is a search index with _Text Search_ (TS) vectors. Both are explained in the next sections
* The module `search_query.py` parses the user's query into a string that is used to create a TS query
* The TS query is wrapped in a single sqlalchemy `.filter()` clause to match TS documents in `subscriptions_search` and get the corresponding `subscription_id`
* The REST/GraphQL endpoint appends this clause to the sqlalchemy `select()` to find subscription objects that match the user's query

The diagram below visualizes the flow and dependencies between components in the Database, API and Frontend.

![Diagram search](search_overview.png)

<!-- search_overview.png can be opened and modified in draw.io -->

### Postgres Text Search

Postgres _Full Text Search_ (TS) has extensive [documentation](https://www.postgresql.org/docs/current/textsearch.html) but we'll cover the fundamentals in this section. TS queries can be done on "normal" DB tables, i.e. without a search index, but this is too slow. It is recommended to maintain a search index which we do in DB view `subscriptions_search`.

**Creating TS vectors**

The query behind `subscriptions_search` retrieves Subscriptions joined with several other tables (as shown in the previous diagram), forming a "document" of keywords that in some way relate to the subscription.

Each document is turned into a _tsvector_ with Postgres function `to_tsvector()` which consists of these phases:
* Parse document into _tokens_: split text into _tokens_ using special characters as delimiters
* Convert tokens into _lexemes_: a lexeme is a normalized token, i.e. this folds upper-case to lower-case. This step can also normalize based on language, but we disable that by using the `'simple'` dictionary (shown in diagram above)
* Create vector optimized for search: store array of _lexemes_ along with positional information

The array of _lexemes_ makes up the tsvector document that we store in `subsriptions_search`, and which we can query through the Postgres function `to_tsquery()`.

We'll demonstrate how this works through a few examples. You can follow along in any postgres shell (v14 or above). If you have Docker installed, run these commands in separate shells.

```sh
docker run --rm -e POSTGRES_HOST_AUTH_METHOD=trust --name pg15 postgres:15
docker exec -i -t pg15 su - postgres -c psql
```

Translating the input text `'color:Light_Blue count:4'` to a tsvector:

```
postgres=# select to_tsvector('simple', 'color:Light_Blue count:4');
to_tsvector
----------------------------------------------
'4':5 'blue':3 'color':1 'count':4 'light':2
(1 row)
```

The result `'4':5 'blue':3 'color':1 'count':4 'light':2` is an array of lexemes combined with information about their position in the original text.

With Postgres function `to_tsdebug()` one can investigate how the input was parsed.

```
postgres=# select * from ts_debug('simple', 'color:Light_Blue count:4');
alias | description | token | dictionaries | dictionary | lexemes
-----------+------------------+-------+--------------+------------+---------
asciiword | Word, all ASCII | color | {simple} | simple | {color}
blank | Space symbols | : | {} | |
asciiword | Word, all ASCII | Light | {simple} | simple | {light}
blank | Space symbols | _ | {} | |
asciiword | Word, all ASCII | Blue | {simple} | simple | {blue}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | count | {simple} | simple | {count}
blank | Space symbols | : | {} | |
uint | Unsigned integer | 4 | {simple} | simple | {4}
```

Note: when we would write "Light-Blue" with a dash instead of underscore, Postgres translates this into the vector `'blue':4 'light':3 'light-blue':2`, which makes it very complicated to query, in particular for UUID strings. To mitigate this we replace all occurrences of `-` with `_` when creating TS Vectors and before executing TS Queries.

**Querying TS Vectors**

To run queries against TS Vectors we have to prepare a TS Query with Postgres function `to_tsquery()`. We also pass the `'simple'` dictionary here to prevent language-specific normalization.

```
postgres=# select to_tsquery('simple', 'Light_Blue');
to_tsquery
--------------------
'light' <-> 'blue'
```

The resulting TS query means as much as: `the vector must contain 'light' followed by 'blue'`. Note that the input string has been tokenized similar to TS vectors.

We can execute this TS query against the TS vector:

```
postgres=# select to_tsvector('simple', 'color:Light_Blue count:4') @@ to_tsquery('simple', 'Light_Blue');
?column?
----------
t
(1 row)
```

Returns `t` for `true`.

As a final example we can also add an `OR` condition:

```
postgres=# select to_tsvector('simple', 'color:Light_Blue count:4') @@ to_tsquery('simple', 'color <-> Green | count <-> 4');
?column?
----------
t
(1 row)
```

Which returns true because while the vector does not contain the color green, it does contain count 4.

Note that the `color <-> Green | count <-> 4` string passed to `ts_query()` must be constructed in a specific way.
This happens in the orchestrator-core module `search_query.py` as shown in the overview diagram.

### Materialized View subscriptions_search

As mentioned before, `subscriptions_search` is a DB view which lies at the heart of the implementation.
If you're not familiar with database views; they represent a (usually complicated) database query in the form of a "virtual table".

In this case we're using a [Materialized View](https://www.postgresql.org/docs/current/rules-materializedviews.html#RULES-MATERIALIZEDVIEWS) which is like a normal view, except that the "virtual table" is persisted to save resources and increase performance.
The view's data is persisted until it is refreshed, which can be done manually or through an update trigger. (further explained below)

This table has a [GIN index](https://www.postgresql.org/docs/current/textsearch-indexes.html#TEXTSEARCH-INDEXES) for efficient search queries.

**Triggers**

Database function `refresh_subscriptions_search_view` takes care of refreshing `subscriptions_search`. It is called by triggers on the following tables:

* `fi_refresh_search` on table `fixed_inputs`
* `products_refresh_search` on table `products`
* `siv_refresh_search` on table `subscription_instance_values`
* `sub_cust_desc_refresh_search` on table `subscription_customer_descriptions`
* `sub_refresh_search` on table `subscriptions`

The following query shows the current state of the triggers:

```sql
SELECT tgname, tgenabled
FROM pg_trigger
where pg_trigger.tgname in
('fi_refresh_search',
'products_refresh_search',
'sub_cust_desc_refresh_search',
'siv_refresh_search',
'sub_refresh_search');
```

When all triggers are disabled the output looks like:

```
tgname | tgenabled
------------------------------+-----------
fi_refresh_search | D
products_refresh_search | D
siv_refresh_search | D
sub_cust_desc_refresh_search | D
sub_refresh_search | D
```

And when all triggers are enabled the output looks like this:
```
tgname | tgenabled
------------------------------+-----------
fi_refresh_search | O
products_refresh_search | O
siv_refresh_search | O
sub_cust_desc_refresh_search | O
sub_refresh_search | O
```

Enabling all triggers is done with these statements:

```sql
ALTER TABLE fixed_inputs ENABLE TRIGGER fi_refresh_search;
ALTER TABLE products ENABLE TRIGGER products_refresh_search;
ALTER TABLE subscription_customer_descriptions ENABLE TRIGGER sub_cust_desc_refresh_search;
ALTER TABLE subscription_instance_values ENABLE TRIGGER siv_refresh_search;
ALTER TABLE subscriptions ENABLE TRIGGER sub_refresh_search;
```

Disabling all triggers is done with these statements:

```sql
ALTER TABLE fixed_inputs DISABLE TRIGGER fi_refresh_search;
ALTER TABLE products DISABLE TRIGGER products_refresh_search;
ALTER TABLE subscription_customer_descriptions DISABLE TRIGGER sub_cust_desc_refresh_search;
ALTER TABLE subscription_instance_values DISABLE TRIGGER siv_refresh_search;
ALTER TABLE subscriptions DISABLE TRIGGER sub_refresh_search;
```

The following query returns the number of seconds since the last refresh, which can be useful for debugging.

```sql
SELECT extract(epoch from now())::int - coalesce(pg_catalog.obj_description('subscriptions_search'::regclass)::int, 0);
```

**Limitations**

Updating the `subscriptions_search` materialized view is expensive and limited to once every 2 minutes.

This means that when 2 changes happen within, say, 5 seconds of each other, the first change will be picked up directly.
However, the second change will only be processed on the next refresh of the view.
So during that period the second change will not show up in the search results.
Binary file added docs/reference-docs/search_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion orchestrator/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@

"""This is the orchestrator workflow engine."""

__version__ = "2.7.6rc1"
__version__ = "2.7.6rc2"

from orchestrator.app import OrchestratorCore
from orchestrator.settings import app_settings
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,17 @@ DROP MATERIALIZED VIEW IF EXISTS subscriptions_search;
CREATE MATERIALIZED VIEW IF NOT EXISTS subscriptions_search AS
WITH rt_info AS (SELECT s.subscription_id,
concat_ws(
' ',
string_agg(rt.resource_type || ' ' || siv.value, ' '),
string_agg(distinct 'subscription_instance_id' || ':' || si.subscription_instance_id, ' ')
) AS rt_info
', ',
string_agg(
rt.resource_type || ':' || siv.value,
', '
ORDER BY rt.resource_type
),
string_agg(
distinct 'subscription_instance_id' || ':' || si.subscription_instance_id,
', '
)
) AS rt_info
FROM subscription_instance_values siv
JOIN resource_types rt ON siv.resource_type_id = rt.resource_type_id
JOIN subscription_instances si ON siv.subscription_instance_id = si.subscription_instance_id
Expand All @@ -23,46 +30,48 @@ WITH rt_info AS (SELECT s.subscription_id,
'note:' || coalesce(s.note, ''),
'customer_id:' || s.customer_id,
'product_id:' || s.product_id],
' '
) AS sub_info,
', '
) AS sub_info,
array_to_string(
ARRAY ['product_name:' || p.name,
'product_description:' || p.description,
'tag:' || p.tag,
'product_type:', p.product_type],
' '
) AS prod_info
', '
) AS prod_info
FROM subscriptions s
JOIN products p ON s.product_id = p.product_id),
fi_info AS (SELECT s.subscription_id,
string_agg(fi.name || ':' || fi.value, ' ') AS fi_info
string_agg(
fi.name || ':' || fi.value,
', '
ORDER BY fi.name
) AS fi_info
FROM subscriptions s
JOIN products p ON s.product_id = p.product_id
JOIN fixed_inputs fi ON p.product_id = fi.product_id
GROUP BY s.subscription_id),
cust_info AS (SELECT s.subscription_id,
string_agg('customer_description: ' || scd.description, ' ') AS cust_info
string_agg(
'customer_description: ' || scd.description,
', '
) AS cust_info
FROM subscriptions s
JOIN subscription_customer_descriptions scd ON s.subscription_id = scd.subscription_id
GROUP BY s.subscription_id)
-- to_tsvector handles parsing of hyphened words in a peculiar way and is inconsistent with how to_tsquery parses it in Postgres <14
-- Replacing all hyphens with underscores makes the parsing more predictable and removes some issues arising when searching for subscription ids for example
-- See: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=0c4f355c6a5fd437f71349f2f3d5d491382572b7
SELECT s.subscription_id,
to_tsvector(
'simple',
replace(
concat_ws(
' ',
spi.sub_info,
spi.prod_info,
fi.fi_info,
rti.rt_info,
ci.cust_info,
md.metadata::text
),
'-', '_')
) as tsv
concat_ws(
', ',
spi.sub_info,
spi.prod_info,
fi.fi_info,
rti.rt_info,
ci.cust_info,
md.metadata::text
)
) as tsv
FROM subscriptions s
LEFT JOIN sub_prod_info spi ON s.subscription_id = spi.subscription_id
LEFT JOIN fi_info fi ON s.subscription_id = fi.subscription_id
Expand Down
Loading

0 comments on commit 1d0367d

Please sign in to comment.