Skip to content

Commit

Permalink
Search mvp (#149)
Browse files Browse the repository at this point in the history
* Search MVP

* Change default mediatype to text/anot+turtle
Fix search bugs

* Bugfix - return annotations with search results

* Update tests - context is included.
Add search tests.

* Add context ontologies

* Align tests (PR + push to feature)
Update cache tests

* Update search query to sum weights. Remove duplicative Regex for exact matches (duplicates LCASE).
Update cache tests - "Purge" references updated to "Reset"
Black code
attempt test bugfix in GH action -
Update poetry
Change mgmt API to reset cache instead of purge.
Reset instead of purging TBOX cache.
Update search methods to allow multiple params
Align prez function to get item with item function
Add SPARQL_TIMEOUT configuration environment variable, and default this in async and sync clients as 30 seconds.

* Debug test failing in pipeline but working locally

* Debug test failing in pipeline but working locally. Set to xfail

* Mark xfail properly

* Another xfail

* Address PR comments

* Change /reset-tbox-cache endpoint back to /purge-tbox-cache after discussion on PR.

* Update to reflect PR comments

* And another

* Remove Jena FT Search
  • Loading branch information
recalcitrantsupplant authored Sep 28, 2023
1 parent df00e01 commit 68a3d4d
Show file tree
Hide file tree
Showing 52 changed files with 19,165 additions and 727 deletions.
3 changes: 2 additions & 1 deletion .github/workflows/on_push_to_feature.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -62,4 +62,5 @@ jobs:
cd ../profiles && poetry run pytest
cd ../services && poetry run pytest
cd ../identifier && poetry run pytest
# cd ../local_sparql_store && poetry run pytest
cd ../object && poetry run pytest
cd ../caching && poetry run pytest
16 changes: 16 additions & 0 deletions README-Dev.md
Original file line number Diff line number Diff line change
Expand Up @@ -163,6 +163,22 @@ At present the parameterised SPARQL queries accept the following parameters: PRE
These parameters are substituted into the SPARQL query using the `string.Template` module. This module substitutes where $PREZ and $TERM are found in the query.
You must also escape any $ characters in the query using a second $.

To configure filters on search the following patterns can be used:
- Specification of `filter-to-focus` and `focus-to-filter` filters as Query String Arguments on the search route. Examples:

1. `/search?term=contact&method=default&
/&_format=text/anot+turtle`
_adds a triple to the search query of the form `?search_result_uri skos:broader <http://resource.geosciml.org/classifier/cgi/contacttype/metamorphic_contact>`_

2. `/search?term=address&method=default&filter-to-focus[rdfs:member]=https://linked.data.gov.au/datasets/gnaf`
_adds a triple to the search query of the form `<https://linked.data.gov.au/datasets/gnaf> rdfs:member ?search_result_uri`_

3. Search with a filter on multiple objects (the list of objects is treated as an OR)`/search?term=address&method=default&filter-to-focus[rdfs:member]=https://linked.data.gov.au/datasets/gnaf,https://linked.data.gov.au/datasets/defg`
_adds a triple to the search query of the form `<https://linked.data.gov.au/datasets/gnaf> rdfs:member ?o . VALUES ?o { <https://linked.data.gov.au/datasets/gnaf> <https://linked.data.gov.au/datasets/defg>}`_

- URIs and CURIEs can be used to specify filters.
_If CURIEs are used, they should only be CURIEs returned as `dcterms:identifier "{identifier}"^^prez:identifier` or in prez:links. There is no guarantee prefix declarations in turtle or other RDF serialisations returned by prez are consistent with the prefixes used internally by prez for links and identifiers._

## Scaled instances of Prez
When using Prez for large volumes of data, it is recommended the support graph data is created offline. This includes:
- Identifiers for all objects (a `dcterms:identifier`)
Expand Down
16 changes: 16 additions & 0 deletions changelog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
## Changes for 2023-09-27

### Features

- Default search added. This is a simple search that will search for terms across all annotation predicates Prez has configured. By default in prez/config.py these are set to:
- label_predicates = [SKOS.prefLabel, DCTERMS.title, RDFS.label, SDO.name]
- description_predicates = [SKOS.definition, DCTERMS.description, SDO.description]
- provenance_predicates = [DCTERMS.provenance]
These are configurable via environment variables using the Pydantic BaseSettings functionality but will need to be properly escaped as they are a list.

More detail on adding filters to search is provided in the readme.
- Timeout for httpx AsyncClient and Client instances is set on the shared instance to 30s. Previously this was set in some individual calls resulting in inconsistent behaviour, as the default is otherwise 5s.
- Updated `purge-tbox-cache` endpoint functionality. This reflects that prez now
includes a number of common ontologies by default (prez/reference_data/context_ontologies), and on startup will load
annotation triples (e.g. x rdfs:label y) from these. As such, the tbox or annotation cache is no longer completely
purged but can be reset to this default state instead.
675 changes: 355 additions & 320 deletions poetry.lock

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions prez/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@
create_endpoints_graph,
populate_api_info,
add_prefixes_to_prefix_graph,
add_common_context_ontologies_to_tbox_cache,
)
from prez.services.exception_catchers import (
catch_400,
Expand Down Expand Up @@ -121,6 +122,7 @@ async def app_startup():
await create_endpoints_graph()
await count_objects()
await populate_api_info()
await add_common_context_ontologies_to_tbox_cache()


@app.on_event("shutdown")
Expand Down
65 changes: 9 additions & 56 deletions prez/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,10 @@

import toml
from pydantic import BaseSettings, root_validator
from rdflib import URIRef
from rdflib.namespace import GEO, DCAT, SKOS, PROF
from rdflib import URIRef, DCTERMS, RDFS, SDO
from rdflib.namespace import SKOS

from prez.reference_data.prez_ns import PREZ
from prez.reference_data.prez_ns import REG


class Settings(BaseSettings):
Expand Down Expand Up @@ -43,6 +43,12 @@ class Settings(BaseSettings):
order_lists_by_label: bool = True
base_classes: Optional[dict]
prez_flavours: Optional[list] = ["SpacePrez", "VocPrez", "CatPrez", "ProfilesPrez"]
label_predicates = [SKOS.prefLabel, DCTERMS.title, RDFS.label, SDO.name]
description_predicates = [SKOS.definition, DCTERMS.description, SDO.description]
provenance_predicates = [DCTERMS.provenance]
other_predicates = [SDO.color, REG.status]
sparql_timeout = 30.0

log_level = "INFO"
log_output = "stdout"
prez_title: Optional[str] = "Prez"
Expand Down Expand Up @@ -81,58 +87,5 @@ def set_system_uri(cls, values):
)
return values

# @root_validator()
# def populate_top_level_classes(cls, values):
# values["top_level_classes"] = {
# "Profiles": [
# PROF.Profile,
# PREZ.SpacePrezProfile,
# PREZ.VocPrezProfile,
# PREZ.CatPrezProfile,
# ],
# "SpacePrez": [DCAT.Dataset],
# "VocPrez": [SKOS.ConceptScheme, SKOS.Collection],
# "CatPrez": [DCAT.Catalog],
# }
# return values
#
# @root_validator()
# def populate_collection_classes(cls, values):
# additional_classes = {
# "Profiles": [],
# "SpacePrez": [GEO.FeatureCollection],
# "VocPrez": [],
# "CatPrez": [DCAT.Resource],
# }
# values["collection_classes"] = {}
# for prez in list(additional_classes.keys()) + ["Profiles"]:
# values["collection_classes"][prez] = (
# values["top_level_classes"].get(prez) + additional_classes[prez]
# )
# return values
#
# @root_validator()
# def populate_base_classes(cls, values):
# additional_classes = {
# "SpacePrez": [GEO.Feature],
# "VocPrez": [SKOS.Concept],
# "CatPrez": [DCAT.Dataset],
# "Profiles": [PROF.Profile],
# }
# values["base_classes"] = {}
# for prez in list(additional_classes.keys()) + ["Profiles"]:
# values["base_classes"][prez] = (
# values["collection_classes"].get(prez) + additional_classes[prez]
# )
# return values
#
# @root_validator()
# def populate_sparql_creds(cls, values):
# username = values.get("sparql_username")
# password = values.get("sparql_password")
# if username is not None and password is not None:
# values["sparql_auth"] = (username, password)
# return values


settings = Settings()
10 changes: 8 additions & 2 deletions prez/models/search_method.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,13 @@ class SearchMethod(BaseModel):
def __hash__(self):
return hash(self.uri)

def populate_query(self, term, limit):
def populate_query(self, term, limit, focus_to_filter, filter_to_focus, predicates):
self.populated_query = self.template_query.substitute(
{"TERM": term, "LIMIT": limit}
{
"TERM": term,
"LIMIT": limit,
"FOCUS_TO_FILTER": focus_to_filter,
"FILTER_TO_FOCUS": filter_to_focus,
"PREDICATES": predicates,
}
)
Loading

0 comments on commit 68a3d4d

Please sign in to comment.