Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shift the human-friendly query translation into the backend. #7

Merged
merged 5 commits into from
Apr 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 66 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,72 @@ The API returns a request body that contains a JSON object with the following pr
Callers can control the number of results to return in each page by setting the `limit=` query parameter.
This should be a positive integer that is no greater than 100.

## Using a human-readable text query syntax

For text searches, we support a more human-readable syntax for boolean operations in the query.
The search string below will look for all metadata documents that match `foo` or `bar` but not `whee`:

```
(foo OR bar) AND NOT whee
```

The `AND`, `OR` and `NOT` (note the all-caps!) are automatically translated to the corresponding search clauses.
This can be combined with parentheses to control precedence; otherwise, `AND` takes precedence over `OR`, and `NOT` takes precedence over both.
Note that any sequence of adjacent text terms are implicitly `AND`'d together, so the two expressions below are equivalent:

```
foo bar whee
foo AND bar AND whee
```

Users can prefix any sequence of text terms with the name of a metadata field, to only search for matches within that field of the metadata file.
For example:

```
(title: prostate cancer) AND (genome: GRCh38 OR genome: GRCm38)
```

Note that this does not extend to the `AND`, `OR` and `NOT` keywords,
e.g., `title:foo OR bar` will not limit the search for `bar` to the `title` field.

If a `%` wildcard is present in a search term, its local search clause is set to perform a partial search.

The human-friendly mode can be enabled by setting the `translate=true` query parameter in the request to the `/query` endpoint.
The structure of the request body is unchanged except that any `text` field is assumed to contain a search string and will be translated into the relevant search clause.

```shell
curl -X POST -L ${SEWER_RAT_URL}/query?translate=true \
-H "Content-Type: application/json" \
-d '{ "type": "text", "text": "Aaron OR stuff" }' | jq
## {
## "results": [
## {
## "path": "/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/sub/B.json",
## "user": "luna",
## "time": 1711754321,
## "metadata": {
## "foo": "bar",
## "gunk": [
## "stuff",
## "blah"
## ]
## }
## },
## {
## "path": "/Users/luna/Programming/ArtifactDB/SewerRat/scripts/test/sub/A.json",
## "user": "luna",
## "time": 1711754321,
## "metadata": {
## "authors": {
## "first": "Aaron",
## "last": "Lun"
## }
## }
## }
## ]
## }
```

## Spinning up an instance

Clone this repository and build the binary.
Expand Down
24 changes: 20 additions & 4 deletions handlers.go
Original file line number Diff line number Diff line change
Expand Up @@ -327,20 +327,32 @@ func newQueryHandler(db *sql.DB, tokenizer *unicodeTokenizer, wild_tokenizer *un
limit = limit0
}

translate := false
if params.Has("translate") {
translate = strings.ToLower(params.Get("translate")) == "true"
}

if r.Body == nil {
dumpJsonResponse(w, http.StatusBadRequest, map[string]string{ "status": "ERROR", "reason": "expected a non-empty request body" })
return
}
query := searchClause{}
query := &searchClause{}
restricted := http.MaxBytesReader(w, r.Body, 1048576)
dec := json.NewDecoder(restricted)
err := dec.Decode(&query)
err := dec.Decode(query)
if err != nil {
dumpJsonResponse(w, http.StatusBadRequest, map[string]string{ "status": "ERROR", "reason": fmt.Sprintf("failed to parse response body; %v", err) })
return
}

san, err := sanitizeQuery(&query, tokenizer, wild_tokenizer)
if translate {
query, err = translateQuery(query)
if err != nil {
dumpJsonResponse(w, http.StatusBadRequest, map[string]string{ "status": "ERROR", "reason": fmt.Sprintf("failed to translate text query; %v", err) })
}
}

san, err := sanitizeQuery(query, tokenizer, wild_tokenizer)
if err != nil {
dumpJsonResponse(w, http.StatusBadRequest, map[string]string{ "status": "ERROR", "reason": fmt.Sprintf("failed to sanitize an invalid query; %v", err) })
return
Expand All @@ -355,7 +367,11 @@ func newQueryHandler(db *sql.DB, tokenizer *unicodeTokenizer, wild_tokenizer *un
respbody := map[string]interface{} { "results": res }
if len(res) == limit {
last := &(res[limit-1])
respbody["next"] = endpoint + "?scroll=" + strconv.FormatInt(last.Time, 10) + "," + strconv.FormatInt(last.Pid, 10)
next := endpoint + "?scroll=" + strconv.FormatInt(last.Time, 10) + "," + strconv.FormatInt(last.Pid, 10)
if translate {
next += "&translate=true"
}
respbody["next"] = next
}

dumpJsonResponse(w, http.StatusOK, respbody)
Expand Down
21 changes: 21 additions & 0 deletions handlers_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -580,6 +580,27 @@ func TestQueryHandler(t *testing.T) {
}
})

t.Run("translated", func (t *testing.T) {
req, err := http.NewRequest("POST", "/query?translate=true", strings.NewReader(`{ "type": "text", "text": "lamb OR chicken" }`))
if err != nil {
t.Fatal(err)
}

rr := httptest.NewRecorder()
handler.ServeHTTP(rr, req)
if rr.Code != http.StatusOK {
t.Fatalf("should have succeeded")
}

all_paths, scroll := validateSearchResults(rr.Body)
if scroll != "" {
t.Fatalf("unexpected scroll %v", scroll)
}
if len(all_paths) != 2 || all_paths[0] != filepath.Join(to_add, "stuff/other.json") || all_paths[1] != filepath.Join(to_add, "metadata.json") {
t.Fatalf("unexpected paths %v", all_paths)
}
})

t.Run("scroll", func (t *testing.T) {
dummy_query := `{ "type": "text", "text": " " }`

Expand Down
20 changes: 10 additions & 10 deletions html/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@
}
</style>

<script type="text/javascript" src="parseQuery.js"></script>
<script type="text/javascript">
var query_body = null;

Expand All @@ -36,12 +35,13 @@
errdiv.replaceChildren();

const metadiv = self.document.getElementById('metadata');
try {
let parsed = parseQuery(metadiv.value);
all_clauses.push(parsed.metadata);
} catch (e) {
errdiv.appendChild(self.document.createTextNode(e.message));
return false;
if (metadiv.value != "") {
try {
all_clauses.push({ type: "text", text: metadiv.value })
} catch (e) {
errdiv.appendChild(self.document.createTextNode(e.message));
return false;
}
}

const user = self.document.getElementById('user');
Expand Down Expand Up @@ -192,7 +192,7 @@
search.setAttribute("disabled", "");

populateQueryBody();
populateSearchResults("/query", true)
populateSearchResults("/query?translate=true", true)
return false;
}

Expand Down Expand Up @@ -255,14 +255,14 @@ <h1>SewerRat search</h1>
<br><br>
On a similar note, the <code>NOT</code> keyword can be used for unary negation.
This should be put before any search terms, e.g., <code>(NOT a b) AND (c d)</code>.
If there are no parenthese, any <code>NOT</code> will take precedence over the other boolean operations,
If there are no parentheses, any <code>NOT</code> will take precedence over the other boolean operations,
i.e., the above query is the same as <code>NOT a b AND c d</code>.
<br><br>
Even more advanced users can prefix any sequence of search terms with the name of a metadata field,
to only search for matches within that field of the metadata file, e.g.,
<code>(title: prostate cancer) AND (genome: GRCh38 OR genome: GRCm38)</code>.
Note that this does not extend to the <code>AND</code>, <code>OR</code> and <code>NOT</code> keywords,
i.e., <code>title:foo OR bar</code> will not limit the search for <code>bar</code> to the <code>title</code> field.
e.g., <code>title:foo OR bar</code> will not limit the search for <code>bar</code> to the <code>title</code> field.
<br><br>
Extremely advanced users can attach a <code>%</code> wildcard to any term to enable a partial search,
e.g., <code>neur%</code> will match files with <code>neuron</code>, <code>neural</code>, <code>neurological</code>, etc.
Expand Down
150 changes: 0 additions & 150 deletions html/parseQuery.js

This file was deleted.

Loading
Loading