-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
3033 Introduced V4 RECAP Search API #3975
Conversation
- Serializers to support the V4 Search API. - Handling of the highlight parameter to enable or disable highlighting.
HERE WE GOOOO! |
… highlight param - If the highlight parameter is not passed to the API request, highlighting will be disabled by default. - To enable highlighting in the Search API, the "highlight=on" parameter must be sent.
- Fixed highlight for other V3 search types.
- Added tests to test consistency of results across pagination.
- Refactored tests to reuse code.
I don't think so. If the backend can handle it then the consumer can too.
Yes. Seems fine.
Yeah, that makes sense. Do we have the docket_entry_id in the recap_docket object in ES also?
Hm, we can do fast count queries that lose accuracy after a certain point, right? I'd say we do that instead. We can just provide the approximate count, and then document that it's only accurate for result sets smaller than XXX (whatever it was we discussed before, if we made a decision about it).
That's not a big deal, but we should open an issue to have it on our backlog as a "someday" issue. Seems easy to fix on the front end, right?
How does this compare to the front end?
This all sounds good. My one concern is that the backwards pagination sounds like a pretty big hack. Is it something of a best practice or something you came up with to solve the problem? |
I just did a very quick skim of the code. @ERosendo, if you can do a full review, that would be great, and I'll do longer review after that. |
Correct, I'll add the
Sure, so that means we'd only provide the "count" key containing the accurate result if it's below a threshold, or the approximate count if it's greater. The threshold we defined in #3926 was 2,000 documents, considering up to 100 pages of 20 documents each. We can perform the same cardinality count with accurate results up to that threshold, but since we're currently getting an accurate count up to 10,000 items from the main query, we can use that count if the results are less than 10,000 and use the approximate count returned by the cardinality aggregation if they exceed 10,000. And the Does that sound good?
Sure, here is the issue: #3999
The HL fields in the front end are the same as in the "r" search type (docket + RD fields) when
Initially, it was just a brief idea I came up with in #3645 (comment) when assessing the |
Everything above sounds great, thanks Alberto. My last comment is in response to this:
That sounds right, but to be certain I understand correctly, are there fields that are shown on both the RD and the R search type that are not highlighted in RD even though they are highlighted in the R search type (or the same question for the D result type)? |
All the fields HL in the Let me explain it with a response example for each type. "r" type:
"d" type:
"rd" type:
|
Perfect, that's what I was expecting, but just wanted to be sure! Thank you! |
Thanks Alberto. All sounds good, and we await Eduardo's review. One other thought: Is it hard to add a second count to the |
It's not difficult to add a count for the recap documents. I didn't include it because it requires an additional ES query for every API request, and the current V3 only displays one count. However, if it's beneficial for users to have the recap document count in We might also use this for the frontend in the future. Alternatively, if the document count is only useful for the frontend, we could add an extra parameter to the request, something like |
I think it's OK to add it to all responses, particularly if we do the cardinality thing that should make it pretty performant? |
Yeah! I'll add the secondary count for the |
cl/search/api_utils.py
Outdated
def limit_api_results_to_page( | ||
results: Response | AttrList, cursor: ESCursor | ||
) -> Response | AttrList: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cursor
argument in this method seems to be optional. Let's update the type hint to reflect this by using | None
.
def limit_api_results_to_page( | |
results: Response | AttrList, cursor: ESCursor | |
) -> Response | AttrList: | |
def limit_api_results_to_page( | |
results: Response | AttrList, cursor: ESCursor | None | |
) -> Response | AttrList: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I've added missing None
to the type
cl/search/api_utils.py
Outdated
reverse = False | ||
if cursor is not None: | ||
search_after, reverse, _ = cursor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not using the search_after
variable here. I think we can simplify this using a ternary if statement
reverse = False | |
if cursor is not None: | |
search_after, reverse, _ = cursor | |
reverse = cursor.reverse if cursor else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is better, changed it!
cl/search/types.py
Outdated
@@ -106,3 +107,6 @@ class EventTable(StrEnum): | |||
DOCKET_ENTRY = "search.DocketEntry" | |||
RECAP_DOCUMENT = "search.RECAPDocument" | |||
UNDEFINED = "" | |||
|
|||
|
|||
ESCursor = namedtuple("ESCursor", ["search_after", "reverse", "search_type"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to use a namedtuple
instead of a dataclass
? Dataclasses
provide the same data storage capabilities with the added benefit of built-in type hints and cleaner syntax. This can improve code readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion. Using a dataclass
can improve readability, so I've made the change it to:
@dataclass(frozen=True)
class ESCursor:
search_after: AttrList | None
reverse: bool
search_type: str
The frozen=True
to make it immutable as the namedtuple
cl/lib/elasticsearch_utils.py
Outdated
main_query = s | ||
if cd["highlight"]: | ||
main_query = add_es_highlighting(s, cd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a ternary if statement here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, changed it!
cl/lib/document_serializer.py
Outdated
class DateField(serializers.Field): | ||
"""Handles date objects.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's update this docstring to improve clarity. Initially, I struggled to differentiate between this class and the DateOrDateTimeField
class. However, I realized that this new class uses only the serializers.DateField() class to handle dates. In contrast, the older class offers more flexibility, using either the DateTimeField
or the DateField
serializer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I've updated the docstring and also renamed the class to CoerceDateField
to avoid confusion.
cl/api/pagination.py
Outdated
def __init__(self): | ||
self.page_size = settings.SEARCH_API_PAGE_SIZE | ||
self.request = None | ||
self.es_list_instance = None | ||
self.results_count_exact = None | ||
self.results_count_approximate = None | ||
self.child_results_count = None | ||
self.results_in_page = None | ||
self.base_url = None | ||
self.cursor = None | ||
self.search_type = None | ||
self.cursor_query_param = "cursor" | ||
self.invalid_cursor_message = "Invalid cursor" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the __init__
method doesn't accept any arguments and only assigns values to variables, I believe we can simplify the code by moving these assignments directly to the class level as class variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right. I have moved them to class variables. The only one that needed to be within the __init__
was self.page_size
for testing purposes. However, it's better to use settings.SEARCH_API_PAGE_SIZE
directly in the code and remove __init__
.
cl/lib/elasticsearch_utils.py
Outdated
if toggle_sorting and order_by: | ||
sort_components = order_by.split(",") | ||
toggle_sort_components = [] | ||
for component in sort_components: | ||
component = component.strip() | ||
if "desc" in component: | ||
toggle_sort_components.append(component.replace("desc", "asc")) | ||
elif "asc" in component: | ||
toggle_sort_components.append(component.replace("asc", "desc")) | ||
else: | ||
toggle_sort_components.append(component) | ||
|
||
return ", ".join(toggle_sort_components) | ||
|
||
return order_by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify the code structure and improve readability by using an early return to reduce the size of the nested block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good overall 👍 . My comments are mostly minor nits, so I think we can merge after they are addressed.
Thank you @ERosendo |
@albertisfu LGTM 👍 |
@blancoramiro, this has a noop migration, so it needs a little hand-holding. Do you think you can get it deployed, please? |
Migrations applied and merge deployed! |
This PR introduces version 4 of the RECAP Search API as outlined in #3033, which will serves as the base for other version 4 API search types.
Here are the main features of this endpoint:
Search types:
The v4 RECAP Search API supports three different search types. All of them operate similarly in terms of queries, they support the same queries as the frontend, but they differ in how results are displayed. For example, the Docket type
"d"
can match dockets by RD fields, although RD fields are not displayed. The RECAP_DOCUMENT type"rd
" can match RDs by docket fields that are indexed within each RD.RECAP "r"
This is the main type and it mimics the RECAP Search results in the frontend.
The objects estructure looks like this:
recap_documents
field displays up to 5 RECAPDocuments that matched the query.more_docs
will be true, indicating there are additional RDs matched by the query. Otherwise, it is false.datetime
in UTC.DOCKETS "d"
This search type only display docket fields without the "recap_documents" or
more_docs
fields.Regarding the
d
andr
types, I noticed that the parties and attorneys fields can be massive in some cases, which can make some responses quite large.Should we consider displaying a maximum number of parties and attorneys?
RECAP_DOCUMENT "rd"
This search type only display RECAPDocuments fields.
I realized that this document type can be useful to accomplish the same results of the
docket_id
query in the frontend, for instance a query like this:?order_by=score+desc&type=rd&q=docket_id:1
Will return all RDs that belong to that
docket_id
. So I did not add thedocket_id
query to ther
type, which would increase the number of nested documents from 5 to 100, maybe therd
type is just better for the same objective?One question here, would it be necessary to add some Docket fields to this serializer so users can easily identify the parent docket? Perhaps adding the
docket_id
?Results Count
Since the total count of results is no longer required for computing pagination, an additional count query is not required. Instead, the count is taken from the main query. With the count limit of 10,000, we can show the actual number of results if they're less than 10,000.
So the count key looks like this:
exact
is the number of results matched by the query, up to 10,000 items.more
indicates if there are more than 10,000 documents it'll True.Suggestions are welcome if you have a better way to display the exact count and indicate that there are more results.
Cursor pagination and sorting
As requested in #3645, the V4 now uses cursor pagination. The cursor paginator is custom made to work alongside the ES
search_after
parameter, enhancing performance during deep pagination. However the architecture of ourESCursorPagination
class follows the standards ofCursorPaginator
.For the cursor paginator to function, it is mandatory to set a sorting key used as the "cursor" in the ES request. The supported sortings are:
Additionally, to avoid pagination inconsistencies due to repeated values like scores or dates, a secondary sorting key, which must be unique for each result, is required. For the 'r' and 'd' types, this key is
docket_id
, and for the 'rd' type, it is the RDid
. It's important to note that using two sorting keys can lead to discrepancies between sorting in the frontend and in the API v4, even though the primary sorting field remains the same. The introduction of the secondary sorting key will sort results with duplicated sorting values in adocket_id desc
orid desc
order, acting as a tiebreaker to ensure consistent results across pages. In the frontend, the order of documents with the same sorting values is displayed arbitrarily.For date sorting such as 'dateFiled', a workaround was necessary regarding sorting and the
search_after
request. The issue arises when Dockets with aNone
'dateFiled' are indexed as null in ES. However, when using this field for sorting, the null fields are represented as-9,223,372,036,854,775,808
, which is thelong.min_value
in JAVA, causing an illegal value error when sent as part of thesearch_after
parameter.The workaround applied was to use the function score (with a few tweaks) we are currently using for the
entry_date_filed
sorting in the frontend. Thus, when a document with a 'dateFiled' of None is in the results, its sorting value is 0 instead oflong.min_value
.So when sorting by either:
Results where the sorting field is None will be shown at the end, regardless of the order (
desc
orasc
).By default, ES does not provide a "search_before" parameter for backward pagination. A different approach was required to implement cursor backward pagination. It uses the same
search_after
approach, but when going backwards, the sorting keys are inverted, and the item selected as the "search_after" is the first item on the page, allowing it to indeed go back to the previous page. A final step is required, as the results for this backward query are returned in the reverse order of the original one. It is necessary to invert the results on that page to achieve the original order.Results per page are controlled by the setting:
SEARCH_API_PAGE_SIZE
, which defaults to 20.random sorting
The random sorting key is being omitted from this PR, as it currently uses a sorting script instead of a function score, so it would require a change in approach to work alongside the
search_after
parameter and might not function properly due to the randomness of thesearch_after
parameter. Therefore, as we agreed, it will not be implemented at this time.In general, cursor pagination is working as expected, and results are consistent across pages, even in scenarios where new items are indexed or removed. However, there can be some corner cases where the cursor pagination can lead to inconsistent results, for instance, if the last or first document on the current page used as the cursor is updated before moving to the next page and the field used as the sorting key (like 'dateFiled') or if sorting by relevance, the update affects the document score. This can lead to inconsistencies in results when moving to the next or previous page, where the updated document or documents can be displayed again. To solve this issue, ES documentation recommends to use a Point in time. I'll open an issue to describe how it works in detail in case it is required to implement in the future.
The cursor is a base-64 encoded string that looks like:
cursor=cz0yLjIwNzg0ODUmcz0xNjYmdD1k
It contains the following parameters:
search_after
parameter.If an invalid cursor string is sent, the response will contain the following body:
Also, if a user switches to a different search type, for instance, if the original request was performed for the
"r"
type and then it's changed to"d"
without cleaning the current cursor, the Invalid cursor error will be shown to avoid pagination inconsistencies due to the current cursor not matching the sorting values of the new search type.On every ES request, 'page_size + 1' documents are requested to check whether there are more results and to determine whether to display a next or previous page.
The next and previous links look like these:
Highlighting on demand.
By default, highlighting in the v4 Search API is disabled, providing a performance boost to the requests. When highlighting is disabled in the RECAP results, the plain text snippet (first 500 characters) is extracted directly from the database for the results on a page. This is because highlighting is required to retrieve the 'no_match' fragment and to avoid retrieving the entire plain text, which can be expensive. Thus, to fully benefit from disabling highlighting, data extraction from the database is necessary.
To enable highlighting, users should pass the highlight=on parameter in the request.
Highlighted fields include:
Dockets
RDs
Let me know what do you think.