-
Notifications
You must be signed in to change notification settings - Fork 400
Index to search #1276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Index to search #1276
Conversation
89fd00e
to
526d757
Compare
10bfd94
to
5bd6b18
Compare
️✅ There are no secrets present in this pull request anymore.If these secrets were true positive and are still valid, we highly recommend you to revoke them. 🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request. |
5bd6b18
to
e966594
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review, I know work is still ongoing and I did not read all the tests... :)
src/backend/core/api/serializers.py
Outdated
q = serializers.CharField(required=True) | ||
|
||
def validate_q(self, value): | ||
"""Ensure the text field is not empty.""" | ||
|
||
if len(value.strip()) == 0: | ||
raise serializers.ValidationError("Text field cannot be empty.") | ||
|
||
return value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q = serializers.CharField(required=True) | |
def validate_q(self, value): | |
"""Ensure the text field is not empty.""" | |
if len(value.strip()) == 0: | |
raise serializers.ValidationError("Text field cannot be empty.") | |
return value | |
q = serializers.CharField(required=True, allow_blank=False) |
You may also add trim_whitespace=True
src/backend/core/api/viewsets.py
Outdated
serializer.is_valid(raise_exception=True) | ||
|
||
try: | ||
indexer = FindDocumentIndexer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this class should come from settings, because not everyone will have an indexer. I also think this view might fallback on searching locally on title if no indexer is configured.
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) | ||
|
||
if not url: | ||
raise RuntimeError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise RuntimeError( | |
raise ImproperlyConfigured( |
Returns: | ||
dict: A JSON-serializable dictionary. | ||
""" | ||
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
url = getattr(settings, "SEARCH_INDEXER_QUERY_URL", None) | |
url = settings.SEARCH_INDEXER_QUERY_URL |
logger.error("HTTPError: %s", e) | ||
logger.error("Response content: %s", response.text) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log the error only once
src/backend/core/models.py
Outdated
} | ||
|
||
|
||
@receiver(signals.post_save, sender=DocumentAccess) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to follow "Use signals as a last resort" (Two scoops of Django): is there a problem calling this from the save
method?
def sortkey(d): | ||
return d["id"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
push_call_args = [call.args[0] for call in mock_push.call_args_list] | ||
|
||
assert len(push_call_args) == 1 # called once but with a batch of docs | ||
assert sorted(push_call_args[0], key=sortkey) == sorted( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, actually I think the documents sorting should be deterministic, in case we need tu run several times the index command => I think we should change the index command to sort documents by creation date or something ^^
We can keep it this way for now, but we surely need to add a comment in the "index" management command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the sort is by id because the indexation is done by batch : loop + id__gt=prev_batch_last_id
push_call_args = [call.args[0] for call in mock_push.call_args_list] | ||
|
||
assert len(push_call_args) == 1 # called once but with a batch of docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you want to simply check the assert_called_once, then use the first value?
7cfa907
to
7255ec2
Compare
client = APIClient() | ||
client.force_login(user) | ||
|
||
response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much APIClient()
^^'
response = client.get
|
||
response = APIClient().get("/api/v1.0/documents/search/", data={"q": "alpha"}) | ||
|
||
assert response.status_code == 401 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the expected status code for misconfiguration?
client = APIClient() | ||
client.force_login(user) | ||
|
||
response = APIClient().get("/api/v1.0/documents/search/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client = APIClient() | |
client.force_login(user) | |
response = APIClient().get("/api/v1.0/documents/search/") | |
client = APIClient() | |
client.force_login(user) | |
response = client.get("/api/v1.0/documents/search/") |
assert response.status_code == 400 | ||
assert response.json() == {"q": ["This field is required."]} | ||
|
||
response = APIClient().get("/api/v1.0/documents/search/", data={"q": " "}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
response = APIClient().get("/api/v1.0/documents/search/", data={"q": " "}) | |
response = client.get("/api/v1.0/documents/search/", data={"q": " "}) |
src/backend/core/api/viewsets.py
Outdated
except ImproperlyConfigured: | ||
return drf.response.Response( | ||
{"detail": "The service is not properly configured."}, | ||
status=status.HTTP_401_UNAUTHORIZED, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why a 401?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree... why ? A 500 is a better idea 😅
src/backend/impress/settings.py
Outdated
# Setup indexer configuration to make test working on the CI. | ||
SEARCH_INDEXER_SECRET = "ThisIsAKeyForTest" # noqa | ||
SEARCH_INDEXER_URL = "http://localhost:8081/api/v1.0/documents/index/" | ||
SEARCH_INDEXER_QUERY_URL = "http://localhost:8081/api/v1.0/documents/search/" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand.
My two cents:
- you should only set the settings in the tests which requires it
- there must be a way to disable the indexation, because not every instance of docs will have an index, and it must be the default case.
=> I would remove those settings, determine which settings enables the indexation (I think SEARCH_INDEXER_CLASS should be None by default, and if None, then the document must not be indexed). Add new tests on the documents save, with SEARCH_INDEXER_CLASS not None, to check the indexation task etc.
Search in Docs relies on an external project like "La Suite Find". We need to declare a common external network in order to connect to the search app and index our documents.
We need to content in our demo documents so that we can test indexing.
Add indexer that loops across documents in the database, formats them as json objects and indexes them in the remote "Find" mico-service.
On document content or permission changes, start a celery job that will call the indexation API of the app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
New API view that calls the indexed documents search view (resource server) of app "Find". Signed-off-by: Fabre Florian <ffabre@hybird.org>
New SEARCH_INDEXER_CLASS setting to define the indexer service class. Raise ImpoperlyConfigured errors instead of RuntimeError in index service. Signed-off-by: Fabre Florian <ffabre@hybird.org>
Signed-off-by: Fabre Florian <ffabre@hybird.org>
1c6a226
to
31d09a7
Compare
Filter deleted documents from visited ones. Set default ordering to the Find API search call (-updated_at) BaseDocumentIndexer.search now returns a list of document ids instead of models. Do not call the indexer in signals when SEARCH_INDEXER_CLASS is not defined or properly configured. Signed-off-by: Fabre Florian <ffabre@hybird.org>
31d09a7
to
5e9f48b
Compare
Purpose
We want to add fulltext (and semantic in a second phase) search to Docs.
The goal is to enable efficient and scalable search across document content by pushing relevant data to a dedicated search backend, such as OpenSearch. The backend should be pluggable.
Proposal
Fixes #322