-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DS-4301] Added Content Reports section and Filtered Collections report therein #202
Conversation
Added the second report, Filtered Items (for metadata queries). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffmorin : Gave this a quick review today (while also reviewing the corresponding backend PR: DSpace/DSpace#8598 -- still testing it though). I have some questions about the design here, namely about the GET vs POST versions of these endpoints... I'm not sure I understand why POST is used so heavily (as POST tends to mean you are creating something new).
Also, I noticed that in the old REST API, these were all GET queries: https://wiki.lyrasis.org/display/DSDOC6x/REST+Reports+-+Summary+of+API+Calls
It's possible I'm simply forgetting why these are switched to POST, or it has to do with the limit to the number of params that you pass to a GET. But, I think we need to describe GET vs POST better in this Contract.
?filters=is_discoverable,has_multiple_originals,has_pdf_original | ||
``` | ||
|
||
In POST mode, it is defined as a JSON document like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer if we describe GET and POST mode separately. I'm having a hard time understanding the way this is documented. When would someone use GET and when would they use POST? It's unclear if everything below this point in the docs is ONLY for POST or if it also applies to GET? Could we give some more examples here as to what the differences are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reorganized report parameterization in the Filtered Collections report documentation. I also fixed a few mistakes and added some info I realised that was missing (e.g., definition of "basic report" in Filtered Items).
About usage of GET vs. POST, please see my other comments below.
contentreports-filtereditems.md
Outdated
[Back to the list of all defined endpoints](endpoints.md) | ||
|
||
## Statistics for the whole repository | ||
**POST /api/contentreports/filtereditems** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this a POST instead of a GET? I notice that the "statistics" endpoints always use GET except when they are adding data to the statistics. See https://github.com/DSpace/RestContract/blob/main/statistics-reports.md and https://github.com/DSpace/RestContract/blob/main/statistics-viewevents.md
Could we better describe why we need to use POST for these endpoints? It appears they are readonly, which implies they might be switched to GET.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, I thought about it. My concern with GET is, as you suggested above, the limited length of the parameters passed as part of the URL. There should be no problems with the Filtered Collections report (only Boolean filters).
Parameterization of the Filtered Items report, however, is much more complex and can easily become long enough to exceed any limit enforced by application servers for URL query strings. This is why I implemented this report as a POST endpoint.
For (a bit of) uniformity, I also added POST support to the Filtered Collections report.
Besides, while the HTTP spec clearly states that GET should be used for read-only requests, I saw nothing stating that POST should be used only for data-changing requests.
If you feel that everything should be switched to GET anyway, I can do it. In this case, the Filtered Items report shall be thoroughly tested with lots of parameters selected to make sure that nothing goes wrong due to too long a query string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About Filtered Items: The parameters could be organized into a GET query string, although such a string might become quite long. Another concern I had while designing the API for this report is the "query predicates" part: these are structured parameters (a query predicate is a (field, operator, value) tuple). This is another reason why I didn't include a GET version of this report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check: https://github.com/DSpace/RestContract/blob/main/search-endpoint.md solution.
A query string solution it's used for filtering results:
f.<:filter-name>=<:filter-value>,<:filter-operator>
where a filter-name
belongs to a predefined structure that is previously returned. Like: f.title=rainbows,notcontains
{
"query":"my query",
"scope":"9076bd16-e69a-48d6-9e41-0238cb40d863",
"appliedFilters": [
{
"filter" : "title",
"operator" : "notcontains",
"value" : "abcd",
"label" : "abcd"
},
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this discussion stems from a disagreement between HTTP and REST about what POST is for. RFC 9110 says that creating a new resource is only one possible use for POST. "The POST method requests that the target resource process the representation enclosed in the request according to the resource's own specific semantics." The description of POST here is quite a bit narrower. One might say that REST is a reuse of HTTP syntax with different semantics.
So it can be argued that REST is not a very good fit for this operation, but it is what we have.
Would it be a violation of the "spirit of REST" to consider such a POST to be storing a report description, which is consumed in the act of generating the report? Reports may take some time to create. Suppose that one POSTs a document describing the desired report and receives a token in the response. The report generator runs in the background. When finished, the token can be presented (using GET) to receive the report, and the report description is then destroyed.
I changed all |
Thank you @jeffmorin for this important contribution. I would like to help you bringing this to DSpace core. I didn't (yet) check your implementation and I was expecting to have the DSpace demo site working for comparing, but it's unavailable. I agree with @tdonohue , our Rest Contract defines that POST method should be used to create new resources, leaving the GET to access resources: My understanting it's that we aren't creating resources, but access them, so we should use GET. At the discovery or search endpoints, a complex structure is defined at the backend, stating what the client can do (defining available browses, facets, sorts options,...). It's also possible use query string to pass params. But, when pre-defining what structure to use, you don't need to pass a lot of query params. |
I could argue that we are neither creating nor accessing resources, because these reports are not resources; they are snapshots of the operational state of a store of resources. (I realize that this is the opposite of what I said above.) What is wanted is really not REST at all, but REST is all that we have to work with. REST basically defines putting things into a store and getting them out again. Complex configurable reports on the status of the store itself are something that REST isn't able to do if one cares about ideological purity. The notion of triggering pre-defined reports (which is how I understand the Discovery analogy) is interesting, but I think that we need to consider how the reports will be used. Statistics might be a useful model. One sometimes doesn't yet know what questions to ask, so one engages in wide-ranging exploration of the data to find promising perspectives. At other times (such as after exploratory examination) one has specific questions to be answered, possibly over and over again as the dataset evolves. Is this a good fit to the way we expect Content Reports to be used? Because doing exploratory work by repeatedly editing a set of preconfigured reports on the back-end sounds slow, tedious and exhausting. |
I don't have access to this feature right now, but I found a very useful video by Terry Brady (@terrywbrady ) Just to contextualize others, I would also like to add the Wiki page URL for the DS6 feature: |
I just added a GET endpoint to the Filtered Items content report. I also made a minor update to the Filtered Collections report documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeffmorin thanks for sharing your proposal to move the DSpace 6 reporting system to DSpace 7. I have added my feedback inline
contentreport-filteredcollections.md
Outdated
The GET-based endpoint takes a `filters` query parameter whose value is a comma-separated list of filters | ||
like the following: | ||
``` | ||
?filters=is_discoverable,has_multiple_originals,has_pdf_original | ||
``` | ||
|
||
Alternatively, the comma-separated list can be replaced by a repetition of the `filters` parameter | ||
for each requested filter: | ||
``` | ||
?filters=is_discoverable&filter=has_multiple_originals&filter=has_pdf_original | ||
``` | ||
|
||
|
||
Please see [below](#available-filters) for the list of available filters. | ||
|
||
**POST /api/contentreport/filteredcollections** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend against that. We should use the filter to build an unique id of the report, similar to what is done in the statistics endpoint see https://github.com/DSpace/RestContract/blob/main/statistics-reports.md
In this way the /api/contentreports/filteredcollections
will represent our collection of resources endpoint (it can return 405 as there is no use case in scrolling all the potential report) and /api/contentreports/filteredcollections/<concatenation-of-filter-as-report-id>
the individual report. In this way the HTTP caching mechanism will work at the best
contentreport-filtereditems.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be very close to the discovery endpoint... we need to build a query and get the list of matching items. If we want to get a different representation of the resulting items it should be an "export" script.
My suggestion is to look to which "filter" type are not yet supported in discover and eventually implement them instead than build from scratch a separate endpoint. This would allow to reuse almost everything in the UI just defining a special discovery configuration used for this purpose (similar to what has been done for the administrative search)
Please see the comment I wrote at DSpace/DSpace#8598. |
My fork is now synchronized with the main branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jeffmorin !
I've already had a manifested my thoughts about when we should bring this feature into DS. But that doesn't mean that I don't want it for DS8. I just think we need to straighten our heads in which way is the right path for it, then build the needed scaffolding to have this feature in top of it. Just for comparison, a specific Working Group was created just to address discussions about Entities single feature on DSpace 7.
I'm afraid that having this feature as is, others might build features on top of it, or use it and then it will be very hard to have it refactored later on.
DS6 content report feature has a set of functionalities that are still missing in this specification on this PR, for example collection filters (I didn't see any endpoint specification that retrieved available filters) and also filtering by collection id.
You can take a glance on:
https://wiki.lyrasis.org/display/DSDOC6x/REST+Reports+-+Collection+Report+Screenshots+with+Annotated+API+Calls
Also, I think queryPredicates
at the item filtering level could be better described in the documentation. And I also think we need to have endpoints for what type of operations are available within fields (perhaps advanced search could help in here).
Even if the implementation doesn't support it yet, I think those missing endpoints should be defined at this phase.
Perhaps some of the difficulty is in calling it "beta"? That implies that the design is finished and the code is expected to work well. Would "experimental" or "preview" be more suitable? With a stern warning: "the work is unfinished and we expect to make substantial changes in the final, supported code to address, at least, these issues...." Because that's what was agreed upon, I think: an early release in 8.0 despite 8.0's compressed schedule, to get this facility into the hands of users this year and to gather wider experience; and a stable release with full support by 9.0. If people choose to rely on internals of code when they've been clearly warned not to, that is their problem, and we should not be shy about reminding them. Politely appreciative of feedback: yes; apologetic: no. It would be helpful to have missing features logged as issues so that they can be studied, scheduled and tracked. |
@paulo-graca : From talking with @pilasou and @jeffmorin , I don't think it's possible to update this PR any further in time for 8.0. The decision for 8.0 was to scale back because @pilasou and @jeffmorin don't have availability to do any more detailed work on this code. We have to remember that, while this is wanted by many, the development on this is entirely donated by U of Laval. There are only two options that remain:
As decided in the meeting on Feb 8, we voted for option 1. That means we let this into 8.0 so that everyone can start to improve/enhance it. But, put warnings on it that it is missing features & there are some known issues. Those missing features / known issues can be detailed in GitHub issue tickets to be addressed in future 8.x or 9.0 releases. |
@tdonohue, to me, there are two different aspects here:
|
@paulo-graca : To clarify my point... I disagree with this statement:
I feel the REST Contract should only define the endpoints that are implemented at this time. Otherwise, it is misleading to other developers as to what features actually exist. (And to clarify, we do have a versioned REST Contract. We have the Overall, I agree that we should enhance this contract as necessary to describe the current implementation. However, we should not add endpoints which do not yet exist, as the Contract is meant to describe the implementation. I hope that clarifies things. If you do have feedback on the contract compared to the current implementation, then please do point it out. We both agree that there is the opportunity to clean up this PR as the implementation PR also gets cleaned up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Thanks @jeffmorin ! This looks good now.
This pull request features a REST specification for the first of the two REST-based reports to be ported to DSpace 7.x (see ticket DS-4301, DSpace/DSpace#7641), the Filtered Collections report.
Following a suggestion from Andrea Bollini, I created a new
/contentreports
branch in which I created a/filteredcollections
URI for the report itself. I added an entry under Endpoints Under Development/Discussion inendpoints.md
.I also added a missing link and removed a duplicate entry in
endpoints.md
.