-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the property_ranges
query parameter and associated metadata
#481
base: develop
Are you sure you want to change the base?
Adding the property_ranges
query parameter and associated metadata
#481
Conversation
Thanks for writing this up! Why did you move the whole section "Transmission of large property values"? It makes it tricky to see what has changed. Can you move it back? Also, my interpretation of the workshop discussions was to make the |
It seemed more logical to place the metadata section before the "Transmission of large property values" section, as the "Transmission of large property values" section depends on the metadata section but not the other way round. So it is more useful to read the metadata section before the "Transmission of large property values" section. |
You are right, I could still separate the property ranges query parameter and the extra metadata fields from the transmission of large data section, as each could still be used separate from the other. I'll try to do this next week. |
Make sure each sentence starts on a new line. Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
d2a18b7
to
a84c72e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly formatting changes and suggesting rewordings -- I think I can go through and force most of them in if there are no objections
property_ranges
query parameter and associated metadata
This PR appears a bit stalled and is important. @JPBergsma do you plan to continue working on this, or are you ok with me (or someone else) starting to merge changes into it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, just a few minor typos.
On the topic of separators: The business with url-safe characters is complicated. I note that according to RFC3986 both bracket parentheses and colons are in the same category, "gen-delims", which I think is generally worse in terms of having to be escaped in practice compared to the characters in "sub-delims" (and in particular the unreserved characters listed right below the linked segment, which are always safe). However, very particular for bracket parentheses, apparently chrome and Firefox had decided to violate the standard and generally do not encode those characters out of 'tradition'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I left only some minor comments.
@@ -602,6 +605,162 @@ Example of the corresponding metadata property definition contained in the field | |||
} | |||
// ... | |||
|
|||
Slices of array properties |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest using "list" instead of "array" everywhere, as in the specification "list" is defined as data type, whereas "array" is used much less often and mostly as a synonym of "list".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, good catch. Array should almost everywhere be replaced by list. Array should only be kept when we explicitly refer to JSON:API output arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After thinking about this, I don't find a good way to handle this without defining the term 'array' to be a general word for a list or a structure of nested lists. (IMO "Multidimensional lists" is far worse as a term.) So, I've pushed a commit that does this.
(Note: 'resolve' this conversation if you are happy with the fix, otherwise comment below).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave it up to somebody more experienced in the protocol to resolve this.
Co-authored-by: Andrius Merkys <andrius.merkys@gmail.com> Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
…ates with different format than the structure one
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
The start value specifies the first index in that dimension for which values should be returned (which is 0-based and inclusive). | ||
The default is :val:`0`. | ||
The stop value specifies the last index for which values should be returned (inclusive). | ||
The default is the last index of the array along the specified dimension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default is the last index of the array along the specified dimension. | |
The default is :val:`null`, which represents the last index of the array along the specified dimension. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually already explaiend that :val:null
refers to the default value, and that it should be returned unless a specified value was used.
From how I understood it, the same applies to the start
and step
keys too.
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
In view of Monday's online meeting where we need to discuss this, it would be great if we can get a last check and, if all is OK, approvals. The only remaining things are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! In anticipation of the upcoming meeting, here is the promised review.
Mostly just questions for clarification, or me pointing out where the phrasing may be somewhat dense for those less familiar with OPTIMADE.
From the NOMAD perspective, my main question would be "whether array dimensions MAY also just be integers (served as strings)?"
We are also working on a system for adding (named) variables, so some dimensions may be dynamic. I don't think this conflicts with the specificatiosn per se, but I'd have to consult my coworkers about the implementation.
Slices are used for a client to ask the server to only provide a subset of items of an array, which can result in a small or large set of items. | ||
In contrast, the protocol for large property values is used by the server implementation to transmit a set of items that it deems too large to provide inside the normal OPTIMADE response. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gather that "small" and "large" have a techncial meaning here (wrt to the transmission).
Do you have any kind of annotation trick to makes this clearer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slices are used for a client to ask the server to only provide a subset of items of an array, which can result in a small or large set of items. | |
In contrast, the protocol for large property values is used by the server implementation to transmit a set of items that it deems too large to provide inside the normal OPTIMADE response. | |
The protocol for large property values is used by the server implementation to transmit a set of items that it deems too large to provide inside the normal OPTIMADE response. | |
Slices, on the other hand, are used for a client to request a subset of any size of the items of an array, which can possibly (but not necessarily) result in such a large amount of values that the protocol for large property values is required to transmit them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this better, yes. Given that only large has a meaning, it's better to remove small, as you did.
2 minor suggestions:
The protocol for large property values
: can this refer to the relevant section?- If we decide on using some kind of highlighting (as brought up above), I'd mark server and client. I think it's a good short-hand for navigating the protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protocol for large property values: can this refer to the relevant section?
I realize that the way these comments chop up the text can make it a bit difficult to see the context, but the link you ask for is already right on the row before this line (numbered 616 right now).
In contrast, the protocol for large property values is used by the server implementation to transmit a set of items that it deems too large to provide inside the normal OPTIMADE response. | ||
|
||
The main mechanism is provided through the query parameter :query-param:`property_slices` defined in section `Single Entry URL Query Parameters`_. | ||
Information relating to the ability of the server to handle this query parameter and the relevant ranges of indexes is provided using metadata property field :field:`array_axes` (see `Metadata properties`_). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the relevant ranges of indexes
Above you used the term "array axis". I'd use the same term here, as with "index" idk whether you mean the dimension or a slice along a dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here, can you use the GitHub suggest edit feature to show a suggestion? (Also, the discussion of terminology with regards to indices and axes may have been clarified by a discussion we had last web meet?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, the discussion of terminology with regards to indices and axes may have been clarified by a discussion we had last web meet?)
I'd need a refresher there, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Information relating to the ability of the server to handle this query parameter and the relevant ranges of indexes is provided using metadata property field :field:`array_axes` (see `Metadata properties`_). | |
Information relating to the ability of the server to handle this query parameter. This SHOULD include the property field :field:`array_axes` (see `Metadata properties`_), listing the numerical indices of axes that may be sliced. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, this doesn't read cleanly to me in the context it appears. The first sentence in the suggestion isn't even a complete sentence?
The context here is that we are in the general text at the top of the section trying to outline what this feature is and how it is used. The sentence right now is just trying to communicate that the metadata property array_axis is one of the essential mechanisms of the protocol.
The step value specifies the step size in that dimension. | ||
The default is :val:`1`. | ||
|
||
An empty value of the :query-param:`property_slices` query parameter MUST be interpreted as equivalent to the query parameter not being included in the request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this then equivalent to property_ranges=[<first_dim>:::, ...]
?
I.e. should this return the full property or leave it out?
This might be defined somewhere else. If so, pls refer to that section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query parameter not being included means the client isn't trying to use this protocol; so I think the natural interpretation is that "everything works as usual without the slices protocol". I guess that could be clarified somewhere, but I'm not sure exactly where.
|
||
- :query-url:`http://optimade.example.com/v1/trajectories/id_12345?response_fields=frame_cartesian_site_positions&property_ranges=dim_frames::999:10,dim_sites:30:70:` | ||
|
||
This query URL requests items from the array :field:`frame_cartesian_site_positions` only for the 31st to 71st sites (i.e., with indexes 30 through 70 inclusive) for 1 out of every 10 frames of the first 1000 frames (i.e., taking steps of 10 over indexes 0 through 999 inclusive, which requests the frames with indexes 0, 10, 20, 30, ..., 990) of a trajectory with ID :val:`id_12345`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor style changes to make the phrase clearer:
- Split the phrase up by both attributes.
- Align the order of the example with the explanation, i.e. concerning
dim_frames
anddim_sites
. - Just reiterate the 0-indexing convention after "the 31st to 71st sites". Apart from that, it's a great refresher.
the frames with indexes 0, 10, 20, 30, ..., 990
explains it enough. The paranteses else become too verbose.
@@ -1211,6 +1375,31 @@ While the following URL query parameters are OPTIONAL for clients, API implement | |||
The URL query parameter :query-param:`include` is OPTIONAL for both clients and API implementations. | |||
The meaning of these URL query parameters are as defined above in section `Entry Listing URL Query Parameters`_. | |||
|
|||
One additional query parameter :query-param:`property_slices` MUST be handled by the API implementation either as defined below or by returning the error :http-error:`501 Not Implemented`: | |||
|
|||
- **property\_slices**: A number of slice specifications to request only parts of array properties for the functionality described in `Slices of array properties`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May the client request multiple (different) slices from the same array axis in the same query?
E.g. property_slices=[dim_frames:3:37:5,dim_frames:49:105:2]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is clarified in "REQUIRED keys". A link suffices then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the present version (which may have changed since you commented) I think this information is quite clear in the text right below.
For example, let us consider the property :property:`frame_cartesian_site_positions` of the trajectory entry, where the first dimension name is :val:`dim_frames`. | ||
If there is another one-dimensional (i.e., with a single axis) array property :property:`_exmpl_energy` of the same trajectory entry that specifies in its property definition the same dimension name :val:`dim_frames` for its axis, then the values of :property:`_exmpl_energy` and of :property:`frame_cartesian_site_positions` at index *i* pertain to the same frame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Her, you lost me
As a follow-up on my comment, the decision in today's online meeting is to go for the syntax |
Coming back to last meeting's discussion about the slicing, I'd like to explain my question further via an example. In NOMAD, we have a tree-like schema with data at the nodes. That data may have any tensor rank, i.e. scalar, vector, matrix, etc. and slicing it is very sensible. However, our schema is being extended to be more dynamic and easier tailor. Consequentially, we foresee that rank may vary. Consider for example a property, like
At a database-wide level, we can only ensure a minimum set of dimensionalities for Lastly, while I named the dimensions here, most of them only bear integers in practice (at least the dependent ones). Would integers (represented as strings) also be fine to identify dimensions, or is that semantically too vague? |
- :field:`requested_slice`: Dictionary. | ||
A field that describes the requested slice that was provided via the query parameter :query-param:`property_slices`. | ||
The subfields MUST reflect the values provided via the :query-param:`property_slices`. | ||
The implementation MUST preserve the values as given in the query parameter, including the distinction between specific values and default values even when they are equivalent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation MUST preserve the values as given in the query parameter, including the distinction between specific values and default values even when they are equivalent. | |
The implementation MUST preserve the values as given in the query parameter, including the distinction between specific values and default values even when they are equivalent (see example below). |
- :query-url:`http://optimade.example.com/v1/trajectories/id_12345?response_fields=frame_cartesian_site_positions&property_ranges=dim_frames::999:10,dim_sites:30:70:` | ||
|
||
This query URL requests items from the array :field:`frame_cartesian_site_positions` only for the 31st to 71st sites (i.e., with indexes 30 through 70 inclusive) for 1 out of every 10 frames of the first 1000 frames (i.e., taking steps of 10 over indexes 0 through 999 inclusive, which requests the frames with indexes 0, 10, 20, 30, ..., 990) of a trajectory with ID :val:`id_12345`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this addresses @ndaelman-hu suggestions, except for removing the explanation in the last parenthesis, which from our discussions on the workshop I do think still is necessary to be completely clear.
- :query-url:`http://optimade.example.com/v1/trajectories/id_12345?response_fields=frame_cartesian_site_positions&property_ranges=dim_frames::999:10,dim_sites:30:70:` | |
This query URL requests items from the array :field:`frame_cartesian_site_positions` only for the 31st to 71st sites (i.e., with indexes 30 through 70 inclusive) for 1 out of every 10 frames of the first 1000 frames (i.e., taking steps of 10 over indexes 0 through 999 inclusive, which requests the frames with indexes 0, 10, 20, 30, ..., 990) of a trajectory with ID :val:`id_12345`. | |
- :query-url:`http://optimade.example.com/v1/trajectories/id_12345?response_fields=frame_cartesian_site_positions&property_ranges=dim_sites:30:70:,dim_frames::999:10` | |
This query URL requests items from the trajectory with ID :val:`id_12345`. | |
It requests items from the array :field:`frame_cartesian_site_positions` for this trajectory. | |
The items that are requested are for only the 31st to 71st sites (i.e., with indexes 30 through 70 inclusive) for 1 out of every 10 frames of the first 1000 frames (i.e., taking steps of 10 over indexes 0 through 999 inclusive, which requests the frames with indexes 0, 10, 20, 30, ..., 990). |
Dimension names defined by database or definition providers MUST be prefixed by the corresponding database or namespace prefix, and SHOULD also be prefixed by ``dim_``, e.g., ``_exmpl_dim_particles``. | ||
If, within one entry, two or more array axes in one or more properties share the same dimension :field:`name`, those represent the same dimension. | ||
For example, let us consider the property :property:`frame_cartesian_site_positions` of the trajectory entry, where the first dimension name is :val:`dim_frames`. | ||
If there is another one-dimensional (i.e., with a single axis) array property :property:`_exmpl_energy` of the same trajectory entry that specifies in its property definition the same dimension name :val:`dim_frames` for its axis, then the values of :property:`_exmpl_energy` and of :property:`frame_cartesian_site_positions` at index *i* pertain to the same frame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ndaelman-hu is this more clear?:
If there is another one-dimensional (i.e., with a single axis) array property :property:`_exmpl_energy` of the same trajectory entry that specifies in its property definition the same dimension name :val:`dim_frames` for its axis, then the values of :property:`_exmpl_energy` and of :property:`frame_cartesian_site_positions` at index *i* pertain to the same frame. | |
Let the trajectory entry in this example have another, one-dimensional, array property :property:`_exmpl_energy`, which in its property definition specifies *the same name*, :val:`dim_frames`, as the name of the axis corresponding to its single dimension. | |
The joint dimension name means the values of :property:`_exmpl_energy` and of :property:`frame_cartesian_site_positions` at index *i* pertain to the same frame. | |
If slicing is used to request only parts of the data along the :val:`dim_frames` dimension, that is a request to slice both the properties according to the specified slice. |
I have now updated the PR to reflect the syntax I have also tried to somehow address all outstanding comments of @ndaelman-hu. It would be great if you can:
Finally, as a reply to this:
Just to make sure I understand: do you mean that according to the new schemas of NOMAD:
Because then the answer is that you cannot today represent this kind of freedom as a single OPTIMADE property definition. A single property has to have a fixed dimensionality. However, if you are OK with dynamically assigning alternative names to the different options, you can translate your data representation to that of OPTIMADE. Lets for the moment disregard that if your example with forces vs time is actually the atomic forces in a trajectory, we will already have a specific standard field for that; lets instead just discuss these as if they were completely custom NOMAD properties. What you would do is provide a set of force property definitions:
In the OPTIMADE property definitions, a dimension name is a string. There is nothing preventing you from generating OPTIMADE dimension names from your numbers, e.g., |
Thank you for clarifying @rartino ! Indeed, you understood the example correctly.
While NOMAD natively provides aggregation queries, those are more so for overall statistics.
This dynamic dimensionality has been requested in several cases. Atm, we have a preliminary implementation (in our new, revised schema). The final form isn't set yet, however. If the dimensionality remains dynamic, we'll make sure to project out the varieties as you suggested here.
Yes, I just wanted to know whether this was appropriate. Thx for confirming. |
- **dictionary**: an associative array of **keys** and **values**, where **keys** are pre-determined strings, i.e., for the same entry property, the **keys** remain the same among different entries whereas the **values** change. | ||
Multidimensional collections of items are represented as nested lists. | ||
The specification uses **array** as a more general term for structures of nested lists representing single or multidimensional data, and the term **array axes** for the levels of nesting. | ||
Note that arrays are represented using lists and not as a separate data type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In JSON? If we move to e.g. HDF5 then there the arrays will be a real distinct data type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part discusses the data model internal to OPTIMADE itself, where we (for now) only formally need list in lists (but we say here ~"lets use the term 'arrays' for lists in lists"). How this data model is mapped onto JSON types is first described in section 4.2, where indeed lists -> JSON lists. A future section discussing HDF5, or for that matter XML, parquet, etc., would define how these types are mapped into native data types. It would be valid for HDF5 to say that lists in lists (in lists...) should be mapped to the HDF5 array format.
If you ask "why don't we just adopt arrays as a fundamental data type and say that in JSON arrays map to lists in lists", I think that would be a valid change (but not in this PR).
@@ -218,7 +218,10 @@ representation in all contexts. They are as follows: | |||
- Basic types: **string**, **integer**, **float**, **boolean**, **timestamp**. | |||
- **list**: an ordered collection of items, where all items are of the same type, unless they are unknown. | |||
A list can be empty, i.e., contain no items. | |||
- **dictionary**: an associative array of **keys** and **values**, where **keys** are pre-determined strings, i.e., for the same entry property, the **keys** remain the same among different entries whereas the **values** change. | |||
Multidimensional collections of items are represented as nested lists. | |||
The specification uses **array** as a more general term for structures of nested lists representing single or multidimensional data, and the term **array axes** for the levels of nesting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to require that a multidimensional array, as opposed to just a list of lists, MUST always have the same size for all sublists at the same dimension? I.e. a matrix (2D array) can be square or rectangular, but can not be triangle or arbitrarily ragged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right; we deliberately wanted to allow raggedness in the design when referencing items using the property_ranges protocol.
I guess I should take your comment to mean: "should we reserve the term Array for non-ragged multidimensional data?" That isn't a bad idea, but, what is then a good term for lists-in-lists-in-lists that can be ragged? Because this PR got very tricky to formulate without a specific term for that (which now is "Array"...)
Co-authored-by: ndaelman-hu <107392603+ndaelman-hu@users.noreply.github.com>
This PR describes how a client can request a specific part of a property that is returned as partial data.
I have tried to keep things as much as possible the same as in the original ranged properties proposal #452.