Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the property_ranges query parameter and associated metadata #481

Open
wants to merge 39 commits into
base: develop
Choose a base branch
from

Conversation

JPBergsma
Copy link
Contributor

@JPBergsma JPBergsma commented Jun 29, 2023

This PR describes how a client can request a specific part of a property that is returned as partial data.
I have tried to keep things as much as possible the same as in the original ranged properties proposal #452.

@JPBergsma JPBergsma marked this pull request as ready for review June 29, 2023 16:32
@blokhin blokhin changed the title Property_Ranges Querry Parameter Property_Ranges Query Parameter Jun 29, 2023
optimade.rst Outdated Show resolved Hide resolved
@rartino
Copy link
Contributor

rartino commented Jun 30, 2023

Thanks for writing this up!

Why did you move the whole section "Transmission of large property values"? It makes it tricky to see what has changed. Can you move it back?

Also, my interpretation of the workshop discussions was to make the property_ranges query parameter completely separate from, and orthogonal to, the mechanism for "Transmission of large property values" (this was a strong point of @sauliusg). Hence, it isn't clear to me why anything has to change in the "Transmission of large property values". Why can't we keep all documentation about this feature in the definition of the property_ranges query parameter?

@rartino rartino mentioned this pull request Jun 30, 2023
5 tasks
@JPBergsma
Copy link
Contributor Author

It seemed more logical to place the metadata section before the "Transmission of large property values" section, as the "Transmission of large property values" section depends on the metadata section but not the other way round. So it is more useful to read the metadata section before the "Transmission of large property values" section.

@JPBergsma
Copy link
Contributor Author

You are right, I could still separate the property ranges query parameter and the extra metadata fields from the transmission of large data section, as each could still be used separate from the other. I'll try to do this next week.

optimade.rst Outdated Show resolved Hide resolved
@ml-evs ml-evs added the blocking-release This is a PR or issue that presently blocks the release of next version of the spec. label Dec 19, 2023
Copy link
Member

@ml-evs ml-evs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly formatting changes and suggesting rewordings -- I think I can go through and force most of them in if there are no objections

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
@ml-evs ml-evs changed the title Property_Ranges Query Parameter Adding the property_ranges query parameter and associated metadata Dec 19, 2023
@rartino
Copy link
Contributor

rartino commented Jan 9, 2024

This PR appears a bit stalled and is important. @JPBergsma do you plan to continue working on this, or are you ok with me (or someone else) starting to merge changes into it?

Copy link
Contributor

@vaitkus vaitkus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, just a few minor typos.

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
@rartino
Copy link
Contributor

rartino commented Jun 14, 2024

On the topic of separators: The business with url-safe characters is complicated. I note that according to RFC3986 both bracket parentheses and colons are in the same category, "gen-delims", which I think is generally worse in terms of having to be escaped in practice compared to the characters in "sub-delims" (and in particular the unreserved characters listed right below the linked segment, which are always safe). However, very particular for bracket parentheses, apparently chrome and Firefox had decided to violate the standard and generally do not encode those characters out of 'tradition'.

Copy link
Member

@merkys merkys left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I left only some minor comments.

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
@@ -602,6 +605,162 @@ Example of the corresponding metadata property definition contained in the field
}
// ...

Slices of array properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest using "list" instead of "array" everywhere, as in the specification "list" is defined as data type, whereas "array" is used much less often and mostly as a synonym of "list".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, good catch. Array should almost everywhere be replaced by list. Array should only be kept when we explicitly refer to JSON:API output arrays.

Copy link
Contributor

@rartino rartino Jun 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this, I don't find a good way to handle this without defining the term 'array' to be a general word for a list or a structure of nested lists. (IMO "Multidimensional lists" is far worse as a term.) So, I've pushed a commit that does this.

(Note: 'resolve' this conversation if you are happy with the fix, otherwise comment below).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it up to somebody more experienced in the protocol to resolve this.

optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
rartino and others added 4 commits June 15, 2024 15:30
Co-authored-by: Andrius Merkys <andrius.merkys@gmail.com>
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
…ates with different format than the structure one
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
The start value specifies the first index in that dimension for which values should be returned (which is 0-based and inclusive).
The default is :val:`0`.
The stop value specifies the last index for which values should be returned (inclusive).
The default is the last index of the array along the specified dimension.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The default is the last index of the array along the specified dimension.
The default is :val:`null`, which represents the last index of the array along the specified dimension.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually already explaiend that :val:null refers to the default value, and that it should be returned unless a specified value was used.
From how I understood it, the same applies to the start and step keys too.

Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>
@giovannipizzi
Copy link
Contributor

In view of Monday's online meeting where we need to discuss this, it would be great if we can get a last check and, if all is OK, approvals.

The only remaining things are:

  • a comment between @rartino and @merkys but it seems to me that the current state is probably OK?
  • the discussion on the syntax dim_xxx:start:stop:step vs dim_xxx[start:stop:step] or similar. I still think the second is easier to read, but I'm OK to keep the currently suggested version only with : if nobody else feels strongly about it

Copy link

@ndaelman-hu ndaelman-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! In anticipation of the upcoming meeting, here is the promised review.

Mostly just questions for clarification, or me pointing out where the phrasing may be somewhat dense for those less familiar with OPTIMADE.

From the NOMAD perspective, my main question would be "whether array dimensions MAY also just be integers (served as strings)?"
We are also working on a system for adding (named) variables, so some dimensions may be dynamic. I don't think this conflicts with the specificatiosn per se, but I'd have to consult my coworkers about the implementation.

optimade.rst Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
optimade.rst Outdated Show resolved Hide resolved
Comment on lines +617 to +618
Slices are used for a client to ask the server to only provide a subset of items of an array, which can result in a small or large set of items.
In contrast, the protocol for large property values is used by the server implementation to transmit a set of items that it deems too large to provide inside the normal OPTIMADE response.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gather that "small" and "large" have a techncial meaning here (wrt to the transmission).
Do you have any kind of annotation trick to makes this clearer?

Copy link
Contributor

@rartino rartino Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Slices are used for a client to ask the server to only provide a subset of items of an array, which can result in a small or large set of items.
In contrast, the protocol for large property values is used by the server implementation to transmit a set of items that it deems too large to provide inside the normal OPTIMADE response.
The protocol for large property values is used by the server implementation to transmit a set of items that it deems too large to provide inside the normal OPTIMADE response.
Slices, on the other hand, are used for a client to request a subset of any size of the items of an array, which can possibly (but not necessarily) result in such a large amount of values that the protocol for large property values is required to transmit them.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this better, yes. Given that only large has a meaning, it's better to remove small, as you did.
2 minor suggestions:

  1. The protocol for large property values: can this refer to the relevant section?
  2. If we decide on using some kind of highlighting (as brought up above), I'd mark server and client. I think it's a good short-hand for navigating the protocol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The protocol for large property values: can this refer to the relevant section?

I realize that the way these comments chop up the text can make it a bit difficult to see the context, but the link you ask for is already right on the row before this line (numbered 616 right now).

In contrast, the protocol for large property values is used by the server implementation to transmit a set of items that it deems too large to provide inside the normal OPTIMADE response.

The main mechanism is provided through the query parameter :query-param:`property_slices` defined in section `Single Entry URL Query Parameters`_.
Information relating to the ability of the server to handle this query parameter and the relevant ranges of indexes is provided using metadata property field :field:`array_axes` (see `Metadata properties`_).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the relevant ranges of indexes

Above you used the term "array axis". I'd use the same term here, as with "index" idk whether you mean the dimension or a slice along a dimension.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here, can you use the GitHub suggest edit feature to show a suggestion? (Also, the discussion of terminology with regards to indices and axes may have been clarified by a discussion we had last web meet?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, the discussion of terminology with regards to indices and axes may have been clarified by a discussion we had last web meet?)

I'd need a refresher there, sorry.

Copy link

@ndaelman-hu ndaelman-hu Oct 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Information relating to the ability of the server to handle this query parameter and the relevant ranges of indexes is provided using metadata property field :field:`array_axes` (see `Metadata properties`_).
Information relating to the ability of the server to handle this query parameter. This SHOULD include the property field :field:`array_axes` (see `Metadata properties`_), listing the numerical indices of axes that may be sliced.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused, this doesn't read cleanly to me in the context it appears. The first sentence in the suggestion isn't even a complete sentence?

The context here is that we are in the general text at the top of the section trying to outline what this feature is and how it is used. The sentence right now is just trying to communicate that the metadata property array_axis is one of the essential mechanisms of the protocol.

The step value specifies the step size in that dimension.
The default is :val:`1`.

An empty value of the :query-param:`property_slices` query parameter MUST be interpreted as equivalent to the query parameter not being included in the request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this then equivalent to property_ranges=[<first_dim>:::, ...]?
I.e. should this return the full property or leave it out?

This might be defined somewhere else. If so, pls refer to that section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query parameter not being included means the client isn't trying to use this protocol; so I think the natural interpretation is that "everything works as usual without the slices protocol". I guess that could be clarified somewhere, but I'm not sure exactly where.


- :query-url:`http://optimade.example.com/v1/trajectories/id_12345?response_fields=frame_cartesian_site_positions&property_ranges=dim_frames::999:10,dim_sites:30:70:`

This query URL requests items from the array :field:`frame_cartesian_site_positions` only for the 31st to 71st sites (i.e., with indexes 30 through 70 inclusive) for 1 out of every 10 frames of the first 1000 frames (i.e., taking steps of 10 over indexes 0 through 999 inclusive, which requests the frames with indexes 0, 10, 20, 30, ..., 990) of a trajectory with ID :val:`id_12345`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor style changes to make the phrase clearer:

  1. Split the phrase up by both attributes.
  2. Align the order of the example with the explanation, i.e. concerning dim_frames and dim_sites.
  3. Just reiterate the 0-indexing convention after "the 31st to 71st sites". Apart from that, it's a great refresher.
  4. the frames with indexes 0, 10, 20, 30, ..., 990 explains it enough. The paranteses else become too verbose.

optimade.rst Outdated Show resolved Hide resolved
@@ -1211,6 +1375,31 @@ While the following URL query parameters are OPTIONAL for clients, API implement
The URL query parameter :query-param:`include` is OPTIONAL for both clients and API implementations.
The meaning of these URL query parameters are as defined above in section `Entry Listing URL Query Parameters`_.

One additional query parameter :query-param:`property_slices` MUST be handled by the API implementation either as defined below or by returning the error :http-error:`501 Not Implemented`:

- **property\_slices**: A number of slice specifications to request only parts of array properties for the functionality described in `Slices of array properties`_.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May the client request multiple (different) slices from the same array axis in the same query?
E.g. property_slices=[dim_frames:3:37:5,dim_frames:49:105:2]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is clarified in "REQUIRED keys". A link suffices then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the present version (which may have changed since you commented) I think this information is quite clear in the text right below.

Comment on lines +2379 to +2380
For example, let us consider the property :property:`frame_cartesian_site_positions` of the trajectory entry, where the first dimension name is :val:`dim_frames`.
If there is another one-dimensional (i.e., with a single axis) array property :property:`_exmpl_energy` of the same trajectory entry that specifies in its property definition the same dimension name :val:`dim_frames` for its axis, then the values of :property:`_exmpl_energy` and of :property:`frame_cartesian_site_positions` at index *i* pertain to the same frame.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Her, you lost me

@giovannipizzi
Copy link
Contributor

As a follow-up on my comment, the decision in today's online meeting is to go for the syntax dim_xxx[start:stop:step]

@ndaelman-hu
Copy link

Coming back to last meeting's discussion about the slicing, I'd like to explain my question further via an example.

In NOMAD, we have a tree-like schema with data at the nodes. That data may have any tensor rank, i.e. scalar, vector, matrix, etc. and slicing it is very sensible. However, our schema is being extended to be more dynamic and easier tailor.

Consequentially, we foresee that rank may vary. Consider for example a property, like forces, sliceable along the constituent atoms or spatial axes (e.g. x, y, z). Any entry can extend that definition's rank by adding independent variables, e.g. forces(time). A client could now, in principle, also slice along the time dimension. The table below lists various examples of independent variables and the full rank:

dependent variable | rank dependent variable | independent variables | rank independent variables | full rank
--|--|--|--|--
forces | 2 | time | 1 | 3
forces | 2 | temperature | 1 | 3
forces | 2 | electric field | 2 | 4
forces | 2 | strain | 3 | 5

At a database-wide level, we can only ensure a minimum set of dimensionalities for forces, i.e. the force vector over each atom. Additional, entry-specific dimensions could at best only be returned after a query has filtered down the data. Does this limitation conflict with "query parameter :query-param:property_slices for metadata"?

Lastly, while I named the dimensions here, most of them only bear integers in practice (at least the dependent ones). Would integers (represented as strings) also be fine to identify dimensions, or is that semantically too vague?

- :field:`requested_slice`: Dictionary.
A field that describes the requested slice that was provided via the query parameter :query-param:`property_slices`.
The subfields MUST reflect the values provided via the :query-param:`property_slices`.
The implementation MUST preserve the values as given in the query parameter, including the distinction between specific values and default values even when they are equivalent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The implementation MUST preserve the values as given in the query parameter, including the distinction between specific values and default values even when they are equivalent.
The implementation MUST preserve the values as given in the query parameter, including the distinction between specific values and default values even when they are equivalent (see example below).

optimade.rst Outdated Show resolved Hide resolved
Comment on lines +1400 to +1402
- :query-url:`http://optimade.example.com/v1/trajectories/id_12345?response_fields=frame_cartesian_site_positions&property_ranges=dim_frames::999:10,dim_sites:30:70:`

This query URL requests items from the array :field:`frame_cartesian_site_positions` only for the 31st to 71st sites (i.e., with indexes 30 through 70 inclusive) for 1 out of every 10 frames of the first 1000 frames (i.e., taking steps of 10 over indexes 0 through 999 inclusive, which requests the frames with indexes 0, 10, 20, 30, ..., 990) of a trajectory with ID :val:`id_12345`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this addresses @ndaelman-hu suggestions, except for removing the explanation in the last parenthesis, which from our discussions on the workshop I do think still is necessary to be completely clear.

Suggested change
- :query-url:`http://optimade.example.com/v1/trajectories/id_12345?response_fields=frame_cartesian_site_positions&property_ranges=dim_frames::999:10,dim_sites:30:70:`
This query URL requests items from the array :field:`frame_cartesian_site_positions` only for the 31st to 71st sites (i.e., with indexes 30 through 70 inclusive) for 1 out of every 10 frames of the first 1000 frames (i.e., taking steps of 10 over indexes 0 through 999 inclusive, which requests the frames with indexes 0, 10, 20, 30, ..., 990) of a trajectory with ID :val:`id_12345`.
- :query-url:`http://optimade.example.com/v1/trajectories/id_12345?response_fields=frame_cartesian_site_positions&property_ranges=dim_sites:30:70:,dim_frames::999:10`
This query URL requests items from the trajectory with ID :val:`id_12345`.
It requests items from the array :field:`frame_cartesian_site_positions` for this trajectory.
The items that are requested are for only the 31st to 71st sites (i.e., with indexes 30 through 70 inclusive) for 1 out of every 10 frames of the first 1000 frames (i.e., taking steps of 10 over indexes 0 through 999 inclusive, which requests the frames with indexes 0, 10, 20, 30, ..., 990).

Dimension names defined by database or definition providers MUST be prefixed by the corresponding database or namespace prefix, and SHOULD also be prefixed by ``dim_``, e.g., ``_exmpl_dim_particles``.
If, within one entry, two or more array axes in one or more properties share the same dimension :field:`name`, those represent the same dimension.
For example, let us consider the property :property:`frame_cartesian_site_positions` of the trajectory entry, where the first dimension name is :val:`dim_frames`.
If there is another one-dimensional (i.e., with a single axis) array property :property:`_exmpl_energy` of the same trajectory entry that specifies in its property definition the same dimension name :val:`dim_frames` for its axis, then the values of :property:`_exmpl_energy` and of :property:`frame_cartesian_site_positions` at index *i* pertain to the same frame.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ndaelman-hu is this more clear?:

Suggested change
If there is another one-dimensional (i.e., with a single axis) array property :property:`_exmpl_energy` of the same trajectory entry that specifies in its property definition the same dimension name :val:`dim_frames` for its axis, then the values of :property:`_exmpl_energy` and of :property:`frame_cartesian_site_positions` at index *i* pertain to the same frame.
Let the trajectory entry in this example have another, one-dimensional, array property :property:`_exmpl_energy`, which in its property definition specifies *the same name*, :val:`dim_frames`, as the name of the axis corresponding to its single dimension.
The joint dimension name means the values of :property:`_exmpl_energy` and of :property:`frame_cartesian_site_positions` at index *i* pertain to the same frame.
If slicing is used to request only parts of the data along the :val:`dim_frames` dimension, that is a request to slice both the properties according to the specified slice.

@rartino
Copy link
Contributor

rartino commented Oct 18, 2024

I have now updated the PR to reflect the syntax dim_xxx[start:stop:step].

I have also tried to somehow address all outstanding comments of @ndaelman-hu. It would be great if you can:

  • "thumb up" any edit suggestions that matches your intent
  • counter-suggest edits if my suggested edit do not match your intent.
  • press 'resolve' on any conversations where I replied without an edit where you think the matter is resolved.
  • reply in any conversation where you think the matter is not resolved.

Finally, as a reply to this:

At a database-wide level, we can only ensure a minimum set of dimensionalities for forces, i.e. the force vector over each atom. Additional, entry-specific dimensions could at best only be returned after a query has filtered down the data. Does this limitation conflict with "query parameter :query-param:property_slices for metadata"?

Just to make sure I understand: do you mean that according to the new schemas of NOMAD:

  • There exist a concept "force".

  • The NOMAD schemas say "a force is an array of at least 3xN spatial dimensions, e.g., force = [3.2 , 4.2, 6.4], [6.2, 3.6, 6.2] for a two-atom system (in units of, e.g., E_h/a0, which you also specify somehow).

  • However, at any time when I (as a client) get a force from NOMAD, it may instead be an array of dimension, e.g., 3 x N x 2064, or even 3 x N x 2064 x 160, or 3 x N x 160 x 2064, where the force is resolved along a new 'dimension' such as time and (just to add something) trajectory_set_index (to index 160 trajectory sets).

  • Presumably, I can get information about what these dimensions are and which order they come, however, NOMAD do not already assign these different options different names, they are all 'force'?

Because then the answer is that you cannot today represent this kind of freedom as a single OPTIMADE property definition. A single property has to have a fixed dimensionality. However, if you are OK with dynamically assigning alternative names to the different options, you can translate your data representation to that of OPTIMADE.

Lets for the moment disregard that if your example with forces vs time is actually the atomic forces in a trajectory, we will already have a specific standard field for that; lets instead just discuss these as if they were completely custom NOMAD properties.

What you would do is provide a set of force property definitions:

  • If your force is a 3 x N array, then you define _nomad_force defined via an OPTIMADE property definition to be a two-dimensional array with the dimension names dim_spatial and dim_sites.
  • If your force is a 3 x N x 2064 array with forces for 2064 times, you'd call it force_per_time and define it via an OPTIMADE property definition to be a three-dimensional array with the dimension names dim_spatial, dim_sites, and _nomad_dim_time.
  • and so on...

Lastly, while I named the dimensions here, most of them only bear integers in practice (at least the dependent ones). Would integers (represented as strings) also be fine to identify dimensions, or is that semantically too vague?

In the OPTIMADE property definitions, a dimension name is a string. There is nothing preventing you from generating OPTIMADE dimension names from your numbers, e.g., _nomad_dim_force_3 as the dimension name of the third axis of your force property.

@ndaelman-hu
Copy link

I have now updated the PR to reflect the syntax dim_xxx[start:stop:step].

I have also tried to somehow address all outstanding comments of @ndaelman-hu. It would be great if you can:

* "thumb up" any edit suggestions that matches your intent

* counter-suggest edits if my suggested edit do not match your intent.

* press 'resolve' on any conversations where I replied without an edit where you think the matter is resolved.

* reply in any conversation where you think the matter is not resolved.

Finally, as a reply to this:

At a database-wide level, we can only ensure a minimum set of dimensionalities for forces, i.e. the force vector over each atom. Additional, entry-specific dimensions could at best only be returned after a query has filtered down the data. Does this limitation conflict with "query parameter :query-param:property_slices for metadata"?

Just to make sure I understand: do you mean that according to the new schemas of NOMAD:

* There exist a concept "force".

* The NOMAD schemas say "a force is an array of at least 3xN spatial dimensions, e.g., force = [3.2 , 4.2, 6.4], [6.2, 3.6, 6.2] for a two-atom system (in units of, e.g., E_h/a0, which you also specify somehow).

* However, at any time when I (as a client) get a force from NOMAD, it may instead be an array of dimension, e.g., 3 x N x 2064, or even 3 x N x 2064 x 160, or 3 x N x 160 x 2064, where the force is resolved along a new 'dimension' such as time and (just to add something) trajectory_set_index (to index 160 trajectory sets).

Thank you for clarifying @rartino ! Indeed, you understood the example correctly.

* Presumably, I can get information about what these dimensions are and which order they come, however, NOMAD do not already assign these different options different names, they are all 'force'?

While NOMAD natively provides aggregation queries, those are more so for overall statistics.
In some cases, the dimensions are stored as individual properties that may be returned. This isn't applicable across the board, however.

Because then the answer is that you cannot today represent this kind of freedom as a single OPTIMADE property definition. A single property has to have a fixed dimensionality. However, if you are OK with dynamically assigning alternative names to the different options, you can translate your data representation to that of OPTIMADE.

Lets for the moment disregard that if your example with forces vs time is actually the atomic forces in a trajectory, we will already have a specific standard field for that; lets instead just discuss these as if they were completely custom NOMAD properties.

What you would do is provide a set of force property definitions:

* If your `force` is a 3 x N array, then you define `_nomad_force` defined via an OPTIMADE property definition to be a two-dimensional array with the dimension names `dim_spatial` and `dim_sites`.

* If your `force` is a 3 x N x 2064 array with forces for 2064 times, you'd call it `force_per_time` and define it via an OPTIMADE property definition to be a three-dimensional array with the dimension names `dim_spatial`, `dim_sites`, and `_nomad_dim_time`.

* and so on...

This dynamic dimensionality has been requested in several cases. Atm, we have a preliminary implementation (in our new, revised schema). The final form isn't set yet, however. If the dimensionality remains dynamic, we'll make sure to project out the varieties as you suggested here.

Lastly, while I named the dimensions here, most of them only bear integers in practice (at least the dependent ones). Would integers (represented as strings) also be fine to identify dimensions, or is that semantically too vague?

In the OPTIMADE property definitions, a dimension name is a string. There is nothing preventing you from generating OPTIMADE dimension names from your numbers, e.g., _nomad_dim_force_3 as the dimension name of the third axis of your force property.

Yes, I just wanted to know whether this was appropriate. Thx for confirming.

- **dictionary**: an associative array of **keys** and **values**, where **keys** are pre-determined strings, i.e., for the same entry property, the **keys** remain the same among different entries whereas the **values** change.
Multidimensional collections of items are represented as nested lists.
The specification uses **array** as a more general term for structures of nested lists representing single or multidimensional data, and the term **array axes** for the levels of nesting.
Note that arrays are represented using lists and not as a separate data type.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JSON? If we move to e.g. HDF5 then there the arrays will be a real distinct data type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part discusses the data model internal to OPTIMADE itself, where we (for now) only formally need list in lists (but we say here ~"lets use the term 'arrays' for lists in lists"). How this data model is mapped onto JSON types is first described in section 4.2, where indeed lists -> JSON lists. A future section discussing HDF5, or for that matter XML, parquet, etc., would define how these types are mapped into native data types. It would be valid for HDF5 to say that lists in lists (in lists...) should be mapped to the HDF5 array format.

If you ask "why don't we just adopt arrays as a fundamental data type and say that in JSON arrays map to lists in lists", I think that would be a valid change (but not in this PR).

@@ -218,7 +218,10 @@ representation in all contexts. They are as follows:
- Basic types: **string**, **integer**, **float**, **boolean**, **timestamp**.
- **list**: an ordered collection of items, where all items are of the same type, unless they are unknown.
A list can be empty, i.e., contain no items.
- **dictionary**: an associative array of **keys** and **values**, where **keys** are pre-determined strings, i.e., for the same entry property, the **keys** remain the same among different entries whereas the **values** change.
Multidimensional collections of items are represented as nested lists.
The specification uses **array** as a more general term for structures of nested lists representing single or multidimensional data, and the term **array axes** for the levels of nesting.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to require that a multidimensional array, as opposed to just a list of lists, MUST always have the same size for all sublists at the same dimension? I.e. a matrix (2D array) can be square or rectangular, but can not be triangle or arbitrarily ragged?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; we deliberately wanted to allow raggedness in the design when referencing items using the property_ranges protocol.

I guess I should take your comment to mean: "should we reserve the term Array for non-ragged multidimensional data?" That isn't a bad idea, but, what is then a good term for lists-in-lists-in-lists that can be ragged? Because this PR got very tricky to formulate without a specific term for that (which now is "Array"...)

Co-authored-by: ndaelman-hu <107392603+ndaelman-hu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR/requires-discussion status/has-concrete-suggestion This issue has one or more concrete suggestions spelled out that can be brought up for consensus.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants