SMILES data type #436

merkys · 2022-12-07T07:55:16Z

In #368 SMILES property for structures was proposed. PR #392 was proposed to introduce string-valued SMILES property, however, there were opinions that internal semantics of SMILES strings have to be respected, what does not fit nicely with "plain string" data type. Thus the current PR introduces both smiles data type and an associated property.

Fixes #368.

…SMILES data type.

JPBergsma · 2022-12-07T12:58:43Z

optimade.rst

+Equality comparisons ('=' and '!=') MUST be supported for SMILES values.
+When handling equality comparisons of SMILES values, an implementation SHOULD NOT regard them as simple strings.
+Instead, an implementation SHOULD either compare the described chemical structures or canonicalize SMILES representations and then perform direct string matching.
+In addition to equality comparison operators, :val:`CONTAINS` MAY be supported optionally as an operator to check whether one structure is a substructure of another.


To me, this line is not entirely clear. Does this mean that we support querying for chemical groups ? As would be defined with SMARTS query language ? In that case we should probably mention the SMARTS query language.
Or do we only support searching for whole molecules in the SMILES string which could be separated by a "."?

Indeed, this deserves some clarification. I would not introduce SMARTS yet, but it is worth explaining what smiles CONTAINS "c1ccccc1" means.

When I was putting this PR together, I was thinking about substructure search. That is, "c1ccccc1" would as well be found in fluorobenzene. But we may limit ourselves to complete match of whole molecular entities (i.e., parts of SMILES separated by .). Which use would have better cost/benefit ratio?

optimade.rst

JPBergsma

Thanks for making this nice PR.
Apart from the two small comments, it looks good to me.

…tokens.

Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>

optimade.rst

vaitkus · 2022-12-22T17:58:37Z

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

merkys · 2022-12-30T07:36:14Z

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

ml-evs · 2022-12-30T12:27:08Z

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

It does, see: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-validation-01#section-7.2

Lots of impenetrable waffle in the spec about it though -- not clear to me how we would announce what "format": "smiles" actually means within JSON Schema yet. We could at least include a regex for structural validation.

rartino · 2022-12-30T21:26:53Z

The introduction of the SMILES data type will most likely also require some changes similar to those introduced in PR #444 for the timestamp data type.

Fair point. However, I am not sure JSON Schema supports custom formats, I need to have a better look at it.

It does, see: https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-validation-01#section-7.2

Lots of impenetrable waffle in the spec about it though -- not clear to me how we would announce what "format": "smiles" actually means within JSON Schema yet. We could at least include a regex for structural validation.

Reading your link:

Implementations MAY support custom format attributes. Save for agreement between parties, schema authors SHALL NOT expect a peer implementation to support such custom format attributes. An implementation MUST NOT fail to collect unknown formats as annotations. When the Format-Assertion vocabulary is specified, implementations MUST fail upon encountering unknown formats. Vocabularies do not support specifically declaring different value sets for keywords. Due to this limitation, and the historically uneven implementation of this keyword, it is RECOMMENDED to define additional keywords in a custom vocabulary rather than additional format attributes if interoperability is desired.

So, if I read this correctly, they are saying: "But, don't use this, use your own field instead", e.g., x-optimade-type?

The third (IMO very inelegant) option is to specify the format as "regex" + the best JSON Schema compatible regex we can come up with for SMILES strings. Implementations would then have to recognize specifically that regex and reverse-map it into the knowledge that the field is of OPTIMADE SMILES-type. However, in that case I think x-optimade-type is about 1e10 times better as a solution.

A happy side note: it appears the JSON Schema regex situation is more clear now than when we looked into it last time. They now define a "least common denominator subset" regex standard.

https://datatracker.ietf.org/doc/html/draft-bhutton-json-schema-01#name-regular-expressions

To me, this seems to finally provide a solution to the regular expression dilemma discussions (#42 #160). We should now simply follow their lead and standardize this both for our filter language and for format: regex in property definitions (which presently are not allowed at all). In contrast to JSON Schema, our property definitions should say that the regex MUST only use the subset.

…add much of a benefit for the specification.

…ILES-data-type

merkys · 2023-01-17T08:41:10Z

So, if I read this correctly, they are saying: "But, don't use this, use your own field instead", e.g., x-optimade-type?

The third (IMO very inelegant) option is to specify the format as "regex" + the best JSON Schema compatible regex we can come up with for SMILES strings. Implementations would then have to recognize specifically that regex and reverse-map it into the knowledge that the field is of OPTIMADE SMILES-type. However, in that case I think x-optimade-type is about 1e10 times better as a solution.

x-optimade-type sounds reasonable to me. Should I open a separate PR to add x-optimade-type to the specification or should I lump it together with this one?

rartino · 2023-01-17T08:52:25Z

x-optimade-type sounds reasonable to me. Should I open a separate PR to add x-optimade-type to the specification or should I lump it together with this one?

Lets take it in the open PR on property definitions; I have another thing that should be adjusted in the specification there as well, moving the identifier to "$ID". If you want to hurry things along, feel free to do a PR against my branch for that PR, or don't and I'll add x-optimade-type as I add '$ID'.

merkys · 2023-01-17T09:08:17Z

Lets take it in the open PR on property definitions; I have another thing that should be adjusted in the specification there as well, moving the identifier to "$ID". If you want to hurry things along, feel free to do a PR against my branch for that PR, or don't and I'll add x-optimade-type as I add '$ID'.

I am OK to wait. Thanks.

…ged now.

…o SMILES-data-type

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

merkys added 3 commits November 25, 2022 16:32

Describe SMILES data type.

9ec5181

Add SMILES property as is done in Materials-Consortia#392, but using …

775fda0

…SMILES data type.

Describe comparisons involving SMILES.

06409e0

merkys requested review from rartino, vaitkus, JPBergsma and ml-evs December 7, 2022 07:56

merkys added type/proposal Proposal for addition/removal of features. May need broad discussion to reach consensus. PR/requires-discussion labels Dec 7, 2022

merkys mentioned this pull request Dec 7, 2022

Add SMILES property #368

Open

Merge branch 'develop' into SMILES-data-type

9d94f74

JPBergsma reviewed Dec 7, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

vaitkus reviewed Dec 7, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

optimade.rst Outdated Show resolved Hide resolved

optimade.rst Outdated Show resolved Hide resolved

optimade.rst Outdated Show resolved Hide resolved

JPBergsma reviewed Dec 7, 2022

View reviewed changes

merkys and others added 3 commits December 12, 2022 18:09

Decouple SMILES from other data types in the enumeration of operator …

59af3fa

…tokens.

Update optimade.rst

08997e3

Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>

Update optimade.rst

bcc69c1

Co-authored-by: Antanas Vaitkus <antanas.vaitkus90@gmail.com>

JPBergsma reviewed Dec 12, 2022

View reviewed changes

optimade.rst Outdated Show resolved Hide resolved

rartino mentioned this pull request Dec 22, 2022

Property definitions type for timestamp (and other non-json OPTIMADE types) #443

Closed

ml-evs mentioned this pull request Dec 22, 2022

OPTIMADE v1.2 release planning #429

Closed

Merge branch 'develop' into SMILES-data-type

e938254

rartino mentioned this pull request Dec 30, 2022

Like operator #160

Closed

merkys added 3 commits January 17, 2023 10:22

Merge branch 'develop' into SMILES-data-type

c96cab1

Removing sentences mentioning SMILES canonicalization as they do not …

57b6ca8

…add much of a benefit for the specification.

Merge branch 'SMILES-data-type' of github.com:merkys/OPTIMADE into SM…

b7c002b

…ILES-data-type

rartino mentioned this pull request Feb 20, 2023

Changes to property definitions #457

Merged

merkys and others added 4 commits February 21, 2023 15:23

Define x-optimade-type for smiles as Materials-Consortia#457 is mer…

8fb816c

…ged now.

Merge branch 'develop' of github.com:Materials-Consortia/OPTIMADE int…

7444d6e

…o SMILES-data-type

Merge branch 'develop' into SMILES-data-type

9621645

Update optimade.rst

afb319b

Co-authored-by: Johan Bergsma <29785380+JPBergsma@users.noreply.github.com>

rartino mentioned this pull request Jan 10, 2024

InChIKey property #466

Closed

merkys mentioned this pull request Jun 14, 2024

Allow provider-specific data types #529

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SMILES data type #436

SMILES data type #436

merkys commented Dec 7, 2022

JPBergsma Dec 7, 2022

merkys Dec 12, 2022

JPBergsma left a comment

vaitkus commented Dec 22, 2022

merkys commented Dec 30, 2022

ml-evs commented Dec 30, 2022

rartino commented Dec 30, 2022

merkys commented Jan 17, 2023

rartino commented Jan 17, 2023

merkys commented Jan 17, 2023

SMILES data type #436

Are you sure you want to change the base?

SMILES data type #436

Conversation

merkys commented Dec 7, 2022

JPBergsma Dec 7, 2022

Choose a reason for hiding this comment

merkys Dec 12, 2022

Choose a reason for hiding this comment

JPBergsma left a comment

Choose a reason for hiding this comment

vaitkus commented Dec 22, 2022

merkys commented Dec 30, 2022

ml-evs commented Dec 30, 2022

rartino commented Dec 30, 2022

merkys commented Jan 17, 2023

rartino commented Jan 17, 2023

merkys commented Jan 17, 2023