Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace schema with autogen from qcelemental #77

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

loriab
Copy link
Collaborator

@loriab loriab commented Sep 3, 2020

Description

In response to the discussion at #68 on maintainability, synchronization, accuracy, and community involvement, this is the initial commit refactoring schema storage here at QCSchema as autogenerated output from QCElemental. Overview here.

After I fill in questions and topics for discussion below, I'll be pinging the main participants of #68. I don't expect this to be a quick PR, but in the end it should close #68. (qcsk := QCSchema; qcel := QCElemental)

Questions

  • This is not a translation of qcsk into a qcel rendering but a reconciliation of the two representations. Changes are minor but present, so look it over and make sure nothing alarming. Because of the change from python to json in the dev/ dir, diffs are useless, but the things to compare are all the deleted qcschema/dev/*.py files to the new qcschema/dev/QCSchema.schema file.
  • The description fields for all the schema have been merged btwn qcsk and qcel and moderately edited for coherence. Suggested edits and elaborations and guidance welcome.
  • integer vs. number+multipleof issue. integer isn't part of JSON Schema Core, but it is part of JSON Schema Validation (https://pydantic-docs.helpmanual.io/usage/schema/#json-schema-types), from which we've also been using pattern, anyof, and enum. Ok to keep using integer then (which is pydantic's normal export for ints) rather than persuading pydantic to export number + multipleOf?
  • The separate (and sometimes redundant) files in qcschema/data/vdev/ are what I consider the core qcsk schema. Other schema from qcel haven't been exported to here. Even within the core schema are sections that might be considered associated with the qcel implementation. At a minimum, the fields where these live need to remain in qcsk. Should the subschema be adopted or removed?
    • Molecule.Identifiers
    • AtomicInput.Protocols
  • While it's a given that arrays shall be stored flat and units shall be atomic, I've added fields (transparent to validation) to hold shaped dimension and units info separate from the description. This follows from what was started in Wfn. Is this a satisfactory way to handle these matters?
  • List provenance. Once upon a time, I thought the provenance field should be either a single entry (of creator/version/routine/etc.) or a list of entries so that one could trace the layers of software that touched an instance. I think I got the list provenance added to this repo. since then, I was talked out of list provenance, ripped it out of psi4.Molecule, and haven't missed it. On the other hand, Horton folks like it. Presently in this repo, output.provenance is single or list and molecule.provenance and all others are single only. How do we want it to be?
  • Required fields.
    • Two major use cases for required fields list
      • Initialization For a human writing JSON or the constructor for a qcsk implementation or for efficient web transport of non-default data, there is a minimum set of fields that can be used to construct a schema instance.
      • Storage For software wanting to use qcsk records or a programmer writing an implementation, there is a set of fields that should be present and already validated and defaulted so that no more post-JSON validator logic is required.
    • Points
      • While many models require practically all fields (e.g., center data needs angular momenta, exponents, and coefficients), for some there's a big gap between required, deducible (by implementation, not by JSON validation), and optional fields. An example is Molecule, which I think of having the following breakdown:
        • required: symbols, geometry
        • deducible: atomic_numbers, masses, mass_numbers, real, fix_com, fix_orientation, labels, fragments, fragment_multiplicity, fragment_charges, molecular_charge, molecular_multiplicity, provenance, extras ({})
        • optional: fix_symmetry, connectivity, comment, name
      • Both previously (hand-written) and newly (autogen), the required fields is the initialization set (required). But it can be useful for implementation writers to know the storage set (required+deducible). It could even be useful to be able to call JSON validation on the storage set. Does this seem important, and how should we indicate these sets of fields?

Status

  • Ready to go

@lgtm-com
Copy link

lgtm-com bot commented Sep 3, 2020

This pull request fixes 1 alert when merging b613648 into 09cb381 - view on LGTM.com

fixed alerts:

  • 1 for Duplicate key in dict literal

@lgtm-com
Copy link

lgtm-com bot commented Sep 3, 2020

This pull request fixes 1 alert when merging d7b31eb into 09cb381 - view on LGTM.com

fixed alerts:

  • 1 for Duplicate key in dict literal

@loriab
Copy link
Collaborator Author

loriab commented Sep 14, 2020

pinging @ghutchis @cryos @mattwelborn @dgasmith @bennybp @awvwgk @wilhadams as persons involved in #68 discussion who may have opinions. I'll leave this open at least another week. If discussion doesn't develop by then, I'll seek ordinary MolSSI merge approval.

Thanks for any comments.

@cryos
Copy link
Collaborator

cryos commented Sep 14, 2020

For me I would go back to concerns we had when we started discussing QCSchema - not tying it to one particular code/implementation. This feels a lot like that is what is happening, and that is how most of the other standards have evolved including Chemical JSON that I developed years before MolSSI was founded. Maybe it is unavoidable, and it is something we should simply accept while aiming to establish a flexible standard we can use. Not trying to be negative, just to summarize thoughts.

I would like to develop import/export of appropriate data to/from Chemical JSON making sure we can round trip datra, and build up some examples of simple files, and then increasing complexity that are version controlled. E.g. molecule with/without bonds, I am not sure what the status of more complex output is (equivalent of FCHK, or something I can generate orbitals/cubes from). I lost track a little in the midst of pandemic and changing jobs...

@ghutchis
Copy link
Collaborator

In general, I'm all in favor of automation, but I'd agree with @cryos that we started with a community workshop, and the end result of this merge seems to imply "anything that happens to QCElemental becomes the schema."

Personally, I'd still rather see the reverse - that the schema lives here, with regular discussion about extension / changes, and auto-generated bits get submitted to QCElemental as a pull request.

-OR-

I don't understand why QCElement can't use a prefix / extension rather than force a change to the schema. What I remember of the in-person discussion was the need for extensions to the schema for particular programs, and they would get standardized. My analogy was with vendor-specific CSS prefixes.

  • molecule.identifiers has been discussed and is much needed in the core (e.g., tracking a molecule through a workflow, retaining SMILES bond orders, etc.). Most programs permit a title or comment.
  • atomicinput.protocols seems more specific to QCElemental

I know @berquist was working on an implementation for cclib: cclib/cclib#798

@loriab
Copy link
Collaborator Author

loriab commented Sep 14, 2020

Thanks for the comments and concerns. While I'm personally fond of qcel, be assured that I don't want it to encroach on the independence of qcsk, and my primary motives are still of maintenance and engagement, #68 (comment).

I worry QCSchema the repository will stagnate and be a source of confusion to new implementors unless there's a project (preferably multiple projects) regularly running their "qcschema" instances through some sort of JSON validator tied to this repo. qcel hasn't been, and I don't know of other projects that might have been. (I suspect not, b/c there was a ( vs [ syntax error preventing this repo even being written to JSON.) That is, I think I agree with @cryos's thoughts about research-driven standards.

Personally, I'd still rather see the reverse - that the schema lives here, with regular discussion about extension / changes, and auto-generated bits get submitted to QCElemental as a pull request.

Regarding qcel dominating the development of qcsk, I hoped the approach in this PR could alleviate the worst concerns in #68. The reverse you describe would be best. However since pydantic classes can generate JSON schema, but JSON schema can't generate pydantic classes, that forces the dependency order. This PR does incorporate an important aspect of the reverse idea in that qcsk repo can block changes in qcel repo that alter QCSchema. Only after proposed changes are merged here, presumably after discussion, do schema-altering PRs to qcel become mergeable. (Well, not at the moment b/c this is just the initial delta schema rearrangement PR so that future diffs are meaningful.) So I think the loss-of-independence risk is greater for qcel than for qcsk. It'd be helpful to separate concerns about this future update process and concerns about the rearrangement and changes in this initial delta PR.

atomicinput.protocols seems more specific to QCElemental

Broadly, it's the user's opportunity to control the data layout. If the user (person or database) knows they don't care about the ASCII output, prevent it from being included. If the user (person or geometry optimizer) knows they don't care about the Fock matrix, prevent it from being included. Mostly of concern for minimizing internet transfer size. Definitely used by QCFractal, but I think it's of general interest.

I don't understand why QCElement can't use a prefix / extension rather than force a change to the schema. What I remember of the in-person discussion was the need for extensions to the schema for particular programs, and they would get standardized. My analogy was with vendor-specific CSS prefixes.

I may not be understanding your suggestion. I agree on the general progression from unformalized extension -> formalized and adopted. What the PR does for most big schemas (provenance is the exception that comes to mind) is (1) always include a field extras (hash type with string keys and anything values) that can be used freely for development fields like psi4:qcvars and (2) batten down on additionalProperties and forbid any other top-level fields not in the formal schema. It makes pass-through easier b/c everything unexpectable is behind one field. Is this rearrangement too abrupt or otherwise hard to work within?

I would like to develop import/export of appropriate data to/from Chemical JSON making sure we can round trip datra, and build up some examples of simple files, and then increasing complexity that are version controlled. E.g. molecule with/without bonds, I am not sure what the status of more complex output is (equivalent of FCHK, or something I can generate orbitals/cubes from).

Also in the course of this PR series, over at qcel I've added a bit into testing where the usual python tests drop JSON instances, then these are later run through JSON validation against the qcel-export qcsk. Those are generated, not stored, but can form another body of examples.

@berquist
Copy link
Contributor

berquist commented Sep 15, 2020

Just a couple of comments before reading the actual changes.

Regarding the initial PR text:

While it's a given that arrays shall be stored flat and units shall be atomic, I've added fields (transparent to validation) to hold shaped dimension and units info separate from the description. This follows from what was started in Wfn. Is this a satisfactory way to handle these matters?

This is similar to datasets and metadata in HDF5 and will map nicely to that format.

Presently in this repo, output.provenance is single or list and molecule.provenance and all others are single only. How do we want it to be?

I think this is less important if there is the ability to trace in some other way the previous calculations that were performed to create a single input, such as performing a frequency calculation after a separate geometry optimization. I know that we discussed such calculations, plus finite difference, SAPT, and so on as being separate in the eyes of logical QCSchema blocks, but we will need a way of connecting them. List provenance is one way (albeit lossy) of helping with this. If there is provenance (or extensions) on the level of granularity one up from fields (for example, SCF energy) so that program modules can register their own provenance (SCF-specific info from this package, CC-specific info from another package). This doesn't answer the question, but I haven't read the PR yet.


Regarding later comments:

I worry QCSchema the repository will stagnate and be a source of confusion to new implementors unless there's a project (preferably multiple projects) regularly running their "qcschema" instances through some sort of JSON validator tied to this repo.

Because I haven't been watching this repo, only QCSchema, I didn't know how deeply a part of qcel the schema was, having only hooked into the periodic table and physical constants. I know parts of Psi4 were migrated here, but didn't check the contents. So I am/was confused from the other direction.

If the user (person or geometry optimizer) knows they don't care about the Fock matrix, prevent it from being included. Mostly of concern for minimizing internet transfer size. Definitely used by QCFractal, but I think it's of general interest.

This will definitely be useful for source programs themselves (engines?); for translators it's less clear, but its inclusion doesn't hurt.

For the prefix/non-standard extensions,

Is this rearrangement too abrupt or otherwise hard to work within?

I think your example is actually what we agreed on at the meeting.

Ultimately, we need to have more substantiative examples than what's in https://github.com/MolSSI/QCSchema/tree/09cb381e77704e809fb4d8cee4be47a7e17d6673/tests/simple, unless these fully cover the current schema, excluding the geometry optimization differences. If the generated qcel examples fit the bill then that's good, otherwise maybe something from QCFractal?

Copy link

@wilhadams wilhadams left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does incorporate an important aspect of the reverse idea in that qcsk repo can block changes in qcel repo that alter QCSchema. Only after proposed changes are merged here, presumably after discussion, do schema-altering PRs to qcel become mergeable.

This doesn't change the fact that in order for a change to be made to the schema, it must be made to the Pydantic models in QCElemental. Thus, QCSchema is inherently dependent on the implementation in QCElemental. One specific concern I have is that QCSchema seems to be using JSON draft 04 only because Pydantic does. If the choice of JSON specification is made for that reason, then not only is QCSchema dependent on QCElemental, but also on Pydantic's choice of specification to support.

integer vs. number+multipleof issue. integer isn't part of JSON Schema Core, but it is part of JSON Schema Validation (https://pydantic-docs.helpmanual.io/usage/schema/#json-schema-types), from which we've also been using pattern, anyof, and enum. Ok to keep using integer then (which is pydantic's normal export for ints) rather than persuading pydantic to export number + multipleOf?

In the Pydantic documentation referring to JSON Schema Validation, it links to draft 2019-09, which is the current draft of the JSON specification. In Appendix A of that specification (see link), "Keywords Moved from Validation to Core", both enum and anyOf are now core vocabulary. This isn't necessarily important, because the default JSON Schema meta-schema includes both the Validation and Core vocabularies, however, it shows that Pydantic isn't necessarily keeping up with the JSON specifications as they develop, given that every time a specification is given via the "$schema" keyword, draft 04 is used. The integer case is mainly problematic due to the way that it is validated by each validator; I'll quote json-schema.org here:

The precise treatment of the “integer” type may depend on the implementation of your JSON Schema validator. JavaScript (and thus also JSON) does not have distinct types for integers and floating-point values. Therefore, JSON Schema can not use type alone to distinguish between integers and non-integers. The JSON Schema specification recommends, but does not require, that validators use the mathematical value to determine whether a number is an integer, and not the type alone. Therefore, there is some disagreement between validators on this point. For example, a JavaScript-based validator may accept 1.0 as an integer, whereas the Python-based jsonschema does not.

At a minimum, the fields where these live need to remain in qcsk. Should the subschema be adopted or removed?

  • Molecule.Identifiers
  • AtomicInput.Protocols

I don't think it's necessary to keep them, since there won't be a JSON file containing just identifiers or just protocols.

While many models require practically all fields (e.g., center data needs angular momenta, exponents, and coefficients), for some there's a big gap between required, deducible (by implementation, not by JSON validation), and optional fields. An example is Molecule, which I think of having the following breakdown: ...

I think that fix_com and fix_orientation apply more to QCElemental processes than to QCSchema in general. As commented in the schema, if there is a specific, expected way for every QCSchema parser to handle these processes, then it should be explicitly documented here - otherwise, it could be an optional field, with each implementation specifying what happens when you request that keyword. I know that in general QCElemental validates beyond just ensuring that the JSON file is valid as per the schema - are any of those behaviours expected from any program supporting QCSchema, or just from QCElemental?

In Molecule, I think it's important to specify what field is the source for inferring "nat", and similar for any variable ("nfr", etc.). For the specific case of "nat", "symbols" is a good choice, as it is a required field; it would be good to explicitly state that this is the field from which "nat" is inferred, such that if any other, optional but linked field is a different length, then that optional field is considered to be invalid, while "symbols" cannot be invalid based on length. That could be what's done in QCElemental, I haven't checked, but should be included as an explicit instruction for parsing QCSchema files, to ensure consistent behaviour between parsers.

In the QCSchema docs it states that snake case will be used for keys, however, all of the top-level keys (Molecule, et al.) are camel case.

"definitions": {
"Identifiers": {
"title": "Identifiers",
"description": "Canonical chemical identifiers\n\nParameters\n----------\nmolecule_hash : str, Optional\nmolecular_formula : str, Optional\nsmiles : str, Optional\ninchi : str, Optional\ninchikey : str, Optional\ncanonical_explicit_hydrogen_smiles : str, Optional\ncanonical_isomeric_explicit_hydrogen_mapped_smiles : str, Optional\ncanonical_isomeric_explicit_hydrogen_smiles : str, Optional\ncanonical_isomeric_smiles : str, Optional\ncanonical_smiles : str, Optional\npubchem_cid : str, Optional\n PubChem Compound ID\npubchem_sid : str, Optional\n PubChem Substance ID\npubchem_conformerid : str, Optional\n PubChem Conformer ID",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken description

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's broken? it's a merge of all the field descriptions. rather ghastly to read, but it's cleaner to read them at their individual fields.

"title": "Identifiers",
"description": "Canonical chemical identifiers\n\nParameters\n----------\nmolecule_hash : str, Optional\nmolecular_formula : str, Optional\nsmiles : str, Optional\ninchi : str, Optional\ninchikey : str, Optional\ncanonical_explicit_hydrogen_smiles : str, Optional\ncanonical_isomeric_explicit_hydrogen_mapped_smiles : str, Optional\ncanonical_isomeric_explicit_hydrogen_smiles : str, Optional\ncanonical_isomeric_smiles : str, Optional\ncanonical_smiles : str, Optional\npubchem_cid : str, Optional\n PubChem Compound ID\npubchem_sid : str, Optional\n PubChem Substance ID\npubchem_conformerid : str, Optional\n PubChem Conformer ID",
"type": "object",
"properties": {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These identifiers should be documented on the website (https://molssi-qc-schema.readthedocs.io/en/latest/auto_topology.html) and required validation for those identifiers should be specified and documented.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite agree. I got the docs for qcsk fixed up wrt current master a few weeks ago. Better to keep this PR to the stuff the community needs to review and leave docs reconciliation to a subsequent PR.

The current JSON to Sphinx process for qcsk is homegrown. Poking around a bit, I didn't see an external project that does it much better, but if anyone knows a clean way, I'd be glad to learn of it.

},
"Provenance": {
"title": "Provenance",
"description": "Provenance information.\n\nParameters\n----------\ncreator : str\n The name of the program, library, or person who created the object.\nversion : str, Default: \n The version of the creator, blank otherwise. This should be sortable by the very broad [PEP 440](https://www.python.org/dev/peps/pep-0440/).\nroutine : str, Default: \n The name of the routine or function within the creator, blank otherwise.",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken description

"version": {
"title": "Version",
"description": "The version of the creator, blank otherwise. This should be sortable by the very broad [PEP 440](https://www.python.org/dev/peps/pep-0440/).",
"default": "",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vague: default value for optional key

"routine": {
"title": "Routine",
"description": "The name of the routine or function within the creator, blank otherwise.",
"default": "",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vague: default value for optional key

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I had meant to have all three -- creator, version, and routine -- as required, like qcsk dev had. Having a "required empty" like this is anomalous for qcsk and another reason I want to have the input required vs. validated required discussion I mentioned.

"schema_version": {
"title": "Schema Version",
"description": "The version number of ``schema_name`` to which this model conforms.",
"default": 1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is the specific schema for a specific version of QCSchema, so it shouldn't have a default, it should be explicitly fixed.

"keywords": {
"title": "Keywords",
"description": "The program-specific keywords to be used.",
"default": {},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vague: Default value for optional key

"protocols": {
"title": "Protocols",
"description": "Protocols regarding the manipulation of computational result data.",
"default": {},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vague: Default value for optional key

"extras": {
"title": "Extras",
"description": "Additional information to bundle with the computation. Use for schema development and scratch space.",
"default": {},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vague: Default value for optional key

},
"properties": {
"title": "Properties",
"description": "Named properties of quantum chemistry computations following the MolSSI QCSchema.\n\n Notes\n -----\n All arrays are stored flat but must be reshapable into the dimensions in attribute ``shape``, with abbreviations as follows:\n nao: number of atomic orbitals = calcinfo_nbasis\n nmo: number of molecular orbitals\n ",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

broken description

Copy link
Collaborator Author

@loriab loriab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the detailed review!

"definitions": {
"Identifiers": {
"title": "Identifiers",
"description": "Canonical chemical identifiers\n\nParameters\n----------\nmolecule_hash : str, Optional\nmolecular_formula : str, Optional\nsmiles : str, Optional\ninchi : str, Optional\ninchikey : str, Optional\ncanonical_explicit_hydrogen_smiles : str, Optional\ncanonical_isomeric_explicit_hydrogen_mapped_smiles : str, Optional\ncanonical_isomeric_explicit_hydrogen_smiles : str, Optional\ncanonical_isomeric_smiles : str, Optional\ncanonical_smiles : str, Optional\npubchem_cid : str, Optional\n PubChem Compound ID\npubchem_sid : str, Optional\n PubChem Substance ID\npubchem_conformerid : str, Optional\n PubChem Conformer ID",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's broken? it's a merge of all the field descriptions. rather ghastly to read, but it's cleaner to read them at their individual fields.

"title": "Identifiers",
"description": "Canonical chemical identifiers\n\nParameters\n----------\nmolecule_hash : str, Optional\nmolecular_formula : str, Optional\nsmiles : str, Optional\ninchi : str, Optional\ninchikey : str, Optional\ncanonical_explicit_hydrogen_smiles : str, Optional\ncanonical_isomeric_explicit_hydrogen_mapped_smiles : str, Optional\ncanonical_isomeric_explicit_hydrogen_smiles : str, Optional\ncanonical_isomeric_smiles : str, Optional\ncanonical_smiles : str, Optional\npubchem_cid : str, Optional\n PubChem Compound ID\npubchem_sid : str, Optional\n PubChem Substance ID\npubchem_conformerid : str, Optional\n PubChem Conformer ID",
"type": "object",
"properties": {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I quite agree. I got the docs for qcsk fixed up wrt current master a few weeks ago. Better to keep this PR to the stuff the community needs to review and leave docs reconciliation to a subsequent PR.

The current JSON to Sphinx process for qcsk is homegrown. Poking around a bit, I didn't see an external project that does it much better, but if anyone knows a clean way, I'd be glad to learn of it.

"routine": {
"title": "Routine",
"description": "The name of the routine or function within the creator, blank otherwise.",
"default": "",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I had meant to have all three -- creator, version, and routine -- as required, like qcsk dev had. Having a "required empty" like this is anomalous for qcsk and another reason I want to have the input required vs. validated required discussion I mentioned.

"required": [
"creator"
],
"$schema": "http://json-schema.org/draft-04/schema#"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not -- it's simply a shared component

"title": "Schema Name",
"description": "The QCSchema specification to which this model conforms. Explicitly fixed as qcschema_molecule.",
"default": "qcschema_molecule",
"pattern": "^(qcschema_molecule)$",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply analogy to other schema which had qcschema_input and qc_schema_input variants. the effect of the current is the same as const or enum afaik. I can post-edit the type if people feel strongly.

Comment on lines +386 to +388
{
"type": "string"
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent question -- I had it, too. exponents and coefficients have always been strings in qcsk for the sensible reason that Ben Pritchard uses it in BSE, and he wants precision from the literature from strings, not lossy floats. Meanwhile, qcel and most practical implementations want numbers.

Comment on lines +637 to +658
"ErrorCorrectionProtocol": {
"title": "ErrorCorrectionProtocol",
"description": "Configuration for how QCEngine handles error correction\n\n WARNING: These protocols are currently experimental and only supported by NWChem tasks\n\n\nParameters\n----------\ndefault_policy : bool, Default: True\n Whether to allow error corrections to be used if not directly specified in `policies`\npolicies : name='policies' type=Optional[Mapping[str, bool]] required=False default=None, Optional\n Settings that define whether specific error corrections are allowed. Keys are the name of a known error and values are whether it is allowed to be used.",
"type": "object",
"properties": {
"default_policy": {
"title": "Default Policy",
"description": "Whether to allow error corrections to be used if not directly specified in `policies`",
"default": true,
"type": "boolean"
},
"policies": {
"title": "Policies",
"description": "Settings that define whether specific error corrections are allowed. Keys are the name of a known error and values are whether it is allowed to be used.",
"type": "object",
"additionalProperties": {
"type": "boolean"
}
}
},
"additionalProperties": false
},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll want to keep the field reserved, but otherwise, I agree, since this is experimental.

"title": "Schema Name",
"description": "The QCSchema specification this model conforms to. Explicitly fixed as qcschema_input.",
"default": "qcschema_input",
"pattern": "^(qc_?schema_input)$",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

early versions did qc_. harmless to allow old instances to validate if they are otherwise compatible. implementations should reset to the new qcschema_input name. I ought to add that to the desccription.

"keywords": {
"title": "Keywords",
"description": "The program-specific keywords to be used.",
"default": {},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the notion of a separate list of required fields for full instances catches on, this and certain other fields need to have empty dictionary default so that access is easy without a bunch of error checking.

Comment on lines +1399 to +1425
"ComputeError": {
"title": "ComputeError",
"description": "Complete description of the error from an unsuccessful program execution.\n\nParameters\n----------\nerror_type : str\n The type of error which was thrown. Restrict this field to short classifiers e.g. 'input_error'. Suggested classifiers: https://github.com/MolSSI/QCEngine/blob/master/qcengine/exceptions.py\nerror_message : str\n Text associated with the thrown error. This is often the backtrace, but it can contain additional information as well.\nextras : Dict[str, Any], Optional\n Additional information to bundle with the error.",
"type": "object",
"properties": {
"error_type": {
"title": "Error Type",
"description": "The type of error which was thrown. Restrict this field to short classifiers e.g. 'input_error'. Suggested classifiers: https://github.com/MolSSI/QCEngine/blob/master/qcengine/exceptions.py",
"type": "string"
},
"error_message": {
"title": "Error Message",
"description": "Text associated with the thrown error. This is often the backtrace, but it can contain additional information as well.",
"type": "string"
},
"extras": {
"title": "Extras",
"description": "Additional information to bundle with the error.",
"type": "object"
}
},
"required": [
"error_type",
"error_message"
],
"additionalProperties": false
},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in QCEngine and above, but it's generic. inclusion/exclusion can go either way.

@loriab
Copy link
Collaborator Author

loriab commented Sep 18, 2020

This PR does incorporate an important aspect of the reverse idea in that qcsk repo can block changes in qcel repo that alter QCSchema. Only after proposed changes are merged here, presumably after discussion, do schema-altering PRs to qcel become mergeable.

This doesn't change the fact that in order for a change to be made to the schema, it must be made to the Pydantic models in QCElemental. Thus, QCSchema is inherently dependent on the implementation in QCElemental. One specific concern I have is that QCSchema seems to be using JSON draft 04 only because Pydantic does. If the choice of JSON specification is made for that reason, then not only is QCSchema dependent on QCElemental, but also on Pydantic's choice of specification to support.

Pydantic is working off draft07, I think. That draft04 was forcibly added by me to imitate qcsk.
https://github.com/MolSSI/QCElemental/pull/237/files#diff-7a9af1eb4f18dc820783bf210d2cfc95R197
pydantic/pydantic#1478

@langner
Copy link

langner commented Sep 22, 2020

A couple of thoughts from my point of view (cclib dev):

  • I think QCSchema/QCElemental/QCEngine are all really cool - a lot of excellent work being done here! I've been following the projects, but not closely.
  • I too feel it would be healthier to have a schema not tied to a specific proejct/organization, or at least tied in equal measure to several projects/organizations.
  • I actually like the idea of having the source of truth in Python or some other more robust representation - and exporting JSON or whatnot from that. I never quite understood why the standard needs to be specific to JSON/XML or another serialized form.
  • Perhaps a fork of QCSchema/QCElemental that sets a common standard as a consensus of several orgs (MolSSI, OpenChemistry, others) is something to try?
  • Communities are great, but capricious. If QCEngine fails, will QCElemental have a life of it's own. Would it be used by someone else to develop QCEngine2 from scratch?
  • I'm somehow reminded of Quixote (https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-3-38) - I'm not sure how successful that was and whether it's still alive at all, but I wonder if there are takeaways about what worked well and what could be done better this time around.

@dgasmith
Copy link
Collaborator

dgasmith commented Sep 23, 2020

While it is undoubtedly best for the schema's generality to not originate from QCElemental, a significant push of this PR is that QCElemental ended up receiving much more attention, community effort, and uptake than the QCSchema repo as the library stack underpins a variety of efforts (OpenFF, Psi4 ecosystem, QCArchive, etc). A practical issue is if the schema implementation of Elemental is moved to this repo or the schema is expanded here, will it realistically influence the uptake of the project for those commenting here?

It became reasonably clear that we ended up with the schema + 1 problem without an implementation driving requirements and uptake. But then the driving project requirements became imbalanced against community requirements since there wasn't much push for the schema itself. Perhaps this is the inevitable issue with schema without more explicit resources focused on promoting the schema itself than was allocated for this project.

It would be terrific if MolSSI could weigh in on future directions.

@cryos
Copy link
Collaborator

cryos commented Sep 23, 2020

I am not aware of any continued effort with Quixote @langner, there were some great ideas and there was a lot of energy around approaches being explored at the time. I think the QCSchema will likely end up being supported to the extent that many other formats are to interoperate with the projects that use it, our original goals may not be realized if it is driven by one or two dominant projects. It was valuable to gather community feedback, and to explore how to better document/validate JSON formats.

This is one area we had hoped that MolSSI could help, it may see wider uptake. It would be great to see a community schema developed even if this is not the one, several of us have our pet schemas if not and the format translation in the form of cclib that can add one more format to their repertoire and be easily reused in other projects.

@berquist berquist mentioned this pull request Jan 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Keeping QCSchema in sync with QCElemental
7 participants