Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple conformations in a single file? #34

Open
leeping opened this issue Jan 6, 2018 · 16 comments
Open

Multiple conformations in a single file? #34

leeping opened this issue Jan 6, 2018 · 16 comments

Comments

@leeping
Copy link

leeping commented Jan 6, 2018

As a force field developer and quantum chemistry user, I often find myself working with collections of structures (conformations) and associated energies. This could be useful for torsion drives in 1D and 2D, as well as reaction energies / minimum energy paths. Often I am also interested in running the same quantum chemistry method on the whole set of conformations. Thus, I think it would be very helpful if the schema could support this.

@dgasmith
Copy link
Collaborator

dgasmith commented Jan 6, 2018

I believe this is part of the change where the base schema starts with a list and looks like the following:

[
    "spec_version",
    {
      ...input_one
    },
    {
      ...input_two
    },
    ...
]

While I do like this my main concern is making various QM programs actually execute this. Can @saromleang or @loriab weigh in?

@davidlmobley
Copy link

Totally agree with @leeping here. @vtlim in my group also uses this aspect a lot, and @bannanc likely will as well.

@dgasmith
Copy link
Collaborator

I think Psi4 could natively support this, but other codes would require calling wrappers which might as well be moved to other program layers rather than baking into the spec itself. Can I get other QM devs to weigh in here?

@wadejong
Copy link
Collaborator

wadejong commented Jan 17, 2018 via email

@loriab
Copy link
Collaborator

loriab commented Jan 17, 2018

I once had the view that what a QC input file could support, a job schema should support. I've since withdrawn to a single schema job should support what quantum-chemically is a single logical unit, so a whole SAPT is one job, but a CCSD followed by a CISD is two jobs, even if SCF is shared between them. That can keep the job spec schema from getting too combinatorial – loop over these molecules, doing these methods, at all these basis sets, and at each of this list of torsion angles. Psi could do that job, but I'm reluctant to see it do-able in the job schema immediately facing a QC program. Better that that should be driven by the next layer up in the workflow.

@saromleang
Copy link

I don't believe GAMESS is setup to take in a batch of inputs and process them. A lot of work would need to be done within GAMESS to allow this (not saying that there isn't any benefit to it).

@leeping
Copy link
Author

leeping commented Jan 18, 2018

A set of configurations with the same atomic symbols / charge / multiplicity / method is a common kind of calculation; it's also a convenient unit of data to include in a database, because the user is likely going to request the entire set rather than just one configuration. That was my starting point for requesting this feature.

I appreciate the concerns of the QM developers. It's a major task to make a quantum chemistry code support this kind of batch processing if it doesn't already. My hope is that requesting a feature in the schema is not equivalent to requesting this feature in all QM codes.

Maybe the "set of multiple conformations" should have a variable name other than "geometry", such as:

json_molecule["geometries"] = [[0, 0, 0, 0, 0, 1], [0, 0, 0, 0, 0, 2]]

If "geometries" is provided, then "geometry" should not be provided. That way, the QM codes that support batch processing will loop over the configurations, and those that don't can simply throw an error. What do you think?

@loriab
Copy link
Collaborator

loriab commented Jan 18, 2018

Can multiple job spec documents differing in geometry serve the same role? It's considerable redundancy for the workflows you're planning, but it's pretty modest compared to the output and cost of QC calculations. So long as the runtime database is indexable by molecule identity, the records should be readily grouped. Conformations can then be associated even if they came from different input files.

@leeping
Copy link
Author

leeping commented Jan 18, 2018

I think multiple job spec documents could serve the same role, similar to how a stack of looseleaf pages can play a similar role as a book. It's mainly a matter of organization and convenience, and having the technology to bind the book can save a lot of time.

@langner
Copy link

langner commented Jan 18, 2018

I tend agree with @loriab - it will be much easier to implement a simple schema that covers a single unit of computation. But how do we intend to deal with multiple confirmations when they occur in single jobs, for example geometry optimization? Surely the output should be in one file. Perhaps we could extend the design for these types of cases so they would support more generic cases?

@leeping
Copy link
Author

leeping commented Jan 19, 2018

I certainly understand @loriab 's concern that a single conformation makes the most sense as a single unit of computation. On the other hand, an array of single-point calculations (sharing atomic symbols / method / charge / multiplicity) is becoming increasingly common and important. There is currently no easy and standard way to manage these arrays, leading to a lot of overhead in doing this research.

I'm mainly asking for this feature as an organizational tool, which would enable us to store the data in one file, have our data-processing programs process a single file instead of looping over multiple ones, store an array of single-point calculations as one entry in a database, and refer to the whole array using one key.

It would also be great if QM codes could support running an array of single-point calculations as a feature, but that's not what I'm directly interested in. I would implement this behavior in the codes I contribute to, but I wouldn't go as far as to request it in every single code. A small script could be provided to split the job array file into multiple single units, or multiple outputs may be combined into one file.

More generally, if we request such "organizational features" in the schema, is it equivalent to requesting the same kind of organization in the QM codes that use it?

@vtlim
Copy link

vtlim commented Jan 20, 2018

This feature would definitely be useful for me as well -- running the same geometry optimization scheme on a large array of conformations (10s to 1000s of conformations per molecule). If I want to perform additional optimizations or visualizations, that requires me to iterate through the conformations' directories a number of times. Supporting multiple conformations would make it easier for processing and maintenance.

@jchodera
Copy link

It would indeed be extremely useful for JSON files intended to specify input for quantum chemical calculations to list a number of configurations on which the same operation is to be performed. If this is difficult for all programs to support natively, could this just be added as a separate Tier of spec support? It would seem like a simple Python driver would be sufficient to then act as a harness for codes that conform to a lower level Tier of the spec.

Another relevant question: If the JSON spec would also provide a way to associate the output with the inputs, how would the mapping from calculation outputs to input configuration be managed? And if several configurations were produced by a single input (e.g., for geometry optimization), we also have to worry about the association between each input configuration and many output configurations (as well as, for example, other associated properties with every output configuration).

@dgasmith
Copy link
Collaborator

I think this goes back to what the scope of the schema project really is. Talking to folks who have implemented successful schema's before have indicated that projects which are narrow in scope have a much (much!) higher chance of being successful. Having this spec as a simple "API" for single QC applications seems complex enough (to me).

A few downsides to implementing "workflows" in the spec:

  • Whats happens if the QC program crashes on computation number 49/50?
  • Putting multiple computations through a single QC program makes parallelizing over these computations harder.
  • How do we link multiple inputs and outputs and what happens with nested outputs? (as John mentioned)
  • We create further divisions in the spec between QM programs that support this and those that do not.

I do understand that this ability is very useful; however, I believe this would best take place at a higher workflow level. This seems to be an extremely useful aspect, as John mentioned could we move this to a higher level "workflow" schema?

Im happy to be convinced otherwise here, I mostly worry that increasing the scope and complexity of this project makes its first version date even further out (if ever).

@davidlmobley
Copy link

I think I'm inclined to agree that, while useful, this should be punted to further down the line or a higher workflow level.

@leeping
Copy link
Author

leeping commented Jan 27, 2018

Sounds good - I agree it's best to focus on making something narrow in scope that works. A higher-level workflow schema would be a good way forward.

@dgasmith dgasmith mentioned this issue May 22, 2018
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants