Prototype_data

This page is for sorting out the data we will be extracting from the log files for the prototype.

This is put together from: http://okfnpad.org/quixote-data

The first section lists the data we need and there are two sections examining an NWChem and a Gaussian log file to see where that information is to be found.

<<toc></toc>>

Table of Contents Required data External metadata Internal metadata Definition of the system Provenance (type of calculation) NWChem NWChem internal metadata NWChem definition of the system NWChem provenance (type of calculation) Gaussian Gaussian internal metadata Gaussian definition of the system Gaussian provenance (type of calculation) Gaussian Archive Files Conventions, Dictionaries and CML

Required data

External metadata

These are not likely to be in the output file but should be added:

 * **identifier** [SINGLE] See sec. 4.14 at http://dublincore.org/documents/usageguide/elements.shtml...  It's critical that all calculations have an identifier and we should think out a semi-semantic scheme for general use - based on domain-names and URIs

 * **url/location of original log file**  [SINGLE]  -- I WOULD SAY THIS IS NOT ESSENTIAL FOR SEARCHING, BUT IT IS VERY USEFUL DATA. It's very useful for those creating and maintaining the files locally. Until the files are publicly posted this is the primary address of the data

 * **name of submitter**  [SINGLE] This will be critical for publication

 * **name of creator** [SINGLE] and this

 * **email/contact details of submitter**  [STRING]  -- I WOULD SAY THIS IS NOT ESSENTIAL FOR SEARCHING, BUT IT IS VERY USEFUL DATA again for publication. CIFs carry all the above

 * **date of submission** [STRING] highly desirable

 * **associated publications** (if any)  [STRING] yes - this is harder as it represents the whole scholarly environment. It's the sort of thing we have tackled in OREChem

 * **rights** [STRING] See http://creativecommons.org/science and the Open Data section at the wiki (http://quixote.wikispot.org/Related_and_links) Critically important

 * **subject** [STRING] Subject, keywords, classification codes, or key phrases describing the resource. (see the isCitedBy article) This would be for motivation and environment. We should be ale to extract the actual compchem keywords automatically

 * **description** [STRING]

Internal metadata

This should be contained in the file

 * **program** what was used to run the calculation  [STRING]

 * **program version**  [STRING] Essential

 * **date run**  [STRING] essential

 * **title** (often gives reason for calculation)  [STRING] This is really important - won't be easy as it should reference related calculations

 * **OS** [STRING]
 * **no of cores** [INT]
 * **no of cores** [INT]
 * **architecture** [STRING] (Intel, etc.)
 * **memory** [INT]

Definition of the system

 * Molecular structure (connection tables) [CMLMolecule]
 * **chemical formula** [CMLFormula] (compositional formula).

 * **charge** [INTEGER] CML has formal charge. We can also use heuristics for organic molecules to check

 * **multiplicity** [INTEGER] (maybe this is more logically included in the provenance section, since it is a constraint on the wavefunction, and therefore part of the method) again CML can hold this

 * **geometry/structure/nuclear coordinates**  [ARRAY] NB - for the prototype we ignore z-matricies - just need coordinates

Provenance (type of calculation)

 * **basis set** (either with an agreed-upon name from EMSL BSE, or noted as a custom basis set) [SINGLE] Yes - we should probably use EMSL if possible

 * **level of the theory** (RHF, DFT, MP2, AM1, CC, MRCI, SDCI, CCSD CCSD(T), etc.) [SINGLE].

 * **additional theory details** (DFT functional, frozen core, etc.)  [STRING] Does EMSL do this? If not we need to come up with a normalized representation. They don't and we do.

 * **calculation type** (single point, optimisation, frequency, etc) [SINGLE]

 * **list of properties** (e.g. energy, spectra...) [LIST]

NWChem

This is based on the nwchem output:

https://bitbucket.org/wwmm/jumbo-converters/src/df04a8b0be7f/jumbo-converters-compchem/src/test/resources/compchem/nwchem/log/in/test1.out

and the parsed output:

https://bitbucket.org/wwmm/jumbo-converters/src/df04a8b0be7f/jumbo-converters-compchem/src/test/resources/compchem/nwchem/log/ref/test1.xml

NB: in retrospect, this isn't an ideal example, as it's a single file with multiple jobs in it, and a DFT calculation, which is one of the hardest ones to quantify, but it should as least serve to highlight some of the issues we'll encounter.

Where the data is already captured in the xml, the identifier is listed.

NWChem internal metadata

 * **program**

Will have to infer code from the header e.g.:

 * **program version**

Captured by: **nwchem:nwchem_branch**

 * **date run**

Captured by: **nwchem:date**

 * **title**

Within NWChem, a title is optional, and each module can have it's own title, and there can also be a general title for the calculation. The title is printed beneath each module header, e.g.:

If there is a "general" title it is printed by the input module:

NWChem definition of the system

 * **chemical formula**

Unfortunately (by default at least) NWChem doesn't print out the chemical formula, so it would have to be determined from the structure (see below).

 * **charge**

Each module prints out the charge in it's own way.

For the test1.out calculation, the DFT module charge is captured in: **nwchem:Charge**

 * **multiplicity**

Again - printed by each module. Captured by: **nwchem:Spin_multiplicity**

 * **geometry/structure/nuclear coordinates**

This is captured by the block associated with: **nwchem:geom**

NWChem provenance (type of calculation)

 * **basis set**

NWChem can have multiple basis sets defined within an input file, any of which can be used by a particular module.

The default basis is always called "ao basis".

The basis is (usually) printed explicitly, although there is also a summary.

If using DFT, the charge density fitting basis is called "cd basis" and the exchange-correlation fitting basis is called "xc basis"

The summary for the basis set called "ao basis" is the block:

If the same basis set is applied to each atom, the the "Description" field should suffice as the basis set description (which we then need to map to the appropriate EMSL descriptor.

If different basis sets are used on different atoms, then there is no single string that can be used to label the basis set.

 * **level of the theory**

This will usually be determined by the name of the module running the type of calculation, together with the options specified for that module.

For test1.out, the module (for all calculations) is:

"NWChem DFT Module" i.e. the DFT module, so theory is DFT.

 * **additional theory details**

The functional in use is specified by the block:

In this case, this defines the default, which to quote the manual is:

//The default exchange-correlation functional is defined as the local density approximation (LDA) for closed shell systems and its counterpart the local spin-density (LSD) approximation for open shell systems. Within this approximation the exchange functional is the Slater ρ1 / 3 functional (from J.C. Slater, Quantum Theory of Molecules and Solids, Vol. 4: The Self-Consistent Field for Molecules and Solids (McGraw-Hill, New York, 1974)), and the correlation functional is the Vosko-Wilk-Nusair (VWN) functional (functional V) (S.J. Vosko, L. Wilk and M. Nusair, Can. J. Phys. 58, 1200 (1980)). The parameters used in this formula are obtained by fitting to the Ceperley and Alder Quantum Monte-Carlo solution of the homogeneous electron gas. //

For DFT you could also specify the grid etc, but it won't be needed for the prototype.

 * **calculation type**

This job is a series of single-point calculations with DFT. If it were an optimisation, this would be indicated by a block starting with:

and there would then be multiple single-point calculations within that.

A frequency run would start with the block

 * **list of properties**

For the prototype, just the energy should suffice.

For test1.out, for the first calculation, this is the string:

Gaussian

This is based on the Gaussian output:

https://bitbucket.org/wwmm/jumbo-converters/src/bcdeceeed021/jumbo-converters-compchem/src/test/resources/compchem/gaussian/log/in/pablo_formyl-alanyl-amide_nohashP.log

The parsers are currently being developed so there is no xml to check against at the moment.

Gaussian internal metadata

This should be contained in the file

 * **program**

 * **program version**

The best place to get both data is from a piece of text close to the beginning of the logfile, reading in this case:

 * **date run**

The last line of the logfile reads:

 * **title**

After the version bit

Gaussian echoes some input directives:

and then some numerical values corresponding to the so called IOp's:

The next thing is the title, in this case:

Gaussian definition of the system

 * **chemical formula**

This can be extracted from many places:

 1. It can be deduced from the geometry, nuclear coordinates (see below).
 1. At the end of the file, in the "archive" part, we can read:

which apparently is the information about the point group and the type and quantity of the atoms.

 1.#3 This example file is a geometry optimization in which each step begins by

In each one of these blocks, we can find the text

 * **charge**

 * **multiplicity**

When Gaussian echoes the input (see above), after

it prints

where the two last bits of data can be extracted.

 * **geometry/structure/nuclear coordinates**

As mentioned, this example is a geometry optimization in which the structure is changed at each step.

Each step begins and ends by

printing additional information after the Grad data.

The next lines are

for the first step,

for the subsequent ones, and the final one contains

in the Grad block if the optimization converged.

In the three cases, the best places to get the Euclidean coordinates are either

in Z-matrix orientation, or from

in the standard one.

Both tables are found after the corresponding Grad blocks.

Gaussian provenance (type of calculation)

 * **basis set**

This information can be taxen from several places:

 1. When the input is echoed, we can read:

where the first line contains the basis set in the method/basisset notation.

 1.#2 After each grad block in each optimization step, we can read:

 1.#3 In the first lines of the final archive section, we have it also:

 * **level of the theory**

The simplest place to get this is from the echoed input at the beginning of the logfile:

or from the archive section:

It is MP2 in this case.

 * **additional theory details**

This is really scattered all over the place, and many details are not even printed. Probably the best place to pick them if one wants to be precise is from the IOp's mentioned before. Parsing the theory details without interpreting the manual, i.e., using only information in the logfile itself, is probably impossible.

 * **calculation type**

The fact that this is an optimization is signalled by the Opt keyword in the echoed input:

We have this repeated in the archive section:

 * **list of properties**

In this example there are two energies that are useful: the RHF and the MP2 ones. We can read them from

and

respectively, after each grad block.

At the end of the archive part, we can find the final energies again:

Gaussian Archive Files

Here is the parse of a simple Gaussian Archive File (\1\1GINC...)

Conventions, Dictionaries and CML

The information that was previously here concerning conventions and dictionaries has now been integrated into the convention and dictionary

Provide feedback

Saved searches

Use saved searches to filter your results more quickly