Skip to content

Chempound

Oliver Stueker edited this page May 11, 2015 · 3 revisions

<<toc></toc>>

Table of Contents

Chempound

Chempound is a server for archiving and searching the outputs of computational chemistry calculations. It can be used as a standalone tool for managing the files on a users' personal computer, or as a managed server for curating the data generated by a group/company.

The website for chempound can be found at: http://www.chempound.net. This also contains links to download the latest version of the software and descriptions of how to use it.

An example of a chempound server containing the results of several thousand calculations can be found here: http://quixote.ch.cam.ac.uk.

The rest of this page is a temporary placeholder for information that will be moved to the chempound website, so please ignore for the time being!

Documentation

The existing documentation for Chempound can be found here:

* this page: http://quixote.wikispot.org/Chempound * the chempound website: http://www.chempound.net * Jorge's repository: https://bytebucket.org/jestrada/quixote-docs/wiki/main/quixote-main.html * Sam's in-press JODI paper: http://wwmm.ch.cam.ac.uk/~sea36/chempound/

Repositories

The repository for the chempound packages is hosted on bitbucket: https://bitbucket.org/chempound

Using Chempound

With a functioning chempound respository in place, we can now start to query the data held within it.

For simple searches, we can just **Browse** through the files, or use the simple **Search** functionality on the web interface to pull out entries of interest.

This is fine for small, arbitrary searches, but Chempound also makes it very easy to automate searches and extract subsets of the data in a variety of ways.

Chempound uses a RESTful interface, which means that, by going to the url for a particular calculation, depending on how we make the request to the server, we can receive the requested data in a variety of formats.

The currently supported formats are:

* html * xhtml * json * rdf/xml * rdf/turtle * rdf/N3

If we take an example computational chemistry calculation done with the Gaussian code, and hosted on the Cambridge Chempound server, if we go to the url for the calculation with a browser, we will get an html Splash Page, with a human-readable summary of the calculation, and the ability to view the structure in jmol:

http://quixote.ch.cam.ac.uk/content/compchem/spectra-dspace/to-8800_8899/to-8893/

We get this page, because our browser has requested a text/html representation of the resource.

Getting json with Python

The following python script, sets the http header to Accept json, and then prints out the json returned.

This outputs:

Within chempound, the various files that make up the entry for the calculation, are grouped together as an ORE object. The **resources** key of the json object holds, these, and includes the uri of the original log file, the cml file, gif picture generated by jmol etc.

Requesting the RDF

Chempound is built on RDF and a primary component is a triple store containing RDF statements describing the structure of the data, and its associated metadata.

If we query the url and request the rdf serialised as xml, we can receive an object that contains the full data of the object, including the links to the files. The following python script does this and prints out the resulting rdf/xml:

SPARQL queries

SPARQL is a query language for extracting data represented as RDF, in much the same way the SQL is a language for querying data in relational databases. As the data in Chempound is stored as RDF, SPARQL is the language of choice for making complex queries against the stored data.

A good - and chemistry related - tutorial on SPARQL can be found here.

The chempound webserver provides a page where SPARQL queries can be typed into a webpage and the results returned as html or RDF. The SPARQL page on the Cambridge server can be found here.

The easiest way to get to grips with SPARQL is to dissect a simple query:

The crucial line is the one stating: **?molecule "H 2 O 1" .**

This uses the RDF subject:predicate:object pattern. The subject is the variable **molecule** (variables in SPARQL are prefixed with a **?**, although you can also use **$**), the predicate is a uri which references the CML schema, and the object is a string literal. The statement is then terminated by a full stop.

What this says is that we want to assign to the variable molecule, all the entities where the cml **formula** property is "'H 2 O 1"'.

The **SELECT** statement says that we want to the query to return the molecule variable, which will contain the list of all objects that matched the statement.

If we run this against the cambridge chempound server we get back something like the following:

Which returns the uri's of all the water molecules in the database.

We can now look at a more advanced query:

The first line is equivalent to declaring a namespace in xml, and associates a convenient label with a long uri, so that instead of writing ****, we can just write **cml**.

We are now selecting 3 variables from our dataset, and they will be returned in the order we have listed them. The **WHERE** statement has been omitted as it is implicit.

The next two lines by themselves would select all entities in the database (and return them in the molecule variable) that had the cml properites **formula** and **inhi**. However, we are filtering the returned data to restrict the values returned to those where the value contained in the formula variable is **"H 2 O 1"**.

Discovering the available search terms

The data that is extracted into RDF and therefore available for searching in Chempound is determined by the convention and dictionaries that apply to the files in question.

Please follow these links for more information on conventions and dictionaries.

For CIF files, the CIF dictionary lists all the terms that are available.

For Computational Chemistry outputs, the CompChem dictionary lists the indexed terms.

To determine how best to search for data, it is usually useful to go to the splash page for a representative structure in chempound and download the RDF file. This will show how the form of the RDF and how a structure needs to be constructed.

For example, if we wish to search on the cell_measurement_temperature, looking at the RDF for a CIF file, we see it is structured as shown below:

If we just search for the cell_measurement_temperature, we will be returned the RDF resource, we therefore further need to extract the value, which is done with the following query:

A similar example for a CompChem file is shown below. This searches on a term in the compchem dictionary, and then filters the value for only those structures with a charge of 0.

Remote Chempound SPARQL queries with Python

The Chempound SPARQL page will return the results as html or rdf/xml. The rdf/xml can of course be saved and processed offline, but it is more useful to be able to query and download the results all from within a single script.

As the Chempound SPARQL endpoint exposes a RESTful API , we can query it directly. The following python script executes a SPARQL query against chempound and then saves the result as a csv (comma-separated variable) file, so that the results of the query can be imported into a spreadsheet program for (e.g.) plotting a graph of the results.

    1. If however, we change the http headers of our request, we can request the data in json. You can do this on a windows, OSX or linux, from the command-line using the telnet program.
    2. telnet quixote.ch.cam.ac.uk 80
    3. GET /content/compchem/spectra-dspace/to-2300_2399/to-2301 HTTP/1.0
    4. Accept : text/html
    5. Accept : application/json

Hacking Chempound

This section is for those who may be interested in altering or extending Chempound. It isn't intended to be a programmer's manual, more a brief overview of chempound's current structure and a walk-though on how to add additional CML data to the repository, which is expected to be the reason why most people would currently want to extend Chempound.

    • NB:** additional information can be found in Jorge Estrada's repository
Chempound is actually a very general tool for managing collections of objects (collected as ORE aggregates) and their associated data and metadata, using RDF for the data model. As such, almost all of the chemistry functionality is implemented using plugins, so the code that needs to be modified to change the chemistry behaviour is very localised.

Overview of the repositories

The repositories for the chempound packages is hosted on bitbucket: https://bitbucket.org/chempound

Currently, there are 8 repositories as detailed below:

* https://bitbucket.org/chempound/ - this contains the main server code. There is almost no chemistry-specific code here, apart from in the chempound-rdf-cml directory, which has a very small class to add some CML data to the RDF model. * https://bitbucket.org/chempound/chemistry - this is where the most general chemistry code lives, and where the general functions to handle the conversion of data from CML are. * https://bitbucket.org/chempound/chempound-client - the base classes for the command-line client (it is the client that actually handles the conversion of logfiles into CML and the generation of the jmol pictures etc) are here, although there is no chemistry-specific code here. * https://bitbucket.org/chempound/chempound-parent - this just contains the central maven pom.xml that is used to configure maven for chempound. * https://bitbucket.org/chempound/compchem - all the code to handle the data associated with computational chemistry calculations (both server and client) lives here. * https://bitbucket.org/chempound/crystallography - all the code to handle the crystallography-specific aspects of the data. * https://bitbucket.org/chempound/deposit-client - TODO - not had to look at this yet. * https://bitbucket.org/chempound/quixote-client - the code to drive the code-specific imports of compchem logfiles. * https://bitbucket.org/chempound/quixote-repository - this is more code to package chempound for use by the quixote project and create the stand-alone chempound server war file.

A slightly more detailed view of the chemistry-specific repositories and their modules follows below.

||**Repository** || **Modules** || **Description and important classes** || || chemistry || chemistry-common || Classes to handle the generic processing of CML datatypes and the conversion to RDF || || || || *** net.chempound.chemistry.cmlChemicalMine.java** - mime types || || || || *** net.chempound.chemistry.Cml2RdfConverter.java** - code to handle the conversion of generic, simple cml datatypes into RDF. || || ||chemistry-importer || Base classes for the client-side conversion of files and the generation of images || || ||chemistry-jmol-plugin || Classes to drive jmol to generate the images, and also the jmol code itself || || ||chemistry-search-structure || Classes to handle the chemistry-specific search page - if you want to add more chemistry search boxes, the you'll need to edit things here. || |||||||| || compchem || compchem-common || General code related to the compchem RDF data structures. The utility functions used by the freemarker templates to access the compchem data live here. || |||| compchem-handler || Code to handle the processing of chemical data on the server, such as display of the html pages and the freemarker templates. || ||||compchem-importer || The classes to handle importing code-specific logfiles (NWChem, Gaussian etc) using the jumbo-classes. These classes are used by the client, not chempound itself. The test cases for checking the imports also live here. || ||||compchem-test-harness || Code to test the various compchem-specific modules, as most do not contain any test code themselves. ||

Adding New Data and editing the Splash page

Chempound extracts data from CML in accordance with the compchem convention. Provided that the data is a CML scalar, and is in the job's environment, initialization or finalization modules, with a dictRef (ideally) in the compchem dictionary, then the data will already be extracted into RDF.

If additional data needs to be extracted (such as is currently done for basis sets and dft functionals), then all that may be necessary is to edit the file CmlComp2RdfConverter.java to add the additional data to the RDF.

The html pages in chempound are generated using the freemarker template engine. The freemarker template that is used to generate the html page for each individual structure is the file: comp.ftl (other template and css files are in the parent directory).

In order to facilitate extracting key RDF data for use with the freemarker templates, several classes are used. For adding new terms, the following files needed to be edited:

* CompChemCalculation.java - this defines the interface that will be used by the freemarker template to access the data.

* CompChem.java - this creates the RDF terms that are used.

* CompChemCalculationImpl.java - this actually implements the functions to get the data.

When the new terms have been added, the tests should be updated, or a new test added in the directory https://bitbucket.org/chempound/compchem/src/ef32d64ba51b/compchem-importer/src/test/java/net/chempound/compchem

If the new terms are to be added to the chemistry search page, then the CompChemSearchProvider.java file will need to be edited, and suitable tests added to the file CompChemSearchIntegrationTest.java

Installing Chempound

For installing chempound for personal use on a local machine, the getting-started notes should be sufficient.

The following instructions apply for installing Chempound on an existing server, for use by an institution or group.

Installing Chempound into an existing Jetty server

Chempound is a pure java program, so can be run in any java container. These instructions are specific to installing it into jetty for use on unix systems.

The latest version of the war file for chempound can be downloaded here.

There are any number of ways to configure jetty, so this just describes one way, with some pointers to the other possibilities.

If you do not already have jetty installed on your server, and it is not available within the package management software for your distribution, a jetty hightide distribution, can be downloaded from codehaus.

The following instructions assume that you have a jetty server, with a directory structure similar to the following (only the relevant files and directories are listed).

The file **start.jar** is the java file used to start jetty with the command:

By default, on startup, jetty will parse the file **start.ini**, which contains command-line options for the server, including the list of modules to include, and a list of XML configuration files that determine various options (these are listed one per line in the start.ini files and can be removed by commenting the line out with the **#** character). By default, the XML files reside in the **etc** directory. In this example, only one configuration file is used, the **jetty.xml** file in the **etc** directory.

This file contains the following:

There are two ways that jetty is usually configured to serve applications:

* jetty can monitor a directory (by default the **webapps** directory) and any **.war** files placed there, will be served at a URL determined from the name of the war file (i.e. **quixote-repository-webapp-0.1-SNAPSHOT.war** would be served at the URL **/quixote-repository-webapp-0.1-SNAPSHOT** relative to the base server url. * jetty can monitor a directory (by default the **contexts** directory) for XML files, and these will then be parsed to determine the location of the application's war file and the options required for serving the application.

This example uses the second approach, the final block of XML in the **jetty.xml** above, configuring jetty to monitor the context directory. The contexts directory contains one file, **quixote.xml**, the contents of which are shown below (with comments to explain relevant bits):

    • NB:** For general information and examples of contexts file, please see the jetty wiki.
The first two **Set** commands should be self-explanatory.

The next block sets two important variables that are needed by chempound:

* **chempound.uri** - this is a string included in the html pages served by chempound and is used to set the url where various files (such as the CSS files) are expected to be found. It should be the full url where the base chempound sever will be found, such as **http://cdsora4.dl.ac.uk/chempound**. * **chempound.workspace** - this the path to a locally accessible directory on the server where all the files needed by chempound will be stored. The actual files held by chempound (such as the logfiles, CML file etc, are stored in the this directory in the **content** folder).

These two variables can also be set by setting them as environment variables before the server is started, or setting them on the command-line when the server is started as shown below:

Security Considerations

In order to make chempound available on a standard URL (such as http://cdsora4.dl.ac.uk/chempound), the server needs to listen for TCP requests on port 80.

On unix systems, only processes started by root are permitted to bind to ports numbered less than 1024, which would entail a requirement to run the chempound jetty server as root. However, this is not considered a good security practice, and there is no other reason why the server needs to run as root.

On debian-based systems, a way around this is the **authbind** package, which allows users to bind non-root servers to a low-numbered port.

Another approach is to start the server under a non-root user, binding to a high-numbered port and to use a firewall to redirect requests from port 80 to the port the server is listening on. If the server was started on port 8080, then the iptables rule to accomplish this would be:

    • NB:** if using this method, it is important to remember that the **chempound.uri** variable, will need to be set to point at the url as visible externally, and should not include the port number, as otherwise the CSS files will not be found.

Debugging Chempound

If Chempound is not working as expected, the logging facility can be used to increase the amount of information printed, which is useful for tracking down the causes of problems.

The logging subsystem consists of the interface SLF4J 1.6.1 (Simple Logging Facade for Java): and the implementation LOG4J 1.2.

The included configuration for DepositNWChem is:

However, you can change the logging behavior of the application by adding your own log4j.properties file to the classpath. For instance, the following configuration file will set the general log level to INFO and, for class uk.ac.cam.ch.wwmm.chempound.compchem.CmlComp2RdfConverter, the level will be DEBUG.

For example, if you place the log4j.properties configuration file in your current working directory and you run DepositNWChem from there, you can add the current directory to the classpath as follows (it assumes you have the jar file of DepositNWChem with its dependencies in a target subdirectory):

You can use logging anywhere in the code. You will need to grab a Logger object to pass the logging messages. Simply import the Logger and LoggerFactory classes, and call LoggerFactory.getLogger to obtain a Logger object. Then call any of the debug, info, warn or error methods to log your message at the appropriate log level.

The following code snippet shows how to get the root Logger as well as another child Logger (identified with the uk.ac.cam.ch.wwmm.chempound.compchem.CmlComp2RdfConverter class name) and how to emit a INFO level message.

Debugging Chempound running under Jetty

A simple way to debug chempound when running under jetty, is to add the following lines to the jetty **start.ini** file, which is used to prepend command-line arguments to jetty (the arguments can also be added to the command-line when starting jetty, or indeed to any java program that supports log4j):

The first line turns on debugging for log4j itself - this is useful as it causes log4j to print which configuration file it is using. The second file gives the path to a log4j configuration file, which should contain the directives as described above.

Clone this wiki locally