Skip to content

Planning

Oliver Stueker edited this page May 5, 2015 · 2 revisions

<<toc></toc>>

Table of Contents

Overview

Each day, countless calculations are run by thousands of computational chemistry researchers around the world, on everything from ageing, dusty desktops, to the most powerful supercomputers on the planet.

It might be supposed that this would lead to a deluge of valuable data, but the surprising fact remains that most of this data, if it is archived at all, usually lies hidden away on hard disks or buried on tape backups – often lost to the original researcher and never seen by the wider chemistry community at all.

However, it is widely accepted that if the results of all these calculations were publicly accessible it would be extremely valuable as it would:

 * avoid the costly duplication of results.
 * allow different codes to be easily validated and benchmarked.
 * provide the data required for the development of new methods.
 * provide a valuable resource for data mining.
 * provide an easy, automated way of generating and archiving supporting information for publications.

In the rare cases when data is made openly available, the output of calculations are inevitably produced in a code-specific format - there being no currently accepted output standard. This means that interpreting or reusing the data requires knowledge of the code, or the use of specific software that understands the output.

A standard output format would:

 * allow tools (e.g. GUI's) to operate on the input and output of any code supporting the format, vastly increasing their utility and range.
 * enable different codes to interoperate to create complex workflows.
 * additionally, if a semantic model underlies the format, data can easily be validated.

The benefits of a common data standard and results databases are obvious, but several previous efforts have failed to address them, largely because of an inability to settle on a data standard or provide any useful tools that would make it worthwhile for code developers to expend the time to make their codes compatible.

The Quixote project aims to tackle both of these problems in a pragmatic way, building an infrastructure that can be used to both archive and search calculations on a local hard-drive, or expose the data on publicly accessible servers to make it available to the wider community.

The data standard will be consolidated around the tools and encourage its adoption by providing code and tool developers with an obvious reason for adopting the data standard; the "If you build it, they will come" approach.

The project is rooted in the belief that scientific codes and data should be "open", and we are therefore focussing our efforts on using existing open-source solutions and standards where possible, and then developing any additional tools within the project.

The Quixote project is itself completely open, de-centralised and community-driven. We are composed of passionate researchers from around the globe and are happy to collaborate with anyone who shares our aims.

Requirements

The requirements of our solution are that:

 * it must be lightweight
 * easily installable
 * flexible, easily up-dateable
 * simple user interface
 * support the major Quantum Chemistry codes

Vision and goals

See The different ways of looking at the world by Peter Murray-Rust.

Some goals are practical, some are midterm, some are wild:

Short term

 * To get a feeling for the current problems faced by people interacting with QC data
 * To converge on the core requirements about the data that we want to extract and index from QC datafiles
 * To collect most of the existing technology regarding the following section (no need to reinvent the wheel, or "rediscover the Mediterranean sea", as we say in Spain)

Mid term

 * To build/connect the tools to organize, validate, parse, index, aggregate, search and share Quantum Chemistry (or more generally Computational Chemistry) data. The knowledge gained here should prove useful in related fields such as Molecular Dynamics
 * To successfully deploy the tools and encourage their widespread uptake

Long term

 * To maintain the tools and accommodate any changes in the codes or in the users' practice
 * To increase the openness and availability of QC data

Longest term

 * To change the ways in which the community works, as well as the position of other players, such as journals or funding agencies towards fostering the previous point
 * To increase the efficiency of the field and ultimately facilitate scientific advances

Scope

//Limits and context of Quixote, to keep it focused and manageable.//

    • Note: This scope statement is still under revision**
Quixote currently focuses on small molecules (no periodicity), and avoids dynamics and relativistic studies.

From an existing Quantum Chemical calculation, a user would upload the files to a public server running Quixote, which would parse the results (using, for instance, Jumbo converters), structure them with the help of a dictionary (such as CML and the related compchem dictionary), store them in a database (perhaps using the RDF format), and allow the users of the public servers to retrieve the structured data through web queries (in the style of the SPARQL language, for instance) and through HTML browsing.

In the future, Quixote may be extended to work with further chemical systems and to offer more kinds of retrieval and analysis web utilities.

Stakeholders

//How Quixote may help you, and how you can help Quixote.//

    • Note: This list of stakeholders is currently under revision**
If you are ...
 * an **experimental chemist**, you may guide the design of your experiments using the data organized with the Quixote project tools. You can help Quixote by suggesting the kind of data and properties you would be interested in.
 * a **theoretical chemist**, you may use the data organized using the Quixote project tools. This data can help you in developing new theoretical models, or in extending and comparing existing ones. You can help Quixote by suggesting the methodological properties that should be present with all data. You can also suggest what kind of data you would like to see organized by Quixote.
 * a **computational chemistry software developer**, you can reuse Quixote project modules in your software. Besides, by coordinating your software development with Quixote, you can add features to your software so that your users can easily organize and publish their results. Particularly important for Quixote is knowing what kind of data the different software programs create, and which theoretical methods are implemented in the software.
 * a **computational scientist**, you are probably the main kind of user of Quixote. With it, you will be able to organize your computational results and, if you wish, easily share them with your collaborators or with the whole scientific community. Besides, you can use Quixote for finding what kind of theoretical model better fits your problem. Therefore, you can contribute to Quixote by suggesting what kind of data organization and search facilities you would like to see in it.
 * a **software engineer, software developer, or computer scientist**, you can contribute by sharing your knowledge in creating software products, or you can even directly contribute to the Quixote software development. You may also be interested in seeing how different software technologies are applied in Quixote.
 * a **librarian**, you have experience in organizing reference information, and you can then help in the way Quixote should organize the data it processes, and what kind of metadata should appear along with each data set. Besides, you may consider using Quixote for helping your library users to organize and share their computational results.
 * a **teacher** in a higher education institution, you can show your students the results of computational studies, and teach them how to use computational Quantum Chemistry software and how to organize their results. You can contribute to the Quixote project by suggesting what kind of data and properties you would like to see in it.
 * a **journal publisher**, you can use Quixote to provide supplementary material for research papers that use Quantum Chemistry computational results. You can contribute to Quixote with suggestions for the kind of metadata to include in each data set, in order to allow for reproducibility of results.

If you want to collaborate with the Quixote project, join the mailing lists mentioned in the Front_Page and share your ideas, or contact one of the project members in People (though that is not an updated list).

Use cases and scenarios

//These use cases and scenarios will help in identifying what we want Quixote for.//

    • Note: The set of use cases and scenarios is currently a draft.**
The simple use cases we are currently gathering try to show how we can, from an existing QC calculation, upload the files to a public server, parse the results with Quixote modules (Jumbo converters), structure them with CML and a related compchem dictionary, store them in a database (RDF format), and retrieve them through web queries (at first, by using the SPARQL language) or HTML browsing. The importance of the use case is to show that this process is useful in some scientific or educational way, to guide Quixote future development.

You can see the minutes of the meeting on November 26 to see a list of ideas for use cases.

Currently, we are working to develop a use case around the cyclobutadiene mysteries.

Design(s)

Jorge Estrada and Pablo Echenique

Based on previous discussions and on what has been done until now, we have prepared a few diagrams showing the architecture of the proposed system.

The diagrams show components (running software), storage (databases, filesystems, etc.), layers (organization of the source code) and data flows (for components) or usage relations (for layers).

A short description of the different systems appearing in the diagrams (more suitable names are needed, but these serve as a first approach):

 * Local Results Storage System: Where the results of computational chemistry calculations are stored.
 * Publishing System: It converts the results to an intermediate format (CML) and transfers them to distributed storage.
 * Distributed Semistructured Results Storage System: Stores intermediate format (CML). Can be really distributed, or a single remote filesystem, or even the local filesystem.
 * Structuring System: Transforms intermediate format results (CML) to a inter-results and intra-results relationships database or triple store.
 * Distributed Structured Results Storage System: Stores inter-results and intra-results relationships. It is a database or triple store. It can be really distributed, or a single remote filesystem, or even the local filesystem.
 * Retrieval Service System: It knows how to query and retrieve data from both the semistructured and structured storage systems.
 * RSS Feed System: Creates RSS feeds through queries to the Retrieval Service System.
 * Web Browsing Server System: Creates an HTML view of the semistructured and structured data (like CrystalEye).
 * Retrieval Client System: It performs retrieval requests to the Retrieval Service System.

The first overview diagram shows the main components of the Quixote system, and the data flows. Some components are optional (such as the RSS Feed System), and others may be missing. This diagram may help to refine the list of functionality required.

(OpenOffice.org 3 - Draw original file )

How does this architecture map to the specific components we are using or we plan to use? This annotated overview diagram tries to answer this question:

(OpenOffice.org 3 - Draw original file )

Finally, we have already tested a raw system using Lensfield2. It would correspond to the Publishing System of the overview diagram:

(OpenOffice.org 3 - Draw original file )

Peter Murray-Rust

See Components of the Quixote Open computational chemistry system and the WWMM.

Philosophy

 * run anywhere (favours Java, Python)
 * Open Source and Open Data
 * community-driven - interoperate and re-use rather than compete
 * evolution rather than top-down
 * web-friendly (e.g. Linked Open Data)
 * lightweight whever possible
 * minimal formal design (i.e. avoid heavy schemas, messaging protocols, databases)
 * many modular components linked by lightweight glue
 * mixture of push (deliberate upload) and pull (expose for harvesting)
 * favour simplicitly over formal robustness

Pablo Echenique

 * You have datafiles.
 * They can be in your hard disk, in the cloud, in an institutional repository, anywhere.
 * Each datafile has a URI (unique identifier, I think this is the proper term). PMR: each file has an address (the filename, URL, etc.) and one or more identifiers (URIs, DataCite, etc.)
 * The datafiles can be grouped in, say, "datagroups", e.g., all data in a given machine can be a datagroup (the simplest one), or you can have many datagroups in a single site.
 * The datagroups may contain more "human" metadata, such as Publication, Author, Description, Funder, etc., that apply to all its datafiles.
 * Everyone can define a datagroup, as long as you know the URIs of the datafiles you want to put into it. Probably, new datagroups should inherit or merge the original metadata.
 * Each datagroup has a URI.
 * Your datafiles can be accessible (through the internet) or they can private.
 * You have an application that, given a set of accessible URIs (of datafiles or datagroups or both), generates a database after parsing the relevant information from the datafiles. If the URIs are local, that's fine, if they are on a different place, you just retrieve the datafiles through the internet.
 * Now you have databases.
 * They can be in your hard disk, in the cloud, in an institutional repository, anywhere.
 * Each database has a URI.
 * You have a library that, when plugged into an application, given a set of accessible database URIs, allows you to perform search queries, aggregate databases into new merged ones with new URIs, and anything we find meaningful.
 * Using this library, we can build an application to search any database, probably using a beautiful and intuitive UI, probably inside a browser. But in fact, anyone can. This application can be exposed in a web server or used privately.

In this way, datafiles, datagroups and databases can sit anywhere, be public or private, and anyone can choose to use, or even code, an application to access and search any collection of them.

The only things we would need to build are:

 * The system for assigning URIs to datafiles, datagroups and databases (maybe just URLs is the simplest choice).
 * The system for building and tagging datagroups.
 * The application for building databases.
 * The library for merging and searching databases.
 * One instance of beautiful frontend for this library.

Then, you can have users for which all this process is transparent (they just see the final application in a browser), you can have people that builds and aggregate databases for their particular interests, you can have programmers that code new UIs, etc.

Roadmap

    • Note: This is an incomplete list.**
Current efforts in the Quixote project focus on:
 * Build a basic system: basic uploading, basic structuring, basic web search and browse capabilities.
 * Build parsers for different QC software. See [[Codes]] and [[Parsing]].
 * Develop use cases and scenarios.
 * Make a list of requirements for the system.
 * Refine the list of stakeholders.
 * Write an introductory document for newcomers. See [[http://quixote.wikispot.org/Collaborative_projects#JChemInf|Journal of Cheminformatics Paper]].
 * Build dictionaries (start by documenting CML and compchem). See [[CML]] and [[Dictionaries]].
 * ...

Meetings

 * [[First_Quixote_Conference_-_22nd-23rd_March_2010]]
Clone this wiki locally