Skip to content

Commit

Permalink
edit docs about DDL2 validation
Browse files Browse the repository at this point in the history
  • Loading branch information
wojdyr committed Jan 10, 2025
1 parent e72d7b8 commit eb7b5ee
Show file tree
Hide file tree
Showing 3 changed files with 139 additions and 50 deletions.
148 changes: 121 additions & 27 deletions docs/cif.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1489,22 +1489,29 @@ Validation
==========

A CIF document can conform to a dictionary (ontology, think DTD for XML
or JSON Schema for JSON). A dictionary is called DDL (dictionary
definition language) and is itself a CIF document. There are three
versions of DDL:

* DDL1 is the simplest one. It enables fewer checks than the others.
It is used, for instance, for small molecule CIFs.
* DDL2 is used for PDBx/mmCIF and the activity in this area
is centered around the PDB.
* DDLm is a newer (from around 2011) version from the IUCr's COMCIFS
or JSON Schema for JSON). A dictionary, written in one of the versions
of DDL (Dictionary Definition Language), is itself a CIF document.
There are three versions of DDL:

* DDL1 is the simplest. It is used, for instance, for small molecule CIFs.
* DDL2 is used for PDBx/mmCIF, with activity in this area
centered around the PDB.
* DDLm is a newer version (from around 2011) from the IUCr's COMCIFS
(Committee for the Maintenance of the CIF Standard). It's not widely
used yet and, similarly to CIF2, is not supported by Gemmi.
used yet and, like CIF2, is not supported by Gemmi.

Gemmi is used primarily in structural biology and it's exercised mostly
Gemmi is primarily used in structural biology and is mostly exercised
with mmCIF and DDL2. DDL1 is supported to a limited extent (which could be
expanded if there was interest and a good use case for it).
expanded if there was a good use case).

.. note::

In most cases, it's simpler to use the command-line program
:ref:`gemmi validate <gemmi-validate>` instead of the functions
described below. If you use mmCIF-like files, make sure you read
:ref:`notes about DDL2 <DDL2>`.

The validation capabilities are implemented in class `cif::Ddl`.
Let's start with a simple example, a pet weighting experiment:

.. doctest::
Expand All @@ -1519,7 +1526,7 @@ Let's start with a simple example, a pet weighting experiment:
... 2 dog 15
... '''

Now let's make a contrived DDL1 dictionary for it:
Now let's create a contrived DDL1 dictionary for it:

.. doctest::

Expand All @@ -1543,7 +1550,8 @@ Now let's make a contrived DDL1 dictionary for it:
... _units kg
... ''')

We can now use class `Ddl` to check if it's:
The `Ddl` class must be first supplied with a dictionary and can then validate
CIF files.

.. doctest::

Expand All @@ -1552,7 +1560,7 @@ We can now use class `Ddl` to check if it's:
>>> validator.validate_cif(cif.read_string(pet_example))
True

Now let's add one line to our example that will trigger errors:
Now let's append a line that will trigger errors:

.. doctest::

Expand All @@ -1567,9 +1575,8 @@ Now let's add one line to our example that will trigger errors:
string:2 [pets] _pet_weight: value out of expected range: 3000
False

The errors are sent to a logger as described
in a :ref:`separate section <logger>`.
The logger was set in constructor and can be changed at any point:
Errors are sent to a logger as described in a :ref:`separate section <logger>`.
The logger is set in the constructor and can be changed at any point:

.. doctest::

Expand All @@ -1578,18 +1585,105 @@ The logger was set in constructor and can be changed at any point:
False

Calling `read_ddl()` moves the content of a `Document` to the `Ddl` class,
leaving the original object empty (it avoids copying to make it faster).
`read_ddl()` can be called multiple time to use multiple dictionaries
(or extensions) simultanously.
leaving the original object empty (it's slightly faster this way).
`read_ddl()` can be called multiple times to use multiple dictionaries
(or extensions) simultaneously.

`Ddl` has a few flags to enable or disable certain types of checks.
These correspond to the optional checks listed in the documentation
of the :ref:`gemmi validate <gemmi-validate>` subcommand.
In C++, these are member variables that can be set directly.
In Python, they are set through keyword arguments in the constructor.

The minimal example above used a contrived dictionary. Normally, you will
use a dictionary downloaded from the IUCr, wwPDB or another source --
perhaps with your own extensions. So you'll use `cif.read()` instead of
`cif.read_string()`.

`Ddl` has a few flags that enable or disable a few types of checks.
They correspond to Optional Checks of the :ref:`gemmi validate <gemmi-validate>`
subcommand. In C++ they are member variables that can be set directly,
while Python bindings provide corresponding keyword arguments for constructor.
.. _ddl2:

Notes on DDL2
-------------

The commonly used DDL2-based dictionaries are
`available from wwPDB <https://mmcif.wwpdb.org/dictionaries/downloads.html>`_.
To validate mmCIF files, use the current version of the PDBx/mmCIF
dictionary (`mmcif_pdbx_v50.dic` as of 2025).
The original IUCr mmCIF dictionary (`cif_mm.dic`) is now only of historical
interest. It was actively developed in the 1990s, but in the 2000s development
was taken over by the PDB under the PDBx/mmCIF name. Formally, PDBx/mmCIF
is an extension of the IUCr mmCIF, but for all practical purposes it can be
thought of as the current mmCIF dictionary.
When we talk about mmCIF files, it's shorthand for PDBx/mmCIF
or mmCIF-like files. No software targets the original mmCIF specification.

The mmCIF dictionary itself is massive---over 5MB of text---so it, too, can
use some validation. That's what the `mmcif_ddl.dic` (DDL2) dictionary is for.
This dictionary can also validate itself, closing the loop:

.. code-block:: console
TBC - overview of DDL2 dictionaries,
how to interpret results, what to focus on, etc
$ gemmi validate -d mmcif_ddl.dic mmcif_ddl.dic
The PDBx/mmCIF dictionary can be used to validate coordinate,
structure factor and chemical component files.
From a validation perspective, they are all the same thing.
That's why all categories in mmCIF files are, according to the dictionary,
optional (`_category.mandatory_code no`) -- we can't tell what the file
must contain. However, many items are marked as mandatory within
categories (`_item.mandatory_code`). If you've ever wondered
about the difference between null values `?` and `.` in mmCIF files:
the PDB's software writes `?` and `.` for optional and mandatory items,
respectively (an implementation detail that deviates from the CIF 1.1 spec).
If an item is *mandatory*, it only means that if its category is present,
the tag must also be present, but its value can be unknown or n/a.

DDL2 is missing a comprehensive specification. What is not covered in
International Tables for Crystallography, vol G (2006), has to be
inferred from studying dictionaries and asking around.
Parent-child relationships are particularly challenging.
Tags may have associated parent tags (e.g. `_entity.id` is the parent
of `_entity_poly.entity_id`), and groups of tags may have associated
parent groups (defined in the `pdbx_item_linked_group` category).
But it's unclear if every parent must exist.
The PDB's own validation software (CifCheck from
`mmcif-dict-suite <https://sw-tools.rcsb.org/apps/MMCIF-DICT-SUITE/>`_)
checks for the presence of parent tags but has a long list of arbitrary
exceptions hardcoded into the program, otherwise most of the files from
the PDB wouldn't validate. In gemmi, the relationships are not checked
by default, but there is an option for it (`-p` in `gemmi validate`).

In some cases, broken relationships are fixable. In others, there is a
fundamental mismatch between the design of the mmCIF schema and the capabilities
of DDL2. For example, some aspects of polymers and non-polymers are
described in different categories, but a residue can't be conditionally
linked to one or the other. So, it's linked only to polymeric categories,
leaving the schema partially incorrect.

If you run gemmi validation in verbose mode, you might see warnings about
incorrect regular expressions in a dictionary. In general, regexes
come in various flavors. Over the years, some flavors have been formally
defined and standardized (POSIX BRE, ERE, RegExp in EcmaScript, etc.).
I think the regexes used in DDL2 are closest to POSIX ERE (Extended RegExp).
Gemmi has hacks for parsing all the regexes that have been in mmCIF dictionaries
for a long time, but sometimes new ones are added that are inconsistent with
the older ones, so full support can't be guaranteed.

Dictionaries allow only relatively simple checks.
When you deposit files to the PDB, they are primarily validated by other means.
Coordinate files are processed by a program called MAXIT,
and structure factor files -- by SF-CONVERT. These are part of a C++ codebase
that has been developed at RCSB since the late 1990s.
The files you deposit don't need to strictly conform to the dictionary,
but they must be able to pass through the processing programs, which rewrite
them anyway. Attempts to make an mmCIF file more conformant with the spec
sometimes backfire, choking the OneDep pipeline.

Validation helps spot certain types of mistakes
but shouldn't be overemphasized.
When generating an mmCIF file, the goal is to ensure that it can be
read and correctly interpreted by the software it will be used with next.
Validation against a dictionary is a guideline, not the goal.


JSON
Expand Down
40 changes: 17 additions & 23 deletions docs/program.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,29 +24,23 @@ use a tool such as `GNU parallel <https://www.gnu.org/software/parallel/>`_::
validate
========

A CIF validator. Apart from checking the syntax it can check most of the rules
imposed by DDL1 and DDL2 dictionaries.

If you want to validate mmCIF files,
the current version of the PDBx/mmCIF specification, maintained by the PDB,
is distributed as one file
(`mmcif_pdbx_v50.dic <https://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v50.dic/Index/>`_),
which can be used to validate all kinds of mmCIF files: coordinate files,
reflection files, and CCD monomers.
Note that such validation can spot only certain types of mistakes.
It won't tell you if the file is appropriate for deposition to the PDB.
Dictionary-based validation can't even tell if the file contains all
the necessary tables; it is unaware of what the file represents --
coordinates, reflection data or something else.
On the other hand, the mmCIF files deposited to the PDB do not need
to strictly conform to the PDBx/mmCIF spec.
Not even the files distributed by the PDB are fully compliant
(partly because not everything can be expressed in DDL2 syntax;
usually it's about child-parent relationships;
PDB's own validator, program CifCheck from
`mmcif-dict-suite <https://sw-tools.rcsb.org/apps/MMCIF-DICT-SUITE/>`_,
has a few exceptions hardcoded in C++,
so that non-conformance is not accidental).
This program validates CIF and mmCIF files. It can:

* check the STAR/CIF syntax (CIF 1.1, not 2)::

gemmi validate file1.cif file2.cif

* verify rules imposed by DDL1 and DDL2 dictionaries::

gemmi validate -d mmcif_pdbx_v50.dic -d extension.dic file.mmcif

* and perform a few extra checks for CCP4 monomer files::

gemmi validate -m $CLIBD_MON/a/AAA.cif

Before validating mmCIF-like files, see :ref:`the notes on DDL2 <ddl2>`.
The dictionary used for mmCIF files is the first one
`from here <https://mmcif.wwpdb.org/dictionaries/downloads.html>`_.

.. literalinclude:: validate-help.txt
:language: console
Expand Down
1 change: 1 addition & 0 deletions src/ddl.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -478,6 +478,7 @@ void Ddl::read_ddl2_block(cif::Block& block) {
gemmi::replace_all(re_str, "/\\{}", "/\\\\{}");
// in binary, \<newline> is apparently meant to be ignored
gemmi::replace_all(re_str, "\\\n", "");
gemmi::replace_all(re_str, "\\\r\n", "");
auto flag = std::regex::awk | std::regex::optimize;
regexes_.emplace(row.str(0), std::regex(re_str, flag));
} catch (const std::regex_error& e) {
Expand Down

0 comments on commit eb7b5ee

Please sign in to comment.