diff --git a/docs/cif.rst b/docs/cif.rst index e2920eb8..d209ac3b 100644 --- a/docs/cif.rst +++ b/docs/cif.rst @@ -1489,22 +1489,29 @@ Validation ========== A CIF document can conform to a dictionary (ontology, think DTD for XML -or JSON Schema for JSON). A dictionary is called DDL (dictionary -definition language) and is itself a CIF document. There are three -versions of DDL: - -* DDL1 is the simplest one. It enables fewer checks than the others. - It is used, for instance, for small molecule CIFs. -* DDL2 is used for PDBx/mmCIF and the activity in this area - is centered around the PDB. -* DDLm is a newer (from around 2011) version from the IUCr's COMCIFS +or JSON Schema for JSON). A dictionary, written in one of the versions +of DDL (Dictionary Definition Language), is itself a CIF document. +There are three versions of DDL: + +* DDL1 is the simplest. It is used, for instance, for small molecule CIFs. +* DDL2 is used for PDBx/mmCIF, with activity in this area + centered around the PDB. +* DDLm is a newer version (from around 2011) from the IUCr's COMCIFS (Committee for the Maintenance of the CIF Standard). It's not widely - used yet and, similarly to CIF2, is not supported by Gemmi. + used yet and, like CIF2, is not supported by Gemmi. -Gemmi is used primarily in structural biology and it's exercised mostly +Gemmi is primarily used in structural biology and is mostly exercised with mmCIF and DDL2. DDL1 is supported to a limited extent (which could be -expanded if there was interest and a good use case for it). +expanded if there was a good use case). +.. note:: + + In most cases, it's simpler to use the command-line program + :ref:`gemmi validate ` instead of the functions + described below. If you use mmCIF-like files, make sure you read + :ref:`notes about DDL2 `. + +The validation capabilities are implemented in class `cif::Ddl`. Let's start with a simple example, a pet weighting experiment: .. doctest:: @@ -1519,7 +1526,7 @@ Let's start with a simple example, a pet weighting experiment: ... 2 dog 15 ... ''' -Now let's make a contrived DDL1 dictionary for it: +Now let's create a contrived DDL1 dictionary for it: .. doctest:: @@ -1543,7 +1550,8 @@ Now let's make a contrived DDL1 dictionary for it: ... _units kg ... ''') -We can now use class `Ddl` to check if it's: +The `Ddl` class must be first supplied with a dictionary and can then validate +CIF files. .. doctest:: @@ -1552,7 +1560,7 @@ We can now use class `Ddl` to check if it's: >>> validator.validate_cif(cif.read_string(pet_example)) True -Now let's add one line to our example that will trigger errors: +Now let's append a line that will trigger errors: .. doctest:: @@ -1567,9 +1575,8 @@ Now let's add one line to our example that will trigger errors: string:2 [pets] _pet_weight: value out of expected range: 3000 False -The errors are sent to a logger as described -in a :ref:`separate section `. -The logger was set in constructor and can be changed at any point: +Errors are sent to a logger as described in a :ref:`separate section `. +The logger is set in the constructor and can be changed at any point: .. doctest:: @@ -1578,18 +1585,105 @@ The logger was set in constructor and can be changed at any point: False Calling `read_ddl()` moves the content of a `Document` to the `Ddl` class, -leaving the original object empty (it avoids copying to make it faster). -`read_ddl()` can be called multiple time to use multiple dictionaries -(or extensions) simultanously. +leaving the original object empty (it's slightly faster this way). +`read_ddl()` can be called multiple times to use multiple dictionaries +(or extensions) simultaneously. + +`Ddl` has a few flags to enable or disable certain types of checks. +These correspond to the optional checks listed in the documentation +of the :ref:`gemmi validate ` subcommand. +In C++, these are member variables that can be set directly. +In Python, they are set through keyword arguments in the constructor. + +The minimal example above used a contrived dictionary. Normally, you will +use a dictionary downloaded from the IUCr, wwPDB or another source -- +perhaps with your own extensions. So you'll use `cif.read()` instead of +`cif.read_string()`. -`Ddl` has a few flags that enable or disable a few types of checks. -They correspond to Optional Checks of the :ref:`gemmi validate ` -subcommand. In C++ they are member variables that can be set directly, -while Python bindings provide corresponding keyword arguments for constructor. +.. _ddl2: +Notes on DDL2 +------------- + +The commonly used DDL2-based dictionaries are +`available from wwPDB `_. +To validate mmCIF files, use the current version of the PDBx/mmCIF +dictionary (`mmcif_pdbx_v50.dic` as of 2025). +The original IUCr mmCIF dictionary (`cif_mm.dic`) is now only of historical +interest. It was actively developed in the 1990s, but in the 2000s development +was taken over by the PDB under the PDBx/mmCIF name. Formally, PDBx/mmCIF +is an extension of the IUCr mmCIF, but for all practical purposes it can be +thought of as the current mmCIF dictionary. +When we talk about mmCIF files, it's shorthand for PDBx/mmCIF +or mmCIF-like files. No software targets the original mmCIF specification. + +The mmCIF dictionary itself is massive---over 5MB of text---so it, too, can +use some validation. That's what the `mmcif_ddl.dic` (DDL2) dictionary is for. +This dictionary can also validate itself, closing the loop: + +.. code-block:: console -TBC - overview of DDL2 dictionaries, -how to interpret results, what to focus on, etc + $ gemmi validate -d mmcif_ddl.dic mmcif_ddl.dic + +The PDBx/mmCIF dictionary can be used to validate coordinate, +structure factor and chemical component files. +From a validation perspective, they are all the same thing. +That's why all categories in mmCIF files are, according to the dictionary, +optional (`_category.mandatory_code no`) -- we can't tell what the file +must contain. However, many items are marked as mandatory within +categories (`_item.mandatory_code`). If you've ever wondered +about the difference between null values `?` and `.` in mmCIF files: +the PDB's software writes `?` and `.` for optional and mandatory items, +respectively (an implementation detail that deviates from the CIF 1.1 spec). +If an item is *mandatory*, it only means that if its category is present, +the tag must also be present, but its value can be unknown or n/a. + +DDL2 is missing a comprehensive specification. What is not covered in +International Tables for Crystallography, vol G (2006), has to be +inferred from studying dictionaries and asking around. +Parent-child relationships are particularly challenging. +Tags may have associated parent tags (e.g. `_entity.id` is the parent +of `_entity_poly.entity_id`), and groups of tags may have associated +parent groups (defined in the `pdbx_item_linked_group` category). +But it's unclear if every parent must exist. +The PDB's own validation software (CifCheck from +`mmcif-dict-suite `_) +checks for the presence of parent tags but has a long list of arbitrary +exceptions hardcoded into the program, otherwise most of the files from +the PDB wouldn't validate. In gemmi, the relationships are not checked +by default, but there is an option for it (`-p` in `gemmi validate`). + +In some cases, broken relationships are fixable. In others, there is a +fundamental mismatch between the design of the mmCIF schema and the capabilities +of DDL2. For example, some aspects of polymers and non-polymers are +described in different categories, but a residue can't be conditionally +linked to one or the other. So, it's linked only to polymeric categories, +leaving the schema partially incorrect. + +If you run gemmi validation in verbose mode, you might see warnings about +incorrect regular expressions in a dictionary. In general, regexes +come in various flavors. Over the years, some flavors have been formally +defined and standardized (POSIX BRE, ERE, RegExp in EcmaScript, etc.). +I think the regexes used in DDL2 are closest to POSIX ERE (Extended RegExp). +Gemmi has hacks for parsing all the regexes that have been in mmCIF dictionaries +for a long time, but sometimes new ones are added that are inconsistent with +the older ones, so full support can't be guaranteed. + +Dictionaries allow only relatively simple checks. +When you deposit files to the PDB, they are primarily validated by other means. +Coordinate files are processed by a program called MAXIT, +and structure factor files -- by SF-CONVERT. These are part of a C++ codebase +that has been developed at RCSB since the late 1990s. +The files you deposit don't need to strictly conform to the dictionary, +but they must be able to pass through the processing programs, which rewrite +them anyway. Attempts to make an mmCIF file more conformant with the spec +sometimes backfire, choking the OneDep pipeline. + +Validation helps spot certain types of mistakes +but shouldn't be overemphasized. +When generating an mmCIF file, the goal is to ensure that it can be +read and correctly interpreted by the software it will be used with next. +Validation against a dictionary is a guideline, not the goal. JSON diff --git a/docs/program.rst b/docs/program.rst index 5e1722c5..c1b8749d 100644 --- a/docs/program.rst +++ b/docs/program.rst @@ -24,29 +24,23 @@ use a tool such as `GNU parallel `_:: validate ======== -A CIF validator. Apart from checking the syntax it can check most of the rules -imposed by DDL1 and DDL2 dictionaries. - -If you want to validate mmCIF files, -the current version of the PDBx/mmCIF specification, maintained by the PDB, -is distributed as one file -(`mmcif_pdbx_v50.dic `_), -which can be used to validate all kinds of mmCIF files: coordinate files, -reflection files, and CCD monomers. -Note that such validation can spot only certain types of mistakes. -It won't tell you if the file is appropriate for deposition to the PDB. -Dictionary-based validation can't even tell if the file contains all -the necessary tables; it is unaware of what the file represents -- -coordinates, reflection data or something else. -On the other hand, the mmCIF files deposited to the PDB do not need -to strictly conform to the PDBx/mmCIF spec. -Not even the files distributed by the PDB are fully compliant -(partly because not everything can be expressed in DDL2 syntax; -usually it's about child-parent relationships; -PDB's own validator, program CifCheck from -`mmcif-dict-suite `_, -has a few exceptions hardcoded in C++, -so that non-conformance is not accidental). +This program validates CIF and mmCIF files. It can: + +* check the STAR/CIF syntax (CIF 1.1, not 2):: + + gemmi validate file1.cif file2.cif + +* verify rules imposed by DDL1 and DDL2 dictionaries:: + + gemmi validate -d mmcif_pdbx_v50.dic -d extension.dic file.mmcif + +* and perform a few extra checks for CCP4 monomer files:: + + gemmi validate -m $CLIBD_MON/a/AAA.cif + +Before validating mmCIF-like files, see :ref:`the notes on DDL2 `. +The dictionary used for mmCIF files is the first one +`from here `_. .. literalinclude:: validate-help.txt :language: console diff --git a/src/ddl.cpp b/src/ddl.cpp index 3a2cc23e..f2b8efe6 100644 --- a/src/ddl.cpp +++ b/src/ddl.cpp @@ -478,6 +478,7 @@ void Ddl::read_ddl2_block(cif::Block& block) { gemmi::replace_all(re_str, "/\\{}", "/\\\\{}"); // in binary, \ is apparently meant to be ignored gemmi::replace_all(re_str, "\\\n", ""); + gemmi::replace_all(re_str, "\\\r\n", ""); auto flag = std::regex::awk | std::regex::optimize; regexes_.emplace(row.str(0), std::regex(re_str, flag)); } catch (const std::regex_error& e) {