The purpose of this package is to provide developers with ways to validate and run error checks on NACC-specific forms before submission, or for validating extra data your forms may have.
See Data Quality Rule Definition Guidelines for more information on how the quality rules themselves work, which includes custom rules written specifically for NACC forms.
The NACC Form Validator primarily consists of two classes: QualityCheck
, which in turn creates a NACCValidator
. Together they are used to validate a record against a validation schema. The NACCValidator
itself is an extension of Cerberus' Validator
class, and provides definitions for custom NACC rules. These custom rules are described in more detail in Data Quality Rule Definition Guidelines.
The NACCValidator
can also use an an optional Datastore
object which you can implement to access records in your own database. See Example Usage - Records and Datastores for more information.
The usage workflow is to instantiate a QualityCheck
object with the schema you want to validate against, and then pass the record to validate to the validate_record
method. This method returns 4 variables:
Variable | Type | Description |
---|---|---|
passed |
bool |
Whether or not the record satisfied all validation rules defined by the schema |
sys_failure |
bool |
Whether or not a system error occured |
errors |
dict[str, list[str]] |
Dict of errors encountered keyed by the variable that failed. Empty if no errors encountered for the record. |
error_tree |
DocumentErrorTree |
A dict-like object of ValidationError instances. See Cerberus' Errors documentation for more information. |
The following is a simple example of using this package. The schema contains rules for the field hello
, which can only be assigned to the string world
, and the primary key example_primary_key
, which must be an integer. The primary key should always be a required field in the schema.
from nacc_form_validator import QualityCheck
pk_name = "example_primary_key"
schema = {
"example_primary_key": {
"type": "integer"
"required", True
},
"hello": {
"type": "string",
"required": True,
"allowed": [
"world"
]
}
}
qc = QualityCheck(pk_name, schema, strict=True, datastore=None)
passed, sys_failure, errors, error_tree = qc.validate_record({"example_primary_key": 1, "hello": "world"})
# passed = True
# sys_failure = False
# errors = {}
# error_tree = [],{}
passed, sys_failure, errors, error_tree = qc.validate_record({"example_primary_key": 2, "hello": "pluto"})
# passed = False
# sys_failure = False
# errors = {'hello': ['unallowed value pluto']}
# error_tree = [],{
# 'hello': [
# ValidationError @ 0x101649390 ( document_path=('hello',),
# schema_path=('hello', 'allowed'),
# code=0x44,
# constraint=['world'],
# value="pluto",
# info=('pluto',)
# )
# ],
# {}
# }
You may want to compare a record against previous records, particularly if your schema uses temporal rules, which are usually associated with plausibility checks. This is where pk_name
and datastore
come in, in which pk_name
must be the "primary key" to index the datastore by. The primary key must always be a required field. datastore.py
then provides a Datastore
abstract class that you must implement, specifically the get_previous_record
method.
For more information on temporal rules, see Data Quality Rule Definition Guidelines for more information on temporal rules). If you are validating against a schema that doesn't have temporal rules, it isn't really necessary to set up a datastore.
As an example, say pk_name
is the patient_id
. For sake of simplicity, the "database" in this example will be a hard-coded Python dict, but yours will likely be much more complicated. Next our validation scheme checks to see if a patient filled out taxes in a previous visit (0); if so, the current record cannot indicate that they never did it (8).
import copy
from nacc_form_validator import QualityCheck
from nacc_form_validator.datastore import Datastore
class CustomDatastore(Datastore):
def __init__(self, pk_field: str, orderby: str) -> None:
self.__db = {
'PatientID1': [
{
"visit_num": 1,
"taxes": 8
},
{
"visit_num": 2,
"taxes": 0
}
]
}
super.__init__(pk_field, orderby)
def get_previous_record(self, current_record: dict[str, str]) -> dict[str, str] | None:
"""
See where current record would fit in the sorted record and return the previous record
Making a deep copy since we don't actually want to modify the record in this method
"""
key = current_record[self.pk_field]
if key not in self.__db:
return None
sorted_record = copy.deepcopy(self.__db[key])
sorted_record.append(current_record)
sorted_record.sort(key=lambda record: record[self.__orderby])
index = sorted_record.index(current_record)
return sorted_record[index - 1] if index != 0 else None
pk_name = "patient_id"
orderby = "visit_num"
datastore = CustomDatastore(pk_name, orderby)
schema = {
"patient_id": {
"type": "string",
"required": True
},
"visit_num": {
"type": "integer",
"required": True
},
"taxes": {
"type": "integer",
"temporalrules": [
{
"previous": {
"taxes": {
"allowed": [0]
}
},
"current": {
"taxes": {
"forbidden": [8]
}
}
}
]
}
}
qc = QualityCheck(pk_name, schema, strict=True, datastore=datastore)
record = {
"patient_id": "PatientID1",
"visit_num": 3,
"taxes": 1
}
passed, sys_failure, errors, error_tree = qc.validate_record(record)
# passed = True
# sys_failure = False
# errors = {}
# error_tree = [],{}
record = {
"patient_id": "PatientID1",
"visit_num": 3,
"taxes": 8
}
passed, sys_failure, errors, error_tree = qc.validate_record(record)
# passed = False
# sys_failure = False
# errors = {
# 'taxes': [
# "('taxes',['unallowed value 8']) in current visit for {'allowed': [0]} in previous visit - temporal rule no: 1"
# ]
# }
# error_tree = [],{
# 'taxes': [
# ValidationError @ 0x100a70ed0 ( document_path=('taxes',),
# schema_path=('taxes', 'temporalrules'),
# code=0x2000,
# constraint={
# "previous": {
# "taxes": {
# "allowed": [0]
# }
# },
# "current": {
# "taxes": {
# "forbidden": [8]
# }
# }
# },
# value=8,
# info=(1, "('taxes', ['unallowed value 8'])",{'allowed': [0]})
# )
# ],{}
# }
It is likely you will want to validate multiple records at once. This is easily achieved by instantiating a QualityCheck
with the corresponding schema and looping over the record(s) you want to validate as Python dict
objects. What this data looks like outside of that is up to you - maybe you wish to read in forms from external files (JSONs, YAML, or CSVs), or directly from a database.
docs/validate_csv_records.py
sets up an example CLI script to read in multiple records from a CSV file (where each row is a record) and validate them against a schema passed as a JSON file. It then summarizes the results and prints them to a CSV or JSON file (based on file extension, defaults to CSV if no extension is provided), or stdout
(in JSON) if no output file is specified.
Example usage:
python3 validate_csv_records.py \
-r rules-schema.json \
-i input-records.csv \
-o output-errors.csv