-
Notifications
You must be signed in to change notification settings - Fork 3
Validator module
The Validator module is a collection of classes that compares one dataset to another and produces output to explain discrepancies between them.
The validator
module is composed of 3 main classes:
Validator
is the primary interface for validation. Once instantiated with a mapper type and validatable fields, it can accept records to compare. Typically, this will be one mapped Rikolti record and one Solr record, but any set of comparable data will work.
If a certain mapper type requires specific logic, Validator
can be subclassed. See subclassing for details.
The ValidationLog
is where validation results are stored, and contains methods for outputting data in a consumable format (typically a CSV). For the most part, it is used by Validator
and shouldn't have to be accessed directly except to generate output.
ValidationMode
is an enum class that tells a Validator
instance how to compare values in its validations. Options are STRICT
and LAX
, the latter of which compares strings case-insensitively and lists order-independent.
Validation modes can be passed into a Validator
or an individual validatable field to change this behavior. By default, validations are made in STRICT
mode (case and order sensitive).
In most cases, a collection can be validated and output generated from convenience methods in the validate_mapping.py
module.
python validate_mapping.py {collection_id} --log-level={log_level} --verbose --filename={filename}
collection_id
(required): The collection ID to validate. Mapped metadata must exist for this collection.
--log-level
(default: WARNING
): The level of validation result to output. By default, it'll only output validation failures of types WARNING
and ERROR
. INFO
or lower will generate rows for validation successes, and can result in lengthy CSVs.
--filename/-f
: A filename/path to save the generated file. Default is a timestamp.
--verbose/-v
(default: false): If set, outputs in verbose mode. At log level INFO
, this will result in line-by-line validation successes being recorded. This can potentially generate very long CSV files.
Generate a CSV of validation failures for collection 1234:
python validate_mapping.py 1234
Generate a CSV of validation failures and non-verbose (single line per record) successes for collection 1234, output to filename "my_validation.csv":
python validate_mapping.py 1234 --log-level=INFO --filename=my_validation.csv
Generate a (probably very large) CSV of all validation failures and successes for collection 1234:
python validate_mapping.py 1234 --log-level=INFO --verbose
In Python, Validator
(or any subclass of Validator
)can be instantiated directly.
Validator(log_level: ValidationLogLevel, validation_mode: ValidationMode, verbose: bool)
You can then call the validate
method on each pair of records you want to compare:
validator.validate(record_harvest_id: str, rikolti_data: dict, comparison_data: dict, validation_mode: ValidationMode)
...and access the log afterward
results = validator.log
In order to run validations, Validator
must have definitions for each field it should validate, and how. These definitions are provided as a list of dictionaries, saved on a Validator instance as validatable_fields
.
The schema for a validation dictionary is:
{
"field": The name (as a string) of the field to be looked up in both datasets to compare
"type": The Python `type` the value is expected to be; alternatively, can be a `Callable` that will be invoked by the `type_match` validation; see [customizing validations](#customizing-validations) for more
"validations": A `Callable`, `list[Callable]`, or `dict[Callable, ValidationLogLevel]` that will be invoked for the field defined above; see [customizing validations](#customizing-validations) for more
"validation_mode": The `ValidationMode` to use to perform validations. If not defined, uses the `validation_mode` set on the Validator.
"level": The `ValidationLogLevel` to record validation failures as. Default `ERROR`.
}
The validator.validator
module exposes a property called default_validatable_fields
that provides a reasonable default, but this can be overridden for any mapper type.
By default, Validator
uses default_validatable_fields
, however it exposes 3 methods to change its validatable fields:
add_validatable_field(field: str, type: type | Callable, validations: Callable | list[Callable] | dict[Callable, ValidationLogLevel, level: ValidationLogLevel (optional), validation_mode: ValidationMode (optional), replace: bool (optional) -> bool
Adds a validatable field to the Validator's validatable_fields
.
If replace
is true, it will replace the validatable field definition for the given field, if it exists. Otherwise, it'll do nothing and return False
.
remove_validatable_field(field: str) -> bool
Removes a validatable field definition from the Validator's validatable_fields
list for a given field name. Returns True
if a field was found and removed, else False
.
set_validatable_fields(fields: list[dict[str, Any]]) -> list[dict[str, Any]]
Completely replaces the Validator's validatable_fields
with the fields provided.
Validator
is designed to be subclassed if a given mapper requires unique logic. The easiest way to do this is to define it in the mapper module, then tell the mapper's "Record" class to use the validator.
# example_mapper.py
from validator import Validator
class ExampleValidator(Validator):
def setup(self) -> None:
# Do custom setup here
class ExampleRecord(Record):
validator = ExampleValidator
validate_mapping.py
will find and use the correct validator class based on the mapper defined for the provided collection.
In order to operate on a Validator
before it runs validations, you can utilize one of several callbacks that it invokes as it operates.
setup() -> None
This method is called immediately after __init__
and is a good place to modify/change validatable fields.
before_validation() -> None
This method runs before each record comparison is performed. Though no arguments are provided to the method, the records are available as self.rikolti_data
and self.comparison_data
.
after_validation() -> None
This method runs after the validations are performed for a pair of records. It is a good place to perform any logic required on the ValidationLog
(self.log
) after it's been populated.