Skip to content

Validator module

timfrazee edited this page Jul 3, 2023 · 2 revisions

Validator

The Validator module is a collection of classes that compares one dataset to another and produces output to explain discrepancies between them.

Module overview

The validator module is composed of 3 main classes:

Validator

Validator is the primary interface for validation. Once instantiated with a mapper type and validatable fields, it can accept records to compare. Typically, this will be one mapped Rikolti record and one Solr record, but any set of comparable data will work.

If a certain mapper type requires specific logic, Validator can be subclassed. See subclassing for details.

ValidationLog

The ValidationLog is where validation results are stored, and contains methods for outputting data in a consumable format (typically a CSV). For the most part, it is used by Validator and shouldn't have to be accessed directly except to generate output.

ValidationMode

ValidationMode is an enum class that tells a Validator instance how to compare values in its validations. Options are STRICT and LAX, the latter of which compares strings case-insensitively and lists order-independent.

Validation modes can be passed into a Validator or an individual validatable field to change this behavior. By default, validations are made in STRICT mode (case and order sensitive).

Usage

CLI

In most cases, a collection can be validated and output generated from convenience methods in the validate_mapping.py module.

python validate_mapping.py {collection_id} --log-level={log_level} --verbose --filename={filename}

Arguments

collection_id (required): The collection ID to validate. Mapped metadata must exist for this collection.

--log-level (default: WARNING): The level of validation result to output. By default, it'll only output validation failures of types WARNING and ERROR. INFO or lower will generate rows for validation successes, and can result in lengthy CSVs.

--filename/-f: A filename/path to save the generated file. Default is a timestamp.

--verbose/-v (default: false): If set, outputs in verbose mode. At log level INFO, this will result in line-by-line validation successes being recorded. This can potentially generate very long CSV files.

Examples

Generate a CSV of validation failures for collection 1234:

python validate_mapping.py 1234

Generate a CSV of validation failures and non-verbose (single line per record) successes for collection 1234, output to filename "my_validation.csv":

python validate_mapping.py 1234 --log-level=INFO --filename=my_validation.csv

Generate a (probably very large) CSV of all validation failures and successes for collection 1234:

python validate_mapping.py 1234 --log-level=INFO --verbose

Direct usage

In Python, Validator (or any subclass of Validator)can be instantiated directly.

Validator(log_level: ValidationLogLevel, validation_mode: ValidationMode, verbose: bool)

You can then call the validate method on each pair of records you want to compare:

validator.validate(record_harvest_id: str, rikolti_data: dict, comparison_data: dict, validation_mode: ValidationMode)

...and access the log afterward

results = validator.log

Validatable Fields

In order to run validations, Validator must have definitions for each field it should validate, and how. These definitions are provided as a list of dictionaries, saved on a Validator instance as validatable_fields.

The schema for a validation dictionary is:

{
  "field": The name (as a string) of the field to be looked up in both datasets to compare
  "type": The Python `type` the value is expected to be; alternatively, can be a `Callable` that will be invoked by the `type_match` validation; see [customizing validations](#customizing-validations) for more
  "validations": A `Callable`, `list[Callable]`, or `dict[Callable, ValidationLogLevel]` that will be invoked for the field defined above; see [customizing validations](#customizing-validations) for more
  "validation_mode": The `ValidationMode` to use to perform validations. If not defined, uses the `validation_mode` set on the Validator.
  "level": The `ValidationLogLevel` to record validation failures as. Default `ERROR`.
}

The validator.validator module exposes a property called default_validatable_fields that provides a reasonable default, but this can be overridden for any mapper type.

Customizing Validations

By default, Validator uses default_validatable_fields, however it exposes 3 methods to change its validatable fields:

add_validatable_field(field: str, type: type | Callable, validations: Callable | list[Callable] | dict[Callable, ValidationLogLevel, level: ValidationLogLevel (optional), validation_mode: ValidationMode (optional), replace: bool (optional) -> bool

Adds a validatable field to the Validator's validatable_fields.

If replace is true, it will replace the validatable field definition for the given field, if it exists. Otherwise, it'll do nothing and return False.

remove_validatable_field(field: str) -> bool

Removes a validatable field definition from the Validator's validatable_fields list for a given field name. Returns True if a field was found and removed, else False.

set_validatable_fields(fields: list[dict[str, Any]]) -> list[dict[str, Any]]

Completely replaces the Validator's validatable_fields with the fields provided.

Subclassing

Validator is designed to be subclassed if a given mapper requires unique logic. The easiest way to do this is to define it in the mapper module, then tell the mapper's "Record" class to use the validator.

# example_mapper.py

from validator import Validator

class ExampleValidator(Validator):
  
  def setup(self) -> None:
    # Do custom setup here

class ExampleRecord(Record):
  
  validator = ExampleValidator

validate_mapping.py will find and use the correct validator class based on the mapper defined for the provided collection.

Callbacks

In order to operate on a Validator before it runs validations, you can utilize one of several callbacks that it invokes as it operates.

setup() -> None

This method is called immediately after __init__ and is a good place to modify/change validatable fields.

before_validation() -> None

This method runs before each record comparison is performed. Though no arguments are provided to the method, the records are available as self.rikolti_data and self.comparison_data.

after_validation() -> None

This method runs after the validations are performed for a pair of records. It is a good place to perform any logic required on the ValidationLog (self.log) after it's been populated.