This gear is used to validate files content, Flywheel file metadata or Flywheel container metadata according to a user provided JSONSchema file. It can be run at the project, subject, session or acquisition level.
Validates a json or csv file based on a provided validation schema
License: MIT
Category: utility
Gear Level:
- Project
- Subject
- Session
- Acquisition
- Analysis
[[TOC]]
- input_file
- Name: input_file
- Type: file
- Optional: true
- Description: The file to validate. If none is provided, only the destination container metadata will be validated
- validation_schema:
- Name: validation_schema
- Type: file
- Optional: false
- Description: The JSONSchema to use to validate the file and/or container metadata
-
validation_level
- Name: validation_level
- Type: string
- Description: Select if validation should run on the file or the flywheel representation of the file. 'Validate File Contents' will read the input file and run validation on it, 'Validate Flywheel Objects' will load the json representation of the file in flywheel, including the parent container objects of the file
- Default: Validate File Contents
- Choices: ['Validate File Contents', 'Validate Flywheel Objects']
-
add_parents:
-
Name: add_parents
-
Type: boolean
-
Description: If validating Flywheel Objects, add the parent containers of the object to the schema for validation
-
Default: false
-
tag:
- Name: tag
- Type: string
- Description: Tag to attach to files that gear runs on upon run completion
-
debug:
- Name: debug
- Type: boolean
- Description: Tag to attach to files that gear runs on upon run completion
- Default: false
-
none
Any schema errors identified by the gear will be stored on the file's metadata. The over-all pass/fail state of the validation will also be stored.
The metadata will be added to the file or container under the file custom information in the following format:
qc:
file-validator:
validation:
state: "PASS" # or "FAIL" depending on the file validation
data: [ "<list of error objects>" ]
where each error object has the following keys:
{
"type": "str - will always be 'error' in this gear",
"code": "str - Type of the error (e.g. MaxLength)",
"location": "<location object>",
"flywheel_path": "str – Flywheel path to the container/file",
"container_id": "str – ID of the source container/file ",
"value": "str – current value",
"expected": "str – expected value",
"message": "str – error message description"
}
Where value for location will be formatted as such:
For JSON input file:
{ “key_path”: "str = the json key that raised the error" }
For CSV input file:
{ “line”: "int - the row number that raised the error",
“column_name”: "str - the column name that raised the error" }
When validating Flywheel file metadata, file content first need to be parsed. The
form-importer
gear can be used to parse the file content and add it to the file metadata.
This gear can be used to validate different file content and extracted file metadata against a JSONSchema. The JSONSchema file can be provided as a gear input and much easier describe the content of the file or the metadata of the file and optionally its parent container. The gear can be triggered automatically through gear rule when configured as such or be used as part of a validation pipeline.
Supported Filetypes: Json, Csv.
For a json file there are two validation checks that are done:
- Empty File Validation - checks to see if the file is empty
- Schema validation - applies the schema directly to the file
For a csv file, two validation checks are performed:
- Empty File Validation - checks to see if the file is empty.
- Header Validation - Checks to see that a header is present, AND if the header has any extra columns NOT specified in the schema.
- Schema Validation - Each row is turned into a json object with
key:value
pairs, where the key comes from the column headers, and the values come from the cells in the given row. Each row is then validated against the schema.
CSVs are inherently untyped, so the exact type of each column must be provided
in the jsonschema file. By default, python will read everything as a string,
including blank spaces, which get read as an empty string (''
). We then use
default python casting rules to attempt to cast these values to specified types.
Because of the complications of determining a true intended type on an untyped
list of string, we do NOT allow columns to be multiple types. This means that
in your jsonschema, a property's type
can NOT be a list of types.
Not allowed:
"INVALID_property": {"type": ["integer", "bool"] }
Nulls are handled in the following way: If a row has a blank cell, that value is treated as null, and is removed from the row dictionary for validation. This does not modify the original data in any way, just the data that's passed to the validator.
This way, if that cell was required, the schema will catch it as a missing vaule, but if it was optional, the schema will pass as intended.
Here is a list of considerations when setting types for CSV:
- If a column is a REQUIRED property, that every row MUST have a value for, then only ONE type is allowed to be specified for that column.
- If a column is an OPTIONAL property, that some rows may have no value for it.
- The use of two or more types for a column is NEVER allowed in any case.
- Casting is done using python's built in type casting. Json types and python types aren't perfectly aligned, but we have assigned the following conversions:
JSON | Python |
---|---|
string | str |
number | float |
integer | int |
boolean | bool |
null | None |
- If your data has a custom null value ("NA", "None", etc), then you will need
to modify your schema to account for this in the following way:
- any "optional" fields must have something like this to allow a custom "null" value:
{
"anyOf": [
{
"type": "integer",
"minimum": 0,
"maximum": 10
},
{
"type": "string",
"enum": [
"null",
"NA"
]
}
]
}
-
All data is initially read in as string. This is important for casting in python for the following reasons:
- Python sees ANY string of ANY value as "true" when you try
to cast the string to a boolean. Only the empty string
''
is recognized asFalse
in python. This can lead to confusing behavior, for example:
bool("FALSE") >>> True
- Python will convert an integer to a float if it's a string. Technically this
aligns with javascript's type definition anyways, as "number" means either an
int or a float, but it is important to be aware that a column with type "number"
will convert
"123"
to123.0
:
float("123") >>> 123.0
- Python sees ANY string of ANY value as "true" when you try
to cast the string to a boolean. Only the empty string
The chart below shows the various casting results from different strings that might be read in, to different types.
value | int(value) | float(value) | bool(value) | str(value) |
---|---|---|---|---|
"123" | 123 | 123.0 | True |
"123" |
"12.3" | ValueError |
12.3 | True |
"12.3" |
"any string" | ValueError |
ValueError |
True |
"any string" |
"" | ValueError |
ValueError |
False |
"" |
This section contains specifications on any input files that the gear may need
The input file content or metadata extracted from the file will be validated against against the JSONSchema
The JSONSchema to use to validate the file and/or container metadata must be provided as a gear input and much easier describe the content of the file or the metadata extracted. The file content must be a valid JSON following the JSONSchema standard.
For example, if the input_file
is a JSON with the following content:
{
"KeyA": "Some-Value"
}
and the following JSONSchema is used:
{
"$schema": "http://json-schema.org/draft-07/schema",
"$comment": "JSON Schema for my file",
"$id": "MyID",
"title": "MyTitle",
"type": "object",
"required": ["MyKey"],
"definitions": {
"MyKey": {
"type": "string",
"maxLength": 4
}
},
"properties": {
"MyKey": {
"$ref": "#/definitions/MyKey"
}
}
}
The gear will validate the content of the input_file
and generates a report
of any validation errors it finds. The report is saved on the metadata.
graph LR;
A[input_file]:::input --> E(Empty \nValidation);
B[validation_schema]:::input --> E;
subgraph Gear;
E:::container -->|Not Empty| AA{Is Csv?}
AA -->|no| CC
AA --> |yes| BB(Header \nValidation)
BB:::container -->|Valid| CC[Validate Content\nWith Schema]
CC -->G
BB -->|Invalid| G
E -->|Empty| G[Error Metadata];
end
G-->F
F[Output QC\nMetadata]:::container;
classDef container fill:#57d,color:#fff
classDef input fill:#7a9,color:#fff
classDef gear fill:#659,color:#fff
[For more information about how to get started contributing to that gear, checkout CONTRIBUTING.md.]