This document specifies how to express data specifications in whip. How meta! 🤘
Whip specifications are expressed in YAML, a human and machine-readable data serialization language.
A specification in whip describes what a data value in a field should adhere to. For example, to express that the field age
should always contain the literal value 33
, use:
age: # Name of the field
allowed: 33 # Specification
Multiple specifications can be defined for a field. For example, to express that age
values should fall between 9
and 99
, use:
age:
min: 9
max: 99
To add specifications for another field, just add the name of the field and its specification(s):
age:
min: 9
max: 99
sex:
allowed: [male, female]
Together, all these field/term-based specifications form a specification file (e.g. my_specifications.yaml
), which can be used by a validator to test how well data meets certain specifications.
Tests if a value is the same as an allowed value or belongs to a list of allowed values:
sex:
allowed: male # A single allowed value. Will accept "male", but
# not "Male" (case-sensitive) or anything else.
sex:
allowed: "male" # Same as above
sex:
allowed: 'male' # Same as above
sex:
allowed: [male] # Same as above
sex:
allowed: [male, female] # A list of allowed values: separate by commas and
# wrap in square brackets. Will accept "male" or
# "female".
sex:
allowed: [male, female, 'male, female'] # Use quotes to escape commas and
# white space. Will accept "male", "female" or
# "male, female", but not "male,female" (no
# white space) or "female, male".
Note: to pass, a value needs to be literally the same (= same sequence of characters) as (one of) the allowed value(s). This means that allowed
is sensitive to case and white space.
Tests if a value has a minimum number of characters:
postal_code:
minlength: 4 # Will accept "9050" and "B-9050", but not "905".
Tests if a value has a maximum number of characters:
license_plate:
maxlength: 6 # Will accept "AF8934" and "AF893",
# but not "AF8-934" (note the dash)
Tests if a value conforms to a specific string format (url
or json
):
website:
stringformat: url # Will accept "http://github.com/inbo/whip",
# including urls with http, querystrings and
# anchors, but not "github.com/inbo/whip"
measurements:
stringformat: json # Will accept {"length": 2.0} and
# {"length": 2.0, "length_unit": "cm"}, but not
# {'length': 2.0} or {length: 2.0} (use double
# quotes) or "length": 2.0 (use curly brackets).
Tests if a value matches a regular expression (regex):
observation_id:
regex: 'INBO:VIS:\d+' # Will accept "INBO:VIS:12" and "INBO:VIS:456",
# but not "INBO:VIS:" or "INBO:VIS:ABC"
issue_url:
regex: 'https:\/\/github\.com\/inbo\/whip\/issues\/\d+' # Don't forget to
# escape (using "\") reserved characters like "."
# and "/". Will accept "https://github.com/inbo/
# whip/issues/4"
utm1km:
regex: '31U[D-G][S-T]\d\d\d\d' # Will accept UTM 1km codes for Flanders,
# e.g. "31UDS8748"
Note: regular expressions allow to craft very specific specifications, but are often frustratingly difficult to get right. Use a tool like https://regex101.com to verify that they will match/unmatch what you intend.
Note: Always wrap the regex specification in single quotes. Not quoting will fail expressions containing [ ]
, as they are interpreted by YAML as a list. Double quoting can cause escaped characters to be escaped again.
Note: The regex always expects to have a full match of the value.
Tests if a numeric value is equal to or higher than a minimum value:
age:
min: 9 # Will accept "9", "9.0", "9.1", "10", but not
# "8.99999" or "-9".
age:
min: 9.0 # Same as above
Tests if a numeric value is equal to or lower than a maximum value:
age:
max: 99 # Will accept "99", "99.0", "89.9", "88", "-99",
# but not "99.1".
age:
max: 99.0 # Same as above
Tests if a numeric value conforms to a specific number format:
length:
numberformat: '.3' # Will accept numbers with 3 digits to the right
# of the decimal point, such as ".123", "1.123",
# "12.123" and "-1.123", but not "1.12", "1.1234"
# or "a.abc".
length:
numberformat: '2.' # Will accept numbers with 2 digits to the left
# of the decimal point, such as "12", "12.",
# "12.1" and "-12.", but not "123".
length:
numberformat: '2.3' # Will accept numbers with 2 digits to the left
# and 3 digits to the right of the decimal point,
# such as "12.123" and "-12.123".
length:
numberformat: '.' # Will accept any float value, such as "1.0", but
# not integers, such as "1".
length:
numberformat: 'x' # Will accept any integer value, such as "1", but
# not floats, such as "1.0".
Note: Always wrap the numberformat specification in single quotes. The negative sign is ignored, only digits are taken into account.
Tests if a date value is equal to or later than a minimum date:
date:
mindate: 1985-11-29 # Will accept "1985-11-29" and "2012-09-12", but
# not "1942-11-26".
Tests if a date value is equal to or earlier than a maximum date:
date:
maxdate: 2012-09-12 # Will accept "2012-09-12" and "1985-11-29",
# but not "2016-12-07".
Tests if a date value conforms to a specific date format. Syntax follows strftime:
date:
dateformat: '%Y-%m-%d' # Will accept "2016-12-07", but not "2016/12/07",
# "07-12-2016", "2016-12", or "2016-12-32"
# (invalid date).
date:
dateformat: ['%Y-%m-%d', '%Y-%m', '%Y'] # Will accept valid ISO8601 dates
# "2016-12-07", "2016-12", and "2016".
date:
dateformat: ['%Y-%m-%d/%Y-%m-%d'] # Will accept valid day-precise ISO8601
# date ranges, such as "2016-01-01/2017-02-13"
Note: Always wrap the dateformat specification in single quotes.
By default, whip specifications reject empty values, apply to the whole content of a single field, and are independent from other fields. There are three methods to change that scope: empty
allows empty values to pass specifications, delimitedvalues
restricts the scope of specifications to individual delimited values within a field, and if
makes specifications dependent on the value of another field.
Makes specifications accept empty values:
# Default
sex:
allowed: [male, female] # Will accept "male" and "female", but not empty
# values.
sex:
allowed: [male, female]
empty: False # Same as above, implied by default.
# With empty: True
sex:
empty: True # Makes all specifications of this field to accept
# empty values.
allowed: [male, female] # Will now accept "male", "female", and empty
# values.
sex:
allowed: [male, female]
empty: True # Same as above, order does not matter.
Note: Whip specifications will only accept empty values when empty: True
is explicitly added as a specification. That means that the following specifications will not accept empty values, even though you might intuitively think so:
field:
maxlength: 2 # Will not accept empty values.
field:
maxlength: 0 # Will not accept anything.
field:
minlength: 0 # Will not accept empty values.
field:
allowed: '' # Will not accept empty values.
field:
allowed: [male, female, ''] # Will not accept empty values.
field:
regex: '^\s*$' # Regex to match an empty string, will not accept
# empty values.
Note: to only accept empty values (and nothing else), use:
required_to_be_empty:
allowed: ''
empty: True
Makes specifications apply to delimited values within a field, rather than the whole field. Requires delimiter
:
sex:
delimitedvalues:
delimiter: ' | ' # Required. Will use this to separate content
# of a field. All specifications within the
# "delimitedvalues" group apply to values
# delimited with this delimiter.
allowed: [male, female] # Will accept "male" or "female".
# Valid values for the whole field thus are:
# "male", "female", "male | female", and
# "female | male", but not "male, female" (wrong
# delimiter), "male|female" (missing spaces), or
# "male | " (empty second value).
empty: True # It is still possible to set specifications for
# the whole field. Here it is specified that the
# whole field can be empty (but delimited values
# cannot).
Note: to specify that a field cannot contain empty delimited values (but without defining other specifications for those values), use:
list_of_names:
empty: true # The whole field can be empty...
delimitedvalues:
delimiter: ' | ' # .. but using this delimiter, delimited values
# cannot be empty, since "empty: False" is implied
# by default.
Makes specifications conditional. This means that they are only verified if another field of the same record (i.e. the same row in tabular format) successfully passes certain specifications:
lifestage:
if:
- sex:
allowed: [male, female] # If sex is "male" or "female"...
allowed: adult # ... then lifestage needs to be "adult".
- sex:
allowed: '' # If sex is empty (and nothing else)...
empty: True
allowed: '' # ... then lifestage needs to be empty.
empty: True
Note: Always use the correct indentation and -
to define a new condition:
field_1:
if:
- field_2: # Condition A
spec_for_field_2: value
spec_for_field_2: value
conditional_spec_for_field_1: value
conditional_spec_for_field_1: value
- field_2: # Condition B
spec_for_field_2: value
conditional_spec_for_field_1: value
conditional_spec_for_field_1: value
Note: conditions are independent from each other. This means that certain records can pass one condition, but fail another. In the example below a record with sex = male
and lifestage = adult
will pass the first condition, but fail the second. Avoid using very broad conditions:
lifestage:
if:
- sex:
allowed: [male, female] # If sex is "male" or "female"...
allowed: adult # ... then lifestage needs to be "adult".
- sex:
empty: False # If sex is any non-empty value, which includes
# "male" or "female"...
allowed: unknown # ... then lifestage needs to be "unknown".
Note: the value of a field on which a condition is based needs to successfully pass all specifications (i.e. they are combined with the AND
operator) before the conditional specifications are tested:
province:
if:
- postalcode:
type: integer # province needs to be an integer
min: 8000 # AND province needs to be larger or equal to 8000
max: 8999 # AND province needs to be smaller or equal to 8999
allowed: 'West Flanders' # Only then it is tested if province is
# "West Flanders".