Skip to content

Latest commit

 

History

History
415 lines (310 loc) · 13.6 KB

syntax.md

File metadata and controls

415 lines (310 loc) · 13.6 KB

Whip syntax

This document specifies how to express data specifications in whip. How meta! 🤘

General

Whip specifications are expressed in YAML, a human and machine-readable data serialization language.

A single specification

A specification in whip describes what a data value in a field should adhere to. For example, to express that the field age should always contain the literal value 33, use:

age:                       # Name of the field
  allowed: 33              # Specification

Multiple specifications

Multiple specifications can be defined for a field. For example, to express that age values should fall between 9 and 99, use:

age:
  min: 9
  max: 99

To add specifications for another field, just add the name of the field and its specification(s):

age:
  min: 9
  max: 99

sex:
  allowed: [male, female]

A specification file

Together, all these field/term-based specifications form a specification file (e.g. my_specifications.yaml), which can be used by a validator to test how well data meets certain specifications.

Specification types

allowed

Tests if a value is the same as an allowed value or belongs to a list of allowed values:

sex:
  allowed: male             # A single allowed value. Will accept "male", but 
                            # not "Male" (case-sensitive) or anything else.

sex:  
  allowed: "male"           # Same as above

sex:  
  allowed: 'male'           # Same as above

sex:
  allowed: [male]           # Same as above

sex:
  allowed: [male, female]   # A list of allowed values: separate by commas and 
                            # wrap in square brackets. Will accept "male" or 
                            # "female".

sex:
  allowed: [male, female, 'male, female'] # Use quotes to escape commas and 
                            # white space. Will accept "male", "female" or 
                            # "male, female", but not "male,female" (no 
                            # white space) or "female, male".

Note: to pass, a value needs to be literally the same (= same sequence of characters) as (one of) the allowed value(s). This means that allowed is sensitive to case and white space.

minlength

Tests if a value has a minimum number of characters:

postal_code:
  minlength: 4              # Will accept "9050" and "B-9050", but not "905".

maxlength

Tests if a value has a maximum number of characters:

license_plate:
  maxlength: 6              # Will accept "AF8934" and "AF893", 
                            # but not "AF8-934" (note the dash)

stringformat

Tests if a value conforms to a specific string format (url or json):

website:
  stringformat: url         # Will accept "http://github.com/inbo/whip", 
                            # including urls with http, querystrings and 
                            # anchors, but not "github.com/inbo/whip"

measurements:
  stringformat: json        # Will accept {"length": 2.0} and 
                            # {"length": 2.0, "length_unit": "cm"}, but not 
                            # {'length': 2.0} or {length: 2.0} (use double 
                            # quotes) or "length": 2.0 (use curly brackets).

regex

Tests if a value matches a regular expression (regex):

observation_id:
  regex: 'INBO:VIS:\d+'     # Will accept "INBO:VIS:12" and "INBO:VIS:456", 
                            # but not "INBO:VIS:" or "INBO:VIS:ABC"

issue_url:
  regex: 'https:\/\/github\.com\/inbo\/whip\/issues\/\d+' # Don't forget to 
                            # escape (using "\") reserved characters like "." 
                            # and "/". Will accept "https://github.com/inbo/
                            # whip/issues/4"

utm1km:
  regex: '31U[D-G][S-T]\d\d\d\d' # Will accept UTM 1km codes for Flanders, 
                            # e.g. "31UDS8748"

Note: regular expressions allow to craft very specific specifications, but are often frustratingly difficult to get right. Use a tool like https://regex101.com to verify that they will match/unmatch what you intend.

Note: Always wrap the regex specification in single quotes. Not quoting will fail expressions containing [ ], as they are interpreted by YAML as a list. Double quoting can cause escaped characters to be escaped again.

Note: The regex always expects to have a full match of the value.

min

Tests if a numeric value is equal to or higher than a minimum value:

age:
  min: 9                    # Will accept "9", "9.0", "9.1", "10", but not 
                            # "8.99999" or "-9".

age:
  min: 9.0                  # Same as above

max

Tests if a numeric value is equal to or lower than a maximum value:

age:
  max: 99                   # Will accept "99", "99.0", "89.9", "88", "-99", 
                            # but not "99.1".

age:
  max: 99.0                 # Same as above

numberformat

Tests if a numeric value conforms to a specific number format:

length:
  numberformat: '.3'        # Will accept numbers with 3 digits to the right 
                            # of the decimal point, such as ".123", "1.123", 
                            # "12.123" and "-1.123", but not "1.12", "1.1234"  
                            # or "a.abc".

length:
  numberformat: '2.'        # Will accept numbers with 2 digits to the left 
                            # of the decimal point, such as "12", "12.", 
                            # "12.1" and "-12.", but not "123".

length:
  numberformat: '2.3'       # Will accept numbers with 2 digits to the left 
                            # and 3 digits to the right of the decimal point, 
                            # such as "12.123" and "-12.123".

length:
  numberformat: '.'         # Will accept any float value, such as "1.0", but 
                            # not integers, such as "1".

length:
  numberformat: 'x'         # Will accept any integer value, such as "1", but 
                            # not floats, such as "1.0".

Note: Always wrap the numberformat specification in single quotes. The negative sign is ignored, only digits are taken into account.

mindate

Tests if a date value is equal to or later than a minimum date:

date:
  mindate: 1985-11-29       # Will accept "1985-11-29" and "2012-09-12", but 
                            # not "1942-11-26".

maxdate

Tests if a date value is equal to or earlier than a maximum date:

date:
  maxdate: 2012-09-12       # Will accept "2012-09-12" and "1985-11-29", 
                            # but not "2016-12-07".

dateformat

Tests if a date value conforms to a specific date format. Syntax follows strftime:

date:
  dateformat: '%Y-%m-%d'    # Will accept "2016-12-07", but not "2016/12/07", 
                            # "07-12-2016", "2016-12", or "2016-12-32" 
                            # (invalid date).

date:
  dateformat: ['%Y-%m-%d', '%Y-%m', '%Y'] # Will accept valid ISO8601 dates 
                            # "2016-12-07", "2016-12", and "2016".

date:
  dateformat: ['%Y-%m-%d/%Y-%m-%d'] # Will accept valid day-precise ISO8601 
                            # date ranges, such as "2016-01-01/2017-02-13"

Note: Always wrap the dateformat specification in single quotes.

Changing scope

By default, whip specifications reject empty values, apply to the whole content of a single field, and are independent from other fields. There are three methods to change that scope: empty allows empty values to pass specifications, delimitedvalues restricts the scope of specifications to individual delimited values within a field, and if makes specifications dependent on the value of another field.

empty

Makes specifications accept empty values:

# Default
sex:
  allowed: [male, female]   # Will accept "male" and "female", but not empty 
                            # values.

sex:
  allowed: [male, female]
  empty: False              # Same as above, implied by default.

# With empty: True
sex:
  empty: True               # Makes all specifications of this field to accept 
                            # empty values.
  allowed: [male, female]   # Will now accept "male", "female", and empty 
                            # values.

sex:
  allowed: [male, female]
  empty: True               # Same as above, order does not matter.

Note: Whip specifications will only accept empty values when empty: True is explicitly added as a specification. That means that the following specifications will not accept empty values, even though you might intuitively think so:

field:
  maxlength: 2              # Will not accept empty values.

field:
  maxlength: 0              # Will not accept anything.

field:
  minlength: 0              # Will not accept empty values.

field:
  allowed: ''               # Will not accept empty values.

field:
  allowed: [male, female, ''] # Will not accept empty values.

field:
  regex: '^\s*$'            # Regex to match an empty string, will not accept 
                            # empty values.

Note: to only accept empty values (and nothing else), use:

required_to_be_empty:
  allowed: ''
  empty: True

delimitedvalues

Makes specifications apply to delimited values within a field, rather than the whole field. Requires delimiter:

sex:
  delimitedvalues:
    delimiter: ' | '        # Required. Will use this to separate content 
                            # of a field. All specifications within the 
                            # "delimitedvalues" group apply to values 
                            # delimited with this delimiter.
    
    allowed: [male, female] # Will accept "male" or "female".
                            # Valid values for the whole field thus are: 
                            # "male", "female", "male | female", and 
                            # "female | male", but not "male, female" (wrong 
                            # delimiter), "male|female" (missing spaces), or 
                            # "male | " (empty second value).
  
  empty: True               # It is still possible to set specifications for 
                            # the whole field. Here it is specified that the 
                            # whole field can be empty (but delimited values 
                            # cannot).

Note: to specify that a field cannot contain empty delimited values (but without defining other specifications for those values), use:

list_of_names:
  empty: true               # The whole field can be empty...
  delimitedvalues:
    delimiter: ' | '        # .. but using this delimiter, delimited values 
                            # cannot be empty, since "empty: False" is implied 
                            # by default.

if

Makes specifications conditional. This means that they are only verified if another field of the same record (i.e. the same row in tabular format) successfully passes certain specifications:

lifestage:
  if:
    - sex:
        allowed: [male, female] # If sex is "male" or "female"...
      allowed: adult        # ... then lifestage needs to be "adult".
    - sex:
        allowed: ''         # If sex is empty (and nothing else)...
        empty: True
      allowed: ''           # ... then lifestage needs to be empty.
      empty: True

Note: Always use the correct indentation and - to define a new condition:

field_1:
  if:
    - field_2:              # Condition A
        spec_for_field_2: value
        spec_for_field_2: value
      conditional_spec_for_field_1: value
      conditional_spec_for_field_1: value

    - field_2:              # Condition B
        spec_for_field_2: value
      conditional_spec_for_field_1: value
      conditional_spec_for_field_1: value

Note: conditions are independent from each other. This means that certain records can pass one condition, but fail another. In the example below a record with sex = male and lifestage = adult will pass the first condition, but fail the second. Avoid using very broad conditions:

lifestage:
  if:
    - sex:
        allowed: [male, female] # If sex is "male" or "female"...
      allowed: adult        # ... then lifestage needs to be "adult".
    - sex:
        empty: False        # If sex is any non-empty value, which includes 
                            # "male" or "female"...
      allowed: unknown      # ... then lifestage needs to be "unknown".

Note: the value of a field on which a condition is based needs to successfully pass all specifications (i.e. they are combined with the AND operator) before the conditional specifications are tested:

province:
  if:
    - postalcode:
        type: integer       # province needs to be an integer
        min: 8000           # AND province needs to be larger or equal to 8000
        max: 8999           # AND province needs to be smaller or equal to 8999
      allowed: 'West Flanders' # Only then it is tested if province is 
                            # "West Flanders".