Skip to content
Mark Howison edited this page Sep 3, 2020 · 11 revisions

Documentation

Configuration file

To set configuration options, create a file called sirad_config.py and place either in the directory where you are executing the sirad command or somewhere else on your Python path. See _options in config.py for a complete list of possible options and default values.

For an example of a configuration file, see sirad_config.py from the SIRAD worked example repo.

The following options are available:

  • DATA_SALT: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults to None.

  • PII_SALT: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults to None.

  • LAYOUTS: directory that contains layout files. Defaults to layouts/.

  • RAW_DIR, DATA_DIR, PII_DIR, LINK_DIR, RESEARCH_DIR: paths to where the original data, the processed files, and the research files will be saved.

  • VERSION: the current version number of the processed and research files.

YAML layout format

sirad uses YAML files to define the layout, or structure, of raw data files. These YAML files define each column in the incoming data and how it should be processed.

For an example of a YAML layout file, see tax.yaml from the SIRAD worked example repo.

The following properties can be specified in a YAML layout file:

source (required)

The path to the source file, relative to RAW_DIR.

type (default: csv)

The following values for file type are supported:

  • csv - delimited text file, defaulting to comma-delimited (see delimiter below)
  • fixed - fixed width format, which requires the specification of a width property for each field (see fields below)
  • xlsx - Excel .xlsx file (note: .xls is not currently supported)

delimiter (default: ',')

For the csv file type, this specifies the delimiter to use. Common alternatives to comma-delimited include tab-delimited ('\t') and pipe-delimited ('|').

encoding (default: ascii)

The file encoding to use when opening a source file of type csv or fixed. If you do not know the encoding ahead of time, you can detect the encoding by running the Unix file command on the source file. Line endings (LF or CRLF) are detected automatically by the file parser. Non-ASCII characters are automatically transliterated to ASCII according to the character mapping found in readers.py.

header (default: True)

Whether to read the first line of the file as the column headers.

fields (required)

A list specifying the header name and type of each field in the source file.

For a fixed-width source file, or when setting header=False:

  • The fields list must be in the same order as the contents of the source file.

For a csv or xlsx file:

  • You can specify a different order, which will be used as the order in the output.
  • Every field that appears in the fields list must also appear with the same name in the source file header.
  • If a field exists in the source file header, but not in the fields list, it will be skipped in the output.

Each field consists of a name, optionally followed with a dictionary of the following field properties:

type (default: varchar)

Specify date if you wish to interpret the value as a date and convert to a standardized YYYYMMDD format during processing.

pii

Marks the field as a type of personally identifiable information (PII). The field will be included in the PII_DIR output and not in the DATA_DIR output. The named PII fields used in calculating the sirad_id are:

  • first_name
  • last_name
  • dob

The named PII fields used in censuscoding addresses have one of the following prefixes for address type (additional types can be added by editing research.py):

  • home
  • mailing
  • employer
  • employer1
  • employer2
  • employer3

and one of the following suffixes for the address element:

  • _address: a field containing the entire street address including street number, ex. 3 Main St
  • _street: a field containing only the street name
  • _street_num: a field containing only the street number
  • _city
  • _zip5: the five digit zip code
  • _zip9: a nine digit zip code

hash

Replaces the value with an irreversible SHA-1 hash of the value, using the salt in PII_SALT for PII_DIR output or the DATA_SALT for DATA_DIR output. Commonly used in conjunction with ssn or with sensitive identifiers that will be included in DATA_DIR output.

ssn

Marks the field as containing a Social Security Number, which will be validated according to the rules found in dataset.py. A field with _invalid appended and _last4 will be added to the output with the result of the validation.

format

Specifies the date format in strftime notation for a field of date type.

width

For a fixed-width file, this specifies the number of characters that will be read for this field.

skip

Skip the field in all output. This is equivalent to omitting the field from the fields list for a csv or xlsx file, but can be useful if you want to document the existence of the field in the layout file.

data

Includes the field in the data output. Used to force a field marked pii to be included in both the PII_DIR and DATA_DIR outputs. This is useful in the case where a field is needed for calculating the sirad_id or for censuscoding, but is not actually considered PII. Examples might include dob for sirad_id (date of birth may not be classified as PII in a data sharing agreement) or a city or zip code field for censuscoding.

Output

Clone this wiki locally