-
Notifications
You must be signed in to change notification settings - Fork 3
Home
To set configuration options, create a file called sirad_config.py
and place
either in the directory where you are executing the sirad
command or
somewhere else on your Python path. See _options
in config.py
for a
complete list of possible options and default values.
For an example of a configuration file, see sirad_config.py from the SIRAD worked example repo.
The following options are available:
-
DATA_SALT
: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults toNone
. -
PII_SALT
: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults toNone
. -
LAYOUTS
: directory that contains layout files. Defaults tolayouts/
. -
RAW_DIR
,DATA_DIR
,PII_DIR
,LINK_DIR
,RESEARCH_DIR
: paths to where the original data, the processed files, and the research files will be saved. -
VERSION
: the current version number of the processed and research files.
sirad
uses YAML files to define the layout, or structure, of raw data files. These YAML files define each column in the incoming data and how it should be processed.
For an example of a YAML layout file, see tax.yaml from the SIRAD worked example repo.
The following properties can be specified in a YAML layout file:
The path to the source file, relative to RAW_DIR
.
The following values for file type are supported:
-
csv
- delimited text file, defaulting to comma-delimited (seedelimiter
below) -
fixed
- fixed width format, which requires the specification of awidth
property for eachfield
(seefields
below) -
xlsx
- Excel .xlsx file (note: .xls is not currently supported)
For the csv
file type, this specifies the delimiter to use. Common alternatives to comma-delimited include tab-delimited ('\t'
) and pipe-delimited ('|'
).
The file encoding to use when opening a source file of type csv
or fixed
. If you do not know the encoding ahead of time, you can detect the encoding by running the Unix file
command on the source file.
Line endings (LF or CRLF) are detected automatically by the file parser.
Non-ASCII characters are automatically transliterated to ASCII according to the character mapping found in readers.py
.
Whether to read the first line of the file as the column headers.
A list specifying the header name and type of each field in the source file.
For a fixed-width source file, or when setting header=False
:
- The
fields
list must be in the same order as the contents of the source file.
For a csv or xlsx file:
- You can specify a different order, which will be used as the order in the output.
- Every field that appears in the
fields
list must also appear with the same name in the source file header. - If a field exists in the source file header, but not in the
fields
list, it will be skipped in the output.
Each field consists of a name, optionally followed with a dictionary of the following field properties:
Specify date
if you wish to interpret the value as a date and convert to a standardized YYYYMMDD
format during processing.
Marks the field as a type of personally identifiable information (PII). The field will be included in the PII_DIR
output and not in the DATA_DIR
output. The named PII fields used in calculating the sirad_id
are:
first_name
last_name
dob
The named PII fields used in censuscoding addresses have one of the following prefixes for address type (additional types can be added by editing research.py
):
home
mailing
employer
employer1
employer2
employer3
and one of the following suffixes for the address element:
-
_address
: a field containing the entire street address including street number, ex.3 Main St
-
_street
: a field containing only the street name -
_street_num
: a field containing only the street number _city
-
_zip5
: the five digit zip code -
_zip9
: a nine digit zip code
Replaces the value with an irreversible SHA-1 hash of the value, using the salt in PII_SALT
for PII_DIR
output or the DATA_SALT
for DATA_DIR
output. Commonly used in conjunction with ssn
or with sensitive identifiers that will be included in DATA_DIR
output.
Marks the field as containing a Social Security Number, which will be validated according to the rules found in dataset.py
. A field with _invalid
appended and _last4
will be added to the output with the result of the validation.
Specifies the date format in strftime notation for a field of date
type.
For a fixed-width file, this specifies the number of characters that will be read for this field.
Skip the field in all output. This is equivalent to omitting the field from the fields
list for a csv or xlsx file, but can be useful if you want to document the existence of the field in the layout file.
Includes the field in the data output. Used to force a field marked pii
to be included in both the PII_DIR
and DATA_DIR
outputs. This is useful in the case where a field is needed for calculating the sirad_id
or for censuscoding, but is not actually considered PII. Examples might include dob
for sirad_id
(date of birth may not be classified as PII in a data sharing agreement) or a city or zip code field for censuscoding.