-
Notifications
You must be signed in to change notification settings - Fork 3
Home
To set configuration options, create a file called sirad_config.py and place
either in the directory where you are executing the sirad command or
somewhere else on your Python path. See _options in config.py for a
complete list of possible options and default values.
For an example of a configuration file, see sirad_config.py from the SIRAD worked example repo.
The following options are available:
-
DATA_SALT: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults toNone. -
PII_SALT: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults toNone. -
LAYOUTS: directory that contains layout files. Defaults tolayouts/. -
RAW_DIR,DATA_DIR,PII_DIR,LINK_DIR,RESEARCH_DIR: paths to where the original data, the processed files, and the research files will be saved. -
VERSION: the current version number of the processed and research files.
sirad uses YAML files to define the layout, or structure, of raw data files. These YAML files define each column in the incoming data and how it should be processed.
For an example of a YAML layout file, see tax.yaml from the SIRAD worked example repo.
The following properties can be specified in a YAML layout file:
The path to the source file, relative to RAW_DIR.
The following values for file type are supported:
-
csv- delimited text file, defaulting to comma-delimited (seedelimiterbelow) -
fixed- fixed width format, which requires the specification of awidthproperty for eachfield(seefieldsbelow) -
xlsx- Excel .xlsx file (note: .xls is not currently supported)
For the csv file type, this specifies the delimiter to use. Common alternatives to comma-delimited include tab-delimited ('\t') and pipe-delimited ('|').
The file encoding to use when opening a source file of type csv or fixed. If you do not know the encoding ahead of time, you can detect the encoding by running the Unix file command on the source file.
Line endings (LF or CRLF) are detected automatically by the file parser.
Non-ASCII characters are automatically transliterated to ASCII according to the character mapping found in readers.py.
Whether to read the first line of the file as the column headers.
A list specifying the header name and type of each field in the source file.
For a fixed-width source file, or when setting header=False:
- The
fieldslist must be in the same order as the contents of the source file.
For a csv or xlsx file:
- You can specify a different order, which will be used as the order in the output.
- Every field that appears in the
fieldslist must also appear with the same name in the source file header. - If a field exists in the source file header, but not in the
fieldslist, it will be skipped in the output.
Each field consists of a name, optionally followed with a dictionary of the following field properties:
Specify date if you wish to interpret the value as a date and convert to a standardized YYYYMMDD format during processing.
Marks the field as a type of personally identifiable information (PII). The field will be included in the PII_DIR output and not in the DATA_DIR output. The named PII fields used in calculating the sirad_id are:
first_namelast_namedob
The named PII fields used in censuscoding addresses have one of the following prefixes for address type (additional types can be added by editing research.py):
homemailingemployeremployer1employer2employer3
and one of the following suffixes for the address element:
-
_address: a field containing the entire street address including street number, ex.3 Main St -
_street: a field containing only the street name -
_street_num: a field containing only the street number _city-
_zip5: the five digit zip code -
_zip9: a nine digit zip code
Replaces the value with an irreversible SHA-1 hash of the value, using the salt in PII_SALT for PII_DIR output or the DATA_SALT for DATA_DIR output. Commonly used in conjunction with ssn or with sensitive identifiers that will be included in DATA_DIR output.
Marks the field as containing a Social Security Number, which will be validated according to the rules found in dataset.py. A field with _invalid appended will be added to the output with the result of the validation.
Specifies the date format in strftime notation for a field of date type. You can specify multiple formats separated by '|' in the case where the input data does not have a consistent format, and each format will be attempted in order after splitting on the '|' separator.
For a fixed-width file, this specifies the number of characters that will be read for this field.
Skip the field in all output. This is equivalent to omitting the field from the fields list for a csv or xlsx file, but can be useful if you want to document the existence of the field in the layout file.
Includes the field in the data output. Used to force a field marked pii to be included in both the PII_DIR and DATA_DIR outputs. This is useful in the case where a field is needed for calculating the sirad_id or for censuscoding, but is not actually considered PII. Examples might include dob for sirad_id (date of birth may not be classified as PII in a data sharing agreement) or a city or zip code field for censuscoding.
All output from SIRAD is in pipe-delimited CSV files, and the pipe character is stripped from all field values.
The sirad process command stages output files in the following output directories (which can be deleted after a successful run of the sirad research command):
Contains an output CSV file corresponding to each layout file, using the basename of the layout file. The field record_id is prepended, and is the row number from the source file (1-based indexing). Only fields that were not marked as pii (except those with data=True) are included, and in the order the provided in the fields list.
Contains an output CSV file corresponding to each layout file, using the basename of the layout file. The row order is randomly shuffled relative to the source file, so that the PII files cannot be directly joined to the data files. The field record_id is prepended, which is the row number after random shuffling (1-based indexing). Only fields marked as pii are included, and they are renamed according to the PII name. Additionally, each field marked as ssn has a corresponding _invalid field with the indicator for SSN validation that is appended at the end of the fields.
Contains an output CSV file corresponding to each layout file, using the basename of the layout file. This file contains a record_id field which corresponds to the record_id in the data file, and a pii_id field which corresponds to the record_id in the PII file. This mapping provides a link between the randomly-shuffled PII rows and the data rows.
The sirad research command generates a final, versioned release of de-identified data that can be used in research. It uses the PII_DIR files to construct the sirad_id and perform censuscoding, and the LINK_DIR to map and prepend any fields constructed from the PII to each of the DATA_DIR files.
An output CSV file corresponding to each layout file is written to RESEARCH_DIR, using the basename of the layout file. If the source file contained PII sufficient to construct a sirad_id (first name, last name, DOB) then a sirad_id field is prepended. For each type of address (home, mailing, employer, employer1, employer2, employer3), if the source file contained PII sufficient for censuscoding (address/zip or street/street_num/zip), then a corresponding triplet of anonymous geolocations (_city, _zip, _blkgrp) is prepended for that address type.
As described above, the following transformations are applied in the final output:
A row identifier, called record_id, is added to every output file.
Fields marked as type=date are interpreted according to the format value (which can be a pipe-delimited list of formats), and then transformed to a normalized YYYYMMDD format in the output. Values that cannot be interpreted according to the format string are replaced with nulls, and a warning is printed when the --debug option is used.
All PII fields are removed from the output, unless they are explicitly marked with data=true.
The sirad_id field is added to the output for any file that contains sufficient PII to construct it.
Each field marked as ssn=true has a corresponding _invalid field with the indicator for SSN validation added to the output.
For each set of address PII fields that can be censuscoded, a triplet of (_city, _zip, _blkgrp) fields is added to the output. Even though the original _city and _zip PII fields are dropped from the output (as per the transformation on PII described above), the censuscoder adds normalized versions of these fields back into the output. To normalize _city, characters are converted to upper case and only letter and space characters are retained. To normalize _zip, only digit characters are retained.