output_checker is a tool that allows analysts to check if their code is ok to
export
- Check some repo to ensure it's ok to remove from a secure environment
- flag any large files
- function to calculate file size
- flag any long files
- function to calculate file length
- flag any CSVs, or other known data files
- flag any images, artifacts etc.
- scan files for hardcoded datastructures, e.g. embedded data, tokens
- scan files for "entities" e.g. names, or other identifiers
- scan notebooks to ensure they have had outputs cleared
- flag any large files
- Check "outputs" to ensure they are ok to remove from a secure environment
- make sure "data" outputs, e.g. CSV/parquet, etc. follow disclosure control rules
- scan output files for entities
- flag any long files
- flag any large files
This will work by:
- pointing a checking function at a given directory and indicating if it contains code or data files
- function which scans all files in a directory and applies the necessary checks
- This will return a table detailing each file in which there were problems, and which problems were found
- functionality which returns all failures in a tabular form
This repo will also contain test cases to show it works correctly. E.g.
- example code files with and without harded datastructures, entities, etc.
- example data files with and without entities, and which pass and fail disclosure control etc.
- For large and long files - we need functions which can generate these on the fly
This will contain functions to do a few different checks for both data outputs and code.
data outputs:
code outputs:
- ToDo: Large files
- ToDo: Files which are too long
- ToDo: Entity Recognition
- ToDo: Embedded tables
- ToDo: notebooks without cleared outputs
Initially, we need to make simple functions which address the above, then we can build other functions which will apply them multiple files.
A core part of the use-case is that there is that people can be alerted what issues there are, in which files.
It does this by running a few simple checks. The envisaged workflow is:
graph LR
A[Statistical Disclosure Control] --> B{Values below a threshold?};
A --> C{Values not rounded?};
B -->|Pass| D[Ok to output];
B -->|Fail| E[Identifies problems];
C -->|Pass| D[Ok to output];
C -->|Fail| E[Identifies problems];
click A "https://github.com/SamHollings/output_checker/tree/main/src/disclosure_control_check" "Disclosure Control code" _blank
To start using this project, first make sure your system meets its requirements.
Contributors have some additional requirements!
- Python 3.6.1+ installed
- a
.secretsfile with the required secrets and credentials - load environment variables from
.env
To install the Python requirements, open your terminal and enter:
pip install -r requirements.txtTo run this project, you need a .secrets file with secrets/credentials as
environmental variables. The
secrets/credentials should have the following environment variable name(s):
| Secret/credential | Environment variable name | Description |
|---|---|---|
| Secret 1 | SECRET_VARIABLE_1 |
Plain English description of Secret 1. |
| Credential 1 | CREDENTIAL_VARIABLE_1 |
Plain English description of Credential 1. |
Once you've added, load these environment variables using
.env.
Unless stated otherwise, the codebase is released under the MIT License. This covers both the codebase and any sample code in the documentation. The documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.
If you want to help us build, and improve output_checker, view our
contributing guidelines.
This project structure is based on the govcookiecutter template
project.