This repository contains scripts to calculate data quality for datasets used in the Pacific Salmon Strategy Initiative (PSSI) Data Portal. Tests are organized into dimensions, and tests that evaluate similar aspects of data quality are conceptually grouped as metrics. Each test can be run using a Jupyter Notebook or a no-code UI tool on CSV or XLSX datasets.
For instructions on setting up the repository and running tests, see the Getting Started section.
This framework is open source, and contributions are welcome. Check out the CONTRIBUTING page to add new tests or customize existing ones.
Data quality tests are divided into 8 dimensions:
- Accessibility
- Accuracy
- Completeness
- Consistency
- Interdependency
- Relevance
- Timeliness
- Uniqueness
See the Tests Reference Table for a complete list of runnable tests.
For full details on each test, see the Detailed Tests page.
Set up the repository and run metrics.
- Python 3.10 or later. See instructions.
- Git. See instructions
- Jupyter Notebook or Jupyterlab. See instructions.
-
Fork the repository and clone it locally
In a terminal, run:
git clone https://github.com/dfo-mpo/DataQuality.git cd DataQuality -
Choose how to run tests