diff --git a/README.md b/README.md index 73c00b6a8..f16a19b0f 100644 --- a/README.md +++ b/README.md @@ -6,13 +6,19 @@ PIXL Image eXtraction Laboratory `PIXL` is a system for extracting, linking and de-identifying DICOM imaging data, structured EHR data and free-text data from radiology reports at UCLH. -Please see the [rolling-skeleton]([https://github.com/SAFEHR-data/the-rolling-skeleton=](https://github.com/SAFEHR-data/the-rolling-skeleton/blob/main/docs/design/100-day-design.md)) for more details. -PIXL is intended run on one of the [GAE (General Application Environments)](https://github.com/SAFEHR-data/Book-of-FlowEHR/blob/main/glossary.md#gaes)s and comprises -several services orchestrated by [Docker Compose](https://docs.docker.com/compose/). +It comprises several services orchestrated by [Docker Compose](https://docs.docker.com/compose/). + +
UCLH SPECIFIC + +PIXL is intended run on one of the [GAE (General Application Environments)](https://github.com/SAFEHR-data/Book-of-FlowEHR/blob/main/glossary.md#gaes)s. To get access to the GAE, [see the documentation on Slab](https://uclh.slab.com/posts/gae-access-7hkddxap). -Please request access to Slab and add further details in a [new blank issue](https://github.com/SAFEHR-data/PIXL/issues/new). + +Please request access to Slab and add further details in a [new blank issue](https://github.com/SAFEHR-data/PIXL/issues/new). + +
+ ## Installation in production @@ -66,7 +72,7 @@ destination. Provides helper functions for de-identifying DICOM data -### PostgreSQL +### [PostgreSQL](.postgres/README.md) RDBMS which stores DICOM metadata, application data and anonymised patient record data. @@ -78,7 +84,7 @@ HTTP API to export files (parquet and DICOM) from UCLH to endpoints. HTTP API to process messages from the `imaging` queue and populate the raw orthanc instance with images from PACS/VNA. -## Setup `PIXL` in GAE +## Setup `PIXL`
Click here to expand steps and configurations @@ -208,7 +214,7 @@ These variables can be set in the `.env` file. For testing, they can be set in the `test/.secrets.env` file. For dev purposes find the `pixl-dev-secrets.env` note on LastPass for the necessary values. -If an Azure Keyvault hasn't been set up yet, follow [these instructions](./docs/setup/azure-keyvault.md). +At UCLH if an Azure Keyvault hasn't been set up yet, follow [these instructions](./docs/setup/azure-keyvault.md). A second Azure Keyvault is used to store hashing keys and salts for the `hasher` service. This kevyault is configured with the following environment variables: @@ -224,7 +230,7 @@ See the [hasher documentation](./hasher/README.md) for more information.
-## Run `PIXL` in GAE +## Run `PIXL`
Click here to view detailed steps @@ -284,6 +290,9 @@ test/resources/omop/public /*.parquet ### OMOP ES extract dir (input to PIXL) +>[!NOTE] +> OMOP ES is the tool used to extract Electronic Health Records that may be linked to images. + EXTRACT_DIR is the directory passed to `pixl populate` as the input `PARQUET_PATH` argument. ``` @@ -294,8 +303,8 @@ EXTRACT_DIR/public /*.parquet ### PIXL Export dir (PIXL intermediate) -The directory where PIXL will copy the public OMOP extract files (which now contain -the radiology reports) to. +The directory where PIXL will copy the public OMOP extract files and the radiology reports. + These files will subsequently be uploaded to the `parquet` destination specified in the [project config](#3-configure-a-new-project). @@ -316,10 +325,63 @@ FTPROOT/PROJECT_SLUG/EXTRACT_DATETIME/parquet/radiology/radiology.parquet ..............................................omop/public/*.parquet ``` -## :octocat: Cloning repository -* Generate your SSH keys as suggested [here](https://docs.github.com/en/github/authenticating-to-github/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent) -* Clone the repository by typing (or copying) the following lines in a terminal -``` -git clone git@github.com:SAFEHR-data/PIXL.git -``` +## 'PIXL' Directory Contents + +
+ + +

Subdirectories with links to the relevant README

+ +
+ + +[bin](./bin/README.md) + +[cli](./cli/README.md) + +[docker](./docker/README.md) + +[docs](./docs/README.md) + +[hasher](./hasher/README.md) + +[orthanc](./orthanc/README.md) + +[pixl_core](./pixl_core/README.md) + +[pixl_dcmd](./pixl_dcmd/README.md) + +[pixl_export](./pixl_export/README.md) + +[pixl_imaging](./pixl_imaging/README.md) + +[postgres](./postgres/README.md) + +[projects](./projects/README.md) + +[pytest-pixl](./pytest-pixl/README.md) + +[schemas](./schemas/README.md) + +[scripts](./scripts/README.md) + +[test](./test/README.md) +
+
+ + +### Files + + + +| **Configuration** | **User docs** | **Housekeeping** | +| :--- | :--- | :--- | +| .env.sample | CODE_OF_CONDUCT.md | .renovaterc.json5 | +| .pre-commit-config.yaml | CONTRIBUTING.md | codecov.yml | +| docker-compose.yml | LICENSE | | +| mypy.ini | NOTICE | | +| pytest.ini | README.md | | +| ruff.toml | | | +| template_config.yaml | | | +
\ No newline at end of file diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 000000000..57025896a --- /dev/null +++ b/docs/README.md @@ -0,0 +1,18 @@ +## 'PIXL/docs' Directory Contents + +
+ +

Subdirectories with links to the relevant README

+ +
+ +[archive](./archive/README.md) + +[design](./design/README.md) + +[developer](./developer/README.md) + +[joss-publication](./joss-publication/README.md) + +
+ diff --git a/docs/archive/PIXLv1/Considerations.md b/docs/archive/PIXLv1/Considerations.md new file mode 100644 index 000000000..f24bdec52 --- /dev/null +++ b/docs/archive/PIXLv1/Considerations.md @@ -0,0 +1,15 @@ +## Risks and Considerations + +### Technical Risks +The primary technical risk is overburdening the PACS & VNA and causing an adverse impact on the operational performance of these systems. +To mitigate this risk, queries will be managed with a task queue. The system will enforce rate limiting of any commands sent to the PACS & VNA with an adapted [token bucket](https://en.wikipedia.org/wiki/Token_bucket) algorithm which can be adjusted at runtime in response to system load. A [circuit breaker](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern) will wrap the retrieval processes and act as fail-safe. Individual request retries will be subject to an [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff) strategy. + + +### Security Considerations +#### Inbound access to the Cloud Environment in Azure +It is expected that a VPN connection (or ExpressRoute connection) between the on-prem UCLH estate and Azure will not initially be available. +Point-to-point firewall restrictions and Azure access tokens will manage secure communication between PIXL and the DICOM service. + +#### Outbound access +All outbound connections will be over HTTPS. +The existing proxy service will be relied upon to manage outbound access from the GAE. \ No newline at end of file diff --git a/docs/archive/PIXLv1/DICOM_tags.md b/docs/archive/PIXLv1/DICOM_tags.md new file mode 100644 index 000000000..a96f6dc9f --- /dev/null +++ b/docs/archive/PIXLv1/DICOM_tags.md @@ -0,0 +1,234 @@ +# CXR tags - v1.0.3 + +The following tables list the DICOM tags with associated operations to produce the whitelist. + +## Post-anonymisation + +- All private tags will be deleted. +- All data in `(60xx,xxxx)` (Overlays) will be removed. + + +| Tag | Attribute | Op | +| ----------- | ----------------------------------------- | - | +| (0008,0005) | Specific Character Set | keep | +| (0008,0008) | Image Type | keep | +| (0008,0016) | SOP Class UID | keep | +| (0008,0018) | SOP Instance UID | change | +| (0008,0020) | Study Date | keep | +| (0008,0021) | Series Date | keep | +| (0008,0022) | Acquisition Date | keep | +| (0008,0023) | Image Date | keep | +| (0008,002a) | Acquisition Date Time | date-shift | +| (0008,0030) | Study Time | date-shift | +| (0008,0031) | Series Time | date-shift | +| (0008,0032) | Acquisition Time | date-shift | +| (0008,0033) | Image Time | date-shift | +| (0008,0050) | Accession Number | secure-hash | +| (0008,0060) | Modality | keep | +| (0008,0061) | Modalities In Study | keep | +| (0008,0070) | Manufacturer | keep | +| (0008,1030) | Study Description | keep | +| (0008,103e) | Series Description | keep | +| (0008,1090) | Manufacturers Model Name | keep | +| (0010,0010) | Patients Name | secure-hash | +| (0010,0020) | Patient ID | fixed | +| (0010,1010) | Patients Age | num-range | +| (0010,1020) | Patients Size | keep | +| (0010,1030) | Patients Weight | keep | +| (0018,0015) | Body Part Examined | keep | +| (0018,0060) | kVp | keep | +| (0018,1020) | Software Version | keep | +| (0018,1149) | Field Of View Dimension | keep | +| (0018,1150) | Exposure Time | keep | +| (0018,1151) | X Ray Tube Current | keep | +| (0018,1152) | Exposure | keep | +| (0018,1153) | Exposure In Uas | keep | +| (0018,115e) | Image Area Dose Product | keep | +| (0018,1164) | Imager Pixel Spacing | keep | +| (0018,1166) | Grid | keep | +| (0018,1190) | Focal Spot | keep | +| (0018,1400) | Acquisition Device Processing Description | keep | +| (0018,1411) | Exposure Index | keep | +| (0018,1412) | Target Exposure Index | keep | +| (0018,1413) | Deviation Index | keep | +| (0018,1508) | Positioner Type | keep | +| (0018,1700) | Collemator Shape | keep | +| (0018,1720) | Vertices Of The Polygonal Collimator | keep | +| (0018,5101) | View Position | keep | +| (0018,6000) | Sensitivity | keep | +| (0018,7001) | Detector Temperature | keep | +| (0018,7004) | Detector Type | keep | +| (0018,7005) | Detector Configuration | keep | +| (0018,700a) | Detector ID | keep | +| (0018,701a) | Detector Binning | keep | +| (0018,7020) | Detector Element Physical Size | keep | +| (0018,7022) | Detector Element Spacing | keep | +| (0018,7024) | Detector Active Shape | keep | +| (0018,7026) | Detector Active Dimensions | keep | +| (0018,7030) | Field Of View Origin | keep | +| (0018,7032) | Field Of View Rotation | keep | +| (0018,7034) | Field Of View Horizontal Flip | keep | +| (0018,704c) | Grid Focal Distance | keep | +| (0018,7060) | Exposure Control Mode | keep | +| (0020,000d) | Study Instance UID | change | +| (0020,000e) | Series Instance UID | change | +| (0020,0010) | Study ID | change | +| (0020,0011) | Series Number | keep | +| (0020,0013) | Image Number | keep | +| (0020,0020) | Patient Orientation | keep | +| (0020,0062) | Image Laterality | keep | +| (0028,0002) | Samples Per Pixel | keep | +| (0028,0004) | Photometric Interpretation | keep | +| (0028,0010) | Rows | keep | +| (0028,0011) | Columns | keep | +| (0028,0030) | Pixel Spacing | keep | +| (0028,0100) | Bits Allocated | keep | +| (0028,0101) | Bits Stored | keep | +| (0028,0102) | High Bit | keep | +| (0028,0103) | Pixel Representation | keep | +| (0028,0300) | Quality Control Image | keep | +| (0028,0301) | Burned In Annotation | keep | +| (0028,0a02) | Pixel Spacing Calibration Type | keep | +| (0028,0a04) | Pixel Spacing Calibration Description | keep | +| (0028,1040) | Pixel Intensity Relationship | keep | +| (0028,1041) | Pixel Intensity Relationship Sign | keep | +| (0028,1050) | Window Center | keep | +| (0028,1051) | Window Width | keep | +| (0028,1052) | Rescale Intercept | keep | +| (0028,1053) | Rescale Slope | keep | +| (0028,1054) | Rescale Type | keep | +| (0028,1055) | Window Center And Width Explanation | keep | +| (0028,2110) | Lossy Image Compression | keep | +| (0028,3010) | VOI LUT Sequence | keep | +| (0054,0220) | View Code Sequence | keep | + +## All tags + +| Tag | Attribute | Op | +| ----------- | ----------------------------------------- | - | +| (0008,0005) | Specific Character Set | keep | +| (0008,0008) | Image Type | keep | +| (0008,0016) | SOP Class UID | keep | +| (0008,0018) | SOP Instance UID | change | +| (0008,0020) | Study Date | keep | +| (0008,0021) | Series Date | keep | +| (0008,0022) | Acquisition Date | keep | +| (0008,0023) | Image Date | keep | +| (0008,002a) | Acquisition Date Time | date-shift | +| (0008,0030) | Study Time | date-shift | +| (0008,0031) | Series Time | date-shift | +| (0008,0032) | Acquisition Time | date-shift | +| (0008,0033) | Image Time | date-shift | +| (0008,0050) | Accession Number | secure-hash | +| (0008,0060) | Modality | keep | +| (0008,0061) | Modalities In Study | keep | +| (0008,0068) | Presentation Intent Type | delete | +| (0008,0070) | Manufacturer | keep | +| (0008,0080) | Institution Name | delete | +| (0008,0081) | Institution Address | delete | +| (0008,0090) | Referring Physicians Name | delete | +| (0008,1010) | Station Name | delete | +| (0008,1030) | Study Description | keep | +| (0008,103e) | Series Description | keep | +| (0008,1040) | Institutional Department Name | delete | +| (0008,1050) | Performing Physicians Name | delete | +| (0008,1070) | Operators Name | delete | +| (0008,1090) | Manufacturers Model Name | keep | +| (0008,1110) | Referenced Study Sequence | delete | +| (0008,1120) | Referenced Patient Sequence | delete | +| (0008,2112) | Source Image Sequence | delete | +| (0008,2218) | Anatomic Region Sequence | delete | +| (0008,3010) | Irradiation Event UID | delete | +| (0010,0010) | Patients Name | secure-hash | +| (0010,0020) | Patient ID | fixed | +| (0010,0021) | Issuer Of Patient ID | delete | +| (0010,0030) | Patients Birth Date | delete | +| (0010,0032) | Patients Birth Time | delete | +| (0010,0040) | Patients Sex | delete | +| (0010,1001) | Other Patient Names | delete | +| (0010,1010) | Patients Age | num-range | +| (0010,1020) | Patients Size | keep | +| (0010,1030) | Patients Weight | keep | +| (0010,2000) | Medical Alerts | delete | +| (0010,2110) | Contrast Allergies | delete | +| (0010,4000) | Patient Comments | delete | +| (0011,0010) | Private Creator Data Element | delete | +| (0018,0015) | Body Part Examined | keep | +| (0018,0060) | kVp | keep | +| (0018,1020) | Software Version | keep | +| (0018,1030) | Protocol Name | delete | +| (0018,1149) | Field Of View Dimension | keep | +| (0018,1150) | Exposure Time | keep | +| (0018,1151) | X Ray Tube Current | keep | +| (0018,1152) | Exposure | keep | +| (0018,1153) | Exposure In Uas | keep | +| (0018,115e) | Image Area Dose Product | keep | +| (0018,1164) | Imager Pixel Spacing | keep | +| (0018,1166) | Grid | keep | +| (0018,1190) | Focal Spot | keep | +| (0018,1400) | Acquisition Device Processing Description | keep | +| (0018,1411) | Exposure Index | keep | +| (0018,1412) | Target Exposure Index | keep | +| (0018,1413) | Deviation Index | keep | +| (0018,1508) | Positioner Type | keep | +| (0018,1700) | Collemator Shape | keep | +| (0018,1720) | Vertices Of The Polygonal Collimator | keep | +| (0018,5101) | View Position | keep | +| (0018,6000) | Sensitivity | keep | +| (0018,7001) | Detector Temperature | keep | +| (0018,7004) | Detector Type | keep | +| (0018,7005) | Detector Configuration | keep | +| (0018,700a) | Detector ID | keep | +| (0018,701a) | Detector Binning | keep | +| (0018,7020) | Detector Element Physical Size | keep | +| (0018,7022) | Detector Element Spacing | keep | +| (0018,7024) | Detector Active Shape | keep | +| (0018,7026) | Detector Active Dimensions | keep | +| (0018,7030) | Field Of View Origin | keep | +| (0018,7032) | Field Of View Rotation | keep | +| (0018,7034) | Field Of View Horizontal Flip | keep | +| (0018,704c) | Grid Focal Distance | keep | +| (0018,7060) | Exposure Control Mode | keep | +| (0020,000d) | Study Instance UID | change | +| (0020,000e) | Series Instance UID | change | +| (0020,0010) | Study ID | change | +| (0020,0011) | Series Number | keep | +| (0020,0013) | Image Number | keep | +| (0020,0020) | Patient Orientation | keep | +| (0020,0062) | Image Laterality | keep | +| (0020,1208) | Number Of Study Related Images | delete | +| (0028,0002) | Samples Per Pixel | keep | +| (0028,0004) | Photometric Interpretation | keep | +| (0028,0010) | Rows | keep | +| (0028,0011) | Columns | keep | +| (0028,0030) | Pixel Spacing | keep | +| (0028,0100) | Bits Allocated | keep | +| (0028,0101) | Bits Stored | keep | +| (0028,0102) | High Bit | keep | +| (0028,0103) | Pixel Representation | keep | +| (0028,0300) | Quality Control Image | keep | +| (0028,0301) | Burned In Annotation | keep | +| (0028,0a02) | Pixel Spacing Calibration Type | keep | +| (0028,0a04) | Pixel Spacing Calibration Description | keep | +| (0028,1040) | Pixel Intensity Relationship | keep | +| (0028,1041) | Pixel Intensity Relationship Sign | keep | +| (0028,1050) | Window Center | keep | +| (0028,1051) | Window Width | keep | +| (0028,1052) | Rescale Intercept | keep | +| (0028,1053) | Rescale Slope | keep | +| (0028,1054) | Rescale Type | keep | +| (0028,1055) | Window Center And Width Explanation | keep | +| (0028,2110) | Lossy Image Compression | keep | +| (0028,3010) | VOI LUT Sequence | keep | +| (0038,0300) | Current Patient Location | delete | +| (0038,0500) | Patient State | delete | +| (0040,0244) | Performed Procedure Start Date | delete | +| (0040,0245) | Performed Procedure Start Time | delete | +| (0040,0253) | Performed Procedure Step ID | delete | +| (0040,0254) | Performed Procedure Step Description | delete | +| (0040,0260) | Performed Action Item Sequence | delete | +| (0040,0275) | Request Attributes Sequence | delete | +| (0040,0555) | Acquisition Context Sequence | delete | +| (0040,1008) | Confidentiality Code | delete | +| (0045,0010) | Private Creator Data Element | delete | +| (0054,0220) | View Code Sequence | keep | \ No newline at end of file diff --git a/docs/archive/PIXLv1/De-identification.md b/docs/archive/PIXLv1/De-identification.md new file mode 100644 index 000000000..b4d4fc4a7 --- /dev/null +++ b/docs/archive/PIXLv1/De-identification.md @@ -0,0 +1,42 @@ +# De-identification +![De-identification Logical Data Flow Diagram](./diagrams/PIXL-De-identification_Data_Flow.drawio.png) +The technical elements of the de-identification process are purposefully limited for the [MVP](MVP.md). +The strategy is to initially lean on the other 4 Safes rather than attempt to immediately produce a gold-plated automated anonymisation process for both pixel data or every possible private DICOM element. This approach allows standing up of the infrastructure scaffolding around which more complex de-identification strategies will be constructed for subsequent versions. + +## Hashing +As is standard practice, a hashing function will be used to de-identify any personal identifiers which need to retain their uniqueness in relation to the rest of the data. +The [BLAKE2](https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2) hashing function will be used. It is an efficient cryptographic hashing function and the original algorithm was a SHA-3 finalist. BLAKE2 is used in the Linux kernel RNG and is the hashing function for the WireGuard VPN. +To mitigate the risk of a brute force attack in the rare event of an identifier leak, the hashing function will be used as part of a Hash-based Message Authentication Code primitive, or in so-called "keyed hashing mode". This usage pattern involves the addition of a secret key as part of the digest generation process. The secret key will be stored in Azure Key Vault. We will refer to this process as _secure hashing_ to distinguish it from simply applying a hashing function. + +## Primary Identifiers +After filtering on opt-out status, the primary patient identifier will be securely hashed and stored alongside the original in an on-prem PostgreSQL database. +This lookup database will be for tracing and debugging purposes and will not be made available to researchers. + +## DICOM Instances +The [MVP](MVP.md) will be limited to the de-identification of DICOM elements and disclosure risk from pixel data will be managed via [policies](./Referenced_notes/disclosure-incidents.md) for the FlowEHR platform. +Extensive DICOM metadata is not required for the NGT-CXR project so a whitelisting strategy will be pursued for including elements. + +A configuration file will be created to specify the list of DICOM elements to include in the output and the transformations to apply to each. +The following operations will be available for processing DICOM elements: + +**Operation** | **Effect** +---|--- +keep | Retain as is +fixed | Replace with a constant +change | Replace with a new GUID +date-floor | Truncate a date or date time to the start of the day +date-shift | Shift a date or date time by the specified number of days (applied at study scope) +num-range | Floor & ceiling limits for a numerical value +hash | BLAKE2 hashing with 48-byte digest (base64 encoded) +secure-hash | BLAKE2 keyed hashing with 48-byte digest and 64-byte key from Azure Key Vault (base64 encoded) + +The list of elements and corresponding transformations within scope of the MVP can be found [here](DICOM_tags). Instances will be stripped from all other elements not included in the whitelist. + +Overlays will be removed from DICOM instances but no "pixel scrubbing" will be performed for the MVP. More advanced techniques for removing text which has been embedded in images will be developed in a subsequent version. + +## Radiology Reports +Developing custom NLP de-identification techniques lies outside the focus of the PIXL project. +The NGT-CXR team have obtained ethics approval to use the [Microsoft Presidio](https://github.com/microsoft/presidio/) OSS data anonymisation tool for de-identification of free text data. It is however important to note that no such tool can guarantee the ability to remove all sensitive information and remaining disclosure risk will need to be managed by policies & procedures. + +## Electronic Health Record +No data points which would present a risk of _direct_ identification will be retrieved from the EHR. \ No newline at end of file diff --git a/docs/archive/PIXLv1/Design.md b/docs/archive/PIXLv1/Design.md new file mode 100644 index 000000000..726524af2 --- /dev/null +++ b/docs/archive/PIXLv1/Design.md @@ -0,0 +1,13 @@ + +## System Context +![System Context](./diagrams/PIXL-System_Context.drawio.png) + + +## Components +![Components](./diagrams/PIXL-Components.drawio.png) + + +---- + +More info in +[DICOM Solution Design](PIXL-FlowEHR_Dicom-Solution-Design.md) and [FlowEHR-PIXL MVP Proposal](PIXL-FlowEHR_Solution-Design.md) \ No newline at end of file diff --git a/docs/archive/PIXLv1/Home.md b/docs/archive/PIXLv1/Home.md new file mode 100644 index 000000000..c2b0508dd --- /dev/null +++ b/docs/archive/PIXLv1/Home.md @@ -0,0 +1,13 @@ +# PIXL Image eXtraction Laboratory + +## Summary + +`PIXL` is a system for extracting, linking and de-identifying DICOM imaging data, structured EHR data and free-text data from radiology reports at UCLH. + +The initial aim is to prove the concept of ingestion and processing of linked multi modal data into a useful and secure environment in the cloud which enables medical imaging ML research at scale. + +The follow-on ambition is to progress from retrospective research and harness linked imaging data for innovation through direct improvement in patient care and increased efficiency in the health system. + +Initial effort is focused on building an [MVP](MVP.md) to serve a single use case. +The UCLH and Microsoft Research collaboration includes the NGT-CXR "Radiology Co-pilot" project which will act as our exemplar. See the [Research Programme Plan](./Referenced_notes/UCLH_MSR_Research_Programme_Plan_COPY_12_09_2022.docx) for further details about this project. + diff --git a/docs/archive/PIXLv1/MVP.md b/docs/archive/PIXLv1/MVP.md new file mode 100644 index 000000000..a26c367ff --- /dev/null +++ b/docs/archive/PIXLv1/MVP.md @@ -0,0 +1,30 @@ +# The PIXL MVP + +## Deliverables +Delivering the `PIXL` MVP requires that + +0. We pre-specify the cohort of patients who received chest X-rays (estimated ~300k images) +1. We intelligently query the PACS/VNA for the chest X-rays from this cohort without causing operational systems to fall over +1. We automatically de-identify DICOM elements with a simple whitelisting approach and removal of PII overlays +1. We automatically push DICOM instances to a DICOM node in Azure via DICOMweb +1. We extract EHR and free-text radiology reports for the specified cohort +1. We de-identify free-text radiology reports with Presidio +1. We de-identify PII EHR attributes using a blacklisting approach +1. We link de-identified data securely +1. We automatically push radiology reports & EHR data into Delta Lake on Azure +1. We automatically ingest data from Delta Lake into the Feathr feature store +1. We provide controlled access to the DICOM node and the Feathr feature store to an Azure TRE workspace +1. We offer useful written guidance to the research team +1. We provide workable policies and SOPs for managing image extraction via the PIXL pipeline + + +## Out of Scope +- Real-time data ingestion +- DICOM pixel data de-identification +- Identifiable patient data in Azure +- Perfection + + +## Target Delivery Date +~~mid-November 2022~~ +mid December 2022 \ No newline at end of file diff --git a/docs/archive/PIXLv1/PIXL-FlowEHR_Dicom-Solution-Design.md b/docs/archive/PIXLv1/PIXL-FlowEHR_Dicom-Solution-Design.md new file mode 100644 index 000000000..91b399b67 --- /dev/null +++ b/docs/archive/PIXLv1/PIXL-FlowEHR_Dicom-Solution-Design.md @@ -0,0 +1,48 @@ +# DICOM Service solution design + +## Overview +![High-level diagram](./diagrams/PIXL-FlowEHR_DICOM_service.drawio.png?raw=true "High-level diagram of DICOM service.") + +## References +- [REST API of Orthanc](https://book.orthanc-server.com/users/rest.html) +- [Performing Query/Retrieve (C-Find) and Find with REST](https://book.orthanc-server.com/users/rest.html#performing-query-retrieve-c-find-and-find-with-rest) +- [Performing Retrieve (C-Move)](https://book.orthanc-server.com/users/rest.html#performing-retrieve-c-move) +- [Quick reference of the REST API of Orthanc](https://book.orthanc-server.com/users/rest-cheatsheet.html#cheatsheet) +- [Python plugin for Orthanc](https://book.orthanc-server.com/plugins/python.html#id1) +- [DICOM Standard - Part 15](https://dicom.nema.org/medical/dicom/current/output/pdf/part15.pdf) + +## Steps + +### Orthanc `raw` +- [ ] Set up Orthanc `raw` with Postgres plugin. Configure to be index only, i.e. images stored on disk. +- [ ] Set up PACS/VNA as Q/R target. +- [ ] Perform C-FIND on PACS/VNA via `raw` API to collate dataset (if required). +- [ ] Retrieve dataset (C-MOVE) using API: +``` +For each series/instance do: + - Issue C-MOVE from PACS/VNA to raw. + - Verify move success. + - Remove series/instance from queue/list. +``` + +### Orthanc `anon` +- [ ] Set up Orthanc `anon` with DICOMweb plugin. +- [ ] Set up `raw` as Q/R target. +- [ ] Set up `Azure DICOM service` as DICOMweb endpoint +- [ ] Poll `raw` for new instances via API: +``` +For each instance do: + - C-MOVE from raw to anon. + - Get patient UID from hasher. +``` +- [ ] Python Plugin written to execute on `ReceivedInstanceCallback`. See [Modifying received instances (new in 4.0)](https://book.orthanc-server.com/plugins/python.html?highlight=python#id33) + +### Callback plugin +``` +For each received instance do: + - GET anon. patient UID from hasher API. + - Apply anonymisation. + - Send via STOW to target. + - Check instance received via WADO (?) + - Drop Instance (i.e. do not write to disk) +``` diff --git a/docs/archive/PIXLv1/PIXL-FlowEHR_Solution-Design.md b/docs/archive/PIXLv1/PIXL-FlowEHR_Solution-Design.md new file mode 100644 index 000000000..f86d022d6 --- /dev/null +++ b/docs/archive/PIXLv1/PIXL-FlowEHR_Solution-Design.md @@ -0,0 +1,192 @@ +# Summary Overview +UCLH intends to develop the capability to enable safe medical imaging research at scale by utilising cloud infrastructure. + The collaboration agreement between UCLH and Microsoft Research presents an opportunity to drive this capability building through the NGT-CXR "Radiology Co-pilot" project. See the [Research Programme Plan](./Referenced_notes/UCLH_MSR_Research_Programme_Plan_COPY_12_09_2022.docx) for further details about this project. + +# Objective +The primary objective is to deliver a technical system and associated governance policies and procedures, driven by the [Five Safes framework](https://ukdataservice.ac.uk/help/secure-lab/what-is-the-five-safes-framework/), which provides safe researcher access to _effectively anonymised_ UCLH medical imaging data. +This will be achieved through +- Suitable de-identification of configurable batches of DICOM images, radiology reports, demographics & selected observations (`Safe Data`) without adverse impact on operational imaging & data systems; +- Storing de-identified data on a DICOM node and EHR data & reports in a Data Lake, both running in Azure and governed by Role-Based Access Control (`Safe Setting`); +- Instantiating a Trusted Research Environment workspace specific to the approved NGT-CXR project and providing a clear disclosure risk incident reporting procedure (`Safe Project`); +- Managing access from the TRE workspace to the DICOM node and Data Lake with Azure Active Directory users & group membership for a trained research team with UCLH honorary contracts (`Safe People`); +- Requiring broad stakeholder consensus for any egress from the TRE workspace, including data and model artifacts (`Safe Outputs`). + +# Scope +Scope will be limited to batch processing of specific but configurable cohorts and a limited set of associated EHR data points. +The system will comply with the [NHS National Data Opt-out](https://digital.nhs.uk/services/national-data-opt-out/compliance-with-the-national-data-opt-out/check-for-national-data-opt-outs-service). +The purpose is to serve approved & trained researchers with UCLH honorary contracts (Safe People), in an environment under strict access control (Safe Setting) with restriction on extraction of any artifacts from the environment (Safe Outputs). +Consequently the system will not guarantee perfect anonymisation to a level where data can be made available for public release. +As per ICO guidelines, pseudonymised data will be considered _effectively anonymised_ within the TRE workspace as researchers will not have access to any link tables or cryptographic keys. See chapters 1-3 of the latest ICO [guidance on anonymisation, pseudonymisation & privacy enhancing technologies](https://ico.org.uk/about-the-ico/ico-and-stakeholder-consultations/ico-call-for-views-anonymisation-pseudonymisation-and-privacy-enhancing-technologies-guidance/). +Furthermore, the reality is that de-identification of DICOM data is inherently imperfect due to a possibility of Personally Identifiable Information leaking via private DICOM elements or burnt into images. Hence the system must include policies & procedures in addition to technical controls in order to provide a robust solution. +The automated de-identification process will initially be limited to DICOM elements, radiology reports and PII in the EHR. It will be assumed that images are free from burn-in. Disclosure risk from burn-in will be addressed via procedures while subsequent versions will address pixel de-identification via an automated technical solution. + +# Technical Requirements +- Compute & storage resources on a GAE for processing small batches of DICOM data in serial +- Access & permission to query the PACS & VNA from an Application Entity running on a GAE +- Access & permission to query the EMAP UDS & IDS +- HTTPS access from the local PIXL application to Azure +- An Azure-hosted DICOM node backed by Azure Blob storage +- An Azure-hosted instance of Delta Lake +- An Azure TRE with VNet access to the DICOM node + +# Technical Risks +The primary technical risk is overburdening the PACS & VNA and causing an adverse impact on the operational performance of these systems. +To mitigate this risk, queries will be managed with a task queue. The system will enforce rate limiting of any commands sent to the PACS & VNA with an adapted [token bucket](https://en.wikipedia.org/wiki/Token_bucket) algorithm which can be adjusted at runtime in response to system load. A [circuit breaker](https://en.wikipedia.org/wiki/Circuit_breaker_design_pattern) will wrap the retrieval processes and act as fail-safe. Individual request retries will be subject to an [exponential backoff](https://en.wikipedia.org/wiki/Exponential_backoff) strategy. + +# De-identification +![De-identification Logical Data Flow Diagram](./diagrams/PIXL-De-identification_Data_Flow.drawio.png) + +The technical elements of the de-identification process are purposefully limited for the alpha version. +The strategy is to initially lean on the other 4 Safes rather than attempt to immediately produce a gold-plated automated anonymisation process for both pixel data or every possible private DICOM element. This approach allows standing up of the infrastructure scaffolding around which more complex de-identification strategies will be constructed for subsequent versions. + +## Hashing +As is standard practice, a hashing function will be used to de-identify any personal identifiers which need to retain their uniqueness in relation to the rest of the data. +The [BLAKE2](https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2) hashing function will be used. It is an efficient cryptographic hashing function and the original algorithm was a SHA-3 finalist. BLAKE2 is used in the Linux kernel RNG and is the hashing function for the WireGuard VPN. +To mitigate the risk of a brute force attack in the rare event of an identifier leak, the hashing function will be used as part of a Hash-based Message Authentication Code primitive, or in so-called "keyed hashing mode". This usage pattern involves the addition of a secret key as part of the digest generation process. The secret key will be stored in Azure Key Vault. We will refer to this process as _secure hashing_ to distinguish it from simply applying a hashing function. + +## Primary Identifiers +After filtering on opt-out status, the primary patient identifier will be securely hashed and stored alongside the original in an on-prem PostgreSQL database. +This lookup database will be for tracing and debugging purposes and will not be made available to researchers. + +***!!Questions*** +- Do we need to hash the Study & Series IDs too? + Do these include hospital or other identifying attributes? + Can this wait until beta? + +## DICOM Instances +As defined in _Scope_, the initial version will be limited to the de-identification of DICOM elements and disclosure risk from pixel data will be managed via policies & procedures. +Extensive DICOM metadata is not required for the NGT-CXR project so a whitelisting strategy will be pursued for including elements. +A configuration file will be created to specify the list of DICOM elements to include in the output and the transformations to apply to each. +Instances will be stripped from all other elements not included in the whitelist. +The following operations will be available for processing DICOM elements: +**Operation** | **Effect** +---|--- +keep | Retain as is +fixed | Replace with a constant +date-floor | Truncate a date or date time to the start of the year +date-shift | Shift a date or date time by the specified number of days (applied at study scope) +num-range | Floor & ceiling limits for a numerical value +hash | BLAKE2 hashing with 48-byte digest (base64 encoded) +secure-hash | BLAKE2 keyed hashing with 48-byte digest and 64-byte key from Azure Key Vault (base64 encoded) + +***!!Questions*** +- What other VR types & operations need to be supported? + +## Radiology Reports +Developing custom NLP de-identification techniques lies outside the focus of the PIXL project. +The NGT-CXR team have obtained ethics approval to use the [Microsoft Presidio](https://github.com/microsoft/presidio/) OSS data anonymisation tool for de-identification of free text data. It is however important to note that no such tool can guarantee the ability to remove all sensitive information and remaining disclosure risk will need to be managed by policies & procedures. + +## Electronic Health Record +No data points which would present a risk of direct identification will be retrieved from the EHR. + +# Policies & Procedures +## Policies +- System & Usage Scope + Guidelines to define the boundaries of the system and its objectives. +- User & Project Requirements + Technical, governance and financial obligations required of projects and users. + +## Procedures +- User & Project On-boarding + To execute at the start of a project or user participation journey. +- User & Project Off-boarding + To execute at the end of a project or user participation journey. +- Data Egress Management + To perform when analysis results or other data are to be removed from the environment. +- Disclosure Incident Reporting + To be followed by researchers when encountering PII. +- Disclosure Incident Management + Tasks and responsibilities when responding to an incident of PII disclosure. +- Operational Systems Impact Management + Instructions for monitoring and management as to ensure minimal impact on operational systems. + +# Key Technologies +- [Docker Compose](https://docs.docker.com/compose/) +- [Python](https://www.python.org) + - [pydicom](https://github.com/pydicom/pydicom) + - [pynetdicom](https://github.com/pydicom/pynetdicom3) + - [deid](https://pydicom.github.io/deid/) + - [Azure SDK for Python](https://docs.microsoft.com/en-us/azure/developer/python/sdk/azure-sdk-overview) +- [Celery](https://docs.celeryq.dev/en/stable/) +- [Redis](https://redis.io) +- [Orthanc](https://www.orthanc-server.com/static.php?page=documentation) +- [DICOM DIMSE](https://www.dicomstandard.org/current) +- [DICOMweb](https://www.dicomstandard.org/using/dicomweb) +- [SQL Server](https://www.microsoft.com/sql-server) +- [PostgreSQL](https://www.postgresql.org) +- [Delta Lake](https://delta.io) +- [Feathr](https://linkedin.github.io/feathr/) +- [Terraform](https://www.terraform.io) +- [Azure Blob storage](https://docs.microsoft.com/en-us/azure/storage/blobs/) +- [Azure DICOM Service](https://docs.microsoft.com/en-us/azure/healthcare-apis/dicom/dicom-services-overview) based on the [Microsoft OSS DICOM Server](https://github.com/microsoft/dicom-server) +- [Apache Spark](https://spark.apache.org) in [Azure Synapse](https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-overview) +- [Microsoft Purview](https://docs.microsoft.com/en-gb/azure/purview/) +- [Azure TRE](https://github.com/microsoft/AzureTRE) +- [OpenTelemetry](https://opentelemetry.io) +- [Azure Monitor](https://docs.microsoft.com/en-us/azure/azure-monitor/overview) +- [Presidio](https://microsoft.github.io/presidio/) +- [Azure Key Vault](https://docs.microsoft.com/en-us/azure/key-vault/general/) +- [Azure Active Directory](https://docs.microsoft.com/en-us/azure/active-directory/) +- [Azure Policy](https://docs.microsoft.com/en-us/azure/governance/policy/overview) +- [Azure RBAC](https://docs.microsoft.com/en-us/azure/role-based-access-control/) +- [Azure Functions](https://docs.microsoft.com/en-us/azure/azure-functions/functions-overview) +- [Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/) + + +# Security Considerations +## Inbound access to the Cloud Environment in Azure +It is expected that a VPN connection (or ExpressRoute connection) between the on-prem UCLH estate and Azure will not initially be available. +Point-to-point firewall restrictions and Azure access tokens will manage secure access to the DICOM service. +## Researcher access to the Research Workspace +The Research Workspace will not be accessible from the public internet. +Access will only be available from Trust devices, either physical or via a Citrix session. +## Outbound access from the on-prem PIXL process +All outbound connections will be over HTTPS. +The existing proxy service will be relied upon to manage outbound access from the GAE. + +# Diagrams + +## System Context +![System Context](./diagrams/PIXL-FlowEHR-System_Context.drawio.png) + +## Deployment Units +![Deployment Units](./diagrams/PIXL-FlowEHR-Deployment_Units.drawio.png) + +## Components +### PIXL +![Local Components](./diagrams/PIXL-FlowEHR-PIXL_Components.drawio.png) +### FlowEHR +![Cloud Components](./diagrams/PIXL-FlowEHR-FlowEHR_Components.drawio.png) + + +Security Considerations +Inbound access to the Cloud Environment in Azure +It is expected that a VPN connection (or ExpressRoute connection) between the on-prem UCLH estate and Azure will not initially be available. +Point-to-point firewall restrictions and Azure access tokens will manage secure access to the DICOM service. + +Researcher access to the Research Workspace +The Research Workspace will not be accessible from the public internet. Access will only be available from Trust devices, either physical or via a Citrix session. + +Outbound access from the on-prem PIXL process +All outbound connections will be over HTTPS. The existing proxy service will be relied upon to manage outbound access from the GAE. + + + + +Requires that + +### PIXL +_On-prem_ + +0. We **intelligently query the PACS/VNA** for ~300k chest X-rays without causing it to fall over +1. We **de-identify DICOM elements** with a simple whitelisting approach and **remove PII overlays** +1. We **de-identify free-text radiology reports** with Presidio +1. We **de-identify PII EHR** attributes +1. We **link** de-identified data securely +1. We **push DICOM instances** to Azure +1. We **push radiology reports & EHR data** into Delta Lake on Azure + +(see [PIXL]((https://github.com/UCLH-DIF/PIXL)) for more details) + +### FlowEHR +_Azure_ \ No newline at end of file diff --git a/docs/archive/PIXLv1/Referenced_notes/README.md b/docs/archive/PIXLv1/Referenced_notes/README.md new file mode 100644 index 000000000..6ac50f4bf --- /dev/null +++ b/docs/archive/PIXLv1/Referenced_notes/README.md @@ -0,0 +1,16 @@ +## 'PIXL/docs/archive/PIXLv1/Referenced_notes' Directory Contents + +
+ +

Files from other repositories referenced here

+ +
+ +| **User docs** | +| :--- | +| disclosure-incidents.md | +| UCLH_MSR_Research_Programme_Plan_COPY_12_09_2022.docx | +| UCLH_SOP_0006.md | + +
+ diff --git a/docs/archive/PIXLv1/Referenced_notes/UCLH_MSR_Research_Programme_Plan_COPY_12_09_2022.docx b/docs/archive/PIXLv1/Referenced_notes/UCLH_MSR_Research_Programme_Plan_COPY_12_09_2022.docx new file mode 100644 index 000000000..1a271c232 Binary files /dev/null and b/docs/archive/PIXLv1/Referenced_notes/UCLH_MSR_Research_Programme_Plan_COPY_12_09_2022.docx differ diff --git a/docs/archive/PIXLv1/Referenced_notes/UCLH_SOP_0006.md b/docs/archive/PIXLv1/Referenced_notes/UCLH_SOP_0006.md new file mode 100644 index 000000000..c449fa6f9 --- /dev/null +++ b/docs/archive/PIXLv1/Referenced_notes/UCLH_SOP_0006.md @@ -0,0 +1,40 @@ +# Title: UCLH SOP 6: Disclosure Incident Reporting + +## General information + +| **SOP Number and Version:** | UCLH SOP 6, V0 | +|-----------------------------|:--------------------------------------------------| +| **Effective date:** | 16/01/2023 | +| **Last reviewed date:** | 16/01/2023 | +| **Author:** | Dr Anika Cawthorn (Senior RSE) | +| **Approved by:** | Nel Swanepoel (Enterprise Architect) | +| **Authorised by:** | Awaiting Dr Mark White (Chief Technology Officer) | + +## Revision chronology + +| Version | Effective date | Reason for changes | Author | +|---------|----------------|--------------------|----------------| +| V0 | 16/01/2023 | Initial draft | Anika Cawthorn | + +## 1. Introduction + +This Standard Operating Procedure (SOP) defines which checks need to happen when an end user reports a disclosure incident. In a +nutshell a disclosure incident occurs when personally identifiable information (PII) is discovered in the +Trusted Research Environment (TRE) by end users not authorised to work with identifiable patient data. Further information and examples for disclosure incidents can be found +in the [disclosure policy](../policies/disclosure-incidents.md). + +## 2. Pre-requisites + +The purpose of this SOP is to ensure that end users of the TRE understand their role in reporting unintended disclosure events of PII that +may (albeit with a low likelihood) happen in the automatic provision of data. + +## 3. Procedure + +![SOP_0006](./diagrams/SOP_0006.png) + +Note that logging an incident will trigger [SOP 7](./UCLH_SOP_0007.md) immediately. + +## 4. Implementation and training + +This procedure will be shared with those seeking access to the FlowEHR TRE. It will also be shared as part of the end user +documentation through the GitHub repository that is mirrored to the Gitea service inside the TRE. \ No newline at end of file diff --git a/docs/archive/PIXLv1/Referenced_notes/disclosure-incidents.md b/docs/archive/PIXLv1/Referenced_notes/disclosure-incidents.md new file mode 100644 index 000000000..542ae630a --- /dev/null +++ b/docs/archive/PIXLv1/Referenced_notes/disclosure-incidents.md @@ -0,0 +1,26 @@ +# Disclosures + + +## Introduction + +While great care has been taken when anonymising patient relevant data, all methods applied are automated and hence personally identifiable +information (PII) may be revealed accidentally. In cases where PII is discovered either through end user or the TRE providers, the instances of data +will need to be reported (following [SOP 6](UCLH_SOP_0006.md)). + +### Examples of PPI + +1. Names, e.g. "Sophia Andrews" or "S. Andrews" +2. Date of birth or dates of specific investigations, including partial dates, e.g. "21/02/22" or "21/02" +3. Ethnic group information in Dicom images + +If in doubt whether or not PPI may have leaked outside the hospital compute infrastructure, please reach out to the TRE team to decide the best +progress moving forward. + + +## Responsibilities + +Once an incident has been reported, an assessment of the severity needs to be made and appropriate steps taken to mitigate risks around the +disclosure. This will likely involve following up with the Information Governance team and a potential data breach reporting procedure. + +Furthermore, it is important to assess how the data propagated to the end user and any steps for prevention in the future need to be identified and +executed, if it cannot be excluded that the behaviour constituted an exception. diff --git a/docs/archive/PIXLv1/Technologies.md b/docs/archive/PIXLv1/Technologies.md new file mode 100644 index 000000000..362d2885b --- /dev/null +++ b/docs/archive/PIXLv1/Technologies.md @@ -0,0 +1,23 @@ +## Key Technologies + +- [Docker](https://www.docker.com/) +- [Docker Compose](https://docs.docker.com/compose/) +- [Python](https://www.python.org) + - [pydicom](https://github.com/pydicom/pydicom) + - [pynetdicom](https://github.com/pydicom/pynetdicom3) + - [deid](https://pydicom.github.io/deid/) + - [Azure SDK for Python](https://docs.microsoft.com/en-us/azure/developer/python/sdk/azure-sdk-overview) +- [Celery](https://docs.celeryq.dev/en/stable/) +- [Redis](https://redis.io) +- [Presidio](https://microsoft.github.io/presidio/) +- [DICOM DIMSE](https://www.dicomstandard.org/current) +- [Orthanc](https://www.orthanc-server.com/static.php?page=documentation) + - [Orthanc Python plugin](https://book.orthanc-server.com/plugins/python.html#id1) + - [Orthanc DICOMweb plugin](https://book.orthanc-server.com/plugins/dicomweb.html) + - [Orthanc PostgreSQL plugin](https://book.orthanc-server.com/plugins/postgresql.html?highlight=postgresql) +- [DICOMweb](https://www.dicomstandard.org/using/dicomweb) +- [PostgreSQL](https://www.postgresql.org) +- [Azure DICOM Service](https://docs.microsoft.com/en-us/azure/healthcare-apis/dicom/dicom-services-overview) based on the [Microsoft OSS DICOM Server](https://github.com/microsoft/dicom-server) +- [OpenTelemetry](https://opentelemetry.io) +- [Azure Monitor](https://docs.microsoft.com/en-us/azure/azure-monitor/overview) +- [Azure Key Vault](https://docs.microsoft.com/en-us/azure/key-vault/general/) diff --git a/docs/archive/PIXLv1/images/PIXL-Components.drawio.png b/docs/archive/PIXLv1/images/PIXL-Components.drawio.png new file mode 100644 index 000000000..469ee5118 Binary files /dev/null and b/docs/archive/PIXLv1/images/PIXL-Components.drawio.png differ diff --git a/docs/archive/PIXLv1/images/PIXL-De-identification_Data_Flow.drawio.png b/docs/archive/PIXLv1/images/PIXL-De-identification_Data_Flow.drawio.png new file mode 100644 index 000000000..5cc381f9d Binary files /dev/null and b/docs/archive/PIXLv1/images/PIXL-De-identification_Data_Flow.drawio.png differ diff --git a/docs/archive/PIXLv1/images/PIXL-FlowEHR_DICOM_service.drawio b/docs/archive/PIXLv1/images/PIXL-FlowEHR_DICOM_service.drawio new file mode 100644 index 000000000..0f86218ef --- /dev/null +++ b/docs/archive/PIXLv1/images/PIXL-FlowEHR_DICOM_service.drawio @@ -0,0 +1 @@ +7Vtbd5s4EP41Pmf3IRxAgOHRlzjJ2bTJ1t1ennJkkG1tMfKCHNv59TsCgQFhx2lwnL00PS2MhEBz+eabgXTQYLG5ivFy/oEFJOyYerDpoGHHNA3LNDvirx5sM4nr2JlgFtNATtoJxvSJSKEupSsakKQykTMWcrqsCn0WRcTnFRmOY7auTpuysHrXJZ4RRTD2cahKv9KAz6XUcLzdwDWhs7m8tWt2s4EFzifLnSRzHLB1SYQuO2gQM8azo8VmQEKhvFwv2XWjPaPFg8Uk4sdcMEwcZ/X56fuX+6tNfPlx8Xh95V/IXTzicCU33HtaxUQ+Md/mapjSMBywkMXpKbocjfQR6qB+wmP2g+QjEYtgfj/AyZyIu+pw8khiTkGfvZDOIpBxtgTplEV8LJfX5XlpffgZwQHqq5uU+xbLkk1JJDd9RdiC8HgLU+So7UqPkx7oSXusS+bsyinzkiUtW07E0oVmxdI7LcOBVHSz0h0HrSdTkgwvPbqJvc/bxTfnwlKU/sfg9vrfpHPD0p/TuWXqmmE0qB2dSu2OomESQKzLUxbzOZuxCIeXO2k/ZqsoKPS6m3PLhEqHBgj/JJxvJXDhFWcgmvNFKEfJhvJv4nINfCw7/V4aGm7k0unJVp4kHMe8J9ALBH6Ik4T6uXhEw3ztvYZK2Cr2yQFV5EiK4xnhB+bJsBB6Omj2mISY08cqZjYZML0Udoa3pQlLRiOelFa+F4KdNyG3W/EmZHSrSFebb+nOoflwkD3BzpWKrfy8d3nn9S7zH+hd7vvwLsd8kXcVYPaW3pUj5Xncyyj5lq7ZB70rzTLZgob5Jm6UU8X2/OhVacYwlfQ++HD35bLRgLd4AoS5onQsk7YPCiNxQzZf0CDI7EsS+oQnRUKXng6L2/2OPTzKGAe9TUn4Bc+Wd61Q2SYicAFJ3vUkHL8sjneBl09h02kCnlA3UBvhpZhMsRbQ96U4pIu0YiisklrwniWUUyasM2GcswVMCMVAH/s/Zmkc5kwrIFO8CvleloaTZVbHTOlGxG4/vWEvl+q5BI7nnIsqqCfUYY7W67UmYhxH/kVCYlhd8+FBzBFdzODfu2xIW0YzWMIP6fI+q1BoBEr9xTBTNgYJX0eehlB2aCDNcKQUUM7+tR16iOwqhALp0/KCqkwRPa3rNTBEELum57rIMGzdtswTBTJ6QRyfGogj2JBAYognHSwiJRkeu143F+wgOT3LMbkRx/eiuG0WEH9PYgrKFDC0W7Yu/HmEN49F+HcF8Cq+d0wnjegJHMzEwX1vMM6FcI9CvhdV/G1IwUViUd+t55ST8RKnqlvHeFn1hUnmTLeTQlBgzN2KwzJEyhMJ/MKYtVoywMSd+kotCSOO75LJtKUq0KmFOeoqMZ5PKQe4e6oK0N5rOZEoK5Zx/lqxfOAi02RPAKK53KTKycdzy0p4hSk3YJ0N/M+iUOii8I145wR9sTko3LVUw5yS9OYhYz9Wy5LbZA+1z3N+EO7PZQgqSaPkLyorUBhGntPqdCFPNFkGCelEaD3xMREkV/SJHoYYqABOyMOUxQ+QBfksJuPfbx+yBJRoyaPINgk4MxV5Z2i1lEDqHBwSiKv4FjI1FzXkD6SdKmV01ZQxuvk4PGvKKE4K1N+TKcxuhvsvwvhnsdtsG7v3VHF1SuFY1SWyZCSvOlTeWfWFamiU7VhZqC0q6ioudJNGoakDhhMVTRLOGvq150aHIU38PPpbiHezW493V3NUvtiQSxDKJ74m2Bt756qlxp/vvl58Gp8j3OvGqTDGGmG0dfc5wtguz8sr9DJYNKrUfCO0cLr1IHc12yv9OVXIN+5aNpDeR4LY5YA9vlH3tNO3E4/2HnQS73lpQ9Hyqi+cLPkC6rQNQr2B2/7fwjhZC8OqI8i5WxjNyNLU2KrVqj0oRLYLCvo+qmIF/fAqkDS+jywXnFJ0fGOzqQKu4t3RXeYXWNQWRinlgOpLJNOAFNEtDav1RlMt28bbzGbTHtGG+ITX/22bIsvSvPdtRpVFKuY514cAJzBI/cMA21K/DHDMBgMY5qksUHyfc1a+XmFR0pRlCnU0AW/2Mu9ICpWjytu3Wg8+d7mhcnujEpszV7tQKSUsaq/cRbajoSpuGUgNlLydWo6TNj5aag4T6xxhUhQppbrke3nsDEVKs3oa+mFvVKS8zqwqQ/yNbFOBoPjvLdLg2R6+iCdrsbmEXCdvCOdZyelqnqHEm6VrVtMXa6ame6cyj8ry5Aea+vBmcPehoR8IdRP1319HMHtf0Lu/SV8VjK5vPrVnQccSret6OfYMzbMbWMbp0NM+M3pabrdTafN46DkMbWwC/gxvIVGgQi8I2wFepALv23zL+DqHUL8uGBM/i+w5UHjxugx2vyeY93VuimprbxlWrvFkO2gZM58kybM1Wumdsq4Z4k0ejn3pC06Da7z8nXNY204BOAoCtUSyLL0EEdWPTC1HJVyG3g5mwOnuu/+sA7j77Ql0+Tc= \ No newline at end of file diff --git a/docs/archive/PIXLv1/images/PIXL-FlowEHR_DICOM_service.drawio.png b/docs/archive/PIXLv1/images/PIXL-FlowEHR_DICOM_service.drawio.png new file mode 100644 index 000000000..3e2420daf Binary files /dev/null and b/docs/archive/PIXLv1/images/PIXL-FlowEHR_DICOM_service.drawio.png differ diff --git a/docs/archive/PIXLv1/images/PIXL-System_Context.drawio.png b/docs/archive/PIXLv1/images/PIXL-System_Context.drawio.png new file mode 100644 index 000000000..14e304cd9 Binary files /dev/null and b/docs/archive/PIXLv1/images/PIXL-System_Context.drawio.png differ diff --git a/docs/archive/PIXLv1/images/PIXL.drawio b/docs/archive/PIXLv1/images/PIXL.drawio new file mode 100644 index 000000000..7a21288c9 --- /dev/null +++ b/docs/archive/PIXLv1/images/PIXL.drawio @@ -0,0 +1 @@ 7V1Zd5u6Fv41eUwWs+1HZ+45TZuTdPU29+UuBWSbGiMKOInPr78SCBu0ZUOIAQ9Jh4CAjbS39O1JEif6xeztJkTB5I442DvRFOftRL880TTVUDX6i5Us0hJzkJ6PQ9fh96wKHt1/MS9UeOncdXBUuDEmxIvdoFhoE9/HdlwoQ2FIXou3jYhXfGuAxhgUPNrIg6X/cZ14wktVa7C6cIvd8YS/uq/10gszlN3MWxJNkENec0X61Yl+ERISp0eztwvsMd5lfEmfu15zdVmxEPtxlQdsfNPrTfz+9Gf8v8nty/Sfm1vtlLfiBXlz3mBe2XiRcSAkc9/BjIhyop+/TtwYPwbIZldfqchp2SSeefRMpYdRHJLpklO0jecj1/MuiEfChJp+PWB/aDmsPm/RCw5j/JYr4s25wWSG43BBb8mu6py1i6zPZAWvOVEZFi+c5MRkKrwQ8f4xXlJfsZAecC7KOfrkG9PLoXH65Zvzm/Qs7+f5f/861QFHL79cfL87YdKxPPr68+eQHo3Z0SU+pb3bj92Ri0PAeNphAnYYhMTGUVTO/GdkT8eJuL7PY8/1MS9fSXAlopxIDNW0rHO58IgfP/IaqRJhKtf96+vtCNPsC8JUZcLUJLK0mhKlBUT589sQiCknAAdFk+JIiYrCghxeLw2R1/3t8VqHvDYhrxXpuGmI1z0JEKXDhbLQzwbM1d3wPiunr8lf6lwu+bGiNCEk3ZIMiFaF1K8mpNuHUqirJsIjR0Bd6RoBZeYBFPgDDkgYf8p8GzI32pT59fDPqU4h5On38OrL16fJr+HF36eqROaCjLDvDJltTc9sD0WRaxdFUp/1+M2NfzEAPTP52ROnwo4v3ziGJyeL7MSnzf6VP3laUWCnq8eSs+y5teKLyDy0cbmNF6NwjONNrOSsw07ByYC9IcQeit2Xos8hkyx/9J64tM7LXmQNRENYE3pH2iT+WN5JECj1SimljQaUaH9Ai9xtAbsh2lBlQ3iRphfcF3qQklx15CW/6vdtDfRt7OEZ5jUt9nEqs6/omXqztD/jyP0XPXPVXujqyHPHPhsHlAgFOf2cjX2X+o9DfmHmOg578CN4VNkS3DSguYfMm3Gy9EsrgFXd7pndQkajCMcnIjJtQaDQ2+ocrJR9BSu1HbDqKSLE6DXBql9KqSmwUgYtgJUB+rY7Q2N8yFCltwdVH7KRzE5hpyaE1IGr+rDTg7CzyadsGnUMMVaoa4Mzs1cPeMwqxLaEPbDibRhKMPSVePUHCzxmZzZSRlXQXM2ZTLJI255i12a/sHnsGrSEXaL5ofdqWkwwxGQ0YzEB1DLaQC0YnwyR49KOOV7A6FSYRK0O2Zrq7Yk1VSEJubuIpObwaIVO20YkqyIi6S0hkiniiJhPrYxIYsBJE7Ht3Yi0bYWZqWaY0hUhxfWjGPn2Qbtogz0BFRXGsR/wjNDXagr9P/TQAoppF5IPINlwva1kQw+EMWQZxXaTDTAi+4AdZNOhfsHqgKJJekS1tceEQIVHh5syxTigv0YhmbFqBVRgL5TdtCVMXp4bxccuW637RBIMYOUzgkxa2dSY9cD5rix+JmFEQRN5Dwyf/HEKl7y9SiPQWUwYflyYq1T/coKGIZFmXyJNMQK6PWlCv/EG+zhEMYaK0MevEuke2ZAciOFpVZPMsml3SMpmcLRnatcwmVfm+coif8pf+0i+ZKPhUJ4w6bVjbFuisa2KTnvl7C6gJJrtW3L/ByJ+tZIwUaEjaXsY+UzVAFP90FMp2VDfeUNdg4b6nmKS2hwmabuOSVrdAADAJK2hkCTApHZmnEAHZy0mHcFcFGG5xmFNRtFgWngYBB6r9D0Tm+MSKPShT/zFzI1om4gPL/8gxJu6x+bN9kVvtvtpkdpe583aiVIPKmqpbKC0njgz6s6LBGFq48NaauvoA327UYhZxIwOOgggB6RT9iXPpUH/BAY70/jmKtwJQ5zP9Mr0CEOcQCm0uj5CPl8NpoY+lYJ8gle5UmhpNkVPLYPyyvNPSyl1rhR06Fw7eEbGbK22ax+yp5ENzZ3XCrpkxbCQJrm6ffjMjHDMF714XZUE1VvNjOgwz9WmEvhQAOt9iqOSEpDzqKproLc0g2W1VkCcV/tuNTAAk+rEyZzdqwEYmcCjEUUBykBPMjcO8aCELEp12BOBs8HcQYCKdx+j+ECDa6egwygqntz0SWXdfMlPPZTaft3rIehePmJ7HiYjPPMu7x6+JXWM2ckULxJZkjCRtuuzs+SRhMzcOzYXc5UQ3R0f04AZhDbNi9ZNBa2yqdCSw6jScXym5H5MoYvU9R+XWfH1NkjnloMBvZMgwnPnWO2DbDQegX1gdDsvfy/8GrPqZCG9pXWO0K8x6vo1FI0GZ4Pcj1lCuHOwMmE4trabE/JNbQ4XyVqcyd8xkpkwCvppGX/UMu4+JW92axkvl6QOBnpRSSnZ9e2qKUbkHocu5R7DlI/a2WZF1ZWNnqZV16D3PnVT3czu75see7fRffDaKhvrx6CtPvMJ5Xa3ZEHsxkVIzdvdPUE/qqZZ1+4G27OKpLpHqC0mFI54MW021I8B1eCWMp82+HttcHEGVPcrysy9nBa7VhjlxrDVlj7RtqZPwD75O6hPYC6yxOI9Zq3R4oTXjrWGBV38H+E8ipMegWYMzf3nKEh4J3aF7wFbncym0wo9YQ937FdFk1AzLIj8TW0GLxdNt25KzTV4jW2CuLH7lqoVqyW1ArcSEy2DqlpFA0qlv2thFAs6KVE8dxbJc3yWS3jIisLqzr1oQTVAh2L9l3Ye6Mtc/HL0X9lRQZSiza/sSD9BBXMz+a+8fIqwVIS62aITKBUhNNJ2wBIo2yGksS1FK1sCm4ZD04aABvaLEg3FyoYACEoMOjQEpDyVpFMo73DSioOzBDaN0H03BKRt2+vN2Le+Bm+jrVS6f2hbbog48VGtvaNxKaXO0WetnSpaOH/mmCvjwwShFjdZ/5A5s5ch7cYAZZOVUoonLW1HpJVuz1AVT0DutdM1vVLew1D54z9fjw9Nut/1gTz/Zp+cpnVlezVMiOcwIzJPzDa+oRmnf//lF5WSkltiZfM9f5SvZMx4y66iGNFf1x5JNtB00ThEs4zWj0XAaV1QdxNRvzV8tEmAz5kIUCZs27jEkR26QUI8uf3HBNd9szJK9qOIEwq8CTPs8IfYLnquP6ZHTvo4d8FJeJZWxku7WoqhaQ9lrnKhT1p/5iS7cBolXwAf0htUK3hbXVx9L5L3c7HAcV+kVNnmJ6e8OzOyHh7FMrImlxXLqC6/VJkQLb6HloJ307K0VVuvUl6U62umlcZYlNwI4BiTA4VkgxjJd7yz0ekTFiIpDFhelOFE0gSIEjEJ1kRjil8VpyMNXTluvEQf2guH2dfbFQGqUv0aozgHXUUUSJ+w52FER/cDTvsUV38B8rO3JhCNw6uXZNO1XFkyiM1z+lc5Y1y/UJJ/5iXbqIUVysp6sFBlZxmFYqGsrGfKSKqSd4tlmqRQSlLybkWoJP27nTgV2HypD0ONpixOJWrcClBMTzkaZ6chIXFe77IdL+6Iw9Do6v8= \ No newline at end of file diff --git a/docs/archive/PIXLv1/images/README.md b/docs/archive/PIXLv1/images/README.md new file mode 100644 index 000000000..36aaa2314 --- /dev/null +++ b/docs/archive/PIXLv1/images/README.md @@ -0,0 +1,19 @@ +## 'PIXL/docs/archive/PIXLv1/images' Directory Contents + +
+ +

Image files used in the documentation of PIXLv1

+ +
+ +| **Image files** | +| :--- | +| PIXL-Components.drawio.png | +| PIXL-De-identification_Data_Flow.drawio.png | +| PIXL-FlowEHR_DICOM_service.drawio | +| PIXL-FlowEHR_DICOM_service.drawio.png | +| PIXL-System_Context.drawio.png | +| PIXL.drawio | + +
+ diff --git a/docs/archive/PIXLv1/readme.md b/docs/archive/PIXLv1/readme.md new file mode 100644 index 000000000..18b28341a --- /dev/null +++ b/docs/archive/PIXLv1/readme.md @@ -0,0 +1,37 @@ +## 'PIXL/docs/archive/PIXLv1' Directory Contents + +This is a clone of the wiki from the original PIXL repository. + +
+ +

Subdirectories with links to the relevant README

+ +
+ +[diagrams](./diagrams/README.md) + +[Referenced_notes](./Referenced_notes/README.md) + +
+ +
+ +

Files

+ +
+ +| **User docs** | +| :--- | +| Considerations.md | +| De-identification.md | +| Design.md | +| DICOM_tags.md | +| Home.md | +| MVP.md | +| PIXL-FlowEHR_Dicom-Solution-Design.md | +| PIXL-FlowEHR_Solution-Design.md | +| readme.md | +| Technologies.md | + +
+ diff --git a/docs/archive/README.md b/docs/archive/README.md new file mode 100644 index 000000000..91d2fcfd3 --- /dev/null +++ b/docs/archive/README.md @@ -0,0 +1,12 @@ +## 'PIXL/docs/archive' Directory Contents + +
+ +

Subdirectories with links to the relevant README

+ +
+ +[PIXLv1](./PIXLv1/README.md) + +
+ diff --git a/docs/design/README.md b/docs/design/README.md new file mode 100644 index 000000000..63a5f0a24 --- /dev/null +++ b/docs/design/README.md @@ -0,0 +1,16 @@ +## 'PIXL/docs/design' Directory Contents + +
+ +

Subdirectories with links to the relevant README

+ +
+ +[decision-record](./decision-record/README.md) + +[details](./details/README.md) + +[images](./images/README.md) + +
+ diff --git a/docs/design/bigger_picture.md b/docs/design/bigger_picture.md deleted file mode 100644 index 27750abbf..000000000 --- a/docs/design/bigger_picture.md +++ /dev/null @@ -1,7 +0,0 @@ -# The bigger picture - -The [System design for linkage and export of imaging and EHR data](https://github.com/SAFEHR-data/the-rolling-skeleton/blob/main/docs/design/system-design.md) -document in `the-rolling-skeleton` repository provides an overview of the overall system -architecture and the components that make up the system that PIXL is part of. - -Please request access to the `the-rolling-skeleton` repository and add further details in a [new blank issue](https://github.com/SAFEHR-data/PIXL/issues/new). \ No newline at end of file diff --git a/docs/design/details/100-day-design.md b/docs/design/details/100-day-design.md new file mode 100644 index 000000000..334a9fa10 --- /dev/null +++ b/docs/design/details/100-day-design.md @@ -0,0 +1,61 @@ +# 100 day plan data export + +## Background + +The data export part of the 100 day plan aims to export chest X-rays to the UCL Data Safe Haven (DSH) for MSc student projects. + +This will use existing UCLH services: + +- OMOP ES for cohort definition and structured patient data extraction +- Cogstack free text de-identification api to anonymise radiology reports +- PIXL to export and de-identify chest X-ray DICOM data and radiology reports + +## Constraints + +- PIXL is an open source project and the Clarity and Caboodle databases are EPIC intellectual property, so queries using these databases can't exist + in the PIXL repository. +- Aiming for reasonable but not perfect solutions given the time for the project +- DSH allows data ingress using FTPS only currently +- VNA should only be requested to move images once per project, to reduce load +- De-identification of radiology reports should be checked before export? +- De-identification and linking of data should all happen within the UCLH network + +## Not in scope + +- System-wide automated tests +- CI/CD based processing and transfer automation tool +- Separate project-based identifier hashing component + +## Data flow through components + +![Data flow diagram for pixl](../images/100-days.drawio.png) + +There will be 3 stages for exporting data to the DSH: + +1. The configuration of an OMOP ES extract will be reviewed in GitHub and added to version control. + The `OMOP ES` container will be run based on this using `Caboodle` as a data source, using the OMOP ES imaging plugin to extract + imaging study data: accession number and study datetime. + `OMOP ES` will write files into 2 groups (plus a log file) as an input to PIXL: + - Public parquet files that have had identifiers removed and replaced with a sequential ID for the export + - Private link files (parquet) that map sequential identifiers to patient identifiers (e.g. MRNs, Accession numbers, NHS numbers) + - log file (json) which contains the extract datetime, project name and a hash of the OMOP ES repository for the extract +2. The `PIXL` pipeline will be run, using the OMOP output files as an input, with the `PIXL CLI` populating the imaging and EHR queues by reading the public and + private + parquet files to generate messages that contain the MRN, accession number, study datetime, project name and OMOP ES extract datetime. + The `PIXL CLI` also copies the public parquet files that are accessible to the `ehr-api` for later export + - The `imaging-api` queries the VNA for the study and if it exists, the DICOM data is sent to `orthanc-raw` + - `orthanc-raw` acts as a cache for identifiable images by storing as much DICOM data as possible, and sending the DICOM data to + the `orthanc-anon` service, along with the project name. + - `orthanc-anon` uses an allow-list approach with DICOM tags, each of these having a defined action which allows for de-identification. The + accession number will be hashed by the `hashing-api` for linking to the EHR data. + This will also reject any images which are not of the expected modality. After the image has been de-identified, the DICOM data will be sent + to the DSH using FTPS + - The `ehr-api` will access `EMAP` for imaging reports and de-identify them using the `cogstack deidentification api` + - The de-identification will be checked with another method to have a rough metric of certainty in the ongoing de-identification in case of + data drift. This will be persisted to the postgres database along with any of the reports which are flagged as containing identifiable data. + - The accession number will be hashed using the `hashing-api` + - The de-identified radiology reports will be saved to a parquet file, which acts as a link between the OMOP ES sequential study identifier + and the hashed accession number +3. The de-identification will be reviewed and then the `PIXL CLI` will be used to send a REST request to the `ehr-api` to export the public and report + parquet files to the DSH using FTPS. + diff --git a/docs/design/details/README.md b/docs/design/details/README.md new file mode 100644 index 000000000..90d8098ed --- /dev/null +++ b/docs/design/details/README.md @@ -0,0 +1,16 @@ +## 'PIXL/docs/design/details' Directory Contents + +
+ +

Designed documents in order

+ +
+ +| **Number** | **User docs** | +| :---: | :--- | +| 1 | 100-day-design.md | +| 2 | system-design.md | +| 3 | bigger_picture.md | + +
+ diff --git a/docs/design/details/bigger_picture.md b/docs/design/details/bigger_picture.md new file mode 100644 index 000000000..bbe819725 --- /dev/null +++ b/docs/design/details/bigger_picture.md @@ -0,0 +1,4 @@ +# The bigger picture + +The [System design for linkage and export of imaging and EHR data](system-design.md) document provides an overview of the overall system +architecture and the components that make up the system that PIXL is part of. diff --git a/docs/design/details/system-design.md b/docs/design/details/system-design.md new file mode 100644 index 000000000..5b2c51549 --- /dev/null +++ b/docs/design/details/system-design.md @@ -0,0 +1,76 @@ +# System design for linkage and export of imaging and EHR data + +## Background + +Continuation of work in [100 day plan](100-day-design.md). +This iteration will aim to bring in MRI and CT data, with a more coherent design using OMOP ES, Cogstack, an identifier hashing service and PIXL. + +This will use existing UCLH services: + +- OMOP ES for cohort definition and structured patient data extraction, with an escape hatch option of a flatfile input for pre-Epic imaging data +- Cogstack for free text data retrieval and de-identification +- PIXL to retrieve, de-identify, and export DICOM data. Also allow for export of electronic health record data (including radiology reports and clinical notes) + +## Constraints + +- PIXL is an open source project and the Clarity and Caboodle databases are EPIC intellectual property, so queries using these databases can't exist + in the PIXL repository. +- Projects may store their DICOM data in the UCL Data Safe Haven (DSH), an Azure DICOM service or a NHS Digital approved (Data Security and Protection Toolkit) DICOMWeb-compliant server. +- Imaging studies retrieved from the VNA should be cached and reused across runs within the PIXL infrastructure to the maximum extent possible (e.g. limited by GAE storage limits) to reduce unecessary query and retrieval from UCLH imaging systems. +- De-identification and linking of data should be managed within the UCLH network using a maintained suite of Standard Operating Procedures +- Each project should have its own salt used for hashing +- Need time to consider the ramifications of OMOP ES using constant pseudonyms for identifiers, + so PIXL will keep on using its own hashing service for linking images to OMOP data in a mapping parquet file. + This adds extra overhead on researchers with multiple extracts over time, but is a reasonable tradeoff. + +## Not in scope + +- System-wide automated tests +- CI/CD based processing and transfer automation tool + +## Data flow through components + +![Data flow diagram for the entire process](../images/voxl.drawio.png) + +There will be 3 stages for exporting data from UCLH since Epic: + +0. The de-identification of clinical notes and imaging reports will be trained and validated before any of the electronic health record data will + be exported in Step 3. +1. `OMOP ES` will define the cohort and + - The configuration of an OMOP ES extract will be reviewed in GitHub and added to version control. + - The `OMOP ES` container will be run based on this using `Caboodle` as a data source, using the OMOP ES plugins to bring in imaging information. + - This will then request clinical notes and imaging reports from `Cogstack`, which will be de-identified before returning to `OMOP ES`. + - In the case of emergency imaging (where the study UID will not match), the imaging accession number will also be exported. + - `OMOP ES` will write files into 2 groups (plus a log file) as an input to PIXL: + - Public parquet files that have had identifiers removed and replaced with a sequential ID for the export + - Private link files (parquet) that map sequential identifiers to patient identifiers (e.g. MRNs, Accession numbers, NHS numbers) + - log file (json) which contains the extract datetime, project name and a hash of the OMOP ES repository for the extract +2. The `PIXL` pipeline will be run, to process the imaging data + - A project is registered to PIXL using a pull request of the project configuration, only projects in the source code can be processed. + - Using the `OMOP ES` output files as an input, with the `PIXL CLI` populating the imaging and EHR queues by reading the public and private + parquet files to generate messages that contain the: + - study UID + - MRN + - accession number + - study datetime + - project name + - OMOP ES extract datetime + - The CLI command `pixl populate` command persists the details of each message being sent for processing so that the progress can be tracked and + the pseudonymised data can be linked with a project for export later on. + - The `imaging-api` requests that `orthanc-raw` queries the VNA for the study and if it exists, the DICOM data is sent to `orthanc-raw` + - `orthanc-raw` acts as a cache for identifiable images by storing as much DICOM data as possible (with the max. storage size being configurable), and sending the DICOM data to + the `orthanc-anon` service. + - `orthanc-anon` uses an allow-list approach with DICOM tags, each of these having a defined action which allows for de-identification. + This will also reject any images which are not of the expected modality. + The imaging identifiers will be hashed by the `hashing-api` for linking to the EHR data. + To enable a mapping table export in the next step, the hashed study UID will be saved to the PIXL database along with the input fields. + - After the image has been de-identified, the DICOM data will be sent to three possible configurations from the `export-api`: + - An Azure DICOM service, with the `export-api` sending a REST request to orthanc anon to send it onwards + - A researcher-managed DICOMWeb compliant server, with the `export-api` sending a REST request to orthanc anon to send it onwards + - To the DSH directly via FTPS +3. The CLI command `pixl export` will export all electronic health record data. + - The PIXL database will be queried for all images that have been exported from the project, + mapping their OMOP ES sequential IDs with the hashed identifiers used in the imaging export. + - This mapping table will be written into an export-specific directory accessible by the `export-api` + - The command will copy over the public files from the `OMOP ES` extract, into the same export-specific directory. + - All of these files are then exported via FTPS to the secure endpoint diff --git a/docs/design/images/100-days.drawio b/docs/design/images/100-days.drawio new file mode 100644 index 000000000..3f018f006 --- /dev/null +++ b/docs/design/images/100-days.drawiodiff --git a/docs/design/images/100-days.drawio.png b/docs/design/images/100-days.drawio.png new file mode 100644 index 000000000..0a147fdf9 Binary files /dev/null and b/docs/design/images/100-days.drawio.png differ diff --git a/docs/design/images/README.md b/docs/design/images/README.md new file mode 100644 index 000000000..eb5eb9dc9 --- /dev/null +++ b/docs/design/images/README.md @@ -0,0 +1,19 @@ +## 'PIXL/docs/design/images' Directory Contents + +
+ +

Image files used in the design documents

+ +
+ +| **Image files** | +| :--- | +| 100-days.drawio | +| 100-days.drawio.png | +| pixl-multi-project-config.drawio | +| pixl-multi-project-config.png | +| voxl.drawio | +| voxl.drawio.png | + +
+ diff --git a/docs/design/diagrams/pixl-multi-project-config.drawio b/docs/design/images/pixl-multi-project-config.drawio similarity index 100% rename from docs/design/diagrams/pixl-multi-project-config.drawio rename to docs/design/images/pixl-multi-project-config.drawio diff --git a/docs/design/diagrams/pixl-multi-project-config.png b/docs/design/images/pixl-multi-project-config.png similarity index 100% rename from docs/design/diagrams/pixl-multi-project-config.png rename to docs/design/images/pixl-multi-project-config.png diff --git a/docs/design/images/voxl.drawio b/docs/design/images/voxl.drawio new file mode 100644 index 000000000..38997c1d7 --- /dev/null +++ b/docs/design/images/voxl.drawio @@ -0,0 +1,395 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/docs/design/images/voxl.drawio.png b/docs/design/images/voxl.drawio.png new file mode 100644 index 000000000..34bc64c82 Binary files /dev/null and b/docs/design/images/voxl.drawio.png differ diff --git a/docs/developer/DICOM-anonymisation-SOP.md b/docs/developer/DICOM-anonymisation-SOP.md new file mode 100644 index 000000000..3c9c53732 --- /dev/null +++ b/docs/developer/DICOM-anonymisation-SOP.md @@ -0,0 +1,147 @@ +# DICOM Anonymisation Standard Operating Procedure (SOP) + +## Subject + +DICOM Anonymisation SOP for the structured chest X-ray imaging data collected for the PIXL Nasogastric tube study. This SOP explains the procedure for +removing all identifiable data from the DICOM files while maintaining data fidelity. This can be extended to other imaging modalities and other +studies but would require re-validation of the anonymisation process for other scan modalities or anatomical regions. + +## Purpose + +To establish a standard process for anonymising chest X-ray images prior to use in the research project. Anonymisation protects patient privacy while +allowing data to be used for research purposes. + +The criteria for successful anonymisation are: + +- Removal of all direct identifiers from the DICOM files +- The process does not interfere with fidelity of data as defined below. +- The process is consistent when repeated, so that: + - The same project-level unique ids are produced for a specific patient id, regardless of which historical archives the scans are retrieved from. + +**Data fidelity** is defined as the faithfulness of the relationship with the original data: + +- If there is associated data (e.g. radiology reports) then date or time shifts should be consistent + +To ensure **Data fidelity** across modalities: + +- Date or time jitters should be matched to ensure they are the same in both the scan (or DICOM) and any accompanying radiology report. +- The process can be repeated in a consistent way across the different modalities. + +## Scope + +This SOP applies to all research projects at UCLH. The process being used to anonymise the accompanying radiology reports can be found in SOP **ADD +TITLE**. + +## Process + +1. Set metadata tags as defined for per-project configuration, defaults shown in +2. If scan is isotropic and covers facial region at high resolution, a face removal script may be + run to anonymise facial features in image voxels (3D pixels) while maintaining remainder of voxel data integrity. + This will be considered on a case-by-case basis. +3. Document anonymisation process (separate SOPs) + +## Relevant configuration + +The relevant configuration for the orthanc-anon DICOM tag de-identification is +located in PIXL's [project config directory](https://github.com/SAFEHR-data/PIXL/tree/main/projects/configs) and the `tag-operations` sub directory. + +### Delete Operations + +If a tag isn't defined then it will be removed by default + +### Kept without altering + +These tags will be kept by default for all projects + +| Name | Group | Element | +|----------------------------------------|--------|---------| +| Modality | 0x0008 | 0x0060 | +| Modalities in study | 0x0008 | 0x0061 | +| Manufacturer | 0x0008 | 0x0070 | +| Study Description | 0x0008 | 0x1030 | +| Series Description | 0x0008 | 0x103e | +| Patient Sex| 0x0010 | 0x0040 | + + +### Regeneration of UIDs + +Table 1 + +| Name | Group | Element | +|----------------------------------------|--------|---------| +| SOP Instance UID | 0x0008 | 0x0018 | +| Study Instance UID | 0x0020 | 0x000d | +| Series Instance UID | 0x0020 | 0x000e | +| Instance Creator UID | 0x0008 | 0x0014 | +| Referenced SOP Instance UID | 0x0008 | 0x1155 | +| Frame of Reference UID | 0x0020 | 0x0052 | +| Synchronization Frame of Reference UID | 0x0020 | 0x0200 | +| Storage Media File-set UID | 0x0088 | 0x0140 | +| Referenced Frame of Reference UID | 0x3006 | 0x0024 | +| Related Frame of Reference UID | 0x3006 | 0x00C2 | +| UID | 0x0040 | 0xA124 | + +### Hashing Operations + +Hashing is used to replace identifiable data with a unique identifier. The following tags are hashed: + +Table 2 + +| Name | Group | Element | +|----------------------------------------|--------|---------| +| Patient ID | 0x0010 | 0x0020 | + +The hashing algorithm is the [BLAKE2](https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE2) hashing function. +A per-project salt, and a locally held salt are used. These are stored separately and not included in exports for users. + +### Time Shifting Operations + +Anonymisation may be performed without removal of dates and times where this can be justified. However, our default will be to remove these data. +This requires either deletion of dates and times from the DICOM file with +linkage on an accession number or similar. + +Date and time fields that might be affected this are listed below. + +Table 3 + +| Name | Group | Element | +|---------------------------|--------|---------| +| Study Date | 0x0008 | 0x0020 | +| Series Date | 0x0008 | 0x0021 | +| Acquisition Date | 0x0008 | 0x0022 | +| Image Date | 0x0008 | 0x0023 | +| Acquisition Date Time | 0x0008 | 0x002a | +| Study Time | 0x0008 | 0x0030 | +| Series Time | 0x0008 | 0x0031 | +| Acquisition Time | 0x0008 | 0x0032 | +| Image Time | 0x0008 | 0x0033 | + +### Fixed tags + +The following tag(s) are set to a fixed value: + +Table 4 + +| Name | Group | Element | +|-----------|--------|---------| +| Study ID | 0x0020 | 0x0010 | +| Accession Number | 0x0008 | 0x0050 | +| Patient's Name | 0x0010 | 0x0010 | + +## Other Anonymisation + +- Other anatomical regions or 3D scans (for example DICOM MRI) would require slightly different anonymisation process (to be defined). + +## Documentation + +- PIXL automatically logs (audit trails) for steps taken and database tables used, they should have project name, and date at which anonymisation (the + current process) is performed. +- Record any issues encountering during anonymisation process as Github issue while referring to this SOP. + +## Monitoring Compliance + +- Spot check anonymisation quality for names and dates on 1% of cases per project release (minimum of 100 and maximum of 500). Report issues as Github + issues to the PIXL team. Iterate this process until the quality is acceptable and no identifiable data is remaining. The person checking the + anonymisation should not be the same person who performed the anonymisation, it can be a manager, a CRIU member or a research team member under + Confidentiality Advisory Group-approved project. +- Update SOP as needed based on results of compliance checks. diff --git a/docs/developer/README.md b/docs/developer/README.md new file mode 100644 index 000000000..78f1d2a0e --- /dev/null +++ b/docs/developer/README.md @@ -0,0 +1,29 @@ +## 'PIXL/docs/developer' Directory Contents + +Developer documentation + +
+ +

Subdirectories with links to the relevant README

+ +
+ +[setup](./setup/README.md) + +
+ +
+ +

Files

+ +
+ +| **User docs** | +| :--- | +| DICOM-anonymisation-SOP.md | +| ftp-server.md | +| parquet_files.md | +| pixl_database.md | + +
+ diff --git a/docs/services/ftp-server.md b/docs/developer/ftp-server.md similarity index 59% rename from docs/services/ftp-server.md rename to docs/developer/ftp-server.md index 7564f281b..2606f48b3 100644 --- a/docs/services/ftp-server.md +++ b/docs/developer/ftp-server.md @@ -3,7 +3,7 @@ Currently, we can only upload files to the Data Safe Haven (DSH) through an [FTPS](https://en.wikipedia.org/wiki/FTPS) connection. -The [`core.upload`](../../pixl_core/src/core/upload.py) module implements functionality to upload +The [`core.upload`](../../pixl_core/src/core/uploader/README.md) module implements functionality to upload DICOM tags and parquet files to an **FTPS server**. This requires the following environment variables to be set: @@ -13,8 +13,11 @@ variables to be set: - `FTP_USER_PASSWORD`: password for the authorised user We provide mock values for these for the unit tests (see -[`./tests/conftest.py`](./tests/conftest.py)). When running in production, these should be defined -in the `.env` file (see [the example](../.env.sample)). +[`pixl_core/tests/conftest.py`](../../pixl_core/tests/conftest.py)). When running in production, these should be defined +in the `.env` file (see [the example](../../.env.sample)). + + +## The `pytest-pixl` plugin + +We provide a [`pytest` plugin](../../pytest-pixl/README.md) with shared functionality for PIXL system +and unit tests. This includes an `ftp_server` fixture to spin up a lightweight FTP server, +to mock the FTP server used by the Data Safe Haven. diff --git a/docs/file_types/parquet_files.md b/docs/developer/parquet_files.md similarity index 91% rename from docs/file_types/parquet_files.md rename to docs/developer/parquet_files.md index 8b56234e1..b8785dcbf 100644 --- a/docs/file_types/parquet_files.md +++ b/docs/developer/parquet_files.md @@ -3,7 +3,7 @@ ## OMOP-ES files From -[OMOP-ES](https://github.com/SAFEHR-data/the-rolling-skeleton/blob/main/docs/design/100-day-design.md#data-flow-through-components) +[OMOP-ES](../design/details/100-day-design.md#data-flow-through-components) we receive parquet files defining the data we need to export. These input files appear as 2 groups: 1. **Public** parquet files: have had identifiers removed and replaced with a sequential ID for the @@ -25,7 +25,8 @@ An output parquet file named `IMAGE_LINKER.parquet` that defines the connection OMOP procedure_occurrence_id for the current extract and the pseudo image/study ID. The procedure_occurrence_id can get renumbered from extract to extract. -See method `make_radiology_linker_table` for more. + ## Exporting (copying from OMOP ES) diff --git a/docs/services/pixl_database.md b/docs/developer/pixl_database.md similarity index 100% rename from docs/services/pixl_database.md rename to docs/developer/pixl_database.md diff --git a/docs/developer/setup/README.md b/docs/developer/setup/README.md new file mode 100644 index 000000000..49d0728f6 --- /dev/null +++ b/docs/developer/setup/README.md @@ -0,0 +1,16 @@ +## 'PIXL/docs/developer/setup' Directory Contents + +
+ +

Files relating to the setup of PIXL

+ +
+ +| **User docs** | +| :--- | +| azure-keyvault.md | +| pixl-setup.md | +| uclh-infrastructure-setup.md | + +
+ diff --git a/docs/setup/azure-keyvault.md b/docs/developer/setup/azure-keyvault.md similarity index 100% rename from docs/setup/azure-keyvault.md rename to docs/developer/setup/azure-keyvault.md diff --git a/docs/setup/developer.md b/docs/developer/setup/pixl-setup.md similarity index 92% rename from docs/setup/developer.md rename to docs/developer/setup/pixl-setup.md index 7f8ae9764..a0eb73873 100644 --- a/docs/setup/developer.md +++ b/docs/developer/setup/pixl-setup.md @@ -71,7 +71,7 @@ If your are running Docker Engine on Linux, listening on this socket should be ### Integration tests There are also integration tests in `PIXL/test/` directory that can be run using the `PIXL/test/run-system-test.sh`. See the -[integration test docs](test/README.md) for more info. +[integration test docs](../../../test/README.md) for more info. ### Workflow @@ -99,18 +99,18 @@ To install the git pre-commit hook locally so it runs every time you make a comm pre-commit install ``` -The `pre-commit` configuration can be found in [`.pre-commit-config.yml`](../../.pre-commit-config.yaml). +The `pre-commit` configuration can be found in [`.pre-commit-config.yml`](../../../.pre-commit-config.yaml). ## Environment variables Running the `pixl` pipeline and the tests requires a set of environment variables to be set. The `test/` -directory contains a complete [`.env` file](../../test/.env) that can be used to run the pipeline and tests locally. +directory contains a complete [`.env` file](../../../test/.env) that can be used to run the pipeline and tests locally. Either run any `pixl` commands from the `test/` directory, or copy the `test/.env` file to the root of the repository. ### Secrets -PIXL uses an [Azure Keyvault](../../README.md#project-secrets) to store authentication details for +PIXL uses an [Azure Keyvault](../azure-keyvault.md) to store authentication details for external services. We have a development keyvault for testing. Access to this keyvault is provided by a set of environment variables specified in `test/.secrets.env.sample`. To run the pipeline locally, you will need to copy this file to `test/.secrets.env` and fill out diff --git a/docs/setup/uclh-infrastructure-setup.md b/docs/developer/setup/uclh-infrastructure-setup.md similarity index 100% rename from docs/setup/uclh-infrastructure-setup.md rename to docs/developer/setup/uclh-infrastructure-setup.md diff --git a/docs/joss-publication/paper.bib b/docs/joss-publication/paper.bib new file mode 100644 index 000000000..e69de29bb diff --git a/docs/joss-publication/paper.md b/docs/joss-publication/paper.md new file mode 100644 index 000000000..c6c0206c8 --- /dev/null +++ b/docs/joss-publication/paper.md @@ -0,0 +1 @@ +This is the file where the paper must be written \ No newline at end of file diff --git a/index.md b/index.md index badac46d3..0e3f06e0c 100644 --- a/index.md +++ b/index.md @@ -6,19 +6,19 @@ * [ADR-2](cli/tests/README.md) - * [ADR-3](CODE_OF_CONDUCT.md) - Contributor Covenant Code of Conduct * [ADR-4](CONTRIBUTING.md) - Contributing to `PIXL`. -* [ADR-5](docs/design/bigger_picture.md) - The bigger picture +* [ADR-5](docs/design/details/bigger_picture.md) - The bigger picture * [ADR-0001](docs/design/decision-record/0001-multiservice-architecture.md) - Multi-service Architecture * [ADR-0002](docs/design/decision-record/0002-message-processing.md) - Message-based Processing of Images * [ADR-0003](docs/design/decision-record/0003-dicom-processing.md) - DICOM server and processing * [ADR-0004](docs/design/decision-record/0004-secrets-storage.md) - Secrets Storage * [ADR-6](docs/design/decision-record/index.md) - Architectural Decision Log * [ADR-7](docs/design/decision-record/template.md) - [short title of solved problem and solution] -* [ADR-8](docs/file_types/parquet_files.md) - Parquet files you might encounter throughout PIXL -* [ADR-9](docs/services/ftp-server.md) - FTPS server -* [ADR-10](docs/services/pixl_database.md) - The PIXL database -* [ADR-11](docs/setup/azure-keyvault.md) - Azure Keyvault setup -* [ADR-12](docs/setup/developer.md) - Developer setup -* [ADR-13](docs/setup/uclh-infrastructure-setup.md) - UCLH Infrastructure setup instructions +* [ADR-8](docs/developer/parquet_files.md) - Parquet files you might encounter throughout PIXL +* [ADR-9](docs/developer/ftp-server.md) - FTPS server +* [ADR-10](docs/developer/pixl_database.md) - The PIXL database +* [ADR-11](docs/developer/setup/azure-keyvault.md) - Azure Keyvault setup +* [ADR-12](docs/developer/setup/pixl-setup.md) - Developer setup +* [ADR-13](docs/developer/setup/uclh-infrastructure-setup.md) - UCLH Infrastructure setup instructions * [ADR-14](hasher/README.md) - Hasher API * [ADR-15](orthanc/orthanc-anon/docs/DicomServiceViaAAD.md) - Retrieving an access token for the DICOM service using and Azure AD application * [ADR-16](orthanc/orthanc-anon/README.md) - Orthanc Anon @@ -36,7 +36,7 @@ - +not sure what 26 is supposed to be about or pointing to diff --git a/pixl_core/src/core/uploader/README.md b/pixl_core/src/core/uploader/README.md new file mode 100644 index 000000000..09a39793d --- /dev/null +++ b/pixl_core/src/core/uploader/README.md @@ -0,0 +1,19 @@ +## 'PIXL/pixl_core/src/core/uploader' Directory Contents + +
+ +

Files

+ +
+ +| **Code** | **User docs** | +| :--- | :--- | +| base.py | README.md | +| _dicomweb.py | | +| _ftps.py | | +| _orthanc.py | | +| _xnat.py | | +| __init__.py | | + +
+