The purpose of this project is to provide an automated import mechanism for emergency medical service data into an i2b2 data warehouse.
This script acts as the bridge between an external, relational data format (CSV) and the i2b2 clinical data model. Its primary function is to Extract, Transform, and Load (ETL) this specific dataset.
This script is not intended to be run as a standalone file by end-users. Instead, it is designed to be integrated into a larger Data Warehouse (DWH) or data management platform.
The intended workflow is as follows:
- User Action: A user navigates in DWH Admin to the "Art der Verarbeitung" selection in Daten-Import.
- Script Selection: The user selects this script (e.g., "Rettungsdienst Importscript").
- File Upload: The platform prompts the user to upload a single
.zipfile containing the emergency service data. - Execution: Upon receiving the file, the DWH backend executes this Python script, passing the path to the uploaded
.zipfile as an argument. - Transformation: The script handles the entire ETL process:
- It opens the
.zipfile. - It reads the
einsatzdaten.csvfrom within the zip. - It transforms the relational (flat) CSV data into the vertical Entity-Attribute-Value (EAV) model used by i2b2's
OBSERVATION_FACTtable. - It loads the transformed data into the i2b2 database.
- It opens the
For a developer maintaining or improving this script, here is the step-by-step technical breakdown:
- All transformation logic, file-level validation, and i2b2 concept mappings are defined in the
CONFIGdictionary at the top of the script. - To add a new field: Add a new transformation rule to the
CONFIG["i2b2_transforms"]list. - To change validation: Modify the regex patterns in
CONFIG["files"]["einsatzdaten"]["regex_patterns"].
- Unzip: The
extract_zipfunction securely extracts the contents of the provided.zipfile into a temporary directory. - Read CSV: The
extractfunction reads the requiredeinsatzdaten.csvfile from the temporary directory into a pandas DataFrame. It expects a semicolon (;) delimiter.
- Mandatory Columns:
preprocesschecks if allmandatory_columns(e.g.,einsatznummer) are present. - Regex Validation: The
validate_dataframefunction iterates through the regex patterns defined inCONFIGand checks every value in the corresponding columns. If an invalid, non-empty value is found, the script will raise an error and stop. - Timestamp Check: It ensures that rows are not missing all possible clock/time columns, as a valid
start_dateis required for i2b2.
This is the core logic, handled by transform_dataframe and dataframe_to_i2b2.
Important: Facts are processed in ascending time order. This ensures that newer data automatically updates or overrides older information.
- Find Earliest Timestamp: It scans all
clock_columnsfor each row to find the earliest valid timestamp. This becomes thestart_datefor the i2b2observation_factand is stored in_metadata_start_date. - Assign Instance Number: It assigns an
instance_numto ensure uniqueness for facts that share the same encounter and start time. - Apply Transforms: The script iterates over every row of the DataFrame and applies every rule in the
CONFIG["i2b2_transforms"]list. A dispatcher (TRANSFORM_DISPATCHER) calls the correct function (tval_transform,code_transform,cd_transform) based on the rule'stransform_type.tval_transform: Creates a simple fact with the CSV value intval_char.code_transform: Creates a fact where the CSV value is appended to theconcept_cd_base(e.g.,AS:TYPE:A01).cd_transform: Creates a "modifier fact" that links a value (tval_char) to aconcept_cdusing amodifier_cd.
- Add Metadata: The script enriches the i2b2 data with:
import_dateandupdate_date.- A
sourcesystem_cdgenerated by hashing the input.zipfile. This allows for tracing all facts back to their source file. - Script ID and Version (from environment variables
uuidandscript_versionprovided by the DWH).
- Connect: The
loadfunction reads database credentials (username,password,connection-url) from environment variables. - Delete Old Data: To ensure idempotency, the
delete_from_dbfunction first deletes any existing facts from theobservation_facttable that match theencounter_num,start_date, andconcept_cdof the data about to be loaded. - Insert New Data: The
upload_into_dbfunction inserts the new, transformed DataFrame into theobservation_facttable in batches.
To run the script locally, you must provide environment variables for the database connection. The load_env() helper function (for __main__ execution) will try to load a .env file from the parent directory of the script.
Create a .env file with the following keys:
username=YOUR_DB_USER
password=YOUR_DB_PASSWORD
connection-url=jdbc:postgresql://your-host:5432/your-db?searchPath=your_i2b2_schema
uuid=test-script-uuid
script_version=1.0-testOnce your .env file is set up and dependencies are installed, you can run the script from your terminal:
# Install dependencies
pip install -r requirements.txt
# Run the import
python rd_import.py /path/to/your/RettungsdienstData.zip