reprodICU is a freely accessible pipeline, streamlining the creation of a harmonized critical care dataset, including data from up to 470k ICU admissions from multiple healthcare centers across the US and Europe. In this pipeline, reprodICU harmonizes data from the following publicly available ICU datasets, which were previously published by others: AmsterdamUMCdb, eICU-CRD, HiRID, MIMIC-III, MIMIC-IV, NWICU, SICb.
As part of the Charité Outcomes Research Repository (CORR), the pipeline was developed by the Institute of Medical Informatics (IMI) at Charité - Universitätsmedizin Berlin.
The dataset created by running the pipeline contains de-identified demographic information and a total of 136 routinely collected physiological variables, diagnostic test results and treatment parameters from almost 350k patients during the period from 2001 to 2022.
- AmsterdamUMCdb v1.0.2
- eICU Collaborative Research Database v2.0
- HiRID, a high time-resolution ICU dataset v1.1.1
- MIMIC-III Clinical Database v1.4
- MIMIC-IV v3.1
- Northwestern ICU (NWICU) database v0.1.0
- Salzburg Intensive Care database (SICdb) v1.0.8
Axioms are datapoints that are completely underivable — for example: the heart_rate of a patient is not calculable from his lab values.
Anything else(!) that can be calculated, however complicated that may be, is not(!) an axiom. Anything that can be calculated, should be calculated. Calculable variables are called Concepts.
Concepts should be defined as python functions depending on their respective axiomatic inputs. Concepts do not need to be defined on the basis of axioms, concepts may also be derived from other concepts. At the end, where no further derivation is possible, there are the axioms.
reprodICU harmonizes 469,822 ICU admissions from seven major public datasets across four countries, creating the largest harmonized ICU dataset publicly available. This breadth enables cross-institutional and cross-national studies that were previously impractical due to data incompatibility.
reprodICU is harmonized using established clinical vocabularies (e.g., SNOMED, LOINC, RxNorm) and broadly follows the structure of the German Medical Informatics Initiative modules to ensure interoperability. Crucially, the project applies minimal preprocessing to preserve source fidelity and maintain compatibility with the original datasets.
The project includes the creation of a massive, curated catalog of clinical variables, ranging from advanced ventilator metrics to dozens of mortality and severity scoring systems (e.g., SOFA, APACHE, MODS, NEWS, SAPS). These ready-to-use components eliminate the need for researchers and developers to manually redefine or look up formulas, making it easier and faster to build robust analyses or models.
- Bennett et al. (2023) ricu: R’s interface to intensive care data and the associated git-Repository ricu
- Oliver et al. (2023) Introducing the BlendedICU dataset, the first harmonized, international intensive care dataset and the associated git-Repository BlendedICU