reprodICU

DOCUMENTATION

INSTALLATION & GETTING STARTED

INTRODUCTION

reprodICU is a freely accessible pipeline, streamlining the creation of a harmonized critical care dataset, including data from up to 470k ICU admissions from multiple healthcare centers across the US and Europe. In this pipeline, reprodICU harmonizes data from the following publicly available ICU datasets, which were previously published by others: AmsterdamUMCdb, eICU-CRD, HiRID, MIMIC-III, MIMIC-IV, NWICU, SICb.

As part of the Charité Outcomes Research Repository (CORR), the pipeline was developed by the Institute of Medical Informatics (IMI) at Charité - Universitätsmedizin Berlin.

The dataset created by running the pipeline contains de-identified demographic information and a total of 136 routinely collected physiological variables, diagnostic test results and treatment parameters from almost 350k patients during the period from 2001 to 2022.

INCLUDED DATASETS

AXIOMS

Axioms are datapoints that are completely underivable — for example: the heart_rate of a patient is not calculable from his lab values. Anything else(!) that can be calculated, however complicated that may be, is not(!) an axiom. Anything that can be calculated, should be calculated. Calculable variables are called Concepts. Concepts should be defined as python functions depending on their respective axiomatic inputs. Concepts do not need to be defined on the basis of axioms, concepts may also be derived from other concepts. At the end, where no further derivation is possible, there are the axioms.

HIGHLIGHTS

Scale and Scope

reprodICU harmonizes 469,822 ICU admissions from seven major public datasets across four countries, creating the largest harmonized ICU dataset publicly available. This breadth enables cross-institutional and cross-national studies that were previously impractical due to data incompatibility.

Standardization Without Overprocessing

reprodICU is harmonized using established clinical vocabularies (e.g., SNOMED, LOINC, RxNorm) and broadly follows the structure of the German Medical Informatics Initiative modules to ensure interoperability. Crucially, the project applies minimal preprocessing to preserve source fidelity and maintain compatibility with the original datasets.

Rich Library of Pre-Defined Clinical Concepts

The project includes the creation of a massive, curated catalog of clinical variables, ranging from advanced ventilator metrics to dozens of mortality and severity scoring systems (e.g., SOFA, APACHE, MODS, NEWS, SAPS). These ready-to-use components eliminate the need for researchers and developers to manually redefine or look up formulas, making it easier and faster to build robust analyses or models.

Inspired by previous work by

Bennett et al. (2023) ricu: R’s interface to intensive care data and the associated git-Repository ricu
Oliver et al. (2023) Introducing the BlendedICU dataset, the first harmonized, international intensive care dataset and the associated git-Repository BlendedICU

Name		Name	Last commit message	Last commit date
Latest commit History 460 Commits
src/reprodICU		src/reprodICU
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reprodICU

DOCUMENTATION

INSTALLATION & GETTING STARTED

INTRODUCTION

INCLUDED DATASETS

AXIOMS

HIGHLIGHTS

Scale and Scope

Standardization Without Overprocessing

Rich Library of Pre-Defined Clinical Concepts

Inspired by previous work by

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

reprodICU

DOCUMENTATION

INSTALLATION & GETTING STARTED

INTRODUCTION

INCLUDED DATASETS

AXIOMS

HIGHLIGHTS

Scale and Scope

Standardization Without Overprocessing

Rich Library of Pre-Defined Clinical Concepts

Inspired by previous work by

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages