stratosida · georgheinze · Aug 30, 2022 · Sep 1, 2022 · Sep 2, 2022 · Nov 14, 2022
diff --git a/Bact_IDA_plan.qmd b/Bact_IDA_plan.qmd
@@ -0,0 +1,124 @@
+# IDA plan {#IDA_plan}
+
+```{r}
+library(here)
+library(tidyverse)
+
+
+load(here::here("data", "a_bact.rda"))
+```
+
+
+This document exemplifies the prespecified plan for initial data analysis (IDA plan) for the bacteremia study.
+
+## Prerequisites for the IDA plan
+
+### Analysis strategy
+
+We assume that the aims of the study are to fit a diagnostic prediction model and to describe the functional form of each predictor. These aims are addressed by fitting a logistic regression  model with bacteremia status as the dependent variable. 
+
+Based on domain expertise, the predictors are grouped by their assumed importance to predict bacteremia. Variables with known strong associations with bacteremia are age (AGE), leukocytes (WBC), blood urea neutrogen (BUN), creatinine (CREA), thrombocytes (PLT), and neutrophiles (NEU) and these predictors will be included in the model as key predictors. Predictors of medium importance are potassium (POTASS), and some acute-phase related parameters such as fibrinogen (FIB), C-reactive protein (CRP), aspartate transaminase (ASAT), alanine transaminase (ALAT), and gamma-glutamyl transpeptidase (GGT). All other predictors are of minor importance. 
+
+Continuous predictors should be modelled by allowing for flexible functional forms, where for all key predictors four degrees of freedom will be spent, and for predictors of medium and minor importance, three or two degrees of freedom should be foreseen at maximum, respectively. The decision on whether to use only key predictors, or to consider predictors also from the predictor sets of medium or minor importance depends on results of data screening, but will be made before uncovering the association of predictors with the outcome variable. 
+
+An adequate strategy to cope with missing values will also be chosen after screening the data. Candidate strategies are omission of predictors with abundant missing values, complete case analysis, single value imputation or multiple imputation with chained equations. 
+
+### Data dictionary
+
+The data dictionary of the bacteremia data set consists of columns for variable names, variable labels, scale of measurement (continuous or categorical), units, plausibility limits, and remarks:
+
+```{r}
+bact.dd<-read.csv(here::here("misc", "bacteremia-DataDictionary.csv"))
+
+bact.dd
+```
+
+
+### Domain expertise
+
+The demographic variables age and sex are are chosen as the structural variables in this analysis for illustration purposes, since they are commonly considered important for describing a cohort in health studies. Key predictors and predictors of medium importance are as defined above. Laboratory analyses always bear the risk of machine failures, and hence missing values are a frequent challenge. This may differ between laboratory variables, but no a priori estimate about the expected proportion of missing values can be assumed. As most predictors measure concentrations of chemical compounds or cell counts, skewed distributions are expected. Some predictors describe related types of cells or chemical compounds, and hence some correlation between them is to be expected. For example, leukocytes consist of five different types of blood cells (BASO, EOS, NEU, LYM and MONO), and the sum of the concentration of these types approximately (but not exactly) gives the leukocyte count, which is recorded in the variable WBC. Moreover, these variables are given as absolute counts and as percentages of the sum of the five variables, which creates some correlation. Some laboratory variables differ by sex and age, but the special selection of patients for this study (suspicion of bacteremia) may distort or alter the expected correlations with sex and age.
+
+For the purpose of stratifying IDA results by age, age will be categorized into the following three groups: (16, 50], (50, 65], (65, 101]. 
+
+The predictor grouping is defined here:
+
+```{r, echo=TRUE}
+#demog_vars <-c("AGE", "SEX")
+structural_vars <- c("AGE", "SEX")
+key_predictors <- c("WBC", "AGE", "BUN","CREA","NEU","PLT")
+medimp_predictors <-c("POTASS", "FIB", "CRP", "ASAT", "ALAT", "GGT")
+outcome_vars <-c("BloodCulture", "BC")
+remaining_predictors <- names(a_bact)[is.na(match(names(a_bact),c("ID", structural_vars, key_predictors,medimp_predictors, outcome_vars)))]
+
+b_bact <- 
+  a_bact %>% mutate(GENDER = factor(SEX, levels=c(1,2),labels=c("male", "female")), AGEGROUP = factor(cut(AGE, c(min(AGE)-1,50, 65, max(AGE))), labels=c("1:(15,50]","2:(50,65]","3:(65,101]")))
+
+
+bact_variables <- list(structural_vars=structural_vars, key_predictors=key_predictors, medimp_predictors=medimp_predictors,
+                       remaining_predictors=remaining_predictors, outcome_vars=outcome_vars)
+
+```
+
+## IDA plan
+
+
+### M1: Prevalence of missing values
+
+Numbers and proportions of missing values will be reported for each predictor separately (M1). Type of missingness has not been recorded.
+
+### M2: Complete cases
+
+The number of available complete cases (outcome and predictors) will be reported when considering:
+
+1.  the outcome variable (BC)
+2.  outcome and structural variables (BC, AGE, SEX)
+3.	outcome and key predictors only (BC, AGE, WBC, BUN, CREA, PLT, NEU)
+4.	outcome, key predictors and predictors of medium importance (BC, AGE, WBC, BUN, CREA, PLT, NEU, POTASS, FIB, CRP, ASAT, ALAT, GGT)
+5.	outcome and all predictors.
+
+### M3: Patterns of missing values
+
+Patterns of missing values will be investigated by:
+
+1.	computing a table of complete cases (for the three predictor sets described above) for strata defined by the structural variables age and sex,
+2.	constructing a dendrogram of missing values to explore which predictors tend to be missing together.
+
+
+
+### U1: Univariate descriptions: categorical variables
+
+For sex and bacteremia status, the frequency and proportion of each category will be described numerically.
+
+### U2: Univariate descriptions: continuous variables
+
+For all continuous predictors, combo plots consisting of high-resolution histograms, boxplots and dotplots will be created. Because of the expected skew distribution, combo plots will also be created for log-transformed predictors. 
+
+As numerical summaries, minimum and maximum values, main quantiles (5th, 10th, 25th, 50th, 75th, 90th, 95th), and the first four moments (mean, standard deviation, skewness, curtosis) will be reported. The number of distinct values and the five most frequent values will be given, as well as the concentration ratio (ratio of frequency of most frequent value and mean frequency of each unique value).  
+
+Graphical and parametric multivariate analyses of the predictor space such as cluster analyses or the computation of variance inflation factors are heavily influenced by the distribution of the predictors. In order to make this set of analyses more robust to highly influential points or areas of the predictor support, some predictors may need transformation (e.g. logarithmic). We will compute the correlation of the untransformed and log-transformed predictors with normal deviates. Since some predictors may have values at or close to 0, we will consider the pseudolog transformation $f(x;\sigma) = sinh^{-1}(x/2\sigma)/\log10$ (Johnson, 1949) which provides a smooth transition from linear (close to 0) to logarithmic (further away from 0). The transformation has a parameter $\sigma$ which we will optimize separately for each predictor in order to achieve an optimal approximation to a normal distribution monitored via the correlation of normal deviates with the transformed predictor. For those predictors for which the pseudolog-transformation increases correlation with normal deviates by at least 0.2 units of the correlation coefficient, the pseudolog-transformed predictor will be used in multivariate IDA instead of the respective original predictor. For those predictors, histograms and boxplots will be provided on both the original and the transformed scale.
+
+### V1: Multivariate descriptions: associations of predictors with structural variables
+
+A scatterplot of each predictor with age, with different panels for males and females will be constructed. Associated Spearman correlation coefficients will be computed.
+
+### V2: Multivariate descriptions: correlation analyses
+
+A matrix of Spearman correlation coefficients between all pairs of predictors will be computed and described numerically as well as by means of a heatmap.
+
+### VE1: Multivariate descriptions: comparing nonparametric and parametric predictor correlation
+
+A matrix of Pearson correlation coefficients will be computed. Predictor pairs for which Spearman and Pearson correlation coefficients differ by more than 0.1 correlation units will be depicted in scatterplots.
+
+### VE2: Variable clustering
+
+A variable clustering analysis will be performed to evaluate which predictors are closely associated. A dendrogram groups predictors by their correlation. Scatterplots of pairs of predictors with Spearman correlation coefficients greater than 0.8 will be created.
+
+### VE3: Redundancy
+
+Variance inflation factors will be computed between the candidate predictors. This will be done for the three possible candidate models, and using all complete cases in the respective candidate predictor sets. Redundancy will further be explored by computing parametric additive models for each predictor in the three candidate models.
+
+
+```{r}
+
+save(list=c("b_bact", "bact_variables",  "structural_vars","key_predictors","medimp_predictors","remaining_predictors"), file = here::here("data", "bact_env_b.rda"))
+```
diff --git a/Bact_SAP.qmd → Bact_SAP_orphaned.Rmd b/Bact_SAP.qmd → Bact_SAP_orphaned.Rmd
@@ -1,4 +1,13 @@
-# IDA plan
+---
+title: "Bact_SAP_orphaned"
+author: "GH"
+date: "2022-08-30"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
 
 ```{r, echo=FALSE, warning=FALSE, message=FALSE, echo=FALSE}
 ## Load libraries
@@ -274,3 +283,4 @@ Tudela P, Lacoma A, Prat C, Modol JM, Gimenez M, et al. (2010) Prediction of bac
 save(list=c("b_bact", "bact_variables", "demog_vars", "structural_vars","key_predictors","leuko_related_vars","leuko_ratio_vars","kidney_related_vars", "acute_related_vars","remaining_vars"), file = here::here("data", "bact_env_b.rda"))
 ```
 
+
diff --git a/Bact_intro.qmd b/Bact_intro.qmd
@@ -1,5 +1,4 @@
-# Bacteremia {#Bacteremia}
-
+# Bacteremia study  {#Bacteremia}
 
 ```{r, echo=FALSE, warning=FALSE, message=FALSE, echo=FALSE}
 ## Load libraries for this chapter
@@ -8,27 +7,16 @@ library(tidyverse)
 library(Hmisc)
 ```
 
-# Introduction to Bacteremia
-
-To demonstrate the workflow and content of IDA, we created a hypothetical research aim and corresponding statistical analysis plan, which is described in more detail in the section [Bact_SAP.Rmd](Bact_SAP.Rmd).
-
-**Hypothetical research aim for IDA** is to develop a multivariable diagnostic model for bacteremia using 49 continuous laboratory blood parameters, age and gender with the primary aim of prediction and a secondary aim of describing the association of each variable with the outcome ('explaining' the multivariable model). 
-
-A diagnostic prediction model was developed based on this data set and validated in "A Risk Prediction Model for Screening Bacteremic Patients: A Cross Sectional Study"  [Ratzinger et al, PLoS One 2014](https://doi.org/10.1371/journal.pone.0106765). The assumed research aim is in line with this diagnostic prediction model.
-
 
-## Dataset Description
 
-Ratzinger et al (2014) performed a diagnostic study in which age, sex and 49 laboratory variables can be used to diagnose bacteremia status  of a blood sample using a multivariable model.  Between January 2006 and December 2010, patients with the clinical suspicion to suffer from bacteraemia were included if blood culture analysis was requested by the responsible physician and blood was sampled for assessment of haematology and biochemistry. The data consists of 14,691 observations from different patients.
+## Overview of the bacteremia study  CHANGED SUBHEADER
 
-Our version of this data was slightly modified compared to original version, and this modified version was cleared by the Medical University of Vienna for public use (DC 2019-0054). Variable names have been kept as they were (partly German abbreviations). A data dictionary is available in the **misc** folder of the project directory ('bacteremia-DataDictionary.csv').
+We will exemplify our proposed systematic approach to data screening by means of a diagnostic study with the primary aim of using age, sex and 49 laboratory variables to fit a diagnostic prediction model for the bacteremia status (= presence of bacteria in the blood stream) of a blood sample. A secondary aim of the study is to describe the functional form of each predictor in the model. Between January 2006 and December 2010, patients with the clinical suspicion to suffer from bacteremia were included if blood culture analysis was requested by the responsible physician and blood was sampled for assessment of hematology and biochemistry. An analysis of this study can be found in Ratzinger et al: "A Risk Prediction Model for Screening Bacteremic Patients: A Cross Sectional Study"  [Ratzinger et al, PLoS One 2014](https://doi.org/10.1371/journal.pone.0106765).
 
-In the original paper describing the study [(Ratzinger et al, PLoS One 2014)](https://doi.org/10.1371/journal.pone.0106765), a machine learning approach was taken to diagnose a positive status of blood culture. The true status was determined for all blood samples by blood culture analysis, which is the gold standard. Here we will make use of a multivariable logistic regression model.
+The data consists of 14,691 observations from different patients and 51 potential predictors. To protect data privacy our version of this data was slightly modified compared to the original version, and this modified version was cleared by the Medical University of Vienna for public use (DC 2019-0054). Compared to the official results given in Ratzinger et al (2014), our results may differ to a negligible degree. 
 
 
-## Bacteremia dataset contents
-
-### Source dataset 
+### Source dataset  CHANGED Hierarchy to third level
 
 We refer to the source data set as the dataset available in this repository.
 
@@ -59,25 +47,11 @@ As a cross check we display the contents again to ensure the additional data is
 
 ```{r contents_abact, warning=FALSE, message=FALSE, echo=FALSE, results='asis'} 
 
+### COMMENT: I removed the definition of a bact_subset as we don't need it anymore! (this comment can be deleted)
+
 ## Complete metadata by adding missing labels. 
 ## Generate a derived dataset stored in data as we are adding to the original source dataset obtained. 
 
-bact_subset <- bact
-
-## select candidate predictor variables. -- See SAP
-
-#bact_subset <- 
-#  bact  %>%
-#  dplyr::select(
-#    ID,
-#    AGE,
-#    WBC,
-#    KREA,
-#    BUN,
-#    PLT,
-#    NEU,
-#    BloodCulture
-#  )
 
 labels_list <- bact.dd$Label
 units_list <- bact.dd$Units
@@ -86,7 +60,7 @@ names(labels_list) <- names(units_list) <- bact.dd$Variable
 
 ## Complete metadata by adding missing labels.
 a_bact <- Hmisc::upData(
-  bact_subset,
+  bact,
   labels = labels_list[names(bact_subset)], units=units_list[names(bact_subset)])
 
 ## Derive outcome variable