R Tools for Routine Healthcare Data

Pieta 2024-11-15

NB this is very much work in progress and my own convenience function library

NBB These instructions are some what superseded by the CPRD wrangling GitHub approach still load it all into duckdb, but I have switched to using dbplyr and Lazy Evaluation rather than raw SQL

Motivation

I currently have several CPRD project using Aurum, I will also have a GOLD project soon. There are some common steps that need to be achieved with these projects.

The idea is that if I can import the data into an DuckDB database it will permit the use of SQL to join and extract data for the process of linkage and cleaning. This also means the original text files need only be visited once. Some projects have billions of observation records so managing them in memory in R is not really feasible. DuckDB also has a well documented R api and some very nifty SQL dialect aspects that are worth discovering. Its biggest frustration for me at the moment is that it seems to have a 1024 character limit on SQL statements and many of my SQL statements are more complex that that permits. So I have to do some queries in steps creating temporary tables rather than one big query. I am not sure if investing the time to learn dbplyr might be a neat tidyverse solution to this, but I have been using SQL for almost 40 years and perhaps I getting long in the tooth to learn quite so many new tricks.

NB. DuckDB can take a long time to install, a longer time than you think is reasonable but I have never had it crash on installing so be patient

Caveat

This is work in progress and is an amalgamation of functions I have written over the years and found useful but exist in many disparate packages. This is my attempt to put the really useful ones in one place. There are still some to import, namely the HES APC data ONS data and IMD data import. There are also other functions to be written yet.

There are a suite of matching GOLD data processing functions that as yet have not been imported

Install

It is available at GitHub https://github.com/PietaSchofield/rtrhd

R code

if(!require("remotes")){
  install.packages("remotes")
}
#> Loading required package: remotes
if(!require("tidyverse")){
  install.packages("tidyverse")
}
#> Loading required package: tidyverse
#> ── Attaching core tidyverse packages ──────────────────────────────────────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
if(!require("ggplot2")){
  install.packages("ggplot2")
}
if(!require("rtrhd")){
  remotes::install_github("PietaSchofield/rtrhd")
}
#> Loading required package: rtrhd

Setup

I have access to the CPRD Aurum Synthetic Dataset which I will I can share with folk at Liverpool University as we have Multi-Study License. But I cannot distribute with this package. It is what I use in this demo. Unfortunately it only has single files for each table unlike the usual many many observation and drug_issue files. So it doesn’t totally test the functionality of the table load functions.

Paths

It is important the patient data and the lookups are in separate directory trees and while the code recursively searches down the patient data tree for files so they can be in subdirectories by table name (or not) the code identifies files to import by wildcards and having the Patienttype lookup in the same subtree as the Patient file(s) will break the import of patients.

Make the DuckDb

I will use the duckdb package because it is reasonalbly fast for some analytics processing. It is slightly better tuned to this task than the current itteration of SQLite and I am not going to write this to be flexible enough to use a generic DBI SQL connection. I will leave that as a task for the dedicated and just make the source code fully available

R code

dbName <- "Aurum_Sythetic_Data.duckdb"
dbPath <- file.path(Sys.getenv("HOME"),"Projects","rtrhd")
patient_data <- file.path(dbPath,"Aurum_Synthetic_Data","patient_data")
lookup_data <- file.path(dbPath,"Aurum_Synthetic_Data","lookups")
  
synthdataDb <- rtrhd::make_aurum_database(dbName,dbPath,
                 patient_path=file.path(data_path,"patient_data"),
                 lookup_path=file.path(data_path,"lookups"))
#> 4491905 records exist
#> 4535639 records exist
#> 124552 records exist
#> 14 records exist
#> 87679 records exist
#> 111313 records exist
#> 12257 records exist

This function wraps a bunch of functions that load the individual tables

R code

tables <- rtrhd::list_tables(synthdataDb) 
names(tables) <- tables 
records <- lapply(tables, function(tn){
    flds <- paste0(rtrhd::list_fields(dbf=synthdataDb,tab=tn),collapse=", ")
    recs <- rtrhd::get_table(dbf=synthdataDb,
              sqlstr=paste0("SELECT COUNT(*) AS records FROM ",tn,";")) %>% pull(records)
    return(tibble(table=tn,fields=flds,records=recs))}) %>% bind_rows()
records %>% kableExtra::kable()

table	fields	records
aurum_common_dosages	dosageid, dosage_text, daily_dose, dose_number, dose_unit, dose_frequency, dose_interval, choice_of_dose, dose_max_average, change_dose, dose_duration	73068
aurum_conssource	conssourceid, description	88013
aurum_emiscodecat	emiscodecatid, description	48
aurum_gender	genderid, description	4
aurum_jobcat	jobcatid, description	208
aurum_medicaldictionary	medcodeid, term, originalreadcode, cleansedreadcode, snomedctconceptid, snomedctdescriptionid, release, emiscodecategoryid	240486
aurum_numunit	numunitid, description	18436
aurum_obstype	obstypeid, description	11
aurum_orgtype	orgtypeid, description	328
aurum_parentprobrel	parentprobrelid, description	3
aurum_patienttype	patienttypeid, description	32
aurum_probstatus	probstatusid, description	4
aurum_productdictionary	prodcodeid, dmdid, term from emis, productname, formulation, routeofadministration, drugsubstancename, substancestrength, bnfchapter, release	112148
aurum_quantunit	quantunitid, description	97
aurum_refmode	refmodeid, description	7
aurum_refservicetype	refservicetypeid, description	14
aurum_refurgency	refurgencyid, description	5
aurum_region	regionid, description	13
aurum_sign	signid, description	2
consultations	patid, consid, consdate, conssourceid, cprdconstype, consmedcodeid	4535639
drug_issues	patid, prodcodeid, issuedate, dosageid, quantity, quantunitid, duration	87679
observations	patid, consid, parentobsid, probobsid, medcodeid, obsdate, value, numunitid, numrangelow, numrangehigh	4491905
patients	patid, pracid, gender, yob, regstartdate, regenddate, emis_ddate, cprd_ddate	45662
practices	pracid, lcd, uts, region	14
problems	patid, obsid, parentprobobsid, probenddate, expduration, lastrevdate, parentprobrelid, probstatusid, probeddate	111313
referrals	patid, obsid, pracid, refsourceorgid, reftargetorgid, refurgencyid, refservicetypeid, refmodeid	124552
staff	staffid, pracid, jobcatid	12257

It is now possible to run SQL against this database

R code

## START SQL GEN BLOCK
unnamed_sql <- str_c("SELECT DISTINCT
  p.patid,
  COUNT(o.medcodeid) AS observations
FROM
  patients AS p
INNER JOIN
  observations AS o
  ON p.patid=o.patid
GROUP BY
  p.patid")
tmp_sqlfile <- rtrhd::link_sql(unnamed_sql)

SQL code

SELECT DISTINCT
  p.patid,
  COUNT(o.medcodeid) AS observations
FROM
  patients AS p
INNER JOIN
  observations AS o
  ON p.patid=o.patid
GROUP BY
  p.patid

R code

rtrhd::unlink_sql(tmp_sqlfile)
res <- rtrhd::get_table(dbf=synthdataDb,sqlstr=unnamed_sql)
res %>% ggplot(aes(x=observations)) + geom_histogram(binwidth=10,fill="blue",colour="black",alpha=0.7) +
  labs(title="Distribution of Observation Counts Per Patient",
       x="Number of Observations",
       y="Frequency") + theme_minimal()

## END SQL BLOCK

Other Stuff

There are several functions for importing SNOMED CT and DMplusD data available from TRUD. However the current manefestation of the DMplusD import uses R to import the xml files and this is very very slow unless your computer is powerful and a has a lot of memory. It still takes significant time on my 20 core 64GB RAM machine. I have found a quicker way to do this using calls to python scripts but this is not fully implemented yet.

Talking of slow. Duckdb takes a long time to install first time round and when it update it can also be slow but I think it is worth it.

Finally

This has been developed on linux and not install and tested on windows or mac so there is not guarantee it will work. It is actively being developed and while I will probably try to not change functionality it is definitely a moving target and as such I haven’t yet started to version control it. It is currently morphing on a bleeding edge model. If anyone else does actually start using it I will make the effort to change to a release model. But as I am the only user at the moment it hasn’t seemed necessary.

I also haven’t done clean install to check that I have all the dependencies in the Imports list so there maybe some I haven’t caught yet.

By The Way

This is the code that built the vignette and the README markdown for the GitHub repository

R code

homeDir <- Sys.getenv("HOME")
packagepath <- file.path(homeDir,"GitLab","rtrhd")
inputpath <- file.path(packagepath,"vignettes")
inputFile <- file.path(inputpath,"rtrhd_demo.Rmd")
noteDir <- file.path("/srv","http","uol","rtrhd")

fdisp <- rmarkdown::render(inputFile,encoding=encoding,output_dir=noteDir,clean=T)


rmarkdown::render(inputFile, output_format = "github_document", output_file = "README.md",
                  output_dir = packagepath)
browseURL(fdisp)

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
R		R
inst		inst
man		man
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
meta.yaml		meta.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

R Tools for Routine Healthcare Data

Motivation

Caveat

Install

Setup

Paths

Make the DuckDb

Other Stuff

Finally

By The Way

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

R Tools for Routine Healthcare Data

Motivation

Caveat

Install

Setup

Paths

Make the DuckDb

Other Stuff

Finally

By The Way

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages