This is some extra tooling for the PIC-SURE ecosystem being used by the Genomic Information Commons (GIC) project.
Please note this is considered alpha software.
This repository is very much a work in progress. It's realized vision is still unclear. The goal is to make certain aspects of PIC-SURE maintenance and administration easier.
The repository is a testing ground for exploring various ideas with possibly lots of breaking changes to come or abandonment. At worst case, it can serve as an example of what not to to :-) .
The work is currently unaffiliated with hms-dbmi, the primary drivers of PIC-SURE and GIC.
- Ensure that you have Java SDK / OpenJDK LTS edition (either version 11, 17 or 21) available on your system path.
- Download
gic-tk, or the jar, from the release page - Upon downloading, ensure that it's an executable file, and simply invoke
gic-tkor
java -jar gic-tools-0.0.1-standalone.jarThe above commands will show you the available subcommands to invoke. You should see something like so:
Usage: gic-tk <subcommand> <options>
Most subcommands support the options:
--help subcommand help
--debug enable debug logging mode
Subcommands:
irdb: commands to manipulate Intermediate Representation Databases (irdb)
init initialize a new irdb
add add or update concept data to an existing irdb
merge merge multiple target irdbs into a source irdb
dump generate a javabin file from an irdb
inspect inspect an irdb contents
help subcommand help message
help: this help message
version: version number information
gic-tk is designed as a command-line application that consists of various subcommands to deal with various aspects of PIC-SURE maintenance and administration. Currently, there's one subcommand irdb (see below).
"IRDB" is an acronym for "Intermediate Representation Database". The PIC-SURE system loads its phenotypic, or Electronic Health Record (EHR), data via binary "javabin" files on the file system. These javabin files are essentially an organized concatenation of serialized java data structures. The IRDB subcommand aims to make the assembly, creation, and inspection of these PIC-SURE javabin storage files more practical and accessible.
The current PIC-SURE tooling constructs these javabin files from "compiling" a source of CSV files. The compilation process is currently an "all or nothing" process. While this is effective for small data sets, it becomes unwieldy as the phenotypic data grows both in terms of quantity and variety. This group of subcommands provides a way to manage and create the javabin binary files from the source CSV (or parquet) files through a set of intermediate representation (IR) files that are based on DuckDB. These IRDB files can be created in parallel, have data added to it arbitrarily, merged together to consolidate information, and of course, generate associated javabin files.
Create an empty IRDB file.
$ gic-tk irdb help init
Usage: gic-tk irdb init <options> <arguments>
Options:
--debug false Enable additional debug logging
-i, --dbpath /path/to/irdb.db The file path to the irdb database [required]
Add phenotypic/EHR data to an existing IRDB file.
$ gic-tk irdb help add
Usage: gic-tk irdb add <options> <arguments>
Options:
--debug false Enable additional debug logging
-i, --input-parquet /path/to/input.parquet The input data source to add to the irdb database [required]
-o, --target-irdb /path/to/irdb.db The target irdb database to add data into [required]
-c, --concept \concept\path A specific concept path to add into the irdb from the input parquet data source
-l, --concepts-list /path/to/concepts.list A list of concept paths to add into the irdb from the input parquet data source (one concept per line)
--interval INTEGER 100000 The interval to display record processing updates
Note: If the input irdb file already contains the concept path of interest, it will append the observations from the source file into existing database concept record (aka cube).
Merge multiple IRDB's into a single IRDB.
$ gic-tk irdb help merge
Usage: gic-tk irdb merge <options> /path/to/irdb1.db /path/to/irdb2.db ...
Options:
--debug false Enable additional debug logging
-m, --main-irdb /path/to/main-irdb.db The main input irdb database to merge cubes into [required]
Note: If the main irdb file already contains the concept path of interest, it's concept record (aka cube) will be overwritten by the concept record contained in the child irdb file(s) being merged. The last child irdb specified in the command line with the concept record "wins".
Dump javabin files from a given input IRDB.
$ gic-tk irdb help dump
Usage: gic-tk irdb dump <options> <arguments>
Options:
--debug false Enable additional debug logging
-i, --input-irdb /path/to/input.parquet The input data source to add to the irdb database [required]
-t, --target-dir /path/to/javabin.store/ The target javabin directory to create and place data into [required]
-e, --encryption-file /path/to/encryption_key The encryption key file to secure the observation store files [required]
Inspect the contents of an IRDB file (both metadata and raw observation records).
$ gic-tk irdb help inspect
Usage: gic-tk irdb inspect <options> <arguments>
Options:
--debug false Enable additional debug logging
-i, --irdb /path/to/irdb.db The irdb database to inspect [required]
-c, --concept \concept\path A specific concept path to add into the irdb from the input parquet data source
-l, --concepts-list /path/to/concepts.list A list of concept paths to add into the irdb from the input parquet data source (one concept per line)
--show-data false Display the raw observation data
--display-concepts false Display the list of concepts paths in the irdb
--limit INTEGER limit the number of observations to show when displaying raw observation data
# initialize irdb databases
gic-tk irdb init -i age.duckdb
gic-tk irdb init -i race.duckdb
gic-tk irdb init -i age-race-merged.duckdb
# generate "age" and "race" irdb databases independently
gic-tk irdb add --input-parquet age.parquet --target-irdb age.duckdb
gic-tk irdb add --input-parquet race.parquet --target-irdb race.duckdb
# merge the "age" and "race" irdb databases into one
gic-tk irdb merge --main-irdb age-race-merged.duckdb age.duckdb race.duckdb
# create the javabin files from the merged irdb database
mkdir -p javabin-out
gic-tk irdb dump -i age-race-merged.duckdb -t javabin-out -e encryption_key
# view the overall summary contents of the merged irdb database
gic-tk irdb inspect --irdb age-race-merged.duckdb
# view the summary contents for just a single concept
gic-tk irdb inspect -i age-race-merged.duckdb -c '\ACT Demographics\Age\'
# view the first 10 observation records for a specific concept
gic-tk irdb inspect -i age-race-merged.duckdb -c '\ACT Demographics\Age\' --show-data --limit 10
# view all the observation records for a specific concept
gic-tk irdb inspect -i age-race-merged.duckdb -c '\ACT Demographics\Age\' --show-data --limit 10
# view all the concept paths currently in the irdb database
gic-tk irdb inspect -i age-race-merged.duckdb --display-concepts
The toolkit is currently written in clojure. Please have the following requirements installed and available to tinker with gic-tk:
- Java SDK / OpenJDK LTS edition (either version 11, 17 or 21) -- see https://adoptium.net if you need to download a SDK
- Clojure CLI Tools
- GNU or BSD Make
This code is built upon a forked repository of pic-sure-hpds. See build-pic-sure.sh for more details.
make uberjar
Assuming you're in the root directory of the repository:
clj -M:cli irdb help
This utility is open to contribution. Feel free to open issues or submit PRs.
ISC