GIC Toolkit (`gic-tk`)

This is some extra tooling for the PIC-SURE ecosystem being used by the Genomic Information Commons (GIC) project.

Caveat Emptor

Please note this is considered alpha software.

This repository is very much a work in progress. It's realized vision is still unclear. The goal is to make certain aspects of PIC-SURE maintenance and administration easier.

The repository is a testing ground for exploring various ideas with possibly lots of breaking changes to come or abandonment. At worst case, it can serve as an example of what not to to :-) .

The work is currently unaffiliated with hms-dbmi, the primary drivers of PIC-SURE and GIC.

Quick Start

Ensure that you have Java SDK / OpenJDK LTS edition (either version 11, 17 or 21) available on your system path.
Download gic-tk, or the jar, from the release page
Upon downloading, ensure that it's an executable file, and simply invoke

gic-tk

or

java -jar gic-tools-0.0.1-standalone.jar

The above commands will show you the available subcommands to invoke. You should see something like so:

Usage: gic-tk <subcommand> <options>

     Most subcommands support the options:
       --help   subcommand help
       --debug  enable debug logging mode

     Subcommands:

     irdb: commands to manipulate Intermediate Representation Databases (irdb)
       init     initialize a new irdb
       add      add or update concept data to an existing irdb
       merge    merge multiple target irdbs into a source irdb
       dump     generate a javabin file from an irdb
       inspect  inspect an irdb contents
       help     subcommand help message

     help: this help message
     version: version number information

Architecture

gic-tk is designed as a command-line application that consists of various subcommands to deal with various aspects of PIC-SURE maintenance and administration. Currently, there's one subcommand irdb (see below).

Usage

`irdb` subcommand

"IRDB" is an acronym for "Intermediate Representation Database". The PIC-SURE system loads its phenotypic, or Electronic Health Record (EHR), data via binary "javabin" files on the file system. These javabin files are essentially an organized concatenation of serialized java data structures. The IRDB subcommand aims to make the assembly, creation, and inspection of these PIC-SURE javabin storage files more practical and accessible.

The current PIC-SURE tooling constructs these javabin files from "compiling" a source of CSV files. The compilation process is currently an "all or nothing" process. While this is effective for small data sets, it becomes unwieldy as the phenotypic data grows both in terms of quantity and variety. This group of subcommands provides a way to manage and create the javabin binary files from the source CSV (or parquet) files through a set of intermediate representation (IR) files that are based on DuckDB. These IRDB files can be created in parallel, have data added to it arbitrarily, merged together to consolidate information, and of course, generate associated javabin files.

`irdb init`

Create an empty IRDB file.

$ gic-tk irdb help init

Usage: gic-tk irdb init <options> <arguments>
Options:
      --debug                   false Enable additional debug logging
  -i, --dbpath /path/to/irdb.db       The file path to the irdb database [required]

`irdb add`

Add phenotypic/EHR data to an existing IRDB file.

$ gic-tk irdb help add

Usage: gic-tk irdb add <options> <arguments>
Options:
      --debug                                false  Enable additional debug logging
  -i, --input-parquet /path/to/input.parquet        The input data source to add to the irdb database [required]
  -o, --target-irdb   /path/to/irdb.db              The target irdb database to add data into [required]
  -c, --concept       \concept\path                 A specific concept path to add into the irdb from the input parquet data source
  -l, --concepts-list /path/to/concepts.list        A list of concept paths to add into the irdb from the input parquet data source (one concept per line)
      --interval      INTEGER                100000 The interval to display record processing updates

Note: If the input irdb file already contains the concept path of interest, it will append the observations from the source file into existing database concept record (aka cube).

`irdb merge`

Merge multiple IRDB's into a single IRDB.

$ gic-tk irdb help merge

Usage: gic-tk irdb merge <options> /path/to/irdb1.db /path/to/irdb2.db ...
Options:
      --debug                           false Enable additional debug logging
  -m, --main-irdb /path/to/main-irdb.db       The main input irdb database to merge cubes into [required]

Note: If the main irdb file already contains the concept path of interest, it's concept record (aka cube) will be overwritten by the concept record contained in the child irdb file(s) being merged. The last child irdb specified in the command line with the concept record "wins".

`irdb dump`

Dump javabin files from a given input IRDB.

$ gic-tk irdb help dump

Usage: gic-tk irdb dump <options> <arguments>
Options:
      --debug                                   false Enable additional debug logging
  -i, --input-irdb      /path/to/input.parquet        The input data source to add to the irdb database [required]
  -t, --target-dir      /path/to/javabin.store/       The target javabin directory to create and place data into [required]
  -e, --encryption-file /path/to/encryption_key       The encryption key file to secure the observation store files [required]

`irdb inspect`

Inspect the contents of an IRDB file (both metadata and raw observation records).

$ gic-tk irdb help inspect

Usage: gic-tk irdb inspect <options> <arguments>
Options:
      --debug                                   false Enable additional debug logging
  -i, --irdb             /path/to/irdb.db             The irdb database to inspect [required]
  -c, --concept          \concept\path                A specific concept path to add into the irdb from the input parquet data source
  -l, --concepts-list    /path/to/concepts.list       A list of concept paths to add into the irdb from the input parquet data source (one concept per line)
      --show-data                               false Display the raw observation data
      --display-concepts                        false Display the list of concepts paths in the irdb
      --limit            INTEGER                      limit the number of observations to show when displaying raw observation data

Example Use Cases

# initialize irdb databases
gic-tk irdb init -i age.duckdb
gic-tk irdb init -i race.duckdb
gic-tk irdb init -i age-race-merged.duckdb

# generate "age" and "race" irdb databases independently
gic-tk irdb add --input-parquet age.parquet --target-irdb age.duckdb
gic-tk irdb add --input-parquet race.parquet --target-irdb race.duckdb

# merge the "age" and "race" irdb databases into one
gic-tk irdb merge --main-irdb age-race-merged.duckdb age.duckdb race.duckdb

# create the javabin files from the merged irdb database
mkdir -p javabin-out
gic-tk irdb dump -i age-race-merged.duckdb -t javabin-out -e encryption_key

# view the overall summary contents of the merged irdb database
gic-tk irdb inspect --irdb age-race-merged.duckdb

# view the summary contents for just a single concept
gic-tk irdb inspect -i age-race-merged.duckdb -c '\ACT Demographics\Age\'

# view the first 10 observation records for a specific concept
gic-tk irdb inspect -i age-race-merged.duckdb -c '\ACT Demographics\Age\' --show-data --limit 10

# view all the observation records for a specific concept
gic-tk irdb inspect -i age-race-merged.duckdb -c '\ACT Demographics\Age\' --show-data --limit 10

# view all the concept paths currently in the irdb database
gic-tk irdb inspect -i age-race-merged.duckdb --display-concepts

Development

The toolkit is currently written in clojure. Please have the following requirements installed and available to tinker with gic-tk:

Java SDK / OpenJDK LTS edition (either version 11, 17 or 21) -- see https://adoptium.net if you need to download a SDK
Clojure CLI Tools
GNU or BSD Make

Custom Building PIC-SURE

This code is built upon a forked repository of pic-sure-hpds. See build-pic-sure.sh for more details.

Helpful Commands

Create an uberjar (and executable jar)

make uberjar

To run a command straight from the current source code tree

Assuming you're in the root directory of the repository:

clj -M:cli irdb help

Contribution

This utility is open to contribution. Feel free to open issues or submit PRs.

License

ISC

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
resources		resources
scripts		scripts
src/gic/tools		src/gic/tools
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
build.clj		build.clj
deps.edn		deps.edn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GIC Toolkit (`gic-tk`)

Caveat Emptor

Quick Start

Architecture

Usage

`irdb` subcommand

`irdb init`

`irdb add`

`irdb merge`

`irdb dump`

`irdb inspect`

Example Use Cases

Development

Custom Building PIC-SURE

Helpful Commands

Create an uberjar (and executable jar)

To run a command straight from the current source code tree

Contribution

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GIC Toolkit (gic-tk)

Caveat Emptor

Quick Start

Architecture

Usage

irdb subcommand

irdb init

irdb add

irdb merge

irdb dump

irdb inspect

Example Use Cases

Development

Custom Building PIC-SURE

Helpful Commands

Create an uberjar (and executable jar)

To run a command straight from the current source code tree

Contribution

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

GIC Toolkit (`gic-tk`)

`irdb` subcommand

`irdb init`

`irdb add`

`irdb merge`

`irdb dump`

`irdb inspect`

Packages