Skip to content

CBS Python (Docker)

Conor Wild edited this page Sep 12, 2023 · 12 revisions

Description

Here you will find instructions about installing and using our software toolkit (aka CBS Python) that is contained in a pre-built and (hopefully!) easy-to-use Docker image.

Basically, this Docker image contains a functional Python environment for working with Cambridge Brain Sciences (CBS) data. It includes custom Python packages (and their required dependencies) for performing common preprocessing routines. Consider this Docker image an alternative to having to install and maintain a Python development environment; instead, just spin up a container and make use of all the Python-ic goodness stored inside! You can preprocess your CBS datasets to extract score features, re-organize the data, calculate norms, calculate domain scores, encrypt your data files, and other things. These functions are all accessed using Script Entrypoints - see below for descriptions.

Note, these tools do not include any statistics or analysis routines! You'll have to use your software of choice (R, SPSS, Python, etc.) to read and analyze the data saved by the commands you run here.

Why use Docker for this stuff? Well, it makes it easy to distribute these tools for others to use without having to worry about what kind of computer they're using, or what specific Python version, or package versions, are installed. Better yet, you don't even have to know how to use Python in order to use all these Python scripts! Also, it's easy to distribute updates for these tools without having to worry about dependencies, etc; you can just tell Docker to pull the latest image when sa new one is available. There are lots of reasons! Go forth and Google...

Getting Started

This toolkit was initially designed to pre-process CBS data in a Datalad pipeline to ensure computational reproducibility. See datalad-containers and the Datalad handbook. It's worth reading a bit about DataLad if you have never heard about it...

Alternatively, you just can just use the CBS Python image to manually spin up a Docker container and process your data. See the appropriate Instructions depending on your use case.

Either way, you will have to use a command-line terminal (e.g., on MacOS) to use these tools. That's right - no GUI for you! If you're not familiar with a terminal, maybe get started with a tutorial or two.

Docker Installation

  1. Make sure that you have Docker installed. See instructions for MacOS, Windows, Linux, etc.
  2. In your terminal of choice, test your Docker installation:
# Running the following command should display a "Hello from Docker!" message:
docker run hello-world

Accessing the CBS Python Image

Our image is not stored on the DockerHub registry, but rather in a GitHub package associated with this repository. To use this image with Docker on your machine you will need to jump through a few extra hoops.

  1. The image is stored in a private GH pcakage (for now), and to gain read access you will need to be added to TheOwenLab's "collaborators" team. Message me on the Discussion Board or via email with your GitHub account name and your intended use case, and I will will grant you access.
  2. Create a Personal Access Token for your GitHub account. [UPDATE: use "classic" token, and not "fine grained" one] This acts like a secondary password for your account that grants restricted permissions. The only "scope" (permission) required for our purposes is read:packages.
  3. In your terminal, use your new token to log the Docker application into the Github container registry:
export GH_PAT=PASTE_YOUR_TOKEN_HERE
echo $GH_PAT | docker login ghcr.io -u YOUR_GITHUB_USERNAME --password-stdin

Note: You are probably going to need to use your token more than once, e.g., if you have to re-pull the Docker image for updates. Good practice would be to add the export GH_PAT=PASTE_YOUR_TOKEN_HERE to your ~/.bashrc, ~/.profile, or equivalent file. That way the environment variable GH_PAT will always be available when you start a new terminal (also, note that GitHub will only show you your token once when it is generated).

  1. Pull the image from the GH registry to your local machine:
docker pull ghcr.io/theowenlab/cbspython:latest

That's it! CBS Python is now installed on your machine, and you can use the Docker run command (or datalad-containers run) to invoke various magical commands.

Script Entrypoints

The toolkit provided in this Docker image is really just a collection of Python scripts (kinda) that process your input files (e.g., a raw CBS data export in .csv form) and save some output files. Any of the following entrypoints can be executed in your terminal with a docker run ... command. Details are provided below.

The script takes as an input a raw CBS .csv data export and parses/extracts all kinds of score feature data and saves them into a nice wideform format with one row per assessment.

Parses the data like the previous script, but also:

  • generates age/gender matched norms for each participant in your dataset,
  • z-scores all test scores and their features,
  • calculates "domain" scores based on the Varimax-rotated PCA loadings from Hampshire et al. 2012,
  • generates age/gender matched norms for the domain scores for each of your participants,
  • z-scores the domain scores.

cbs_encrypt

Batch encryption of files using Fernet encryption - which is an implementation of symmetric (aka "secret key") authenticated cryptography. Basically, super securely encrypt data files. More information to added...

cbs_decrypt

Batch decrypt files that have been encrypted by cbs_encrypt. More information to added...

Usage Instructions

Running a CBS Python Script in a Datalad Pipeline

  • To be implemented at a later time (need to test the GHCR implementation....)

Manually Running a CBS Python Script

Each of the scripts provided in this toolkit can be executed in your terminal by using a docker run ... command that looks like this:

docker run --rm -it -v $PWD:/tmp -w /tmp ghcr.io/theowenlab/cbspython:latest ENTRYPOINT_NAME ARG1 ARG2 ETC

Ok, the command above:

  • creates and runs a new container ("docker run"),
  • that will be removed after it is done running ("--rm"),
  • in an interactive mode ("-it"),
  • using the latest cbspython image ("ghcr.io/theowenlab/cbspython:latest"),
  • mounts your current working directory to /tmp ("-v $PWD:/tmp"),
  • sets the containers working directory to that folder ("-w tmp"),
  • and executes the remaining stuff (i.e., the script name with supplied arguments) within the container's environment.

For example, try looking at the "help" displayed by cbs_parse_data:

docker run --rm -it -v $PWD:/tmp -w /tmp ghcr.io/theowenlab/cbspython:latest cbs_parse_data --help

To run one of our scripts on your CBS data:

  1. Navigate to the folder containing your raw CBS data export (in a .csv form) (e.g., cd ~/Documents/myproject/data/)
  2. Run the desired script in a container:
# Replace ENTRYPOINT_NAME with one of the above script names, and ARGUMENTS with the required arguments.
docker run --rm -it -v $PWD:/tmp -w /tmp ghcr.io/theowenlab/cbspython:latest ENTRYPOINT_NAME ARGUMENTS

Image Details (Components)

The CBS Python Docker image is simply a python3.9-slim base image that installs three custom Python pacakges (and required dependencies). The various scripts are provided as console entrypoints by the individual packages:

  • Private CBS repository, for now.
  • Provides the cbs_parse_data and cbs_score_calculator console scripts, and other various functionality.
  • Provides the cbs_encrypt and cbs_decrypt console scripts.

Notes for Myself

Build

To build this image, you need to have SSH access to CBS Bitbucket and the Owenlab GIN Server. That is, you must have an account on each hosting site with your public SSH key added to the profile in each account.

# From the cbs-sci-containers root directory
ssh-add ~/.ssh/id_rsa
datalad run -m "Rebuilding cbspython Docker image" -o cbspython/image/ "cd cbspython && ./build.sh"

Debug

Run a bash shell: docker run --rm -it --entrypoint /bin/bash cbspython:latest

Test

$ datalad containers-run -m "Testing CBSPython parsing" -n cbspython -i test/test_data.csv "cbs_parse_data {inputs} test_data_out -o stdout --user-tfm 'lambda x: x[:x.index(\"@\")]'"

Reset

git reset --hard # removes staged and working directory changes
git clean -f -d # remove untracked