Installation

CAPTURE A framework and command line interface (CLI) for computational science.

Table of Contents

Installation
CLI usage
- cap env
- cap help
- cap md5
- cap new
- cap run
  - Runtime environment
- cap update
- cap verify
- cap version
Job helper functions
Verification helper functions
- cap_verify_append
- cap_verify_md5
Environment helper functions
- cap_data_link
Contributions

Installation

curl -sSL https://raw.githubusercontent.com/lasseignelab/capture/refs/heads/main/install.sh | bash
source ~/.bash_profile

Update to the current version

cap update

CLI usage

The cap CLI provides commands to help with reproducible research.

cap <command> params...

env

Displays CAPTURE environment variables. This command must be executed from the project root directory.

Usage:

cap env

Options:

-e,--environment
           Specifies the environment to show variables for.

Example:

$ cap env

CAP_ENV_PATH=/data/user/acrumley/3xtg-repurposing/bin/env
CAP_CONTAINER_PATH=/data/user/acrumley/3xtg-repurposing/bin/container
CAP_DATA_PATH=/data/user/acrumley/3xtg-repurposing/data
CAP_ENVIRONMENT=default
CAP_LOGS_PATH=/data/user/acrumley/3xtg-repurposing/logs
CAP_PROJECT_NAME=3xtg-repurposing
CAP_PROJECT_PATH=/data/user/acrumley/3xtg-repurposing
CAP_RANDOM_SEED=16600
CAP_RESULTS_PATH=/data/user/acrumley/3xtg-repurposing/results
CAP_VERIFICATIONS_PATH=/data/user/acrumley/3xtg-repurposing/verifications

help

Shows help for the cap command line tool.

Usage:

cap help [COMMAND]

Example:

$ cap help

  Usage: cap COMMAND ...

  Commands:
    The following subcommands are available.

  COMMAND
    env        Displays CAPTURE environment variables.
    help       Shows help for the cap command line tool.
    md5        Calculates a combined MD5 checksum for one or more files.
    new        Creates a new reproducible research project.
    run        Runs a CAPTURE framework job.
    update     Updates the CAPTURE framework to the latest version.
    version    Displays the currently installed version of CAPTURE.

$ cap help md5

  Calculates a combined MD5 checksum for one or more files.

  The "md5" command produces a combined MD5 checksum for all the files
  specified.  It will show a list of all files included to ensure that the
  result is as expected.

  Usage:
    cap md5 FILE...

    FILE... can be one or more file and/or directory specifications.

  Example:
    $ cap md5 *

    Files included:
    43bd364a97a38fb1da7c57e6381886c1  capture/LICENSE
    b794df25f796ac80680c0e4d27308bce  capture/commands/md5.sh
    0d9281c3586c420130bcb5d25c8a151a  capture/lab
    5e79c988140af1b7bd5735b0bf96306b  capture/README.md
    783a44ffae97afbce3f1649c5ff517a5  capture/install.sh

    Combined MD5 checksum:
    a225199964b84bdeef33bafe3df7c10b

md5

The md5 command produces an MD5 checksum for each file specified and a combined MD5 checksum for all the files. The purpose of this command is to determine whether files downloaded or created are complete and accurate. If the MD5 checksums from two sets of files match then the files are all the same.

Usage:

cap md5 [options] FILE...

FILE... One or more file and/or directory names or patterns. For directories,
        all files in the directory and its subdirectories will be included.

Options:

--append
        Append to the output file if it already exists.

-n,--dry-run
        Lists the files that will have md5sums calculated in order to
        verify the expected files are included.  This is helpful when
        the files are large and take a long time to process.

--ignore=PATTERN
        Exclude files matching the file PATTERN based on the full relative
        path. If the option is specified multiple times, all files matching
        any of the patterns will be EXCLUDED (logical OR). The selector will
        generally have wildcards. Ensure patterns are quoted ("*pattern*") to
        prevent unintended shell expansion.

-o,--output=FILE
        Specify an output file name to write the results to. See examples for
        the output format.

--output-files-only
        Output only the file names with their md5sum. This facilitates
        programmatic verification of files.

--normalize
        Normalizes the output file paths so that files in different root
        directories can be easily compared.

--select=PATTERN
        Include only files matching the file PATTERN based on the full relative
        path. If the option is specified multiple times, all files matching
        any of the patterns will be INCLUDED (logical OR). The selector will
        generally have wildcards. Ensure patterns are quoted ("*pattern*") to
        prevent unintended shell expansion.

-s,--slurm=[batch|run]
        Runs the md5 command as a Slurm job. If the value is run then
        srun is used and the output stays connected to the current
        terminal session.  If the value is batch then sbatch is used and
        the output is written to cap-md5-<job_id>.out unless the -o or --output
        option is specified.

Examples:

Calculate md5 sums for all files in a directory and its subdirectories.

cap md5 files/*

Files included:
b3ac2b8b9998bf504ef708ec837a4cce  files/one.bin
8d62064673ecb2a440b8802a2f752e8a  files/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae  files/outs/three.bin
009c79f013fe8d4d97c95bf5ceea68ed  files/two.bin

Combined MD5 checksum:
1060bcc0958e5cc774f84ccd24a3b010

Calculate md5 sums for files in the subdirectory named "outs".

cap md5 --select "*/outs/*" files/*

Files included:
8d62064673ecb2a440b8802a2f752e8a  files/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae  files/outs/three.bin

Combined MD5 checksum:
feaaf18494b99f6570ab6e4730f9e4af

Calculate md5 sums for files not in the subdirectory named "outs".

cap md5 --ignore "*/outs/*" files/*

Files included:
b3ac2b8b9998bf504ef708ec837a4cce  files/one.bin
009c79f013fe8d4d97c95bf5ceea68ed  files/two.bin

Combined MD5 checksum:
c6f882353ed4c63582276bdd49974a86

new

The cap new command will create a new research project based on the project-template submodule in the capture repository. The project repository will be created with the origin remote pointed to a Github repository owner specified by the Github account and project name parameters.

Usage:

cap new [options] PROJECT_NAME

PROJECT_NAME Name of the project which will be used for the directory name.
       It should also match the git host repo name if one is used.

Options:

--git-host=<host-domain-name>
       Git host for the repository used for creating git remotes.  The
       default is "github.com".

-o,--owner=<owner-id>
       Git host owner the project repo will be created under.  This may
       be a personal or organization account.

--skip-git
       Skip making the project a git repository in order to allow
       the use of other source control software.

Example:

$ cap new lasseignelab PKD_Research

Create an empty repository for 'PKD_Research' on GitHub by using the
following link and settings:

  https://github.com/organizations/lasseignelab/repositories/new

  * No template
  * Owner: lasseignelab
  * Repository name: PKD_Research
  * Private
  * No README file
  * No .gitignore
  * No license

Where you able to create a repository (y/N)? y


Cloning into 'PKD_Research'...
done.

...

Happy researching!!!

run

The cap run command runs a CAPTURE framework job within the context of a reproducible research project. It will configure the environment based on configuration defined by the current user. By default, the job runs in the current terminal session. This command must be executed from the project root directory.

Usage:

cap run [options] FILE

FILE  File name of the job to run.

Options:

-e,--environment
        Specifies the environment to run jobs in.  Environments allow
        different setups for a pipeline.  For instance, a pipeline may
        use internal copies of data during development but download that
        data when the pipeline is ran in a different environment.
-n,--dry-run
        Displays the contents of the job to run along with the context
        it will run in.
-s,--slurm=[batch|run]
        Runs the script as a Slurm job. If the value is run then
        srun is used and the output stays connected to the current
        terminal session.  If the value is batch then sbatch is used and
        the output is written to the log file in the logs directory.

Example:

$ cap run src/01_download.sh

CAPTURE environment: default

View job output with the following command:
cat logs/01_down_20241118_090854_tcrumley*

Submitted batch job 29818073

Runtime environment

The runtime environment is configured with the following variables available to Slurm scripts.

CAP_CONTAINER_PATH: Path to where container files such as Docker will be maintained. Defaults to <project-path>/bin/container.
CAP_DATA_PATH: Path to where data files will be written. Defaults to <project-path>/data.
CAP_ENVIRONMENT: The name of the current execution environment. Defaults to the value "default". A shell script in config/environments with a name matching the environment name will be executed during the CAPTURE configuration process, e.g. config/environments/default.sh. This variable will generally be set in the ~/.caprc file. It is possible to set it as a shell environment variable somewhere like ~/.bash_profile. Another option is to provide it before a command, e.g. CAP_ENVIRONMENT=mylab cap run foo.sh. Finally, some commands provide an option for environment such as cap run --environment=mylab foo.sh.
CAP_ENV_PATH: Path to where conda and other runtime environment files will be maintained. Defaults to <project-path>/bin/env.
CAP_LOGS_PATH: Path to where log files will be written. Defaults to <project-path>/logs.
CAP_PROJECT_NAME: The name of the project given with the cap new command.
CAP_PROJECT_PATH: Path to the root directory of the project.
CAP_RANDOM_SEED: A randomly generated seed to facilitate reproducible random number generation.
CAP_RESULTS_PATH: Path to where analysis results will be written. Defaults to <project-path>/results.
CAP_VERIFICATIONS_PATH: Path to where verification files and the result files they produce are written. Defaults to <project-path>/verifications.

Environment variables can be configured with the following configuration files.

/
|-- etc/
`   |-- caprc

~/
`-- .caprc

<project-path>/
|-- .caprc
|-- config/
|   |-- pipeline.sh
|   `-- environment/
|       |-- default.sh
`       `-- <lab-name>.sh

Configuration files are loaded in the following order:

<project-path>/config/pipeline.sh: Configuration to bootstrap the runtime environment. This file is configured by the cap new command with the CAP_PROJECT_NAME variable set to the name given as a parameter.
defaults: The defaults described in the environment variable section are set at this point.
/etc/caprc: Configuration set by an organization.
~/.caprc: Configuration set for a specific user. This is a good place to source in lab specific configuration.
<project-path>/.labrc: Configuration specific to a project.
<project-path>/config/environments/<CAP_ENVIRONMENT>.sh: Configuration specific to a project and the environment it is being executed in. The default.sh configuration should only contain reproducible configuration that will work in any Slurm environment. Other lab specific environment files can contain non- reproducible configuration but the job must also work in the default environment for reproducibility. An example of environment specific configuration would be creating symlinks in the data directory for sharing large datasets internal to a lab while also downloading the data when the symlink does not exist. See cap_data_link.

update

The cap update command will upgrade the CAPTURE framework to the latest version.

Usage:

cap update

Example:

$ cap update


Switched to branch 'main'
Already up-to-date.

CAPTURE updated to version v0.0.1.

verify

The verify command runs CAPTURE verifications which are shell scripts that determine whether outputs are reproducible. The output of verification scripts will be written to the verifications folder with the same name as the script with a ".out" extension. These files should be committed to source control so that reviewers can compare their results. This command must be executed from the project root directory.

version

The cap version command will display the currently installed version of CAPTURE.

Usage:

cap version

Example:

$ cap version

v0.0.3

Job helper functions

cap_array_value

Retrieves a value from an array file based on a zero based index.

cap_array_value FILE [INDEX]

FILE The file containing an array value on each line.
INDEX The optional zero based index for the value of the array.

If a value is not provided for INDEX then the SLURM_ARRAY_TASK_ID environment variable will be used as the default.

Example that retrieves array values based on the Slurm environment variable default index.

sample=$(cap_array_value "$CAP_DATA_PATH/sample_list.array")

Example with a for loop:

for index in {1..10}; do
  sample=$(cap_array_value "$CAP_DATA_PATH/sample_list.array" index)
  # Do something with each sample value.
done

cap_data_download

Downloads data into the data directory.

cap_data_download [options] URL

URL The URL of the file to download.

Options:

--source-file-name The name of the file being downloaded. When the source file URL does not end in a proper file name, this option allows a name to be provided. Final downloaded file and/or directory names may be different if the --unzip option is used.
--md5sum The md5sum to check against the file being downloaded.
--unzip Unzips and/or unarchives downloaded files.
--subdirectory Specifies a subdirectory within the data directory where the downloaded file will be stored. If the subdirectory does not exist, it will be created.

The file will be downloaded to the file name specified by the URL or the --source-file-name option. If the --unzip option is provided then it will be unarchived into the data directory and possibly have a different final name. The data directory is specified by CAP_DATA_PATH which defaults to CAP_PROJECT_PATH/data. If the --subdirectory option is provided, the downloaded file will be saved in CAP_PROJECT_PATH/data/subdirectory.

If the file or directory already exists in the data directory (or subdirectory if --subdirectory is provided) then it will not be downloaded again. This is also true when the file or directory has been symlinked into the data directory by cap_data_link.

The following example will download and unarchive a directory into CAP_DATA_PATH/refdata-gex-GRCm39-2024-A.

cap_data_download \
  --unzip \
  --md5sum="37c51137ccaeabd4d151f80dc86ce0b3" \
  "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCm39-2024-A.tar.gz"

The following example will download and unarchive a directory into CAP_DATA_PATH/reference/refdata-gex-GRCm39-2024-A.

cap_data_download \
  --unzip \
  --subdirectory "reference" \
  --md5sum="37c51137ccaeabd4d151f80dc86ce0b3" \
  "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCm39-2024-A.tar.gz"

cap_container

Downloads the proper docker or singularity container.

cap_container [options] REFERENCE

REFERENCE The Docker image reference found on DockerHub. The format of the reference is /<repository_name>:[tag].

Options

-c singularity If specified, cap_container will use singularity pull instead of docker pull. If CAP_CONTAINER_TYPE is specified in a caprc file then the -c option is not necessary. CAP_CONTAINER_TYPE is the preferred method.

cap_container first checks whether the Docker image or Singularity .sif file already exists in CAP_CONTAINER_PATH. If the image is not found, it is downloaded from DockerHub. By default, cap_container uses Docker, but specifying the -c singularity option or CAP_CONTAINER_TYPE=singularity in a caprc directs it to generate a Singularity .sif file in the CAP_CONTAINER_PATH directory instead.

The following example checks for the corresponding .sif file in CAP_CONTAINER_PATH. If the file is not found, it downloads and converts the Docker image into the Singularity .sif file - ollama_0.5.8.sif.

cap_container \
  -c singularity \
  "ollama/ollama:0.5.8"

Verification helper functions

Functions to facilitate verifying that pipeline results are reproducible. Verification scripts are stored in the verifications directory in the project root directory and should be committed to the code repository.

The output file will be given the same name as the verification file with a .out extension and will be stored in the same directory. The output file should also be committed to the code repository. When reproducing results, use the git diff command to confirm that results of a verification match the original results.

cap_verify_append

The cap_verify_append function appends text to the verification's .out file. The purpose of this command is facilitate custom verifications and to add comments between groupings of verification output.

cap_verify_append TEXT

TEXT Text that will be appended directly to the end of the .out file.

Examples for a verification named `verifications/verify_example.sh`:

Verify files with comments.

cap_verify_append "##### Mouse data #####"
cap_verify_md5 "data/mouse/*"
cap_verify_append "##### Human data #####"
cap_verify_md5 "data/human/*"

Results in verifications/verify_example.out:

##### Mouse data #####
b3ac2b8b9998bf504ef708ec837a4cce  data/mouse/one.bin
8d62064673ecb2a440b8802a2f752e8a  data/mouse/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae  data/mouse/outs/three.bin
009c79f013fe8d4d97c95bf5ceea68ed  data/mouse/two.bin
##### Human data #####
b3ac2b8b9998bf504ef708ec837a4cc1  data/human/one.bin
8d62064673ecb2a440b8802a2f752e82  data/human/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5a5  data/human/outs/three.bin
009c79f013fe8d4d97c95bf5ceea68e8  data/human/two.bin

Custom verification from a Python script.

The environment variable CAP_VERIFICATION_DRY_RUN can be used to add dry run functionality to custom verifcation scripts, which will be equal to "true" on a dry run.

cap_verify_append "$(python3 $CAP_VERIFICATIONS_PATH/verify_example.py)"

Results in verifications/verify_example.out:

Cell Count: 1000
Gene Count: 5000

cap_verify_md5

The cap_verify_md5 function produces an MD5 checksum for each file specified, storing the results in an output file to be checked into the repository for verifying future reproducibility. The purpose of this command is to determine whether files downloaded or created are complete and accurate when reproduced. If the MD5 checksums from two sets of files match then the files are all the same.

cap_verify_md5 [options] FILE...

FILE... One or more file and/or directory names or patterns. For directories, all files in the directory and its subdirectories will be included.

Options

--ignore=PATTERN Exclude files matching the file PATTERN based on the full relative path. If the option is specified multiple times, all files matching any of the patterns will be EXCLUDED (logical OR). The selector will generally have wildcards. Ensure patterns are quoted ("pattern") to prevent unintended shell expansion.
--select=PATTERN Include only files matching the file PATTERN based on the full relative path. If the option is specified multiple times, all files matching any of the patterns will be INCLUDED (logical OR). The selector will generally have wildcards. Ensure patterns are quoted ("pattern") to prevent unintended shell expansion.

Examples for a verification named `verifications/verify_example.sh`:

Verify all files in a directory and its subdirectories.

cap_verify_md5 "data/*"

Results in verifications/verify_example.out:

b3ac2b8b9998bf504ef708ec837a4cce  data/one.bin
8d62064673ecb2a440b8802a2f752e8a  data/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae  data/outs/three.bin
009c79f013fe8d4d97c95bf5ceea68ed  data/two.bin

Verify all files in the subdirectory named "outs".

cap_verify_md5 --select "*/outs/*" "data/*"

Results in verifications/verify_example.out:

8d62064673ecb2a440b8802a2f752e8a  data/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae  data/outs/three.bin

Verify all files not in the subdirectory named "outs".

cap_verify_md5 --ignore "*/outs/*" "data/*"

Results in verifications/verify_example.out:

b3ac2b8b9998bf504ef708ec837a4cce  data/one.bin
009c79f013fe8d4d97c95bf5ceea68ed  data/two.bin

Environment helper functions

Functions to facilitate setting up environments for CAPTURE to operate in. Environments help create reproducible pipelines by allowing authors to work in their unique development setup, which may only work for them, and reviewers to run pipelines in a default environment that should work anywhere. Environment files are stored in the config/environments directory.

cap_data_link

Creates a symbolic link in the data directory. A common use is to prevent duplicate storage of large datasets in the author's compute environment. By linking to a shared copy, multiple authors won't create multiple copies. This function is often used in conjunction with cap_data_download, where cap_data_link prevents cap_data_download from downloading a new version of previously downloaded data while ensuring the data will be downloaded in other environments such as the default environment.

cap_data_link <FILE>|<DIR>

<FILE>|<DIR> The full path to a file or directory.

The symbolic link will have the same name as the specified file or directory and will be created in the directory specified by CAP_DATA_PATH which defaults to CAP_PROJECT_PATH/data.

The following example will create a symbolic link at $CAP_DATA_PATH/mouse and should be included in an environment file in config/environments, e.g config/environments/my_lab.sh. The $MY_LAB environment variable should be created in a .caprc file (See Runtime environment).

cap_data_link "$MY_LAB/genome/mouse"

To use the my_lab environment when running a job, use the cap run command with the -e/--environment option like in the following example.

cap run -e my_lab src/01_download.sh

Contributions

Tests

All pull requests must include BATS tests covering the changes.

The testing framework is installed by the following command.

tests/install

The entire test suite is executed by the following command.

tests/run

The tests can be filtered with the --filter option. This saves time by allowing subsets of the test suite to be ran while coding. The following examples of using --filter are based on this hypothetical BATS test.

@test "cap md5: All files in a folder" {
  ...
}

Examples of using --filter

How to run just the cap md5 tests:

tests/run --filter "cap md5"

How to run just the single hypothetical test:

tests/run --filter "cap md5: All files in a folder"

Name		Name	Last commit message	Last commit date
Latest commit History 352 Commits
.github		.github
commands		commands
lib		lib
project-template @ 96685c2		project-template @ 96685c2
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.mega-linter.yml		.mega-linter.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
cap		cap
cap_completion.sh		cap_completion.sh
configure.sh		configure.sh
install.sh		install.sh
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Update to the current version

CLI usage

env

help

md5

new

run

Runtime environment

update

verify

version

Job helper functions

cap_array_value

cap_data_download

cap_container

Verification helper functions

cap_verify_append

Examples for a verification named `verifications/verify_example.sh`:

cap_verify_md5

Examples for a verification named `verifications/verify_example.sh`:

Environment helper functions

cap_data_link

Contributions

Tests

Examples of using --filter

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

lasseignelab/capture

Folders and files

Latest commit

History

Repository files navigation

Installation

Update to the current version

CLI usage

env

help

md5

new

run

Runtime environment

update

verify

version

Job helper functions

cap_array_value

cap_data_download

cap_container

Verification helper functions

cap_verify_append

Examples for a verification named verifications/verify_example.sh:

cap_verify_md5

Examples for a verification named verifications/verify_example.sh:

Environment helper functions

cap_data_link

Contributions

Tests

Examples of using --filter

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Examples for a verification named `verifications/verify_example.sh`:

Examples for a verification named `verifications/verify_example.sh`:

Packages