CAPTURE A framework and command line interface (CLI) for computational science.
Table of Contents
- Installation
- CLI usage
- Job helper functions
- Verification helper functions
- Environment helper functions
- Contributions
curl -sSL https://raw.githubusercontent.com/lasseignelab/capture/refs/heads/main/install.sh | bash
source ~/.bash_profile
cap update
The cap CLI provides commands to help with reproducible research.
cap <command> params...
Displays CAPTURE environment variables. This command must be executed from the project root directory.
Usage:
cap env
Options:
-e,--environment
Specifies the environment to show variables for.
Example:
$ cap env
CAP_ENV_PATH=/data/user/acrumley/3xtg-repurposing/bin/env
CAP_CONTAINER_PATH=/data/user/acrumley/3xtg-repurposing/bin/container
CAP_DATA_PATH=/data/user/acrumley/3xtg-repurposing/data
CAP_ENVIRONMENT=default
CAP_LOGS_PATH=/data/user/acrumley/3xtg-repurposing/logs
CAP_PROJECT_NAME=3xtg-repurposing
CAP_PROJECT_PATH=/data/user/acrumley/3xtg-repurposing
CAP_RANDOM_SEED=16600
CAP_RESULTS_PATH=/data/user/acrumley/3xtg-repurposing/results
CAP_VERIFICATIONS_PATH=/data/user/acrumley/3xtg-repurposing/verifications
Shows help for the cap command line tool.
Usage:
cap help [COMMAND]
Example:
$ cap help
Usage: cap COMMAND ...
Commands:
The following subcommands are available.
COMMAND
env Displays CAPTURE environment variables.
help Shows help for the cap command line tool.
md5 Calculates a combined MD5 checksum for one or more files.
new Creates a new reproducible research project.
run Runs a CAPTURE framework job.
update Updates the CAPTURE framework to the latest version.
version Displays the currently installed version of CAPTURE.
$ cap help md5
Calculates a combined MD5 checksum for one or more files.
The "md5" command produces a combined MD5 checksum for all the files
specified. It will show a list of all files included to ensure that the
result is as expected.
Usage:
cap md5 FILE...
FILE... can be one or more file and/or directory specifications.
Example:
$ cap md5 *
Files included:
43bd364a97a38fb1da7c57e6381886c1 capture/LICENSE
b794df25f796ac80680c0e4d27308bce capture/commands/md5.sh
0d9281c3586c420130bcb5d25c8a151a capture/lab
5e79c988140af1b7bd5735b0bf96306b capture/README.md
783a44ffae97afbce3f1649c5ff517a5 capture/install.sh
Combined MD5 checksum:
a225199964b84bdeef33bafe3df7c10b
The md5 command produces an MD5 checksum for each file specified and a
combined MD5 checksum for all the files. The purpose of this command is to
determine whether files downloaded or created are complete and accurate. If
the MD5 checksums from two sets of files match then the files are all the same.
Usage:
cap md5 [options] FILE...
FILE... One or more file and/or directory names or patterns. For directories,
all files in the directory and its subdirectories will be included.
Options:
--append
Append to the output file if it already exists.
-n,--dry-run
Lists the files that will have md5sums calculated in order to
verify the expected files are included. This is helpful when
the files are large and take a long time to process.
--ignore=PATTERN
Exclude files matching the file PATTERN based on the full relative
path. If the option is specified multiple times, all files matching
any of the patterns will be EXCLUDED (logical OR). The selector will
generally have wildcards. Ensure patterns are quoted ("*pattern*") to
prevent unintended shell expansion.
-o,--output=FILE
Specify an output file name to write the results to. See examples for
the output format.
--output-files-only
Output only the file names with their md5sum. This facilitates
programmatic verification of files.
--normalize
Normalizes the output file paths so that files in different root
directories can be easily compared.
--select=PATTERN
Include only files matching the file PATTERN based on the full relative
path. If the option is specified multiple times, all files matching
any of the patterns will be INCLUDED (logical OR). The selector will
generally have wildcards. Ensure patterns are quoted ("*pattern*") to
prevent unintended shell expansion.
-s,--slurm=[batch|run]
Runs the md5 command as a Slurm job. If the value is run then
srun is used and the output stays connected to the current
terminal session. If the value is batch then sbatch is used and
the output is written to cap-md5-<job_id>.out unless the -o or --output
option is specified.
Examples:
Calculate md5 sums for all files in a directory and its subdirectories.
cap md5 files/*
Files included:
b3ac2b8b9998bf504ef708ec837a4cce files/one.bin
8d62064673ecb2a440b8802a2f752e8a files/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae files/outs/three.bin
009c79f013fe8d4d97c95bf5ceea68ed files/two.bin
Combined MD5 checksum:
1060bcc0958e5cc774f84ccd24a3b010
Calculate md5 sums for files in the subdirectory named "outs".
cap md5 --select "*/outs/*" files/*
Files included:
8d62064673ecb2a440b8802a2f752e8a files/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae files/outs/three.bin
Combined MD5 checksum:
feaaf18494b99f6570ab6e4730f9e4af
Calculate md5 sums for files not in the subdirectory named "outs".
cap md5 --ignore "*/outs/*" files/*
Files included:
b3ac2b8b9998bf504ef708ec837a4cce files/one.bin
009c79f013fe8d4d97c95bf5ceea68ed files/two.bin
Combined MD5 checksum:
c6f882353ed4c63582276bdd49974a86
The cap new command will create a new research project based on the
project-template submodule in the capture repository. The project
repository will be created with the origin remote pointed to a Github
repository owner specified by the Github account and project name parameters.
Usage:
cap new [options] PROJECT_NAME
PROJECT_NAME Name of the project which will be used for the directory name.
It should also match the git host repo name if one is used.
Options:
--git-host=<host-domain-name>
Git host for the repository used for creating git remotes. The
default is "github.com".
-o,--owner=<owner-id>
Git host owner the project repo will be created under. This may
be a personal or organization account.
--skip-git
Skip making the project a git repository in order to allow
the use of other source control software.
Example:
$ cap new lasseignelab PKD_Research
Create an empty repository for 'PKD_Research' on GitHub by using the
following link and settings:
https://github.com/organizations/lasseignelab/repositories/new
* No template
* Owner: lasseignelab
* Repository name: PKD_Research
* Private
* No README file
* No .gitignore
* No license
Where you able to create a repository (y/N)? y
Cloning into 'PKD_Research'...
done.
...
Happy researching!!!
The cap run command runs a CAPTURE framework job within the context of a
reproducible research project. It will configure the environment based
on configuration defined by the current user. By default, the job runs in
the current terminal session. This command must be executed from the project
root directory.
Usage:
cap run [options] FILE
FILE File name of the job to run.
Options:
-e,--environment
Specifies the environment to run jobs in. Environments allow
different setups for a pipeline. For instance, a pipeline may
use internal copies of data during development but download that
data when the pipeline is ran in a different environment.
-n,--dry-run
Displays the contents of the job to run along with the context
it will run in.
-s,--slurm=[batch|run]
Runs the script as a Slurm job. If the value is run then
srun is used and the output stays connected to the current
terminal session. If the value is batch then sbatch is used and
the output is written to the log file in the logs directory.
Example:
$ cap run src/01_download.sh
CAPTURE environment: default
View job output with the following command:
cat logs/01_down_20241118_090854_tcrumley*
Submitted batch job 29818073
The runtime environment is configured with the following variables available to Slurm scripts.
- CAP_CONTAINER_PATH: Path to where container files such as Docker will be
maintained. Defaults to
<project-path>/bin/container. - CAP_DATA_PATH: Path to where data files will be written. Defaults to
<project-path>/data. - CAP_ENVIRONMENT: The name of the current execution environment. Defaults
to the value "default". A shell script in
config/environmentswith a name matching the environment name will be executed during the CAPTURE configuration process, e.g.config/environments/default.sh. This variable will generally be set in the~/.caprcfile. It is possible to set it as a shell environment variable somewhere like~/.bash_profile. Another option is to provide it before a command, e.g.CAP_ENVIRONMENT=mylab cap run foo.sh. Finally, some commands provide an option for environment such ascap run --environment=mylab foo.sh. - CAP_ENV_PATH: Path to where conda and other runtime environment files
will be maintained. Defaults to
<project-path>/bin/env. - CAP_LOGS_PATH: Path to where log files will be written. Defaults to
<project-path>/logs. - CAP_PROJECT_NAME: The name of the project given with the
cap newcommand. - CAP_PROJECT_PATH: Path to the root directory of the project.
- CAP_RANDOM_SEED: A randomly generated seed to facilitate reproducible random number generation.
- CAP_RESULTS_PATH: Path to where analysis results will be written.
Defaults to
<project-path>/results. - CAP_VERIFICATIONS_PATH: Path to where verification files and the
result files they produce are written. Defaults to
<project-path>/verifications.
Environment variables can be configured with the following configuration files.
/
|-- etc/
` |-- caprc
~/
`-- .caprc
<project-path>/
|-- .caprc
|-- config/
| |-- pipeline.sh
| `-- environment/
| |-- default.sh
` `-- <lab-name>.sh
Configuration files are loaded in the following order:
- <project-path>/config/pipeline.sh: Configuration to bootstrap the
runtime environment. This file is configured by the
cap newcommand with theCAP_PROJECT_NAMEvariable set to the name given as a parameter. - defaults: The defaults described in the environment variable section are set at this point.
- /etc/caprc: Configuration set by an organization.
- ~/.caprc: Configuration set for a specific user. This is a good place
to
sourcein lab specific configuration. - <project-path>/.labrc: Configuration specific to a project.
- <project-path>/config/environments/<CAP_ENVIRONMENT>.sh: Configuration specific
to a project and the environment it is being executed in. The
default.shconfiguration should only contain reproducible configuration that will work in any Slurm environment. Other lab specific environment files can contain non- reproducible configuration but the job must also work in the default environment for reproducibility. An example of environment specific configuration would be creating symlinks in the data directory for sharing large datasets internal to a lab while also downloading the data when the symlink does not exist. See cap_data_link.
The cap update command will upgrade the CAPTURE framework to the latest
version.
Usage:
cap update
Example:
$ cap update
Switched to branch 'main'
Already up-to-date.
CAPTURE updated to version v0.0.1.
The verify command runs CAPTURE verifications which are shell scripts that
determine whether outputs are reproducible. The output of verification scripts
will be written to the verifications folder with the same name as the script
with a ".out" extension. These files should be committed to source control so
that reviewers can compare their results. This command must be executed from
the project root directory.
See also verification helper functions.
Environment variables (useful for custom verifcations):
CAP_VERIFICATION_DRY_RUN: Boolean value ("true", "false") indicating whether the current verification is a dry run.
CAP_VERIFICATION_OUTPUT_FILE: File name to append verification output.
Usage:
cap verify [options] FILE
FILE One file name.
Options:
-n,--dry-run
Lists the files that will have verifications performed in order to
verify the expected files are included. This is helpful when
the files are large and take a long time to process.
-s,--slurm=[batch|run]
Runs the verify command as a Slurm job with sbatch or srun.
Example:
Perform verifications for a step in the pipeline which will produce an
output file named verifications/01_download.out.
cap verify verifications/01_download.sh
The cap version command will display the currently installed version
of CAPTURE.
Usage:
cap version
Example:
$ cap version
v0.0.3
Retrieves a value from an array file based on a zero based index.
cap_array_value FILE [INDEX]
FILEThe file containing an array value on each line.INDEXThe optional zero based index for the value of the array.
If a value is not provided for INDEX then the SLURM_ARRAY_TASK_ID
environment variable will be used as the default.
Example that retrieves array values based on the Slurm environment variable default index.
sample=$(cap_array_value "$CAP_DATA_PATH/sample_list.array")
Example with a for loop:
for index in {1..10}; do
sample=$(cap_array_value "$CAP_DATA_PATH/sample_list.array" index)
# Do something with each sample value.
done
Downloads data into the data directory.
cap_data_download [options] URL
URLThe URL of the file to download.
Options:
--source-file-nameThe name of the file being downloaded. When the source file URL does not end in a proper file name, this option allows a name to be provided. Final downloaded file and/or directory names may be different if the--unzipoption is used.--md5sumThe md5sum to check against the file being downloaded.--unzipUnzips and/or unarchives downloaded files.--subdirectorySpecifies a subdirectory within the data directory where the downloaded file will be stored. If the subdirectory does not exist, it will be created.
The file will be downloaded to the file name specified by the URL or the
--source-file-name option. If the --unzip option is provided then it will
be unarchived into the data directory and possibly have a different final name.
The data directory is specified by CAP_DATA_PATH which defaults to
CAP_PROJECT_PATH/data. If the --subdirectory option is provided, the
downloaded file will be saved in CAP_PROJECT_PATH/data/subdirectory.
If the file or directory already exists in the data directory (or subdirectory
if --subdirectory is provided) then it will not be downloaded again. This is
also true when the file or directory has been symlinked into the data directory
by cap_data_link.
The following example will download and unarchive a directory into
CAP_DATA_PATH/refdata-gex-GRCm39-2024-A.
cap_data_download \
--unzip \
--md5sum="37c51137ccaeabd4d151f80dc86ce0b3" \
"https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCm39-2024-A.tar.gz"
The following example will download and unarchive a directory into
CAP_DATA_PATH/reference/refdata-gex-GRCm39-2024-A.
cap_data_download \
--unzip \
--subdirectory "reference" \
--md5sum="37c51137ccaeabd4d151f80dc86ce0b3" \
"https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCm39-2024-A.tar.gz"
Downloads the proper docker or singularity container.
cap_container [options] REFERENCE
REFERENCEThe Docker image reference found on DockerHub. The format of the reference is /<repository_name>:[tag].
Options
-c singularityIf specified, cap_container will usesingularity pullinstead ofdocker pull. IfCAP_CONTAINER_TYPEis specified in acaprcfile then the -c option is not necessary.CAP_CONTAINER_TYPEis the preferred method.
cap_container first checks whether the Docker image or Singularity .sif file
already exists in CAP_CONTAINER_PATH. If the image is not found, it is downloaded
from DockerHub. By default, cap_container uses Docker, but specifying the
-c singularity option or CAP_CONTAINER_TYPE=singularity in a caprc
directs it to generate a Singularity .sif file in the CAP_CONTAINER_PATH
directory instead.
The following example checks for the corresponding .sif file in CAP_CONTAINER_PATH.
If the file is not found, it downloads and converts the Docker image into the
Singularity .sif file - ollama_0.5.8.sif.
cap_container \
-c singularity \
"ollama/ollama:0.5.8"
Functions to facilitate verifying that pipeline results are reproducible.
Verification scripts are stored in the verifications directory in the
project root directory and should be committed to the code repository.
The output file will be given the same name as the verification file with a
.out extension and will be stored in the same directory. The output file
should also be committed to the code repository. When reproducing results, use
the git diff command to confirm that results of a verification match the
original results.
The cap_verify_append function appends text to the verification's .out
file. The purpose of this command is facilitate custom verifications and to
add comments between groupings of verification output.
cap_verify_append TEXT
- TEXT Text that will be appended directly to the end of the
.outfile.
- Verify files with comments.
cap_verify_append "##### Mouse data #####"
cap_verify_md5 "data/mouse/*"
cap_verify_append "##### Human data #####"
cap_verify_md5 "data/human/*"
Results in verifications/verify_example.out:
##### Mouse data #####
b3ac2b8b9998bf504ef708ec837a4cce data/mouse/one.bin
8d62064673ecb2a440b8802a2f752e8a data/mouse/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae data/mouse/outs/three.bin
009c79f013fe8d4d97c95bf5ceea68ed data/mouse/two.bin
##### Human data #####
b3ac2b8b9998bf504ef708ec837a4cc1 data/human/one.bin
8d62064673ecb2a440b8802a2f752e82 data/human/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5a5 data/human/outs/three.bin
009c79f013fe8d4d97c95bf5ceea68e8 data/human/two.bin
- Custom verification from a Python script.
The environment variable CAP_VERIFICATION_DRY_RUN can be used to add dry run functionality to custom verifcation scripts, which will be equal to "true" on a dry run.
cap_verify_append "$(python3 $CAP_VERIFICATIONS_PATH/verify_example.py)"
Results in verifications/verify_example.out:
Cell Count: 1000
Gene Count: 5000
The cap_verify_md5 function produces an MD5 checksum for each file specified,
storing the results in an output file to be checked into the repository for
verifying future reproducibility. The purpose of this command is to determine
whether files downloaded or created are complete and accurate when reproduced.
If the MD5 checksums from two sets of files match then the files are all the
same.
cap_verify_md5 [options] FILE...
- FILE... One or more file and/or directory names or patterns. For directories, all files in the directory and its subdirectories will be included.
Options
-
--ignore=PATTERNExclude files matching the file PATTERN based on the full relative path. If the option is specified multiple times, all files matching any of the patterns will be EXCLUDED (logical OR). The selector will generally have wildcards. Ensure patterns are quoted ("pattern") to prevent unintended shell expansion. -
--select=PATTERNInclude only files matching the file PATTERN based on the full relative path. If the option is specified multiple times, all files matching any of the patterns will be INCLUDED (logical OR). The selector will generally have wildcards. Ensure patterns are quoted ("pattern") to prevent unintended shell expansion.
- Verify all files in a directory and its subdirectories.
cap_verify_md5 "data/*"
Results in verifications/verify_example.out:
b3ac2b8b9998bf504ef708ec837a4cce data/one.bin
8d62064673ecb2a440b8802a2f752e8a data/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae data/outs/three.bin
009c79f013fe8d4d97c95bf5ceea68ed data/two.bin
- Verify all files in the subdirectory named "outs".
cap_verify_md5 --select "*/outs/*" "data/*"
Results in verifications/verify_example.out:
8d62064673ecb2a440b8802a2f752e8a data/outs/four.bin
74a08ee2de381ec8e19da52ad36bb5ae data/outs/three.bin
- Verify all files not in the subdirectory named "outs".
cap_verify_md5 --ignore "*/outs/*" "data/*"
Results in verifications/verify_example.out:
b3ac2b8b9998bf504ef708ec837a4cce data/one.bin
009c79f013fe8d4d97c95bf5ceea68ed data/two.bin
Functions to facilitate setting up environments for CAPTURE to operate in.
Environments help create reproducible pipelines by allowing authors to
work in their unique development setup, which may only work for them, and
reviewers to run pipelines in a default environment that should work anywhere.
Environment files are stored in the config/environments directory.
Creates a symbolic link in the data directory. A common use is to prevent duplicate storage of large datasets in the author's compute environment. By linking to a shared copy, multiple authors won't create multiple copies. This function is often used in conjunction with cap_data_download, where cap_data_link prevents cap_data_download from downloading a new version of previously downloaded data while ensuring the data will be downloaded in other environments such as the default environment.
cap_data_link <FILE>|<DIR>
<FILE>|<DIR>The full path to a file or directory.
The symbolic link will have the same name as the specified file or directory
and will be created in the directory specified by CAP_DATA_PATH which
defaults to CAP_PROJECT_PATH/data.
The following example will create a symbolic link at $CAP_DATA_PATH/mouse
and should be included in an environment file in config/environments, e.g
config/environments/my_lab.sh. The $MY_LAB environment variable should
be created in a .caprc file (See Runtime environment).
cap_data_link "$MY_LAB/genome/mouse"
To use the my_lab environment when running a job, use the cap run command
with the -e/--environment option like in the following example.
cap run -e my_lab src/01_download.sh
All pull requests must include BATS tests covering the changes.
The testing framework is installed by the following command.
tests/install
The entire test suite is executed by the following command.
tests/run
The tests can be filtered with the --filter option. This saves time by allowing subsets of the test suite to be ran while coding. The following examples of using --filter are based on this hypothetical BATS test.
@test "cap md5: All files in a folder" {
...
}
How to run just the cap md5 tests:
tests/run --filter "cap md5"
How to run just the single hypothetical test:
tests/run --filter "cap md5: All files in a folder"