a tool to archive and co-locate NGS data with project-level, sample-level, and analysis-level metadata.
- Overview
- Getting Started
2.1 Dependencies
2.2 Installation - Run pyrkit
3.1 Usage
3.2 Required Arguments
3.3 OPTIONS
3.4 Example
pyrkit, pronouced park-it, automates the process of moving data from the cluster into object storage in HPC DME. It instantiates a collection heirarchy to archive raw data and results. pyrkit parses a project request template, a pipeline's output directory, and a MultiQC directory to capture project, analysis, quality-control metadata. pyrkit was created to enable FAIR scientific data management and stewardship.
Please Note: Some of the metadata listed in the example above is pipeline-specific (i.e. only for the RNA-seq pipeline).
pykrit has a few required dependencies. It requires the installation of the following programs:
Please note that if you running pyrkit on Biowulf, the only dependency you will need to install in the HPC DME toolkit. pyrkit will attempt to module load jq and python/3.5 (which meets any python requirements), if they are not in your $PATH.
Installation of pyrkit is easy! Please clone the repository from Github, create a virtual enviroment, and install any dendencies. Again, if you are on Biowulf, all you will need to do is clone the repository.
# Clone the Repository
git clone https://github.com/skchronicles/pyrkit.git
# Steps below are optional for biowulf users
# Create a virtual environment
python3 -m venv .venv
# Activate the virtual environment
. .venv/bin/activate
# Update pip
pip install --upgrade pip
# Download Dependencies
pip install -r requirements.txtusage: pyrkit -i INPUT_DIRECTORY -o OUTPUT_VAULT -r REQUEST_TEMPLATE
-m MULTIQC_DIRECTORY -d DME_REPO [-p PROJECT_ID] [-n] [-h]
[--version]| Argument | Type | Description | Example |
|---|---|---|---|
| -i, --input-directory | Path | Pipeline output directory | /scratch/RNA_hg38/ |
| -o, --output-vault | String | HPC DME vault to upload data | /CCBR_Archive |
| -r, --request-template | File | Project Request Template | experiment_metadata.xlsx |
| -m, --multiqc-directory | Path | MultiQC Output Directory | /scratch/RNA_hg38/multiqc_data/ |
| -d, --dme-repo | Path | Path to a HPC DME toolkit install | ~/DME/HPC_DME_APIs/ |
| Argument | Type | Description | Example |
|---|---|---|---|
| -p, --project-id | String | Project ID | ccbr-123 |
| -n, --dry-run | Flag | Dry-run the entire pyrkit workflow | -n |
| -h, --help | Flag | Display help message and exit | -h |
| --version | Flag | Display version information and exit | --version |
# Grab an interactive node or submit pyrkit command to cluster
# Do not run this on the head node!
sinteractive --mem=8g --cpus-per-task=2
# Dry runs pyrkit and submits job to upload data to cluster
./pyrkit -i /scratch/ccbr123/RNA_hg38/ \
-o /CCBR_Archive \
-r experiment_metadata.xlsx \
-m /scratch/ccbr123/RNA_hg38/multiqc_data/ \
-d ~/DME/HPC_DME_APIs/ \
-p ccbr-123