diff --git a/README.md b/README.md index 75cc7ec..9b7c3d5 100644 --- a/README.md +++ b/README.md @@ -1,129 +1,85 @@ # SSAxgeo -This software provides protein Secondary Structure Assignment based on differential geometry and knot theory descriptors. +SSAxgeo assigns protein secondary structure elements using differential geometry and knot theory. It provides containerised command-line tools for analysing Protein Data Bank entries. ---- +## Features -## How to install? +- Differential geometry–based descriptors. +- Command-line utilities for sampling PDB entries and computing descriptors. +- Singularity container for reproducible execution. +- Workflows to reproduce the analyses from the associated publication. -1. Clone the repository: -```{bash} -git clone --recurse-submodules https://github.com/labstructbioinf/SSAxgeo.git -``` +## Prerequisites -2. build the container -The recomended way to run SSAxgeo is using the the container provided on this repository. Once [Singularity](https://docs.sylabs.io/guides/3.11/user-guide/) is available on your system, +- Git +- Python ≥3.8 +- Singularity 3.x +- localpdb (for reproducing the paper analyses) -```{bash} -sudo singularity build ssaxgeo.sif SingularityFile +```bash +python --version ``` -### How to run? - -on the container +## Installation -```{bash} -singularity exec ssaxgeo.sif ssaxgeo [pdb_filepath] +```bash +git clone --recurse-submodules https://github.com/labstructbioinf/SSAxgeo.git +cd SSAxgeo +sudo singularity build ssaxgeo.sif SingularityFile ``` ----- -## REPRODUCE PAPER ANALYSES - +## Basic Usage -### 0 - Get a local copy of the PDB - -To reproduce the analyses presented on the paper, be sure localpdb is available on your environment. -Then, setup your local pdb copy: - -```{bash} -localpdb_setup -db_path /path/to/mypdb/ -plugins DSSP PDBClustering PDBChain --fetch_cif --fetch_pdb +```bash +singularity exec ssaxgeo.sif ssaxgeo /path/to/structure.pdb ``` -This process most likely will take a long time. -### 1 - Get a sampling of a clustered PDB +## Reproduce Paper Analyses -Once the local pdb copy is in place, compute a clustered pdb with a given sequence redundancy. -For instance, with the command bellow the user can obtain entries clustered by 30% of redundance and entry with at least 2 angstron resolutions. +1. **Prepare a local copy of the PDB** -TODO: if dssp and xgeo folder are not there, create it -```{bash} -ssaxgeo_getSampleOfClstrPDB /path/to/mypdb/ -out_dir /path/to/mydir/ -redundancy 30 -res_lim 2.0 -ncpus 4 -seed 0 -``` + ```bash + localpdb_setup -db_path /path/to/mypdb/ -plugins DSSP PDBClustering PDBChain --fetch_cif --fetch_pdb + ``` -``` -usage: ssaxgeo_getSampleOfClstrPDB [-h] [-redundancy REDUNDANCY] [-out_dir OUT_DIR] [-res_lim RES_LIM] [-ncpus NCPUS] [-seed SEED] mylocalpdb + Downloads PDB files and sets up required plugins at `/path/to/mypdb/`. -This script loads data from localpdb, select a given clustered PDB, select randomly one exemplar of each cluster and save results as csv files. - -positional arguments: - mylocalpdb Path to a local PDB copy (must be obtained by localpdb package) - -options: - -h, --help show this help message and exit - -redundancy REDUNDANCY - redundancy by sequence identity [100, 95, 90, 70, 50 and 30] - -out_dir OUT_DIR Output directory (default=working dir) - -res_lim RES_LIM resolution limit of structures to be considered (default=2.0) - -ncpus NCPUS number of cpus to use (default = 1) - -seed SEED seed for random number generator (default = None -``` -### 2 - compute differential geometry descriptors +2. **Sample a clustered PDB** -For each entry on the clustered pdb, we need to compute our differential geometry descriptors: + ```bash + ssaxgeo_getSampleOfClstrPDB /path/to/mypdb/ -out_dir /path/to/mydir/ -redundancy 30 -res_lim 2.0 -ncpus 4 -seed 0 + ``` -```{bash} -ssaxgeo_computePDBxgeo --mylocalpdb_path /path/to/mypdb/ --sampled_clstrd_path /path/to/sampled_clust-30.csv --xgeo_output_dir /path/to/mypdb/xgeo_chains/ --ncpus 8 --out_csv /path/to/sampled_clust-30_updated.csv -``` + Generates a clustered set of structures with the specified redundancy in `/path/to/mydir/`. -``` -usage: ssaxgeo_computePDBxgeo [-h] --mylocalpdb_path MYLOCALPDB_PATH --sampled_clstrd_path SAMPLED_CLSTRD_PATH [--xgeo_output_dir XGEO_OUTPUT_DIR] [--ncpus NCPUS] - [--out_csv OUT_CSV] - -Compute xgeo data for a given set of protein chains provided. - -options: - -h, --help show this help message and exit - --mylocalpdb_path MYLOCALPDB_PATH - path to a localpdb database - --sampled_clstrd_path SAMPLED_CLSTRD_PATH - path to a sampled clustered csv (produced by getSampleOfCLstrPDB) - --xgeo_output_dir XGEO_OUTPUT_DIR - path of a dir to store xgeo csv files (default = xgeo_output_dir+"/xgeo_chains/" - --ncpus NCPUS Number of cpus to be used (default=1) - --out_csv OUT_CSV Description of out_csv -``` -### 4 - Clustering residues and generating "fragments" ----- +3. **Compute differential geometry descriptors** -The next step is to normalize and smooth xgeo representation for each entry, clustering residues and obtain "fragments" (i. e., consecutive residues which belongs to the same cluster). Optionally, is possible to label all residues according to canonical regions (via `--do_res_labeling`) + ```bash + ssaxgeo_computePDBxgeo --mylocalpdb_path /path/to/mypdb/ --sampled_clstrd_path /path/to/sampled_clust-30.csv --xgeo_output_dir /path/to/mypdb/xgeo_chains/ --ncpus 8 --out_csv /path/to/sampled_clust-30_updated.csv + ``` -**WARN**: normalizing and smoothing may not be necessary anymore + Produces xgeo descriptor files and an updated sample table. -``` -ssaxgeo_clusterResidues /path/to/sampled_clust-30_updated.csv clust-30 -ncpus 8 -``` +4. **Cluster residues and generate fragments** -To obtain residue labeling according to canonical regions a directory containing dataframes for canonical regions needs to be provided. Those dataframes needs to be named as: `alpha_can.p`, `pi_can.p`, `three_can.p` and `pp2_can.p`. + ```bash + ssaxgeo_clusterResidues /path/to/sampled_clust-30_updated.csv clust-30 -ncpus 8 + ``` -``` -ssaxgeo_clusterResidues /path/to/sampled_clust-30_updated.csv clust-30 -ncpus 8 -do_ -``` + Normalises xgeo values and outputs residue clusters representing structural fragments. -### 4 - Select canonical regions ----- -Once a csv with the fragments is obtained, canonical regions can be idenfied by filtering fragments for geometrical helices, and clustering those fragment based on density. A jupyter notebook to generate the canonical sets is provided at `notebooks/SetCanonicalRegions.ipynb` +5. **Select canonical regions** ----- + Use `notebooks/SetCanonicalRegions.ipynb` to filter fragments for geometrical helices and derive canonical sets. -### Algorithm description +## Roadmap ---- -## TODO: (ver 1.0) +### v1.0 - [x] bring diffgeo to be part of ssaxgeo - [x] migrate code for canonical regions detection to rely on localpdb - [x] add/update and adapt scripts to reproduce paper results more easily (via CLI) -- [ ] adapt old code scripts to rely on new structure +- [ ] adapt old code scripts to rely on new structure - [x] ssaxgeo - [x] computePDBxgeo - [x] getSampleOfClstrPDB @@ -135,7 +91,20 @@ Once a csv with the fragments is obtained, canonical regions can be idenfied by - [ ] update documentation - [ ] test end-to-end -## TODO: (ver 1.1) +### v1.1 + - [ ] add citation - [ ] add pymol viz support -- [ ] add xgeo Dlang code suport +- [ ] add xgeo Dlang code support + +## Citation + +If you use SSAxgeo in your research, please cite the associated publication. + +## Contributing + +Contributions are welcome. Please open an issue to discuss significant changes before submitting a pull request. + +## License + +This project is licensed under the MIT License.