Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
155 changes: 62 additions & 93 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,129 +1,85 @@
# SSAxgeo

This software provides protein Secondary Structure Assignment based on differential geometry and knot theory descriptors.
SSAxgeo assigns protein secondary structure elements using differential geometry and knot theory. It provides containerised command-line tools for analysing Protein Data Bank entries.

---
## Features

## How to install?
- Differential geometry–based descriptors.
- Command-line utilities for sampling PDB entries and computing descriptors.
- Singularity container for reproducible execution.
- Workflows to reproduce the analyses from the associated publication.

1. Clone the repository:
```{bash}
git clone --recurse-submodules https://github.com/labstructbioinf/SSAxgeo.git
```
## Prerequisites

2. build the container
The recomended way to run SSAxgeo is using the the container provided on this repository. Once [Singularity](https://docs.sylabs.io/guides/3.11/user-guide/) is available on your system,
- Git
- Python ≥3.8
- Singularity 3.x
- localpdb (for reproducing the paper analyses)

```{bash}
sudo singularity build ssaxgeo.sif SingularityFile
```bash
python --version
```

### How to run?

on the container
## Installation

```{bash}
singularity exec ssaxgeo.sif ssaxgeo [pdb_filepath]
```bash
git clone --recurse-submodules https://github.com/labstructbioinf/SSAxgeo.git
cd SSAxgeo
sudo singularity build ssaxgeo.sif SingularityFile
```

----
## REPRODUCE PAPER ANALYSES

## Basic Usage

### 0 - Get a local copy of the PDB

To reproduce the analyses presented on the paper, be sure localpdb is available on your environment.
Then, setup your local pdb copy:

```{bash}
localpdb_setup -db_path /path/to/mypdb/ -plugins DSSP PDBClustering PDBChain --fetch_cif --fetch_pdb
```bash
singularity exec ssaxgeo.sif ssaxgeo /path/to/structure.pdb
```
This process most likely will take a long time.

### 1 - Get a sampling of a clustered PDB
## Reproduce Paper Analyses

Once the local pdb copy is in place, compute a clustered pdb with a given sequence redundancy.
For instance, with the command bellow the user can obtain entries clustered by 30% of redundance and entry with at least 2 angstron resolutions.
1. **Prepare a local copy of the PDB**

TODO: if dssp and xgeo folder are not there, create it
```{bash}
ssaxgeo_getSampleOfClstrPDB /path/to/mypdb/ -out_dir /path/to/mydir/ -redundancy 30 -res_lim 2.0 -ncpus 4 -seed 0
```
```bash
localpdb_setup -db_path /path/to/mypdb/ -plugins DSSP PDBClustering PDBChain --fetch_cif --fetch_pdb
```

```
usage: ssaxgeo_getSampleOfClstrPDB [-h] [-redundancy REDUNDANCY] [-out_dir OUT_DIR] [-res_lim RES_LIM] [-ncpus NCPUS] [-seed SEED] mylocalpdb
Downloads PDB files and sets up required plugins at `/path/to/mypdb/`.

This script loads data from localpdb, select a given clustered PDB, select randomly one exemplar of each cluster and save results as csv files.

positional arguments:
mylocalpdb Path to a local PDB copy (must be obtained by localpdb package)

options:
-h, --help show this help message and exit
-redundancy REDUNDANCY
redundancy by sequence identity [100, 95, 90, 70, 50 and 30]
-out_dir OUT_DIR Output directory (default=working dir)
-res_lim RES_LIM resolution limit of structures to be considered (default=2.0)
-ncpus NCPUS number of cpus to use (default = 1)
-seed SEED seed for random number generator (default = None
```
### 2 - compute differential geometry descriptors
2. **Sample a clustered PDB**

For each entry on the clustered pdb, we need to compute our differential geometry descriptors:
```bash
ssaxgeo_getSampleOfClstrPDB /path/to/mypdb/ -out_dir /path/to/mydir/ -redundancy 30 -res_lim 2.0 -ncpus 4 -seed 0
```

```{bash}
ssaxgeo_computePDBxgeo --mylocalpdb_path /path/to/mypdb/ --sampled_clstrd_path /path/to/sampled_clust-30.csv --xgeo_output_dir /path/to/mypdb/xgeo_chains/ --ncpus 8 --out_csv /path/to/sampled_clust-30_updated.csv
```
Generates a clustered set of structures with the specified redundancy in `/path/to/mydir/`.

```
usage: ssaxgeo_computePDBxgeo [-h] --mylocalpdb_path MYLOCALPDB_PATH --sampled_clstrd_path SAMPLED_CLSTRD_PATH [--xgeo_output_dir XGEO_OUTPUT_DIR] [--ncpus NCPUS]
[--out_csv OUT_CSV]

Compute xgeo data for a given set of protein chains provided.

options:
-h, --help show this help message and exit
--mylocalpdb_path MYLOCALPDB_PATH
path to a localpdb database
--sampled_clstrd_path SAMPLED_CLSTRD_PATH
path to a sampled clustered csv (produced by getSampleOfCLstrPDB)
--xgeo_output_dir XGEO_OUTPUT_DIR
path of a dir to store xgeo csv files (default = xgeo_output_dir+"/xgeo_chains/"
--ncpus NCPUS Number of cpus to be used (default=1)
--out_csv OUT_CSV Description of out_csv
```
### 4 - Clustering residues and generating "fragments"
----
3. **Compute differential geometry descriptors**

The next step is to normalize and smooth xgeo representation for each entry, clustering residues and obtain "fragments" (i. e., consecutive residues which belongs to the same cluster). Optionally, is possible to label all residues according to canonical regions (via `--do_res_labeling`)
```bash
ssaxgeo_computePDBxgeo --mylocalpdb_path /path/to/mypdb/ --sampled_clstrd_path /path/to/sampled_clust-30.csv --xgeo_output_dir /path/to/mypdb/xgeo_chains/ --ncpus 8 --out_csv /path/to/sampled_clust-30_updated.csv
```

**WARN**: normalizing and smoothing may not be necessary anymore
Produces xgeo descriptor files and an updated sample table.

```
ssaxgeo_clusterResidues /path/to/sampled_clust-30_updated.csv clust-30 -ncpus 8
```
4. **Cluster residues and generate fragments**

To obtain residue labeling according to canonical regions a directory containing dataframes for canonical regions needs to be provided. Those dataframes needs to be named as: `alpha_can.p`, `pi_can.p`, `three_can.p` and `pp2_can.p`.
```bash
ssaxgeo_clusterResidues /path/to/sampled_clust-30_updated.csv clust-30 -ncpus 8
```

```
ssaxgeo_clusterResidues /path/to/sampled_clust-30_updated.csv clust-30 -ncpus 8 -do_
```
Normalises xgeo values and outputs residue clusters representing structural fragments.

### 4 - Select canonical regions
----
Once a csv with the fragments is obtained, canonical regions can be idenfied by filtering fragments for geometrical helices, and clustering those fragment based on density. A jupyter notebook to generate the canonical sets is provided at `notebooks/SetCanonicalRegions.ipynb`
5. **Select canonical regions**

----
Use `notebooks/SetCanonicalRegions.ipynb` to filter fragments for geometrical helices and derive canonical sets.

### Algorithm description
## Roadmap

---
## TODO: (ver 1.0)
### v1.0

- [x] bring diffgeo to be part of ssaxgeo
- [x] migrate code for canonical regions detection to rely on localpdb
- [x] add/update and adapt scripts to reproduce paper results more easily (via CLI)
- [ ] adapt old code scripts to rely on new structure
- [ ] adapt old code scripts to rely on new structure
- [x] ssaxgeo
- [x] computePDBxgeo
- [x] getSampleOfClstrPDB
Expand All @@ -135,7 +91,20 @@ Once a csv with the fragments is obtained, canonical regions can be idenfied by
- [ ] update documentation
- [ ] test end-to-end

## TODO: (ver 1.1)
### v1.1

- [ ] add citation
- [ ] add pymol viz support
- [ ] add xgeo Dlang code suport
- [ ] add xgeo Dlang code support

## Citation

If you use SSAxgeo in your research, please cite the associated publication.

## Contributing

Contributions are welcome. Please open an issue to discuss significant changes before submitting a pull request.

## License

This project is licensed under the MIT License.