@@ -26,7 +26,7 @@ The Enhanced Database of Interacting Protein Structures for Interface Prediction
2626 * Benchmark results included in our paper were run after this issue was resolved
2727 * However, if you ran experiments using DB5-Plus' filename list for its test complexes, please re-run them using the latest list
2828
29- ## How to run creation tools
29+ ## How to set up
3030
3131First, download Mamba (if not already downloaded):
3232``` bash
@@ -51,66 +51,135 @@ conda activate DIPS-Plus # Note: One still needs to use `conda` to (de)activate
5151pip3 install -e .
5252```
5353
54- ## Default DIPS-Plus directory structure
54+ To install PSAIA for feature generation, install GCC 10 for PSAIA:
55+
56+ ``` bash
57+ # Install GCC 10 for Ubuntu 20.04:
58+ sudo apt install software-properties-common
59+ sudo add-apt-repository ppa:ubuntu-toolchain-r/ppa
60+ sudo apt update
61+ sudo apt install gcc-10 g++-10
62+
63+ # Or install GCC 10 for Arch Linux/Manjaro:
64+ yay -S gcc10
65+ ```
66+
67+ Then install QT4 for PSAIA:
68+
69+ ``` bash
70+ # Install QT4 for Ubuntu 20.04:
71+ sudo add-apt-repository ppa:rock-core/qt4
72+ sudo apt update
73+ sudo apt install libqt4* libqtcore4 libqtgui4 libqtwebkit4 qt4* libxext-dev
74+
75+ # Or install QT4 for Arch Linux/Manjaro:
76+ yay -S qt4
77+ ```
78+
79+ Conclude by compiling PSAIA from source:
80+
81+ ``` bash
82+ # Select the location to install the software:
83+ MY_LOCAL=~ /Programs
84+
85+ # Download and extract PSAIA's source code:
86+ mkdir " $MY_LOCAL "
87+ cd " $MY_LOCAL "
88+ wget http://complex.zesoi.fer.hr/data/PSAIA-1.0-source.tar.gz
89+ tar -xvzf PSAIA-1.0-source.tar.gz
90+
91+ # Compile PSAIA (i.e., a GUI for PSA):
92+ cd PSAIA_1.0_source/make/linux/psaia/
93+ qmake-qt4 psaia.pro
94+ make
95+
96+ # Compile PSA (i.e., the protein structure analysis (PSA) program):
97+ cd ../psa/
98+ qmake-qt4 psa.pro
99+ make
100+
101+ # Compile PIA (i.e., the protein interaction analysis (PIA) program):
102+ cd ../pia/
103+ qmake-qt4 pia.pro
104+ make
105+
106+ # Test run any of the above-compiled programs:
107+ cd " $MY_LOCAL " /PSAIA_1.0_source/bin/linux
108+ # Test run PSA inside a GUI:
109+ ./psaia/psaia
110+ # Test run PIA through a terminal:
111+ ./pia/pia
112+ # Test run PSA through a terminal:
113+ ./psa/psa
114+ ```
115+
116+ Lastly, install Docker following the instructions from https://docs.docker.com/engine/install/
117+
118+ ## How to generate protein feature inputs
119+ In our [ feature generation notebook] ( notebooks/feature_generation.ipynb ) , we provide examples of how users can generate the protein features described in our [ accompanying manuscript] ( https://arxiv.org/abs/2106.04362 ) for individual protein inputs.
120+
121+ ## How to use data
122+ In our [ data usage notebook] ( notebooks/data_usage.ipynb ) , we provide examples of how users might use DIPS-Plus (or DB5-Plus) for downstream analysis or prediction tasks. For example, to train a new NeiA model with DB5-Plus as its cross-validation dataset, first download DB5-Plus' raw files and process them via the ` data_usage ` notebook:
123+
124+ ``` bash
125+ mkdir -p project/datasets/DB5/final
126+ wget https://zenodo.org/record/5134732/files/final_raw_db5.tar.gz -O project/datasets/DB5/final/final_raw_db5.tar.gz
127+ tar -xzf project/datasets/DB5/final/final_raw_db5.tar.gz -C project/datasets/DB5/final/
128+
129+ # To process these raw files for training and subsequently train a model:
130+ python3 notebooks/data_usage.py
131+ ```
132+
133+ ## Standard DIPS-Plus directory structure
55134
56135```
57136DIPS-Plus
58137│
59138└───project
60- │ │
61- │ └───datasets
62- │ │ │
63- │ │ └───builder
64- │ │ │
65- │ │ └───DB5
66- │ │ │ │
67- │ │ │ └───final
68- │ │ │ │ │
69- │ │ │ │ └───raw
70- │ │ │ │
71- │ │ │ └───interim
72- │ │ │ │ │
73- │ │ │ │ └───complexes
74- │ │ │ │ │
75- │ │ │ │ └───external_feats
76- │ │ │ │ │
77- │ │ │ │ └───pairs
78- │ │ │ │
79- │ │ │ └───raw
80- │ │ │ │
81- │ │ │ README
82- │ │ │
83- │ │ └───DIPS
84- │ │ │
85- │ │ └───filters
86- │ │ │
87- │ │ └───final
88- │ │ │ │
89- │ │ │ └───raw
90- │ │ │
91- │ │ └───interim
92- │ │ │ │
93- │ │ │ └───complexes
94- │ │ │ │
95- │ │ │ └───external_feats
96- │ │ │ │
97- │ │ │ └───pairs-pruned
98- │ │ │
99- │ │ └───raw
100- │ │ │
101- │ │ └───pdb
102- │ │
103- │ └───utils
104- │ constants.py
105- │ utils.py
106- │
107- .gitignore
108- environment.yml
109- LICENSE
110- README.md
111- requirements.txt
112- setup.cfg
113- setup.py
139+ │
140+ └───datasets
141+ │
142+ └───DB5
143+ │ │
144+ │ └───final
145+ │ │ │
146+ │ │ └───processed # task-ready features for each dataset example
147+ │ │ │
148+ │ │ └───raw # generic features for each dataset example
149+ │ │
150+ │ └───interim
151+ │ │ │
152+ │ │ └───complexes # metadata for each dataset example
153+ │ │ │
154+ │ │ └───external_feats # features curated for each dataset example using external tools
155+ │ │ │
156+ │ │ └───pairs # pair-wise features for each dataset example
157+ │ │
158+ │ └───raw # raw PDB data downloads for each dataset example
159+ │
160+ └───DIPS
161+ │
162+ └───filters # filters to apply to each (un-pruned) dataset example
163+ │
164+ └───final
165+ │ │
166+ │ └───processed # task-ready features for each dataset example
167+ │ │
168+ │ └───raw # generic features for each dataset example
169+ │
170+ └───interim
171+ │ │
172+ │ └───complexes # metadata for each dataset example
173+ │ │
174+ │ └───external_feats # features curated for each dataset example using external tools
175+ │ │
176+ │ └───pairs-pruned # filtered pair-wise features for each dataset example
177+ │ │
178+ │ └───parsed # pair-wise features for each dataset example after initial parsing
179+ │
180+ └───raw
181+ │
182+ └───pdb # raw PDB data downloads for each dataset example
114183```
115184
116185## How to compile DIPS-Plus from scratch
@@ -122,7 +191,7 @@ Retrieve protein complexes from the RCSB PDB and build out directory structure:
122191rm project/datasets/DIPS/final/raw/pairs-postprocessed.txt project/datasets/DIPS/final/raw/pairs-postprocessed-train.txt project/datasets/DIPS/final/raw/pairs-postprocessed-val.txt project/datasets/DIPS/final/raw/pairs-postprocessed-test.txt
123192
124193# Create data directories (if not already created):
125- mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
194+ mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/pairs-pruned project/datasets/DIPS/interim/ external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed
126195
127196# Download the raw PDB files:
128197rsync -rlpt -v -z --delete --port=33444 --include=' *.gz' --include=' *.xz' --include=' */' --exclude ' *' \
@@ -139,7 +208,17 @@ python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pa
139208
140209# Generate externally-sourced features:
141210python3 project/datasets/builder/generate_psaia_features.py " $PSAIADIR " " $PROJDIR " /project/datasets/builder/psaia_config_file_dips.txt " $PROJDIR " /project/datasets/DIPS/raw/pdb " $PROJDIR " /project/datasets/DIPS/interim/parsed " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned " $PROJDIR " /project/datasets/DIPS/interim/external_feats --source_type rcsb
142- python3 project/datasets/builder/generate_hhsuite_features.py " $PROJDIR " /project/datasets/DIPS/interim/parsed " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned " $HHSUITE_DB " " $PROJDIR " /project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file
211+ python3 project/datasets/builder/generate_hhsuite_features.py " $PROJDIR " /project/datasets/DIPS/interim/parsed " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned " $HHSUITE_DB " " $PROJDIR " /project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --write_file # Note: After this, one needs to re-run this command with `--read_file` instead
212+
213+ # Generate multiple sequence alignments (MSAs) using a smaller sequence database (if not already created using the standard BFD):
214+ DOWNLOAD_DIR=" $HHSUITE_DB_DIR " && ROOT_DIR=" ${DOWNLOAD_DIR} /small_bfd" && SOURCE_URL=" https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz" && BASENAME=$( basename " ${SOURCE_URL} " ) && mkdir --parents " ${ROOT_DIR} " && aria2c " ${SOURCE_URL} " --dir=" ${ROOT_DIR} " && pushd " ${ROOT_DIR} " && gunzip " ${ROOT_DIR} /${BASENAME} " && popd # e.g., Download the small BFD
215+ python3 project/datasets/builder/generate_hhsuite_features.py " $PROJDIR " /project/datasets/DIPS/interim/parsed " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned " $HHSUITE_DB_DIR " /small_bfd " $PROJDIR " /project/datasets/DIPS/interim/external_feats --num_cpu_jobs 4 --num_cpus_per_job 8 --num_iter 2 --source_type rcsb --generate_msa_only --write_file # Note: After this, one needs to re-run this command with `--read_file` instead
216+
217+ # Identify interfaces within intrinsically disordered regions (IDRs) #
218+ # (1) Pull down the Docker image for `flDPnn`
219+ docker pull docker.io/sinaghadermarzi/fldpnn
220+ # (2) For all sequences in the dataset, predict which interface residues reside within IDRs
221+ python3 project/datasets/builder/annotate_idr_interfaces.py " $PROJDIR " /project/datasets/DIPS/final/raw
143222
144223# Add new features to the filtered pairs, ensuring that the pruned pairs' original PDB files are stored locally for DSSP:
145224python3 project/datasets/builder/download_missing_pruned_pair_pdbs.py " $PROJDIR " /project/datasets/DIPS/raw/pdb " $PROJDIR " /project/datasets/DIPS/interim/pairs-pruned --num_cpus 32 --rank " $1 " --size " $2 "
0 commit comments