Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 106 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,20 +5,107 @@ Predicting enzyme kinetic parameters is a crucial task in enzyme discovery and e

<img src="models/catapro.png">

## Create the CataPro environment
To run CataPro, you should create a conda environment that includes the following packages:
---

pytorch >= 1.13.0
transformers
numpy
pandas
RDKit
## Installation

In addition, CataPro also relies on additional pre-trained models, including [prot_t5_xl_uniref50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50) and [molt5-base-smiles2caption](https://huggingface.co/laituan245/molt5-base-smiles2caption). These two models are used for extracting features from enzymes and substrates, respectively. You need to place the weights for these two pre-trained models in the `models` directory.
## Setup a Python environment

To ensure a clean and isolated setup, we recommend to use [uv](https://docs.astral.sh/uv/), a lightweight tool that simplifies Python environment and package management. If you don’t have it yet:

```p
# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
```

```powershell
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
$env:Path += ";$env:USERPROFILE\.local\bin"
```

Create and activate a virtual environment with uv:

```bash
# macOS / Linux
uv venv
source .venv/bin/activate
```

```powershell
# Windows
uv venv
.venv\Scripts\activate
```

## Install dependencies

```bash
uv pip install torch transformers numpy pandas RDKit sentencepiece
```

### 2. Clone the CataPro repository

```bash
git clone https://github.com/zchwang/CataPro
```

### 3. Set up Git LFS

CataPro uses [Git Large File Storage (LFS)](https://git-lfs.github.com/) to handle large model files.
If you don't have Git LFS installed, you can install it using the following command:

```bash
git lfs install
```

### 4. Download the models

In addition, CataPro also relies on additional pre-trained models, including [prot_t5_xl_uniref50](https://huggingface.co/Rostlab/prot_t5_xl_uniref50) and [molt5-base-smiles2caption](https://huggingface.co/laituan245/molt5-base-smiles2caption). These two models are used for extracting features from enzymes and substrates, respectively.

> [!WARNING]
> The models prot_t5_xl_uniref50 and molt5-base-smiles2caption required for CataPro are 64 and 1.9 GB,
> respectively.

```bash
# macOS / Linux
cd CataPro/models/

LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/Rostlab/prot_t5_xl_uniref50
cd prot_t5_xl_uniref50
git lfs pull

cd ..

LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/laituan245/molt5-base-smiles2caption
cd molt5-base-smiles2caption
git lfs pull

cd ../..
```

```powershell
# Windows
git -c filter.lfs.smudge= -c filter.lfs.required=false clone https://huggingface.co/Rostlab/prot_t5_xl_uniref50
cd prot_t5_xl_uniref50
git lfs pull

cd ..

git -c filter.lfs.smudge= -c filter.lfs.required=false clone https://huggingface.co/laituan245/molt5-base-smiles2caption
cd molt5-base-smiles2caption
git lfs pull

cd ../..
```

---

## Contact
Zechen Wang, PhD, Shandong University, wangzch97@gmail.com</p>

---

## Usage
### 1. Prepare the input files for inference
Enzyme and substrate information should be organized in a DataFrame created with pandas (in CSV format). Each enzyme-substrate pair must include the Enzyme_id, type (wild-type or mutant), the enzyme sequence, and the substrate's SMILES. The format is as follows:
Expand All @@ -33,14 +120,19 @@ You can also refer to a sample file samples/sample_inp.csv

### 2. Next, you can use the following command to run CataPro to infer the kinetic parameters of the enzymatic reaction:

python predict.py \
-inp_fpath samples/sample_inp.csv \
-model_dpath models \
-batch_size 64 \
-device cuda:0 \
-out_fpath catapro_prediction.csv
```bash
# In CataPro folder
python inference/predict.py \
-inp_fpath samples/sample_inp.csv \
-model_dpath models \
-batch_size 64 \
-device cuda:0 \
-out_fpath catapro_prediction.csv
```

Finally, the prediction results from CataPro are stored in the "catapro_prediction.csv" file. You can also run "bash run_catapro.sh" directly in the inference directory to achieve the above process.

---

## Question and Answer
To be updated ...
4 changes: 2 additions & 2 deletions inference/act_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ def __init__(self, rate=0.0, alpha=0.4, device="cuda:0"):
super(ActivityModel, self).__init__()
self.alpha = alpha

self.kcat_model = KcatModel().to(device)
self.Km_model = KmModel().to(device)
self.kcat_model = KcatModel(device=device).to(device)
self.Km_model = KmModel(device=device).to(device)

self.prot_norm = nn.BatchNorm1d(1024).to(device)
self.molt5_norm = nn.BatchNorm1d(768).to(device)
Expand Down