RXNGraphormer

Official implementation of "A unified pre-trained deep learning framework for cross-task reaction performance prediction and synthesis planning" (under review). Here, we provide a wiki document, which was auto-generated by DeepWiki based on latest commit. Preprint is available at ResearchSquare.

📌 Key Features

Unified Architecture: Integrates graph neural networks with Transformer to capture both intra and inter-molecular information.
Pre-trained Model: A pre-trained model that learns chemical bond transformation patterns from 13 million reactions.
Cross-task Reaction Prediction: Simultaneous handling of:
- Yield, regioselectivity and enantioselectivity estimation (regression)
- Forward- and retro-synthesis planning (sequence generation)
Performance: RXNGraphormer achieves state-of-the-art performance across eight benchmark datasets for reactivity/selectivity prediction and forward-/retro-synthesis planning, as well as three external realistic datasets for reactivity and selectivity prediction.

🚀 Quick Start

Installation

conda create -n rxngraphormer python=3.8
conda activate rxngraphormer

# option 1: install pre-built package (recommended)
pip install rxngraphormer -i https://pypi.org/simple -f https://data.pyg.org/whl/torch-1.12.0+cu113.html --extra-index-url https://download.pytorch.org/whl/cu113

# option 2: install from source (development mode)
git clone https://github.com/licheng-xu-echo/RXNGraphormer.git
cd RXNGraphormer/
pip install -r requirements.txt -f https://data.pyg.org/whl/torch-1.12.0+cu113.html --extra-index-url https://download.pytorch.org/whl/cu113
pip install .

Note: All codes were tested under Ubuntu 22.04.2 LTS

Download model weights and dataset

All model weights and preprocessed datasets are available via our figshare repository (before the manuscript is formally published, please vist the private link). Please put these files in model_path and dataset folders, like the following structure:

RXNGraphormer
|── rxngraphormer
├── model_path
│   ├── buchwald_hartwig/
|   ├── C_H_func/
|   ├── pretrained_classification_model/
|   ├── suzuki_miyaura/
│   ├── thiol_addition/
|   ├── USPTO_50k/
|   ├── USPTO_480k/
|   └── ...
|── dataset
│   ├── 50k_with_rxn_type/
|   ├── external_validation_dataset/
│   ├── OOS/
│   ├── pretrain/
│   ├── USPTO_50k/
│   ├── USPTO_480k/
│   ├── USPTO_full/
│   └── ...
|── ...

Basic Usage

Reaction performance prediction

from rxngraphormer.eval import reaction_prediction
# reactivity prediction
bh_model_path = "./model_path/buchwald_hartwig/seed0"
rxn_smiles_lst = ["CC(C)c1cc(C(C)C)c(-c2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CN(C)C(=NC(C)(C)C)N(C)C.Cc1ccc(N)cc1.FC(F)(F)c1ccc(I)cc1.c1ccc(-c2ccon2)cc1>>Cc1ccc(Nc2ccc(C(F)(F)F)cc2)cc1",
                  "CC(C)c1cc(C(C)C)c(-c2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CCOC(=O)c1cc(C)on1.CN(C)C(=NC(C)(C)C)N(C)C.Cc1ccc(N)cc1.FC(F)(F)c1ccc(Cl)cc1>>Cc1ccc(Nc2ccc(C(F)(F)F)cc2)cc1"]
bh_react_preds = reaction_prediction(bh_model_path,rxn_smiles_lst,task_type="reactivity")

# selectivity prediction
thiol_add_model_path = "./model_path/thiol_addition/seed0"
rxn_smiles_lst = ["COCc1cccc(-c2cc3c(c4c2OP(=O)(O)Oc2c(-c5cccc(COC)c5)cc5c(c2-4)CCCC5)CCCC3)c1.O=C(/N=C/c1ccccc1)c1ccccc1.Sc1ccccc1>>O=C(NC(Sc1ccccc1)c1ccccc1)c1ccccc1",
                  "COCc1cccc(-c2cc3ccccc3c3c2OP(=O)(O)Oc2c(-c4cccc(COC)c4)cc4ccccc4c2-3)c1.O=C(/N=C/c1cccc2ccccc12)c1ccccc1.SC1CCCCC1>>O=C(NC(SC1CCCCC1)c1cccc2ccccc12)c1ccccc1"]
thiol_add_sel_preds = reaction_prediction(thiol_add_model_path,rxn_smiles_lst,task_type="selectivity")

# It takes about 25 seconds to predict 100 reactions on a single AMD 7735HS CPU without GPU.

Reaction synthesis planning prediction

from rxngraphormer.eval import reaction_prediction

# retro-synthesis planning
uspto_50k_model_path = "./model_path/USPTO_50k"
pdt_smiles_lst = ['COC(=O)[C@H](CCCCN)NC(=O)Nc1cc(OC)cc(C(C)(C)C)c1O',
                  'O=C(Nc1cccc2cnccc12)c1cc([N+](=O)[O-])c(Sc2c(Cl)cncc2Cl)s1',
                  'CCN(CC)Cc1ccc(-c2nc(C)c(COc3ccc([C@H](CC(=O)N4C(=O)OC[C@@H]4Cc4ccccc4)c4ccon4)cc3)s2)cc1']
rct_preds = reaction_prediction(uspto_50k_model_path, pdt_smiles_lst, task_type="retro-synthesis")

# forward-synthesis planning
uspto_480k_model_path = "./model_path/USPTO_480k"
rct_smiles_lst = ['C1CCOC1.CC(C)C[Mg+].CON(C)C(=O)c1ccc(O)nc1.[Cl-]',
                  'CN.O.O=C(O)c1ccc(Cl)c([N+](=O)[O-])c1',
                  'CCn1cc(C(=O)O)c(=O)c2cc(F)c(-c3ccc(N)cc3)cc21.O=CO']
pdt_preds = reaction_prediction(uspto_480k_model_path, rct_smiles_lst, task_type="forward-synthesis")

# It takes about 40 seconds to predict 100 reactions on a single AMD 7735HS CPU without GPU.

Reaction embeddings generation

from rxngraphormer.rxn_emb import RXNEMB

# base on pretrained model
pretrain_model_path = "./model_path/pretrained_classification_model"
rxnemb_calc_pretrained = RXNEMB(pretrained_model_path=pretrain_model_path,model_type="classifier")
rxn_emb_pretrained = rxnemb_calc_pretrained.gen_rxn_emb(["C1CCCCC1.CCO.CS(=O)(=O)N1CCN(Cc2ccccc2)CC1.[OH-].[OH-].[Pd+2]>>CS(=O)(=O)N1CCNCC1",
                                              "CCOC(C)=O.Cc1cc([N+](=O)[O-])ccc1NC(=O)c1ccccc1.Cl[Sn]Cl.O.O.O=C([O-])O.[Na+]>>Cc1cc(N)ccc1NC(=O)c1ccccc1",
                                              "COc1ccc(-c2coc3ccc(-c4nnc(S)o4)cc23)cc1.COc1ccc(CCl)cc1F>>COc1ccc(-c2coc3ccc(-c4nnc(SCc5ccc(OC)c(F)c5)o4)cc23)cc1"])

# base on finetuned model
finetune_model_path = "./model_path/buchwald_hartwig/seed0"
rxnemb_calc_finetuneed = RXNEMB(pretrained_model_path=finetune_model_path,model_type="regressor")
rxn_emb_finetuned = rxnemb_calc_finetuneed.gen_rxn_emb(["CC(C)c1cc(C(C)C)c(-c2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CN(C)C(=NC(C)(C)C)N(C)C.Cc1ccc(N)cc1.FC(F)(F)c1ccc(I)cc1.c1ccc(-c2ccon2)cc1>>Cc1ccc(Nc2ccc(C(F)(F)F)cc2)cc1",
                                                        "CCN=P(N=P(N(C)C)(N(C)C)N(C)C)(N(C)C)N(C)C.COc1ccc(OC)c(P([C@]23C[C@H]4C[C@H](C[C@H](C4)C2)C3)[C@]23C[C@H]4C[C@H](C[C@H](C4)C2)C3)c1-c1c(C(C)C)cc(C(C)C)cc1C(C)C.Cc1ccc(N)cc1.Ic1cccnc1.c1ccc(-c2ccon2)cc1>>Cc1ccc(Nc2cccnc2)cc1",
                                                        "CC(C)c1cc(C(C)C)c(-c2ccccc2P(C2CCCCC2)C2CCCCC2)c(C(C)C)c1.CCc1ccc(I)cc1.CN1CCCN2CCCN=C12.Cc1ccc(N)cc1.c1ccc2oncc2c1>>CCc1ccc(Nc2ccc(C)cc2)cc1"])

⚛️ Model Training

Although the script automatically preprocesses the data when you start the model training program, we recommend that you perform data preprocessing separately to avoid errors when using multi-gpu training models. Multi-gpu is usually used for model pretraining and fine-tuning for reaction synthesis planning tasks.

In the pre-training task, multiple data sub-files are used to create training sets and validation sets. The file_num_trunck parameter in the json file can control how many sub-files are used. If it is set to 0, all sub-files are used.

python data_preprocess.py --config_json ./config/pretrain_parameters.json                                                                         # preprocess pretraining data 
CUDA_VISIBLE_DEVICES="0,1" python -m torch.distributed.launch --nproc_per_node 2 train_model.py --config_json ./config/pretrain_parameters.json   # pretrain model with multi-gpu
python train_model.py --config_json ./config/C_H_func_parameters.json                                                                             # finetune regression model
python train_model.py --config_json ./config/uspto_50k_parameters.json                                                                            # finetune sequence genration model

🔢 Evaluation

python eval_model.py --config_json ./config/buchwald_hartwig_eval.json    # evaluate regression model
python eval_model.py --config_json ./config/uspto_50k_eval.json           # evaluate sequence generation model

📑 Some results in paper

Here we provide a notebook to show how to generate fictitious reactions based on real reactions, how to generate delta-mol SMILES based on reaction SMILES, how to evaluate the regression performance of RXNGraphormer, use the embeddings generated by pre-trained RXNGraphormer to measure the distance between different reaction types in USPTO-50k, and use the embeddings generated by fine-tuned RXNGraphormer to examine the model's understanding of reaction structure-performance relationship.

📚 Citation

Before the paper is formally published, please cite the following preprint: https://doi.org/10.21203/rs.3.rs-5994908/v1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RXNGraphormer

📌 Key Features

🚀 Quick Start

Installation

Download model weights and dataset

Basic Usage

⚛️ Model Training

🔢 Evaluation

📑 Some results in paper

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
config		config
dataset		dataset
model_path		model_path
notebook		notebook
rxngraphormer		rxngraphormer
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
data_preprocess.py		data_preprocess.py
eval_model.py		eval_model.py
requirements.txt		requirements.txt
setup.py		setup.py
train_model.py		train_model.py

Folders and files

Latest commit

History

Repository files navigation

RXNGraphormer

📌 Key Features

🚀 Quick Start

Installation

Download model weights and dataset

Basic Usage

⚛️ Model Training

🔢 Evaluation

📑 Some results in paper

📚 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages