Skip to content

MJ-Zeng/RXNGraphormer-Reproduction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

74 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RXNGraphormer Reproduction

This repository provides a reproduction of πŸ”— RXNGraphormer, a unified pre-trained framework for reaction performance prediction and synthesis planning. The original work is by Xu et al.

This reproduction focuses on validating core functionalities, reaction-type analysis, and extending evaluation to external literature datasets.


πŸ“ Directory Structure

RXNGraphormer/reproduction/
β”œβ”€β”€ 1_basic_usage.ipynb           # Core model functionality verification
β”œβ”€β”€ 2_Reaction_Type_Visual.ipynb  # Reaction type discrimination & clustering (part new work)
β”œβ”€β”€ 2_ReactionType_Res/           # Saved visualization results for reaction type analysis
β”œβ”€β”€ 3_regression.sh               # Regression task training script
β”œβ”€β”€ 3_png/                        # Output figures from regression experiments
β”œβ”€β”€ 4_USPTO.sh                    # USPTO-style dataset training for sequence generation
β”œβ”€β”€ 4_uspto/                      # Logs and outputs for USPTO experiments
β”œβ”€β”€ 5_SPR.ipynb                   # Structure-performance relationship analysis
β”œβ”€β”€ 6_experiment_results.ipynb    # External validation on literature datasets (new work)
β”œβ”€β”€ 7_finetune_guide.ipynb        # Fine-tuning user guide and example
β”œβ”€β”€ finetune_grid_search.py       # Grid search script for fine-tuning
β”œβ”€β”€ 7_finetune_results/           # Logs, configs, and results from fine-tuning experiments
└── README.md                     # Documentation

πŸ—‚οΈ Project Organization Update

For better reproducibility, the internal directory structures of config, dataset, and model_path have been reorganized compared to the original repository.


βš™οΈ Reproduction Setup

This reproduction uses the original pre-trained model weights; we only perform fine-tuning on downstream tasks (e.g., yield, selectivity prediction).
For sequence generation tasks, models are fine-tuned on USPTO-50k and USPTO-480k, while the USPTO-full model is evaluated without retraining.

All training logs and checkpoints are saved under corresponding subdirectories in model_path/.

# Install the additional dependency for reaction-type clustering
pip install hdbscan

βœ… hdbscan is used in 2_Reaction_Type_Visual.ipynb for unsupervised clustering of reaction embeddings.


πŸ“¦ Datasets and Training Artifacts

  • For all datasets (USPTO_STEREO, USPTO_full, USPTO_480k, USPTO_50k, OOS, external_validation_dataset, and 50k_with_rxn_type,bechmark):
    Download from the original model's Figshare repository.
    These preprocessed datasets are part of the original RXNGraphormer release.

  • For Test.zip:
    Download from our Figshare repository.
    This test set contains newly curated real-world reaction data from literature and high-throughput experimentation (HTE) for external validation.

Note:

  • All model checkpoints, training logs, and evaluation results are available in our Figshare repository and correspond to our independent reproduction runs.
  • Please follow the dataset directory structure outlined below after extraction.
  • πŸ’‘ This ensures full reproducibility of all experiments presented in the reproduction/ notebooks and scripts.

πŸ§ͺ What This Reproduction Covers

  • βœ… Basic inference and embedding generation
  • βœ… Reaction type classification and unsupervised clustering
  • βœ… Regression tasks (yield, regioselectivity, enantioselectivity)
  • βœ… Sequence generation (forward/retro-synthesis) on USPTO dataset
  • βœ… Structure-performance relationship (SPR) analysis
  • βœ… External validation on real-world HTE or literature datasets

πŸ“Š Experiment Results

Regression Performance Comparison

The following table summarizes the performance comparison between the original RXNGraphormer and this Reproduction across benchmark, out-of-sample (OOS), and external datasets.

Data RXNGraphormer Reproduction
R2 MAE Precision ACC R2 MAE Precision ACC
Benchmark datasets Buchwald–Hartwig 0.971 2.980 / / 0.970Β±0.003 3.079Β±0.144 / /
Suzuki–Miyaura 0.876 6.300 / / 0.871Β±0.009 6.431Β±0.187 / /
C–H functionalization 0.992 0.266 / / 0.992Β±0.001 0.273Β±0.007 / /
Asymmetric thiol 0.915 0.134 / / 0.916Β±0.010 0.135Β±0.007 / /
OOS Buchwald Hartwig Additive 1 0.883 6.430 / / 0.815 8.310 / /
Additive 2 0.906 6.000 / / 0.897 6.280 / /
Additive 3 0.792 8.500 / / 0.651 10.399 / /
Additive 4 0.736 9.940 / / 0.643 10.966 / /
Bromide 0.890 5.810 / / 0.869 5.934 / /
Chloride -0.053 15.120 / / -0.377 18.879 / /
Iodide 0.823 7.540 / / 0.844 7.186 / /
Component-combination 0.725 10.120 / / 0.732 9.457 / /
Thiol addition Cat 0.781 0.236 / / 0.804 0.230 / /
Sub 0.923 0.138 / / 0.915 0.138 / /
Sub and Cat 0.804 0.248 / / 0.732 0.257 / /
External Nicolit Avg 0.308 21.760 0.793 0.732 0.209 37.199 0.796 0.730
Asymmetric hydrogenation of olefins 0.832 0.371 / / 0.739 0.477 / /
Pallada-electrocatalyzed C–H activation 0.924 0.211 / / 0.900 0.196 / /

Sequence Performance Comparison

This section evaluates synthesis planning performance via Top-n accuracy metrics on both retrosynthetic and forward synthesis tasks.

Task Dataset RXNGraphormer Reproduction note
top-n accuracy(%) top-n accuracy(%)
1 3 5 10 1 3 5 10
Retrosynthetic USPTO-50k 51.0 69.0 74.2 79.2 50.3 69.3 73.7 78.0 fine-tuned
USPTO-full 47.4 63.0 67.4 71.6 47.3 62.9 67.5 71.6 inference-only
Forward USPTO-480k 90.6 94.3 94.9 95.5 90.5 94.4 95.1 95.7 fine-tuned
USPTO-STEREO 78.2 85.1 86.5 87.8 78.1 84.9 86.4 87.7 fine-tuned

External Validation Eval

Model generalization is validated on newly introduced real-world datasets (e.g., HTE or literature-derived reactions), with results compared to baseline methods.

DATA Origin Model Rxngraphormer
R2 MAE(%) R2 MAE(%)
Sulfoxonium Train Set 0.89 6.60 0.91 5.70
Validation Set 0.77 8.00 0.60 9.46
Meta_C_H Train Set 0.75 9.30 0.82 5.76
Independt Test Set 0.74 9.10 0.78 6.49
Strict Independt Test Set 0.71 11.60 -1.19 23.22
Amide Coupling HTE Full HTE (with NATURE intermediate) Random split 0.66 10.00 0.58 14.14
Partial Novelty 0.68 14.00 0.59 14.31
Full Novelty 0.63 15.00 0.58 12.56
Full HTE (without intermediate) Random split 0.66 10.00 0.58 14.31
Partial Novelty 0.68 14.00 0.66 13.12
Full Novelty 0.63 15.00 0.44 14.93
DCC (with intermediate) Random split 0.86 8.00 0.37 16.63
Partial Novelty 0.81 11.00 0.27 15.99
Full Novelty 0.67 7.00 -0.41 13.05
EDC (with intermediate) Random split 0.89 6.10 0.23 18.46
Partial Novelty 0.88 9.00 0.20 18.32
Full Novelty 0.75 14.00 -0.10 22.23
HATU (with intermediate) Random split 0.86 6.00 0.08 19.08
Partial Novelty 0.78 12.00 -0.09 20.57
Full Novelty 0.84 7.00 -0.37 16.34
PyBOP (with intermediate) Random split 0.90 5.00 0.35 15.54
Partial Novelty 0.82 10.00 0.38 14.09
Full Novelty 0.89 8.00 0.22 15.56
TBTU (without intermediate) Random split 0.71 10.00 0.49 13.40
Partial Novelty 0.57 16.00 0.31 11.70
Full Novelty 0.66 13.00 0.64 5.96
HBTU (without intermediate) Random split 0.83 8.00 0.23 18.02
Partial Novelty 0.72 13.00 0.04 18.75
Full Novelty 0.68 14.00 -0.11 17.88
Amide Coupling Literature 0.39 13.30 0.35 12.48

Non-USPTO Sequence Generation (External Comparison)

To further evaluate generalization in synthesis planning, we compare our reproduced results with those reported in the original paper on non-USPTO datasets.

Forward Synthesis

Setting Model Invalid SMILES (%) Top-1 Acc (with SC) (%) Top-1 Acc (w/o SC) (%)
Separated Origin Model 0.40 66.10 66.92
Separated (USPTO_480k) RXNGraphormer 0.75 83.34 83.40
Separated (USPTO_STEREO) RXNGraphormer 0.52 83.31 85.64
Mixed Origin Model 0.27 84.12 85.20
Mixed (USPTO_480k) RXNGraphormer 0.77 83.29 83.35
Mixed (USPTO_STEREO) RXNGraphormer 0.48 83.30 85.64

Retrosynthesis

Model Invalid SMILES (%) Top-1 Acc (with SC) (%) Top-1 Acc (w/o SC) (%)
Origin Model 0.27 37.22 37.42
RXNGraphormer (USPTO_full) 4.01 27.59 28.03
RXNGraphormer (USPTO_50k) 3.05 16.52 16.64

πŸ“š Acknowledgments

Thanks to the original authors for open-sourcing RXNGraphormer. This reproduction builds directly upon their codebase and methodology.

πŸ’‘ Note: For full installation instructions and model details, please refer to the original README.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors