Synthetic accessibility scoring is an invaluable tool for generative chemistry and more generally for filtering or scoring molecular designs that come either from AI or human designers. It's not hard to understand that synthetic scoring approaches that do not consider how the target molecule can be synthesized are of quite limited use in practice. This work aims to produce a tool that can predict synthetic routes but in a reasonable time frame for many real-world applications.
The idea of the tool is to restrict the depth of search in order to improve computation speed but in the same time having access to a large amount of synthetically accessible chemical space of enough complexity to be useful in drug discovery applications. A number of obvious optimizations like caching, parallel execution etc can significantly reduce times while other approaches could also be tested in the future.
The tool is based on aizynthfinder provided by the MolecularAI group in AstraZeneca https://github.com/MolecularAI/aizynthfinder and specifically I use here:
- The expansion policy model and templates provided by the group
- A filter policy model provided by the group Please refer to the detailed documentation in https://molecularai.github.io/aizynthfinder/ for instructions on how to train policy models, construct stocks etc
- A full tree search is implemented (DFS) instead of MCTS for increased accuracy of predictions
- Caching tree branches works really well to increase speed by 2x-3x
- Persistent Redis cache for sharing results across parallel workers and surviving process restarts
- Vectorization of hot loops resulted in 3x speed gains
- A maximum search depth can be set
- A context search mode is available for the analysis of collections of molecules with a common scaffold (e.g. parallel libraries) where generally there is only one disconnection of interest in the first step of the retrosynthesis. See also the Jupyter notebook example.
- A large amount of code has been removed or refactored for simplification and speed optimisation.
- Customized collections of templates The current policy model and templates have been generated from the USPTO dataset and cover chemical synthesis knowledge before 2019 and thus uderrepresenting or not including at all important modern synthetic methods such as sp2-sp3 cross coupling reactions, late-stage functionalization reactions and so on. This feature gives the option to add new reaction templates and enhance or otherwise modify the synthetic knowledge of the tool. In this repo you can find
shallowtree/rules/direct.csvan example collection of templates that cover standard cross coupling reactions.
- Clone the GitHub repository e.g.
git clone https://github.com/Arhs99/shallow-tree.git - Then execute:
cd shallow-tree
conda env create -f env.yml
conda activate shallow-tree
poetry install
For use with a GPU install tensorflow from conda as follows, it will pick the correct CUDA libraries compatible with your set-up
conda install -y tensorflow-gpu=2.8.0 -c conda-forge
For Redis caching support (recommended for parallel execution):
poetry install -E cache
One needs to import the Expander class and use any of the two available methods:
search_treeprovides scoring and predicted routes and starting materials for a set of query moleculescontext_searcha SMILES string parameter is required for the desired scaffold which is also indicating the attachment point.
See the synth_score_NOTEBOOK.ipynb jupyter notebook example in this repository
search_cli is the command line tool. These are 2 examples for each of the available search modes:
echo 'Clc1ccccc1COC5CC(Nc3n[nH]c4cc(c2ccccc2)ccc34)C5' | searchcli --config config.yml --scaffold '[*]c1n[nH]c2cc(-c3ccccc3)ccc12' --depth 2 --routes
and
searchcli --config config.yml --depth 2 --routes < smiles.txt > routes.csv
A significant gain in speed and extended ability to scale-up can be achieved by serving the models using tensorflow serving and parallelization by batching the SMILES inputs.
For optimal performance with parallel workers, enable Redis caching to share computed results across processes:
# In your config.yml
cache:
enabled: true
host: localhost
port: 6379See the parallel folder README for TensorFlow Serving setup, Redis installation, and usage examples.
For more information on the effect of caching intermediates on the performance of the algorithm, consult the AstraZeneca paper in reference 2 below.
- Genheden S, Thakkar A, Chadimova V, et al (2020) AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. ChemRxiv. Preprint. https://doi.org/10.26434/chemrxiv.12465371.v1
- Pablo Iáñez Picazo, Alexey Voronov, Samuel Genheden, et al. Joint synthesis planning by leveraging common intermediates. ChemRxiv. 23 January 2026. DOI: https://doi.org/10.26434/chemrxiv.10001547/v1