Automated discovery of cycloaddition reactions with unsupervised machine learning and quantum chemistry
The search for new chemical transformations is a key problem for modern chemistry and important factor in development of other fields of human life. In this work an automated pipeline for discovery of new reactivity was developed and applied to the search of cycloaddition reactions — important atom-economic processes. The problem was tackled by combining rule-based candidate generation approach, thermodynamic evaluation, and cluster-based sampling. Selected reactions were optimized and experimentally verified, which led to the discovery of X novel reactions.
- Generate cycloaddition reaction templates using the
mining_pubchem/generate_templates.pyscript.
cd mining_pubchem
python generate_templates.py --output ../../working_files/templates.pkl- Use these templates to mine reactions from the QM9 dataset:
python processing_dataset.py --ds-path ../../working_files/dsgdb9nsd.xyz.tar.bz2 --db-path ../../working_files/cas --db-name 'CAS' --n-jobs 16 --output-dir ../../working_files/test_sep_dir/ --output ../../working_files/proc_dataset.pkl- In the previous step, we used
n_jobs=16processes to parallelize our program. Now we run all 16 separate files through themining_pubchem/reaction_find.shscript.
sh reaction_find.sh 16 ../../working_files/proc_dataset.pkl ../../working_files/test_sep_dir/ ../../working_files/templates.pkl ../../working_files/reactions_dir/ 'CAS'- Combining all reactions can be done using a script
mining_pubchem/reacts_concatenation.py.
python reacts_concatenation.py --input ../../working_files/reactions_dir/ --output ../../working_files/reactions.pkl --output-rs ../../working_files/embeds/smiles_reags.txt --output-ps ../../working_files/embeds/smiles_prods.txt- Substances are converted into vector form using a script
clustering_reactions/smi2vec.py.
cd clustering_reactions
python smi2vec.py --p-vocab ../../working_files/vocab.pkl --p-trfm ../../working_files/trfm.pkl --smi ../../working_files/embeds/smiles_reags.txt --output ../../working_files/embeds/reags.npy & python smi2vec.py --p-vocab ../../working_files/vocab.pkl --p-trfm ../../working_files/trfm.pkl --smi ../../working_files/embeds/smiles_prods.txt --output ../../working_files/embeds/prods.npy- Dimensionaly reduction of vectors can be done in three ways (
PCA, t-SNE, UMAP) by the fileclustering_reactions/dimensionality_reduction.py:
python dimensionality_reduction.py --emb-r ../../working_files/embeds/reags.npy --emb-p ../../working_files/embeds/prods.npy --smi-r ../../working_files/embeds/smiles_reags.txt --smi-p ../../working_files/embeds/smiles_prods.txt --reacts ../../working_files/reactions.pkl --method 't-SNE' --output ../../working_files/embeds_reacts.pkl -n_components 2 -perplexity 100- Clustering can be done in several ways (
AgglomerativeClustering, KMeans, SpectralClustering) using theclustering_reactions/cluster_reactions.pyscript:
python cluster_reactions.py --input ../../working_files/embeds_reacts.pkl --method 'AgglomerativeClustering' --metric 'euclidean' --plot ../../working_files/clusters.png --model ../../working_files/qm9_model.pkl -n_clusters 12- To select a certain number of reactions from each cluster using the
creation_reaction_cards/filter_reactions_by_energy.pyscript:
cd creation_reaction_cards
python filter_reactions_by_energy.py --reactions ../../working_files/reactions.pkl --db-name 'CAS' --model ../../working_files/qm9_model.pkl --number 18 --output ../../working_files/need_reactions.pkl --output-numbers ../../working_files/reactions_numbers.pkl- Processing "reaction cards":
python parse_manual_labels.py --archive ../../working_files/marks_of_reacts.zip --output ../../working_files/marks_of_reacts/ --numbers ../../working_files/reactions_numbers.pkl --csv ../../working_files/reactions_data.csv- Search for alkynes within the database using the
laboratory_database_of_reagents/get_substituents.pyscript:
cd laboratory_database_of_reagents
python get_substituents.py --input-db ../../working_files/ReagentsLB30.sdf --output-db ../../working_file/smiles_alkynes_fin.txt- Handling Labeled Reactions using the
generate_computation/generate_mopac_smiles.pyscript:
cd generate_computation
python generate_mopac_smiles.py ../../working_files/reacts_map.pkl ../../working_files/smiles_alkynes_fin.txt ../../working_files/products.txt- Mopac's file generation using the
generate_computation/mopac_generate.pyscript:
python mopac_generate.py ../../working_files/products.txt ../../working_files/mopac/- Automatic processing of mopac files and creation of a Gaussian files:
python gaussian_generate.py ../../working_files/mopac/ ../../working_files/gaussian_checks/ 16000 'B3LYP/6-31G(2df,p)' 16 ../../working_files/gaussian_calc/- Automatic processing of Gaussian files:
python withdrawal_energy.py ../../working_files/gaussian_calc/ ../../working_files/products.txt ../../working_files/products_energy.txt- Bringing the received data into a table:
python prop_rxns.py ../../working_files/reagents_energy.txt ../../working_files/alkynes_energy.txt ../../working_files/products_energy.txt ../../working_files/calc_reactions.csv- Sort Reactions using the
final_filter/fine_filter_reactions_by_energy.pyscript:
cd final_filter
python fine_filter_reactions_by_energy.py --input-csv ../../working_files/calc_reactions.csv --number 10 --output ../../working_files/need_reactions.csvFirst, you would need to change the way to generate templates.