Automated discovery of cycloaddition reactions with unsupervised machine learning and quantum chemistry

The search for new chemical transformations is a key problem for modern chemistry and important factor in development of other fields of human life. In this work an automated pipeline for discovery of new reactivity was developed and applied to the search of cycloaddition reactions — important atom-economic processes. The problem was tackled by combining rule-based candidate generation approach, thermodynamic evaluation, and cluster-based sampling. Selected reactions were optimized and experimentally verified, which led to the discovery of X novel reactions.

Steps to reproduce the results

Reaction generation

Generate cycloaddition reaction templates using the mining_pubchem/generate_templates.py script.

cd mining_pubchem
python generate_templates.py --output ../../working_files/templates.pkl

Use these templates to mine reactions from the QM9 dataset:

python processing_dataset.py --ds-path ../../working_files/dsgdb9nsd.xyz.tar.bz2 --db-path ../../working_files/cas --db-name 'CAS' --n-jobs 16 --output-dir ../../working_files/test_sep_dir/ --output ../../working_files/proc_dataset.pkl

In the previous step, we used n_jobs=16 processes to parallelize our program. Now we run all 16 separate files through the mining_pubchem/reaction_find.sh script.

sh reaction_find.sh 16 ../../working_files/proc_dataset.pkl ../../working_files/test_sep_dir/ ../../working_files/templates.pkl ../../working_files/reactions_dir/ 'CAS'

Combining all reactions can be done using a script mining_pubchem/reacts_concatenation.py.

python reacts_concatenation.py --input ../../working_files/reactions_dir/ --output ../../working_files/reactions.pkl --output-rs ../../working_files/embeds/smiles_reags.txt --output-ps ../../working_files/embeds/smiles_prods.txt

Processing reactions with unsupervised learning

Substances are converted into vector form using a script clustering_reactions/smi2vec.py.

cd clustering_reactions
python smi2vec.py --p-vocab ../../working_files/vocab.pkl --p-trfm ../../working_files/trfm.pkl --smi ../../working_files/embeds/smiles_reags.txt --output ../../working_files/embeds/reags.npy & python smi2vec.py --p-vocab ../../working_files/vocab.pkl --p-trfm ../../working_files/trfm.pkl --smi ../../working_files/embeds/smiles_prods.txt --output ../../working_files/embeds/prods.npy

Dimensionaly reduction of vectors can be done in three ways (PCA, t-SNE, UMAP) by the file clustering_reactions/dimensionality_reduction.py:

python dimensionality_reduction.py --emb-r ../../working_files/embeds/reags.npy --emb-p ../../working_files/embeds/prods.npy --smi-r ../../working_files/embeds/smiles_reags.txt --smi-p ../../working_files/embeds/smiles_prods.txt --reacts ../../working_files/reactions.pkl --method 't-SNE' --output ../../working_files/embeds_reacts.pkl -n_components 2 -perplexity 100

Clustering can be done in several ways (AgglomerativeClustering, KMeans, SpectralClustering) using the clustering_reactions/cluster_reactions.py script:

python cluster_reactions.py --input ../../working_files/embeds_reacts.pkl --method 'AgglomerativeClustering' --metric 'euclidean' --plot ../../working_files/clusters.png --model ../../working_files/qm9_model.pkl -n_clusters 12

For expert opinion

To select a certain number of reactions from each cluster using the creation_reaction_cards/filter_reactions_by_energy.py script:

cd creation_reaction_cards
python filter_reactions_by_energy.py --reactions ../../working_files/reactions.pkl --db-name 'CAS' --model ../../working_files/qm9_model.pkl --number 18 --output ../../working_files/need_reactions.pkl --output-numbers ../../working_files/reactions_numbers.pkl

Processing "reaction cards":

python parse_manual_labels.py --archive ../../working_files/marks_of_reacts.zip --output ../../working_files/marks_of_reacts/ --numbers ../../working_files/reactions_numbers.pkl --csv ../../working_files/reactions_data.csv

Lab database

Search for alkynes within the database using the laboratory_database_of_reagents/get_substituents.py script:

cd laboratory_database_of_reagents
python get_substituents.py --input-db ../../working_files/ReagentsLB30.sdf --output-db ../../working_file/smiles_alkynes_fin.txt

Generation and calculation products

Handling Labeled Reactions using the generate_computation/generate_mopac_smiles.py script:

cd generate_computation
python generate_mopac_smiles.py ../../working_files/reacts_map.pkl ../../working_files/smiles_alkynes_fin.txt ../../working_files/products.txt

Mopac's file generation using the generate_computation/mopac_generate.py script:

python mopac_generate.py ../../working_files/products.txt ../../working_files/mopac/

Automatic processing of mopac files and creation of a Gaussian files:

python gaussian_generate.py ../../working_files/mopac/ ../../working_files/gaussian_checks/ 16000 'B3LYP/6-31G(2df,p)' 16 ../../working_files/gaussian_calc/

Automatic processing of Gaussian files:

python withdrawal_energy.py ../../working_files/gaussian_calc/ ../../working_files/products.txt ../../working_files/products_energy.txt

Bringing the received data into a table:

python prop_rxns.py ../../working_files/reagents_energy.txt ../../working_files/alkynes_energy.txt ../../working_files/products_energy.txt ../../working_files/calc_reactions.csv

Choice of reactions from the counted

Sort Reactions using the final_filter/fine_filter_reactions_by_energy.py script:

cd final_filter
python fine_filter_reactions_by_energy.py --input-csv ../../working_files/calc_reactions.csv --number 10 --output ../../working_files/need_reactions.csv

How to apply to another type of reactions?

First, you would need to change the way to generate templates.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
cli_scripts		cli_scripts
web_interface		web_interface
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated discovery of cycloaddition reactions with unsupervised machine learning and quantum chemistry

Steps to reproduce the results

Reaction generation

Processing reactions with unsupervised learning

For expert opinion

Lab database

Generation and calculation products

Choice of reactions from the counted

How to apply to another type of reactions?

How to cite this?

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Ananikov-Lab/QM9

Folders and files

Latest commit

History

Repository files navigation

Automated discovery of cycloaddition reactions with unsupervised machine learning and quantum chemistry

Steps to reproduce the results

Reaction generation

Processing reactions with unsupervised learning

For expert opinion

Lab database

Generation and calculation products

Choice of reactions from the counted

How to apply to another type of reactions?

How to cite this?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages