- Python = 3.11
- Rdkit
- Streamlit
- Pandas
- Numpy
- NetworkX
- matplotlib
- Pulp
- CPLEX solver
The files associated with the MinChemBio are seperately attached here, since the github limits uploading files larger than 100 MB without LFS.
Dataset and code on Scholarsphere
-
Create the Conda environment
Set up the environment with all required dependencies by running:conda env create -f minchembio_yml.yml
-
Prepare the dataset
Ensure you have the dataset files in the working directory -
Extract reactant and product IDs
Open the provided Jupyter notebook and run it to extract the IDs of reactants and products fromS1.txt. -
Run the Streamlit web app
Launch the web app interface by executing the following command:streamlit run minchembio_streamlit.py
OR
python run minchembio.py
Enter the reactant and product IDs in the app to generate solutions using minChemBio.
-
Visualize the solutions
Use the visualization Jupyter notebook to explore the generated pathways. Provide the folder containing the pathway outputs as input to the notebook. -
Reference data (optional)
The fileS2.csvcontains a list of USPTO patent IDs and their corresponding reaction SMILES strings. This can be used for further analysis or comparison.
This repository contains the code and datasets required to run minChemBio, a tool for exploring pathways using mixed-integer linear programming (MILP).
-
all_rij_with_miss_cat.json
A dictionary where molecule IDs are the keys. Each value is a dictionary listing all reactions involving that molecule, either as a reactant (-1) or a product (1). -
all_sij_with_miss_cat.json
A dictionary where reaction IDs are the keys. Each value is a dictionary listing all molecules involved in the reaction as reactants (-1) or products (1). -
bio_chem_smiles_ids_dict_NEW.json
A dictionary with canonical SMILES strings as keys and molecule IDs as values. -
bio_chem_smiles_ids_dict_updated.json
A dictionary with canonical SMILES strings as keys and molecule IDs as values. Similar to earlier SMILES-ID mappings but includes fewer molecules. -
bio_chem_ids_dict_NEW.json
A dictionary with molecule IDs as keys and canonical SMILES strings as values. Similar to earlier SMILES-ID mappings but includes fewer molecules. -
metanetx_metab_db.jsonA dictionary where MetaNetx molecule IDs are the keys. Each value is a dictionary listing Name, Formula, Charge, Mass, InChI , InchIKey , SMILES , Reference for each molecule. This dataset is extracted from the MetaNetX web platform. -
rev_pair_90_nondup.json
A dictionary where reaction IDs are the keys. Each value is a list containing reverse mappings extracted from the same reaction. Used to ensure that forward and reverse reactions do not co-occur in the same pathway. -
rxn_classify_with_miss_cat.json
A dictionary with reaction IDs as keys. The value for each key is:1for chemical reactions0for biological reactions
-
S1.txt
A text file containing molecule IDs and their corresponding canonical SMILES strings. -
S2.csv
A CSV file containing reaction IDs from the USPTO dataset, along with their associated patent number, year, and reaction SMILES string.
minchembio_streamlit.py
A Streamlit web app interface for minChemBio.
- Inputs: Product and reactant molecule IDs
- Required files:
all_rij_with_miss_cat.jsonall_sij_with_miss_cat.jsonbio_chem_smiles_ids_dict_updated.jsonrev_pair_90_nondup.jsonrxn_classify_with_miss_cat.json
- Output: A text file named in the format
productID_from_reactionID-timestamp_.txt
This file contains all possible solutions (pathways), each being a list of reaction IDs derived from solving the MILP problem.
minchembio.py
A Python script version of the Streamlit app.
- Same functionality as
minchembio_streamlit.py - Users need to edit the
main()function to input the desired molecule IDs.
visualize.ipynb
A Jupyter notebook for visualizing the output pathways.
- Inputs: Same as
minchembio_streamlit.py+ the results file generated from MILP - Output: Visual representations of all identified pathways, saved as .png files.