Main use cases of this repo:
- Generate molecular features for ML models
src/gen_mol_fea.py - Aggregate molecular features generated on HPC (Theta, Frontera)
src/agg_fea_hpc.py
src/gen_mol_fea.py takes SMILES, canonicalizes, and generates multiple feature sets stored in separate files.
Mordred descriptors and fingerprints are stored in dataframes (e.g., parquet, csv).
Images are stored in python dictionaries (pickle files).
Each feature (column) name in a dataframe is prefixed with an appropriate string indicating the type.
- Mordred descriptors (prefix:
dd_) - ECFP2 (prefix:
ecfp2_) - ECFP4 (prefix:
ecfp4_) - ECFP6 (prefix:
ecfp6_) - Images
First, clone the repo.
$ git clone https://github.com/adpartin/mol-features/Inside project dir, create folder that will contain raw SMILES.
$ cd mol-features
$ mkdir -p data/raw/Get the data from from Box or Petrel (e.g., from Box copy 2019-nCoV/drug-screening/Baseline-Screen-Datasets) to data/raw.
Then, launch a python script or use a bash script (you may need to change the bash script to specify your parameters).
$ python src/gen_mol_fea.py --smiles_path data/OZD-dock-2020-06-01/OZD.May29.unique.csv --id_name TITLE --fea_type descriptors fps --par_jobs 16 --ignore_3D$ bash scripts/covid/gen_fea_OZD.bashClone the repo.
$ git clone https://github.com/adpartin/mol-features/Inside project dir, create folder that will contain raw SMILES.
$ cd mol-features
$ mkdir -p data/raw/July2020Get the data from /vol/ml/mshukla/data_frames/Jul2020/drug_info to data/raw/July2020.
Then, launch a python script or use a bash script (you may need to change the bash script to specify your parameters).
$ python src/gen_mol_fea.py --smiles_path data/July2020/drug_info --id_name ID --fea_type descriptors fps --par_jobs 16 --ignore_3D$ bash scripts/pilot1/gen_fea_July2020.bashInstead of calculating features with src/gen_mol_fea.py, we can take features computed on HPC and aggregate those into dataframes. src/agg_fea_hpc.py takes the output from github.com/globus-labs/covid-analyses and aggregates files into a single dataframe. At this point, the code was tested only to gerenare dataframe with Mordred descriptors.