Skip to content

Generate molecular feature sets for ML models.

Notifications You must be signed in to change notification settings

nkoussa/mol-features

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Main use cases of this repo:

  1. Generate molecular features for ML models src/gen_mol_fea.py
  2. Aggregate molecular features generated on HPC (Theta, Frontera) src/agg_fea_hpc.py

Calculate molecular features

src/gen_mol_fea.py takes SMILES, canonicalizes, and generates multiple feature sets stored in separate files.
Mordred descriptors and fingerprints are stored in dataframes (e.g., parquet, csv).
Images are stored in python dictionaries (pickle files).
Each feature (column) name in a dataframe is prefixed with an appropriate string indicating the type.

  • Mordred descriptors (prefix: dd_)
  • ECFP2 (prefix: ecfp2_)
  • ECFP4 (prefix: ecfp4_)
  • ECFP6 (prefix: ecfp6_)
  • Images

First, clone the repo.

$ git clone https://github.com/adpartin/mol-features/

If you are working on Covid

Inside project dir, create folder that will contain raw SMILES.

$ cd mol-features
$ mkdir -p data/raw/

Get the data from from Box or Petrel (e.g., from Box copy 2019-nCoV/drug-screening/Baseline-Screen-Datasets) to data/raw.
Then, launch a python script or use a bash script (you may need to change the bash script to specify your parameters).

$ python src/gen_mol_fea.py --smiles_path data/OZD-dock-2020-06-01/OZD.May29.unique.csv --id_name TITLE --fea_type descriptors fps --par_jobs 16 --ignore_3D
$ bash scripts/covid/gen_fea_OZD.bash

If you are working on Pilot1 (cancer)

Clone the repo.

$ git clone https://github.com/adpartin/mol-features/

Inside project dir, create folder that will contain raw SMILES.

$ cd mol-features
$ mkdir -p data/raw/July2020

Get the data from /vol/ml/mshukla/data_frames/Jul2020/drug_info to data/raw/July2020.
Then, launch a python script or use a bash script (you may need to change the bash script to specify your parameters).

$ python src/gen_mol_fea.py --smiles_path data/July2020/drug_info --id_name ID --fea_type descriptors fps --par_jobs 16 --ignore_3D
$ bash scripts/pilot1/gen_fea_July2020.bash

Aggregate molecular feature from HPC runs

Instead of calculating features with src/gen_mol_fea.py, we can take features computed on HPC and aggregate those into dataframes. src/agg_fea_hpc.py takes the output from github.com/globus-labs/covid-analyses and aggregates files into a single dataframe. At this point, the code was tested only to gerenare dataframe with Mordred descriptors.

About

Generate molecular feature sets for ML models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 87.1%
  • Python 11.8%
  • Shell 1.1%