GitHub - nkoussa/mol-features: Generate molecular feature sets for ML models.

Main use cases of this repo:

Generate molecular features for ML models src/gen_mol_fea.py
Aggregate molecular features generated on HPC (Theta, Frontera) src/agg_fea_hpc.py

Calculate molecular features

src/gen_mol_fea.py takes SMILES, canonicalizes, and generates multiple feature sets stored in separate files.
Mordred descriptors and fingerprints are stored in dataframes (e.g., parquet, csv).
Images are stored in python dictionaries (pickle files).
Each feature (column) name in a dataframe is prefixed with an appropriate string indicating the type.

First, clone the repo.

$ git clone https://github.com/adpartin/mol-features/

If you are working on Covid

Inside project dir, create folder that will contain raw SMILES.

$ cd mol-features
$ mkdir -p data/raw/

Get the data from from Box or Petrel (e.g., from Box copy 2019-nCoV/drug-screening/Baseline-Screen-Datasets) to data/raw.
Then, launch a python script or use a bash script (you may need to change the bash script to specify your parameters).

$ python src/gen_mol_fea.py --smiles_path data/OZD-dock-2020-06-01/OZD.May29.unique.csv --id_name TITLE --fea_type descriptors fps --par_jobs 16 --ignore_3D

$ bash scripts/covid/gen_fea_OZD.bash

If you are working on Pilot1 (cancer)

Clone the repo.

$ git clone https://github.com/adpartin/mol-features/

Inside project dir, create folder that will contain raw SMILES.

$ cd mol-features
$ mkdir -p data/raw/July2020

Get the data from /vol/ml/mshukla/data_frames/Jul2020/drug_info to data/raw/July2020.
Then, launch a python script or use a bash script (you may need to change the bash script to specify your parameters).

$ python src/gen_mol_fea.py --smiles_path data/July2020/drug_info --id_name ID --fea_type descriptors fps --par_jobs 16 --ignore_3D

$ bash scripts/pilot1/gen_fea_July2020.bash

Aggregate molecular feature from HPC runs

Instead of calculating features with src/gen_mol_fea.py, we can take features computed on HPC and aggregate those into dataframes. src/agg_fea_hpc.py takes the output from github.com/globus-labs/covid-analyses and aggregates files into a single dataframe. At this point, the code was tested only to gerenare dataframe with Mordred descriptors.

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
README		README
data		data
nbs		nbs
sample_data		sample_data
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Calculate molecular features

If you are working on Covid

If you are working on Pilot1 (cancer)

Aggregate molecular feature from HPC runs

About

Uh oh!

Releases

Packages

Languages

nkoussa/mol-features

Folders and files

Latest commit

History

Repository files navigation

Calculate molecular features

If you are working on Covid

If you are working on Pilot1 (cancer)

Aggregate molecular feature from HPC runs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages