LightGBM pipeline for cocrystallization prediction with RDKit-based pair features.
The repository includes:
- a Python package for feature generation, training, inference, and tuning
- prepared data files for local runs
- simple scripts for prediction and Optuna-based hyperparameter search
Pharmaceutical cocrystals are an important strategy for improving the solubility, stability, and bioavailability of drug compounds. Reliable prediction of cocrystal formation can help speed up drug development and reduce the cost of experimental screening.
CoCrystalBoost/
├── cocrystalboost/
├── data/
├── notebooks/
├── scripts/
├── pyproject.toml
└── README.md
- Python 3.10+
- pip
- RDKit
Dependencies are defined in pyproject.toml.
git clone https://github.com/romandolgo/CoCrystalBoost.git
cd CoCrystalBoost
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -e .If RDKit is easier to install through Conda on your machine:
conda create -n cocrystalboost python=3.11
conda activate cocrystalboost
conda install -c conda-forge rdkit
pip install -e .python -m cocrystalboostor:
python scripts/predict.pyThis creates submission.csv.
python -m cocrystalboost.tuningor:
python scripts/tune_lgbm.pyThis creates lgbm_params_generated.py. If that file exists, the main pipeline uses it automatically. Otherwise, default parameters from settings.py are used.
Expected files:
data/train_dataset/train_extended.csvdata/test.csvdata/sample_submission.csv
Expected columns:
- train:
SMILES1,SMILES2,result - test:
SMILES1,SMILES2
- cocrystalboost/main.py — prediction pipeline
- cocrystalboost/tuning.py — Optuna tuning
- cocrystalboost/features.py — feature engineering
- cocrystalboost/modeling.py — training and threshold selection
- F1 is used for threshold selection and tuning.
- Grouped cross-validation is used to reduce leakage between equivalent pairs.
- Generated files such as
submission.csv,train_features_cache.pkl, andlgbm_params_generated.pyare ignored by git.