Pipeline for the EUOS 2025 Ochem challenge on predicting optical properties (transmittance and fluorescence) from molecular SMILES. Challenge link: https://ochem.eu/static/challenge2025.do
This repo reproduces the final submission using a stacking ensemble and final retraining on combined train + leaderboard data.
- Load competition training data for all four tasks.
- Load precomputed features and align them to molecules by canonical SMILES.
- Train three diverse base models per task using scaffold-based folds.
- Stack base predictions with a logistic regression meta-learner trained on leaderboard labels.
- Retrain base models on combined train + leaderboard data.
- Apply the saved meta-learners to final base predictions and write the submission.
Four binary classification targets:
- Transmittance(340)
- Transmittance(450)
- Fluorescence(340/450)
- Fluorescence(>480)
- Training data:
data/euos25_challenge_train_*.csv - Leaderboard labels:
data/euos_challenge_2025_leaderboard.csv - Test data:
data/euos25_challenge_test.csv - Features:
features/train_rdkit_features_v2.csv,features/test_rdkit_features_v2.csv
Precomputed feature matrix per molecule:
- ECFP4 fingerprints: columns
ecfp_*(2048 bits, radius=2) - PubChem fingerprints: columns
pubchem_*(881 bits) - PhysChem fingerprints: columns
physchem_*(2048 bits)
Molecules are aligned by canonical SMILES (salt-stripped, RDKit canonicalization).
Three diverse sources (base model variants) trained per task:
- Source 1: XGBoost (balanced class weights)
- Source 2: RandomForest (task-specific depth/leaf settings)
- Source 3: XGBoost (simpler baseline)
Training uses 5-fold StratifiedGroupKFold with Murcko scaffolds to reduce scaffold leakage. Fold models generate test predictions that are averaged per source.
A logistic regression meta-learner is trained on leaderboard labels using base model predictions as features. One meta-learner is saved per task.
For the final submission:
- Training data is merged with leaderboard labels.
- Base models are retrained on the merged data.
- Saved stackers combine the final base predictions.
- Submission is written to
output/submission_final.csv.
Generated artifacts during a run:
- Base models:
models/base/base_source*_{task}.joblib - Meta-learners:
models/base/stacker_{task}.joblib - Final models:
models/final/final_source*_{task}.joblib - Final submission:
output/submission_final.csv
All scripts use random_state=42. Small numerical differences across systems
are possible, but the file shape and column order are deterministic.
Conda (recommended):
conda env create -f environment.yml
conda activate euos25_finalPip:
python -m venv venv_euos25
source venv_euos25/bin/activate
pip install -r requirements.txtRequired data files in data/:
euos25_challenge_train_transmittance340.csveuos25_challenge_train_transmittance450.csveuos25_challenge_train_fluorescence340_450.csveuos25_challenge_train_fluorescence480.csveuos_challenge_2025_leaderboard.csveuos25_challenge_test.csv
Required feature files in features/ (not included in the repo due to size):
train_rdkit_features_v2.csvtest_rdkit_features_v2.csvGenerate them locally usingscripts/generate_rdkit_features.py(ECFP4 2048 + PubChem 881 + PhysChem 2048 viascikit-fingerprints) so train/test match the expected schema.
./run_full_pipeline.sh --regen-featuresIf you already generated features, you can skip regeneration:
./run_full_pipeline.shYou can also regenerate features explicitly:
python scripts/generate_rdkit_features.py --data-dir data --features-dir featuresThis runs:
- Base model training (3 sources, 5-fold scaffold CV)
- Stacking on leaderboard labels
- Final training on train + leaderboard, then submission generation
Models are saved after training:
models/base/(base models + meta-learners)models/final/(final retrained models)
Output:
output/submission_final.csv
Expected columns:
Transmittance(340)Transmittance(450)Fluorescence(340/450)Fluorescence(>480)
Submited file to the competition:
output/submission_final.csv
euos25/
├── data/ # Required CSV data
├── features/ # Required precomputed features
├── scripts/ # Training + stacking scripts
├── models/ # Generated models (created by pipeline)
├── source_submissions/ # Generated base predictions (created by pipeline)
├── output/ # submission_final.csv is written here
├── environment.yml
├── requirements.txt
└── run_full_pipeline.sh