Skip to content

fulopjoz/euos25

Repository files navigation

EUOS25 Challenge Pipeline

Pipeline for the EUOS 2025 Ochem challenge on predicting optical properties (transmittance and fluorescence) from molecular SMILES. Challenge link: https://ochem.eu/static/challenge2025.do

This repo reproduces the final submission using a stacking ensemble and final retraining on combined train + leaderboard data.

Pipeline Summary

  1. Load competition training data for all four tasks.
  2. Load precomputed features and align them to molecules by canonical SMILES.
  3. Train three diverse base models per task using scaffold-based folds.
  4. Stack base predictions with a logistic regression meta-learner trained on leaderboard labels.
  5. Retrain base models on combined train + leaderboard data.
  6. Apply the saved meta-learners to final base predictions and write the submission.

Architecture (Concise Technical Report)

Tasks

Four binary classification targets:

  • Transmittance(340)
  • Transmittance(450)
  • Fluorescence(340/450)
  • Fluorescence(>480)

Data Flow

  • Training data: data/euos25_challenge_train_*.csv
  • Leaderboard labels: data/euos_challenge_2025_leaderboard.csv
  • Test data: data/euos25_challenge_test.csv
  • Features: features/train_rdkit_features_v2.csv, features/test_rdkit_features_v2.csv

Features

Precomputed feature matrix per molecule:

  • ECFP4 fingerprints: columns ecfp_* (2048 bits, radius=2)
  • PubChem fingerprints: columns pubchem_* (881 bits)
  • PhysChem fingerprints: columns physchem_* (2048 bits)

Molecules are aligned by canonical SMILES (salt-stripped, RDKit canonicalization).

Base Models (Per Task)

Three diverse sources (base model variants) trained per task:

  • Source 1: XGBoost (balanced class weights)
  • Source 2: RandomForest (task-specific depth/leaf settings)
  • Source 3: XGBoost (simpler baseline)

Training uses 5-fold StratifiedGroupKFold with Murcko scaffolds to reduce scaffold leakage. Fold models generate test predictions that are averaged per source.

Stacking (Per Task)

A logistic regression meta-learner is trained on leaderboard labels using base model predictions as features. One meta-learner is saved per task.

Final Training

For the final submission:

  • Training data is merged with leaderboard labels.
  • Base models are retrained on the merged data.
  • Saved stackers combine the final base predictions.
  • Submission is written to output/submission_final.csv.

Outputs

Generated artifacts during a run:

  • Base models: models/base/base_source*_{task}.joblib
  • Meta-learners: models/base/stacker_{task}.joblib
  • Final models: models/final/final_source*_{task}.joblib
  • Final submission: output/submission_final.csv

Reproducibility

All scripts use random_state=42. Small numerical differences across systems are possible, but the file shape and column order are deterministic.

Reproduce End-to-End

1) Create the environment

Conda (recommended):

conda env create -f environment.yml
conda activate euos25_final

Pip:

python -m venv venv_euos25
source venv_euos25/bin/activate
pip install -r requirements.txt

2) Verify required files

Required data files in data/:

  • euos25_challenge_train_transmittance340.csv
  • euos25_challenge_train_transmittance450.csv
  • euos25_challenge_train_fluorescence340_450.csv
  • euos25_challenge_train_fluorescence480.csv
  • euos_challenge_2025_leaderboard.csv
  • euos25_challenge_test.csv

Required feature files in features/ (not included in the repo due to size):

  • train_rdkit_features_v2.csv
  • test_rdkit_features_v2.csv Generate them locally using scripts/generate_rdkit_features.py (ECFP4 2048 + PubChem 881 + PhysChem 2048 via scikit-fingerprints) so train/test match the expected schema.

3) Run the full pipeline

./run_full_pipeline.sh --regen-features

If you already generated features, you can skip regeneration:

./run_full_pipeline.sh

You can also regenerate features explicitly:

python scripts/generate_rdkit_features.py --data-dir data --features-dir features

This runs:

  1. Base model training (3 sources, 5-fold scaffold CV)
  2. Stacking on leaderboard labels
  3. Final training on train + leaderboard, then submission generation

Models are saved after training:

  • models/base/ (base models + meta-learners)
  • models/final/ (final retrained models)

4) Final submission file

Output:

  • output/submission_final.csv

Expected columns:

  • Transmittance(340)
  • Transmittance(450)
  • Fluorescence(340/450)
  • Fluorescence(>480)

Submited file to the competition:

  • output/submission_final.csv

Folder Structure

euos25/
├── data/                # Required CSV data
├── features/            # Required precomputed features
├── scripts/             # Training + stacking scripts
├── models/              # Generated models (created by pipeline)
├── source_submissions/  # Generated base predictions (created by pipeline)
├── output/              # submission_final.csv is written here
├── environment.yml
├── requirements.txt
└── run_full_pipeline.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors