A FastAPI backend that predicts molecular properties from SMILES strings using RDKit and a trained scikit-learn model.
This project demonstrates a complete cheminformatics workflow: validation → featurisation → model inference → API response.
- Python, FastAPI, Uvicorn
- RDKit, NumPy, pandas
- scikit-learn, joblib
- pytest
This project uses the AqSolDB aqueous solubility dataset.
Download from Kaggle: https://www.kaggle.com/datasets/sorkun/aqsoldb-a-curated-aqueous-solubility-dataset
After downloading, place the CSV file at:
backend/data/raw/aqsoldb.csv
# create environment
python -m venv .venv
.venv\Scripts\activate
# install deps
pip install -r backend/requirements.txt
# train model (required once)
python -m backend.training.trainuvicorn backend.app.main:app --reloadBase URL:
http://127.0.0.1:8000
GET /api/v1/health
GET /api/v1/model-info
POST /api/v1/predict
{
"smiles": "CCO"
}{
"input_smiles": "CCO",
"canonical_smiles": "CCO",
"valid": true,
"property_name": "aqueous_solubility",
"prediction": -0.77,
"units": "logS"
}pytestdocker build -t molprop-api ./backend
docker run -p 8000:8000 molprop-api- Model: RandomForestRegressor
- Features: Morgan fingerprints (1024 bits)
- Training and inference are fully separated