Skip to content
/ MolSol Public

A graph neural network framework for molecular solubility prediction that integrates both structural (graph-based) and physchem descriptors. The model supports uncertainty quantification, detailed feature importance analysis, and generates comprehensive evaluation reports.

Notifications You must be signed in to change notification settings

nar-n/MolSol

Repository files navigation

GNNSol: Graph Neural Network for Molecular Solubility Prediction

GNNSol is a machine learning system that uses Graph Neural Networks to predict molecular solubility with uncertainty quantification. The framework converts SMILES representations of molecules into graph structures and uses GNN architectures to accurately predict water solubility (LogS) values.

Predictions with Uncertainty

Key Features

  • Graph Representation of Molecules: Converts SMILES to molecular graphs
  • Physicochemical Descriptors: Incorporates critical solubility-related properties
  • GNN-Based Prediction: Uses GCN or GAT architectures for property prediction
  • Uncertainty Quantification: Combines epistemic and aleatoric uncertainty
  • Advanced Evaluation: Including cross-validation and feature importance analysis
  • Comprehensive Reporting: Detailed metrics and visualizations

Architecture

The system consists of several interconnected modules:

  1. SMILES to Graph Converter: Transforms molecular SMILES into graph data
  2. Molecular Descriptors: Calculates key physicochemical properties for solubility
  3. GNN Encoder: Learns molecular representations using graph neural networks
  4. Property Predictor: Predicts molecular properties from the learned representations
  5. Uncertainty Quantifier: Provides reliable uncertainty estimates
  6. Feature Importance Analysis: Identifies key molecular features

Physicochemical Descriptors

The model incorporates the following physicochemical descriptors that are known to be critical for solubility prediction:

  1. LogP (octanol-water partition coefficient): The primary determinant of molecular solubility
  2. Topological Polar Surface Area (TPSA): Critical for representing molecular polarity
  3. Hydrogen Bond Donors/Acceptors: Key factors in solute-solvent interactions
  4. Molecular Weight: Influences crystal lattice energy and solubility
  5. Rotatable Bonds: Represents molecular flexibility and entropy effects
  6. Aromatic and Aliphatic Rings: Impact molecular shape and packing
  7. Surface Area (Labute ASA): Approximates the solvent-accessible surface area
  8. Molecular Refractivity: Represents molecular volume and polarizability

These descriptors are integrated using our advanced feature integration architecture that:

  1. Applies consistent normalization to ensure balanced contribution
  2. Evaluates group-based importance to capture synergistic effects
  3. Uses a gate mechanism to dynamically weight graph vs. physicochemical features
  4. Analyzes feature interactions to identify important joint contributions

Our feature importance analysis now includes:

  • Individual feature assessment
  • Group-based feature evaluation
  • Feature interaction analysis
  • Separate visualization of physicochemical contributions

This comprehensive approach prevents underestimation of physicochemical descriptors and provides insights into their true contribution to prediction quality.

Latest Updates

1. Enhanced Feature Importance Analysis

  • Comprehensive Feature Evaluation: Now includes both graph-based and physicochemical descriptors
  • Visual Feature Importance: Automatically generates feature importance plots with color-coded categories
  • Customizable Visualization: Top features shown with their relative importance scores

2. Improved Reporting

  • Compound ID Integration: Uses compound identifiers in reporting instead of just SMILES strings
  • Unified Evaluation Report: Single comprehensive report that combines cross-validation and advanced evaluation metrics
  • Detailed Feature Breakdown: Includes category-based analysis of most important features

3. Batch Normalization Improvements

  • Single Sample Inference: Custom batch normalization to handle individual molecule predictions
  • Robust Uncertainty Quantification: Fixed issues with batch norm during MC dropout uncertainty estimation
  • Consistent Performance: Maintains model accuracy while enabling flexible batch sizes

4. Enhanced Molecule Processing

  • Better Error Handling: Improved robustness for problematic SMILES strings
  • Descriptor Integration: Seamless addition of physicochemical descriptors to graph representation
  • Feature Masking Analysis: Allows understanding of each feature's contribution to predictions

Usage

from model.smiles_to_graph import SMILESToGraph
from model.gnn_encoder import GNNEncoder
from model.property_predictor import MoleculeGNN
from model.uncertainty import UncertaintyQuantifier

# Create model
model = MoleculeGNN(input_dim=14, hidden_dim=128, latent_dim=64, n_tasks=1)

# Convert SMILES to graph
converter = SMILESToGraph()
molecule_graph = converter.convert("CC(=O)OC1=CC=CC=C1C(=O)O")  # Aspirin

# Predict with uncertainty
mean_pred, total_std, ci_lower, ci_upper, epistemic_std, aleatoric_std = (
    UncertaintyQuantifier.combined_uncertainty(model, molecule_graph)
)

print(f"Predicted solubility: {mean_pred:.2f} ± {total_std:.2f}")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"Epistemic uncertainty: {epistemic_std:.2f}, Aleatoric uncertainty: {aleatoric_std:.2f}")

Output Files

The system generates several output files in the results directory:

  1. evaluation_report.txt: Comprehensive report with all metrics and results
  2. feature_importance.csv: CSV file with detailed feature importance scores
  3. feature_importance_plot.png: Visualization of feature importance by category
  4. training_progress.png: Plot showing training and validation loss over epochs
  5. unseen_predictions_with_uncertainty.png: Visualization of predictions with uncertainty
  6. uncertainty_*.png: Various uncertainty analysis plots
  7. best_model.pt: Saved PyTorch model state dictionary

Results are automatically saved in a timestamped directory under model_output_evaluation/.

Dependencies

  • PyTorch
  • PyTorch Geometric
  • RDKit
  • NumPy
  • Matplotlib
  • scikit-learn
  • pandas

About

A graph neural network framework for molecular solubility prediction that integrates both structural (graph-based) and physchem descriptors. The model supports uncertainty quantification, detailed feature importance analysis, and generates comprehensive evaluation reports.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published