GNNSol is a machine learning system that uses Graph Neural Networks to predict molecular solubility with uncertainty quantification. The framework converts SMILES representations of molecules into graph structures and uses GNN architectures to accurately predict water solubility (LogS) values.
- Graph Representation of Molecules: Converts SMILES to molecular graphs
- Physicochemical Descriptors: Incorporates critical solubility-related properties
- GNN-Based Prediction: Uses GCN or GAT architectures for property prediction
- Uncertainty Quantification: Combines epistemic and aleatoric uncertainty
- Advanced Evaluation: Including cross-validation and feature importance analysis
- Comprehensive Reporting: Detailed metrics and visualizations
The system consists of several interconnected modules:
- SMILES to Graph Converter: Transforms molecular SMILES into graph data
- Molecular Descriptors: Calculates key physicochemical properties for solubility
- GNN Encoder: Learns molecular representations using graph neural networks
- Property Predictor: Predicts molecular properties from the learned representations
- Uncertainty Quantifier: Provides reliable uncertainty estimates
- Feature Importance Analysis: Identifies key molecular features
The model incorporates the following physicochemical descriptors that are known to be critical for solubility prediction:
- LogP (octanol-water partition coefficient): The primary determinant of molecular solubility
- Topological Polar Surface Area (TPSA): Critical for representing molecular polarity
- Hydrogen Bond Donors/Acceptors: Key factors in solute-solvent interactions
- Molecular Weight: Influences crystal lattice energy and solubility
- Rotatable Bonds: Represents molecular flexibility and entropy effects
- Aromatic and Aliphatic Rings: Impact molecular shape and packing
- Surface Area (Labute ASA): Approximates the solvent-accessible surface area
- Molecular Refractivity: Represents molecular volume and polarizability
These descriptors are integrated using our advanced feature integration architecture that:
- Applies consistent normalization to ensure balanced contribution
- Evaluates group-based importance to capture synergistic effects
- Uses a gate mechanism to dynamically weight graph vs. physicochemical features
- Analyzes feature interactions to identify important joint contributions
Our feature importance analysis now includes:
- Individual feature assessment
- Group-based feature evaluation
- Feature interaction analysis
- Separate visualization of physicochemical contributions
This comprehensive approach prevents underestimation of physicochemical descriptors and provides insights into their true contribution to prediction quality.
- Comprehensive Feature Evaluation: Now includes both graph-based and physicochemical descriptors
- Visual Feature Importance: Automatically generates feature importance plots with color-coded categories
- Customizable Visualization: Top features shown with their relative importance scores
- Compound ID Integration: Uses compound identifiers in reporting instead of just SMILES strings
- Unified Evaluation Report: Single comprehensive report that combines cross-validation and advanced evaluation metrics
- Detailed Feature Breakdown: Includes category-based analysis of most important features
- Single Sample Inference: Custom batch normalization to handle individual molecule predictions
- Robust Uncertainty Quantification: Fixed issues with batch norm during MC dropout uncertainty estimation
- Consistent Performance: Maintains model accuracy while enabling flexible batch sizes
- Better Error Handling: Improved robustness for problematic SMILES strings
- Descriptor Integration: Seamless addition of physicochemical descriptors to graph representation
- Feature Masking Analysis: Allows understanding of each feature's contribution to predictions
from model.smiles_to_graph import SMILESToGraph
from model.gnn_encoder import GNNEncoder
from model.property_predictor import MoleculeGNN
from model.uncertainty import UncertaintyQuantifier
# Create model
model = MoleculeGNN(input_dim=14, hidden_dim=128, latent_dim=64, n_tasks=1)
# Convert SMILES to graph
converter = SMILESToGraph()
molecule_graph = converter.convert("CC(=O)OC1=CC=CC=C1C(=O)O") # Aspirin
# Predict with uncertainty
mean_pred, total_std, ci_lower, ci_upper, epistemic_std, aleatoric_std = (
UncertaintyQuantifier.combined_uncertainty(model, molecule_graph)
)
print(f"Predicted solubility: {mean_pred:.2f} ± {total_std:.2f}")
print(f"95% CI: [{ci_lower:.2f}, {ci_upper:.2f}]")
print(f"Epistemic uncertainty: {epistemic_std:.2f}, Aleatoric uncertainty: {aleatoric_std:.2f}")The system generates several output files in the results directory:
- evaluation_report.txt: Comprehensive report with all metrics and results
- feature_importance.csv: CSV file with detailed feature importance scores
- feature_importance_plot.png: Visualization of feature importance by category
- training_progress.png: Plot showing training and validation loss over epochs
- unseen_predictions_with_uncertainty.png: Visualization of predictions with uncertainty
- uncertainty_*.png: Various uncertainty analysis plots
- best_model.pt: Saved PyTorch model state dictionary
Results are automatically saved in a timestamped directory under model_output_evaluation/.
- PyTorch
- PyTorch Geometric
- RDKit
- NumPy
- Matplotlib
- scikit-learn
- pandas
