GitHub - pkaramertzanis/GNN

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
cheminformatics		cheminformatics
classical_ML		classical_ML
data		data
models		models
visualisations		visualisations
LICENSE		LICENSE
PyG_app_model_fitting_BO.py		PyG_app_model_fitting_BO.py
PyG_run_predictions.py		PyG_run_predictions.py
PyTorch_app_model_fitting_BO.py		PyTorch_app_model_fitting_BO.py
PyTorch_run_predictions.py		PyTorch_run_predictions.py
README		README
analyse_alerts.py		analyse_alerts.py
analyse_predictive_performance.py		analyse_predictive_performance.py
deployment.txt		deployment.txt
logger.py		logger.py
pandas_config.py		pandas_config.py
requirements.txt		requirements.txt
utilities.py		utilities.py

Repository files navigation

This repository accompanies the publication:

Modelling In vitro Mutagenicity Using Multi-Task Deep Learning and REACH Data
Chemical Research in Toxicology
Panagiotis G. Karamertzanis, Mike Rasenberg, Imran Shah and Grace Patlewicz
https://pubs.acs.org/doi/10.1021/acs.chemrestox.5c00152

The main scripts are as follows:
- PyG_app_model_fitting_BO.py: fits single and multi-task GNN models
- PyTorch_app_model_fitting_BO.py: fits single and multi-task feed forward neural network models
- PyG_run_predictions.py: runs predictions using single and multi-task GNN models
- PyTorch_run_predictions.py: runs predictions using single and multi-task feed forward neural network models

Dataset preparation
-------------------

The folder "data" contains one folder per source of data. For example, the folder "data/Hansen_2009" contains the Ames dataset
from Hansen et al. 2009, whilst the folder "data/QSARToolbox" contains the genotoxicity data from several databases in the
QSAR Toolbox. For each source there is a python script, named as "{source}_flatten.py" the purpose of which is to convert
each raw input data source into a flat excel file of fixed structure, with the following columns:

- record ID
- source
- raw input file
- CAS number
- substance name
- smiles
- in vitro/in vivo
- endpoint
- assay
- cell line/species
- metabolic activation
- genotoxicity mode of action
- gene
- notes
- genotoxicity
- reference
- additional source data

Each source requires its own flattening script because the raw data format varies. However, once the flat datasets are
produced, they can be handled in the same way. Please note that the flattening scripts do not do any molecular structure
checking and standardisation, other than ensuring that a SMILES string is included in the flat dataset. The molecular
structure checking and standardisation is done in the next step. In order to ensure that all sources are expressed in a
consistent way, the endpoint and assay are standardised using their values
in the corresponding OECD harmonised templates (OHT 70 and 71, picklists endpoint and type of assay,
https://iuclid6.echa.europa.eu/format).

In the public repo we only provide the raw Hansen 2009 Ames data and the corresponding flattening script. The manuscript contains the
finally combined datasets that were used for building the models.

The function data.combine.create_sdf loads the provided flat files and produces a single sdf file that is suitable for building
models. This function also checks the molecular structures for errors and standardises them. The created sdf file contains
one mol block for each unique molecular structure across all data. The function accepts as input the level of aggregation
for multitask modelling. As an example, if the user provides the task aggregation columns (task_aggregation_cols):
- in vitro/in vivo
- endpoint
- assay

the genotoxicity calls will be aggregated across all
- cell line/species
- metabolic activation
- genotoxicity mode of action
- gene

for the same values in the task aggregation columns. At one extreme level of aggregation, one can model in vitro
genotoxicity. A much more granular model would include each AMES strain separately (the strain is recorded in the
cell line/species column) and also the metabolic activation.

The user can also select to filter out any record that has the value "unknown" in any of the task aggregation columns.

The user can also select to include only records for which one of the columns has a given set of values, e.g. to only
model a given set of Salmonella Thyphimurium strains.

The user does not need to combine the raw datasets. Instead can start from the combined datasets in the supplementary information
of the manuscript and execute the model building scripts from where the general parameters are set onwards.