pkaramertzanis/GNN
Folders and files
| Name | Name | Last commit date | ||
|---|---|---|---|---|
Repository files navigation
This repository accompanies the publication: Modelling In vitro Mutagenicity Using Multi-Task Deep Learning and REACH Data Chemical Research in Toxicology Panagiotis G. Karamertzanis, Mike Rasenberg, Imran Shah and Grace Patlewicz https://pubs.acs.org/doi/10.1021/acs.chemrestox.5c00152 The main scripts are as follows: - PyG_app_model_fitting_BO.py: fits single and multi-task GNN models - PyTorch_app_model_fitting_BO.py: fits single and multi-task feed forward neural network models - PyG_run_predictions.py: runs predictions using single and multi-task GNN models - PyTorch_run_predictions.py: runs predictions using single and multi-task feed forward neural network models Dataset preparation ------------------- The folder "data" contains one folder per source of data. For example, the folder "data/Hansen_2009" contains the Ames dataset from Hansen et al. 2009, whilst the folder "data/QSARToolbox" contains the genotoxicity data from several databases in the QSAR Toolbox. For each source there is a python script, named as "{source}_flatten.py" the purpose of which is to convert each raw input data source into a flat excel file of fixed structure, with the following columns: - record ID - source - raw input file - CAS number - substance name - smiles - in vitro/in vivo - endpoint - assay - cell line/species - metabolic activation - genotoxicity mode of action - gene - notes - genotoxicity - reference - additional source data Each source requires its own flattening script because the raw data format varies. However, once the flat datasets are produced, they can be handled in the same way. Please note that the flattening scripts do not do any molecular structure checking and standardisation, other than ensuring that a SMILES string is included in the flat dataset. The molecular structure checking and standardisation is done in the next step. In order to ensure that all sources are expressed in a consistent way, the endpoint and assay are standardised using their values in the corresponding OECD harmonised templates (OHT 70 and 71, picklists endpoint and type of assay, https://iuclid6.echa.europa.eu/format). In the public repo we only provide the raw Hansen 2009 Ames data and the corresponding flattening script. The manuscript contains the finally combined datasets that were used for building the models. The function data.combine.create_sdf loads the provided flat files and produces a single sdf file that is suitable for building models. This function also checks the molecular structures for errors and standardises them. The created sdf file contains one mol block for each unique molecular structure across all data. The function accepts as input the level of aggregation for multitask modelling. As an example, if the user provides the task aggregation columns (task_aggregation_cols): - in vitro/in vivo - endpoint - assay the genotoxicity calls will be aggregated across all - cell line/species - metabolic activation - genotoxicity mode of action - gene for the same values in the task aggregation columns. At one extreme level of aggregation, one can model in vitro genotoxicity. A much more granular model would include each AMES strain separately (the strain is recorded in the cell line/species column) and also the metabolic activation. The user can also select to filter out any record that has the value "unknown" in any of the task aggregation columns. The user can also select to include only records for which one of the columns has a given set of values, e.g. to only model a given set of Salmonella Thyphimurium strains. The user does not need to combine the raw datasets. Instead can start from the combined datasets in the supplementary information of the manuscript and execute the model building scripts from where the general parameters are set onwards.