This repository provides the code and resources supporting the manuscript "Deep Learning Paradigm for Precision Lung Cancer Therapy with AI-Driven Genotype-Phenotype Mining and Patient-Derived Organoid Validation" (under submission). It implements a deep learning framework that integrates patient genomic sequencing data with compound structural information, trained on phenotypic drug sensitivity profiles from lung cancer patient-derived organoids (PDOs), to enable rapid and accurate prediction of antitumor drug responses. The model achieves 81.6% prediction accuracy, validated through PDO experiments and a clinical lung cancer cohort, facilitating individualized therapy, novel compound evaluation, and drug repurposing in precision oncology.
import os
import random
import torch
from torch import nn
import torch.optim as optim
from tsme import tokenizers as tkz
from tsme import models, datasets
tokenizer = tkz.PmtedTokenizer.from_jsonfile(tkz.PmtedTokenizer.DEFAULT_CONF)
ds = datasets.TsmeBase(smi_tokenizer=tokenizer)
smiles = ["c1cccc1c", "CC[N+](C)(C)Cc1ccccc1Br"]
mutations = [[random.choice([0, 1]) for _ in range(3008)]] * 2
# mutations contains 0/1 encoding information of the genome
values = [0.85, 0.78]
mut_x, smi_src, smi_tgt, out = ds(mutations, smiles, values)
# Regression train
model = models.Tsme(models.Tsme.DEFUALT_CONF)
model.load_pretrained(
torch.load('src/moltx.ckpt', map_location=torch.device('cpu'))
)
mse_loss = nn.MSELoss()
optimizer = optim.Adam(
model.parameters(),
lr=1e-04,
foreach=False
)
optimizer.zero_grad()
pred = model(src=smi_src, tgt=smi_tgt, mutation=mut_x)
loss = mse_loss(pred, out)
loss.backward()
optimizer.step()
torch.save(model.state_dict(), '/path/to/tsme.ckpt')import random
from tsme import tokenizers as tkz
from tsme import models, datasets
from tsme import pipelines, models
# Regression
tokenizer = tkz.PmtedTokenizer.from_jsonfile(tkz.PmtedTokenizer.DEFAULT_CONF)
model = models.Tsme(models.Tsme.DEFUALT_CONF)
model.load_state_dict(
torch.load('/path/to/tsme.ckpt', map_location=torch.device("cpu"))
)
pipeline = pipelines.TsmeTeg(
smi_tokenizer=tokenizer, model=model
)
mutations = [random.choice([0, 1]) for _ in range(3008)]
smiles = "CC[N+](C)(C)Cc1ccccc1Br"
predict = pipeline(mut=mutations, smi=smiles) # e.g. 0.85Our framework's embeddings draw from two key references: genotype embeddings are adapted from the DrugCell model (Ma et al., Cancer Cell, 2020; https://pubmed.ncbi.nlm.nih.gov/33096023/; GitHub: https://github.com/idekerlab/DrugCell), which represents tumor genotypes as binary mutation vectors for ~3,000 frequently mutated genes, processed via a visible neural network (VNN) mirroring gene ontology hierarchies to encode cellular subsystems; compound embeddings leverage our AdaMR model (arxiv:2401.06166; GitHub: https://github.com/js-ish/MolTx), utilizing adaptive multi-resolution representation learning through molecular canonicalization pre-training for substructure- and atomic-level encoding of chemical structures, enhancing predictive and generative tasks.
Sample datasets for model demonstration are located in the project's example/ directory. For core datasets used in training process, access requires approval through the NGDC platform.