ccle_example

Prerequisites

The first step is to download the necessary files from the Cancer Cell Line Encyclopedia. For the purpose of the HitWalker2 instance we will use the following files:

cell_line_annotations.txt
CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07.maf.gz
CCLE_NP24.2009_Drug_data_2012.02.20.csv 
CCLE_Expression.Arrays.sif_2012-10-18.txt
CCLE_Expression.Arrays_2013-03-18.tar.gz

Note that the RNeo4j package needs to be installed from github, this is done automatically for a Vagrant-based HitWalker2 build but can also be done manually using the devtools package:

library(devtools)
install_github("nicolewhite/RNeo4j")

We can then load up the necessary libraries.

suppressPackageStartupMessages(library(hwhelper))
library(RNeo4j)

Subject Data

First we will make the patient/cellLine to sample mappings. This involves creation of a subject data.frame. This object should contain the subject name(s), aliases and any other metadata which will be displayed to the end user. Users should exercise restraint when specifying metadata as only up to 20 total levels can be displayed at a given time. In the example below we will use gender as well as the 'primary site' as our metadata.

cell.annots <- read.delim("cell_line_annotations.txt", sep="\t", header=T, stringsAsFactors=F)

cell.annots$alias <- ifelse(cell.annots$Cell.line.aliase == "", cell.annots$Cell.line.primary.name,
    paste(cell.annots$Cell.line.primary.name, cell.annots$Cell.line.aliases, sep=" | "))
any(grepl("&", cell.annots))

## [1] FALSE

cell.annots$alias <- gsub("\\s+\\|\\s+", "&", cell.annots$alias)

Here we created an 'alias' column which will allow the end user's to search on the subject name as well as the aliases. The aliases should be delimited by '&' which is the default delimiter understood by load.neo4j--the internal function that loads these data into Neo4j.

Similarly we can keep/add additional columns which represent the metadata for these subjects. It is at this point that these data should be recoded and made user-digestable. Spaces should be replaced with '_'. After the subject data.frame is ready, we can create a Subject object. By specifying type here, it will be used as part of the initial subject/sample display.

subject.dta <- cell.annots[,c("CCLE.name", "alias", "Gender", "Site.Primary")]
names(subject.dta) <- c("cellLine", "alias", "gender", "primary_site")

subject.dta$gender[subject.dta$gender == ""] <- "U"
gend.tab <- c(U="Unknown", M="Male", F="Female")
subject.dta$gender <- as.character(gend.tab[subject.dta$gender])

subj <- Subject(subject.info=subject.dta, type="primary_site")

Variant Data

Variant data can come in several forms. For cancer next generation DNA sequencing experiments such as this one, the mutation annotation format (MAF) is a common way to summarize the variant data. Currently we provide parsing support for this CCLE MAF file though as MAF files are roughly standardized, this should be applicable to other variants as well.

maf.file <- "CCLE_hybrid_capture1650_hg19_NoCommonSNPs_NoNeutralVariants_CDS_2012.05.07.maf.gz"

maf.obj <- readMAF.ccle(maf.file)

#add in these sample annotations to the Subject object

all(sampleNames(maf.obj) %in% cell.annots$CCLE.name)

## [1] TRUE

If we wish to be able to directly subset cell lines based on variants/drug sensitivities of interest we can add them to the subject information data.frame and (re)create the Subject object. The findHits method for this object finds those samples with variants.

Note the use of the addSamples method which is used to automatically add Subject -> Sample mapping for the MAF object.

nras.var <- findHits(maf.obj, samples=sampleNames(maf.obj), genes="4893")

subject.dta$NRAS_mutant <- ifelse(subject.dta$cellLine %in% nras.var$Sample, "NRAS_pos", "NRAS_neg")

subj <- Subject(subject.info=subject.dta, type="primary_site")

addSamples(subj, type="DNASeq") <- maf.obj

Drug Data

Next we will prepare the drug assay data. For this purpose we can create a DrugMatrix object which consists of a drug x sample matrix as well as a mapping between drug and gene. The dowloaded file contains this information, though we need to perform semi-automatic gene symbol curation using the Entrez ID information already contained in the base Neo4j database.

library(RNeo4j)

graph = startGraph("http://localhost:7474/db/data/")

drug.dta <- read.csv("CCLE_NP24.2009_Drug_data_2012.02.20.csv", stringsAsFactors=F)

drug.names <- sub("c-", "", unique(drug.dta$Target))

targ.matches <- lapply(drug.names, function(x){

    cypher(graph, "MATCH (gene:EntrezID)-[r:REFERRED_TO]-(symb) WITH gene,symb, r,
                    CASE r.synonyms WHEN null THEN [symb.name] ELSE [symb.name]+r.synonyms END AS gene_syns WHERE ANY(x IN gene_syns WHERE x = {GENE})
                    RETURN gene.name, r.synonyms, symb.name", GENE=x)    
                    
                    })
                    
names(targ.matches) <- drug.names

For each of the records, we consider it a match if a single entry was found or if the input symbol matched the display name. If multiple equally good matches are found then we will weight them equally. In addition we will disregard any gene targets not found.

gene.ids <- mapply(function(x, name){
  
  if (is.null(x)){
    return(NULL)
  }else if (nrow(x) == 1){
    return(x$gene.name)
  }else if (nrow(x) > 1){
    
    is.match <- x$symb.name == name
    
    if (sum(is.match) == 1){
      
      return(x$gene.name[is.match])
      
    }else if (sum(is.match) > 1){
      return(x$gene.name[is.match])
    }else{
      return(x$gene.name)
    }
    
  }

}, targ.matches, names(targ.matches))

The missing targets may refer to gene families more than specific genes and that case would have to be treated specially but we will disregard them for now.

print (names(gene.ids[sapply(gene.ids, is.null)]))

## [1] "HSP90" "RTK"   "FGFR"

The automatic approach above is not perfect, for example, the ABL gene is likely actually mapped to ABL1 (gene 25) as opposed to MTTP.

print (gene.ids[sapply(gene.ids, length) > 1])

## $ABL
## [1] "25"   "4547"
## 
## $GS
## [1] "324"  "2752"

gene.ids[["ABL"]] <- "25"

The GS gene looks ambigous so we will leave it assigned to the two genes reported by NCBI. Now lets load in the drug data.

gene.stack <- stack(gene.ids)

## Warning in stack.default(gene.ids): non-vector elements will be ignored

gene.stack$ind <- as.character(gene.stack$ind)

#load in the drug/gene annotation

names(gene.stack) <- c("entrezID", "Target")
gene.stack$weight <- ifelse(gene.stack$Target == "GS", .5, 1)

drug.genes <- drug.dta[,c("Compound", "Target")]

drug.genes <- drug.genes[!duplicated(drug.genes),]

drug.genes.merge <- merge(drug.genes, gene.stack, by="Target", all=T, incomparables=NA, sort=F)

drug.genes.merge <- drug.genes.merge[complete.cases(drug.genes.merge),]
drug.genes.merge <- drug.genes.merge[,c("Compound", "entrezID", "weight")]
names(drug.genes.merge) <- c("drug", "gene", "weight")

#load both the annotations as well as the matrix data

drug.mat <- acast(formula=Compound~CCLE.Cell.Line.Name,data=drug.dta, value.var="IC50..uM.")

drug <- DrugMatrix(mat=drug.mat, mapping=drug.genes.merge)

Note the use of the 'weight' column to represent ambiguity in the 'drug' -> 'gene' assignments. The 'GeneScore' algorithm used internally by HitWalker2 is based off of the algorithm of Tyner et al. (Tyner et al. 2013) and these weights are used internally as part of the summations.

We can also check the implicit subject -> drug sample mapping

all(sampleNames(drug) %in% cell.annots$CCLE.name)

## [1] FALSE

setdiff(sampleNames(drug), cell.annots$CCLE.name)

## [1] "BGC823_STOMACH" "COLO677_LUNG"   "COLO699_LUNG"   "DOV13_OVARY"   
## [5] "GLC82_LUNG"     "M059J"          "RKN_OVARY"      "SF8657"        
## [9] "SNUC2B"

The addSamples method will deal with the discordant samples by creating sample nodes but not assigning relationships between them and the subjects. Which means that:

addSamples(subj, type="Drug_Assay") <- drug

Implies this:

addSamples(subj) <- data.frame(cellLine=intersect(sampleNames(drug), subject.sample.names),
                                sample=intersect(sampleNames(drug), subject.sample.names),
                                type="Drug_Assay", stringsAsFactors=F)

Processing expression data

To process the expression data we will choose the top 10% most variable genes (in terms of IQR) and perform a batch correction based on the Batch information in the array sample information file. The resulting matrix is then converted back to an ExpressionSet for loading into the database.

sif.dta <- read.delim("CCLE_Expression.Arrays.sif_2012-10-18.txt", sep="\t", header=T, stringsAsFactors=F)

library(affy)
library(genefilter)
library(sva)

all.genes <- ReadAffy(path="CCLE_Expression.Arrays_2013-03-18", filenames=list.files(pattern="CEL"))

all(sif.dta$ID %in% sub(".CEL", "", colnames(all.genes)))

all.exprs <- rma(all.genes[,paste0(sif.dta$ID, ".CEL")])

filt.exprs.list <- nsFilter(all.exprs, require.entrez=TRUE,
         remove.dupEntrez=FALSE, var.func=IQR,
         var.cutoff=0.90, var.filter=TRUE,
         filterByQuantile=TRUE, feature.exclude="^AFFX")

filt.exprs <- filt.exprs.list$eset

#Do combat correction based on batch

batch <- sif.dta$Batch
names(batch) <- paste0(sif.dta$ID, ".CEL")

batch <- batch[colnames(filt.exprs)]

int.mod <- model.matrix(~1, data=sif.dta)

comb.exprs <- ComBat(dat=exprs(filt.exprs), batch=batch, mod=int.mod, numCovs=NULL, par.prior=T)

ccle.eset <- ExpressionSet(assayData=comb.exprs, annotation=annotation(filt.exprs))

save(ccle.eset, file="ccle_eset.RData")

For expression, probeset->gene mapping info doesn't need to be supplied by an ExpressionSet as it can be populated using Bioconductor's annotation mechanism.

As before we still need to supply the mapping between the CEL file names and the subject names and we will do this base on the SIF file. Instead of applying addSamples an object, we will supply it a data.frame with the mapping info as below.

sif.dta$ID <- paste0(sif.dta$ID, ".CEL")

all(sif.dta$ID %in% sampleNames(ccle.eset))

## [1] TRUE

all(sif.dta$CCLE.name %in% subj@subject.info$cellLine)

## [1] TRUE

expr.annot <- sif.dta[,c("CCLE.name", "ID")]
names(expr.annot) <- c("cellLine","sample")
expr.annot$type <- "Affy_Expression"

addSamples(subj) <- expr.annot

Creating the database

Finally we will generate a HW2Config object. To do this we first specify the subject data and the type of gene models (currently only 'entrez'). Next we tell HitWalker2 which of the datatypes are going to be prioritized ('target') and which datatype(s) is going to be used to provide the basis of the prioritization ('seeds'). Finally we provide the data objects along with supplied names.

We can then populate the database and create the configuration files.

hw2.conf <- makeHW2Config(subject=subj, gene.model="entrez",
        data.types=list(seeds="GeneScore", target="Variants"),
        Expression=HW2exprSet(ccle.eset), GeneScore=drug, Variants=maf.obj)

save(hw2.conf, file="hw2_config.RData")

populate(hw2.conf)

## Missing CONSTRAINT(s) for: CellLine will build them first and continue with loading...
## Loading into Neo4j...
## Preprocessing...
## Missing CONSTRAINT(s) for: Sample will build them first and continue with loading...
## Loading into Neo4j...
## Starting: 1
## Preprocessing...
## Missing CONSTRAINT(s) for: ProbeSet will build them first and continue with loading...
## Loading into Neo4j...
## Loading required package: hgu133plus2.db
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: GenomeInfoDb
## Loading required package: S4Vectors
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## 
## Attaching package: 'S4Vectors'
## 
## The following object is masked from 'package:igraph':
## 
##     compare
## 
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## 
## The following object is masked from 'package:igraph':
## 
##     simplify
## 
## Loading required package: org.Hs.eg.db
## Loading required package: DBI
## 
## 
## Preprocessing...
## Found all necessary CONSTRAINTS, continuing...
## Loading into Neo4j...
## Starting: 2
## Missing CONSTRAINT(s) for: Drug will build them first and continue with loading...
## Loading into Neo4j...
## Preprocessing...
## Found all necessary CONSTRAINTS, continuing...
## Loading into Neo4j...
## Preprocessing...
## Found all necessary CONSTRAINTS, continuing...
## Loading into Neo4j...
## Starting: 3
## Preprocessing...
## Missing CONSTRAINT(s) for: Variation will build them first and continue with loading...
## Loading into Neo4j...
## Preprocessing...
## Found all necessary CONSTRAINTS, continuing...
## Loading into Neo4j...

configure(hw2.conf)

## Creating config files from templates
## Staging config files
## Setting up config files
## Getting labels from DB
## Starting StringID
## Starting EntrezID
## Starting Symbol
## Starting Pathway
## Starting CellLine
## Starting Sample
## Starting ProbeSet
## Starting Drug
## Starting Variation

Finally we provide utilities to gain information on the Neo4j graph structure that was created

ccle.graph <- compute.graph.structure()

## Getting labels from DB
## Starting StringID
## Starting EntrezID
## Starting Symbol
## Starting Pathway
## Starting CellLine
## Starting Sample
## Starting ProbeSet
## Starting Drug
## Starting Variation

plot(ccle.graph)

sessionInfo()

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.2 LTS
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  methods   stats     graphics  grDevices utils    
## [8] datasets  base     
## 
## other attached packages:
##  [1] hgu133plus2.db_3.1.3 org.Hs.eg.db_3.1.2   RSQLite_1.0.0       
##  [4] DBI_0.3.1            AnnotationDbi_1.30.1 GenomeInfoDb_1.4.1  
##  [7] IRanges_2.2.5        S4Vectors_0.6.2      RNeo4j_1.4.2        
## [10] hwhelper_0.99.14     rjson_0.2.15         igraph_1.0.1        
## [13] reshape2_1.4.1       Biobase_2.28.0       BiocGenerics_0.14.0 
## [16] rmarkdown_0.7       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6     knitr_1.10.5    magrittr_1.5    R6_2.1.0       
##  [5] stringr_1.0.0   httr_1.0.0      plyr_1.8.3      tools_3.2.1    
##  [9] htmltools_0.2.6 yaml_2.1.13     digest_0.6.8    RJSONIO_1.3-0  
## [13] formatR_1.2     curl_0.9.1      evaluate_0.7    stringi_0.5-5  
## [17] jsonlite_0.9.16

References

Tyner, Jeffrey, Wayne Yang, Armand Bankhead, Guang Fan, Luke Fletcher, Jade Bryant, Jason Glover, et al. 2013. “Kinase Pathway Dependence in Primary Human Leukemias Determined by Rapid Inhibitor Screening.” Cancer Research 73. AACR: 285–96.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccle_example

Prerequisites

Subject Data

Variant Data

Drug Data

Processing expression data

Creating the database

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally