-
Notifications
You must be signed in to change notification settings - Fork 0
SessionExprFiles
Brian Fox edited this page Sep 11, 2022
·
2 revisions
Three main files are required, each of the following pages has detailed specifications of each file:
These files are optional, or can be made via performing score calculations:
- Score info: score_info.csv
- Coords: coord_XYZ.csv
The above files have a "GEO" prefix if they are in a GEO type session, and an "expr" prefix if they are in a base expression session. For example, the expression matrix is called "GEO.expr.txt" in a GEO session or "expr.expr.txt" in an expression sessions. Similarly, it would be "GEO.samples.csv" or "expr.samples.csv"
Read these above pages, most problems occur when formatting those first 3 files.
- I haven't worked out all the issues with using non-ASCII unicode characters and getting them through the whole workflow. I often see these characters in GEO descriptions, so you may need to edit the series info file and look for characters like: long dash, i with two dots in naive, greek characters, +/-, etc, and change them to ASCII characters.
- as stated in here, the expression matrix file (a tab delimited file) should not have a column name for the row names (which are gene names). In other words, be sure that the first character in the expr.txt file is a tab.
You can try running this R script to validate some of the potential errors in your files:
ser.info = read.csv(file="series_info.csv")
samples = read.csv(file="samples.csv")
expr = read.table(file="expr.txt", sep="\t", header=T, quote="")
if (samples[1,1] == "_id") {
if (any(duplicated(samples[,1]))) stop("First column in samples.csv file has duplicate values")
rownames(samples) = samples[,1]
samples[,1] = NULL
}
expr_first_line = read.table(file="expr.txt", sep="\t", header=F, quote="", nrow=1)
if (length(expr_first_line) == ncol(expr)) stop("header row of expr.txt should only have colnames for data columns, no col name for the rownames")
if (any(duplicated(toupper(rownames(expr))))) stop("duplicate gene names")
if (! all(rownames(samples) == colnames(expr))) stop("samples ids don't match")
if (! all(colnames(ser.info) == c("key", "value"))) stop("samples ids don't match")
(c) 2015-2025, Needle Genomics LLC