Idecoder update

Table of Contents This page explains the IDecoder update procedures Historical Identifier and Link Counts Fetching New Data Before Starting Affymetrix CodeLink Custom Arrays Ensembl Human data files Mouse data files Rat data files Flybase, MGI, NCBI, RGD & SwissProt Parsing Input Files Adding New Parsers Layout of the CodeLink File Loading Data into Identifier Tables Comments New eQTLs Final Steps Creating Java Web Start Application

This page explains the IDecoder update procedures

Historical Identifier and Link Counts

Date	Count of IDENTIFIERS	Count of IDENTIFIER_LINKS
Prior to 5/26/06	1,866,625	3,614,678
5/26/06	2,115,668	4,032,055
10/06/06	2,149,889	4,032,067
7/19/07	2,551,937	5,084,356
11/02/07	2,669,103	5,133,013
4/08/09	2,732,872	5,378,805
12/07/09	2,930,543	5,501,504
08/13/10	4,466,570	7,729,776
01/15/11	5,569,750	9,255,549
11/30/11	5,421,171	9,127,605

Fetching New Data

Data files should be downloaded onto stan and parsed there

Before Starting

cp -R /Users/smahaffey/iDecoder/InputFiles /Users/smahaffey/iDecoder/InputFiles_OLD
cp -R /Users/smahaffey/iDecoder/Output /Users/smahaffey/iDecoder/Output_OLD
export INPUT_FILES=/Users/smahaffey/iDecoder/InputFiles

Logon to the production database before downloading the files. Run the 'getDistinctArrays.sql' script in /Users/chornbak/cherylh/sql to see the list of arrays currently used by experiments. Double-check with the list below to make sure all chips are included in the list of files to be downloaded.

In separate windows,

vi $WEB/common/siteVersion.jsp
vi $SRC/edu/ucdenver/ccp/iDecoder/file.properties

Affymetrix

Top

Important Links:

Note: a login is required to fetch these files. Use chornbak@mindspring.com/affymetrix253

Download the following files into $INPUT_FILES/Affymetrix:

Affymetrix Genechip Drosophila Genome [DrosGenome1]
Affymetrix GeneChip Human Genome U133 Plus 2.0[HG-U133_Plus_2]
Affymetrix MoEx-1_0-st-v1 Probeset Annotations
- You can find these 2 under the Technical Documentation tab under the NetAffx Annotation Files heading
Affymetrix MoEx-1_0-st-v1 Transcript Annotations
Affymetrix GeneChip Mouse Expression Array MOE430A [MOE430A]
Affymetrix GeneChip Mouse Expression Array MOE430B [MOE430B]
Affymetrix GeneChip Mouse Genome 430 2.0 [Mouse430_2]
Affymetrix GeneChip Murine Genome U74A [MG_U74A] (use file from InputFiles_OLD)
Affymetrix GeneChip Murine Genome U74Av2 [MG_U74Av2] (use file from InputFiles_OLD)
Affymetrix GeneChip Murine Genome U74Bv2 [MG_U74Bv2] (use file from InputFiles_OLD)
Affymetrix GeneChip Murine Genome U74Cv2 [MG_U74Cv2] (use file from InputFiles_OLD)
Affymetrix RnEx-1_0-st-v1 Probeset Annotations
Affymetrix RnEx-1_0-st-v1 Transcript Annotations
Affymetrix GeneChip Rat Expression Array RAE230A [RAE230A]
Affymetrix GeneChip Rat Genome U34A [RG_U34A]
Affymetrix GeneChip Rat Genome U34C [RG_U34C]

        cd $INPUT_FILES/Affymetrix
        unzip \*.zip

CodeLink

Top

CodeLink is out of business, so the following link no longer works:

CodeLink main page

Download UniSet Mouse I
Download Mouse Whole Genome
Download Rat Whole Genome

Custom Arrays

During the last iDecoder update, the following custom arrays showed up in the getDistinctArrays list:

Custom Array	Annotation File
MM_cDNA	/data/miamexpress_datafiles/arrays/ponomarev/array42/mm10_UTxAustin_ADF_annot_10152004.txt
Mu23k-Compugen-UCHSC 200um spot diam	/data/miamexpress_datafiles/arrays/bhaves/array41/Mu23KMAPV6-200micron_diameter_annot_adf.txt
Qiagen33k-Operon	/data/miamexpress_datafiles/arrays/bhaves/array102/Qiagen_Mu33KMAPV1_adf_Feb02-06.txt

At this time, these chips aren't annotated with iDecoder because we don't support Quality Control or other filtering or statistics programs. As soon as bug number ZZZ-434 is implemented, this section needs to be updated.

Ensembl

Top

Human data files

Go to Ensembl BioMart Tool page
Choose Ensembl Genes (Sanger)
Choose Homo sapiens genes
Click on Attributes link on left
Expand Gene section
Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
Click Results button
Export all results to File CSV, check Unique results only
Click Go button
Save file to $INPUT_FILES/Ensembl and call it Hs_Ensembl_Genes.csv
Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Hs_Ensembl_Transcripts.csv
Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Hs_link_GtoTS.csv

Mouse data files

Go to Ensembl BioMart Tool page
Choose Ensembl Genes (Sanger)
Choose Mus Musculus genes
Click on Attributes link on left
Expand Gene section
Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
Click Results button
Export all results to File CSV, check Unique results only
Click Go button
Save file to $INPUT_FILES/Ensembl and call it Mm_Ensembl_Genes.csv
Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Mm_Ensembl_Transcripts.csv
Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Mm_link_GtoTS.csv

Rat data files

Go to Ensembl BioMart Tool page
Choose Ensembl Genes (Sanger)
Choose Rattus norvegicus genes
Click on Attributes link on left
Expand Gene section
Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
Click Results button
Export all results to File CSV, check Unique results only
Click Go button
Save file to $INPUT_FILES/Ensembl and call it Rn_Ensembl_Genes.csv
Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Rn_Ensembl_Transcripts.csv
Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Rn_link_GtoTS.csv

Flybase, MGI, NCBI, RGD & SwissProt

Top

The code for automatically downloading annotation files from FlyBase, MGI, NCBI, RGD, and SwissProt is in a class located here:

$SRC/edu/ucdenver/ccp/iDecoder/FileGetter.java.

The FlyBase filename will need to be updated -- just go to the FlyBase ftp site, find the most recent file, and change the name in FileGetter.java. Then run

ant runFileGetter.

Either add the gunzip step to the FileGetter.java or unzip the .gz files in each directory.

Parsing Input Files

Top

Parse the files on stan cd $SRC/edu/ucdenver/ccp/iDecoder vi file.properties file to specify the list of files to be parsed. Review the file layouts for consistency with the last run, and if necessary, update the parsers using these instructions Adding New Parsers Run the parsers by running ant:

$ ant -f $SRC/build.xml runParsers

Note: every parser runs, regardless of the "newness" of the file. After a few minutes, the parsers will finish parsing the input files. The results can be found in the "/Users/chornbak/cheryl/iDecoder/Output" subdirectory. There are two output files per data source: one containing identifier Info and one containing Links between identifiers.

cd /Users/chornbak/cherylh/iDecoder/Output
tar -cvf ParserOutput.tar *.out
gzip -f6 ParserOutput.tar

The output files must be copied to phenogen, where the load scripts reside. You can copy the "tar.gz" file or only the 1 or 2 sources that have new data. logon to phenogen as smahaffey

cd iDecoder/ParsedFiles
rm *.out
scp chornbak@stan://Users/chornbak/cherylh/iDecoder/Output/ParserOutput.tar.gz .
gunzip ParserOutput.tar.gz
tar -xvf ParserOutput.tar

Adding New Parsers

Top

The java source is contained in the $SRC directory. There should be one xxxParser.java file for each source. ParserRunner executes them all.

Layout of the CodeLink File

The CodeLink files must be in the following format:

The first two letters of the filename indicate the species
Multiple entries within one column should be delimited by '///'
If the chromosome field is larger than 9 characters (i.e., because it is something like 'Un|NW_20823'), the chromosome field will not be loaded. All other fields in that record will be loaded though.

The columns should be in the following order:

Probe Name
NCBI Accession
UniGene ID
Description
symbol
LLID
UGRepAcc
Chromosome
Cytoband
GO_annotation
Expression Areas
SWISS-PROT
mapview_chromosome
start
end
strand
mgi_id
refseq_mrna_id
refseq_protein_id
Ensembl

Loading Data into Identifier Tables

The loader scripts are on phenogen. logon to phenogen as smahaffey

cd iDecoder/scripts

The directory structure is as follows:

Directory	File	Comment
/data/smahaffey/iDecoder/scripts	loadAll.sh	runs loading process
/data/smahaffey/iDecoder/scripts/sql	createIdentifierIndexes.sql
	createLinkIndexes.sql
	createLoadIndexes.sql
	createLoadSchema.sql
	createMainSchema.sql
	promoteInfo.sql
	promoteLinks.sql
	updateLocations.sql
/data/smahaffey/iDecoder/scripts/ctl	Info.ctl
	Links.ctl
/data/smahaffey/iDecoder/ParsedFiles	Bad.out
	Debug.out
	Links.out
	Info.out
/data/smahaffey/iDecoder/log	.log	Check these after every run
	.bad	Check these after every run

There may be BAD files left over from a prior run. If these need to be retained, move them to another directory; the loadAll.sh script deletes all files with a ".bad" extension.

Run loadAll.sh from the current directory. You must supply the ORACLE_SID and the INIA password as arguments. The ORACLE_SID can be either dev, test, or prod. The script does not generate its own log, so redirect output to a file:

$ ./loadAll.sh dev password > ../log/loadAllDev.log

The script runs sqlldr for each of the input files and then runs sqlplus to move the data into the final tables. It takes about an hour to run. Next, edit the log and bad files:

vi ~/iDecoder/log/*.log
vi ~/iDecoder/log/*.bad

Make sure no errors occurred.

Comments

11/21/11 -- Not sure if this is a problem anymore: updateLocations.sql may generate an error -- check the log file. If so, update the locations values for those identifiers that contain them. See sql/updateLocations.sql for the script.
Just FYI -- nothing needs to be done -- A new table called PUBLIC_EXPERIMENTS was added to store the IDs of the experiments whose arrays are available to all registered users. A file called PUBLIC_EXPERIMENTS.sql exists in the $ADMIN/schemaDefinition directory which creates the table in the database.
??? Still need to do this? -- it was necessary because Laura got location information differently than what iDecoder data sources gave us. But I don't think we need to do it anymore. -- update location and gene symbol information for Laura by running ~/sql/mouseInfo.sql and ~/sql/ratInfo.sql

New eQTLs

If Laura has re-calculated eQTLs (or initially calculated for a new tissue, for example), you will need to do the following:

update the expression_qtls table with instructions in the Expression_QTLs table population section:

Update EXPRESSION_QTLS

Update the gene_symbols table with instructions in the Gene_symbols table population section: Update GENE_SYMBOLS
Add new gene symbols and links to probeset ids from the expression_qtls table by running

 ~/sql/addExpressionQTLInfo.sql

Final Steps

start

~/sql/identifier_links3_table.sql

update iDecoderDoc.jsp with the statistics and date of the update
update $WEB/common/siteVersion.jsp with the statistics and date of the update
start ~/sql/validateObjects.sql to compute statistics on all tables and indexes -- takes about 30 minutes
create the master reference set files required for SPIA (Sorin's pathway) R program:
Edit the RefSetCreator program in $SRC/edu/ucdenver/ccp/iDecoder directory and un-comment any calls.
Run the RefSetCreator program in $SRC/edu/ucdenver/ccp/iDecoder directory using 'ant runRefSetCreator'. This creates files in /Users/chornbak/Desktop/ReferenceFiles directory. This takes 6 hours!!!
copy the xxx_Final files to ~/userFiles/public/GeneLists/ReferenceFiles directory on both stan and amc-kenny

Creating Java Web Start Application

Download latest version of JUNG Put jung-*, colt..., collections... jars into WEB-INF/lib AND into $PHENOGEN/lib -- need in both places because the jnlp file has to reference the lib directory because it can't reference WEB-INF/lib. AAGH! cp log4j jars from WEB-INF/lib into $PHENOGEN/lib -- need in both places Create DrawGraph.java and compile it build.xml script needs the jar-classpath and pathconvert to manifest-classpath to convert it to a useable form for the manifest The jar that contains the code I've written should have the class files, but NOT the dependent jars required by JUNG. Have to sign the new jar and also the dependent jars all together at the same time The jnlp file includes the new jar and all the JUNG dependent jars under the lib directory

Go back to Database Update Procedures

Idecoder update

Table of Contents

This page explains the IDecoder update procedures

Historical Identifier and Link Counts

Fetching New Data

Before Starting

Affymetrix

CodeLink

Custom Arrays

Ensembl

Human data files

Mouse data files

Rat data files

Flybase, MGI, NCBI, RGD & SwissProt

Parsing Input Files

Adding New Parsers

Layout of the CodeLink File

Loading Data into Identifier Tables

Comments

New eQTLs

Final Steps

Creating Java Web Start Application

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally