Skip to content
walkerhound edited this page Oct 18, 2012 · 37 revisions

Table of Contents

This page explains the IDecoder update procedures

Historical Identifier and Link Counts

Date Count of IDENTIFIERS Count of IDENTIFIER_LINKS
Prior to 5/26/06 1,866,625 3,614,678
5/26/06 2,115,668 4,032,055
10/06/06 2,149,889 4,032,067
7/19/07 2,551,937 5,084,356
11/02/07 2,669,103 5,133,013
4/08/09 2,732,872 5,378,805
12/07/09 2,930,543 5,501,504
08/13/10 4,466,570 7,729,776
01/15/11 5,569,750 9,255,549
11/30/11 5,421,171 9,127,605

Fetching New Data

Data files should be downloaded onto stan and parsed there

Before Starting

cp -R /Users/smahaffey/iDecoder/InputFiles /Users/smahaffey/iDecoder/InputFiles_OLD
cp -R /Users/smahaffey/iDecoder/Output /Users/smahaffey/iDecoder/Output_OLD
export INPUT_FILES=/Users/smahaffey/iDecoder/InputFiles
Logon to the production database before downloading the files. Run the 'getDistinctArrays.sql' script in /Users/chornbak/cherylh/sql to see the list of arrays currently used by experiments. Double-check with the list below to make sure all chips are included in the list of files to be downloaded.

In separate windows,

vi $WEB/common/siteVersion.jsp
vi $SRC/edu/ucdenver/ccp/iDecoder/file.properties

Affymetrix

Top

Important Links:

Note: a login is required to fetch these files. Use chornbak@mindspring.com/affymetrix253

Download the following files into $INPUT_FILES/Affymetrix:

  • Affymetrix Genechip Drosophila Genome [DrosGenome1]
  • Affymetrix GeneChip Human Genome U133 Plus 2.0[HG-U133_Plus_2]
  • Affymetrix MoEx-1_0-st-v1 Probeset Annotations
    • You can find these 2 under the Technical Documentation tab under the NetAffx Annotation Files heading
  • Affymetrix MoEx-1_0-st-v1 Transcript Annotations
  • Affymetrix GeneChip Mouse Expression Array MOE430A [MOE430A]
  • Affymetrix GeneChip Mouse Expression Array MOE430B [MOE430B]
  • Affymetrix GeneChip Mouse Genome 430 2.0 [Mouse430_2]
  • Affymetrix GeneChip Murine Genome U74A [MG_U74A] (use file from InputFiles_OLD)
  • Affymetrix GeneChip Murine Genome U74Av2 [MG_U74Av2] (use file from InputFiles_OLD)
  • Affymetrix GeneChip Murine Genome U74Bv2 [MG_U74Bv2] (use file from InputFiles_OLD)
  • Affymetrix GeneChip Murine Genome U74Cv2 [MG_U74Cv2] (use file from InputFiles_OLD)
  • Affymetrix RnEx-1_0-st-v1 Probeset Annotations
  • Affymetrix RnEx-1_0-st-v1 Transcript Annotations
  • Affymetrix GeneChip Rat Expression Array RAE230A [RAE230A]
  • Affymetrix GeneChip Rat Genome U34A [RG_U34A]
  • Affymetrix GeneChip Rat Genome U34C [RG_U34C]
        cd $INPUT_FILES/Affymetrix
        unzip \*.zip 

CodeLink

Top

CodeLink is out of business, so the following link no longer works:

CodeLink main page

  • Download UniSet Mouse I
  • Download Mouse Whole Genome
  • Download Rat Whole Genome

Custom Arrays

During the last iDecoder update, the following custom arrays showed up in the getDistinctArrays list:

Custom Array Annotation File
MM_cDNA /data/miamexpress_datafiles/arrays/ponomarev/array42/mm10_UTxAustin_ADF_annot_10152004.txt
Mu23k-Compugen-UCHSC 200um spot diam /data/miamexpress_datafiles/arrays/bhaves/array41/Mu23KMAPV6-200micron_diameter_annot_adf.txt
Qiagen33k-Operon /data/miamexpress_datafiles/arrays/bhaves/array102/Qiagen_Mu33KMAPV1_adf_Feb02-06.txt
At this time, these chips aren't annotated with iDecoder because we don't support Quality Control or other filtering or statistics programs. As soon as bug number ZZZ-434 is implemented, this section needs to be updated.

Ensembl

Top

Human data files
  1. Go to Ensembl BioMart Tool page
  2. Choose Ensembl Genes (Sanger)
  3. Choose Homo sapiens genes
  4. Click on Attributes link on left
  5. Expand Gene section
  6. Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
  7. Click Results button
  8. Export all results to File CSV, check Unique results only
  9. Click Go button
  10. Save file to $INPUT_FILES/Ensembl and call it Hs_Ensembl_Genes.csv
  11. Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Hs_Ensembl_Transcripts.csv
  12. Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Hs_link_GtoTS.csv
Mouse data files
  1. Go to Ensembl BioMart Tool page
  2. Choose Ensembl Genes (Sanger)
  3. Choose Mus Musculus genes
  4. Click on Attributes link on left
  5. Expand Gene section
  6. Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
  7. Click Results button
  8. Export all results to File CSV, check Unique results only
  9. Click Go button
  10. Save file to $INPUT_FILES/Ensembl and call it Mm_Ensembl_Genes.csv
  11. Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Mm_Ensembl_Transcripts.csv
  12. Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Mm_link_GtoTS.csv
Rat data files
  1. Go to Ensembl BioMart Tool page
  2. Choose Ensembl Genes (Sanger)
  3. Choose Rattus norvegicus genes
  4. Click on Attributes link on left
  5. Expand Gene section
  6. Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
  7. Click Results button
  8. Export all results to File CSV, check Unique results only
  9. Click Go button
  10. Save file to $INPUT_FILES/Ensembl and call it Rn_Ensembl_Genes.csv
  11. Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Rn_Ensembl_Transcripts.csv
  12. Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Rn_link_GtoTS.csv

Flybase, MGI, NCBI, RGD & SwissProt

Top

The code for automatically downloading annotation files from FlyBase, MGI, NCBI, RGD, and SwissProt is in a class located here:

$SRC/edu/ucdenver/ccp/iDecoder/FileGetter.java. 
The FlyBase filename will need to be updated -- just go to the FlyBase ftp site, find the most recent file, and change the name in FileGetter.java. Then run
ant runFileGetter.
Either add the gunzip step to the FileGetter.java or unzip the .gz files in each directory.

Parsing Input Files

Top

Parse the files on stan cd $SRC/edu/ucdenver/ccp/iDecoder vi file.properties file to specify the list of files to be parsed. Review the file layouts for consistency with the last run, and if necessary, update the parsers using these instructions Adding New Parsers Run the parsers by running ant:

$ ant -f $SRC/build.xml runParsers
Note: every parser runs, regardless of the "newness" of the file. After a few minutes, the parsers will finish parsing the input files. The results can be found in the "/Users/chornbak/cheryl/iDecoder/Output" subdirectory. There are two output files per data source: one containing identifier Info and one containing Links between identifiers.
cd /Users/chornbak/cherylh/iDecoder/Output
tar -cvf ParserOutput.tar *.out
gzip -f6 ParserOutput.tar
The output files must be copied to phenogen, where the load scripts reside. You can copy the "tar.gz" file or only the 1 or 2 sources that have new data. logon to phenogen as smahaffey
cd iDecoder/ParsedFiles
rm *.out
scp chornbak@stan://Users/chornbak/cherylh/iDecoder/Output/ParserOutput.tar.gz .
gunzip ParserOutput.tar.gz
tar -xvf ParserOutput.tar 

Adding New Parsers

Top

The java source is contained in the $SRC directory. There should be one xxxParser.java file for each source. ParserRunner executes them all.

Layout of the CodeLink File

The CodeLink files must be in the following format:

  • The first two letters of the filename indicate the species
  • Multiple entries within one column should be delimited by '///'
  • If the chromosome field is larger than 9 characters (i.e., because it is something like 'Un|NW_20823'), the chromosome field will not be loaded. All other fields in that record will be loaded though.
The columns should be in the following order:
  1. Probe Name
  2. NCBI Accession
  3. UniGene ID
  4. Description
  5. symbol
  6. LLID
  7. UGRepAcc
  8. Chromosome
  9. Cytoband
  10. GO_annotation
  11. Expression Areas
  12. SWISS-PROT
  13. mapview_chromosome
  14. start
  15. end
  16. strand
  17. mgi_id
  18. refseq_mrna_id
  19. refseq_protein_id
  20. Ensembl

Loading Data into Identifier Tables

The loader scripts are on phenogen. logon to phenogen as smahaffey

cd iDecoder/scripts
The directory structure is as follows:
Directory File Comment
/data/smahaffey/iDecoder/scripts loadAll.sh runs loading process
/data/smahaffey/iDecoder/scripts/sql createIdentifierIndexes.sql
createLinkIndexes.sql
createLoadIndexes.sql
createLoadSchema.sql
createMainSchema.sql
promoteInfo.sql
promoteLinks.sql
updateLocations.sql
/data/smahaffey/iDecoder/scripts/ctl Info.ctl
Links.ctl
/data/smahaffey/iDecoder/ParsedFiles Bad.out
Debug.out
Links.out
Info.out
/data/smahaffey/iDecoder/log .log Check these after every run
.bad Check these after every run

There may be BAD files left over from a prior run. If these need to be retained, move them to another directory; the loadAll.sh script deletes all files with a ".bad" extension.

Run loadAll.sh from the current directory. You must supply the ORACLE_SID and the INIA password as arguments. The ORACLE_SID can be either dev, test, or prod. The script does not generate its own log, so redirect output to a file:

$ ./loadAll.sh dev password > ../log/loadAllDev.log
The script runs sqlldr for each of the input files and then runs sqlplus to move the data into the final tables. It takes about an hour to run. Next, edit the log and bad files:
vi ~/iDecoder/log/*.log
vi ~/iDecoder/log/*.bad 
Make sure no errors occurred.

Comments

  1. 11/21/11 -- Not sure if this is a problem anymore: updateLocations.sql may generate an error -- check the log file. If so, update the locations values for those identifiers that contain them. See sql/updateLocations.sql for the script.
  2. Just FYI -- nothing needs to be done -- A new table called PUBLIC_EXPERIMENTS was added to store the IDs of the experiments whose arrays are available to all registered users. A file called PUBLIC_EXPERIMENTS.sql exists in the $ADMIN/schemaDefinition directory which creates the table in the database.
  3. ??? Still need to do this? -- it was necessary because Laura got location information differently than what iDecoder data sources gave us. But I don't think we need to do it anymore. -- update location and gene symbol information for Laura by running ~/sql/mouseInfo.sql and ~/sql/ratInfo.sql

New eQTLs

If Laura has re-calculated eQTLs (or initially calculated for a new tissue, for example), you will need to do the following:

  • update the expression_qtls table with instructions in the Expression_QTLs table population section:
Update EXPRESSION_QTLS
  • Update the gene_symbols table with instructions in the Gene_symbols table population section: Update GENE_SYMBOLS
  • Add new gene symbols and links to probeset ids from the expression_qtls table by running
 ~/sql/addExpressionQTLInfo.sql

Final Steps

  • start
~/sql/identifier_links3_table.sql
  • update iDecoderDoc.jsp with the statistics and date of the update
  • update $WEB/common/siteVersion.jsp with the statistics and date of the update
  • start ~/sql/validateObjects.sql to compute statistics on all tables and indexes -- takes about 30 minutes
  • create the master reference set files required for SPIA (Sorin's pathway) R program:
  • Edit the RefSetCreator program in $SRC/edu/ucdenver/ccp/iDecoder directory and un-comment any calls.
  • Run the RefSetCreator program in $SRC/edu/ucdenver/ccp/iDecoder directory using 'ant runRefSetCreator'. This creates files in /Users/chornbak/Desktop/ReferenceFiles directory. This takes 6 hours!!!
  • copy the xxx_Final files to ~/userFiles/public/GeneLists/ReferenceFiles directory on both stan and amc-kenny

Creating Java Web Start Application

Download latest version of JUNG Put jung-*, colt..., collections... jars into WEB-INF/lib AND into $PHENOGEN/lib -- need in both places because the jnlp file has to reference the lib directory because it can't reference WEB-INF/lib. AAGH! cp log4j jars from WEB-INF/lib into $PHENOGEN/lib -- need in both places Create DrawGraph.java and compile it build.xml script needs the jar-classpath and pathconvert to manifest-classpath to convert it to a useable form for the manifest The jar that contains the code I've written should have the class files, but NOT the dependent jars required by JUNG. Have to sign the new jar and also the dependent jars all together at the same time The jnlp file includes the new jar and all the JUNG dependent jars under the lib directory

Go back to Database Update Procedures

Clone this wiki locally