-
Notifications
You must be signed in to change notification settings - Fork 1
Idecoder update
| Date | Count of IDENTIFIERS | Count of IDENTIFIER_LINKS |
|---|---|---|
| Prior to 5/26/06 | 1,866,625 | 3,614,678 |
| 5/26/06 | 2,115,668 | 4,032,055 |
| 10/06/06 | 2,149,889 | 4,032,067 |
| 7/19/07 | 2,551,937 | 5,084,356 |
| 11/02/07 | 2,669,103 | 5,133,013 |
| 4/08/09 | 2,732,872 | 5,378,805 |
| 12/07/09 | 2,930,543 | 5,501,504 |
| 08/13/10 | 4,466,570 | 7,729,776 |
| 01/15/11 | 5,569,750 | 9,255,549 |
| 11/30/11 | 5,421,171 | 9,127,605 |
Data files should be downloaded onto stan and parsed there
cp -R /Users/smahaffey/iDecoder/InputFiles /Users/smahaffey/iDecoder/InputFiles_OLD cp -R /Users/smahaffey/iDecoder/Output /Users/smahaffey/iDecoder/Output_OLD export INPUT_FILES=/Users/smahaffey/iDecoder/InputFilesLogon to the production database before downloading the files. Run the 'getDistinctArrays.sql' script in /Users/chornbak/cherylh/sql to see the list of arrays currently used by experiments. Double-check with the list below to make sure all chips are included in the list of files to be downloaded.
In separate windows,
vi $WEB/common/siteVersion.jsp vi $SRC/edu/ucdenver/ccp/iDecoder/file.properties
Important Links:
Note: a login is required to fetch these files. Use chornbak@mindspring.com/affymetrix253Download the following files into $INPUT_FILES/Affymetrix:
- Affymetrix Genechip Drosophila Genome [DrosGenome1]
- Affymetrix GeneChip Human Genome U133 Plus 2.0[HG-U133_Plus_2]
- Affymetrix MoEx-1_0-st-v1 Probeset Annotations
- You can find these 2 under the Technical Documentation tab under the NetAffx Annotation Files heading
- Affymetrix MoEx-1_0-st-v1 Transcript Annotations
- Affymetrix GeneChip Mouse Expression Array MOE430A [MOE430A]
- Affymetrix GeneChip Mouse Expression Array MOE430B [MOE430B]
- Affymetrix GeneChip Mouse Genome 430 2.0 [Mouse430_2]
- Affymetrix GeneChip Murine Genome U74A [MG_U74A] (use file from InputFiles_OLD)
- Affymetrix GeneChip Murine Genome U74Av2 [MG_U74Av2] (use file from InputFiles_OLD)
- Affymetrix GeneChip Murine Genome U74Bv2 [MG_U74Bv2] (use file from InputFiles_OLD)
- Affymetrix GeneChip Murine Genome U74Cv2 [MG_U74Cv2] (use file from InputFiles_OLD)
- Affymetrix RnEx-1_0-st-v1 Probeset Annotations
- Affymetrix RnEx-1_0-st-v1 Transcript Annotations
- Affymetrix GeneChip Rat Expression Array RAE230A [RAE230A]
- Affymetrix GeneChip Rat Genome U34A [RG_U34A]
- Affymetrix GeneChip Rat Genome U34C [RG_U34C]
cd $INPUT_FILES/Affymetrix
unzip \*.zip
CodeLink is out of business, so the following link no longer works:
- Download UniSet Mouse I
- Download Mouse Whole Genome
- Download Rat Whole Genome
During the last iDecoder update, the following custom arrays showed up in the getDistinctArrays list:
| Custom Array | Annotation File |
|---|---|
| MM_cDNA | /data/miamexpress_datafiles/arrays/ponomarev/array42/mm10_UTxAustin_ADF_annot_10152004.txt |
| Mu23k-Compugen-UCHSC 200um spot diam | /data/miamexpress_datafiles/arrays/bhaves/array41/Mu23KMAPV6-200micron_diameter_annot_adf.txt |
| Qiagen33k-Operon | /data/miamexpress_datafiles/arrays/bhaves/array102/Qiagen_Mu33KMAPV1_adf_Feb02-06.txt |
- Go to Ensembl BioMart Tool page
- Choose Ensembl Genes (Sanger)
- Choose Homo sapiens genes
- Click on Attributes link on left
- Expand Gene section
- Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
- Click Results button
- Export all results to File CSV, check Unique results only
- Click Go button
- Save file to $INPUT_FILES/Ensembl and call it Hs_Ensembl_Genes.csv
- Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Hs_Ensembl_Transcripts.csv
- Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Hs_link_GtoTS.csv
- Go to Ensembl BioMart Tool page
- Choose Ensembl Genes (Sanger)
- Choose Mus Musculus genes
- Click on Attributes link on left
- Expand Gene section
- Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
- Click Results button
- Export all results to File CSV, check Unique results only
- Click Go button
- Save file to $INPUT_FILES/Ensembl and call it Mm_Ensembl_Genes.csv
- Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Mm_Ensembl_Transcripts.csv
- Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Mm_link_GtoTS.csv
- Go to Ensembl BioMart Tool page
- Choose Ensembl Genes (Sanger)
- Choose Rattus norvegicus genes
- Click on Attributes link on left
- Expand Gene section
- Check Ensembl Gene ID, Chromosome Name, Gene Start, Gene End, and Strand
- Click Results button
- Export all results to File CSV, check Unique results only
- Click Go button
- Save file to $INPUT_FILES/Ensembl and call it Rn_Ensembl_Genes.csv
- Repeat above steps with Ensembl Transcript ID (instead of Ensembl Gene ID) and save to Rn_Ensembl_Transcripts.csv
- Repeat above steps with only Ensembl Gene ID and Ensembl Transcript ID selected and save as Rn_link_GtoTS.csv
The code for automatically downloading annotation files from FlyBase, MGI, NCBI, RGD, and SwissProt is in a class located here:
$SRC/edu/ucdenver/ccp/iDecoder/FileGetter.java.The FlyBase filename will need to be updated -- just go to the FlyBase ftp site, find the most recent file, and change the name in FileGetter.java. Then run
ant runFileGetter.Either add the gunzip step to the FileGetter.java or unzip the .gz files in each directory.
Parse the files on stan cd $SRC/edu/ucdenver/ccp/iDecoder vi file.properties file to specify the list of files to be parsed. Review the file layouts for consistency with the last run, and if necessary, update the parsers using these instructions Adding New Parsers Run the parsers by running ant:
$ ant -f $SRC/build.xml runParsersNote: every parser runs, regardless of the "newness" of the file. After a few minutes, the parsers will finish parsing the input files. The results can be found in the "/Users/chornbak/cheryl/iDecoder/Output" subdirectory. There are two output files per data source: one containing identifier Info and one containing Links between identifiers.
cd /Users/chornbak/cherylh/iDecoder/Output tar -cvf ParserOutput.tar *.out gzip -f6 ParserOutput.tarThe output files must be copied to phenogen, where the load scripts reside. You can copy the "tar.gz" file or only the 1 or 2 sources that have new data. logon to phenogen as smahaffey
cd iDecoder/ParsedFiles rm *.out scp chornbak@stan://Users/chornbak/cherylh/iDecoder/Output/ParserOutput.tar.gz . gunzip ParserOutput.tar.gz tar -xvf ParserOutput.tar
The java source is contained in the $SRC directory. There should be one xxxParser.java file for each source. ParserRunner executes them all.
The CodeLink files must be in the following format:
- The first two letters of the filename indicate the species
- Multiple entries within one column should be delimited by '///'
- If the chromosome field is larger than 9 characters (i.e., because it is something like 'Un|NW_20823'), the chromosome field will not be loaded. All other fields in that record will be loaded though.
- Probe Name
- NCBI Accession
- UniGene ID
- Description
- symbol
- LLID
- UGRepAcc
- Chromosome
- Cytoband
- GO_annotation
- Expression Areas
- SWISS-PROT
- mapview_chromosome
- start
- end
- strand
- mgi_id
- refseq_mrna_id
- refseq_protein_id
- Ensembl
The loader scripts are on phenogen. logon to phenogen as smahaffey
cd iDecoder/scriptsThe directory structure is as follows:
| Directory | File | Comment |
|---|---|---|
| /data/smahaffey/iDecoder/scripts | loadAll.sh | runs loading process |
| /data/smahaffey/iDecoder/scripts/sql | createIdentifierIndexes.sql | |
| createLinkIndexes.sql | ||
| createLoadIndexes.sql | ||
| createLoadSchema.sql | ||
| createMainSchema.sql | ||
| promoteInfo.sql | ||
| promoteLinks.sql | ||
| updateLocations.sql | ||
| /data/smahaffey/iDecoder/scripts/ctl | Info.ctl | |
| Links.ctl | ||
| /data/smahaffey/iDecoder/ParsedFiles | Bad.out | |
| Debug.out | ||
| Links.out | ||
| Info.out | ||
| /data/smahaffey/iDecoder/log | .log | Check these after every run |
| .bad | Check these after every run |
There may be BAD files left over from a prior run. If these need to be retained, move them to another directory; the loadAll.sh script deletes all files with a ".bad" extension.
Run loadAll.sh from the current directory. You must supply the ORACLE_SID and the INIA password as arguments. The ORACLE_SID can be either dev, test, or prod. The script does not generate its own log, so redirect output to a file:
$ ./loadAll.sh dev password > ../log/loadAllDev.logThe script runs sqlldr for each of the input files and then runs sqlplus to move the data into the final tables. It takes about an hour to run. Next, edit the log and bad files:
vi ~/iDecoder/log/*.log vi ~/iDecoder/log/*.badMake sure no errors occurred.
- 11/21/11 -- Not sure if this is a problem anymore: updateLocations.sql may generate an error -- check the log file. If so, update the locations values for those identifiers that contain them. See sql/updateLocations.sql for the script.
- Just FYI -- nothing needs to be done -- A new table called PUBLIC_EXPERIMENTS was added to store the IDs of the experiments whose arrays are available to all registered users. A file called PUBLIC_EXPERIMENTS.sql exists in the $ADMIN/schemaDefinition directory which creates the table in the database.
- ??? Still need to do this? -- it was necessary because Laura got location information differently than what iDecoder data sources gave us. But I don't think we need to do it anymore. -- update location and gene symbol information for Laura by running ~/sql/mouseInfo.sql and ~/sql/ratInfo.sql
If Laura has re-calculated eQTLs (or initially calculated for a new tissue, for example), you will need to do the following:
- update the expression_qtls table with instructions in the Expression_QTLs table population section:
- Update the gene_symbols table with instructions in the Gene_symbols table population section: Update GENE_SYMBOLS
- Add new gene symbols and links to probeset ids from the expression_qtls table by running
~/sql/addExpressionQTLInfo.sql
- start
~/sql/identifier_links3_table.sql
- update iDecoderDoc.jsp with the statistics and date of the update
- update $WEB/common/siteVersion.jsp with the statistics and date of the update
- start ~/sql/validateObjects.sql to compute statistics on all tables and indexes -- takes about 30 minutes
- create the master reference set files required for SPIA (Sorin's pathway) R program:
- Edit the RefSetCreator program in $SRC/edu/ucdenver/ccp/iDecoder directory and un-comment any calls.
- Run the RefSetCreator program in $SRC/edu/ucdenver/ccp/iDecoder directory using 'ant runRefSetCreator'. This creates files in /Users/chornbak/Desktop/ReferenceFiles directory. This takes 6 hours!!!
- copy the xxx_Final files to ~/userFiles/public/GeneLists/ReferenceFiles directory on both stan and amc-kenny
Download latest version of JUNG Put jung-*, colt..., collections... jars into WEB-INF/lib AND into $PHENOGEN/lib -- need in both places because the jnlp file has to reference the lib directory because it can't reference WEB-INF/lib. AAGH! cp log4j jars from WEB-INF/lib into $PHENOGEN/lib -- need in both places Create DrawGraph.java and compile it build.xml script needs the jar-classpath and pathconvert to manifest-classpath to convert it to a useable form for the manifest The jar that contains the code I've written should have the class files, but NOT the dependent jars required by JUNG. Have to sign the new jar and also the dependent jars all together at the same time The jnlp file includes the new jar and all the JUNG dependent jars under the lib directory