Instructions on how to update the PubMedDB annually and how to use the non-relational database.
- infotojson.py - converting information in baseline xml files to a single JSON document
- jsontodb.py - read JSON document into a database
- gettfidf.py - query database based on user input to obtain TF-IDFs and output results into a file
All packages are provided within the YML environment file. A conda environment named pubmeddb can be created using the following command.
conda env create -f ./pubmeddb.yml
conda activate pubmeddbPlease use the DATA TRANSFER node of Sockeye to download the PubMed baseline (https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/) and gene2pubmed (https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz).
ssh <cwl>@dtn.sockeye.arc.ubc.caRun download script:
bash ./utils/dl_pubmeddata.shPlease edit the PBS -M with your email address in pubmed_submit.sh.
##PBS -M <email>Run the following code in the COMPUTE node and submit script as a job from a tempory/scratch directory (currently project directory is only readable by the compute nodes).
ssh <cwl>@sockeye.arc.ubc.cacd <SCRATCH DIR>
qsub /project/st-wasserww-1/PubMed_DB/pubmed_submit.sh| PubMedID Collection | Gene Collection |
|---|---|
{
"PMID":"XX",
"ArticleTitle": "xx",
"Abstract":{
"Text": "XX",
"Words":{
"Word1":{
"Stems": [xx , xx, xx],
"Count": 1
},
"Word2":{
"Stems": [xx , xx, xx],
"Count": 1
},
}
},
"Country": "XX",
"MeshHeading":{
"MeshIdentifier (Ex. D000818)":{
"DescriptorName": "XX",
"QualifierName":{}
}
}
}
|
{
"GeneID": XX,
“Name”: XX,
"TaxonomyID": XX,
"PubMedID": [xx , xx, xx]
}
|