This repository contains a workflow for finding public RNA-seq samples and studies related to specific diseases.
-
bin:
- Contains MONDO model files (
*.pkl) that are used to make predictions. These models are pretrained and ready to use with the provided data.
- Contains MONDO model files (
-
data:
aggregated_metadata.json.gz: Compressed JSON file containing metadata about RNA-seq experiments from refine.biotrue_label__inst_type=study__task=disease.csv.gz: Compressed CSV file with true labels that includes redundant and non-redundant MONDO terms.
-
src:
extract_data.py: Script to extract descriptions and accession codes from the compressed JSON metadata file.preprocess.py: Script to preprocess the extracted descriptions.embedding_lookup_table.py: Script to generate embeddings for preprocessed descriptions.tfidf_calculator.py: Script to calculate TF-IDF scores for text data.predict.py: Script to run predictions using pre-trained MONDO models.
-
results: Contains the filtered descriptions and accession codes after preprocessing the metadata.
IDs.tsv: List of accession codes after filtering out studies with no description.refinebio_descriptions_filtered.tsv: Descriptions of the RNA-seq experiments after filtering out studies with no description.
-
run:
run_extraction.sh: Shell script for extracting and filtering descriptions.run_embedding_lookup_table.sh: Shell script to generate embeddings for preprocessed descriptions.run_preprocess.sh: Shell script to preprocess the extracted descriptions.run_predictions.sh: Shell script to run predictions using the MONDO model files.
-
README.md: This file, providing an overview of the project.
- Extract Descriptions: The script
extract_data.pyreads and parses the compressed JSON metadata file located indata/aggregated_metadata.json.gz. It filters out entries with no descriptions.- Output: Filtered descriptions saved in
results/refinebio_descriptions_filtered.tsv. - Accession codes saved in
results/IDs.tsv.
- Output: Filtered descriptions saved in
- Text Preprocessing: The
preprocess.pyscript cleans and preprocesses the extracted descriptions by removing URLs, specific strings, file names, non-UTF-8 characters, and applying text normalization techniques.- Output: Preprocessed descriptions saved in
results/processed_refinebio_descriptions.tsvfor embedding generation.
- Output: Preprocessed descriptions saved in
- Embedding Generation: The
run_embedding_lookup_table.shscript callsembedding_lookup_table.pyto generate embeddings for the preprocessed descriptions using a pre-trained language model (BiomedBERT).- Output: Embeddings saved in
results/my_custom_embeddings.npz.
- Output: Embeddings saved in
- Predictions: The
predict.pyscript is used to run predictions for each MONDO model file using the generated embeddings and preprocessed descriptions.- Output: Prediction results saved in
prediction_resultsfolder. This script needs also thistxt2onto2.0/data/disease_desc_embedding.npzto run.
- Output: Prediction results saved in
- Clone this repository to your local machine:
git clone https://github.com/krishnanlab/Workflow_related_studies.git cd Workflow_related_studies