Skip to content

Jacobson-CompSysBio/JAIL-Data-Curation-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 

Repository files navigation

JAIL-Data-Curation-Benchmark

Table of Contents

  1. Overview
  2. Directory Structure
  3. Key Components
  4. Neo4j Setup and Querying
  5. Acknowledgments

Overview

JAIL-Data-Curation-Benchmark focuses on the curation of various biological datasets, including gene information, Gene Ontology (GO) terms, and biochemical reactions. These curated datasets are intended to serve as a foundation for training and evaluating Large Language Models (LLMs) on their ability to understand and retrieve specific biological information. The project includes scripts for data fetching, processing, formatting into question-answer pairs, and an example notebook for LLM evaluation using the DeepEval framework.

The core aim of the JAIL-Data-Curation-Benchmark is to:

  • Curate High-Quality Biological Datasets: Systematically gather and process information about human genes, Gene Ontology terms, and biochemical reactions from public databases.
  • Prepare Data for LLM Training: Transform curated data into formats suitable for supervised fine tuning, such as question-answer pairs.

Directory Structure

The following directory structure is recommended at the root of your project:

JAIL-Data-Curation-Benchmark/
├── files/                
│   ├── (INPUT) hgnc_complete_set_2024-10-01.txt  # User must provide (update to the most recent file)
│   ├── (INPUT) go-basic.obo                      # User must provide
│   ├── (INPUT) goa_human.gaf                     # User must provide
│   ├── (INPUT) neo4j_files/
|       └── (INPUT) neo4j-community-4.4.40        # Neo4j installation & Reactome Database migration
|        
├── data/
│   ├── (OUTPUT) geneset_dict.json             # Generated by 20241017Gene_Dataset.ipynb
│   ├── (OUTPUT) geneset_qa_pairs.json         # Generated by 20250404GeneInfo_DataFormatting.py
│   ├── (OUTPUT) go_terms_dict.json            # Generated by 20241007_GOterm_Information_Dataset_Dictionary.ipynb
│   ├── (OUTPUT) gene_reactions.json           # Generated by 20250219Gene_Reactions_Dataset.ipynb
│
├── notebooks/
│   ├── 20240823_LLMeval_test_deepeval-checkpoint.ipynb
│   ├── 20241007_GOterm_Information_Dataset_Dictionary.ipynb
│   ├── 20241017Gene_Dataset.ipynb
│   ├── 20250219Gene_Reactions_Dataset.ipynb
|   └──(OUTPUT/INPUT) gene_ensembl.json       # Generated by 20250219Gene_Reactions_Dataset.ipynb to be used
│
├── scripts/
│   └── 20250404GeneInfo_DataFormatting.py
│
├── .env                      # (User-created at project root) For Neo4j credentials
└── README.md

Key Components

Notebooks

  1. notebooks/20240823_LLMeval_test_deepeval-checkpoint.ipynb

    • Purpose: To demonstrate and test the DeepEval framework for LLM evaluation.
    • Description: Sets up a custom Llama 3.1 8B model and uses DeepEval's HallucinationMetric, Faithfulness Metric ,and Bias Metric. This notebook is more self-contained for its example but could be adapted to use the curated datasets.
    • Status: "Currently on pause."
  2. notebooks/20241007_GOterm_Information_Dataset_Dictionary.ipynb

  3. notebooks/20241017Gene_Dataset.ipynb

  4. /notebooks/20250219Gene_Reactions_Dataset.ipynb

Datasets

Gene Ontology Term Dataset

A structured dataset of Gene Ontology (GO) terms in the Biological Process (BP) namespace.

  • Fields Included:
    • go_id: Gene Ontology ID (e.g., GO:0008150)
    • name: GO term name
    • definition: Description of the biological process
    • information_content: A score indicating specificity of the term
    • associated_genes: List of gene symbols linked to the GO term

Gene Information Dataset

Gene metadata curated using the HGNC gene list (as of 2024-10-01) and enriched via the MyGene.info API.

  • Fields Included:
    • ensembl_id: ENSEMBL gene identifier
    • symbol: Official gene symbol (e.g., TP53)
    • aliases: Alternate gene symbols
    • name: Full gene name
    • other_names: Synonyms and other designations
    • description: Functional summary (if available)

Gene Pathway Dataset

Biochemical pathway data extracted from the Reactome database for human genes.

  • Fields Included:
    • gene_id: ENSEMBL or HGNC identifier
    • pathways: List of pathways the gene participates in
    • reactions: List of biochemical reactions associated with each pathway
    • participants: Grouped by:
      • inputs: Molecules consumed in the reaction
      • outputs: Molecules produced
      • catalysts: Entities that facilitate the reaction
      • regulators: Positive/negative regulators involved

Scripts

  1. scripts/20250404GeneInfo_DataFormatting.py
    • Purpose: To transform a structured gene information dataset (JSON) into a list of question-answer pairs.
    • Description: This script reads a JSON file containing detailed information for multiple genes (expected from data/geneset_dict.json). It then iterates through each gene entry and, using a predefined mapping of question templates to JSON keys, generates specific questions about each gene. The corresponding values from the gene's data are used as answers.

Neo4j Setup and Querying

To run Reactome queries locally or remotely (e.g., from DGX, Andes, or any machine with multi-hop SSH access), follow the instructions below. These setup instructions are adapted from the official Reactome documentation, available at: https://reactome.org/dev/graph-database#GetStarted

Requirements

  • Neo4j (version used in this repo: neo4j-community-4.4.40)
  • Reactome graph database dump file (reactome.graphdb.dump)

Installation Steps

  1. Download and Install Neo4j v4

  2. Choose One of the Following Methods to Load Reactome Data

    Option A: Load from a Dump File

    • Download the latest graph database dump from:
      https://reactome.org/download-data
    • Load the dump file using the Neo4j Admin tool:
      ./path/to/neo4j/bin/neo4j-admin load --force --from=/path/to/reactome.graphdb.dump --database=graph.db

    Option B: Use the Extracted Graph Folder

    • Download and extract the graph folder archive (reactome.graph.db.tgz)
    • Move the extracted graph.db folder to:
      /path/to/neo4j/data/databases/
      
    • If a graph.db folder already exists, remove or rename it before replacing.
  3. Install and Configure the Database

    • Load the dump file into Neo4j:
      ./path/to/neo4j/bin/neo4j-admin load --force --from=/path/to/reactome.graphdb.dump --database=graph.db
    • Move the graph.db folder to:
      /path/to/neo4j/data/databases/
      
    • Ensure the folder is named exactly graph.db
  4. Configure Neo4j Settings

    • Edit the configuration file at:
      /path/to/neo4j/conf/neo4j.conf
      
    • Recommended settings:
      dbms.default_database=graph.db
      dbms.recovery.fail_on_missing_files=false
      unsupported.dbms.tx_log.fail_on_corrupted_log_files=false
  5. Start the Neo4j Server

    ./path/to/neo4j/bin/neo4j start

Note: If accessing remotely, use SSH port forwarding from the machine where Neo4j is installed: ssh -L 7474:localhost:7474 -L 7687:localhost:7687 <username>@<remote-host>

Information on how to query reactome database using Neo4j can be found here: https://reactome.org/dev/graph-database/extract-participating-molecules#retrieving-reactions

Note: Create a .env file in the root directory of the JAIL-Data-Curation-Benchmark project with your Neo4j credentials:

    ```env
    # JAIL-Data-Curation-Benchmark/.env
    URI="bolt://your_neo4j_host:7687"
    USERNAME="your_neo4j_username"
    PASSWORD="your_neo4j_password"
    ```
  1. When done, Stop Neo4j
    ./path/to/neo4j/bin/neo4j stop

Acknowledgments

This project builds upon public biological resources and tools including:

  • HGNC – for standardized human gene nomenclature
  • MyGene.info – for programmatic gene metadata access
  • Gene Ontology Consortium – for GO term definitions and annotations
  • Reactome – for curated pathway and reaction data
  • Neo4j – for graph querying of biological entities
  • DeepEval – for evaluating LLM performance

Additional thanks to:

  • Anna Vlot
  • Peter Kruse
  • Kyle Sullivan
  • Ken Smith
  • Matt Lane
  • John Dandy
  • Mirko Pavicic Venegas
  • Doug Hyatt
  • Chanaka R. Abeyratne
  • Daniel Jacobson

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors