JAIL-Data-Curation-Benchmark

Overview

JAIL-Data-Curation-Benchmark focuses on the curation of various biological datasets, including gene information, Gene Ontology (GO) terms, and biochemical reactions. These curated datasets are intended to serve as a foundation for training and evaluating Large Language Models (LLMs) on their ability to understand and retrieve specific biological information. The project includes scripts for data fetching, processing, formatting into question-answer pairs, and an example notebook for LLM evaluation using the DeepEval framework.

The core aim of the JAIL-Data-Curation-Benchmark is to:

Curate High-Quality Biological Datasets: Systematically gather and process information about human genes, Gene Ontology terms, and biochemical reactions from public databases.
Prepare Data for LLM Training: Transform curated data into formats suitable for supervised fine tuning, such as question-answer pairs.

Directory Structure

The following directory structure is recommended at the root of your project:

JAIL-Data-Curation-Benchmark/
├── files/                
│   ├── (INPUT) hgnc_complete_set_2024-10-01.txt  # User must provide (update to the most recent file)
│   ├── (INPUT) go-basic.obo                      # User must provide
│   ├── (INPUT) goa_human.gaf                     # User must provide
│   ├── (INPUT) neo4j_files/
|       └── (INPUT) neo4j-community-4.4.40        # Neo4j installation & Reactome Database migration
|        
├── data/
│   ├── (OUTPUT) geneset_dict.json             # Generated by 20241017Gene_Dataset.ipynb
│   ├── (OUTPUT) geneset_qa_pairs.json         # Generated by 20250404GeneInfo_DataFormatting.py
│   ├── (OUTPUT) go_terms_dict.json            # Generated by 20241007_GOterm_Information_Dataset_Dictionary.ipynb
│   ├── (OUTPUT) gene_reactions.json           # Generated by 20250219Gene_Reactions_Dataset.ipynb
│
├── notebooks/
│   ├── 20240823_LLMeval_test_deepeval-checkpoint.ipynb
│   ├── 20241007_GOterm_Information_Dataset_Dictionary.ipynb
│   ├── 20241017Gene_Dataset.ipynb
│   ├── 20250219Gene_Reactions_Dataset.ipynb
|   └──(OUTPUT/INPUT) gene_ensembl.json       # Generated by 20250219Gene_Reactions_Dataset.ipynb to be used
│
├── scripts/
│   └── 20250404GeneInfo_DataFormatting.py
│
├── .env                      # (User-created at project root) For Neo4j credentials
└── README.md

Key Components

Notebooks

notebooks/20240823_LLMeval_test_deepeval-checkpoint.ipynb
- Purpose: To demonstrate and test the DeepEval framework for LLM evaluation.
- Description: Sets up a custom Llama 3.1 8B model and uses DeepEval's HallucinationMetric, Faithfulness Metric ,and Bias Metric. This notebook is more self-contained for its example but could be adapted to use the curated datasets.
- Status: "Currently on pause."
notebooks/20241007_GOterm_Information_Dataset_Dictionary.ipynb
- Purpose: To create Gene Ontology Term Dataset.
notebooks/20241017Gene_Dataset.ipynb
- Purpose: To create Gene Information Dataset.
/notebooks/20250219Gene_Reactions_Dataset.ipynb
- Purpose: To create Gene Pathway Dataset.

Datasets

Gene Ontology Term Dataset

A structured dataset of Gene Ontology (GO) terms in the Biological Process (BP) namespace.

Fields Included:
- go_id: Gene Ontology ID (e.g., GO:0008150)
- name: GO term name
- definition: Description of the biological process
- information_content: A score indicating specificity of the term
- associated_genes: List of gene symbols linked to the GO term

Gene Information Dataset

Gene metadata curated using the HGNC gene list (as of 2024-10-01) and enriched via the MyGene.info API.

Fields Included:
- ensembl_id: ENSEMBL gene identifier
- symbol: Official gene symbol (e.g., TP53)
- aliases: Alternate gene symbols
- name: Full gene name
- other_names: Synonyms and other designations
- description: Functional summary (if available)

Gene Pathway Dataset

Biochemical pathway data extracted from the Reactome database for human genes.

Fields Included:
- gene_id: ENSEMBL or HGNC identifier
- pathways: List of pathways the gene participates in
- reactions: List of biochemical reactions associated with each pathway
- participants: Grouped by:
  - inputs: Molecules consumed in the reaction
  - outputs: Molecules produced
  - catalysts: Entities that facilitate the reaction
  - regulators: Positive/negative regulators involved

Scripts

scripts/20250404GeneInfo_DataFormatting.py
- Purpose: To transform a structured gene information dataset (JSON) into a list of question-answer pairs.
- Description: This script reads a JSON file containing detailed information for multiple genes (expected from data/geneset_dict.json). It then iterates through each gene entry and, using a predefined mapping of question templates to JSON keys, generates specific questions about each gene. The corresponding values from the gene's data are used as answers.

Neo4j Setup and Querying

To run Reactome queries locally or remotely (e.g., from DGX, Andes, or any machine with multi-hop SSH access), follow the instructions below. These setup instructions are adapted from the official Reactome documentation, available at: https://reactome.org/dev/graph-database#GetStarted

Requirements

Neo4j (version used in this repo: neo4j-community-4.4.40)
Reactome graph database dump file (reactome.graphdb.dump)

Installation Steps

Download and Install Neo4j v4
- Download Neo4j Community Edition v4.4.40 from:
  https://neo4j.com/download-center/#community
Choose One of the Following Methods to Load Reactome Data

Option A: Load from a Dump File
- Download the latest graph database dump from:
  https://reactome.org/download-data
- Load the dump file using the Neo4j Admin tool:
```
./path/to/neo4j/bin/neo4j-admin load --force --from=/path/to/reactome.graphdb.dump --database=graph.db
```
Option B: Use the Extracted Graph Folder
- Download and extract the graph folder archive (reactome.graph.db.tgz)
- Move the extracted graph.db folder to:
```
/path/to/neo4j/data/databases/
```
- If a graph.db folder already exists, remove or rename it before replacing.

Install and Configure the Database

Load the dump file into Neo4j:

./path/to/neo4j/bin/neo4j-admin load --force --from=/path/to/reactome.graphdb.dump --database=graph.db

Move the graph.db folder to:
```
/path/to/neo4j/data/databases/
```
Ensure the folder is named exactly graph.db

Configure Neo4j Settings

Edit the configuration file at:
```
/path/to/neo4j/conf/neo4j.conf
```

Recommended settings:

dbms.default_database=graph.db
dbms.recovery.fail_on_missing_files=false
unsupported.dbms.tx_log.fail_on_corrupted_log_files=false

Start the Neo4j Server
```
./path/to/neo4j/bin/neo4j start
```

Note: If accessing remotely, use SSH port forwarding from the machine where Neo4j is installed: ssh -L 7474:localhost:7474 -L 7687:localhost:7687 <username>@<remote-host>

Information on how to query reactome database using Neo4j can be found here: https://reactome.org/dev/graph-database/extract-participating-molecules#retrieving-reactions

Note: Create a .env file in the root directory of the JAIL-Data-Curation-Benchmark project with your Neo4j credentials:

    ```env
    # JAIL-Data-Curation-Benchmark/.env
    URI="bolt://your_neo4j_host:7687"
    USERNAME="your_neo4j_username"
    PASSWORD="your_neo4j_password"
    ```

When done, Stop Neo4j
```
./path/to/neo4j/bin/neo4j stop
```

Acknowledgments

This project builds upon public biological resources and tools including:

HGNC – for standardized human gene nomenclature
MyGene.info – for programmatic gene metadata access
Gene Ontology Consortium – for GO term definitions and annotations
Reactome – for curated pathway and reaction data
Neo4j – for graph querying of biological entities
DeepEval – for evaluating LLM performance

Additional thanks to:

Anna Vlot
Peter Kruse
Kyle Sullivan
Ken Smith
Matt Lane
John Dandy
Mirko Pavicic Venegas
Doug Hyatt
Chanaka R. Abeyratne
Daniel Jacobson

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JAIL-Data-Curation-Benchmark

Table of Contents

Overview

Directory Structure

Key Components

Notebooks

Datasets

Gene Ontology Term Dataset

Gene Information Dataset

Gene Pathway Dataset

Scripts

Neo4j Setup and Querying

Requirements

Installation Steps

Option A: Load from a Dump File

Option B: Use the Extracted Graph Folder

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

JAIL-Data-Curation-Benchmark

Table of Contents

Overview

Directory Structure

Key Components

Notebooks

Datasets

Gene Ontology Term Dataset

Gene Information Dataset

Gene Pathway Dataset

Scripts

Neo4j Setup and Querying

Requirements

Installation Steps

Option A: Load from a Dump File

Option B: Use the Extracted Graph Folder

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages