JAIL-Data-Curation-Benchmark focuses on the curation of various biological datasets, including gene information, Gene Ontology (GO) terms, and biochemical reactions. These curated datasets are intended to serve as a foundation for training and evaluating Large Language Models (LLMs) on their ability to understand and retrieve specific biological information. The project includes scripts for data fetching, processing, formatting into question-answer pairs, and an example notebook for LLM evaluation using the DeepEval framework.
The core aim of the JAIL-Data-Curation-Benchmark is to:
- Curate High-Quality Biological Datasets: Systematically gather and process information about human genes, Gene Ontology terms, and biochemical reactions from public databases.
- Prepare Data for LLM Training: Transform curated data into formats suitable for supervised fine tuning, such as question-answer pairs.
The following directory structure is recommended at the root of your project:
JAIL-Data-Curation-Benchmark/
├── files/
│ ├── (INPUT) hgnc_complete_set_2024-10-01.txt # User must provide (update to the most recent file)
│ ├── (INPUT) go-basic.obo # User must provide
│ ├── (INPUT) goa_human.gaf # User must provide
│ ├── (INPUT) neo4j_files/
| └── (INPUT) neo4j-community-4.4.40 # Neo4j installation & Reactome Database migration
|
├── data/
│ ├── (OUTPUT) geneset_dict.json # Generated by 20241017Gene_Dataset.ipynb
│ ├── (OUTPUT) geneset_qa_pairs.json # Generated by 20250404GeneInfo_DataFormatting.py
│ ├── (OUTPUT) go_terms_dict.json # Generated by 20241007_GOterm_Information_Dataset_Dictionary.ipynb
│ ├── (OUTPUT) gene_reactions.json # Generated by 20250219Gene_Reactions_Dataset.ipynb
│
├── notebooks/
│ ├── 20240823_LLMeval_test_deepeval-checkpoint.ipynb
│ ├── 20241007_GOterm_Information_Dataset_Dictionary.ipynb
│ ├── 20241017Gene_Dataset.ipynb
│ ├── 20250219Gene_Reactions_Dataset.ipynb
| └──(OUTPUT/INPUT) gene_ensembl.json # Generated by 20250219Gene_Reactions_Dataset.ipynb to be used
│
├── scripts/
│ └── 20250404GeneInfo_DataFormatting.py
│
├── .env # (User-created at project root) For Neo4j credentials
└── README.md
-
notebooks/20240823_LLMeval_test_deepeval-checkpoint.ipynb- Purpose: To demonstrate and test the DeepEval framework for LLM evaluation.
- Description: Sets up a custom Llama 3.1 8B model and uses DeepEval's
HallucinationMetric,Faithfulness Metric,andBias Metric. This notebook is more self-contained for its example but could be adapted to use the curated datasets. - Status: "Currently on pause."
-
notebooks/20241007_GOterm_Information_Dataset_Dictionary.ipynb- Purpose: To create Gene Ontology Term Dataset.
-
notebooks/20241017Gene_Dataset.ipynb- Purpose: To create Gene Information Dataset.
-
/notebooks/20250219Gene_Reactions_Dataset.ipynb- Purpose: To create Gene Pathway Dataset.
A structured dataset of Gene Ontology (GO) terms in the Biological Process (BP) namespace.
- Fields Included:
go_id: Gene Ontology ID (e.g., GO:0008150)name: GO term namedefinition: Description of the biological processinformation_content: A score indicating specificity of the termassociated_genes: List of gene symbols linked to the GO term
Gene metadata curated using the HGNC gene list (as of 2024-10-01) and enriched via the MyGene.info API.
- Fields Included:
ensembl_id: ENSEMBL gene identifiersymbol: Official gene symbol (e.g., TP53)aliases: Alternate gene symbolsname: Full gene nameother_names: Synonyms and other designationsdescription: Functional summary (if available)
Biochemical pathway data extracted from the Reactome database for human genes.
- Fields Included:
gene_id: ENSEMBL or HGNC identifierpathways: List of pathways the gene participates inreactions: List of biochemical reactions associated with each pathwayparticipants: Grouped by:inputs: Molecules consumed in the reactionoutputs: Molecules producedcatalysts: Entities that facilitate the reactionregulators: Positive/negative regulators involved
scripts/20250404GeneInfo_DataFormatting.py- Purpose: To transform a structured gene information dataset (JSON) into a list of question-answer pairs.
- Description: This script reads a JSON file containing detailed information for multiple genes (expected from
data/geneset_dict.json). It then iterates through each gene entry and, using a predefined mapping of question templates to JSON keys, generates specific questions about each gene. The corresponding values from the gene's data are used as answers.
To run Reactome queries locally or remotely (e.g., from DGX, Andes, or any machine with multi-hop SSH access), follow the instructions below. These setup instructions are adapted from the official Reactome documentation, available at: https://reactome.org/dev/graph-database#GetStarted
- Neo4j (version used in this repo: neo4j-community-4.4.40)
- Reactome graph database dump file (
reactome.graphdb.dump)
-
Download and Install Neo4j v4
- Download Neo4j Community Edition v4.4.40 from:
https://neo4j.com/download-center/#community
- Download Neo4j Community Edition v4.4.40 from:
-
Choose One of the Following Methods to Load Reactome Data
- Download the latest graph database dump from:
https://reactome.org/download-data - Load the dump file using the Neo4j Admin tool:
./path/to/neo4j/bin/neo4j-admin load --force --from=/path/to/reactome.graphdb.dump --database=graph.db
- Download and extract the graph folder archive (
reactome.graph.db.tgz) - Move the extracted
graph.dbfolder to:/path/to/neo4j/data/databases/ - If a
graph.dbfolder already exists, remove or rename it before replacing.
- Download the latest graph database dump from:
-
Install and Configure the Database
- Load the dump file into Neo4j:
./path/to/neo4j/bin/neo4j-admin load --force --from=/path/to/reactome.graphdb.dump --database=graph.db
- Move the
graph.dbfolder to:/path/to/neo4j/data/databases/ - Ensure the folder is named exactly
graph.db
- Load the dump file into Neo4j:
-
Configure Neo4j Settings
- Edit the configuration file at:
/path/to/neo4j/conf/neo4j.conf - Recommended settings:
dbms.default_database=graph.db dbms.recovery.fail_on_missing_files=false unsupported.dbms.tx_log.fail_on_corrupted_log_files=false
- Edit the configuration file at:
-
Start the Neo4j Server
./path/to/neo4j/bin/neo4j start
Note: If accessing remotely, use SSH port forwarding from the machine where Neo4j is installed:
ssh -L 7474:localhost:7474 -L 7687:localhost:7687 <username>@<remote-host>
Information on how to query reactome database using Neo4j can be found here: https://reactome.org/dev/graph-database/extract-participating-molecules#retrieving-reactions
Note: Create a .env file in the root directory of the JAIL-Data-Curation-Benchmark project with your Neo4j credentials:
```env
# JAIL-Data-Curation-Benchmark/.env
URI="bolt://your_neo4j_host:7687"
USERNAME="your_neo4j_username"
PASSWORD="your_neo4j_password"
```
- When done, Stop Neo4j
./path/to/neo4j/bin/neo4j stop
This project builds upon public biological resources and tools including:
- HGNC – for standardized human gene nomenclature
- MyGene.info – for programmatic gene metadata access
- Gene Ontology Consortium – for GO term definitions and annotations
- Reactome – for curated pathway and reaction data
- Neo4j – for graph querying of biological entities
- DeepEval – for evaluating LLM performance
Additional thanks to:
- Anna Vlot
- Peter Kruse
- Kyle Sullivan
- Ken Smith
- Matt Lane
- John Dandy
- Mirko Pavicic Venegas
- Doug Hyatt
- Chanaka R. Abeyratne
- Daniel Jacobson