Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
fc18746
feat: multi omics KG building
CHERRY-ui8 Dec 18, 2025
ce2b296
fix: remove remaining conflict markers
CHERRY-ui8 Dec 18, 2025
6d0868a
fix: restore files accidentally modified
CHERRY-ui8 Dec 18, 2025
b6e65c0
fix: update local blast database paths in omics qa config
CHERRY-ui8 Dec 18, 2025
29ab42f
fix: fix pylint problems
CHERRY-ui8 Dec 18, 2025
8b908b8
fix: fix pylint problems agaaaain
CHERRY-ui8 Dec 18, 2025
4173095
fix: fix pylint problems again
CHERRY-ui8 Dec 18, 2025
6466e27
chore: remove protein KG extraction template
CHERRY-ui8 Dec 18, 2025
82665bf
refactor: remove unused read_stream method from JSONReader
CHERRY-ui8 Dec 18, 2025
2a27f69
refactor: remove repeated image_exists method from JSONReader
CHERRY-ui8 Dec 18, 2025
91abfde
refactor: clean up logging in Engine
CHERRY-ui8 Dec 19, 2025
f2ee12f
refactor: enhance initialization of services with configurable backends
CHERRY-ui8 Dec 19, 2025
b21f4ce
refactor: remove unused progress bar from run_concurrent
CHERRY-ui8 Dec 19, 2025
23fa2bb
refactor: simplify anchor_type initialization in AnchorBFSPartitioner
CHERRY-ui8 Dec 19, 2025
41456e8
style: pylint problems in AnchorBFSPartitioner
CHERRY-ui8 Dec 19, 2025
610c48d
refactor: refactor search and db build scripts for DNA, RNA, and protein
CHERRY-ui8 Dec 19, 2025
8566fb4
refactor: remove unused async_to_sync_method
CHERRY-ui8 Dec 19, 2025
7dc02ff
fix: update omics qa generation template to avoid repetition
CHERRY-ui8 Dec 19, 2025
f29ab79
refactor: remove output_dir argument from multi-omics and search scripts
CHERRY-ui8 Dec 24, 2025
7701a62
merge
ChenZiHong-Gavin Dec 24, 2025
7c1ffeb
Merge branch 'feature/multi-omics-qa-clean' of https://github.com/CHE…
ChenZiHong-Gavin Dec 24, 2025
82666a7
Merge branch 'main' into feature/multi-omics-qa-clean
CHERRY-ui8 Dec 24, 2025
6dcfdf4
merge
ChenZiHong-Gavin Dec 24, 2025
664d305
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin Dec 24, 2025
29e4390
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin Dec 24, 2025
6cc75e8
refactor: remove unused multi_omics_search.py
CHERRY-ui8 Dec 24, 2025
36f17c5
Merge branch 'feature/multi-omics-qa-clean' of https://github.com/CHE…
CHERRY-ui8 Dec 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
216 changes: 216 additions & 0 deletions examples/generate/generate_omics_qa/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Multi-omics Knowledge Graph QA Generation

This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using the unified `omics_qa` method.

## Pipeline Overview

The pipeline includes the following steps:

1. **read**: Read input files (JSON/JSONL format with sequence queries or protein data)
2. **search**: Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein) - *optional if input already contains search results*
3. **chunk**: Chunk sequences and metadata
4. **build_kg**: Extract entities and relationships to build knowledge graph
5. **partition**: Partition the knowledge graph into communities using anchor-based BFS
6. **generate**: Generate QA pairs from partitioned communities with automatic molecule caption extraction

## Key Features

- **Unified QA Generation**: Single `omics_qa` method supports DNA, RNA, and Protein
- **Automatic Caption Extraction**: Automatically extracts and attaches molecule-specific information (dna/rna/protein captions) to each QA pair
- **Flexible Configuration**: Easy to switch between DNA, RNA, and Protein by changing input file and data source
- **Anchor-based Partitioning**: Uses molecule type as anchor for BFS partitioning (dna/rna/protein)

## Quick Start

### 1. Configure Input Data

Edit `omics_qa_config.yaml` to set the input file path:

**For DNA:**
```yaml
input_path:
- examples/input_examples/search_dna_demo.jsonl
```

**For RNA:**
```yaml
input_path:
- examples/input_examples/search_rna_demo.jsonl
```

**For Protein:**
```yaml
input_path:
- examples/input_examples/search_protein_demo.jsonl
```

### 2. Configure Data Source

Set the appropriate data source and parameters in the `search_data` node:

**For DNA (NCBI):**
```yaml
data_sources: [ncbi]
ncbi_params:
email: your_email@example.com # Required!
tool: GraphGen
use_local_blast: true
local_blast_db: refseq_release/refseq_release
blast_num_threads: 2
max_concurrent: 5
```

**For RNA (RNAcentral):**
```yaml
data_sources: [rnacentral]
rnacentral_params:
use_local_blast: true
local_blast_db: rnacentral_ensembl_gencode_YYYYMMDD/ensembl_gencode_YYYYMMDD
blast_num_threads: 2
max_concurrent: 5
```

**For Protein (UniProt):**
```yaml
data_sources: [uniprot]
uniprot_params:
use_local_blast: true
local_blast_db: ${RELEASE}/uniprot_sprot
blast_num_threads: 2
max_concurrent: 5
```

### 3. Configure Anchor Type

Set the `anchor_type` in the `partition` node to match your molecule type:

```yaml
partition:
params:
method: anchor_bfs
method_params:
anchor_type: protein # Change to "dna" or "rna" as needed
max_units_per_community: 10
```

### 4. Run the Pipeline

```bash
./generate_omics_qa.sh
```

Or run directly with Python:

```bash
python3 -m graphgen.run \
--config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \
--output_dir cache/
```

## Input Format

### For DNA/RNA (JSONL format):
```jsonl
{"type": "text", "content": "BRCA1"}
{"type": "text", "content": ">query\nATGCGATCG..."}
{"type": "text", "content": "ATGCGATCG..."}
```

### For Protein (JSONL format):
```jsonl
{"type": "text", "content": "P01308"}
{"type": "text", "content": "insulin"}
{"type": "text", "content": "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEACVKQANQALSRFIAPLPFQNTPVVETMQYGALLGGKRLRPFLVYATGHMFGVSTNTLDAPAAAVECIHAYSLIHDDLPAMDDDDLRRGLPTCHVKFGEANAILAGDALQTLAFSILSDANMPEVSDRDRISMISELASASGIAGMCGGQALDLDAEGKHVPLDALERIHRHKTGALIRAAVRLGALSAGDKGRRALPVLDKYAESIGLAFQVQDDILDVVGDTATLGKRQGADQQLGKSTYPALLGLEQARKKARDLIDDARQALKQLAEQSLDTSALEALADYIIQRNK"}
```

## Output Format

The `omics_qa` method automatically extracts and attaches molecule-specific captions to QA pairs:

### Alpaca Format:
```json
{
"instruction": "What is the function of this protein?",
"input": "",
"output": "The protein functions as...",
"dna": {...}, # DNA caption (if molecule_type is DNA)
"rna": {...}, # RNA caption (if molecule_type is RNA)
"protein": {...} # Protein caption (if molecule_type is protein)
}
```

### ChatML Format:
```json
{
"messages": [
{
"role": "user",
"content": [
{
"text": "What is the function of this protein?",
"dna": {...},
"rna": {...},
"protein": {...}
}
]
},
{
"role": "assistant",
"content": "The protein functions as..."
}
]
}
```

## Caption Information

The generator automatically extracts relevant caption information based on molecule type:

- **DNA**: gene_name, gene_description, organism, chromosome, genomic_location, function, gene_type, etc.
- **RNA**: rna_type, description, organism, related_genes, gene_name, so_term, modifications, etc.
- **Protein**: protein_name, gene_names, organism, function, sequence, entry_name, etc.

## Configuration Options

### Chunking Parameters
- `chunk_size`: Size for text metadata chunks (default: 1024)
- `chunk_overlap`: Overlap for text chunks (default: 100)
- `sequence_chunk_size`: Size for sequence chunks (default: 1000)
- `sequence_chunk_overlap`: Overlap for sequence chunks (default: 100)

### Partition Parameters
- `method`: `anchor_bfs` (recommended for omics data)
- `anchor_type`: `dna`, `rna`, or `protein` (must match your data type)
- `max_units_per_community`: Maximum nodes and edges per community (default: 10)

### Generation Parameters
- `method`: `omics_qa` (unified method for DNA/RNA/Protein)
- `data_format`: `Alpaca`, `ChatML`, or `Sharegpt`

## Notes

- **NCBI requires an email address** - Make sure to set `email` in `ncbi_params`
- **Anchor type must match molecule type** - Set `anchor_type` to match your data (dna/rna/protein)
- **Local BLAST** can be enabled if you have local databases set up (see `examples/search/build_db/`)
- **Caption extraction** is automatic - The generator detects molecule type and extracts relevant caption information
- Adjust `max_concurrent` based on your system resources and API rate limits

## Examples

### Generate QA for Protein Data
1. Set `input_path` to `examples/input_examples/search_protein_demo.jsonl`
2. Set `data_sources: [uniprot]`
3. Set `anchor_type: protein`
4. Run `./generate_omics_qa.sh`

### Generate QA for DNA Data
1. Set `input_path` to `examples/input_examples/search_dna_demo.jsonl`
2. Set `data_sources: [ncbi]`
3. Set `anchor_type: dna`
4. Run `./generate_omics_qa.sh`

### Generate QA for RNA Data
1. Set `input_path` to `examples/input_examples/search_rna_demo.jsonl`
2. Set `data_sources: [rnacentral]`
3. Set `anchor_type: rna`
4. Run `./generate_omics_qa.sh`
2 changes: 2 additions & 0 deletions examples/generate/generate_omics_qa/generate_omics_qa.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_omics_qa/omics_qa_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
python3 -m graphgen.run \
--config_file examples/generate/generate_omics_qa/omics_qa_config_searched.yaml
92 changes: 92 additions & 0 deletions examples/generate/generate_omics_qa/omics_qa_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
global_params:
working_dir: cache
graph_backend: kuzu # graph database backend, support: kuzu, networkx
kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv

nodes:
- id: read_files
op_name: read
type: source
dependencies: []
params:
input_path:
# three input files to generate DNA, RNA, and Protein data together
- examples/input_examples/search_dna_demo.jsonl
- examples/input_examples/search_rna_demo.jsonl
- examples/input_examples/search_protein_demo.jsonl

- id: search_data
op_name: search
type: map_batch
dependencies:
- read_files
execution_params:
replicas: 1
batch_size: 10
params:
data_sources: [ncbi, rnacentral, uniprot] # Multi-omics: use all three data sources
# DNA search parameters
ncbi_params:
email: your_email@example.com # Required for NCBI
tool: GraphGen
use_local_blast: true
local_blast_db: path_to_your_local_blast_db/refseq_version/refseq_version
blast_num_threads: 2
max_concurrent: 5
# RNA search parameters
rnacentral_params:
use_local_blast: true
local_blast_db: path_to_your_local_blast_db/rnacentral_YYYYMMDD/rnacentral_YYYYMMDD
blast_num_threads: 2
max_concurrent: 5
# Protein search parameters
uniprot_params:
use_local_blast: true
local_blast_db: path_to_your_local_blast_db/${RELEASE}/uniprot_sprot
blast_num_threads: 2
max_concurrent: 5

- id: chunk_documents
op_name: chunk
type: map_batch
dependencies:
- search_data
execution_params:
replicas: 4
params:
chunk_size: 1024 # chunk size for text splitting
chunk_overlap: 100 # chunk overlap for text splitting
sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein)
sequence_chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
dependencies:
- chunk_documents
execution_params:
replicas: 1
batch_size: 128

- id: partition
op_name: partition
type: aggregate
dependencies:
- build_kg
params:
method: anchor_bfs # partition method
method_params:
anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string)
max_units_per_community: 10 # max nodes and edges per community

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: omics_qa # unified QA generation method for DNA/RNA/Protein
data_format: ChatML # Alpaca, Sharegpt, ChatML
73 changes: 73 additions & 0 deletions examples/generate/generate_omics_qa/omics_qa_config_searched.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
global_params:
working_dir: cache
graph_backend: kuzu # graph database backend, support: kuzu, networkx
kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv

nodes:
- id: read_files
op_name: read
type: source
dependencies: []
params:
input_path:
# Use pre-searched data files (skip search step)
# The search_service will automatically detect and skip search if data already contains search results
- examples/input_examples/searched_dna_demo.jsonl
- examples/input_examples/searched_rna_demo.jsonl
- examples/input_examples/searched_protein_demo.jsonl

- id: search_data
op_name: search
type: map_batch
dependencies:
- read_files
execution_params:
replicas: 1
batch_size: 10
# Note: search_service will automatically detect pre-searched data and skip search,
# but it will still normalize the data format (ensure _doc_id, content, data_source fields exist)

- id: chunk_documents
op_name: chunk
type: map_batch
dependencies:
- search_data
execution_params:
replicas: 4
params:
chunk_size: 1024 # chunk size for text splitting
chunk_overlap: 100 # chunk overlap for text splitting
sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein)
sequence_chunk_overlap: 100

- id: build_kg
op_name: build_kg
type: map_batch
dependencies:
- chunk_documents
execution_params:
replicas: 1
batch_size: 128

- id: partition
op_name: partition
type: aggregate
dependencies:
- build_kg
params:
method: anchor_bfs # partition method
method_params:
anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string)
max_units_per_community: 10 # max nodes and edges per community

- id: generate
op_name: generate
type: map_batch
dependencies:
- partition
execution_params:
replicas: 1
batch_size: 128
params:
method: omics_qa # unified QA generation method for DNA/RNA/Protein
data_format: ChatML # Alpaca, Sharegpt, ChatML
Loading