-
Notifications
You must be signed in to change notification settings - Fork 60
feat: multi-omics KG building #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
CHERRY-ui8
wants to merge
27
commits into
InternScience:main
Choose a base branch
from
CHERRY-ui8:feature/multi-omics-qa-clean
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
27 commits
Select commit
Hold shift + click to select a range
fc18746
feat: multi omics KG building
CHERRY-ui8 ce2b296
fix: remove remaining conflict markers
CHERRY-ui8 6d0868a
fix: restore files accidentally modified
CHERRY-ui8 b6e65c0
fix: update local blast database paths in omics qa config
CHERRY-ui8 29ab42f
fix: fix pylint problems
CHERRY-ui8 8b908b8
fix: fix pylint problems agaaaain
CHERRY-ui8 4173095
fix: fix pylint problems again
CHERRY-ui8 6466e27
chore: remove protein KG extraction template
CHERRY-ui8 82665bf
refactor: remove unused read_stream method from JSONReader
CHERRY-ui8 2a27f69
refactor: remove repeated image_exists method from JSONReader
CHERRY-ui8 91abfde
refactor: clean up logging in Engine
CHERRY-ui8 f2ee12f
refactor: enhance initialization of services with configurable backends
CHERRY-ui8 b21f4ce
refactor: remove unused progress bar from run_concurrent
CHERRY-ui8 23fa2bb
refactor: simplify anchor_type initialization in AnchorBFSPartitioner
CHERRY-ui8 41456e8
style: pylint problems in AnchorBFSPartitioner
CHERRY-ui8 610c48d
refactor: refactor search and db build scripts for DNA, RNA, and protein
CHERRY-ui8 8566fb4
refactor: remove unused async_to_sync_method
CHERRY-ui8 7dc02ff
fix: update omics qa generation template to avoid repetition
CHERRY-ui8 f29ab79
refactor: remove output_dir argument from multi-omics and search scripts
CHERRY-ui8 7701a62
merge
ChenZiHong-Gavin 7c1ffeb
Merge branch 'feature/multi-omics-qa-clean' of https://github.com/CHE…
ChenZiHong-Gavin 82666a7
Merge branch 'main' into feature/multi-omics-qa-clean
CHERRY-ui8 6dcfdf4
merge
ChenZiHong-Gavin 664d305
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin 29e4390
Merge branch 'main' of https://github.com/open-sciencelab/GraphGen in…
ChenZiHong-Gavin 6cc75e8
refactor: remove unused multi_omics_search.py
CHERRY-ui8 36f17c5
Merge branch 'feature/multi-omics-qa-clean' of https://github.com/CHE…
CHERRY-ui8 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,216 @@ | ||
| # Multi-omics Knowledge Graph QA Generation | ||
|
|
||
| This example demonstrates how to build knowledge graphs from multi-omics data (DNA, RNA, protein) and generate question-answer pairs using the unified `omics_qa` method. | ||
|
|
||
| ## Pipeline Overview | ||
|
|
||
| The pipeline includes the following steps: | ||
|
|
||
| 1. **read**: Read input files (JSON/JSONL format with sequence queries or protein data) | ||
| 2. **search**: Search biological databases (NCBI for DNA, RNAcentral for RNA, UniProt for protein) - *optional if input already contains search results* | ||
| 3. **chunk**: Chunk sequences and metadata | ||
| 4. **build_kg**: Extract entities and relationships to build knowledge graph | ||
| 5. **partition**: Partition the knowledge graph into communities using anchor-based BFS | ||
| 6. **generate**: Generate QA pairs from partitioned communities with automatic molecule caption extraction | ||
|
|
||
| ## Key Features | ||
|
|
||
| - **Unified QA Generation**: Single `omics_qa` method supports DNA, RNA, and Protein | ||
| - **Automatic Caption Extraction**: Automatically extracts and attaches molecule-specific information (dna/rna/protein captions) to each QA pair | ||
| - **Flexible Configuration**: Easy to switch between DNA, RNA, and Protein by changing input file and data source | ||
| - **Anchor-based Partitioning**: Uses molecule type as anchor for BFS partitioning (dna/rna/protein) | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### 1. Configure Input Data | ||
|
|
||
| Edit `omics_qa_config.yaml` to set the input file path: | ||
|
|
||
| **For DNA:** | ||
| ```yaml | ||
| input_path: | ||
| - examples/input_examples/search_dna_demo.jsonl | ||
| ``` | ||
|
|
||
| **For RNA:** | ||
| ```yaml | ||
| input_path: | ||
| - examples/input_examples/search_rna_demo.jsonl | ||
| ``` | ||
|
|
||
| **For Protein:** | ||
| ```yaml | ||
| input_path: | ||
| - examples/input_examples/search_protein_demo.jsonl | ||
| ``` | ||
|
|
||
| ### 2. Configure Data Source | ||
|
|
||
| Set the appropriate data source and parameters in the `search_data` node: | ||
|
|
||
| **For DNA (NCBI):** | ||
| ```yaml | ||
| data_sources: [ncbi] | ||
| ncbi_params: | ||
| email: your_email@example.com # Required! | ||
| tool: GraphGen | ||
| use_local_blast: true | ||
| local_blast_db: refseq_release/refseq_release | ||
| blast_num_threads: 2 | ||
| max_concurrent: 5 | ||
| ``` | ||
|
|
||
| **For RNA (RNAcentral):** | ||
| ```yaml | ||
| data_sources: [rnacentral] | ||
| rnacentral_params: | ||
| use_local_blast: true | ||
| local_blast_db: rnacentral_ensembl_gencode_YYYYMMDD/ensembl_gencode_YYYYMMDD | ||
| blast_num_threads: 2 | ||
| max_concurrent: 5 | ||
| ``` | ||
|
|
||
| **For Protein (UniProt):** | ||
| ```yaml | ||
| data_sources: [uniprot] | ||
| uniprot_params: | ||
| use_local_blast: true | ||
| local_blast_db: ${RELEASE}/uniprot_sprot | ||
| blast_num_threads: 2 | ||
| max_concurrent: 5 | ||
| ``` | ||
|
|
||
| ### 3. Configure Anchor Type | ||
|
|
||
| Set the `anchor_type` in the `partition` node to match your molecule type: | ||
|
|
||
| ```yaml | ||
| partition: | ||
| params: | ||
| method: anchor_bfs | ||
| method_params: | ||
| anchor_type: protein # Change to "dna" or "rna" as needed | ||
| max_units_per_community: 10 | ||
| ``` | ||
|
|
||
| ### 4. Run the Pipeline | ||
|
|
||
| ```bash | ||
| ./generate_omics_qa.sh | ||
| ``` | ||
|
|
||
| Or run directly with Python: | ||
|
|
||
| ```bash | ||
| python3 -m graphgen.run \ | ||
| --config_file examples/generate/generate_omics_qa/omics_qa_config.yaml \ | ||
| --output_dir cache/ | ||
| ``` | ||
|
|
||
| ## Input Format | ||
|
|
||
| ### For DNA/RNA (JSONL format): | ||
| ```jsonl | ||
| {"type": "text", "content": "BRCA1"} | ||
| {"type": "text", "content": ">query\nATGCGATCG..."} | ||
| {"type": "text", "content": "ATGCGATCG..."} | ||
| ``` | ||
|
|
||
| ### For Protein (JSONL format): | ||
| ```jsonl | ||
| {"type": "text", "content": "P01308"} | ||
| {"type": "text", "content": "insulin"} | ||
| {"type": "text", "content": "MHHHHHHSSGVDLGTENLYFQSNAMDFPQQLEACVKQANQALSRFIAPLPFQNTPVVETMQYGALLGGKRLRPFLVYATGHMFGVSTNTLDAPAAAVECIHAYSLIHDDLPAMDDDDLRRGLPTCHVKFGEANAILAGDALQTLAFSILSDANMPEVSDRDRISMISELASASGIAGMCGGQALDLDAEGKHVPLDALERIHRHKTGALIRAAVRLGALSAGDKGRRALPVLDKYAESIGLAFQVQDDILDVVGDTATLGKRQGADQQLGKSTYPALLGLEQARKKARDLIDDARQALKQLAEQSLDTSALEALADYIIQRNK"} | ||
| ``` | ||
|
|
||
| ## Output Format | ||
|
|
||
| The `omics_qa` method automatically extracts and attaches molecule-specific captions to QA pairs: | ||
|
|
||
| ### Alpaca Format: | ||
| ```json | ||
| { | ||
| "instruction": "What is the function of this protein?", | ||
| "input": "", | ||
| "output": "The protein functions as...", | ||
| "dna": {...}, # DNA caption (if molecule_type is DNA) | ||
| "rna": {...}, # RNA caption (if molecule_type is RNA) | ||
| "protein": {...} # Protein caption (if molecule_type is protein) | ||
| } | ||
| ``` | ||
|
|
||
| ### ChatML Format: | ||
| ```json | ||
| { | ||
| "messages": [ | ||
| { | ||
| "role": "user", | ||
| "content": [ | ||
| { | ||
| "text": "What is the function of this protein?", | ||
| "dna": {...}, | ||
| "rna": {...}, | ||
| "protein": {...} | ||
| } | ||
| ] | ||
| }, | ||
| { | ||
| "role": "assistant", | ||
| "content": "The protein functions as..." | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ## Caption Information | ||
|
|
||
| The generator automatically extracts relevant caption information based on molecule type: | ||
|
|
||
| - **DNA**: gene_name, gene_description, organism, chromosome, genomic_location, function, gene_type, etc. | ||
| - **RNA**: rna_type, description, organism, related_genes, gene_name, so_term, modifications, etc. | ||
| - **Protein**: protein_name, gene_names, organism, function, sequence, entry_name, etc. | ||
|
|
||
| ## Configuration Options | ||
|
|
||
| ### Chunking Parameters | ||
| - `chunk_size`: Size for text metadata chunks (default: 1024) | ||
| - `chunk_overlap`: Overlap for text chunks (default: 100) | ||
| - `sequence_chunk_size`: Size for sequence chunks (default: 1000) | ||
| - `sequence_chunk_overlap`: Overlap for sequence chunks (default: 100) | ||
|
|
||
| ### Partition Parameters | ||
| - `method`: `anchor_bfs` (recommended for omics data) | ||
| - `anchor_type`: `dna`, `rna`, or `protein` (must match your data type) | ||
| - `max_units_per_community`: Maximum nodes and edges per community (default: 10) | ||
|
|
||
| ### Generation Parameters | ||
| - `method`: `omics_qa` (unified method for DNA/RNA/Protein) | ||
| - `data_format`: `Alpaca`, `ChatML`, or `Sharegpt` | ||
|
|
||
| ## Notes | ||
|
|
||
| - **NCBI requires an email address** - Make sure to set `email` in `ncbi_params` | ||
| - **Anchor type must match molecule type** - Set `anchor_type` to match your data (dna/rna/protein) | ||
| - **Local BLAST** can be enabled if you have local databases set up (see `examples/search/build_db/`) | ||
| - **Caption extraction** is automatic - The generator detects molecule type and extracts relevant caption information | ||
| - Adjust `max_concurrent` based on your system resources and API rate limits | ||
|
|
||
| ## Examples | ||
|
|
||
| ### Generate QA for Protein Data | ||
| 1. Set `input_path` to `examples/input_examples/search_protein_demo.jsonl` | ||
| 2. Set `data_sources: [uniprot]` | ||
| 3. Set `anchor_type: protein` | ||
| 4. Run `./generate_omics_qa.sh` | ||
|
|
||
| ### Generate QA for DNA Data | ||
| 1. Set `input_path` to `examples/input_examples/search_dna_demo.jsonl` | ||
| 2. Set `data_sources: [ncbi]` | ||
| 3. Set `anchor_type: dna` | ||
| 4. Run `./generate_omics_qa.sh` | ||
|
|
||
| ### Generate QA for RNA Data | ||
| 1. Set `input_path` to `examples/input_examples/search_rna_demo.jsonl` | ||
| 2. Set `data_sources: [rnacentral]` | ||
| 3. Set `anchor_type: rna` | ||
| 4. Run `./generate_omics_qa.sh` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| python3 -m graphgen.run \ | ||
| --config_file examples/generate/generate_omics_qa/omics_qa_config.yaml |
2 changes: 2 additions & 0 deletions
2
examples/generate/generate_omics_qa/generate_omics_qa_searched.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| python3 -m graphgen.run \ | ||
| --config_file examples/generate/generate_omics_qa/omics_qa_config_searched.yaml |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| global_params: | ||
| working_dir: cache | ||
| graph_backend: kuzu # graph database backend, support: kuzu, networkx | ||
| kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv | ||
|
|
||
| nodes: | ||
| - id: read_files | ||
| op_name: read | ||
| type: source | ||
| dependencies: [] | ||
| params: | ||
| input_path: | ||
| # three input files to generate DNA, RNA, and Protein data together | ||
| - examples/input_examples/search_dna_demo.jsonl | ||
| - examples/input_examples/search_rna_demo.jsonl | ||
| - examples/input_examples/search_protein_demo.jsonl | ||
|
|
||
| - id: search_data | ||
| op_name: search | ||
| type: map_batch | ||
| dependencies: | ||
| - read_files | ||
| execution_params: | ||
| replicas: 1 | ||
| batch_size: 10 | ||
| params: | ||
| data_sources: [ncbi, rnacentral, uniprot] # Multi-omics: use all three data sources | ||
| # DNA search parameters | ||
| ncbi_params: | ||
| email: your_email@example.com # Required for NCBI | ||
| tool: GraphGen | ||
| use_local_blast: true | ||
| local_blast_db: path_to_your_local_blast_db/refseq_version/refseq_version | ||
| blast_num_threads: 2 | ||
| max_concurrent: 5 | ||
| # RNA search parameters | ||
| rnacentral_params: | ||
| use_local_blast: true | ||
| local_blast_db: path_to_your_local_blast_db/rnacentral_YYYYMMDD/rnacentral_YYYYMMDD | ||
| blast_num_threads: 2 | ||
| max_concurrent: 5 | ||
| # Protein search parameters | ||
| uniprot_params: | ||
| use_local_blast: true | ||
| local_blast_db: path_to_your_local_blast_db/${RELEASE}/uniprot_sprot | ||
| blast_num_threads: 2 | ||
| max_concurrent: 5 | ||
|
|
||
| - id: chunk_documents | ||
| op_name: chunk | ||
| type: map_batch | ||
| dependencies: | ||
| - search_data | ||
| execution_params: | ||
| replicas: 4 | ||
| params: | ||
| chunk_size: 1024 # chunk size for text splitting | ||
| chunk_overlap: 100 # chunk overlap for text splitting | ||
| sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein) | ||
| sequence_chunk_overlap: 100 | ||
|
|
||
| - id: build_kg | ||
| op_name: build_kg | ||
| type: map_batch | ||
| dependencies: | ||
| - chunk_documents | ||
| execution_params: | ||
| replicas: 1 | ||
| batch_size: 128 | ||
|
|
||
| - id: partition | ||
| op_name: partition | ||
| type: aggregate | ||
| dependencies: | ||
| - build_kg | ||
| params: | ||
| method: anchor_bfs # partition method | ||
| method_params: | ||
| anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string) | ||
| max_units_per_community: 10 # max nodes and edges per community | ||
|
|
||
| - id: generate | ||
| op_name: generate | ||
| type: map_batch | ||
| dependencies: | ||
| - partition | ||
| execution_params: | ||
| replicas: 1 | ||
| batch_size: 128 | ||
| params: | ||
| method: omics_qa # unified QA generation method for DNA/RNA/Protein | ||
| data_format: ChatML # Alpaca, Sharegpt, ChatML |
73 changes: 73 additions & 0 deletions
73
examples/generate/generate_omics_qa/omics_qa_config_searched.yaml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| global_params: | ||
| working_dir: cache | ||
| graph_backend: kuzu # graph database backend, support: kuzu, networkx | ||
| kv_backend: rocksdb # key-value store backend, support: rocksdb, json_kv | ||
|
|
||
| nodes: | ||
| - id: read_files | ||
| op_name: read | ||
| type: source | ||
| dependencies: [] | ||
| params: | ||
| input_path: | ||
| # Use pre-searched data files (skip search step) | ||
| # The search_service will automatically detect and skip search if data already contains search results | ||
| - examples/input_examples/searched_dna_demo.jsonl | ||
| - examples/input_examples/searched_rna_demo.jsonl | ||
| - examples/input_examples/searched_protein_demo.jsonl | ||
|
|
||
| - id: search_data | ||
| op_name: search | ||
| type: map_batch | ||
| dependencies: | ||
| - read_files | ||
| execution_params: | ||
| replicas: 1 | ||
| batch_size: 10 | ||
| # Note: search_service will automatically detect pre-searched data and skip search, | ||
| # but it will still normalize the data format (ensure _doc_id, content, data_source fields exist) | ||
|
|
||
| - id: chunk_documents | ||
| op_name: chunk | ||
| type: map_batch | ||
| dependencies: | ||
| - search_data | ||
| execution_params: | ||
| replicas: 4 | ||
| params: | ||
| chunk_size: 1024 # chunk size for text splitting | ||
| chunk_overlap: 100 # chunk overlap for text splitting | ||
| sequence_chunk_size: 1000 # For sequence chunks (bp for DNA/RNA, aa for protein) | ||
| sequence_chunk_overlap: 100 | ||
|
|
||
| - id: build_kg | ||
| op_name: build_kg | ||
| type: map_batch | ||
| dependencies: | ||
| - chunk_documents | ||
| execution_params: | ||
| replicas: 1 | ||
| batch_size: 128 | ||
|
|
||
| - id: partition | ||
| op_name: partition | ||
| type: aggregate | ||
| dependencies: | ||
| - build_kg | ||
| params: | ||
| method: anchor_bfs # partition method | ||
| method_params: | ||
| anchor_type: [dna, rna, protein] # Multi-omics: support multiple anchor types (list or single string) | ||
| max_units_per_community: 10 # max nodes and edges per community | ||
|
|
||
| - id: generate | ||
| op_name: generate | ||
| type: map_batch | ||
| dependencies: | ||
| - partition | ||
| execution_params: | ||
| replicas: 1 | ||
| batch_size: 128 | ||
| params: | ||
| method: omics_qa # unified QA generation method for DNA/RNA/Protein | ||
| data_format: ChatML # Alpaca, Sharegpt, ChatML |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.