feat: multi-omics KG building #122

CHERRY-ui8 · 2025-12-18T05:09:05Z

This pull request introduces a robust and unified pipeline for processing multi-omics data, enabling the construction of specialized knowledge graphs and the generation of high-quality, context-rich question-answer pairs for DNA, RNA, and protein sequences. The changes significantly enhance the platform's capabilities in bioinformatics by integrating diverse biological data sources, optimizing search and data handling, and streamlining the creation of domain-specific QA datasets.

gemini-code-assist · 2025-12-18T05:11:22Z

Summary of Changes

Hello @CHERRY-ui8, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and unified pipeline for processing multi-omics data, enabling the construction of specialized knowledge graphs and the generation of high-quality, context-rich question-answer pairs for DNA, RNA, and protein sequences. The changes significantly enhance the platform's capabilities in bioinformatics by integrating diverse biological data sources, optimizing search and data handling, and streamlining the creation of domain-specific QA datasets.

Highlights

Unified Multi-omics Pipeline: Introduced a comprehensive pipeline for building Knowledge Graphs and generating Question-Answering (QA) pairs from multi-omics data (DNA, RNA, Protein) using a unified omics_qa method.
Enhanced Data Search and Local BLAST Integration: Refactored NCBI, RNAcentral, and UniProt search services to prioritize local BLAST databases, improve concurrency with blast_num_threads and max_concurrent parameters, and include robust retry mechanisms for API calls. Local BLAST database build scripts were also updated for more flexible download options and better resume/merge capabilities.
Molecule-Specific Captioning and Partitioning: The new omics_qa generator automatically extracts and attaches molecule-specific caption information (DNA, RNA, Protein) to generated QA pairs. The AnchorBFSPartitioner was extended to support multiple anchor types, allowing for more precise community partitioning based on DNA, RNA, or Protein entities.
Improved Data Handling and Ray Integration: Added a dedicated JSONLReader with streaming capabilities and enhanced the JSONReader for memory efficiency. Ray Data integration was improved by reducing verbose output and suppressing non-critical warnings during execution.
New KG Builder and Splitter for Omics Data: Introduced OmicsKGBuilder for extracting entities and relationships from biological sequence chunks and metadata, alongside a SequenceSplitter designed for efficient chunking of DNA, RNA, and protein sequences.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new multi-omics knowledge graph QA generation feature, enabling the processing of DNA, RNA, and protein data. Key changes include adding a new omics_qa generation method, an OmicsKGBuilder for extracting entities and relationships from multi-omics data, and an OmicsQAGenerator for creating QA pairs with molecule-specific captions. The AnchorBFSPartitioner was updated to support multiple anchor types (DNA, RNA, protein), and the BaseReader now supports dna and rna types. The NCBISearch, RNACentralSearch, and UniProtSearch modules were enhanced with improved local BLAST integration, including multi-threading, timeout handling, and better logging, along with new parameters for blast_num_threads and max_concurrent. The JSONReader was refactored to use a new JSONLReader for stream processing, reducing memory usage. Additionally, new example configurations and shell scripts for multi-omics QA generation and local BLAST database building were added, and .gitignore was updated to reflect new cache and database directories. Many files across the codebase show unresolved merge conflicts, indicated by <<<<<<<, =======, and >>>>>>> markers, which need to be addressed.

graphgen/models/reader/json_reader.py

graphgen/models/llm/local/sglang_wrapper.py

examples/search/build_db/build_rna_blast_db.sh

examples/search/search_dna.sh

examples/search/search_dna_config.yaml

graphgen/templates/generation/__init__.py

graphgen/templates/kg/__init__.py

graphgen/templates/kg/mm_kg_extraction.py

graphgen/utils/run_concurrent.py

graphgen/models/reader/jsonl_reader.py

graphgen/models/reader/json_reader.py

ChenZiHong-Gavin · 2025-12-19T02:55:18Z

examples/generate/generate_omics_qa/generate_omics_qa.sh

这里可以参照其他的examples，每一类的search按文件夹划分，并给个README说明步骤：例如本地blast构建等等

使用search前的meta data，和search后的结果做建图，这两种情况需要拆分成不同的文件夹吗？比方说dna_search/dna_searched

咦，而且如果把DNA/RNA/protein的generate-qa脚本分开，还怎么同时输入多个数据来源呢？multi-omics-qa不应该需要同时包含DNA/RNA/protein数据吗？

graphgen/operators/chunk/chunk_service.py

ChenZiHong-Gavin · 2025-12-19T03:34:01Z

graphgen/utils/wrap.py

async_to_sync_method 没有地方用到了，可以删掉

那wrap文件还需要保留么

graphgen/models/partitioner/anchor_bfs_partitioner.py

ChenZiHong-Gavin · 2025-12-19T04:11:05Z

graphgen/templates/generation/omics_qa_generation.py

这里需要改下format模板，因为有可能重复模板：

是说可能包含空的分子信息字段吗（比方说只有protein，就不要在qa里包含dnarna字段）

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

CHERRY-ui8 added 2 commits December 18, 2025 14:49

feat: multi omics KG building

fc18746

fix: remove remaining conflict markers

ce2b296

CHERRY-ui8 force-pushed the feature/multi-omics-qa-clean branch from 4c6d68b to ce2b296 Compare December 18, 2025 06:51

CHERRY-ui8 added 6 commits December 18, 2025 18:46

fix: restore files accidentally modified

6d0868a

fix: update local blast database paths in omics qa config

b6e65c0

fix: fix pylint problems

29ab42f

fix: fix pylint problems agaaaain

8b908b8

fix: fix pylint problems again

4173095

chore: remove protein KG extraction template

6466e27

ChenZiHong-Gavin reviewed Dec 18, 2025

View reviewed changes

graphgen/models/reader/json_reader.py Outdated Show resolved Hide resolved

CHERRY-ui8 added 2 commits December 19, 2025 01:12

refactor: remove unused read_stream method from JSONReader

82665bf

refactor: remove repeated image_exists method from JSONReader

2a27f69

ChenZiHong-Gavin reviewed Dec 19, 2025

View reviewed changes

graphgen/operators/chunk/chunk_service.py Show resolved Hide resolved

ChenZiHong-Gavin reviewed Dec 19, 2025

View reviewed changes

graphgen/models/partitioner/anchor_bfs_partitioner.py Show resolved Hide resolved

refactor: clean up logging in Engine

91abfde

ChenZiHong-Gavin reviewed Dec 19, 2025

View reviewed changes

CHERRY-ui8 added 7 commits December 19, 2025 14:00

refactor: enhance initialization of services with configurable backends

f2ee12f

refactor: remove unused progress bar from run_concurrent

b21f4ce

refactor: simplify anchor_type initialization in AnchorBFSPartitioner

23fa2bb

style: pylint problems in AnchorBFSPartitioner

41456e8

refactor: refactor search and db build scripts for DNA, RNA, and protein

610c48d

refactor: remove unused async_to_sync_method

8566fb4

fix: update omics qa generation template to avoid repetition

7dc02ff

CHERRY-ui8 changed the title ~~multi-omics KG building~~ feat: multi-omics KG building Dec 19, 2025

feat: multi-omics KG building #122

Are you sure you want to change the base?

feat: multi-omics KG building #122

Uh oh!

Conversation

CHERRY-ui8 commented Dec 18, 2025 • edited by ChenZiHong-Gavin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Dec 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ChenZiHong-Gavin Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

CHERRY-ui8 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

CHERRY-ui8 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenZiHong-Gavin Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

CHERRY-ui8 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ChenZiHong-Gavin Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

CHERRY-ui8 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CHERRY-ui8 commented Dec 18, 2025 •

edited by ChenZiHong-Gavin

Loading