Skip to content

Conversation

@CHERRY-ui8
Copy link
Contributor

@CHERRY-ui8 CHERRY-ui8 commented Dec 18, 2025

This pull request introduces a robust and unified pipeline for processing multi-omics data, enabling the construction of specialized knowledge graphs and the generation of high-quality, context-rich question-answer pairs for DNA, RNA, and protein sequences. The changes significantly enhance the platform's capabilities in bioinformatics by integrating diverse biological data sources, optimizing search and data handling, and streamlining the creation of domain-specific QA datasets.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @CHERRY-ui8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and unified pipeline for processing multi-omics data, enabling the construction of specialized knowledge graphs and the generation of high-quality, context-rich question-answer pairs for DNA, RNA, and protein sequences. The changes significantly enhance the platform's capabilities in bioinformatics by integrating diverse biological data sources, optimizing search and data handling, and streamlining the creation of domain-specific QA datasets.

Highlights

  • Unified Multi-omics Pipeline: Introduced a comprehensive pipeline for building Knowledge Graphs and generating Question-Answering (QA) pairs from multi-omics data (DNA, RNA, Protein) using a unified omics_qa method.
  • Enhanced Data Search and Local BLAST Integration: Refactored NCBI, RNAcentral, and UniProt search services to prioritize local BLAST databases, improve concurrency with blast_num_threads and max_concurrent parameters, and include robust retry mechanisms for API calls. Local BLAST database build scripts were also updated for more flexible download options and better resume/merge capabilities.
  • Molecule-Specific Captioning and Partitioning: The new omics_qa generator automatically extracts and attaches molecule-specific caption information (DNA, RNA, Protein) to generated QA pairs. The AnchorBFSPartitioner was extended to support multiple anchor types, allowing for more precise community partitioning based on DNA, RNA, or Protein entities.
  • Improved Data Handling and Ray Integration: Added a dedicated JSONLReader with streaming capabilities and enhanced the JSONReader for memory efficiency. Ray Data integration was improved by reducing verbose output and suppressing non-critical warnings during execution.
  • New KG Builder and Splitter for Omics Data: Introduced OmicsKGBuilder for extracting entities and relationships from biological sequence chunks and metadata, alongside a SequenceSplitter designed for efficient chunking of DNA, RNA, and protein sequences.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new multi-omics knowledge graph QA generation feature, enabling the processing of DNA, RNA, and protein data. Key changes include adding a new omics_qa generation method, an OmicsKGBuilder for extracting entities and relationships from multi-omics data, and an OmicsQAGenerator for creating QA pairs with molecule-specific captions. The AnchorBFSPartitioner was updated to support multiple anchor types (DNA, RNA, protein), and the BaseReader now supports dna and rna types. The NCBISearch, RNACentralSearch, and UniProtSearch modules were enhanced with improved local BLAST integration, including multi-threading, timeout handling, and better logging, along with new parameters for blast_num_threads and max_concurrent. The JSONReader was refactored to use a new JSONLReader for stream processing, reducing memory usage. Additionally, new example configurations and shell scripts for multi-omics QA generation and local BLAST database building were added, and .gitignore was updated to reflect new cache and database directories. Many files across the codebase show unresolved merge conflicts, indicated by <<<<<<<, =======, and >>>>>>> markers, which need to be addressed.

@CHERRY-ui8 CHERRY-ui8 force-pushed the feature/multi-omics-qa-clean branch from 4c6d68b to ce2b296 Compare December 18, 2025 06:51
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以参照其他的examples,每一类的search按文件夹划分,并给个README说明步骤:例如本地blast构建等等

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用search前的meta data,和search后的结果做建图,这两种情况需要拆分成不同的文件夹吗?比方说dna_search/dna_searched

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

咦,而且如果把DNA/RNA/protein的generate-qa脚本分开,还怎么同时输入多个数据来源呢?multi-omics-qa不应该需要同时包含DNA/RNA/protein数据吗?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

async_to_sync_method 没有地方用到了,可以删掉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那wrap文件还需要保留么

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要改下format模板,因为有可能重复模板:
image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是说可能包含空的分子信息字段吗(比方说只有protein,就不要在qa里包含dnarna字段)

@CHERRY-ui8 CHERRY-ui8 changed the title multi-omics KG building feat: multi-omics KG building Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants