-
Notifications
You must be signed in to change notification settings - Fork 59
feat: multi-omics KG building #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: multi-omics KG building #122
Conversation
Summary of ChangesHello @CHERRY-ui8, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust and unified pipeline for processing multi-omics data, enabling the construction of specialized knowledge graphs and the generation of high-quality, context-rich question-answer pairs for DNA, RNA, and protein sequences. The changes significantly enhance the platform's capabilities in bioinformatics by integrating diverse biological data sources, optimizing search and data handling, and streamlining the creation of domain-specific QA datasets. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new multi-omics knowledge graph QA generation feature, enabling the processing of DNA, RNA, and protein data. Key changes include adding a new omics_qa generation method, an OmicsKGBuilder for extracting entities and relationships from multi-omics data, and an OmicsQAGenerator for creating QA pairs with molecule-specific captions. The AnchorBFSPartitioner was updated to support multiple anchor types (DNA, RNA, protein), and the BaseReader now supports dna and rna types. The NCBISearch, RNACentralSearch, and UniProtSearch modules were enhanced with improved local BLAST integration, including multi-threading, timeout handling, and better logging, along with new parameters for blast_num_threads and max_concurrent. The JSONReader was refactored to use a new JSONLReader for stream processing, reducing memory usage. Additionally, new example configurations and shell scripts for multi-omics QA generation and local BLAST database building were added, and .gitignore was updated to reflect new cache and database directories. Many files across the codebase show unresolved merge conflicts, indicated by <<<<<<<, =======, and >>>>>>> markers, which need to be addressed.
4c6d68b to
ce2b296
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以参照其他的examples,每一类的search按文件夹划分,并给个README说明步骤:例如本地blast构建等等
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用search前的meta data,和search后的结果做建图,这两种情况需要拆分成不同的文件夹吗?比方说dna_search/dna_searched
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
咦,而且如果把DNA/RNA/protein的generate-qa脚本分开,还怎么同时输入多个数据来源呢?multi-omics-qa不应该需要同时包含DNA/RNA/protein数据吗?
graphgen/utils/wrap.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async_to_sync_method 没有地方用到了,可以删掉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那wrap文件还需要保留么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是说可能包含空的分子信息字段吗(比方说只有protein,就不要在qa里包含dnarna字段)

This pull request introduces a robust and unified pipeline for processing multi-omics data, enabling the construction of specialized knowledge graphs and the generation of high-quality, context-rich question-answer pairs for DNA, RNA, and protein sequences. The changes significantly enhance the platform's capabilities in bioinformatics by integrating diverse biological data sources, optimizing search and data handling, and streamlining the creation of domain-specific QA datasets.