Skip to content

Minor updates and bug fixes#24

Open
Sreehari05055 wants to merge 2 commits intomainfrom
dev
Open

Minor updates and bug fixes#24
Sreehari05055 wants to merge 2 commits intomainfrom
dev

Conversation

@Sreehari05055
Copy link
Copy Markdown
Owner

This pull request introduces improvements to the research tools and service logic for handling academic papers, clarifying tool usage, enforcing input validation, and optimizing paper indexing. The changes ensure clearer guidance for tool selection, stricter input contracts, and more efficient handling of downloaded papers.

Tool usage and guidance improvements:

  • Clarified the docstring for the GetInfo tool to recommend its use for follow-up questions about already downloaded and indexed papers, and to avoid unnecessary calls to DeepScholarResearchAndHighlight unless new papers need to be searched or downloaded.
  • Updated the docstring for DeepScholarResearchAndHighlight to explicitly instruct users to use this tool only when searching for new papers or downloading/indexing papers not already available, and to use GetInfo for follow-up questions about previously indexed papers.

Input validation and contract enforcement:

  • Added a model_validator to DeepScholarResearchAndHighlight to enforce that either a non-empty search_query or a non-empty paper_metadata list must be provided, and that paper_metadata must be non-empty if given.

Paper download and indexing optimizations:

  • Modified the download and indexing logic in deep_research_service.py to track newly downloaded files separately (new_downloaded_files), and only rebuild the RAG index if new papers were downloaded, reusing the existing index otherwise. [1] [2] [3] [4]

Minor improvements and consistency fixes:

  • Updated the PaperMetadata model to use default=None for optional fields, improving clarity and consistency.
  • Standardized import ordering and variable naming in base_research.py and deep_research_service.py. [1] [2]

@Sreehari05055 Sreehari05055 marked this pull request as ready for review March 7, 2026 08:42
Copilot AI review requested due to automatic review settings March 7, 2026 08:42
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the academic paper research pipeline by optimizing the download/index workflow, adding input validation, and clarifying tool docstrings. The key optimization avoids rebuilding the RAG index when papers already exist on disk from prior runs. The PR also adds a model_validator to DeepScholarResearchAndHighlight to enforce that valid inputs are provided, and standardizes parameter naming and defaults across the base research service.

Changes:

  • Added new_downloaded_files state tracking to skip redundant RAG index rebuilds when all requested papers are already on disk.
  • Added a model_validator on DeepScholarResearchAndHighlight to enforce that either search_query or paper_metadata is provided.
  • Standardized BaseResearchService.semantic_scholar_search parameter names (is_oa instead of is_open_access, count=10) and added default=None to optional PaperMetadata fields.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
app/services/scholar_research/deep_research_service.py Tracks newly downloaded files separately to skip index rebuild when no new papers are downloaded; adds new_downloaded_files to ResearchState and initial state.
app/services/scholar_research/base_research.py Renames is_open_accessis_oa and changes count default from 25 to 10 for consistency with OpenAlexResearchService.
app/services/langchain_handler/tool_definitions.py Adds guidance docstrings for GetInfo and DeepScholarResearchAndHighlight, adds model_validator for input validation, and sets explicit default=None on optional PaperMetadata fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +73 to +76
pdf_url: Optional[str] = Field(default=None,description="Direct PDF download URL. Used as fallback if the Content API fails.")
authors: Optional[List[str]] = Field(default=None,description="List of author names.")
publication_year: Optional[int] = Field(default=None,description="Year of publication.")
doi: Optional[str] = Field(default=None,description="DOI identifier.")
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor formatting inconsistency: there's a missing space after the comma in Field(default=None,description=...) on these lines. The rest of the file consistently uses a space after the comma (e.g., lines 8-10, 152, 157). Add a space after the comma to be consistent: Field(default=None, description=...).

Suggested change
pdf_url: Optional[str] = Field(default=None,description="Direct PDF download URL. Used as fallback if the Content API fails.")
authors: Optional[List[str]] = Field(default=None,description="List of author names.")
publication_year: Optional[int] = Field(default=None,description="Year of publication.")
doi: Optional[str] = Field(default=None,description="DOI identifier.")
pdf_url: Optional[str] = Field(default=None, description="Direct PDF download URL. Used as fallback if the Content API fails.")
authors: Optional[List[str]] = Field(default=None, description="List of author names.")
publication_year: Optional[int] = Field(default=None, description="Year of publication.")
doi: Optional[str] = Field(default=None, description="DOI identifier.")

Copilot uses AI. Check for mistakes.
Comment on lines +134 to 138
if not state['new_downloaded_files']:
logger.info("No new papers downloaded. Reusing existing RAG index.")
return {}

logger.info("Rebuilding RAG index with new papers...")
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: When paper PDFs already exist on disk but the RAG index doesn't (e.g., index was deleted, or fresh deployment with pre-existing files), new_downloaded_files will be empty (since the files are skipped as "already downloaded"), so index_papers won't rebuild the index. This means generate_report will call get_info against an empty/stale index and return no context.

The old code didn't have this issue because it checked downloaded_files, which included both newly downloaded and already-existing files.

Consider also checking whether the RAG index currently exists (e.g., BaseRAGPipeline.index is None) before deciding to skip the rebuild. This way, if the index is missing but papers exist on disk, the index will still be rebuilt.

Suggested change
if not state['new_downloaded_files']:
logger.info("No new papers downloaded. Reusing existing RAG index.")
return {}
logger.info("Rebuilding RAG index with new papers...")
# Determine if an index already exists on the RAG service. If it doesn't,
# we should rebuild even when there are no newly downloaded files, since
# papers may already exist on disk (e.g., fresh deployment or deleted index).
existing_index = getattr(self.rag_service, "index", None)
has_index = existing_index is not None
if not state['new_downloaded_files'] and has_index:
logger.info("No new papers downloaded. Reusing existing RAG index.")
return {}
logger.info("Rebuilding RAG index with current papers...")

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants