Minor updates and bug fixes by Sreehari05055 · Pull Request #24 · Sreehari05055/ChatPilot

Sreehari05055 · 2026-03-07T08:42:46Z

This pull request introduces improvements to the research tools and service logic for handling academic papers, clarifying tool usage, enforcing input validation, and optimizing paper indexing. The changes ensure clearer guidance for tool selection, stricter input contracts, and more efficient handling of downloaded papers.

Tool usage and guidance improvements:

Clarified the docstring for the GetInfo tool to recommend its use for follow-up questions about already downloaded and indexed papers, and to avoid unnecessary calls to DeepScholarResearchAndHighlight unless new papers need to be searched or downloaded.
Updated the docstring for DeepScholarResearchAndHighlight to explicitly instruct users to use this tool only when searching for new papers or downloading/indexing papers not already available, and to use GetInfo for follow-up questions about previously indexed papers.

Input validation and contract enforcement:

Added a model_validator to DeepScholarResearchAndHighlight to enforce that either a non-empty search_query or a non-empty paper_metadata list must be provided, and that paper_metadata must be non-empty if given.

Paper download and indexing optimizations:

Modified the download and indexing logic in deep_research_service.py to track newly downloaded files separately (new_downloaded_files), and only rebuild the RAG index if new papers were downloaded, reusing the existing index otherwise. [1] [2] [3] [4]

Minor improvements and consistency fixes:

Updated the PaperMetadata model to use default=None for optional fields, improving clarity and consistency.
Standardized import ordering and variable naming in base_research.py and deep_research_service.py. [1] [2]

…es in state for llm to stop rebuilding index.

Copilot

Pull request overview

This PR improves the academic paper research pipeline by optimizing the download/index workflow, adding input validation, and clarifying tool docstrings. The key optimization avoids rebuilding the RAG index when papers already exist on disk from prior runs. The PR also adds a model_validator to DeepScholarResearchAndHighlight to enforce that valid inputs are provided, and standardizes parameter naming and defaults across the base research service.

Changes:

Added new_downloaded_files state tracking to skip redundant RAG index rebuilds when all requested papers are already on disk.
Added a model_validator on DeepScholarResearchAndHighlight to enforce that either search_query or paper_metadata is provided.
Standardized BaseResearchService.semantic_scholar_search parameter names (is_oa instead of is_open_access, count=10) and added default=None to optional PaperMetadata fields.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`app/services/scholar_research/deep_research_service.py`	Tracks newly downloaded files separately to skip index rebuild when no new papers are downloaded; adds `new_downloaded_files` to `ResearchState` and initial state.
`app/services/scholar_research/base_research.py`	Renames `is_open_access` → `is_oa` and changes `count` default from 25 to 10 for consistency with `OpenAlexResearchService`.
`app/services/langchain_handler/tool_definitions.py`	Adds guidance docstrings for `GetInfo` and `DeepScholarResearchAndHighlight`, adds `model_validator` for input validation, and sets explicit `default=None` on optional `PaperMetadata` fields.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-07T08:46:21Z

app/services/langchain_handler/tool_definitions.py

+    pdf_url: Optional[str] = Field(default=None,description="Direct PDF download URL. Used as fallback if the Content API fails.")
+    authors: Optional[List[str]] = Field(default=None,description="List of author names.")
+    publication_year: Optional[int] = Field(default=None,description="Year of publication.")
+    doi: Optional[str] = Field(default=None,description="DOI identifier.")


Minor formatting inconsistency: there's a missing space after the comma in Field(default=None,description=...) on these lines. The rest of the file consistently uses a space after the comma (e.g., lines 8-10, 152, 157). Add a space after the comma to be consistent: Field(default=None, description=...).

Suggested change

pdf_url: Optional[str] = Field(default=None,description="Direct PDF download URL. Used as fallback if the Content API fails.")

authors: Optional[List[str]] = Field(default=None,description="List of author names.")

publication_year: Optional[int] = Field(default=None,description="Year of publication.")

doi: Optional[str] = Field(default=None,description="DOI identifier.")

pdf_url: Optional[str] = Field(default=None, description="Direct PDF download URL. Used as fallback if the Content API fails.")

authors: Optional[List[str]] = Field(default=None, description="List of author names.")

publication_year: Optional[int] = Field(default=None, description="Year of publication.")

doi: Optional[str] = Field(default=None, description="DOI identifier.")

Copilot · 2026-03-07T08:46:21Z

app/services/scholar_research/deep_research_service.py

+        if not state['new_downloaded_files']:
+            logger.info("No new papers downloaded. Reusing existing RAG index.")
            return {}

        logger.info("Rebuilding RAG index with new papers...")


Bug: When paper PDFs already exist on disk but the RAG index doesn't (e.g., index was deleted, or fresh deployment with pre-existing files), new_downloaded_files will be empty (since the files are skipped as "already downloaded"), so index_papers won't rebuild the index. This means generate_report will call get_info against an empty/stale index and return no context.

The old code didn't have this issue because it checked downloaded_files, which included both newly downloaded and already-existing files.

Consider also checking whether the RAG index currently exists (e.g., BaseRAGPipeline.index is None) before deciding to skip the rebuild. This way, if the index is missing but papers exist on disk, the index will still be rebuilt.

Suggested change

if not state['new_downloaded_files']:

logger.info("No new papers downloaded. Reusing existing RAG index.")

return {}

logger.info("Rebuilding RAG index with new papers...")

# Determine if an index already exists on the RAG service. If it doesn't,

# we should rebuild even when there are no newly downloaded files, since

# papers may already exist on disk (e.g., fresh deployment or deleted index).

existing_index = getattr(self.rag_service, "index", None)

has_index = existing_index is not None

if not state['new_downloaded_files'] and has_index:

logger.info("No new papers downloaded. Reusing existing RAG index.")

return {}

logger.info("Rebuilding RAG index with current papers...")

SreehariA added 2 commits March 7, 2026 08:41

refactor: updated deep research service to include new downloaded fil…

ec5f8d6

…es in state for llm to stop rebuilding index.

refactor: updated to change variable name and count to 10

5055c48

Sreehari05055 marked this pull request as ready for review March 7, 2026 08:42

Copilot AI review requested due to automatic review settings March 7, 2026 08:42

Copilot started reviewing on behalf of Sreehari05055 March 7, 2026 08:43 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minor updates and bug fixes#24

Minor updates and bug fixes#24
Sreehari05055 wants to merge 2 commits intomainfrom
dev

Sreehari05055 commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        if not state['new_downloaded_files']:
-            logger.info("No new papers downloaded. Reusing existing RAG index.")
-            return {}
-        logger.info("Rebuilding RAG index with new papers...")
+        # Determine if an index already exists on the RAG service. If it doesn't,
+        # we should rebuild even when there are no newly downloaded files, since
+        # papers may already exist on disk (e.g., fresh deployment or deleted index).
+        existing_index = getattr(self.rag_service, "index", None)
+        has_index = existing_index is not None
+        if not state['new_downloaded_files'] and has_index:
+            logger.info("No new papers downloaded. Reusing existing RAG index.")
+            return {}
+        logger.info("Rebuilding RAG index with current papers...")

Conversation

Sreehari05055 commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants