Conversation
…es in state for llm to stop rebuilding index.
There was a problem hiding this comment.
Pull request overview
This PR improves the academic paper research pipeline by optimizing the download/index workflow, adding input validation, and clarifying tool docstrings. The key optimization avoids rebuilding the RAG index when papers already exist on disk from prior runs. The PR also adds a model_validator to DeepScholarResearchAndHighlight to enforce that valid inputs are provided, and standardizes parameter naming and defaults across the base research service.
Changes:
- Added
new_downloaded_filesstate tracking to skip redundant RAG index rebuilds when all requested papers are already on disk. - Added a
model_validatoronDeepScholarResearchAndHighlightto enforce that eithersearch_queryorpaper_metadatais provided. - Standardized
BaseResearchService.semantic_scholar_searchparameter names (is_oainstead ofis_open_access,count=10) and addeddefault=Noneto optionalPaperMetadatafields.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
app/services/scholar_research/deep_research_service.py |
Tracks newly downloaded files separately to skip index rebuild when no new papers are downloaded; adds new_downloaded_files to ResearchState and initial state. |
app/services/scholar_research/base_research.py |
Renames is_open_access → is_oa and changes count default from 25 to 10 for consistency with OpenAlexResearchService. |
app/services/langchain_handler/tool_definitions.py |
Adds guidance docstrings for GetInfo and DeepScholarResearchAndHighlight, adds model_validator for input validation, and sets explicit default=None on optional PaperMetadata fields. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pdf_url: Optional[str] = Field(default=None,description="Direct PDF download URL. Used as fallback if the Content API fails.") | ||
| authors: Optional[List[str]] = Field(default=None,description="List of author names.") | ||
| publication_year: Optional[int] = Field(default=None,description="Year of publication.") | ||
| doi: Optional[str] = Field(default=None,description="DOI identifier.") |
There was a problem hiding this comment.
Minor formatting inconsistency: there's a missing space after the comma in Field(default=None,description=...) on these lines. The rest of the file consistently uses a space after the comma (e.g., lines 8-10, 152, 157). Add a space after the comma to be consistent: Field(default=None, description=...).
| pdf_url: Optional[str] = Field(default=None,description="Direct PDF download URL. Used as fallback if the Content API fails.") | |
| authors: Optional[List[str]] = Field(default=None,description="List of author names.") | |
| publication_year: Optional[int] = Field(default=None,description="Year of publication.") | |
| doi: Optional[str] = Field(default=None,description="DOI identifier.") | |
| pdf_url: Optional[str] = Field(default=None, description="Direct PDF download URL. Used as fallback if the Content API fails.") | |
| authors: Optional[List[str]] = Field(default=None, description="List of author names.") | |
| publication_year: Optional[int] = Field(default=None, description="Year of publication.") | |
| doi: Optional[str] = Field(default=None, description="DOI identifier.") |
| if not state['new_downloaded_files']: | ||
| logger.info("No new papers downloaded. Reusing existing RAG index.") | ||
| return {} | ||
|
|
||
| logger.info("Rebuilding RAG index with new papers...") |
There was a problem hiding this comment.
Bug: When paper PDFs already exist on disk but the RAG index doesn't (e.g., index was deleted, or fresh deployment with pre-existing files), new_downloaded_files will be empty (since the files are skipped as "already downloaded"), so index_papers won't rebuild the index. This means generate_report will call get_info against an empty/stale index and return no context.
The old code didn't have this issue because it checked downloaded_files, which included both newly downloaded and already-existing files.
Consider also checking whether the RAG index currently exists (e.g., BaseRAGPipeline.index is None) before deciding to skip the rebuild. This way, if the index is missing but papers exist on disk, the index will still be rebuilt.
| if not state['new_downloaded_files']: | |
| logger.info("No new papers downloaded. Reusing existing RAG index.") | |
| return {} | |
| logger.info("Rebuilding RAG index with new papers...") | |
| # Determine if an index already exists on the RAG service. If it doesn't, | |
| # we should rebuild even when there are no newly downloaded files, since | |
| # papers may already exist on disk (e.g., fresh deployment or deleted index). | |
| existing_index = getattr(self.rag_service, "index", None) | |
| has_index = existing_index is not None | |
| if not state['new_downloaded_files'] and has_index: | |
| logger.info("No new papers downloaded. Reusing existing RAG index.") | |
| return {} | |
| logger.info("Rebuilding RAG index with current papers...") |
This pull request introduces improvements to the research tools and service logic for handling academic papers, clarifying tool usage, enforcing input validation, and optimizing paper indexing. The changes ensure clearer guidance for tool selection, stricter input contracts, and more efficient handling of downloaded papers.
Tool usage and guidance improvements:
GetInfotool to recommend its use for follow-up questions about already downloaded and indexed papers, and to avoid unnecessary calls toDeepScholarResearchAndHighlightunless new papers need to be searched or downloaded.DeepScholarResearchAndHighlightto explicitly instruct users to use this tool only when searching for new papers or downloading/indexing papers not already available, and to useGetInfofor follow-up questions about previously indexed papers.Input validation and contract enforcement:
model_validatortoDeepScholarResearchAndHighlightto enforce that either a non-emptysearch_queryor a non-emptypaper_metadatalist must be provided, and thatpaper_metadatamust be non-empty if given.Paper download and indexing optimizations:
deep_research_service.pyto track newly downloaded files separately (new_downloaded_files), and only rebuild the RAG index if new papers were downloaded, reusing the existing index otherwise. [1] [2] [3] [4]Minor improvements and consistency fixes:
PaperMetadatamodel to usedefault=Nonefor optional fields, improving clarity and consistency.base_research.pyanddeep_research_service.py. [1] [2]