Skip to content

Is this how the agent's document processing and caching works? #5

@dev3py

Description

@dev3py

Hi! I've been using the project and noticed an interesting behavior — when I select a folder with 22 documents and ask my first question, it takes ~18 minutes to respond. But after that, all follow-up questions are answered within ~1 minute.

I dug into the code to understand why, and I wanted to confirm if my understanding is correct:

My Understanding

First query is slow because the agent does a full multi-step exploration:

  1. scan_folder runs on all 22 documents in parallel to get previews
  2. parse_file is called individually on each document marked as RELEVANT for a full deep-read
  3. If cross-references are found between documents, more parse_file calls happen (backtracking)
  4. Each of these steps is a separate API round-trip to Gemini, and each call sends the entire accumulated conversation history — so by the end, we're sending hundreds of thousands of tokens per call
  5. This is what triggers the Gemini 429 rate limit (1M input tokens/minute), causing forced ~60s pauses between retries

Follow-up queries are fast because of the singleton agent pattern:

  1. FsExplorerAgent is created once as a module-level singleton via get_agent() in workflow.py
  2. _chat_history (a plain Python list in memory) is never cleared between queries
  3. So all 22 documents' content from the first scan/parse is already sitting in the history
  4. Gemini already "knows" the documents and can answer in just 1-2 API calls instead of 10+

Things I noticed that could be potential concerns:

  • No persistence — if the server restarts, all chat history is lost and the next query does the full 18-minute scan again
  • History only grows — every follow-up question adds more to _chat_history, meaning we're re-sending all previous documents + Q&A pairs every time, which increases cost and latency over time
  • Skipped documents — if the first query caused the agent to SKIP certain documents as irrelevant, a follow-up question about those skipped documents would still require new parse_file calls

Questions

  1. Is my understanding above correct?
  2. Is the growing _chat_history by design, or is there a plan to implement some form of summarization/pruning to keep token usage in check?
  3. Has there been any consideration for persisting the chat history (e.g., in a database or cache) so that a server restart doesn't require a full re-scan?

Thanks for the great project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions