Skip to content

[Perf/Cost] Optimize prompt payloads and stop repeating full context each turn #447

@Al629176

Description

@Al629176

Summary

Current chat requests appear to resend large repeated prompt content (including prior markdown/context) on every turn. This inflates token usage, increases latency, and raises cost without improving output quality.

We need to send only the minimum required context per turn, while preserving response quality and tool-call reliability.

Problem

For small user inputs (e.g. “Fetch me msgs”), request payloads still include oversized repeated blocks from prior turns. This causes:

  • unnecessary token consumption,
  • slower response times,
  • higher inference cost,
  • higher probability of context-window pressure/truncation issues.

Observed Symptoms

  • Full/near-full previous context repeated each turn.
  • Markdown/system/context blocks duplicated across requests.
  • Token usage does not scale with user message size.
  • Long conversations degrade speed and consistency.

Expected Behavior

Each turn should send:

  1. Current user message
  2. Required system/developer instructions
  3. A compact, relevant conversation window (or summary)
  4. Essential memory/tool state only

…not the entire repeated thread unless explicitly required.

Scope

  • Request builder / prompt assembly path
  • Context windowing strategy
  • Memory injection policy
  • Tool-call context packaging
  • Token budget enforcement
  • Telemetry and regression tests

Proposed Approach

1) Prompt Assembly Audit

  • Trace exactly what is included per turn.
  • Identify repeated sections (markdown history, static instructions, duplicated memory blocks).

2) Context Window Strategy

  • Keep a bounded recent-turn window.
  • Summarize older context into compact state.
  • Inject only relevant memory snippets for the active intent.

3) Dedupe and Canonicalization

  • Ensure static sections are included once.
  • Remove duplicated blocks before sending.
  • Normalize message roles and avoid role-content repeats.

4) Token Budgeting

  • Add pre-send token estimation.
  • Enforce hard/soft caps per request.
  • Apply fallback truncation order (least relevant first).

5) Delta-Oriented Sending

  • Prefer incremental context updates over full transcript resend.
  • Reuse cached conversation summaries when valid.

6) Observability

  • Add before/after metrics:
    • prompt tokens per turn
    • completion tokens per turn
    • latency p50/p95
    • error/timeout rate
  • Add logs for trimming/summarization decisions.

Tasks

  • Map current prompt builder and capture per-turn payload composition.
  • Implement dedupe for repeated markdown/context blocks.
  • Introduce rolling summary for older conversation.
  • Add relevance filter for memory/tool context injection.
  • Add token budget estimator + cap enforcement.
  • Add delta/context-cache mechanism where safe.
  • Add unit tests for prompt assembly correctness.
  • Add integration tests for long-thread behavior.
  • Add benchmark script comparing before vs after token/latency.

Acceptance Criteria

  • Significant reduction in prompt tokens per turn (target to define after baseline; suggested >=30% on long threads).
  • Reduced latency for multi-turn chats.
  • No regression in answer quality or tool-call success rate.
  • No repeated full-context payload on standard turns.
  • Metrics dashboard/logging clearly shows optimization impact.

Risks / Considerations

  • Over-truncation can hurt answer quality; must preserve critical context.
  • Tool calls may need strict context completeness; apply domain-specific exceptions.
  • Summarization must be deterministic and testable to avoid drift.

Definition of Done

  • Code changes merged with tests,
  • measurable token + latency improvement documented,
  • monitoring in place,
  • fallback strategy documented for edge cases.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions