Summary
Current chat requests appear to resend large repeated prompt content (including prior markdown/context) on every turn. This inflates token usage, increases latency, and raises cost without improving output quality.
We need to send only the minimum required context per turn, while preserving response quality and tool-call reliability.
Problem
For small user inputs (e.g. “Fetch me msgs”), request payloads still include oversized repeated blocks from prior turns. This causes:
- unnecessary token consumption,
- slower response times,
- higher inference cost,
- higher probability of context-window pressure/truncation issues.
Observed Symptoms
- Full/near-full previous context repeated each turn.
- Markdown/system/context blocks duplicated across requests.
- Token usage does not scale with user message size.
- Long conversations degrade speed and consistency.
Expected Behavior
Each turn should send:
- Current user message
- Required system/developer instructions
- A compact, relevant conversation window (or summary)
- Essential memory/tool state only
…not the entire repeated thread unless explicitly required.
Scope
- Request builder / prompt assembly path
- Context windowing strategy
- Memory injection policy
- Tool-call context packaging
- Token budget enforcement
- Telemetry and regression tests
Proposed Approach
1) Prompt Assembly Audit
- Trace exactly what is included per turn.
- Identify repeated sections (markdown history, static instructions, duplicated memory blocks).
2) Context Window Strategy
- Keep a bounded recent-turn window.
- Summarize older context into compact state.
- Inject only relevant memory snippets for the active intent.
3) Dedupe and Canonicalization
- Ensure static sections are included once.
- Remove duplicated blocks before sending.
- Normalize message roles and avoid role-content repeats.
4) Token Budgeting
- Add pre-send token estimation.
- Enforce hard/soft caps per request.
- Apply fallback truncation order (least relevant first).
5) Delta-Oriented Sending
- Prefer incremental context updates over full transcript resend.
- Reuse cached conversation summaries when valid.
6) Observability
- Add before/after metrics:
- prompt tokens per turn
- completion tokens per turn
- latency p50/p95
- error/timeout rate
- Add logs for trimming/summarization decisions.
Tasks
Acceptance Criteria
Risks / Considerations
- Over-truncation can hurt answer quality; must preserve critical context.
- Tool calls may need strict context completeness; apply domain-specific exceptions.
- Summarization must be deterministic and testable to avoid drift.
Definition of Done
- Code changes merged with tests,
- measurable token + latency improvement documented,
- monitoring in place,
- fallback strategy documented for edge cases.
Summary
Current chat requests appear to resend large repeated prompt content (including prior markdown/context) on every turn. This inflates token usage, increases latency, and raises cost without improving output quality.
We need to send only the minimum required context per turn, while preserving response quality and tool-call reliability.
Problem
For small user inputs (e.g. “Fetch me msgs”), request payloads still include oversized repeated blocks from prior turns. This causes:
Observed Symptoms
Expected Behavior
Each turn should send:
…not the entire repeated thread unless explicitly required.
Scope
Proposed Approach
1) Prompt Assembly Audit
2) Context Window Strategy
3) Dedupe and Canonicalization
4) Token Budgeting
5) Delta-Oriented Sending
6) Observability
Tasks
Acceptance Criteria
Risks / Considerations
Definition of Done