[Perf/Cost] Optimize prompt payloads and stop repeating full context each turn

## Summary

Current chat requests appear to resend large repeated prompt content (including prior markdown/context) on every turn. This inflates token usage, increases latency, and raises cost without improving output quality.

We need to send only the minimum required context per turn, while preserving response quality and tool-call reliability.

## Problem

For small user inputs (e.g. “Fetch me msgs”), request payloads still include oversized repeated blocks from prior turns. This causes:

- unnecessary token consumption,
- slower response times,
- higher inference cost,
- higher probability of context-window pressure/truncation issues.

## Observed Symptoms

- Full/near-full previous context repeated each turn.
- Markdown/system/context blocks duplicated across requests.
- Token usage does not scale with user message size.
- Long conversations degrade speed and consistency.

## Expected Behavior

Each turn should send:

1. Current user message
2. Required system/developer instructions
3. A compact, relevant conversation window (or summary)
4. Essential memory/tool state only

…not the entire repeated thread unless explicitly required.

## Scope

- Request builder / prompt assembly path
- Context windowing strategy
- Memory injection policy
- Tool-call context packaging
- Token budget enforcement
- Telemetry and regression tests

## Proposed Approach

### 1) Prompt Assembly Audit
- Trace exactly what is included per turn.
- Identify repeated sections (markdown history, static instructions, duplicated memory blocks).

### 2) Context Window Strategy
- Keep a bounded recent-turn window.
- Summarize older context into compact state.
- Inject only relevant memory snippets for the active intent.

### 3) Dedupe and Canonicalization
- Ensure static sections are included once.
- Remove duplicated blocks before sending.
- Normalize message roles and avoid role-content repeats.

### 4) Token Budgeting
- Add pre-send token estimation.
- Enforce hard/soft caps per request.
- Apply fallback truncation order (least relevant first).

### 5) Delta-Oriented Sending
- Prefer incremental context updates over full transcript resend.
- Reuse cached conversation summaries when valid.

### 6) Observability
- Add before/after metrics:
  - prompt tokens per turn
  - completion tokens per turn
  - latency p50/p95
  - error/timeout rate
- Add logs for trimming/summarization decisions.

## Tasks

- [ ] Map current prompt builder and capture per-turn payload composition.
- [ ] Implement dedupe for repeated markdown/context blocks.
- [ ] Introduce rolling summary for older conversation.
- [ ] Add relevance filter for memory/tool context injection.
- [ ] Add token budget estimator + cap enforcement.
- [ ] Add delta/context-cache mechanism where safe.
- [ ] Add unit tests for prompt assembly correctness.
- [ ] Add integration tests for long-thread behavior.
- [ ] Add benchmark script comparing before vs after token/latency.

## Acceptance Criteria

- [ ] Significant reduction in prompt tokens per turn (target to define after baseline; suggested >=30% on long threads).
- [ ] Reduced latency for multi-turn chats.
- [ ] No regression in answer quality or tool-call success rate.
- [ ] No repeated full-context payload on standard turns.
- [ ] Metrics dashboard/logging clearly shows optimization impact.

## Risks / Considerations

- Over-truncation can hurt answer quality; must preserve critical context.
- Tool calls may need strict context completeness; apply domain-specific exceptions.
- Summarization must be deterministic and testable to avoid drift.

## Definition of Done

- Code changes merged with tests,
- measurable token + latency improvement documented,
- monitoring in place,
- fallback strategy documented for edge cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf/Cost] Optimize prompt payloads and stop repeating full context each turn #447

Summary

Problem

Observed Symptoms

Expected Behavior

Scope

Proposed Approach

1) Prompt Assembly Audit

2) Context Window Strategy

3) Dedupe and Canonicalization

4) Token Budgeting

5) Delta-Oriented Sending

6) Observability

Tasks

Acceptance Criteria

Risks / Considerations

Definition of Done

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Perf/Cost] Optimize prompt payloads and stop repeating full context each turn #447

Description

Summary

Problem

Observed Symptoms

Expected Behavior

Scope

Proposed Approach

1) Prompt Assembly Audit

2) Context Window Strategy

3) Dedupe and Canonicalization

4) Token Budgeting

5) Delta-Oriented Sending

6) Observability

Tasks

Acceptance Criteria

Risks / Considerations

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions