test: large transcript stress/correctness — repeated rolling compaction

## context

the roadmap calls for stress testing with large transcripts. correctness under load is the goal — not performance benchmarks.

## scope

integration test (gated on env var, uses `free/agentic` or `free/text-generation`):

1. build a synthetic transcript with 80–100 messages (mix of user, assistant, tool_call+tool_result pairs)
2. run rolling compaction repeatedly until message count is stable (≤4 non-system messages)
3. after each compaction round, assert invariants:
   - no orphaned tool results (every tool message has a corresponding assistant tool_call)
   - system messages preserved throughout
   - summary grows or stays non-empty
   - message count strictly decreases each round (or hits the ≤4 floor)
   - transcript anchor written after each round
4. after full compaction: verify final context is coherent (loadable, parseable, no corrupt JSON)

## notes

- free models may be slow; set generous timeouts
- if LLM returns empty IDs repeatedly (triggering fallback path), that's fine — test the fallback path too
- seed the transcript deterministically so failures are reproducible
- this test will be slow (~minutes); mark clearly in output

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: large transcript stress/correctness — repeated rolling compaction #174

context

scope

notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

test: large transcript stress/correctness — repeated rolling compaction #174

Description

context

scope

notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions