Skip to content

test: large transcript stress/correctness — repeated rolling compaction #174

@emesal

Description

@emesal

context

the roadmap calls for stress testing with large transcripts. correctness under load is the goal — not performance benchmarks.

scope

integration test (gated on env var, uses free/agentic or free/text-generation):

  1. build a synthetic transcript with 80–100 messages (mix of user, assistant, tool_call+tool_result pairs)
  2. run rolling compaction repeatedly until message count is stable (≤4 non-system messages)
  3. after each compaction round, assert invariants:
    • no orphaned tool results (every tool message has a corresponding assistant tool_call)
    • system messages preserved throughout
    • summary grows or stays non-empty
    • message count strictly decreases each round (or hits the ≤4 floor)
    • transcript anchor written after each round
  4. after full compaction: verify final context is coherent (loadable, parseable, no corrupt JSON)

notes

  • free models may be slow; set generous timeouts
  • if LLM returns empty IDs repeatedly (triggering fallback path), that's fine — test the fallback path too
  • seed the transcript deterministically so failures are reproducible
  • this test will be slow (~minutes); mark clearly in output

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions