Small, practical patterns and metrics for detecting early instability in multi-turn agents – using only traces, latency and simple structure, without any framework lock-in.
This repo is a technical notebook + small experiments collection for people who:
- work with multi-turn agents in production or research,
- already have traces / logs / latency data, and
- want better intuition and tools for spotting early signs of instability before they turn into visible failures.
You do not need to adopt any specific framework to use this repo. All examples are based on:
- JSONL traces (one event per line),
- simple structural assumptions (trace ID / span ID / timestamps),
- a few small Python utilities and notebooks.
Frameworks such as Langfuse, Arize, LangChain, LlamaIndex, Semantic Kernel and others are treated as examples, not dependencies.
This repo is aimed at:
-
observability / tracing engineers for LLM systems,
-
people building or operating multi-turn agents / tools / workflows,
-
senior practitioners who “watch traces every day” but want a more systematic view of:
- early instability signals
- correction & recovery loops
- long-horizon behavior.
If you already ship agents but feel that “the theory of stability” is scattered across your notes and dashboards, this repo is for you.
High-level layout:
agent-instability-notes/
README.md
LICENSE
requirements.txt
concepts/
00_overview.md
early-instability-signals.md
correction-and-recovery-loop.md
trace-tree-divergence.md
nested-tool-inconsistency.md
instability-patterns_catalog.md
metrics/
_shared_definitions.md
relative-latency-gap.md
recovery-turn-distance.md
post-correction-relapse-rate.md
failover-frequency.md
session-closure-profile.md
examples/
README.md
long-horizon-toy-agent.md
tool-chain-instability.md
workflow-looping.md
notebooks/
01_inspect_trace_tree.ipynb
02_early_instability_metrics.ipynb
03_visualize_trace_divergence.ipynb
synthetic_traces/
README.md
simple_correction_loop.jsonl
noisy_mixed_sessions.jsonl
generate_synthetic_traces.py
scripts/
compute_metrics_from_jsonl.py
trace_tree_sanity_checks.py
data/
sample_configs/
latency_thresholds.yaml
instability_heuristics.yaml
.github/
workflows/
lint-and-tests.yml
Short, self-contained notes on:
- what we mean by early instability signals in multi-turn agents,
- how to think about correction and recovery loops,
- structural patterns like trace tree divergence and nested tool-call inconsistency,
- a small catalog of instability patterns seen in practice.
The goal is to give you language and shapes to describe what you already see in traces.
Definitions and sketches for instability-focused metrics, such as:
relative-latency-gap– relative latency differences across comparable spans,recovery-turn-distance– how many turns it takes for a session to behave “healthy” again,post-correction-relapse-rate– how often sessions relapse after an apparent correction,failover-frequency– how often flows fall back to alternative paths,session-closure-profile– how sessions actually end (natural completion, forced stop, abandonment, etc.).
Each metric doc focuses on:
- a plain-language definition,
- when the metric is useful,
- how to compute it from JSONL traces,
- caveats and interpretation notes.
Concrete scenarios and artifacts:
-
narrative examples (Markdown) for:
- long-horizon toy agents,
- tool-chain instability,
- workflow looping and recovery;
-
notebooks for:
- inspecting and visualizing trace trees,
- computing instability metrics on synthetic traces,
- plotting divergence and latency gaps;
-
synthetic trace files:
- a small “correction loop” trace,
- a noisier mixed set of sessions,
- a generator script so you can create your own.
Lightweight utilities, for example:
-
compute_metrics_from_jsonl.py→ load a trace file and compute a handful of metrics defined undermetrics/. -
trace_tree_sanity_checks.py→ check basic properties such as:- monotonic timestamps per trace,
- uniqueness of IDs,
- simple structural invariants.
These scripts are intended as reference helpers, not as a production SDK.
A small collection of example configuration files, such as:
latency_thresholds.yaml– example thresholds for flagging suspicious latency gaps,instability_heuristics.yaml– rule-of-thumb patterns for defining “instability episodes”.
You’re expected to adapt these values to your own system and domain.
Create a virtual environment (recommended) and install:
pip install -r requirements.txtDependencies are kept minimal:
pandas,numpyfor data handlingmatplotlibfor basic visualizationsjupyterlabfor notebooks- small utility libs only when needed
No agent framework is required.
The quickest way to get a feel for the repo:
jupyter labThen start with:
examples/notebooks/01_inspect_trace_tree.ipynbexamples/notebooks/02_early_instability_metrics.ipynb
Use the synthetic traces under examples/synthetic_traces/ as initial input.
Once you have your own JSONL traces, you can plug them into the CLI script:
python scripts/compute_metrics_from_jsonl.py \
--file path/to/your_traces.jsonl(Exact options and output format are documented inside the script.)
This repository is meant to be:
-
a reference shelf for patterns and metrics around agent instability,
-
a sandbox for your own experiments,
-
a translation layer between raw traces and the questions you actually care about:
- “Where do things start to wobble?”
- “How long does it take to recover once they do?”
- “When we fix something, how often does it quietly break again later?”
You’re encouraged to:
- fork and adapt the notebooks and scripts,
- plug the metrics into your own dashboards,
- add more patterns and examples that you see in the wild.
This repository is a framework-agnostic notebook for understanding instability in multi-turn agents.
If you want the formal governance model, event lifecycle, and a full behavioral contract for
agent execution stability, the parent conceptual project is:
👉 PLD — Phase Loop Dynamics
https://github.com/kiyoshisasano/agent-pld-metrics
PLD provides:
- a formal Drift → Repair → Reentry → Continue lifecycle,
- a runtime reference implementation,
- schema-level definitions, metrics, and a complete taxonomy,
- end-to-end examples of multi-turn stability governance.
While this repository (agent-instability-notes) stands alone,
it also serves as a practical companion to PLD —
focused on practitioner vocabulary, early-instability signals,
and empirical metrics derived from traces.
License and contribution guidelines are intentionally lightweight and will be documented in:
LICENSE
For now, feel free to open issues or pull requests if you have:
- new instability patterns,
- better visualizations,
- corrections or clarifications to the definitions.