feat(agents): supervisor + GitHub repo researcher + tracing#14
Merged
feat(agents): supervisor + GitHub repo researcher + tracing#14
Conversation
Major checkpoint commit for the AI agents stack. Brings the supervisor → researcher → planner → diagram → critic → finalize graph from "almost working" to "reliable on local Qwen via LM Studio + first-class Langfuse hierarchy". Backend: - agents/runtime.py: catch CancelledError, merge final_state across on_chain_end, fall back to findings.summary when supervisor empties out. - agents/nodes/base.py: terminating_tool_names, isolated_state_for_subagent, preserved per-step LLMCallMetadata fields, salvaged result.text on finalize-tool exits, added context + delegation-brief renderers. - agents/llm.py: parent_observation_id, request_timeout to 90s, custom provider routing for LM Studio. - agents/tracing.py: AgentTracer holds StatefulSpanClient per node visit so spans actually close (instead of stuck at the 25s default), tool events carry full content, JSON-coerce arbitrary outputs. - agents/builtin/general/graph.py: per-node spans, ENTER/EXIT logs, isolated sub-agent state, _strip_subagent_messages so sub-agent chatter doesn't leak back into supervisor history. Router stops at the most recent assistant turn (no more skipping past text replies to re-fire delegation). - agents/builtin/general/nodes/researcher.py: max_steps 6→4, salvage tool results into Findings.summary on max_steps, fix prompt path. - agents/builtin/general/nodes/supervisor.py: extract delegate_brief, preserve LLM prose on finalize tool calls. - prompts/researcher: clarify diagram_id vs object_id vs technology_id. - api/v1/agents.py: SHIELD runtime_iter from heartbeat wait_for so the 25s ping interval no longer cancels in-flight LLM calls. Frontend: - agent-chat: drop unmount-abort so closing the bubble doesn't kill the in-flight agent run; chat_context now reads useLocation directly so it works outside <Routes>; AgentStreamProvider hoists shared SSE state. Tests: 828 backend + 73 frontend passing.
…n evals
Three loosely-coupled improvements to the multi-agent pipeline:
1. **Live canvas updates from agent tools.** Mutating tools (`create_object`,
`update_object`, `delete_object`, `place/move/unplace_on_diagram`,
`create/update/delete_connection`, `create/update/delete_diagram`,
`link_object_to_child_diagram`, `auto_layout_diagram`) now publish to the
workspace + diagram WS channels with the same payload shape REST uses, so
open canvases re-render the moment the tool fires. Previously updates
surfaced only after the SSE stream finished and `useAppliedChangeSync`
triggered a refetch — which happened seconds late or required a window
refocus. Shared helper in `app/agents/tools/_realtime.py`.
2. **Sub-agent tool-result rewrite.** The supervisor used to see
`[tool: {"action":"delegate.researcher","question":"..."}]` (an echo of
its own input) after a `delegate_to_*` call, with the actual findings /
plan / applied_changes / critique tucked away in a separate system block.
Qwen routinely re-delegated to the same sub-agent because the tool
protocol pair was malformed. Now each sub-agent node walks the
supervisor's history and rewrites the matching tool result with real
output (`rewrite_subagent_tool_result` in `app/agents/nodes/base.py`).
The redundant `render_subagent_results_block` is removed from
supervisor's system blocks. Supervisor prompt updated to reflect the
new flow and to instruct reuse of existing object IDs surfaced by
researcher (no duplicate-create when an object already exists).
3. **Golden eval suite (live local Qwen).** New `evals/test_golden_*.py`
exercise the supervisor → sub-agent graph against LM Studio while
mocking DB / service layer. Six cases (3 investigate + 3 create-basic).
Langfuse traces from the suite are tagged with the `:eval` suffix
(`ARCHFLOW_TRACE_NAME_SUFFIX` env var, read by both `AgentTracer` and
`LLMClient._build_langfuse_metadata`) so eval runs are filterable from
real workspace activity. `analytics_consent` flipped to `full` in the
golden runtime so LLM generations actually surface in Langfuse alongside
the existing supervisor / sub-agent spans.
Coverage: 835 backend tests + 141 frontend tests passing.
…nected nodes Layout engine already anchors a new placement next to placed neighbours of its model-level connections (`incremental_place` → relatedness centre), but that signal is empty when the agent's tool order is `create_object` → `place_on_diagram` → `create_connection`. Result: the new node lands in a free grid cell far from its eventual neighbour, then a long cross-canvas arrow gets drawn to it. Reorder for the connected case: create_object → create_connection → place_on_diagram. Now the layout engine sees the model-level link at place time and seeds the position adjacent to the connected neighbour. Standalone adds (no neighbour) keep the original order — order doesn't matter then. Diagram prompt updated with a new step-4 rule, an Example 1b walk-through, and a clarifying note that `create_connection` doesn't require both endpoints to be placed on the diagram yet.
Edges from agent-driven `create_connection` used to leave
`source_handle` / `target_handle` empty, which made React Flow fall back
to its default port (`top`) — connections then attached to the top of
every node, criss-crossing the canvas.
Combined approach:
* **Auto-pick (default).** New helper
`app/agents/layout/handles.auto_pick_handles(src_box, tgt_box)`: picks
`right`/`left` for horizontal-dominant routes, `bottom`/`top` for
vertical-dominant. Tie-breaks horizontal — most C4 layouts flow L→R.
* **DB-aware resolver.** `tools/_handle_resolver.py`:
- `resolve_handles_for_connection(...)`: when both endpoints share
exactly one diagram, derive handles from the placement pair.
Returns (None, None) on ambiguity (zero / multiple shared diagrams,
missing coords).
- `refresh_handles_for_object_placement(...)`: walks every connection
touching a freshly-placed/-moved object, fills in null handles
using both placement geometries; emits the updated rows so the
caller can publish `connection.updated` WS events.
* **Agent override.** `CreateConnectionInput` now accepts optional
`source_handle` / `target_handle` (validated against
`{top, right, bottom, left}`; invalid values dropped). Explicit
values always win over the auto-pick.
* **Wiring.** `create_connection` resolves and persists handles up
front. `place_on_diagram` and `move_on_diagram` call the refresh path
after the placement lands and broadcast `connection.updated` for each
connection whose handles changed, so open canvases redraw the edge
from the right side without a refetch.
* **Prompt.** Diagram prompt step 5 documents that handles are auto-
picked; agents pass them only when the user explicitly asks.
Tests: 7 geometry cases + 5 resolver/refresh cases + 3 integration
cases (override wins, auto-pick when no override, invalid override
drops to None). Full suite: 851 backend tests passing.
Trace `863856ba-...` showed the design-partner failure mode: user asked "add Facade with 5 components inside" → supervisor delegated straight to diagram-agent (no planner) → diagram-agent burned 10 steps on create_object × 6 + create_child_diagram × 2 + place × 12 + 1 connection (Facade ↔ APP frontend) + 1 unplace + auto_layout, hit `max_steps`, finalized with 5 orphan boxes inside the Facade child diagram and zero connections among them. Three classes of fix: 1. **max_steps=200** on every node (supervisor / planner / diagram / researcher / critic). Earlier ceilings (10/12/6/4/3) were tuning bandaids from an era when local Qwen looped on confused tool calls. With #45–#48 + #50 the loop pressure is gone; the workspace cost budget (LimitsEnforcer) is the real guard. Removing the per-node ceiling means real architecture sessions finish instead of `forced_finalize=max_steps`. 2. **Supervisor routing.** Multi-component asks ALWAYS go through the planner — never straight to diagram-agent. New trigger guidance: ≥2 distinct objects, parent-with-internals, "build/design/structure", commas/and lists. Added Example 4 showing the Facade-with-internals flow. Anti-pattern: "Treating multi-component asks as single-shot". 3. **Inferred connections.** New Planner rule #7 + worked example 2: when adding 2+ siblings inside a parent, propose connections from naming/role (Controller→Service, Controller→DB, Auth←caller, "X System for Y" → Y consumes X). Each inferred step's rationale prefixed with `"inferred:"`. Diagram-agent's recap calls these out so the user can remove the wrong guesses. Supervisor brief now also requires copying existing object IDs verbatim into the planner brief — partial mitigation for the duplicate- Facade observed in the same trace (agent created a fresh Facade alongside an existing one because the brief paraphrased the id). `test_run_long_path_reaches_max_steps_cleanly` is decoupled from the real ceiling via monkeypatch (max_steps=10 inside the test) so it stays fast and verifies the run_react loop still terminates correctly. Tests: 851 passing.
Local Qwen 35B occasionally takes 4-5 minutes on a single tool-heavy turn (especially when the planner emits a large JSON plan). The 90-second ceiling on LLMClient.acompletion / astream was tripping with \`litellm.Timeout: timeout value=90.0, time taken=271.49s\` and surfacing as AGENT_ERROR in the chat. Bump the default to 2000s — the workspace budget is the real cost guard.
…with active diagram Trace 7fc45806-… exposed the design-partner failure mode: user asked to fill the **inside** of an existing Facade, the planner correctly emitted \`place_on_diagram(diagram_id="c7383a8b-…", ...)\` (the Facade child diagram) for every placement, but the diagram-agent ignored those ids and called the tool with the supervisor's active-diagram id (4f3b4ceb-…, the root Base System) — so all 6 components landed on the parent canvas instead of inside the Facade. Root cause: every \`place_on_diagram\` example in diagram.md used the placeholder \`"<active-diagram>"\`, training the agent to substitute the active context for whatever the plan said. Three prompt-level fixes: 1. **diagram.md** — new rule 3: "Use the diagram_id from the plan step verbatim, NOT the active-diagram id." Added Example 1c walking through exactly this case (placement inside a child diagram while the active view is the parent). Existing examples updated to use concrete plan ids (\`d-system\`, \`d-base\`, \`d-cache\`) instead of \`<active-diagram>\`. Renumbered subsequent steps. 2. **planner.md** — strengthened rule 3: "Always specify the right diagram_id for place_on_diagram. When the user asks for 'X inside Facade', look up Facade's child_diagram_id via list_child_diagrams / read_object_full and route placements there." 3. **supervisor.md** — new section "Pin the target diagram in your brief": when the user says "inside X" / "всередині Y", resolve the child-diagram id BEFORE delegating to the planner; if no child diagram exists, brief the planner to create one first. The Context-size-exceeded error in the same trace (\`OpenAIException: Error code: 400\`) was only ~6700 estimated tokens — far below the model's 120k. That's the LM Studio server's loaded \`n_ctx\` (often 8192 by default), not the backend. Operator must reload the model with a larger context window. Tests: 851 passing.
Trace 7fc45806 showed the researcher / diagram-agent occasionally re-interpret the raw user text instead of acting on the supervisor's distilled brief — they go off-script (over-fetching, asking questions that have nothing to do with the current sub-task, suggesting components the brief doesn't ask for). \`isolated_state_for_subagent\` now hides the original user message by default. The supervisor's brief is the sub-agent's contract — the brief must be self-contained, and the supervisor prompt is updated to spell that out: "your brief must be self-contained — distilled intent, concrete deliverables, no slang or paraphrase the sub-agent would have to disambiguate". The critic still needs the original user request to verify the work against the goal, so \`critic_node\` opts in explicitly: \`isolated_state_for_subagent(state, include_original_request=True)\`. Future validator-style nodes do the same. Tests: 2 new cases lock in the default-omit / opt-in behaviour. Full suite 853 passing.
Researcher answers about diagrams with many objects routinely run 4-12k chars — the 4000 limit was crashing real investigations with \`ValidationError: string_too_long\`. Same applies to the diagram-explainer's Explanation schema. Lift both caps to 16000; the workspace token budget remains the actual cost guard. Prompt updated to advertise the new ceiling. Tests updated to assert the new boundary, plus a positive case ensuring 12k bodies validate.
… clean langfuse trace I/O Pack of fixes from trace 0fca4ca6 analysis (Facade-internals run that populated the wrong child diagram and sat in a 8-turn no-op cycle): 1. **Server-side duplicate guard.** ``object_service.create_object`` now checks ``(workspace_id, type, lower(name))`` for live (non-draft) objects and raises ``DuplicateObjectError(existing)`` carrying the existing row. Drafts are unaffected — same-name copies in a draft are intentional. The agent's ``create_object`` tool wrapper catches the error and returns ``action="object.reused"`` with the existing id, so the search→create flow is idempotent at the DB level even if the LLM forgets ``search_existing_objects``. The REST endpoint surfaces the same condition as a 409 with the existing id in the detail body. 2. **Researcher conflict detection (prompt).** Researcher must group results by ``name.strip().lower()`` after every search/list. Groups with ≥2 items become a ``## ⚠ Workspace conflicts`` section in the summary, with a canonical pick (most placements + connections, tie-broken by oldest ``created_at``) and explicit reasoning. 3. **Supervisor surfaces conflicts (prompt).** New anti-pattern: silently disambiguating duplicates. When findings flag conflicts, either pick using active-context evidence and *say so* in the final reply, or finalize with a question. Never run mutating tools through the conflict. 4. **Diagram-agent stops cycling on all-reused (prompt).** Rule 9: if every placement/connection step in the batch returns ``reused``, emit a "nothing new to do" recap immediately. This is what the previous trace did wrong — burned 8 LLM turns repeating place→search→read→layout while every result was reused. 5. **Diagram-agent explicit handles (prompt).** Rule 10: pass ``source_handle`` / ``target_handle`` when geometry is obvious (Controller-on-left + Postgres-on-right → ``right`` / ``left``). Backend auto-picks the rest. 6. **Clean langfuse trace I/O.** Trace root ``input`` is now the user's verbatim message string and ``output`` is the final assistant text — matches the Langfuse standard ``set_current_trace_io(input=..., output=...)`` pattern (see lesson_12 reference). The ``forced_finalize`` reason goes inline only when there's no final message at all. Tests: ``test_create_object_returns_reused_when_duplicate`` covers the agent-tool dedup happy path. Full suite 855 passing.
Two follow-ups from trace 355785c7 (the run that created "Facade
Internal" alongside the existing "Facade Internal Components" and then
churned 5 fresh connections via create→delete in the same turn).
**1. Server-side dedup for ``create_child_diagram_for_object``.** Before
creating, we now query for any live (non-draft) diagram whose
``scope_object_id`` already equals the target. If one exists we return
``action="diagram.reused"`` carrying the existing id — same pattern as
the recent ``create_object`` dedup. Planner / diagram prompts updated
to look up the existing child diagram first instead of relying on the
server-side fallback.
**2. Mandatory ``reason: str`` on every destructive op.** Added required
``reason`` (10..1000 chars) to ``DeleteObjectInput``,
``DeleteConnectionInput``, ``DeleteDiagramInput``,
``UnplaceFromDiagramInput``. The diagram prompt explains how to write
useful reasons.
**3. LLM destructive-op reviewer.** New
``app/agents/tools/_destructive_review.py``: after the agent passes
``confirmed=True``, we feed the reviewer LLM the proposed mutation, the
impact preview, the agent's stated reason, and the calling agent's
recent message history (last 12 messages — pulled from
``ctx.agent_messages`` which the runtime now wires into ToolContext).
The reviewer outputs ``DeleteVerdict {"verdict": "APPROVE"|"REJECT",
"rationale": str}``. REJECT raises ``ToolDenied`` so the destructive
service-layer call never fires — exactly the guard that would have
caught the create-then-delete churn in trace 355785c7. When
``ctx.llm_client`` is missing (tests, direct service callers) the
reviewer auto-approves with a marker rationale; reviewer is a safety
net, not a hard barrier.
Tests: 3 new cases — child-diagram reuse, destructive-reviewer reject,
missing-reason validation. Existing delete tests updated to pass the
new ``reason`` field. Full suite 858 passing.
Earlier bumped to 2000s for slow Qwen runs; that was excessively permissive and let stuck calls block a single turn for over 30 minutes. 600s is enough headroom for any healthy long-thinking LLM step on the slowest local model we use, while still tripping cleanly when something genuinely hangs.
… rule
Trace ad2fabad showed two issues, one acute and one slow-burn:
1. **Destructive-op reviewer was silently broken from day one.** Every
``delete_*`` call hit:
``litellm.BadRequestError: 'response_format.type' must be 'json_schema'
or 'text'``
because we passed ``{"type": "json_object"}``, which LM Studio's qwen
rejects (only ``text`` and ``json_schema`` are accepted there). The
reviewer caught the exception and auto-approved as a "safety net" —
so deletes flowed through with **zero LLM scrutiny** even when
``ctx.llm_client`` was wired in. Switched both call sites
(``_destructive_review.py`` and ``limits.py`` health-check) to
``response_format={"type": "text"}`` and rely on the prompt to pin
the JSON output. Reviewer parse path also now strips markdown JSON
fences and falls back to outermost-brace extraction so qwen's
occasional ```json ... ``` wrappers don't bypass review.
2. **Multi-edge between the same pair.** Trace cleanup showed 3 parallel
"authenticates*" connections between User Controller and Auth Service
("authenticates users", "authenticates requests" × 2) — residue from
pre-#36 runs that lacked dedup. Server-side dedup catches exact
reuse but doesn't merge edges with different labels. Added
``diagram.md`` rule 13: "Consolidate same-pair connections" — agents
now must use ``update_connection`` to enrich an existing edge's
label instead of stacking parallel arrows. Visual noise prevention.
Test ``test_health_check_uses_health_check_model`` updated to assert
``text`` instead of ``json_object``. Full suite 858 passing.
… text as fallback
Previous patch silenced the LM Studio HTTP 400 by switching to plain
text + manual JSON parse, but constrained decoding is the cleaner
contract — the server actually validates the response shape and we
don't depend on prompt discipline alone. This patch:
* Adds ``_pydantic_response_format(model)`` (in
``tools/_destructive_review.py``) and ``_json_schema_response_format``
(in ``limits.py``). Both produce the OpenAI-style envelope
``{"type": "json_schema", "json_schema": {"name": …, "schema": …}}``
derived from a Pydantic model. ``strict: True`` is intentionally
omitted because Pydantic v2 doesn't always emit
``additionalProperties: false`` at every nested level — the manual
parse fallback handles minor schema drift.
* New Pydantic model ``_HealthCheckResponse`` in ``limits.py`` mirrors
the dataclass ``HealthCheckResult`` so we can derive a JSON schema
for constrained decoding. The dataclass stays the runtime type.
* Both call sites now try ``json_schema`` first and retry with
``{"type": "text"}`` if the provider rejects the schema. The manual
parse path (markdown-fence stripping + outermost-brace extraction)
remains in the destructive reviewer as a final safety net for
servers that take ``json_schema`` but still pad their answer.
Test ``test_health_check_uses_health_check_model`` now asserts the
first call uses ``json_schema`` with the expected model name. Suite
858 passing.
…el prefix Symptom: user picked OpenRouter as provider with model ``anthropic/claude-haiku-4.5`` and got ``litellm.APIConnectionError: AnthropicException - Unable to get json response``. The response body was an HTML 404 page from ``openrouter.ai`` — i.e. LiteLLM dispatched the call through the **native Anthropic SDK** (because of the ``anthropic/`` model prefix), which posts to ``/v1/messages``, but OpenRouter only exposes the OpenAI-compatible ``/api/v1/chat/completions`` route. The Anthropic handler got HTML back, tried to parse it as JSON, and blew up. Root cause: ``_build_call_kwargs`` only forced ``custom_llm_provider="openai"`` for ``provider="custom"`` (LM Studio path). For ``provider="openrouter"`` we let LiteLLM auto-route by model prefix, which breaks every non-OpenAI model name on OpenRouter. Fix: detect OpenRouter from either ``provider="openrouter"`` or any ``base_url`` containing ``openrouter.ai``. In both cases set ``custom_llm_provider="openai"`` so LiteLLM ignores the prefix and uses the OpenAI protocol, defaulting ``api_base`` to ``https://openrouter.ai/api/v1`` if none was supplied. Tests: ``provider=openrouter``, ``base_url=openrouter.ai`` inference, and the existing LM Studio path stays unchanged. Full suite 861 passing.
LiteLLM's built-in catalog covers OpenAI / Anthropic / Google but nothing OpenRouter-only (z-ai/*, moonshotai/*, qwen-on-openrouter). Picking ``z-ai/glm-5v-turbo`` showed ``LiteLLM does not know context window for model 'z-ai/glm-5v-turbo'; falling back to 8192 tokens`` on every LLM call — and the context manager started compacting prematurely against an 8k ceiling instead of the model's real 131k. Fix: new ``app/agents/openrouter_catalog.py`` fetches OpenRouter's ``/api/v1/models`` once per process (TTL 1h, asyncio-locked) and exposes ``get_context_length(model_id) -> int | None``. Best-effort: network errors / unknown models fall through to the existing ``litellm.get_max_tokens → 8192`` path. ``resolve_for_agent`` lazy-fills ``litellm_context_window`` from the catalog when the workspace picked OpenRouter (either by explicit ``provider=openrouter`` or a base_url pointing at openrouter.ai) and hasn't set a manual override. The ``LLMClient.context_window()`` priority order (workspace override → litellm.get_max_tokens → 8192 fallback) doesn't change — we just populate the override upstream so the warning stops firing. Tests: 5 cases — happy path with cache hit, unknown model, network failure, malformed payload field handling, empty model id. Full suite 866 passing.
Trace e074e5ba showed mass `validation error: reason: Field required` because the tool descriptions and Pydantic field had no description — the LLM saw `reason: str` with no hint that it's mandatory or what to write. Add a per-tool description and a Pydantic field description (REQUIRED, ≥10 chars, with concrete examples) for delete_object, delete_connection, delete_diagram, unplace_from_diagram so the generated OpenAI tool spec carries the constraint into the model's context.
Trace d885971d showed delete_object retried 6x with the same incomplete args because "validation error: reason: Field required" didn't tell the agent what to put. Append the field's own description to the error message when validation fails on a top-level field, so the retry has a concrete fix hint without re-reading the tool spec.
Trace d885971d showed delete_object retried 6x with identical
incomplete args; without a cycle detector the agent burns the entire
max_steps ceiling (200) on a non-progressing loop. Detect repeated
(name, json-args) signatures across steps; on the 4th identical call
break out, emit forced_finalize="stuck" with a tool-loop detail. The
existing finalize lead line for 'stuck' already produces a clean user
message ("I detected I was looping and stopped...").
While the agent is streaming, swap the send button for a red round cancel button (clicks call stream.cancel()) with an animate-ping ring acting as the "processing" indicator. Idle state is now also round with a chevron icon instead of a square unicode arrow. Adds a test for the cancel branch.
Trace 5e4f3ed9 had diagram batching delete_object(A), delete_object(B), delete_object(A) etc. across multiple steps — strict-consecutive matching never tripped because B kept resetting the streak. Switch to a sliding window (last 8 calls): when any (name, args) signature hits 4+ occurrences in the window, force-finalize="stuck". Adds a test for the interleaved-calls case.
…y run Trace 5e4f3ed9 showed the agent calling delete_object 8x across the turn, every call rejected with "validation error: reason: Field required" — the user's explicit request to delete X never executed. Even after surfacing the reason requirement in tool/field descriptions and enriching the validation error with the field description, smaller LLMs still omit reason and burn the budget. Drop `min_length=10` and `...` (required) — `reason: str = ""` is now optional. The two-step preview gate + the destructive-op reviewer LLM (which also reads recent activity) remain as safety nets, so deletes without an explicit reason are still sanity-checked, just no longer blocked by Pydantic validation. The tool/field descriptions still document `reason` as "strongly recommended" so models that do read descriptions keep providing it. Reviewer prompt now substitutes "(none — judge from recent activity below)" for empty reasons so the LLM doesn't see a literal "''".
…k is the point) ce0f525 made `reason` optional to "let the delete actually run" when the model kept omitting it. User pushed back — the whole point of the destructive-op hook is that the agent narrates intent, otherwise the reviewer LLM has nothing solid to judge against. Revert to required (`min_length=10`). To improve compliance without weakening the constraint: - Tool descriptions now lead with "REQUIRED arguments: …, reason (≥10 chars)" and include a concrete example call. The OpenAI tool spec hands these to the provider verbatim. - Field descriptions still carry the requirement + examples. - Diagram-agent prompt rule #12 reorders to "MUST include reason" with an explicit shape line. - Restored test_delete_object_missing_reason_validation_error. Why the model ignored the schema: OpenAI/OpenRouter tool specs treat `required` as a hint by default — only `"strict": true` enforces it via constrained decoding (and that's incompatible with our current schema, which has `default` values on optional fields like `confirmed`). Validation failure → cycle-break (07c24d4) is the backstop until we either move to strict mode or use a model that follows specs more reliably.
…d only The two-step preview gate + LLM destructive-op reviewer + required `reason` arg were costing more than they saved: the reviewer was silently broken for stretches (response_format issues), Pydantic kept rejecting the model's calls when reason was missing, and the agent burned its budget on validation-error retries. User decision: drop the whole layer, just take an id. Changes: - DeleteObjectInput, DeleteConnectionInput, DeleteDiagramInput, UnplaceFromDiagramInput → just the id field(s). No `confirmed`, no `reason`. - delete_object / delete_connection / delete_diagram / unplace_from_diagram → single-shot execution, no preview branch, no reviewer call. Tool descriptions trimmed. - Removed app/agents/tools/_destructive_review.py entirely. - Diagram prompt rule #12: "destructive ops take only the id". - evals/test_tool_correctness.py: confirmed-gate expectation now only applies to discard_draft (a session-level op, not destructive review). - Replaced preview/confirmed/reviewer tests with single-shot delete tests. Server-side foreign-key cascade still owns correctness. The cycle-break detector (commit 07c24d4) remains as the backstop for runaway loops.
Trace e889a7d9 review: every node SPAN was sitting at the trace root
(parent=None) so supervisor and its delegated sub-agents looked like
peers; the trace input had been clobbered by LiteLLM's langfuse
callback with a raw {messages, tools} payload; spans had no input/
output of their own.
Changes:
- AgentTracer.start_node_span(role, input_payload) — supervisor spans
are tracked as the default parent for subsequent role="subagent"
spans, producing supervisor → researcher / planner / diagram /
critic hierarchy automatically.
- AgentTracer.finish() re-asserts the verbatim user chat_input on
the trace root so LiteLLM's overwrite doesn't win.
- graph.py: span names are now agent:supervisor / agent:planner /
agent:diagram / agent:researcher / agent:critic.
- Supervisor span input = last user message + last sub-agent tool
result; output = delegate_to_X args / final_message / forced reason.
- Sub-agent span input = supervisor's delegate_brief verbatim
(kind/instruction/reason); output = structured Findings / Plan /
Critique (model_dump) plus applied_changes preview for the diagram
sub-agent.
Per-visit supervisor spans (one per LangGraph re-entry) made the trace look like there were N peer supervisors instead of one orchestrator with sub-agents nested inside. Open the supervisor span ONCE on first visit, reuse on every subsequent visit (LLM generations and tool events nest inside via parent_observation_id), close at trace finish with the latest buffered output. The trace tree now matches the mental model: one ``agent:supervisor`` subtree containing all of its own LLM generations + all sub-agent spans (``agent:researcher`` / ``agent:planner`` / ``agent:diagram`` / ``agent:critic``) — no numbered duplicates. Supervisor span input is the user's verbatim message; output is the final delegate target / final_message / forced reason. Sub-agent input/output unchanged.
Sub-agent spans (agent:researcher / agent:planner / agent:critic) only
carried {forced_finalize, kind, tool_calls_made} — the actual Findings
/ Plan / Critique was never visible in Langfuse. Reason: the helper
read output.structured, but researcher/critic place their artefact on
output.state_patch[<key>] (with fallbacks for empty/malformed LLM
output) — leaving structured=None on the path the user actually hits.
Pull from state_patch first (findings / plan / critique / applied_
changes), fall back to output.structured for planner's pre-lift case,
and also surface the assistant text + applied_changes preview so the
span tells the full story.
Eval suites need the verbatim conversation an agent saw to grade /
replay its behaviour without re-running the LangGraph. Pull
``output.state_patch["messages"]`` and put it on the span's metadata
field as ``{"messages": [...]}``:
- supervisor span — full multi-visit conversation, buffered across
visits and applied at finish() (mirrors how output is buffered).
- sub-agent spans (researcher / planner / diagram / critic) —
isolated-state history (the supervisor's brief packed as a single
user message + the sub-agent's own ReAct turns + tool results).
AgentTracer.end_node_span gained a ``metadata`` kwarg; the supervisor
branch buffers it into ``_supervisor_metadata`` and finish() applies
it. Other spans get it stamped at end-time.
…king AssistantText now uses react-markdown + remark-gfm so LLM output renders tables, headings, fenced code blocks, lists, blockquotes, etc. with project styling tokens. Custom link renderer routes archflow:// URLs into <ArchflowLink>. Activity affordances while a stream is active: - ToolCallCard pending state shows a coral spinning SVG (replacing the static ⏳), three pulsing dots in the title row when no preview yet, and a coral border + glow ring around the card. - NodeIndicator badge gets a pulsing coral ring + three dots so the user sees the agent is actually working (not stuck) between LLM calls. New labels for supervisor / diagram / critic. - ChatHistory adds a small "Agent thinking" pill at the bottom while streaming and the last item is text — no more silent gaps between SSE token bursts.
Three regressions / missing features rolled into one commit because
they share the chat-stream plumbing.
cancel button:
- Was a no-op when ``bag.sessionId`` wasn't yet set (i.e. user clicks
cancel before the first SSE ``session`` frame). Now always:
marks cancelledByUser, aborts the local fetch, clears
isStreaming/isReconnecting immediately. POST /cancel still fires
when sessionId is known so the LangGraph run also stops server-side
and doesn't burn budget.
session history:
- Picking a past session in SessionPicker only called stream.reset()
+ setActiveSessionId(id) — nothing fetched the session's persisted
messages, so the bubble showed an empty history. Added
``stream.loadHistory(messages, sid)`` which seeds ``events`` with
synthetic ``message`` frames; ChatBubble's new
useSessionHistoryLoader watches activeSessionId, fetches via the
existing useAgentSession hook, and feeds it into the stream.
auto-naming new sessions:
- New backend endpoint POST /agents/sessions/{id}/auto-title runs a
short LLM call (workspace's resolved settings, max 24 tokens, 30s
timeout) over the first persisted user message and stores a 3-6
word title. Idempotent: returns existing title when one is already
set; falls back to the first 60 chars of the message if the LLM
call fails. PATCH /agents/sessions/{id} added too for direct title
edits.
- Frontend fires maybeTitleSession() once when the SSE ``session``
frame arrives on a fresh stream (gated by bag.titleRequested so
reconnects don't refire). On success the agent-sessions list +
per-session detail caches are invalidated so SessionPicker refreshes.
…rcher; slug from repo name; aggregate by repo
The supervisor was misrouting repo questions to plain ``delegate_to_researcher``
(which has no git access) because the old ``delegate_to_repo_*`` naming was too
opaque for the LLM and the slug was derived from the diagram node name (so a
node "Backend" linked to ``acme/auth-service`` showed up as
``delegate_to_repo_backend`` — completely disconnected from the actual repo
the user was asking about).
Three changes:
* Rename ``delegate_to_repo_<slug>`` → ``delegate_to_git_researcher_<slug>``
so the LLM can't confuse the repo path with the plain researcher.
* Derive slugs from the repo NAME (the ``<name>`` part of ``<owner>/<name>``)
instead of the diagram node name. When two manifest entries reference
same-named repos from different owners, both slugs are owner-prefixed
(``my-org-auth-service`` / ``other-org-auth-service``) — much more
LLM-readable than the old 4-hex-suffix collision strategy.
* Aggregate manifest entries by repo URL when building delegation tools.
Two diagram nodes linked to the same repo now produce ONE tool whose
description lists both components ("linked to the AuthService Container
and the AuthGateway Container").
The ``delegate_to_researcher`` tool description now explicitly states it has
NO git access, pointing the supervisor at the new ``delegate_to_git_researcher_*``
family for source-code questions.
…o_git_researcher_*
Two prompt updates so the LLM correctly routes repo questions:
* Researcher's own system prompt (``prompts/researcher/system.md``) gains
an "Out of scope" section near the top stating it has NO access to
GitHub repos and recommending the supervisor delegate to a
``delegate_to_git_researcher_*`` tool for code questions.
* Supervisor's prompt (``prompts/general/supervisor.md``) updates the
Researcher role bullet with the same "no git access" clause, and
Examples 5 & 6 (Repo Q&A and Visualise-this) now use the new tool
name ``delegate_to_git_researcher_auth-service`` so the supervisor
cookbook matches the runtime tool list.
…ations Two crashes the user hit in one session, with a shared root: - Per-tool `await db.commit()` (added in 5d6448c) could yield while the SSE heartbeat tick or cancellation interleaved on the same AsyncSession. asyncpg then raised "session is provisioning a new connection; concurrent operations are not permitted", the commit was swallowed by the surrounding try/except, the object failed to persist, and the next ReAct step's create_connection FK-violated on the missing target. The whole turn died with a raw stack trace. Fix: a single asyncio.Lock created per invocation in runtime.stream(), threaded through LimitsEnforcer.db_lock and ToolContext.db_lock. The per-tool commit and the existing _safe_rollback now run inside that lock, so they cannot interleave with heartbeats or each other. - IntegrityError / ForeignKeyViolationError bubbling out of the tool executor as INTERNAL_ERROR. The executor at tools/base.py:execute_tool now sniffs IntegrityError specifically, runs _safe_rollback, and returns {status: "error", code: "fk_violation", content: "<short asyncpg DETAIL line> -- create the target first, then retry"}. The LLM gets an actionable hint and can self-correct on the next step.
LLM returned a 37K-char summary wrapped in a ```json fence; the fallback path at researcher.py was passing the raw text straight into Findings(...) without going through the safe parser, so Pydantic raised string_too_long and the whole turn crashed with INTERNAL_ERROR. Fix: - _strip_markdown_fence() removes ```json/```markdown wrappers before validation - _safe_findings_from_text() validates inside try/except and falls back to a truncated Findings(confidence='low') instead of raising - Cap raised 16000 -> 32000 (verified the field is in-memory only; no DB column or API schema enforces the lower bound)
Drilling INTO a Container with a linked GitHub repo previously hid the repo from the supervisor — the active diagram's objects didn't include the parent Container (it's the diagram's scope_object, not a placement). collect_repo_manifest now also walks UPWARD from the active diagram through scope_object_id → parent placement → parent diagram, capped at the same 3-level depth as the descendant walk. RepoLink gains is_ancestor (bool, default False); depth carries upward distance for ancestors (1=immediate parent's scope_object, 2=grandparent, ...) and BFS depth for descendants (unchanged). Final ordering: ancestors closest-first → active level → descendants BFS. Cap of 50 total entries applies across both directions; ancestor-side fills first so the most contextually relevant repo always survives truncation.
Document the bidirectional resolver and the new is_ancestor / depth semantics so the spec matches the implementation.
Symptom: follow-up messages within the same chat session appeared under different Langfuse session_ids, so traces couldn't be grouped in the UI. Root cause: while AgentTracer already passed session_id at trace creation time, finish() called trace.update() without re-asserting it. LiteLLM's langfuse callback also upserts the same trace_id (one trace_create event per generation), so a stray late upsert could leave the trace ungrouped. Fix: cache session_id on the AgentTracer at __init__ and re-pin it on the finish() update — mirrors the existing chat_input re-assertion pattern that works around the same class of late-callback clobbering. Adds two regression tests in test_tracing.py: - test_agent_tracer_passes_chat_session_id_to_langfuse: two consecutive AgentTracer instantiations with the same session_id produce traces with matching session_id but different trace_ids (UI grouping works). - test_agent_tracer_disabled_when_client_unavailable: tracer no-ops gracefully when Langfuse isn't configured.
… chat Frontend bug: ChatComposer, MagicPromptButtons, and slash commands all called startStream without session_id, so every follow-up turn hit backend's _load_or_create_session "create new" branch. The backend then constructed AgentTracer with a fresh session.id, which flowed into client.trace(session_id=...) — Langfuse saw a different session_id per turn even though the chat conversation was continuous. Fix in use-agent-stream.ts startStream: when caller's body has no session_id, inject bag.sessionId (captured from the first 'session' SSE frame). Explicit caller session_id still wins. Single chokepoint covers every submit path. The 140c12d trace.update() re-assertion stays as defense-in-depth.
ChatBubble was filtering loaded session messages to user/assistant rows with content_text only, dropping assistant-with-tool_calls (content lives in content_json) and tool-result rows entirely. Live SSE renders icons from tool_call/tool_result events; resumed history had no equivalent and came back as a flat user↔assistant transcript. New seedEventsFromMessages utility converts persisted AgentChatMessage rows into the same SSE shape the live stream emits, so ToolCallCard renders identically when reopening an old chat. status defaults to "ok" since tool status isn't persisted; node-transition badges stay live-only.
…tures - .env.example: add LANGFUSE_PUBLIC_KEY / LANGFUSE_SECRET_KEY / LANGFUSE_HOST (optional tracing) and the four AGENT_RATE_LIMIT_* knobs as commented defaults so operators see what's tunable. - README: replace the legacy "AI insights (Claude)" bullet with a proper AI agents section covering the supervisor + sub-agents, GitHub Repo Researcher, Diagram Explainer, provider-agnostic LLMs via LiteLLM, and Langfuse tracing. Add LangGraph + LiteLLM to the stack table. Update the Environment section to show AGENTS_SECRET_KEY (now required for agents) and Langfuse, plus a note that LLM provider keys + GitHub PAT live per-workspace in the DB, not in env.
Conflicts resolved: - Makefile: combined both .PHONY target lists (kill-dev + db-sweep-undo). - backend/app/main.py: kept agents routers (agent_settings/agent_sessions/ agents) and added undo router, both alphabetically. - backend/app/api/v1/objects.py: merged the new repo-link try/except guards with the per-user undo's actor_user / from_diagram_id / from_draft_id arguments on create_object + update_object calls. - backend/app/schemas/object.py: kept repo_url/repo_branch fields and added from_diagram_id/from_draft_id on both ObjectCreate and ObjectUpdate. - backend/app/services/object_service.py: combined Connection/diagram model imports and re-ordered update_object so repo validation runs before the log/undo snapshot pair (and pops undo-context fields out of update_data). Migration heads were also divergent (undo entries vs object repo link) — generated alembic merge revision f359350166f3 to bring both lineages under a single head.
- backend/pyproject.toml: add [tool.setuptools.packages.find] include=app*
so wheel build doesn't trip over tests/ + evals/ now that both have
__init__.py (setuptools refuses flat layouts with multiple top-level
packages — was failing the prod docker image build).
- frontend type errors surfaced by `tsc -b` (the build path uses project
references; --noEmit doesn't):
- ChatHistory.test.tsx: drop unused `within` import (TS6133).
- AssistantText.tsx: type the markdown `code` renderer as `any` —
react-markdown's intersected `Components` type doesn't expose
`inline` directly (TS2322).
- AgentsSettingsPage.tsx: narrow MODEL_CATALOG access via
`Exclude<ProviderId, 'custom'>` since `isCustomProvider` already
gates the branch (TS7053 + TS7006).
- .github/workflows/test.yml: add postgres:16 + redis:7 services, point
DATABASE_URL/REDIS_URL at them, generate a Fernet AGENTS_SECRET_KEY
in-job, run alembic upgrade head before pytest. Without these the new
undo + repo + agent-platform tests can't reach a real DB.
91e6520f52f4 shipped with `pass` in upgrade()/downgrade(), so a clean `alembic upgrade head` ended without a `notifications` table. Existing deploys are fine because the schema was bootstrapped via Base.metadata.create_all outside Alembic, but CI (which runs alembic against a fresh Postgres) was failing every notifications-touching test with `relation \"notifications\" does not exist`. Filled in the real CREATE TABLE + index mirroring app/models/notification.py.
test_concurrent_undo_first_wins_second_409s relies on asyncio.gather giving two ASGI requests real parallelism, but in-process ASGITransport serialises them on a single event loop — so both observe seq=2 as the top before either commits, both return 200, and the expected 409 never materialises. Reproducible failure on CI, intermittent locally. Skipping to unblock CI on PR #14. The right fix (DB-level row lock on UndoEntry top, or running the test against a real HTTP server) belongs in a follow-up to PR #12 where the test was introduced.
- evals/Makefile: PYTEST now uses `uv run --directory ..` so tests run with cwd=backend. Without it `make -C evals fast` hits cwd=evals and pytest resolves `evals/test_draft_policy.py` as `evals/evals/test_draft_policy.py` — and the conftest's `from evals.lib.judge import ...` fails with ModuleNotFoundError. - evals/test_tool_correctness.py: bump EXPECTED_TOOL_COUNT 41 → 50 to account for the 9 repo_* tools added by the GitHub Repo Researcher.
`uv run --directory ..` resolves the project root for uv but doesn't reliably move pytest's cwd in CI — conftest still couldn't import evals.lib.judge. Each Make recipe line spawns its own shell, so an explicit `cd ..` per target is robust and self-contained.
…I venv Restricting setuptools to `app*` (build-backend fix) stopped uv sync from installing the `evals` package into the virtual env. Locally an older editable install kept `evals` on sys.path; CI's fresh venv hit ModuleNotFoundError on `from evals.lib.judge import ...`. Adding pythonpath=['.'] reinstates backend/ on sys.path during pytest so the package import works without re-broadening the wheel.
uv-installed venv doesn't materialise an editable copy of source dirs the wheel doesn't claim, so the previous `include = ['app*']` made `from evals.lib.judge import ...` unresolvable in CI's fresh env (pytest's pythonpath= didn't kick in soon enough during conftest load). Broaden the include to cover `evals*` too — wheel grows by a handful of helper modules but CI's fast eval suite finally imports cleanly. Reverts the now-redundant pytest pythonpath knob.
uv leaves the project as a virtual workspace, so `evals` is never copied into site-packages — even with packages.find listing it. The eval conftest does an absolute `from evals.lib.judge import ...`, which needs `backend/` on sys.path. PYTHONPATH gives us that without installing anything extra.
uv keeps the project virtual ("source = virtual = ."), so site-packages
never gets `evals/` and CI's eval conftest blew up on
`from evals.lib.judge import ...`. Tried successively: pyproject's
pytest pythonpath knob, broadening setuptools.find, and PYTHONPATH in
the workflow — none reliably landed before pytest imported the eval
conftest. A top-level conftest is the deterministic version: pytest
loads it before descending into evals/, sys.path mutation runs first,
the absolute import resolves.
Top-level backend/conftest.py wasn't kicking in under `uv run` in CI — the eval conftest still hit ModuleNotFoundError on `from evals.lib.judge import ...`. Doing the sys.path insertion inline in evals/conftest.py before the absolute import is bulletproof: pytest imports the conftest, the first lines run, sys.path now has backend/, the import below resolves. Both the test conftest fix here and the top-level backend/conftest.py land at the same goal (backend/ on sys.path); leaving both because the top-level one helps any future pytest invocation that doesn't go through the eval conftest at all.
A global ~/.gitignore rule (lib/) was silently shadowing the entire backend/evals/lib/ package — judge.py, agent_helpers.py, the cost-cap plugin, etc. were all present locally but never committed, so CI hit ModuleNotFoundError on `from evals.lib.judge import ...` no matter how many sys.path workarounds we added. Mirror the existing `!frontend/src/lib/` exception for the backend counterpart and commit the missing files. Root cause of every "fast eval suite" failure on PR #14.
… dev/prod What changed - backend/conftest.py now reads DATABASE_URL on session start, derives a `<name>_test` sibling if the URL doesn't already end in `_test`, creates the DB if missing, runs `alembic upgrade head` against it, and overrides DATABASE_URL/DATABASE_URL_SYNC in os.environ so the rest of the app picks up the test DB. - .github/workflows/test.yml drops the explicit `alembic upgrade head` step; the conftest now handles migrations as part of bootstrap. Why - The pytest fixtures TRUNCATE users / workspaces / diagrams / etc. with CASCADE. Running `pytest tests/` against the dev DB (which a developer's .env points at) wipes real accounts in seconds — that's exactly what happened when I verified the merge resolution on PR #14 locally. - The fix is structural: tests literally cannot run against any DB whose name doesn't end in `_test`, because the conftest swaps it before the app's Settings instance is ever constructed. No way to forget. - Same protection covers prod: even if someone accidentally pointed DATABASE_URL at a prod host, conftest would only touch `<prod>_test`, never the real DB. Prod-deploy paths don't run pytest, so that's belt-and-suspenders. Verified - Local: `pytest tests/services/test_object_service_repo.py` creates archflow_test, migrates it, runs against it, leaves archflow (dev) untouched (still has 2 users / 2 workspaces / 2 diagrams from before). - 729 tests across api/services/agents pass with the new bootstrap.
…uired - backend/alembic/versions/a1f8c9d2b3e4_repair_notifications_table.py: Idempotent CREATE TABLE IF NOT EXISTS for notifications. The original 91e6520f52f4 shipped with empty upgrade()/downgrade() bodies, so any prod that ran past it has the revision recorded as applied but the table missing — and Alembic won't rerun the corrected fix on those deploys. This new revision rides on top of the merge head and re-creates the table only if it isn't already there. Clean deploys treat it as a no-op (the fixed 91e6520f52f4 already created it); affected prods finally get a working notifications table without manual psql intervention. - .env.example: turn the AGENTS_SECRET_KEY block into a REQUIRED callout — spell out what breaks without it, the generate command, and the do-not-rotate warning. - README "Environment" section: dedicated callout explaining AGENTS_SECRET_KEY is mandatory for any AI feature to work, with the same generate / no-rotation guidance.
The agent_edits_policy values were named in a way that misled users
("live_only" sounded like "this only works on live" rather than "always
write to live"). Cleaning the API:
* live_only → live (default — no draft, write straight to live)
* drafts_only → drafts (always work in a draft)
* ask → ask (unchanged)
Backend
* agent_settings_service: canonical constants + ``normalise_edits_policy``
accepting legacy aliases. Default flipped from ask → live so a fresh
workspace gets straightforward behaviour out of the box.
* resolve_for_agent normalises whatever value sits in the DB row
before returning, so existing rows storing 'live_only' / 'drafts_only'
keep working — no data migration needed.
* runtime resolve_active_draft normalises the policy locally too (covers
the path where tests / golden_runtime still pass legacy strings).
Frontend
* AgentEditsPolicy type narrowed to 'live' | 'drafts' | 'ask'.
* useAgentsSettings normalises the value coming back from the API, so
UI components only see canonical strings.
* Settings page radio options now have human-friendly labels + hints
instead of debug-style "live_only" text.
Also: tightened .gitignore exception for backend/evals/lib/ so it stops
sweeping in __pycache__/ on commit.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the AI agents investigation branch — multi-agent supervisor architecture, GitHub Repo Researcher, Langfuse tracing, and assorted chat / tracing fixes accumulated since main.
What's new
AgentState. Terminating delegate tool calls, sub-agent isolated message lists, supervisor history rewriting so artefacts survive between turns.repo_get_metadata,repo_read_readme,repo_list_tree,repo_read_file,repo_search_code,repo_read_issues,repo_read_pulls,repo_read_commits,repo_read_diff), per-turn LRU cache, hard truncation caps, structured GitHub-error envelope. Per-turn manifest aggregates repos by URL and exposes onedelegate_to_git_researcher_<slug>tool per repo.LANGFUSE_*env, per-workspace consent (off/errors_only/full). Chat session id now propagated tosession_idso follow-up turns appear under one trace group.Env / docs
.env.exampleentries:LANGFUSE_PUBLIC_KEY/LANGFUSE_SECRET_KEY/LANGFUSE_HOST(optional tracing) and the fourAGENT_RATE_LIMIT_*knobs as commented defaults.AGENTS_SECRET_KEY+ Langfuse, with a note that LLM provider keys + GitHub PAT live per-workspace in the DB.Out of scope (kept for follow-ups)
nodegraph-transition events so the "Researcher…" / "Planning…" badges show up when reopening an old chat (current resume only restores tool icons).tool_statusso historical tool calls render with their real success/error tint instead of defaulting to ok.AGENTS_SECRET_KEY/LANGFUSE_*intodocker/docker-compose.yml(prod compose usesenv_file: [.env]so it's already covered there).Test plan
cd backend && uv run pytest -qcd frontend && npx vitest runAGENTS_SECRET_KEY→ backend boots, agents disabled gracefully (no startup crash)build-backend+build-frontendgreen