Add MAS visual debugger and evaluation visibility by cm2435 · Pull Request #41 · DeepFlow-research/ergon

cm2435 · 2026-04-27T17:01:36Z

Summary

Adds the MAS run visual debugger surfaces, activity stack, run workspace refinements, and supporting dashboard contract/state updates.
Adds evaluation visibility across backend DTOs and frontend UI: enriched evaluation criteria, cohort rubric status summaries/pips, graph rubric glyphs, container roll-ups, and an evaluation lens.
Expands smoke fixtures, recursive smoke topology coverage, generated contracts, and E2E assertions for the new dashboard/evaluation surfaces.

Test plan

uv run pytest tests/unit/runtime/test_evaluation_summary_contracts.py tests/unit/runtime/test_dynamic_task_evaluation_mapping.py tests/unit/runtime/test_cohort_rubric_status_summary.py -q
pnpm exec tsx --test tests/contracts/contracts.test.ts src/features/evaluation/selectors.test.ts
pnpm run typecheck
pnpm run test:contracts

Made with Cursor

Derive run activity from existing dashboard state and mutations so MAS runs can be replayed as a full graph plus overlapping activity stack without backend DTO changes. Made-with: Cursor

Made-with: Cursor

Tighten the dashboard trace/debugger experience, make smoke e2e runs exercise the canonical sad path, and move sandbox test doubles behind explicit test-support boundaries so production sandbox setup fails loudly when E2B is not configured. Made-with: Cursor

Resolve the execute_task comment conflict while preserving the skipped-task contract violation path from the sandbox boundary cleanup. Made-with: Cursor

Apply Ruff formatting and regenerate dashboard contracts so the Python and frontend drift checks agree with the committed sources. Made-with: Cursor

Regenerate REST OpenAPI contracts, carry cancelled task counts through dashboard state, and clean up Python suppression/type-check issues from the sandbox boundary refactor. Made-with: Cursor

Keep generated REST contracts lint-clean, update the e2e workflow guard for parallel smoke jobs, and rebase the thread-summary migration onto the latest main migration head. Made-with: Cursor

Made-with: Cursor

Keep rubric glyph selectors distinct from graph-node selectors so visual debugger geometry checks continue to measure only actual nodes. Made-with: Cursor

Add structured runtime error persistence, OpenRouter model resolution, workflow CLI hardening, and artifact health checks so small research-rubrics cohorts expose actionable failures instead of opaque worker crashes. Made-with: Cursor

Move PydanticAI transcript capture, replay assembly, and model resolution out of core so ReAct workers persist richer context events without framework-specific core dependencies. Made-with: Cursor

Made-with: Cursor

Move runtime DTOs and protocols to their core homes so internal packages no longer depend on public API facades. Made-with: Cursor

Keep rubric scoring and real-LLM artifact handling explicit so evaluation behavior is easier to inspect and test. Made-with: Cursor

Normalize imports, typing, and small test helpers so the branch has a stable baseline for the schema refactor. Made-with: Cursor

Move the rollout budget helper out of core so live-test spending controls stay scoped to the test harness. Made-with: Cursor

Point real-LLM harness notes at the test-local budget helper after the relocation. Made-with: Cursor

Made-with: Cursor

Restore the module required by workflow CLI tool tests so focused verification collects cleanly. Made-with: Cursor

Keep dashboard task-status emissions aligned with the canonical worker slug field introduced by the task-node DTO cleanup. Made-with: Cursor

Made-with: Cursor

github-actions · 2026-04-28T17:08:21Z

E2E smoke — `researchrubrics`

No PNG screenshots were uploaded for this leg. See screenshots/pr-41 for the uploaded placeholder.

Preserve the recovered dashboard DI, public API cleanup, and criterion contract changes in the main checkout before continuing test fixes. Made-with: Cursor

Made-with: Cursor

github-actions · 2026-04-29T10:52:17Z

E2E smoke — `researchrubrics`

No PNG screenshots were uploaded for this leg. See screenshots/pr-41 for the uploaded placeholder.

github-actions · 2026-04-29T10:52:19Z

E2E smoke — `minif2f`

No PNG screenshots were uploaded for this leg. See screenshots/pr-41 for the uploaded placeholder.

github-actions · 2026-04-29T10:52:24Z

E2E smoke — `swebench-verified`

No PNG screenshots were uploaded for this leg. See screenshots/pr-41 for the uploaded placeholder.

cm2435 added 30 commits April 26, 2026 13:39

feat(dashboard): add MAS visual debugger activity stack

da61fc8

Derive run activity from existing dashboard state and mutations so MAS runs can be replayed as a full graph plus overlapping activity stack without backend DTO changes. Made-with: Cursor

fix(dashboard): align visual debugger styling with Claude design

75d194c

Made-with: Cursor

Merge main into visual debugger branch

acf16e5

Resolve the execute_task comment conflict while preserving the skipped-task contract violation path from the sandbox boundary cleanup. Made-with: Cursor

Fix CI formatting and generated contracts

8589b83

Apply Ruff formatting and regenerate dashboard contracts so the Python and frontend drift checks agree with the committed sources. Made-with: Cursor

Fix type checks after main sync

070e30f

Regenerate REST OpenAPI contracts, carry cancelled task counts through dashboard state, and clean up Python suppression/type-check issues from the sandbox boundary refactor. Made-with: Cursor

Fix CI build and migration head

e9d92e5

Keep generated REST contracts lint-clean, update the e2e workflow guard for parallel smoke jobs, and rebase the thread-summary migration onto the latest main migration head. Made-with: Cursor

Add evaluation visibility and smoke coverage

8e54706

Made-with: Cursor

Make criterion rubric details first class

9bbde99

Made-with: Cursor

Fix graph rubric glyph test id collision

1d4a9c4

Keep rubric glyph selectors distinct from graph-node selectors so visual debugger geometry checks continue to measure only actual nodes. Made-with: Cursor

Improve research rubric rollout diagnostics

3fe6212

Add structured runtime error persistence, OpenRouter model resolution, workflow CLI hardening, and artifact health checks so small research-rubrics cohorts expose actionable failures instead of opaque worker crashes. Made-with: Cursor

Consolidate LLM context capture in builtins

ca3b720

Move PydanticAI transcript capture, replay assembly, and model resolution out of core so ReAct workers persist richer context events without framework-specific core dependencies. Made-with: Cursor

wip: cli fixes and refactors

2b9788a

docs: plan core schema deduplication

e361806

Made-with: Cursor

refactor: narrow public API surface

89b50b2

Move runtime DTOs and protocols to their core homes so internal packages no longer depend on public API facades. Made-with: Cursor

refactor: improve research rubric evaluation

4944832

Keep rubric scoring and real-LLM artifact handling explicit so evaluation behavior is easier to inspect and test. Made-with: Cursor

chore: apply lint and test cleanup

b423097

Normalize imports, typing, and small test helpers so the branch has a stable baseline for the schema refactor. Made-with: Cursor

test: keep OpenRouter budget helper in real LLM tests

98e64e4

Move the rollout budget helper out of core so live-test spending controls stay scoped to the test harness. Made-with: Cursor

docs: update OpenRouter budget helper references

05fe264

Point real-LLM harness notes at the test-local budget helper after the relocation. Made-with: Cursor

Consolidate graph status conventions

63f7f07

Made-with: Cursor

Use graph status conventions in propagation

eedabec

Made-with: Cursor

Align propagation contract with blocked successors

9beec0f

Made-with: Cursor

Use canonical evaluation criterion status

0d8facb

Made-with: Cursor

Unify graph mutation payload contracts

c0f3eba

Made-with: Cursor

Collapse duplicate task node projections

3ac0cfc

Made-with: Cursor

test: add missing tool budget module

b114390

Restore the module required by workflow CLI tool tests so focused verification collects cleanly. Made-with: Cursor

fix: emit assigned worker slug in task status events

3debc68

Keep dashboard task-status emissions aligned with the canonical worker slug field introduced by the task-node DTO cleanup. Made-with: Cursor

Centralize task cancellation causes

de7c73b

Made-with: Cursor

Share typed context event payload schemas

7803087

Made-with: Cursor

Guard generation to context event mapping

1f134ab

Made-with: Cursor

cm2435 added 26 commits April 28, 2026 18:41

Consolidate recovered branch cleanup work

1ec99e3

Preserve the recovered dashboard DI, public API cleanup, and criterion contract changes in the main checkout before continuing test fixes. Made-with: Cursor

docs: capture cleanup and layout plans

0da06aa

Made-with: Cursor

refactor: split public benchmark API package

323a1b2

Made-with: Cursor

refactor: split public criterion and rubric APIs

aed3d1c

Made-with: Cursor

refactor: split public worker API package

2dd0e12

Made-with: Cursor

refactor: introduce explicit component registry API

4f18c87

Made-with: Cursor

refactor: move shared core utilities into core shared package

8839ac2

Made-with: Cursor

refactor: move experiment domain models into domain package

34b84ce

Made-with: Cursor

refactor: move generation context models into domain package

94b3a29

Made-with: Cursor

refactor: move experiment application services

b6c7c17

Made-with: Cursor

refactor: move task and graph application services

85dbd79

Made-with: Cursor

refactor: move evaluation application services

1cba909

Made-with: Cursor

refactor: move workflow application services

857f0c5

Made-with: Cursor

refactor: move runtime jobs into application package

0499ecc

Made-with: Cursor

refactor: move read models out of runtime services

8172aba

Made-with: Cursor

refactor: move Inngest integration into infrastructure package

297a883

Made-with: Cursor

refactor: move sandbox infrastructure package

984a946

Made-with: Cursor

refactor: move tracing and dashboard infrastructure

5a2fd4b

Made-with: Cursor

refactor: move FastAPI routes into rest api package

23cf32f

Made-with: Cursor

refactor: update persistence imports for new core layout

c214063

Made-with: Cursor

refactor: extract builtin worker factories and shared helpers

0133cd6

Made-with: Cursor

refactor: register builtins through explicit registry hooks

5369bb9

Made-with: Cursor

refactor: update builtin benchmarks and evaluators for new APIs

57dea6b

Made-with: Cursor

refactor: update CLI experiment and benchmark flows

5202186

Made-with: Cursor

test: move smoke fixtures into shared test fixtures

db51579

Made-with: Cursor

test: update package tests and black-box harness layout

22b7e62

Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add MAS visual debugger and evaluation visibility#41

Add MAS visual debugger and evaluation visibility#41
cm2435 wants to merge 69 commits intomainfrom
feature/mas-run-visual-debugger-plan

cm2435 commented Apr 27, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cm2435 commented Apr 27, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 28, 2026

E2E smoke — researchrubrics

Uh oh!

github-actions Bot commented Apr 29, 2026

E2E smoke — researchrubrics

Uh oh!

github-actions Bot commented Apr 29, 2026

E2E smoke — minif2f

Uh oh!

github-actions Bot commented Apr 29, 2026

E2E smoke — swebench-verified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

E2E smoke — `researchrubrics`

E2E smoke — `researchrubrics`

E2E smoke — `minif2f`

E2E smoke — `swebench-verified`