Skip to content

Add MAS visual debugger and evaluation visibility#41

Open
cm2435 wants to merge 69 commits intomainfrom
feature/mas-run-visual-debugger-plan
Open

Add MAS visual debugger and evaluation visibility#41
cm2435 wants to merge 69 commits intomainfrom
feature/mas-run-visual-debugger-plan

Conversation

@cm2435
Copy link
Copy Markdown
Contributor

@cm2435 cm2435 commented Apr 27, 2026

Summary

  • Adds the MAS run visual debugger surfaces, activity stack, run workspace refinements, and supporting dashboard contract/state updates.
  • Adds evaluation visibility across backend DTOs and frontend UI: enriched evaluation criteria, cohort rubric status summaries/pips, graph rubric glyphs, container roll-ups, and an evaluation lens.
  • Expands smoke fixtures, recursive smoke topology coverage, generated contracts, and E2E assertions for the new dashboard/evaluation surfaces.

Test plan

  • uv run pytest tests/unit/runtime/test_evaluation_summary_contracts.py tests/unit/runtime/test_dynamic_task_evaluation_mapping.py tests/unit/runtime/test_cohort_rubric_status_summary.py -q
  • pnpm exec tsx --test tests/contracts/contracts.test.ts src/features/evaluation/selectors.test.ts
  • pnpm run typecheck
  • pnpm run test:contracts

Made with Cursor

cm2435 added 30 commits April 26, 2026 13:39
Derive run activity from existing dashboard state and mutations so MAS runs can be replayed as a full graph plus overlapping activity stack without backend DTO changes.

Made-with: Cursor
Tighten the dashboard trace/debugger experience, make smoke e2e runs exercise the canonical sad path, and move sandbox test doubles behind explicit test-support boundaries so production sandbox setup fails loudly when E2B is not configured.

Made-with: Cursor
Resolve the execute_task comment conflict while preserving the skipped-task contract violation path from the sandbox boundary cleanup.

Made-with: Cursor
Apply Ruff formatting and regenerate dashboard contracts so the Python and frontend drift checks agree with the committed sources.

Made-with: Cursor
Regenerate REST OpenAPI contracts, carry cancelled task counts through dashboard state, and clean up Python suppression/type-check issues from the sandbox boundary refactor.

Made-with: Cursor
Keep generated REST contracts lint-clean, update the e2e workflow guard for parallel smoke jobs, and rebase the thread-summary migration onto the latest main migration head.

Made-with: Cursor
Keep rubric glyph selectors distinct from graph-node selectors so visual debugger geometry checks continue to measure only actual nodes.

Made-with: Cursor
Add structured runtime error persistence, OpenRouter model resolution, workflow CLI hardening, and artifact health checks so small research-rubrics cohorts expose actionable failures instead of opaque worker crashes.

Made-with: Cursor
Move PydanticAI transcript capture, replay assembly, and model resolution out of core so ReAct workers persist richer context events without framework-specific core dependencies.

Made-with: Cursor
Move runtime DTOs and protocols to their core homes so internal packages no longer depend on public API facades.

Made-with: Cursor
Keep rubric scoring and real-LLM artifact handling explicit so evaluation behavior is easier to inspect and test.

Made-with: Cursor
Normalize imports, typing, and small test helpers so the branch has a stable baseline for the schema refactor.

Made-with: Cursor
Move the rollout budget helper out of core so live-test spending controls stay scoped to the test harness.

Made-with: Cursor
Point real-LLM harness notes at the test-local budget helper after the relocation.

Made-with: Cursor
Restore the module required by workflow CLI tool tests so focused verification collects cleanly.

Made-with: Cursor
Keep dashboard task-status emissions aligned with the canonical worker slug field introduced by the task-node DTO cleanup.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown

E2E smoke — researchrubrics

No PNG screenshots were uploaded for this leg. See screenshots/pr-41 for the uploaded placeholder.

cm2435 added 26 commits April 28, 2026 18:41
Preserve the recovered dashboard DI, public API cleanup, and criterion contract changes in the main checkout before continuing test fixes.

Made-with: Cursor
@github-actions
Copy link
Copy Markdown

E2E smoke — researchrubrics

No PNG screenshots were uploaded for this leg. See screenshots/pr-41 for the uploaded placeholder.

@github-actions
Copy link
Copy Markdown

E2E smoke — minif2f

No PNG screenshots were uploaded for this leg. See screenshots/pr-41 for the uploaded placeholder.

@github-actions
Copy link
Copy Markdown

E2E smoke — swebench-verified

No PNG screenshots were uploaded for this leg. See screenshots/pr-41 for the uploaded placeholder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant