Add MAS visual debugger and evaluation visibility#41
Open
Conversation
Derive run activity from existing dashboard state and mutations so MAS runs can be replayed as a full graph plus overlapping activity stack without backend DTO changes. Made-with: Cursor
Made-with: Cursor
Tighten the dashboard trace/debugger experience, make smoke e2e runs exercise the canonical sad path, and move sandbox test doubles behind explicit test-support boundaries so production sandbox setup fails loudly when E2B is not configured. Made-with: Cursor
Resolve the execute_task comment conflict while preserving the skipped-task contract violation path from the sandbox boundary cleanup. Made-with: Cursor
Apply Ruff formatting and regenerate dashboard contracts so the Python and frontend drift checks agree with the committed sources. Made-with: Cursor
Regenerate REST OpenAPI contracts, carry cancelled task counts through dashboard state, and clean up Python suppression/type-check issues from the sandbox boundary refactor. Made-with: Cursor
Keep generated REST contracts lint-clean, update the e2e workflow guard for parallel smoke jobs, and rebase the thread-summary migration onto the latest main migration head. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Keep rubric glyph selectors distinct from graph-node selectors so visual debugger geometry checks continue to measure only actual nodes. Made-with: Cursor
Add structured runtime error persistence, OpenRouter model resolution, workflow CLI hardening, and artifact health checks so small research-rubrics cohorts expose actionable failures instead of opaque worker crashes. Made-with: Cursor
Move PydanticAI transcript capture, replay assembly, and model resolution out of core so ReAct workers persist richer context events without framework-specific core dependencies. Made-with: Cursor
Made-with: Cursor
Move runtime DTOs and protocols to their core homes so internal packages no longer depend on public API facades. Made-with: Cursor
Keep rubric scoring and real-LLM artifact handling explicit so evaluation behavior is easier to inspect and test. Made-with: Cursor
Normalize imports, typing, and small test helpers so the branch has a stable baseline for the schema refactor. Made-with: Cursor
Move the rollout budget helper out of core so live-test spending controls stay scoped to the test harness. Made-with: Cursor
Point real-LLM harness notes at the test-local budget helper after the relocation. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Restore the module required by workflow CLI tool tests so focused verification collects cleanly. Made-with: Cursor
Keep dashboard task-status emissions aligned with the canonical worker slug field introduced by the task-node DTO cleanup. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
E2E smoke —
|
Preserve the recovered dashboard DI, public API cleanup, and criterion contract changes in the main checkout before continuing test fixes. Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
Made-with: Cursor
E2E smoke —
|
E2E smoke —
|
E2E smoke —
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Test plan
uv run pytest tests/unit/runtime/test_evaluation_summary_contracts.py tests/unit/runtime/test_dynamic_task_evaluation_mapping.py tests/unit/runtime/test_cohort_rubric_status_summary.py -qpnpm exec tsx --test tests/contracts/contracts.test.ts src/features/evaluation/selectors.test.tspnpm run typecheckpnpm run test:contractsMade with Cursor